Introduction to regular expressions Part 1 - General Mechanics
History of the "regular expression".
Regular expressions were created by an American mathematician named Stephen Kleene (often the "*" character is referred to as the "Kleene star"), who was involved in a lot of the early development of theoretical computer science. While developing regular expressions, he referred to them as "the algebra of regular sets". Most of the earliest text editing tools on the Unix operating system ( grep in particular ), contain the regular expressions founded by Stephen Kleene. As you can see, regular expressions have been used basically since the dawn of advanced operating systems. Think of the simple line: *.* which you see in almost every text editor, and also in the "files of type: " box on most windows applications (open notepad and go to file->open). When you want to open a .txt file you see the line: *.txt and similarly if you want to open an html file you see the line: *.html If you haven't noticed before, this is a regular expression. Basically *.* means "match anything beginning with any text and ending with any extension", *.txt means "match anything beginning with any text and ending with the .txt extension", and *.html of course means "match anything beginning with any text and ending with the .html extension". so according to which you choose, it will search for either all files, all .txt files, or all .html files within the current directory. The best way to understand regex is to say it to yourself the way I have written it for you, in a sentence. "match anything with any text ending with any extension", and so forth. Regular Expressions are used in many situations that most people usually never notice, and in many cases they understand their use with little or no knowledge of regex at all. To begin understanding regex, lets break it down into the most basic parts. Generally regex has two parts: metacharacters and literal text. But I would like to break it down into five parts: metacharacters, anchors (anchors are in actuality metacharacters, but i will be explaining them seperately), literal text, whitespace (whitespace is actually a "peice" of literal text, but I will explain that in a minute...), and character classes. Metacharacter - A metacharacter is a special character that the regex engine will use to apply "rules" for your regex. The 'metacharacter' in the regex below is underlined: *.txt The regular expression above means "match anything containing any text and then the text '.txt'". Literal text - Literal text is actual "text" that you are using to be matched in your regular expression. The 'literal text' in the regex below is underlined: *.txt The regular expression above means "match anything containing any text and then the text '.txt'". Character Class - A character class is something that lets you tell the regex engine what characters (literal text) that you would like to allow at that point in the regular expression. The 'character class' in the regex below is underlined: [Jj]ohn The regular expression above means "match anything containing J or j followed by ohn". Anchor - An anchor is actually a 'metacharacter', but It doesn't actually match text, only the position of text. The anchor(s) in the regex below are underlined: /^[Jj]ohn$/ The regular expression above means "match anything with the start of the line followed by anything containing J or j followed by ohn immediately followed by the end of the line". Whitespace - Whitespace is actually "literal text", but it is empty space. a string of: $str = " "; is comprised of 'whitespace'. Common Metacharacters and Anchors The first metacharacters I would like to introduce are ^, and $. ^ is the 'carrot' symbol, and means basically: "the start of the line" $ is the 'dollar' symbol, and means basically: "the end of the line" but in actuality means: "up to a newline character". As soon as there is a \n in a string, the $ metacharacter will match. Ok lets have a look at a simple regex using these two metacharacters: ^Subject: $ If you were searching through emails to find the subjects, you could use a regex such as this. This basically means "the start of the line followed by the literal text "Subject: " followed by the end of the line.". The next metacharacter is the * character, which basically means: "zero or more occurances" Take the regex: ^[0-9]*$ This means "the beginning of the line followed by zero or more digits, followed by the end of the line." The + metacharacter is much like *, accept it means: "one or more occurances" Take the regex: ^[0-9]+$ This means "the beginning of the line followed by one or more digits, followed by the end of the line." Now Take: ^[0-9]*[a-z]+$ This means "the beginning of the line followed by zero or more digits, followed by one or more lowercase letters, followed by the end of the line." The next metacharacter, ?, means "optional". Take the regex: ^[a-z]?$ This means "the beginning of the line followed by an optional lowercase letter." Now Take: ^[A-Z]+[a-z]*_?$ This means "The beginning of the line followed by one or more uppercase letters, followed by zero or more lowercase letters, followed by an optional underscore, followed by the end of the line." Now I would like to explain the parenthesis () metacharacters are used for. They are basically used to "group" expressions together. Lets take a look.. $(a|b)?$ This means "the beginning of the line followed by an optional a or b, followed by the end of the line.". Lets get a little more advanced.. ^([a-z]|[0-9])*$ This means "the beginning of the line followed by zero or more of any lowercase letter or digit, followed by the end of the line.". The . metacharacter (DOT, or 'period'), is used to match "any single character". Take the regex: ^(.*)$ This means "the beginning of the line followed by zero or more of any character, followed by the end of the line.", which basically means, 'any ammount of any character'. As you have read, the [] metacharacters denote a "character class". Within a 'character class' you can have what is called a 'range', which I have been using in the examples above. Lets take a look: ^[c-f]$ This means "the beginning of the line followed by a lowercase letter between c and f (c,d,e, or f), followed by the end of the line." ^[1-5]$ This means "the beginning of the line followed by a digit between one and five (1,2,3,4 or 5), followed by the end of the line.". Using the ^ ('carrot') to negate a character class: The carrot character, ^, can also be used inside a character class to "negate" it. negating a character class basically means "not this". Lets take a look.. ^(.*)[^a-z5-9]*$ This basically means "the beginning of the line followed by zero or more of any character followed by zero or more of any digit or character that is NOT lowercase or 5-9.". ^[^a-z]{1,3}$ This means "the beginning of the line followed by atleast one but no more than three of any character that is not lowercase.". notice the: {1,3} part? that means: {minimum, maximum} basically meaning, the first parameter is the minimum number and the second parameter is the maximum number of occurances the text may have for it to match. ^[a-zA-Z0-9]{6,15}$ This regex is commonly used as a way to check the correct format of a password on "signup" forms across the internet. As you should be able to tell, it means "the beginning of the line followed by atleast six but no more than 15 any lower or upper case letter or digits followed by the end of the line," meaning, "the password must be comprised of upper/lowercase letters or digits (or both), and be 6-15 characters long.". ^(.*) 376 (.*) :End of /MOTD command.$ If you do not know anything about IRC protocol, you probably won't understand what this regex is used for; but if you do (if you have ever programmed an irc bot for instance), you will notice that this will match when you recieve the "message of the day" upon connecting to an IRC server. Usually when you recieve the MOTD it means you have connected successfuly. (^From: |^To: ) (.+)$ This regex could be used to search through email logs to find who is sending/recieving email. Editing that just a bit, like so: (^From: user@something.com|^To: user@something.com) (.+)$ Will allow you to match all of the emails sent to or recieved by user@something.com. ^:(.+)@(.+) PART #(.+) : (.*)$ This is another regex for IRC, which will match when someone leaves a channel. ^:(.+)@(.+) JOIN :#(.+)$ This is also a regex for IRC, which you should have noticed by now, it will match when someone joins a channel. The Speed Issue On any given week there will be atleast fifteen to twenty people who point out that the 'string functions' of PHP are faster than regular expressions, and this is (for the most part) true. They are. The benchmark tests performed are hardly accurate, but the picture drawn out by many benchmarks is clear - the string functions _ARE_ faster. But they lack true power when it comes to advanced data manipulation. Thus creates a rule for you: Only use regular expressions when 'advanced data manipulation' comes into play. If you have to search through 15,000 files replacing every occurance of "grey" with "gray", I would use regular expressions. Yet if I had to check an array element for the word "dog", i would settle for strstr(). There are many many many debates on irc over which is more efficient, regex or string functions, but in my opinion its senseless. Both regex and native string functions have their uses - and it is up to you to decide what they are - but be careful and weary of using regular expressions in places that they are not needed.
More PHP Programming - Regular Expressions Tutorials:
- Introduction to regular expressions Part 2 - ERE POSIX |
















