Regular Expressions Part 2

In the previous post, simple regular expressions were explained. Today, regex becomes useful. If you didn’t read the previous post, you should at least skim it.

For this post, all examples will be using perl.

Getting a Match
Parentheses are used to extract a match from a string. Let’s say you want to know what is inside the “head” html tags, here’s the code:


if ( $html =~ m/(.*)<\/head>/is ) {
print "HTML Header:\n$1\n";
}

The match is given to the code as the variable $1. Note that this example has an “s” after the closing forward slash. The s treats the string to be compared as a single line. Without it, you probably wouldn’t get a match. Also, this simple regex will not match the entire head in all cases. If you put “</head>” inside a meta keyword list, it would match, but stop at the first “</head>”.

Here’s another example:


if ( $text =~ m/ a ([aeiou][a-z]+)/i ) {
print "Grammar error: use \"an\" when the following word starts with a vowel.\n i.e. an $1\n";
}

Yup, it’s a grammar rule check. Now you know where that green squiggly underline comes from.

Whitespace and Non-whitespace matching
Whitespace refers to a space, tab, and carriage returns. “\s” matches a whitespace character and “\S” matches a non-whitespace character.

That’s the basic of regular expressions. These magically expressions work in almost every language, including perl, php, javascript, and python.

Happy pattern matching.

Regular Expressions Part 1/2

Regular Expressions, those oddities that live between two forward slashes, are very powerful and quite mysterious. Staring at something like /([abcdef0123456789]+)/i all day can give you a heaadache. With a little luck and a bit of hard work, you’ll know exactly what the previous expression means.

For this post, all examples will be using Perl.

Text Search
A regular expression, or regex, in its simplest form is a text search. Here’s an example:


$var = "Hello World";
if ( $var =~ m/Hello/ ) {
print "Match\n";
}

In perl, the operator =~ is used to run a regex against a variable. The m/Hello/ will match if the variable has “Hello” anywhere.

To make the match case-insensitive, simply add an i after the last forward slash. So change the regex to m/Hello/i to match “Hello”, “HeLlO” and “hello”.

Carets and Dollar Signs
A caret (^) at the beginning of a regex represents the beginning of a string. Here’s an example:


$var = "Hello World";
if ( $var =~ m/^Hello/ ) {
print "Match\n";
}

A dollar sign ($) at the end of a regex represents the end of a string. Another example:


$var = "Hello World";
if ( $var =~ m/World$/ ) {
print "Match\n";
}

If you want to match one of these special characters, put a backslash before it.


$var = "Hello^ $World";
if ( $var =~ m/e\^ \$W/ ) {
print "Match\n";
}

Braces
Putting a list of characters inside braces “[]” will match any of these characters.


$var = "Hello World";
if ( $var =~ m/[aeiou]/ ) {
print "There is a vowel.\n";
}

You can even tell if a string contains a hexadecimal value. This example uses the special character +. It means that the previous character must appear 1 or more times.


$var = "0x157afde";
if ( $var =~ m/^0x[0123456789abcdef]+$/ ) {
print "It is hexadecimal\n";
}

Within the braces, instead of listing every possible character, you can specify a range to be matched. For instance, 0-9 will match any digit 0 through 9. Here’s a slightly shorter example:


$var = "0x157afde";
if ( $var =~ m/^0x[0-9a-f]+$/ ) {
print "It is hexadecimal\n";
}

The caret (^) continues its job as a special character within braces. Putting one at the beginning of the braces will match anything but those listed inside the braces.


$var = "0x157afde";
if ( $var =~ m/^[^0-9a-fx]+$/ ) {
print "It is not hexadecimal\n";
}

Periods and Asterisks
A period (.) will match any character.


$var = "Hello World";
if ( $var =~ m/^H.llo W.+$/ ) {
print "Match\n";
}

An asterisks (*) is similar to a plus sign (+), but an asterisks will match 0 or more of the previous character.


$var = "Hello World";
if ( $var =~ m/^Hello .*$/ ) {
print "Saying hello\n";
}

Tomorrow
Tomorrow, more special characters, including white-space characters, non-whitespace characters, and matching parentheses.