Jujusoft

Regular Expressions

Searching...

Regular Expressions are used to enhance searching for text within files. A regular expression (or RE) is a special string which describes which strings will match in a text search. The simplest type of RE is just a normal string. Eg the RE "hello" will match any occurence of itself in another string. The matches are shown in bold in the following examples:

... Othello said hello...

Note that we picked up an unwanted match with the end of "Othello". To avoid this we could add a space to the front of our search string, but looking for " hello" would fail to find the word if it appeared at the beginning of a sentence (because there would be no preceeding space). This is where Regular Expressions become useful, because by using special escape characters (usually indicated with a backslash \), you can specify additional criteria for the search. In this case, we would search for "\bhello\b", and only instances of the whole word "hello" would be found, because the escape character \b means "word break", or the beginning or end of a word. This example barely scratches the surface of the possibilities that Regular Expressions offer. For example, the RE "\b(\w+)\s+\1\b" matches any repeated word separated by whitespace, as indicated in the following:

... in Paris in the the spring...

... Othello had had a bath ... 

One of the most popular types of regular expressions involves optional repeats (of characters or character types). "\b\w+\b" will match any whole word (\w means "word character"), whereas "\b\w{3,6}\b" will match only those words between 3 and 6 letters long. "\banti-?\w+\b" will match any word with the prefix "anti", with or without a hyphen!

So, hopefully you are beginning to see the point of using Regular Expressions to perform sophisticated searches. Now it's time to look at...

 

... & Replacing

As well as cleverly finding text that would otherwise be a nightmare to look for, RE's allow you to specify rules to replace that text based on what exactly was found. By marking sub-expressions in the search RE, they can be referenced in a replace expression. The previous example "\b(\w+)\s+\1\b" uses sub-expressions to reference a repeated word. The "(\w+)" part of the RE marks a subexpression (indicated by the brackets) which is later referred to as \1 (becase it is the first sub expression encountered). This reference can be made in the replace string as well, so that if we choose to replace "\b(\w+)\s+\1\b" with "\1" we will be replacing all double instances with just a single word:

"Paris in the the spring" will become "Paris in the spring"

The power of regular expressions can only really be appreciated through experimentation, but with great power comes great responsibility! Be careful not to go doing find/replace operations on important documents unless you know what you are doing... you dont want to find out later that you have unwittingly removed the 4th letter of the last word in each paragraph! (replace "\b(\w{3})\w(\w*\W*)$" with "\1\2")

 

Syntax

The syntax used by Jujusoft is based on Perl Regular Expressions, with a couple of minor additions . Unfortunately Perl syntax (which may have originated elsewhere?) can be a little ugly and confusing, but like a lot of standards people are used to it so i'm trying to stay compatible with it as much as possible.

Most characters are considered ordinary, meaning the represent only themselves ('A' is 'A' , as well as 'a' if search is case-insenstive). There are a few characters which are considered special however (and some of them seem poorly chosen).

 

Special characters

. (period)
In the default mode, this matches any character except a newline.
^ (caret)
Matches the start of the string or the beginning of a line.
$ (dollar sign)
Matches the end of the string, or the end of a line. foo matches both 'foo' and 'foobar', while the regular expression foo$ matches only 'foo'.
* (asterisk)
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match 'a', 'ab', or 'a' followed by any number of 'b's.
+ (plus sign)
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.
? (question mark)
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either 'a' or 'ab'.
*?, +?, ?? (question mark qualifiers)
The "*", "+", and "?" qualifiers are all greedy by default; they match as much text as possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.
{m,n} (curly braces)
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 "a" characters. Omitting n specifies an infinite upper bound; you can't omit m.
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match the first 5 "a" characters, while a{3,5}? will only match only the first 3 characters.
[] (square brackets)
[^] (square brackets with caret)
Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a "-". Special characters are not active inside sets. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; [a-z] will match any lowercase letter, and [a-zA-Z0-9] matches any letter or digit. Character classes such as \w or \S(defined below) are also acceptable inside a range. If you want to include a "]" or a "-" inside a set, precede it with a backslash, or place it as the first character. The pattern []] will match ']', for example.
You can match the characters not within a range by complementing the set. This is indicated by including a "^" as the first character of the set; "^" elsewhere will simply match the "^" character. For example, [^5] will match any character except "5".
| (vertical bar)  
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. This can be used inside groups (see below) as well. To match a literal "|", use `| or \|, or enclose it inside a character class, as in [|]. "bill|ted" matches both "bill" and "ted"
([RE]) (parentheses)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals "(" or "')", use \( or \), or enclose them inside a character class: [(] [)].
(?:[RE])
A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
\x[hex-number]
A literal character value. this one is pretty sucky as there is no definite way to terminate the number.
\X[decimal-number]
A literal character value.
\ (backslash)
Either escapes special characters (same as "`", permitting you to match characters like "*", "?", and so forth), or signals a special sequence; special sequences are discussed below. I recommend not using this character to quote literal characters (eg to search for backslash itself, standard syntax would have you use \\, while to search for a dollar sign you would use \$. This only makes sense if you know that $ is already a special character). The ` character (grave accent, usually found in top left of keyboard below the tilde) can be used to quote any literal character.
 

Escaped characters

These special sequences consist of "\" and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$ matches the character "$"

\[digit]
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 9 groups.
\A
Matches only at the start of the text.
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.
\B
Matches the empty string, but only when it is NOT at the beginning or end of a word.
\d
Matches any decimal digit; this is equivalent to the set [0-9] .
\D
Matches any non-digit character; this is equivalent to the set [^0-9] .
\s
Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v] .
\S
Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v] .
\w
Matches any alphanumeric character; this is equivalent to the set [a-zA-Z0-9_] .
\W
Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_] .
\Z
Matches only at the end of the text.
\\
Matches a literal backslash.
 

Jujusoft specific extensions

` (grave accent)
Signifies that the next character should always be matched literally, and does not constitute a special string. Whilst use of this character is Jujusoft specific (not part of Perl regular expression syntax) i consider it more appropriate than the over used backslash, which is sometimes used as an escape and sometimes as a literal.
\=[digit]
Sets a case value for returning the results of a find.Forexample,"\b(bill\=1|ted\=2)\b " will match either bill or ted, but will return values 1 or 2 respectively. These values can be referenced in the replace string,eg, the following will replace "bill" with "billy" and "ted" with "teddles":
Find:\b(bill\=1|ted\=2)\b 
Replace:\&\=1y\=2dles
\=:[character]
Sets internal state... useful for modal expressions
\=?[character]
Tests internal state, fails if not the same
\<
Step back... will move the search position back one character, always succeeding unless already at beginning of text. "\<back\<" will match the string " bac" within the phrase "go backwards"
\>
Step fowards... will move the search forward one character, always succeeding unless already at end of text. Functonally equivalent to the set [.\r\n]. "\>fore\>" will match the string "orew" in the phrase "go forewards"
\;[comment]
(?#[comment]) - (alternative syntax)
A comment, which is ended by either another ';', a newline character or the end of the string. Ignored by the search engine.
\:[flags]:
(?[flags]) - (alternative syntax)
(One or more letters from the set "i", "I", "g", "G") The group matches the empty string; the letters set the corresponding flags for the entire regular expression. This is useful if you wish to include the flags as part of the regular expression.
 'i' = ignore case (default)
 'I' = dont ignore case
 'g' = be greedy (default) attempt to make as big a match as possible
 'G' = be un-greedy find the shortest match possible
The special sequences consist of "\" and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$ matches the character "$".
(?[digit][RE])
A manual sub-expression assgnment... This is the same as a regular subexpression only the digit explicitly defines what number should be used to later reference the found text.
(?=[RE])
Look ahead. The search engine will search for the RE but will not eat it. "nice(?=\s+day)" will match "nice day" (note that the word "day" is not actually included in the match) but not "nice week".
(?![RE])
Look ahead NOT. The search engine will search for the RE but will not eat it. "nice(?=\s+day)" will match "nice week" (note that the word "week" is not actually included in the match) but not "nice day".
(?<[name]>[RE])
Sets up a subexpression which can be referenced by name. "(?<id>\w+)" will match a word which can later be referenced with "(?>id)" Using such identifiers makes writing complex REs easier (meaningful names can be chosen) and will not impact on performance. Although it should be noted that there is a limit of 16 such identifiers in any one expression.
(?>[name])
References back to a subexpression by name.