Mason-CM Info   User: guest | Date: 10/22/2017 
OVERVIEW   DOCUMENTATION   MANUAL 
  Help:  Perl Regular Expressions

Adapted from Perl's perlre.pod


This page describes the syntax of regular expressions in Perl. In particular the following metacharacters have their standard egrep-ish meanings:

    \   Quote the next metacharacter
    ^   Match the beginning of the line
    .   Match any character (except newline)
    \$   Match the end of the line (or before newline at the end)
    |   Alternation
    ()  Grouping
    []  Character class

By default, the ``^'' character is guaranteed to match at only the beginning of the string, the ``\$'' character at only the end (or before the newline at the end) and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by ``^'' or ``\$''. You may, however, wish to treat a string as a multi-line buffer, such that the ``^'' will match after any newline within the string, and ``\$'' will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting \$*, but this practice is now deprecated.)

To facilitate multi-line substitutions, the ``.'' character never matches a newline unless you use the /s modifier, which in effect tells Perl to pretend the string is a single line--even if it isn't. The /s modifier also overrides the setting of \$*, in case you have some (badly behaved) older code that sets it in another module.

The following standard quantifiers are recognized:

    *      Match 0 or more times
    +      Match 1 or more times
    ?      Match 1 or 0 times
    {n}    Match exactly n times
    {n,}   Match at least n times
    {n,m}  Match at least n but not more than m times

(If a curly bracket occurs in any other context, it is treated as a regular character.) The ``*'' modifier is equivalent to {0,}, the ``+'' modifier to {1,}, and the ``?'' modifier to {0,1}. n and m are limited to integral values less than 65536.

By default, a quantified subpattern is ``greedy'', that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a ``?''. Note that the meanings don't change, just the ``greediness'':

    *?     Match 0 or more times
    +?     Match 1 or more times
    ??     Match 0 or 1 time
    {n}?   Match exactly n times
    {n,}?  Match at least n times
    {n,m}? Match at least n but not more than m times

Because patterns are processed as double quoted strings, the following also work:

    \\t          tab                   (HT, TAB)
    \\n          newline               (LF, NL)
    \\r          return                (CR)
    \\f          form feed             (FF)
    \\a          alarm (bell)          (BEL)
    \\e          escape (think troff)  (ESC)
    \\033        octal char (think of a PDP-11)
    \\x1B        hex char
    \\c[         control char
    \\l          lowercase next char (think vi)
    \\u          uppercase next char (think vi)
    \\L          lowercase till \\E (think vi)
    \\U          uppercase till \\E (think vi)
    \\E          end case modification (think vi)
    \\Q          quote (disable) regexp metacharacters till \\E

If use locale is in effect, the case map used by \\l, \\L, \\u and <\\U> is taken from the current locale. See perllocale.

In addition, Perl defines the following:

    \\w  Match a "word" character (alphanumeric plus "_")
    \\W  Match a non-word character
    \\s  Match a whitespace character
    \\S  Match a non-whitespace character
    \\d  Match a digit character
    \\D  Match a non-digit character

Note that \\w matches a single alphanumeric character, not a whole word. To match a word you'd need to say \\w+. If use locale is in effect, the list of alphabetic characters generated by \\w is taken from the current locale. See perllocale. You may use \\w, \\W, \\s, \\S, \\d, and \\D within character classes (though not as either end of a range).

Perl defines the following zero-width assertions:

    \\b  Match a word boundary
    \\B  Match a non-(word boundary)
    \\A  Match at only beginning of string
    \\Z  Match at only end of string (or before newline at the end)
    \\G  Match only where previous m//g left off (works only with /g)

A word boundary (\\b) is defined as a spot between two characters that has a \\w on one side of it and a \\W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \\W. (Within character classes \\b represents backspace rather than a word boundary.) The \\A and \\Z are just like ``^'' and ``\$'' except that they won't match multiple times when the /m modifier is used, while ``^'' and ``\$'' will match at every internal line boundary. To match the actual end of the string, not ignoring newline, you can use \\Z(?!\\n). The \\G assertion can be used to chain global matches (using m//g), as described in perlop.

It is also useful when writing lex-like scanners, when you have several regexps which you want to match against consequent substrings of your string, see the previous reference. The actual location where \\G will match can also be influenced by using pos() as an lvalue. See perlfunc.

When the bracketing construct ( ... ) is used, \<digit> matches the digit'th substring. Outside of the pattern, always use ``\$'' instead of ``\\'' in front of the digit. (While the \<digit> notation can on rare occasion work outside the current pattern, this should not be relied upon. See the WARNING below.) The scope of \$<digit> (and \$`, \$&, and \$') extends to the end of the enclosing BLOCK or eval string, or to the next successful pattern match, whichever comes first. If you want to use parentheses to delimit a subpattern (e.g., a set of alternatives) without saving it as a subpattern, follow the ( with a ?:.

You may have as many parentheses as you wish. If you have more than 9 substrings, the variables \$10, \$11, ... refer to the corresponding substring. Within the pattern, \\10, \\11, etc. refer back to substrings if there have been at least that many left parentheses before the backreference. Otherwise (for backward compatibility) \\10 is the same as \\010, a backspace, and \\11 the same as \\011, a tab. And so on. (\\1 through \\9 are always backreferences.)

\$+ returns whatever the last bracket match matched. \$& returns the entire matched string. (\$0 used to return the same thing, but not any more.) \$` returns everything before the matched string. \$' returns everything after the matched string. Examples:

    s/^([^ ]*) *([^ ]*)/\$2 \$1/;     # swap first two words

    if (/Time: (..):(..):(..)/) {
        \$hours = \$1;
        \$minutes = \$2;
        \$seconds = \$3;
    }

Once perl sees that you need one of \$&, \$` or \$' anywhere in the program, it has to provide them on each and every pattern match. This can slow your program down. The same mechanism that handles these provides for the use of \$1, \$2, etc., so you pay the same price for each regexp that contains capturing parentheses. But if you never use \$&, etc., in your script, then regexps without capturing parentheses won't be penalized. So avoid \$&, \$', and \$` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price.

You will note that all backslashed metacharacters in Perl are alphanumeric, such as \\b, \\w, \\n. Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\\\, \\(, \\), \\<, \\>, \\{, or \\} is always interpreted as a literal character, not a metacharacter. This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all the non-alphanumeric characters:

    \$pattern =~ s/(\\W)/\\\\$1/g;

Now it is much more common to see either the quotemeta() function or the \\Q escape sequence used to disable the metacharacters special meanings like this:

    /\$unquoted\\Q\$quoted\\E\$unquoted/

Perl defines a consistent extension syntax for regular expressions. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses (this was a syntax error in older versions of Perl). The character after the question mark gives the function of the extension. Several extensions are already supported:

(?#text)
A comment. The text is ignored. If the /x switch is used to enable whitespace formatting, a simple # will suffice.

(?:regexp)
This groups things like ``()'' but doesn't make backreferences like ``()'' does. So

    split(/\\b(?:a|b|c)\\b/)

is like

    split(/\\b(a|b|c)\\b/)

but doesn't spit out extra fields.

(?=regexp)
A zero-width positive lookahead assertion. For example, /\\w+(?=\\t)/ matches a word followed by a tab, without including the tab in \$&.

(?!regexp)
A zero-width negative lookahead assertion. For example /foo(?!bar)/ matches any occurrence of ``foo'' that isn't followed by ``bar''. Note however that lookahead and lookbehind are NOT the same thing. You cannot use this for lookbehind: /(?!foo)bar/ will not find an occurrence of ``bar'' that is preceded by something which is not ``foo''. That's because the (?!foo) is just saying that the next thing cannot be ``foo''--and it's not, it's a ``bar'', so ``foobar'' will match. You would have to do something like /(?!foo)...bar/ for that. We say ``like'' because there's the case of your ``bar'' not having three characters before it. You could cover that this way: /(?:(?!foo)...|^..?)bar/. Sometimes it's still easier just to say:

    if (/foo/ && \$` =~ /bar\$/)

(?imsx)
One or more embedded pattern-match modifiers. This is particularly useful for patterns that are specified in a table somewhere, some of which want to be case sensitive, and some of which don't. The case insensitive ones need to include merely (?i) at the front of the pattern. For example:

    \$pattern = "foobar";
    if ( /\$pattern/i )

    # more flexible:

    \$pattern = "(?i)foobar";
    if ( /\$pattern/ )

The specific choice of question mark for this and the new minimal matching construct was because 1) question mark is pretty rare in older regular expressions, and 2) whenever you see one, you should stop and ``question'' exactly what is going on. That's psychology...


see also:
perlre.pod hosted at www.perldoc.com.
Mastering Regular Expressions by Jeffrey Friedl.
Reguläre Ausdrücke by Stefan Münz (German).

 Manual

   

© 2007 Mason-CM V 1.3, Content Management built on Mason - Headquarter, CPAN