Perl Compatible Regular Expressions - Features

Features

PCRE has developed an extensive and in some ways unique feature set. While it originally aimed at feature-equivalence with Perl, over time a number of features have been first implemented in PCRE and only much later added to Perl. During the PCRE 7.x and Perl 5.9.x (development track) phase the two projects have coordinated development and are to the extent possible feature equivalent. In some cases PCRE has included in mainline releases features that originated with Perl 5.9.x and in some cases Perl 5.9.x has included features that were previously only available in PCRE.

PCRE includes the following features:

Just-in-time compiler support
This optional feature is available in version 8.20 and above if enabled when the PCRE library is built. Large performance benefits are expected when (for example) the calling program utilizes the feature with compatible patterns that are executed repeatedly. The just-in-time compiler support was written by Zoltan Herczeg and is not addressed in the POSIX or C++ wrappers.
Consistent escaping rules
Like Perl, PCRE has consistent escaping rules: any non-alpha-numeric character may be escaped to mean its literal value by prefixing a \ (backslash) before the character, and vice versa, any alpha-numeric character preceded by a backslash typically gives it a special meaning. In the case where the sequence has not been defined to be special it will also be treated as a literal, however this usage is not forward compatible as new versions of PCRE may give such patterns a special meaning. A good example of this is \R, which had no special meaning prior to PCRE 7. In POSIX regular expressions, sometimes backslashes escaped non-alpha-numerics (e.g. \.) and sometimes they introduced a special feature (e.g. \(\)).
Extended character classes
Single-letter character classes are supported in addition to the longer POSIX names. For example \d matches any digit exactly as ] would in POSIX regular expressions.
Minimal matching (a.k.a. “ungreedy”)
A ? may be placed after any repetition quantifier to indicate that the shortest match should be used. The default is to attempt the longest match first, and backtrack through shorter matches. e.g. "a.*?b" would match "ab" in "ababab", where "a.*b" would match the entire string.
Unicode character properties
Unicode defines several properties for each character. Patterns in PCRE can match these properties. e.g. \p{Ps}.*?\p{Pe} would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as "". Since version 8.10, matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. For example, the set of characters matched by \w (word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII-only) non-UCP alternative. Note that the UCP option requires the PCRE library to have been built to include UTF-8 and Unicode property support. Support for UTF-16 is included in version 8.30 while support for UTF-32 was added in version 8.32.
Multiline matching
^ and $ can match at the beginning and end of a string only, or at the start and end of each "line" within the string depending on what options are set.
Newline/linebreak options
When PCRE is compiled, a newline default is selected. Which newline/linebreak is in effect affects where PCRE detects ^-line beginnings and $-ends (in multiline mode) as well as what matches dot (regardless of multiline mode unless the dotall (?s) option is set). It also affects PCRE's matching procedure (since version 7.0): when an unanchored pattern fails to match at the start of a newline sequence, PCRE advances past the entire newline sequence before retrying the match. If the newline option alternative in effect includes CRLF as one of the valid linebreaks, it does not skip the \n in a CRLF if the pattern contains specific \r or \n references (since version 7.3). Since version 8.10, the metacharacter \N always matches any character other than linebreak characters. It has the same behavior as "." when the dotall option aka "(?s)" is not in effect.
The newline option can be altered with external options when a pattern is compiled as well as when it is run. Few application using PCRE provide users with the means to apply this setting via an external option. So, new in version 7.3, the newline option can also be stated at the start of the pattern using one of the following:
(*LF) Newline is a linefeed character. Corresponding linebreaks can be matched with \n.
(*CR) Newline is a carriage return. Corresponding linebreaks can be matched with \r.
(*CRLF) Newline/linebreak is a carriage return followed by a linefeed. Corresponding linebreaks can be matched with \r\n.
(*ANYCRLF) Any of the above encountered in the data will trigger newline processing. Corresponding linebreaks can be matched with (?>\r\n|) or with \R. See below for configuration and options concerning what matches Backslash-R.
(*ANY) Any of the above plus special Unicode linebreaks. When not in UTF-8 mode, corresponding linebreaks can be matched with (?>\r\n|\n|\x0b|\f|\r|\x85) or \R. In UTF-8 mode, two additional characters are recognized as line breaks with (*ANY): LS (line separator, U+2028), and PS (paragraph separator, U+2029). On Windows, in non-Unicode data, some of the ANY linebreak characters have other meanings. For example, \x85 can match a horizontal ellipsis, and if encountered while the ANY newline is in effect, it would trigger newline processing. See below for configuration and options concerning what matches Backslash-R.
Backslash-R options
New in version 7.4: When PCRE is compiled, a default is selected for what matches \R. The default can be either to match the linebreaks associated ANYCRLF or those corresponding to ANY. The default can be overridden when necessary by including (*BSR_UNICODE) or (*BSR_ANYCRLF) at the start of the pattern. When providing a (*BSR..) option, you can also provide a (*newline) option, e.g., (*BSR_UNICODE)(*ANY)rest-of-pattern. The Backslash-R options also can be changed with external options by the application calling PCRE, when a pattern is compiled as well as when it is run.
Beginning of pattern options
Linebreak options such as (*LF) documented above; Backslash-R options such as (*BSR_ANYCRLF) documented above; Unicode Character Properties option (*UCP) documented above; and, (*UTF8) option documented as follows: Since version 7.9, if your PCRE library has been compiled with UTF-8 support, you can specify the (*UTF8) option at the beginning of a pattern instead of setting an external option to invoke UTF-8 mode.
Named subpatterns
A sub-pattern (surrounded by parentheses, like (...)) may be named by including a leading "?P" after the open-paren. Named subpatterns are a feature that PCRE adopted from Python regular expressions. Since PCRE 7.0, named groups can be defined using (?...) or (?'name'...) as well as (?P...). Named groups can then be invoked with, for example, (?...).
Backreferences
A pattern may refer back to the results of a previous match. For example, (a|b)c\1 would match "a" or "b" followed by a "c". Then it would look for the same character (an "a" or a "b") that matched in the first subpattern.
Subroutines
While a backreference provides a mechanism to refer to that part of the subject that has previously matched a subpattern, a subroutine provides a mechanism to reuse an underlying previously defined subpattern. The subpattern's options, such as case independence, are fixed when the subpattern is defined. (a.c)(?1) would match aacabc or abcadc, whereas using a backreference (a.c)\1 would not, though both would match aacaac or abcabc. Starting with version 7.7 PCRE also supports a non-Perl Oniguruma construct for subroutines. They are specified using \g or \g.
Atomic grouping
Atomic grouping is a way of preventing backtracking in a pattern. For example, a++bc will match as many "a"s as possible, and never back up to try one less.
Look-ahead and look-behind assertions
Patterns may assert that previous text or subsequent text contains a pattern without consuming matched text (zero-width assertion). For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab.
Look-behind assertions cannot be of uncertain length.
Since version 7.2, \K can be used in a pattern to reset the start of the current whole match. This provides a flexible alternative approach to look-behind assertions because the discarded part of the match (the part that precedes \K) need not be fixed in length.
Escape sequences for zero-width assertions
E.g. \b for matching zero-width "word boundaries", similar to (?<=\W)(?=\w)|(?<=\w)(?=\W).
Comments
A comment begins with (?# and ends at the next closing parenthesis.
Recursive patterns
A pattern can refer back to itself recursively or to any subpattern. For example, the pattern "\((a*|(?R))*\)" will match any combination of balanced parentheses and "a"s.
Generic callouts
PCRE expressions can embed "(?Cn)" where n is some number. This will call out to an external, user-defined function through the PCRE API, and can be used to embed arbitrary code in a pattern.

Read more about this topic:  Perl Compatible Regular Expressions

Famous quotes containing the word features:

    Art is the child of Nature; yes,
    Her darling child, in whom we trace
    The features of the mother’s face,
    Her aspect and her attitude.
    Henry Wadsworth Longfellow (1807–1882)

    All visible objects, man, are but as pasteboard masks. But in each event—in the living act, the undoubted deed—there, some unknown but still reasoning thing puts forth the mouldings of its features from behind the unreasoning mask. If man will strike, strike through the mask!
    Herman Melville (1819–1891)

    “It looks as if
    Some pallid thing had squashed its features flat
    And its eyes shut with overeagerness
    To see what people found so interesting
    In one another, and had gone to sleep
    Of its own stupid lack of understanding,
    Or broken its white neck of mushroom stuff
    Short off, and died against the windowpane.”
    Robert Frost (1874–1963)