123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681 |
- .TH PCRE2SYNTAX 3 "28 December 2019" "PCRE2 10.35"
- .SH NAME
- PCRE2 - Perl-compatible regular expressions (revised API)
- .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
- .rs
- .sp
- The full syntax and semantics of the regular expressions that are supported by
- PCRE2 are described in the
- .\" HREF
- \fBpcre2pattern\fP
- .\"
- documentation. This document contains a quick-reference summary of the syntax.
- .
- .
- .SH "QUOTING"
- .rs
- .sp
- \ex where x is non-alphanumeric is a literal x
- \eQ...\eE treat enclosed characters as literal
- .
- .
- .SH "ESCAPED CHARACTERS"
- .rs
- .sp
- This table applies to ASCII and Unicode environments. An unrecognized escape
- sequence causes an error.
- .sp
- \ea alarm, that is, the BEL character (hex 07)
- \ecx "control-x", where x is any ASCII printing character
- \ee escape (hex 1B)
- \ef form feed (hex 0C)
- \en newline (hex 0A)
- \er carriage return (hex 0D)
- \et tab (hex 09)
- \e0dd character with octal code 0dd
- \eddd character with octal code ddd, or backreference
- \eo{ddd..} character with octal code ddd..
- \eN{U+hh..} character with Unicode code point hh.. (Unicode mode only)
- \exhh character with hex code hh
- \ex{hh..} character with hex code hh..
- .sp
- If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
- following are also recognized:
- .sp
- \eU the character "U"
- \euhhhh character with hex code hhhh
- \eu{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
- .sp
- When \ex is not followed by {, from zero to two hexadecimal digits are read,
- but in ALT_BSUX mode \ex must be followed by two hexadecimal digits to be
- recognized as a hexadecimal escape; otherwise it matches a literal "x".
- Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits
- or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
- matches a literal "u".
- .P
- Note that \e0dd is always an octal code. The treatment of backslash followed by
- a non-zero digit is complicated; for details see the section
- .\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
- .\" </a>
- "Non-printing characters"
- .\"
- in the
- .\" HREF
- \fBpcre2pattern\fP
- .\"
- documentation, where details of escape processing in EBCDIC environments are
- also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
- supported in EBCDIC environments. Note that \eN not followed by an opening
- curly bracket has a different meaning (see below).
- .
- .
- .SH "CHARACTER TYPES"
- .rs
- .sp
- . any character except newline;
- in dotall mode, any character whatsoever
- \eC one code unit, even in UTF mode (best avoided)
- \ed a decimal digit
- \eD a character that is not a decimal digit
- \eh a horizontal white space character
- \eH a character that is not a horizontal white space character
- \eN a character that is not a newline
- \ep{\fIxx\fP} a character with the \fIxx\fP property
- \eP{\fIxx\fP} a character without the \fIxx\fP property
- \eR a newline sequence
- \es a white space character
- \eS a character that is not a white space character
- \ev a vertical white space character
- \eV a character that is not a vertical white space character
- \ew a "word" character
- \eW a "non-word" character
- \eX a Unicode extended grapheme cluster
- .sp
- \eC is dangerous because it may leave the current matching point in the middle
- of a UTF-8 or UTF-16 character. The application can lock out the use of \eC by
- setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
- with the use of \eC permanently disabled.
- .P
- By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
- or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
- happening, \es and \ew may also match characters with code points in the range
- 128-255. If the PCRE2_UCP option is set, the behaviour of these escape
- sequences is changed to use Unicode properties and they match many more
- characters.
- .
- .
- .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
- .rs
- .sp
- C Other
- Cc Control
- Cf Format
- Cn Unassigned
- Co Private use
- Cs Surrogate
- .sp
- L Letter
- Ll Lower case letter
- Lm Modifier letter
- Lo Other letter
- Lt Title case letter
- Lu Upper case letter
- L& Ll, Lu, or Lt
- .sp
- M Mark
- Mc Spacing mark
- Me Enclosing mark
- Mn Non-spacing mark
- .sp
- N Number
- Nd Decimal number
- Nl Letter number
- No Other number
- .sp
- P Punctuation
- Pc Connector punctuation
- Pd Dash punctuation
- Pe Close punctuation
- Pf Final punctuation
- Pi Initial punctuation
- Po Other punctuation
- Ps Open punctuation
- .sp
- S Symbol
- Sc Currency symbol
- Sk Modifier symbol
- Sm Mathematical symbol
- So Other symbol
- .sp
- Z Separator
- Zl Line separator
- Zp Paragraph separator
- Zs Space separator
- .
- .
- .SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
- .rs
- .sp
- Xan Alphanumeric: union of properties L and N
- Xps POSIX space: property Z or tab, NL, VT, FF, CR
- Xsp Perl space: property Z or tab, NL, VT, FF, CR
- Xuc Univerally-named character: one that can be
- represented by a Universal Character Name
- Xwd Perl word: property Xan or underscore
- .sp
- Perl and POSIX space are now the same. Perl added VT to its space character set
- at release 5.18.
- .
- .
- .SH "SCRIPT NAMES FOR \ep AND \eP"
- .rs
- .sp
- Adlam,
- Ahom,
- Anatolian_Hieroglyphs,
- Arabic,
- Armenian,
- Avestan,
- Balinese,
- Bamum,
- Bassa_Vah,
- Batak,
- Bengali,
- Bhaiksuki,
- Bopomofo,
- Brahmi,
- Braille,
- Buginese,
- Buhid,
- Canadian_Aboriginal,
- Carian,
- Caucasian_Albanian,
- Chakma,
- Cham,
- Cherokee,
- Chorasmian,
- Common,
- Coptic,
- Cuneiform,
- Cypriot,
- Cyrillic,
- Deseret,
- Devanagari,
- Dives_Akuru,
- Dogra,
- Duployan,
- Egyptian_Hieroglyphs,
- Elbasan,
- Elymaic,
- Ethiopic,
- Georgian,
- Glagolitic,
- Gothic,
- Grantha,
- Greek,
- Gujarati,
- Gunjala_Gondi,
- Gurmukhi,
- Han,
- Hangul,
- Hanifi_Rohingya,
- Hanunoo,
- Hatran,
- Hebrew,
- Hiragana,
- Imperial_Aramaic,
- Inherited,
- Inscriptional_Pahlavi,
- Inscriptional_Parthian,
- Javanese,
- Kaithi,
- Kannada,
- Katakana,
- Kayah_Li,
- Kharoshthi,
- Khitan_Small_Script,
- Khmer,
- Khojki,
- Khudawadi,
- Lao,
- Latin,
- Lepcha,
- Limbu,
- Linear_A,
- Linear_B,
- Lisu,
- Lycian,
- Lydian,
- Mahajani,
- Makasar,
- Malayalam,
- Mandaic,
- Manichaean,
- Marchen,
- Masaram_Gondi,
- Medefaidrin,
- Meetei_Mayek,
- Mende_Kikakui,
- Meroitic_Cursive,
- Meroitic_Hieroglyphs,
- Miao,
- Modi,
- Mongolian,
- Mro,
- Multani,
- Myanmar,
- Nabataean,
- Nandinagari,
- New_Tai_Lue,
- Newa,
- Nko,
- Nushu,
- Nyakeng_Puachue_Hmong,
- Ogham,
- Ol_Chiki,
- Old_Hungarian,
- Old_Italic,
- Old_North_Arabian,
- Old_Permic,
- Old_Persian,
- Old_Sogdian,
- Old_South_Arabian,
- Old_Turkic,
- Oriya,
- Osage,
- Osmanya,
- Pahawh_Hmong,
- Palmyrene,
- Pau_Cin_Hau,
- Phags_Pa,
- Phoenician,
- Psalter_Pahlavi,
- Rejang,
- Runic,
- Samaritan,
- Saurashtra,
- Sharada,
- Shavian,
- Siddham,
- SignWriting,
- Sinhala,
- Sogdian,
- Sora_Sompeng,
- Soyombo,
- Sundanese,
- Syloti_Nagri,
- Syriac,
- Tagalog,
- Tagbanwa,
- Tai_Le,
- Tai_Tham,
- Tai_Viet,
- Takri,
- Tamil,
- Tangut,
- Telugu,
- Thaana,
- Thai,
- Tibetan,
- Tifinagh,
- Tirhuta,
- Ugaritic,
- Vai,
- Wancho,
- Warang_Citi,
- Yezidi,
- Yi,
- Zanabazar_Square.
- .
- .
- .SH "CHARACTER CLASSES"
- .rs
- .sp
- [...] positive character class
- [^...] negative character class
- [x-y] range (can be used for hex characters)
- [[:xxx:]] positive POSIX named set
- [[:^xxx:]] negative POSIX named set
- .sp
- alnum alphanumeric
- alpha alphabetic
- ascii 0-127
- blank space or tab
- cntrl control character
- digit decimal digit
- graph printing, excluding space
- lower lower case letter
- print printing, including space
- punct printing, excluding alphanumeric
- space white space
- upper upper case letter
- word same as \ew
- xdigit hexadecimal digit
- .sp
- In PCRE2, POSIX character set names recognize only ASCII characters by default,
- but some of them use Unicode properties if PCRE2_UCP is set. You can use
- \eQ...\eE inside a character class.
- .
- .
- .SH "QUANTIFIERS"
- .rs
- .sp
- ? 0 or 1, greedy
- ?+ 0 or 1, possessive
- ?? 0 or 1, lazy
- * 0 or more, greedy
- *+ 0 or more, possessive
- *? 0 or more, lazy
- + 1 or more, greedy
- ++ 1 or more, possessive
- +? 1 or more, lazy
- {n} exactly n
- {n,m} at least n, no more than m, greedy
- {n,m}+ at least n, no more than m, possessive
- {n,m}? at least n, no more than m, lazy
- {n,} n or more, greedy
- {n,}+ n or more, possessive
- {n,}? n or more, lazy
- .
- .
- .SH "ANCHORS AND SIMPLE ASSERTIONS"
- .rs
- .sp
- \eb word boundary
- \eB not a word boundary
- ^ start of subject
- also after an internal newline in multiline mode
- (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
- \eA start of subject
- $ end of subject
- also before newline at end of subject
- also before internal newline in multiline mode
- \eZ end of subject
- also before newline at end of subject
- \ez end of subject
- \eG first matching position in subject
- .
- .
- .SH "REPORTED MATCH POINT SETTING"
- .rs
- .sp
- \eK set reported start of match
- .sp
- \eK is honoured in positive assertions, but ignored in negative ones.
- .
- .
- .SH "ALTERNATION"
- .rs
- .sp
- expr|expr|expr...
- .
- .
- .SH "CAPTURING"
- .rs
- .sp
- (...) capture group
- (?<name>...) named capture group (Perl)
- (?'name'...) named capture group (Perl)
- (?P<name>...) named capture group (Python)
- (?:...) non-capture group
- (?|...) non-capture group; reset group numbers for
- capture groups in each alternative
- .sp
- In non-UTF modes, names may contain underscores and ASCII letters and digits;
- in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
- both cases, a name must not start with a digit.
- .
- .
- .SH "ATOMIC GROUPS"
- .rs
- .sp
- (?>...) atomic non-capture group
- (*atomic:...) atomic non-capture group
- .
- .
- .SH "COMMENT"
- .rs
- .sp
- (?#....) comment (not nestable)
- .
- .
- .SH "OPTION SETTING"
- .rs
- Changes of these options within a group are automatically cancelled at the end
- of the group.
- .sp
- (?i) caseless
- (?J) allow duplicate named groups
- (?m) multiline
- (?n) no auto capture
- (?s) single line (dotall)
- (?U) default ungreedy (lazy)
- (?x) extended: ignore white space except in classes
- (?xx) as (?x) but also ignore space and tab in classes
- (?-...) unset option(s)
- (?^) unset imnsx options
- .sp
- Unsetting x or xx unsets both. Several options may be set at once, and a
- mixture of setting and unsetting such as (?i-x) is allowed, but there may be
- only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
- (?^in). An option setting may appear at the start of a non-capture group, for
- example (?i:...).
- .P
- The following are recognized only at the very start of a pattern or after one
- of the newline or \eR options with similar syntax. More than one of them may
- appear. For the first three, d is a decimal number.
- .sp
- (*LIMIT_DEPTH=d) set the backtracking limit to d
- (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
- (*LIMIT_MATCH=d) set the match limit to d
- (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
- (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
- (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
- (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
- (*NO_JIT) disable JIT optimization
- (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
- (*UTF) set appropriate UTF mode for the library in use
- (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
- .sp
- Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
- the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP,
- not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
- application can lock out the use of (*UTF) and (*UCP) by setting the
- PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
- .
- .
- .SH "NEWLINE CONVENTION"
- .rs
- .sp
- These are recognized only at the very start of the pattern or after option
- settings with a similar syntax.
- .sp
- (*CR) carriage return only
- (*LF) linefeed only
- (*CRLF) carriage return followed by linefeed
- (*ANYCRLF) all three of the above
- (*ANY) any Unicode newline sequence
- (*NUL) the NUL character (binary zero)
- .
- .
- .SH "WHAT \eR MATCHES"
- .rs
- .sp
- These are recognized only at the very start of the pattern or after option
- setting with a similar syntax.
- .sp
- (*BSR_ANYCRLF) CR, LF, or CRLF
- (*BSR_UNICODE) any Unicode newline sequence
- .
- .
- .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
- .rs
- .sp
- (?=...) )
- (*pla:...) ) positive lookahead
- (*positive_lookahead:...) )
- .sp
- (?!...) )
- (*nla:...) ) negative lookahead
- (*negative_lookahead:...) )
- .sp
- (?<=...) )
- (*plb:...) ) positive lookbehind
- (*positive_lookbehind:...) )
- .sp
- (?<!...) )
- (*nlb:...) ) negative lookbehind
- (*negative_lookbehind:...) )
- .sp
- Each top-level branch of a lookbehind must be of a fixed length.
- .
- .
- .SH "NON-ATOMIC LOOKAROUND ASSERTIONS"
- .rs
- .sp
- These assertions are specific to PCRE2 and are not Perl-compatible.
- .sp
- (?*...) )
- (*napla:...) ) synonyms
- (*non_atomic_positive_lookahead:...) )
- .sp
- (?<*...) )
- (*naplb:...) ) synonyms
- (*non_atomic_positive_lookbehind:...) )
- .
- .
- .SH "SCRIPT RUNS"
- .rs
- .sp
- (*script_run:...) ) script run, can be backtracked into
- (*sr:...) )
- .sp
- (*atomic_script_run:...) ) atomic script run
- (*asr:...) )
- .
- .
- .SH "BACKREFERENCES"
- .rs
- .sp
- \en reference by number (can be ambiguous)
- \egn reference by number
- \eg{n} reference by number
- \eg+n relative reference by number (PCRE2 extension)
- \eg-n relative reference by number
- \eg{+n} relative reference by number (PCRE2 extension)
- \eg{-n} relative reference by number
- \ek<name> reference by name (Perl)
- \ek'name' reference by name (Perl)
- \eg{name} reference by name (Perl)
- \ek{name} reference by name (.NET)
- (?P=name) reference by name (Python)
- .
- .
- .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
- .rs
- .sp
- (?R) recurse whole pattern
- (?n) call subroutine by absolute number
- (?+n) call subroutine by relative number
- (?-n) call subroutine by relative number
- (?&name) call subroutine by name (Perl)
- (?P>name) call subroutine by name (Python)
- \eg<name> call subroutine by name (Oniguruma)
- \eg'name' call subroutine by name (Oniguruma)
- \eg<n> call subroutine by absolute number (Oniguruma)
- \eg'n' call subroutine by absolute number (Oniguruma)
- \eg<+n> call subroutine by relative number (PCRE2 extension)
- \eg'+n' call subroutine by relative number (PCRE2 extension)
- \eg<-n> call subroutine by relative number (PCRE2 extension)
- \eg'-n' call subroutine by relative number (PCRE2 extension)
- .
- .
- .SH "CONDITIONAL PATTERNS"
- .rs
- .sp
- (?(condition)yes-pattern)
- (?(condition)yes-pattern|no-pattern)
- .sp
- (?(n) absolute reference condition
- (?(+n) relative reference condition
- (?(-n) relative reference condition
- (?(<name>) named reference condition (Perl)
- (?('name') named reference condition (Perl)
- (?(name) named reference condition (PCRE2, deprecated)
- (?(R) overall recursion condition
- (?(Rn) specific numbered group recursion condition
- (?(R&name) specific named group recursion condition
- (?(DEFINE) define groups for reference
- (?(VERSION[>]=n.m) test PCRE2 version
- (?(assert) assertion condition
- .sp
- Note the ambiguity of (?(R) and (?(Rn) which might be named reference
- conditions or recursion tests. Such a condition is interpreted as a reference
- condition if the relevant named group exists.
- .
- .
- .SH "BACKTRACKING CONTROL"
- .rs
- .sp
- All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
- name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
- if :NAME is present. The others just set a name for passing back to the caller,
- but this is not a name that (*SKIP) can see. The following act immediately they
- are reached:
- .sp
- (*ACCEPT) force successful match
- (*FAIL) force backtrack; synonym (*F)
- (*MARK:NAME) set name to be passed back; synonym (*:NAME)
- .sp
- The following act only when a subsequent match failure causes a backtrack to
- reach them. They all force a match failure, but they differ in what happens
- afterwards. Those that advance the start-of-match point do so only if the
- pattern is not anchored.
- .sp
- (*COMMIT) overall failure, no advance of starting point
- (*PRUNE) advance to next starting character
- (*SKIP) advance to current matching position
- (*SKIP:NAME) advance to position corresponding to an earlier
- (*MARK:NAME); if not found, the (*SKIP) is ignored
- (*THEN) local failure, backtrack to next alternation
- .sp
- The effect of one of these verbs in a group called as a subroutine is confined
- to the subroutine call.
- .
- .
- .SH "CALLOUTS"
- .rs
- .sp
- (?C) callout (assumed number 0)
- (?Cn) callout with numerical data n
- (?C"text") callout with string data
- .sp
- The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
- start and the end), and the starting delimiter { matched with the ending
- delimiter }. To encode the ending delimiter within the string, double it.
- .
- .
- .SH "SEE ALSO"
- .rs
- .sp
- \fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
- \fBpcre2matching\fP(3), \fBpcre2\fP(3).
- .
- .
- .SH AUTHOR
- .rs
- .sp
- .nf
- Philip Hazel
- University Computing Service
- Cambridge, England.
- .fi
- .
- .
- .SH REVISION
- .rs
- .sp
- .nf
- Last updated: 28 December 2019
- Copyright (c) 1997-2019 University of Cambridge.
- .fi
|