123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462 |
- .TH PCRE2UNICODE 3 "23 February 2020" "PCRE2 10.35"
- .SH NAME
- PCRE - Perl-compatible regular expressions (revised API)
- .SH "UNICODE AND UTF SUPPORT"
- .rs
- .sp
- PCRE2 is normally built with Unicode support, though if you do not need it, you
- can build it without, in which case the library will be smaller. With Unicode
- support, PCRE2 has knowledge of Unicode character properties and can process
- strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
- width), but this is not the default. Unless specifically requested, PCRE2
- treats each code unit in a string as one character.
- .P
- There are two ways of telling PCRE2 to switch to UTF mode, where characters may
- consist of more than one code unit and the range of values is constrained. The
- program can call
- .\" HREF
- \fBpcre2_compile()\fP
- .\"
- with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
- However, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
- That is, the programmer can prevent the supplier of the pattern from switching
- to UTF mode.
- .P
- Note that the PCRE2_MATCH_INVALID_UTF option (see
- .\" HTML <a href="#matchinvalid">
- .\" </a>
- below)
- .\"
- forces PCRE2_UTF to be set.
- .P
- In UTF mode, both the pattern and any subject strings that are matched against
- it are treated as UTF strings instead of strings of individual one-code-unit
- characters. There are also some other changes to the way characters are
- handled, as documented below.
- .
- .
- .SH "UNICODE PROPERTY SUPPORT"
- .rs
- .sp
- When PCRE2 is built with Unicode support, the escape sequences \ep{..},
- \eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting.
- The Unicode properties that can be tested are limited to the general category
- properties such as Lu for an upper case letter or Nd for a decimal number, the
- Unicode script names such as Arabic or Han, and the derived properties Any and
- L&. Full lists are given in the
- .\" HREF
- \fBpcre2pattern\fP
- .\"
- and
- .\" HREF
- \fBpcre2syntax\fP
- .\"
- documentation. Only the short names for properties are supported. For example,
- \ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported.
- Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
- compatibility with Perl 5.6. PCRE2 does not support this.
- .
- .
- .SH "WIDE CHARACTERS AND UTF MODES"
- .rs
- .sp
- Code points less than 256 can be specified in patterns by either braced or
- unbraced hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger
- values have to use braced sequences. Unbraced octal code points up to \e777 are
- also recognized; larger ones can be coded using \eo{...}.
- .P
- The escape sequence \eN{U+<hex digits>} is recognized as another way of
- specifying a Unicode character by code point in a UTF mode. It is not allowed
- in non-UTF mode.
- .P
- In UTF mode, repeat quantifiers apply to complete UTF characters, not to
- individual code units.
- .P
- In UTF mode, the dot metacharacter matches one UTF character instead of a
- single code unit.
- .P
- In UTF mode, capture group names are not restricted to ASCII, and may contain
- any Unicode letters and decimal digits, as well as underscore.
- .P
- The escape sequence \eC can be used to match a single code unit in UTF mode,
- but its use can lead to some strange effects because it breaks up multi-unit
- characters (see the description of \eC in the
- .\" HREF
- \fBpcre2pattern\fP
- .\"
- documentation). For this reason, there is a build-time option that disables
- support for \eC completely. There is also a less draconian compile-time option
- for locking out the use of \eC when a pattern is compiled.
- .P
- The use of \eC is not supported by the alternative matching function
- \fBpcre2_dfa_match()\fP when in UTF-8 or UTF-16 mode, that is, when a character
- may consist of more than one code unit. The use of \eC in these modes provokes
- a match-time error. Also, the JIT optimization does not support \eC in these
- modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
- contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called,
- the matching will be carried out by the interpretive function.
- .P
- The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test
- characters of any code value, but, by default, the characters that PCRE2
- recognizes as digits, spaces, or word characters remain the same set as in
- non-UTF mode, all with code points less than 256. This remains true even when
- PCRE2 is built to include Unicode support, because to do otherwise would slow
- down matching in many common cases. Note that this also applies to \eb
- and \eB, because they are defined in terms of \ew and \eW. If you want
- to test for a wider sense of, say, "digit", you can use explicit Unicode
- property tests such as \ep{Nd}. Alternatively, if you set the PCRE2_UCP option,
- the way that the character escapes work is changed so that Unicode properties
- are used to determine which characters match. There are more details in the
- section on
- .\" HTML <a href="pcre2pattern.html#genericchartypes">
- .\" </a>
- generic character types
- .\"
- in the
- .\" HREF
- \fBpcre2pattern\fP
- .\"
- documentation.
- .P
- Similarly, characters that match the POSIX named character classes are all
- low-valued characters, unless the PCRE2_UCP option is set.
- .P
- However, the special horizontal and vertical white space matching escapes (\eh,
- \eH, \ev, and \eV) do match all the appropriate Unicode characters, whether or
- not PCRE2_UCP is set.
- .
- .
- .SH "UNICODE CASE-EQUIVALENCE"
- .rs
- .sp
- If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
- of Unicode properties except for characters whose code points are less than 128
- and that have at most two case-equivalent values. For these, a direct table
- lookup is used for speed. A few Unicode characters such as Greek sigma have
- more than two code points that are case-equivalent, and these are treated
- specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
- processing for non-UTF character encodings such as UCS-2.
- .
- .
- .\" HTML <a name="scriptruns"></a>
- .SH "SCRIPT RUNS"
- .rs
- .sp
- The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
- synonyms (*sr:...) and (*asr:...), verify that the string matched within the
- parentheses is a script run. In concept, a script run is a sequence of
- characters that are all from the same Unicode script. However, because some
- scripts are commonly used together, and because some diacritical and other
- marks are used with multiple scripts, it is not that simple.
- .P
- Every Unicode character has a Script property, mostly with a value
- corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
- are also three special values:
- .P
- "Unknown" is used for code points that have not been assigned, and also for the
- surrogate code points. In the PCRE2 32-bit library, characters whose code
- points are greater than the Unicode maximum (U+10FFFF), which are accessible
- only in non-UTF mode, are assigned the Unknown script.
- .P
- "Common" is used for characters that are used with many scripts. These include
- punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
- digits 0 to 9.
- .P
- "Inherited" is used for characters such as diacritical marks that modify a
- previous character. These are considered to take on the script of the character
- that they modify.
- .P
- Some Inherited characters are used with many scripts, but many of them are only
- normally used with a small number of scripts. For example, U+102E0 (Coptic
- Epact thousands mark) is used only with Arabic and Coptic. In order to make it
- possible to check this, a Unicode property called Script Extension exists. Its
- value is a list of scripts that apply to the character. For the majority of
- characters, the list contains just one script, the same one as the Script
- property. However, for characters such as U+102E0 more than one Script is
- listed. There are also some Common characters that have a single, non-Common
- script in their Script Extension list.
- .P
- The next section describes the basic rules for deciding whether a given string
- of characters is a script run. Note, however, that there are some special cases
- involving the Chinese Han script, and an additional constraint for decimal
- digits. These are covered in subsequent sections.
- .
- .
- .SS "Basic script run rules"
- .rs
- .sp
- A string that is less than two characters long is a script run. This is the
- only case in which an Unknown character can be part of a script run. Longer
- strings are checked using only the Script Extensions property, not the basic
- Script property.
- .P
- If a character's Script Extension property is the single value "Inherited", it
- is always accepted as part of a script run. This is also true for the property
- "Common", subject to the checking of decimal digits described below. All the
- remaining characters in a script run must have at least one script in common in
- their Script Extension lists. In set-theoretic terminology, the intersection of
- all the sets of scripts must not be empty.
- .P
- A simple example is an Internet name such as "google.com". The letters are all
- in the Latin script, and the dot is Common, so this string is a script run.
- However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
- string that looks the same, but with Cyrillic "o"s is not a script run.
- .P
- More interesting examples involve characters with more than one script in their
- Script Extension. Consider the following characters:
- .sp
- U+060C Arabic comma
- U+06D4 Arabic full stop
- .sp
- The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
- Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
- appear in script runs of either Arabic or Hanifi Rohingya. The first could also
- appear in Syriac or Thaana script runs, but the second could not.
- .
- .
- .SS "The Chinese Han script"
- .rs
- .sp
- The Chinese Han script is commonly used in conjunction with other scripts for
- writing certain languages. Japanese uses the Hiragana and Katakana scripts
- together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
- and Han. These three combinations are treated as special cases when checking
- script runs and are, in effect, "virtual scripts". Thus, a script run may
- contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
- Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
- Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
- Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
- in allowing such mixtures.
- .
- .
- .SS "Decimal digits"
- .rs
- .sp
- Unicode contains many sets of 10 decimal digits in different scripts, and some
- scripts (including the Common script) contain more than one set. Some of these
- decimal digits them are visually indistinguishable from the common ASCII
- digits. In addition to the script checking described above, if a script run
- contains any decimal digits, they must all come from the same set of 10
- adjacent characters.
- .
- .
- .SH "VALIDITY OF UTF STRINGS"
- .rs
- .sp
- When the PCRE2_UTF option is set, the strings passed as patterns and subjects
- are (by default) checked for validity on entry to the relevant functions. If an
- invalid UTF string is passed, a negative error code is returned. The code unit
- offset to the offending character can be extracted from the match data block by
- calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF
- error.
- .P
- In some situations, you may already know that your strings are valid, and
- therefore want to skip these checks in order to improve performance, for
- example in the case of a long subject string that is being scanned repeatedly.
- If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
- PCRE2 assumes that the pattern or subject it is given (respectively) contains
- only valid UTF code unit sequences.
- .P
- If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
- is undefined and your program may crash or loop indefinitely or give incorrect
- results. There is, however, one mode of matching that can handle invalid UTF
- subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
- \fBpcre2_compile()\fP and is discussed below in the next section. The rest of
- this section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
- .P
- Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check
- for the pattern; it does not also apply to subject strings. If you want to
- disable the check for a subject string you must pass this same option to
- \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP.
- .P
- UTF-16 and UTF-32 strings can indicate their endianness by special code knows
- as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
- strings to be in host byte order.
- .P
- Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
- processing takes place. In the case of \fBpcre2_match()\fP and
- \fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is
- applied only to that part of the subject that could be inspected during
- matching, and there is a check that the starting offset points to the first
- code unit of a character or to the end of the subject. If there are no
- lookbehind assertions in the pattern, the check starts at the starting offset.
- Otherwise, it starts at the length of the longest lookbehind before the
- starting offset, or at the start of the subject if there are not that many
- characters before the starting offset. Note that the sequences \eb and \eB are
- one-character lookbehinds.
- .P
- In addition to checking the format of the string, there is a check to ensure
- that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
- area. The so-called "non-character" code points are not excluded because
- Unicode corrigendum #9 makes it clear that they should not be.
- .P
- Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
- where they are used in pairs to encode code points with values greater than
- 0xFFFF. The code points that are encoded by UTF-16 pairs are available
- independently in the UTF-8 and UTF-32 encodings. (In other words, the whole
- surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
- UTF-32.)
- .P
- Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
- given if an escape sequence for an invalid Unicode code point is encountered in
- the pattern. If you want to allow escape sequences such as \ex{d800} (a
- surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
- option. However, this is possible only in UTF-8 and UTF-32 modes, because these
- values are not representable in UTF-16.
- .
- .
- .\" HTML <a name="utf8strings"></a>
- .SS "Errors in UTF-8 strings"
- .rs
- .sp
- The following negative error codes are given for invalid UTF-8 strings:
- .sp
- PCRE2_ERROR_UTF8_ERR1
- PCRE2_ERROR_UTF8_ERR2
- PCRE2_ERROR_UTF8_ERR3
- PCRE2_ERROR_UTF8_ERR4
- PCRE2_ERROR_UTF8_ERR5
- .sp
- The string ends with a truncated UTF-8 character; the code specifies how many
- bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
- no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279)
- allows for up to 6 bytes, and this is checked first; hence the possibility of
- 4 or 5 missing bytes.
- .sp
- PCRE2_ERROR_UTF8_ERR6
- PCRE2_ERROR_UTF8_ERR7
- PCRE2_ERROR_UTF8_ERR8
- PCRE2_ERROR_UTF8_ERR9
- PCRE2_ERROR_UTF8_ERR10
- .sp
- The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the
- character do not have the binary value 0b10 (that is, either the most
- significant bit is 0, or the next bit is 1).
- .sp
- PCRE2_ERROR_UTF8_ERR11
- PCRE2_ERROR_UTF8_ERR12
- .sp
- A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long;
- these code points are excluded by RFC 3629.
- .sp
- PCRE2_ERROR_UTF8_ERR13
- .sp
- A 4-byte character has a value greater than 0x10ffff; these code points are
- excluded by RFC 3629.
- .sp
- PCRE2_ERROR_UTF8_ERR14
- .sp
- A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
- code points are reserved by RFC 3629 for use with UTF-16, and so are excluded
- from UTF-8.
- .sp
- PCRE2_ERROR_UTF8_ERR15
- PCRE2_ERROR_UTF8_ERR16
- PCRE2_ERROR_UTF8_ERR17
- PCRE2_ERROR_UTF8_ERR18
- PCRE2_ERROR_UTF8_ERR19
- .sp
- A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a
- value that can be represented by fewer bytes, which is invalid. For example,
- the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just
- one byte.
- .sp
- PCRE2_ERROR_UTF8_ERR20
- .sp
- The two most significant bits of the first byte of a character have the binary
- value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a
- byte can only validly occur as the second or subsequent byte of a multi-byte
- character.
- .sp
- PCRE2_ERROR_UTF8_ERR21
- .sp
- The first byte of a character has the value 0xfe or 0xff. These values can
- never occur in a valid UTF-8 string.
- .
- .
- .\" HTML <a name="utf16strings"></a>
- .SS "Errors in UTF-16 strings"
- .rs
- .sp
- The following negative error codes are given for invalid UTF-16 strings:
- .sp
- PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
- PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
- PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
- .sp
- .
- .
- .\" HTML <a name="utf32strings"></a>
- .SS "Errors in UTF-32 strings"
- .rs
- .sp
- The following negative error codes are given for invalid UTF-32 strings:
- .sp
- PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
- PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
- .sp
- .
- .
- .\" HTML <a name="matchinvalid"></a>
- .SH "MATCHING IN INVALID UTF STRINGS"
- .rs
- .sp
- You can run pattern matches on subject strings that may contain invalid UTF
- sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF
- option. This is supported by \fBpcre2_match()\fP, including JIT matching, but
- not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
- PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
- valid UTF string.
- .P
- Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
- generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
- generate different code. If JIT is not used, the option affects the behaviour
- of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF
- is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
- .P
- In this mode, an invalid code unit sequence in the subject never matches any
- pattern item. It does not match dot, it does not match \ep{Any}, it does not
- even match negative items such as [^X]. A lookbehind assertion fails if it
- encounters an invalid sequence while moving the current point backwards. In
- other words, an invalid UTF code unit sequence acts as a barrier which no match
- can cross.
- .P
- You can also think of this as the subject being split up into fragments of
- valid UTF, delimited internally by invalid code unit sequences. The pattern is
- matched fragment by fragment. The result of a successful match, however, is
- given as code unit offsets in the entire subject string in the usual way. There
- are a few points to consider:
- .P
- The internal boundaries are not interpreted as the beginnings or ends of lines
- and so do not match circumflex or dollar characters in the pattern.
- .P
- If \fBpcre2_match()\fP is called with an offset that points to an invalid
- UTF-sequence, that sequence is skipped, and the match starts at the next valid
- UTF character, or the end of the subject.
- .P
- At internal fragment boundaries, \eb and \eB behave in the same way as at the
- beginning and end of the subject. For example, a sequence such as \ebWORD\eb
- would match an instance of WORD that is surrounded by invalid UTF code units.
- .P
- Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
- data, knowing that any matched strings that are returned are valid UTF. This
- can be useful when searching for UTF text in executable or other binary files.
- .
- .
- .SH AUTHOR
- .rs
- .sp
- .nf
- Philip Hazel
- University Computing Service
- Cambridge, England.
- .fi
- .
- .
- .SH REVISION
- .rs
- .sp
- .nf
- Last updated: 23 February 2020
- Copyright (c) 1997-2020 University of Cambridge.
- .fi
|