add PIRegularExpression
This commit is contained in:
442
3rd/pcre2/doc/html/NON-AUTOTOOLS-BUILD.txt
Normal file
442
3rd/pcre2/doc/html/NON-AUTOTOOLS-BUILD.txt
Normal file
@@ -0,0 +1,442 @@
|
||||
Building PCRE2 without using autotools
|
||||
--------------------------------------
|
||||
|
||||
This document contains the following sections:
|
||||
|
||||
General
|
||||
Generic instructions for the PCRE2 C libraries
|
||||
Stack size in Windows environments
|
||||
Linking programs in Windows environments
|
||||
Calling conventions in Windows environments
|
||||
Comments about Win32 builds
|
||||
Building PCRE2 on Windows with CMake
|
||||
Building PCRE2 on Windows with Visual Studio
|
||||
Testing with RunTest.bat
|
||||
Building PCRE2 on native z/OS and z/VM
|
||||
Building PCRE2 under VMS
|
||||
|
||||
|
||||
GENERAL
|
||||
|
||||
The source of the PCRE2 libraries consists entirely of code written in Standard
|
||||
C, and so should compile successfully on any system that has a Standard C
|
||||
compiler and library.
|
||||
|
||||
The PCRE2 distribution includes a "configure" file for use by the
|
||||
configure/make (autotools) build system, as found in many Unix-like
|
||||
environments. The README file contains information about the options for
|
||||
"configure".
|
||||
|
||||
There is also support for CMake, which some users prefer, especially in Windows
|
||||
environments, though it can also be run in Unix-like environments. See the
|
||||
section entitled "Building PCRE2 on Windows with CMake" below.
|
||||
|
||||
Versions of src/config.h and src/pcre2.h are distributed in the PCRE2 tarballs
|
||||
under the names src/config.h.generic and src/pcre2.h.generic. These are
|
||||
provided for those who build PCRE2 without using "configure" or CMake. If you
|
||||
use "configure" or CMake, the .generic versions are not used.
|
||||
|
||||
|
||||
GENERIC INSTRUCTIONS FOR THE PCRE2 C LIBRARIES
|
||||
|
||||
There are three possible PCRE2 libraries, each handling data with a specific
|
||||
code unit width: 8, 16, or 32 bits. You can build any combination of them. The
|
||||
following are generic instructions for building a PCRE2 C library "by hand". If
|
||||
you are going to use CMake, this section does not apply to you; you can skip
|
||||
ahead to the CMake section. Note that the settings concerned with 8-bit,
|
||||
16-bit, and 32-bit code units relate to the type of data string that PCRE2
|
||||
processes. They are NOT referring to the underlying operating system bit width.
|
||||
You do not have to do anything special to compile in a 64-bit environment, for
|
||||
example.
|
||||
|
||||
(1) Copy or rename the file src/config.h.generic as src/config.h, and edit the
|
||||
macro settings that it contains to whatever is appropriate for your
|
||||
environment. In particular, you can alter the definition of the NEWLINE
|
||||
macro to specify what character(s) you want to be interpreted as line
|
||||
terminators by default. You need to #define at least one of
|
||||
SUPPORT_PCRE2_8, SUPPORT_PCRE2_16, or SUPPORT_PCRE2_32, depending on which
|
||||
libraries you are going to build. You must set all that apply.
|
||||
|
||||
When you subsequently compile any of the PCRE2 modules, you must specify
|
||||
-DHAVE_CONFIG_H to your compiler so that src/config.h is included in the
|
||||
sources.
|
||||
|
||||
An alternative approach is not to edit src/config.h, but to use -D on the
|
||||
compiler command line to make any changes that you need to the
|
||||
configuration options. In this case -DHAVE_CONFIG_H must not be set.
|
||||
|
||||
NOTE: There have been occasions when the way in which certain parameters
|
||||
in src/config.h are used has changed between releases. (In the
|
||||
configure/make world, this is handled automatically.) When upgrading to a
|
||||
new release, you are strongly advised to review src/config.h.generic
|
||||
before re-using what you had previously.
|
||||
|
||||
Note also that the src/config.h.generic file is created from a config.h
|
||||
that was generated by Autotools, which automatically includes settings of
|
||||
a number of macros that are not actually used by PCRE2 (for example,
|
||||
HAVE_DLFCN_H).
|
||||
|
||||
(2) Copy or rename the file src/pcre2.h.generic as src/pcre2.h.
|
||||
|
||||
(3) EITHER:
|
||||
Copy or rename file src/pcre2_chartables.c.dist as
|
||||
src/pcre2_chartables.c.
|
||||
|
||||
OR:
|
||||
Compile src/pcre2_dftables.c as a stand-alone program (using
|
||||
-DHAVE_CONFIG_H if you have set up src/config.h), and then run it with
|
||||
the single argument "src/pcre2_chartables.c". This generates a set of
|
||||
standard character tables and writes them to that file. The tables are
|
||||
generated using the default C locale for your system. If you want to use
|
||||
a locale that is specified by LC_xxx environment variables, add the -L
|
||||
option to the pcre2_dftables command. You must use this method if you
|
||||
are building on a system that uses EBCDIC code.
|
||||
|
||||
The tables in src/pcre2_chartables.c are defaults. The caller of PCRE2 can
|
||||
specify alternative tables at run time.
|
||||
|
||||
(4) For a library that supports 8-bit code units in the character strings that
|
||||
it processes, compile the following source files from the src directory,
|
||||
setting -DPCRE2_CODE_UNIT_WIDTH=8 as a compiler option. Also set
|
||||
-DHAVE_CONFIG_H if you have set up src/config.h with your configuration,
|
||||
or else use other -D settings to change the configuration as required.
|
||||
|
||||
pcre2_auto_possess.c
|
||||
pcre2_chkdint.c
|
||||
pcre2_chartables.c
|
||||
pcre2_compile.c
|
||||
pcre2_compile_class.c
|
||||
pcre2_config.c
|
||||
pcre2_context.c
|
||||
pcre2_convert.c
|
||||
pcre2_dfa_match.c
|
||||
pcre2_error.c
|
||||
pcre2_extuni.c
|
||||
pcre2_find_bracket.c
|
||||
pcre2_jit_compile.c
|
||||
pcre2_maketables.c
|
||||
pcre2_match.c
|
||||
pcre2_match_data.c
|
||||
pcre2_newline.c
|
||||
pcre2_ord2utf.c
|
||||
pcre2_pattern_info.c
|
||||
pcre2_script_run.c
|
||||
pcre2_serialize.c
|
||||
pcre2_string_utils.c
|
||||
pcre2_study.c
|
||||
pcre2_substitute.c
|
||||
pcre2_substring.c
|
||||
pcre2_tables.c
|
||||
pcre2_ucd.c
|
||||
pcre2_valid_utf.c
|
||||
pcre2_xclass.c
|
||||
|
||||
Make sure that you include -I. in the compiler command (or equivalent for
|
||||
an unusual compiler) so that all included PCRE2 header files are first
|
||||
sought in the src directory under the current directory. Otherwise you run
|
||||
the risk of picking up a previously-installed file from somewhere else.
|
||||
|
||||
Note that you must compile pcre2_jit_compile.c, even if you have not
|
||||
defined SUPPORT_JIT in src/config.h, because when JIT support is not
|
||||
configured, dummy functions are compiled. When JIT support IS configured,
|
||||
pcre2_jit_compile.c #includes other files from the sljit dependency,
|
||||
all of whose names begin with "sljit". It also #includes
|
||||
src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
|
||||
those yourself.
|
||||
|
||||
Note also that the pcre2_fuzzsupport.c file contains special code that is
|
||||
useful to those who want to run fuzzing tests on the PCRE2 library. Unless
|
||||
you are doing that, you can ignore it.
|
||||
|
||||
(5) Now link all the compiled code into an object library in whichever form
|
||||
your system keeps such libraries. This is the PCRE2 C 8-bit library,
|
||||
typically called something like libpcre2-8. If your system has static and
|
||||
shared libraries, you may have to do this once for each type.
|
||||
|
||||
(6) If you want to build a library that supports 16-bit or 32-bit code units,
|
||||
set 16 or 32 as the value of -DPCRE2_CODE_UNIT_WIDTH when obeying step 4
|
||||
above. If you want to build more than one PCRE2 library, repeat steps 4
|
||||
and 5 as necessary.
|
||||
|
||||
(7) If you want to build the POSIX wrapper functions (which apply only to the
|
||||
8-bit library), ensure that you have the src/pcre2posix.h file and then
|
||||
compile src/pcre2posix.c. Link the result (on its own) as the pcre2posix
|
||||
library. If targeting a DLL in Windows, make sure to include
|
||||
-DPCRE2POSIX_SHARED with your compiler flags.
|
||||
|
||||
(8) The pcre2test program can be linked with any combination of the 8-bit,
|
||||
16-bit and 32-bit libraries (depending on what you specfied in
|
||||
src/config.h) . Compile src/pcre2test.c; don't forget -DHAVE_CONFIG_H if
|
||||
necessary, but do NOT define PCRE2_CODE_UNIT_WIDTH. Then link with the
|
||||
appropriate library/ies. If you compiled an 8-bit library, pcre2test also
|
||||
needs the pcre2posix wrapper library.
|
||||
|
||||
(9) Run pcre2test on the testinput files in the testdata directory, and check
|
||||
that the output matches the corresponding testoutput files. There are
|
||||
comments about what each test does in the section entitled "Testing PCRE2"
|
||||
in the README file. If you compiled more than one of the 8-bit, 16-bit and
|
||||
32-bit libraries, you need to run pcre2test with the -16 option to do
|
||||
16-bit tests and with the -32 option to do 32-bit tests.
|
||||
|
||||
Some tests are relevant only when certain build-time options are selected.
|
||||
For example, test 4 is for Unicode support, and will not run if you have
|
||||
built PCRE2 without it. See the comments at the start of each testinput
|
||||
file. If you have a suitable Unix-like shell, the RunTest script will run
|
||||
the appropriate tests for you. The command "RunTest list" will output a
|
||||
list of all the tests.
|
||||
|
||||
Note that the supplied files are in Unix format, with just LF characters
|
||||
as line terminators. You may need to edit them to change this if your
|
||||
system uses a different convention.
|
||||
|
||||
(10) If you have built PCRE2 with SUPPORT_JIT, the JIT features can be tested
|
||||
by running pcre2test with the -jit option. This is done automatically by
|
||||
the RunTest script. You might also like to build and run the freestanding
|
||||
JIT test program, src/pcre2_jit_test.c.
|
||||
|
||||
(11) The pcre2test program tests the POSIX wrapper library, but there is also a
|
||||
freestanding test program in src/pcre2posix_test.c. It must be linked with
|
||||
both the pcre2posix library and the 8-bit PCRE2 library.
|
||||
|
||||
(12) If you want to use the pcre2grep command, compile and link
|
||||
src/pcre2grep.c; it uses only the 8-bit PCRE2 library (it does not need
|
||||
the pcre2posix library). If you have built the PCRE2 library with JIT
|
||||
support by defining SUPPORT_JIT in src/config.h, you can also define
|
||||
SUPPORT_PCRE2GREP_JIT, which causes pcre2grep to make use of JIT (unless
|
||||
it is run with --no-jit). If you define SUPPORT_PCRE2GREP_JIT without
|
||||
defining SUPPORT_JIT, pcre2grep does not try to make use of JIT.
|
||||
|
||||
|
||||
STACK SIZE IN WINDOWS ENVIRONMENTS
|
||||
|
||||
Prior to release 10.30 the default system stack size of 1MiB in some Windows
|
||||
environments caused issues with some tests. This should no longer be the case
|
||||
for 10.30 and later releases.
|
||||
|
||||
|
||||
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
|
||||
|
||||
If you want to statically link a program against a PCRE2 library in the form of
|
||||
a non-dll .a file, you must define PCRE2_STATIC before including src/pcre2.h.
|
||||
|
||||
|
||||
CALLING CONVENTIONS IN WINDOWS ENVIRONMENTS
|
||||
|
||||
It is possible to compile programs to use different calling conventions using
|
||||
MSVC. Search the web for "calling conventions" for more information. To make it
|
||||
easier to change the calling convention for the exported functions in a
|
||||
PCRE2 library, the macro PCRE2_CALL_CONVENTION is present in all the external
|
||||
definitions. It can be set externally when compiling (e.g. in CFLAGS). If it is
|
||||
not set, it defaults to empty; the default calling convention is then used
|
||||
(which is what is wanted most of the time).
|
||||
|
||||
|
||||
COMMENTS ABOUT WIN32 BUILDS (see also "BUILDING PCRE2 ON WINDOWS WITH CMAKE")
|
||||
|
||||
There are two ways of building PCRE2 using the "configure, make, make install"
|
||||
paradigm on Windows systems: using MinGW or using Cygwin. These are not at all
|
||||
the same thing; they are completely different from each other. There is also
|
||||
support for building using CMake, which some users find a more straightforward
|
||||
way of building PCRE2 under Windows.
|
||||
|
||||
The MinGW home page (http://www.mingw.org/) says this:
|
||||
|
||||
MinGW: A collection of freely available and freely distributable Windows
|
||||
specific header files and import libraries combined with GNU toolsets that
|
||||
allow one to produce native Windows programs that do not rely on any
|
||||
3rd-party C runtime DLLs.
|
||||
|
||||
The Cygwin home page (http://www.cygwin.com/) says this:
|
||||
|
||||
Cygwin is a Linux-like environment for Windows. It consists of two parts:
|
||||
|
||||
. A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing
|
||||
substantial Linux API functionality
|
||||
|
||||
. A collection of tools which provide Linux look and feel.
|
||||
|
||||
On both MinGW and Cygwin, PCRE2 should build correctly using:
|
||||
|
||||
./configure && make && make install
|
||||
|
||||
This should create two libraries called libpcre2-8 and libpcre2-posix. These
|
||||
are independent libraries: when you link with libpcre2-posix you must also link
|
||||
with libpcre2-8, which contains the basic functions.
|
||||
|
||||
Using Cygwin's compiler generates libraries and executables that depend on
|
||||
cygwin1.dll. If a library that is generated this way is distributed,
|
||||
cygwin1.dll has to be distributed as well. Since cygwin1.dll is under the GPL
|
||||
licence, this forces not only PCRE2 to be under the GPL, but also the entire
|
||||
application. A distributor who wants to keep their own code proprietary must
|
||||
purchase an appropriate Cygwin licence.
|
||||
|
||||
MinGW has no such restrictions. The MinGW compiler generates a library or
|
||||
executable that can run standalone on Windows without any third party dll or
|
||||
licensing issues.
|
||||
|
||||
But there is more complication:
|
||||
|
||||
If a Cygwin user uses the -mno-cygwin Cygwin gcc flag, what that really does is
|
||||
to tell Cygwin's gcc to use the MinGW gcc. Cygwin's gcc is only acting as a
|
||||
front end to MinGW's gcc (if you install Cygwin's gcc, you get both Cygwin's
|
||||
gcc and MinGW's gcc). So, a user can:
|
||||
|
||||
. Build native binaries by using MinGW or by getting Cygwin and using
|
||||
-mno-cygwin.
|
||||
|
||||
. Build binaries that depend on cygwin1.dll by using Cygwin with the normal
|
||||
compiler flags.
|
||||
|
||||
The test files that are supplied with PCRE2 are in UNIX format, with LF
|
||||
characters as line terminators. Unless your PCRE2 library uses a default
|
||||
newline option that includes LF as a valid newline, it may be necessary to
|
||||
change the line terminators in the test files to get some of the tests to work.
|
||||
|
||||
|
||||
BUILDING PCRE2 ON WINDOWS WITH CMAKE
|
||||
|
||||
CMake is an alternative configuration facility that can be used instead of
|
||||
"configure". CMake creates project files (make files, solution files, etc.)
|
||||
tailored to numerous development environments, including Visual Studio,
|
||||
Borland, Msys, MinGW, NMake, and Unix. If possible, use short paths with no
|
||||
spaces in the names for your CMake installation and your PCRE2 source and build
|
||||
directories.
|
||||
|
||||
If you are using CMake and encounter errors, deleting the CMake cache and
|
||||
restarting from a fresh build may fix the error. In the CMake GUI, the cache can
|
||||
be deleted by selecting "File > Delete Cache"; or the folder "CMakeCache" can
|
||||
be deleted.
|
||||
|
||||
1. Install the latest CMake version available from http://www.cmake.org/, and
|
||||
ensure that cmake\bin is on your path.
|
||||
|
||||
2. Unzip (retaining folder structure) the PCRE2 source tree into a source
|
||||
directory such as C:\pcre2. You should ensure your local date and time
|
||||
is not earlier than the file dates in your source dir if the release is
|
||||
very new.
|
||||
|
||||
3. Create a new, empty build directory, preferably a subdirectory of the
|
||||
source dir. For example, C:\pcre2\pcre2-xx\build.
|
||||
|
||||
4. Run CMake.
|
||||
|
||||
- Using the CLI, simply run `cmake ..` inside the `build/` directory. You can
|
||||
use the `ccmake` ncurses GUI to select and configure PCRE2 features.
|
||||
|
||||
- Using the CMake GUI:
|
||||
|
||||
a) Run cmake-gui from the Shell environment of your build tool, for
|
||||
example, Msys for Msys/MinGW or Visual Studio Command Prompt for
|
||||
VC/VC++.
|
||||
|
||||
b) Enter C:\pcre2\pcre2-xx and C:\pcre2\pcre2-xx\build for the source and
|
||||
build directories, respectively.
|
||||
|
||||
c) Press the "Configure" button.
|
||||
|
||||
d) Select the particular IDE / build tool that you are using (Visual
|
||||
Studio, MSYS makefiles, MinGW makefiles, etc.)
|
||||
|
||||
e) The GUI will then list several configuration options. This is where
|
||||
you can disable Unicode support or select other PCRE2 optional features.
|
||||
|
||||
f) Press "Configure" again. The adjacent "Generate" button should now be
|
||||
active.
|
||||
|
||||
g) Press "Generate".
|
||||
|
||||
5. The build directory should now contain a usable build system, be it a
|
||||
solution file for Visual Studio, makefiles for MinGW, etc. Exit from
|
||||
cmake-gui and use the generated build system with your compiler or IDE.
|
||||
E.g., for MinGW you can run "make", or for Visual Studio, open the PCRE2
|
||||
solution, select the desired configuration (Debug, or Release, etc.) and
|
||||
build the ALL_BUILD project.
|
||||
|
||||
Regardless of build system used, `cmake --build .` will build it.
|
||||
|
||||
6. If during configuration with cmake-gui you've elected to build the test
|
||||
programs, you can execute them by building the test project. E.g., for
|
||||
MinGW: "make test"; for Visual Studio build the RUN_TESTS project. The
|
||||
most recent build configuration is targeted by the tests. A summary of
|
||||
test results is presented. Complete test output is subsequently
|
||||
available for review in Testing\Temporary under your build dir.
|
||||
|
||||
Regardless of build system used, `ctest` will run the tests.
|
||||
|
||||
|
||||
BUILDING PCRE2 ON WINDOWS WITH VISUAL STUDIO
|
||||
|
||||
The code currently cannot be compiled without an inttypes.h header, which is
|
||||
available only with Visual Studio 2013 or newer. However, this portable and
|
||||
permissively-licensed implementation of the stdint.h header could be used as an
|
||||
alternative:
|
||||
|
||||
http://www.azillionmonkeys.com/qed/pstdint.h
|
||||
|
||||
Just rename it and drop it into the top level of the build tree.
|
||||
|
||||
|
||||
TESTING WITH RUNTEST.BAT
|
||||
|
||||
If configured with CMake, building the test project ("make test" or building
|
||||
ALL_TESTS in Visual Studio) creates (and runs) pcre2_test.bat (and depending
|
||||
on your configuration options, possibly other test programs) in the build
|
||||
directory. The pcre2_test.bat script runs RunTest.bat with correct source and
|
||||
exe paths.
|
||||
|
||||
For manual testing with RunTest.bat, provided the build dir is a subdirectory
|
||||
of the source directory: Open command shell window. Chdir to the location
|
||||
of your pcre2test.exe and pcre2grep.exe programs. Call RunTest.bat with
|
||||
"..\RunTest.Bat" or "..\..\RunTest.bat" as appropriate.
|
||||
|
||||
To run only a particular test with RunTest.Bat provide a test number argument.
|
||||
|
||||
Otherwise:
|
||||
|
||||
1. Copy RunTest.bat into the directory where pcre2test.exe and pcre2grep.exe
|
||||
have been created.
|
||||
|
||||
2. Edit RunTest.bat to identify the full or relative location of
|
||||
the pcre2 source (wherein which the testdata folder resides), e.g.:
|
||||
|
||||
set srcdir=C:\pcre2\pcre2-10.00
|
||||
|
||||
3. In a Windows command environment, chdir to the location of your bat and
|
||||
exe programs.
|
||||
|
||||
4. Run RunTest.bat. Test outputs will automatically be compared to expected
|
||||
results, and discrepancies will be identified in the console output.
|
||||
|
||||
To independently test the just-in-time compiler, run pcre2_jit_test.exe.
|
||||
|
||||
|
||||
BUILDING PCRE2 ON NATIVE Z/OS AND Z/VM
|
||||
|
||||
z/OS and z/VM are operating systems for mainframe computers, produced by IBM.
|
||||
The character code used is EBCDIC, not ASCII or Unicode. In z/OS, UNIX APIs and
|
||||
applications can be supported through UNIX System Services, and in such an
|
||||
environment it should be possible to build PCRE2 in the same way as in other
|
||||
systems, with the EBCDIC related configuration settings, but it is not known if
|
||||
anybody has tried this.
|
||||
|
||||
In native z/OS (without UNIX System Services) and in z/VM, special ports are
|
||||
required. For details, please see file 939 on this web site:
|
||||
|
||||
http://www.cbttape.org
|
||||
|
||||
Everything in that location, source and executable, is in EBCDIC and native
|
||||
z/OS file formats. The port provides an API for LE languages such as COBOL and
|
||||
for the z/OS and z/VM versions of the Rexx languages.
|
||||
|
||||
|
||||
BUILDING PCRE2 UNDER VMS
|
||||
|
||||
Alexey Chuphin has contributed some auxiliary files for building PCRE2 under
|
||||
OpenVMS. They are in the "vms" directory in the distribution tarball. Please
|
||||
read the file called vms/openvms_readme.txt. The pcre2test and pcre2grep
|
||||
programs contain some VMS-specific code.
|
||||
|
||||
==============================
|
||||
Last updated: 26 December 2024
|
||||
==============================
|
||||
|
||||
970
3rd/pcre2/doc/html/README.txt
Normal file
970
3rd/pcre2/doc/html/README.txt
Normal file
@@ -0,0 +1,970 @@
|
||||
README file for PCRE2 (Perl-compatible regular expression library)
|
||||
------------------------------------------------------------------
|
||||
|
||||
PCRE2 is a re-working of the original PCRE1 library to provide an entirely new
|
||||
API. Since its initial release in 2015, there has been further development of
|
||||
the code and it now differs from PCRE1 in more than just the API. There are new
|
||||
features, and the internals have been improved. The original PCRE1 library is
|
||||
now obsolete and no longer maintained. The latest release of PCRE2 is available
|
||||
in .tar.gz, tar.bz2, or .zip form from this GitHub repository:
|
||||
|
||||
https://github.com/PCRE2Project/pcre2/releases
|
||||
|
||||
There is a mailing list for discussion about the development of PCRE2 at
|
||||
pcre2-dev@googlegroups.com. You can subscribe by sending an email to
|
||||
pcre2-dev+subscribe@googlegroups.com.
|
||||
|
||||
You can access the archives and also subscribe or manage your subscription
|
||||
here:
|
||||
|
||||
https://groups.google.com/g/pcre2-dev
|
||||
|
||||
Please read the NEWS file if you are upgrading from a previous release. The
|
||||
contents of this README file are:
|
||||
|
||||
The PCRE2 APIs
|
||||
Documentation for PCRE2
|
||||
Building PCRE2 on non-Unix-like systems
|
||||
Building PCRE2 without using autotools
|
||||
Building PCRE2 using autotools
|
||||
Retrieving configuration information
|
||||
Shared libraries
|
||||
Cross-compiling using autotools
|
||||
Making new tarballs
|
||||
Testing PCRE2
|
||||
Character tables
|
||||
File manifest
|
||||
|
||||
|
||||
The PCRE2 APIs
|
||||
--------------
|
||||
|
||||
PCRE2 is written in C, and it has its own API. There are three sets of
|
||||
functions, one for the 8-bit library, which processes strings of bytes, one for
|
||||
the 16-bit library, which processes strings of 16-bit values, and one for the
|
||||
32-bit library, which processes strings of 32-bit values. Unlike PCRE1, there
|
||||
are no C++ wrappers.
|
||||
|
||||
The distribution does contain a set of C wrapper functions for the 8-bit
|
||||
library that are based on the POSIX regular expression API (see the pcre2posix
|
||||
man page). These are built into a library called libpcre2-posix. Note that this
|
||||
just provides a POSIX calling interface to PCRE2; the regular expressions
|
||||
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
|
||||
and does not give full access to all of PCRE2's facilities.
|
||||
|
||||
The header file for the POSIX-style functions is called pcre2posix.h. The
|
||||
official POSIX name is regex.h, but I did not want to risk possible problems
|
||||
with existing files of that name by distributing it that way. To use PCRE2 with
|
||||
an existing program that uses the POSIX API, pcre2posix.h will have to be
|
||||
renamed or pointed at by a link (or the program modified, of course). See the
|
||||
pcre2posix documentation for more details.
|
||||
|
||||
|
||||
Documentation for PCRE2
|
||||
-----------------------
|
||||
|
||||
If you install PCRE2 in the normal way on a Unix-like system, you will end up
|
||||
with a set of man pages whose names all start with "pcre2". The one that is
|
||||
just called "pcre2" lists all the others. In addition to these man pages, the
|
||||
PCRE2 documentation is supplied in two other forms:
|
||||
|
||||
1. There are files called doc/pcre2.txt, doc/pcre2grep.txt, and
|
||||
doc/pcre2test.txt in the source distribution. The first of these is a
|
||||
concatenation of the text forms of all the section 3 man pages except the
|
||||
listing of pcre2demo.c and those that summarize individual functions. The
|
||||
other two are the text forms of the section 1 man pages for the pcre2grep
|
||||
and pcre2test commands. These text forms are provided for ease of scanning
|
||||
with text editors or similar tools. They are installed in
|
||||
<prefix>/share/doc/pcre2, where <prefix> is the installation prefix
|
||||
(defaulting to /usr/local).
|
||||
|
||||
2. A set of files containing all the documentation in HTML form, hyperlinked
|
||||
in various ways, and rooted in a file called index.html, is distributed in
|
||||
doc/html and installed in <prefix>/share/doc/pcre2/html.
|
||||
|
||||
|
||||
Building PCRE2 on non-Unix-like systems
|
||||
---------------------------------------
|
||||
|
||||
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
|
||||
your system supports the use of "configure" and "make" you may be able to build
|
||||
PCRE2 using autotools in the same way as for many Unix-like systems.
|
||||
|
||||
PCRE2 can also be configured using CMake, which can be run in various ways
|
||||
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
|
||||
NON-AUTOTOOLS-BUILD has information about CMake.
|
||||
|
||||
PCRE2 has been compiled on many different operating systems. It should be
|
||||
straightforward to build PCRE2 on any system that has a Standard C compiler and
|
||||
library, because it uses only Standard C functions.
|
||||
|
||||
|
||||
Building PCRE2 without using autotools
|
||||
--------------------------------------
|
||||
|
||||
The use of autotools (in particular, libtool) is problematic in some
|
||||
environments, even some that are Unix or Unix-like. See the NON-AUTOTOOLS-BUILD
|
||||
file for ways of building PCRE2 without using autotools.
|
||||
|
||||
|
||||
Building PCRE2 using autotools
|
||||
------------------------------
|
||||
|
||||
The following instructions assume the use of the widely used "configure; make;
|
||||
make install" (autotools) process.
|
||||
|
||||
If you have downloaded and unpacked a PCRE2 release tarball, run the
|
||||
"configure" command from the PCRE2 directory, with your current directory set
|
||||
to the directory where you want the files to be created. This command is a
|
||||
standard GNU "autoconf" configuration script, for which generic instructions
|
||||
are supplied in the file INSTALL.
|
||||
|
||||
The files in the GitHub repository do not contain "configure". If you have
|
||||
downloaded the PCRE2 source files from GitHub, before you can run "configure"
|
||||
you must run the shell script called autogen.sh. This runs a number of
|
||||
autotools to create a "configure" script (you must of course have the autotools
|
||||
commands installed in order to do this).
|
||||
|
||||
Most commonly, people build PCRE2 within its own distribution directory, and in
|
||||
this case, on many systems, just running "./configure" is sufficient. However,
|
||||
the usual methods of changing standard defaults are available. For example:
|
||||
|
||||
CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
|
||||
|
||||
This command specifies that the C compiler should be run with the flags '-O2
|
||||
-Wall' instead of the default, and that "make install" should install PCRE2
|
||||
under /opt/local instead of the default /usr/local.
|
||||
|
||||
If you want to build in a different directory, just run "configure" with that
|
||||
directory as current. For example, suppose you have unpacked the PCRE2 source
|
||||
into /source/pcre2/pcre2-xxx, but you want to build it in
|
||||
/build/pcre2/pcre2-xxx:
|
||||
|
||||
cd /build/pcre2/pcre2-xxx
|
||||
/source/pcre2/pcre2-xxx/configure
|
||||
|
||||
PCRE2 is written in C and is normally compiled as a C library. However, it is
|
||||
possible to build it as a C++ library, though the provided building apparatus
|
||||
does not have any features to support this.
|
||||
|
||||
There are some optional features that can be included or omitted from the PCRE2
|
||||
library. They are also documented in the pcre2build man page.
|
||||
|
||||
. By default, both shared and static libraries are built. You can change this
|
||||
by adding one of these options to the "configure" command:
|
||||
|
||||
--disable-shared
|
||||
--disable-static
|
||||
|
||||
Setting --disable-shared ensures that PCRE2 libraries are built as static
|
||||
libraries. The binaries that are then created as part of the build process
|
||||
(for example, pcre2test and pcre2grep) are linked statically with one or more
|
||||
PCRE2 libraries, but may also be dynamically linked with other libraries such
|
||||
as libc. If you want these binaries to be fully statically linked, you can
|
||||
set LDFLAGS like this:
|
||||
|
||||
LDFLAGS=--static ./configure --disable-shared
|
||||
|
||||
Note the two hyphens in --static. Of course, this works only if static
|
||||
versions of all the relevant libraries are available for linking. See also
|
||||
"Shared libraries" below.
|
||||
|
||||
. By default, only the 8-bit library is built. If you add --enable-pcre2-16 to
|
||||
the "configure" command, the 16-bit library is also built. If you add
|
||||
--enable-pcre2-32 to the "configure" command, the 32-bit library is also
|
||||
built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
|
||||
to disable building the 8-bit library.
|
||||
|
||||
. If you want to include support for just-in-time (JIT) compiling, which can
|
||||
give large performance improvements on certain platforms, add --enable-jit to
|
||||
the "configure" command. This support is available only for certain hardware
|
||||
architectures. If you try to enable it on an unsupported architecture, there
|
||||
will be a compile time error. If in doubt, use --enable-jit=auto, which
|
||||
enables JIT only if the current hardware is supported.
|
||||
|
||||
. If you are enabling JIT under SELinux environment you may also want to add
|
||||
--enable-jit-sealloc, which enables the use of an executable memory allocator
|
||||
that is compatible with SELinux. Warning: this allocator is experimental!
|
||||
It does not support fork() operation and may crash when no disk space is
|
||||
available. This option has no effect if JIT is disabled.
|
||||
|
||||
. If you do not want to make use of the default support for UTF-8 Unicode
|
||||
character strings in the 8-bit library, UTF-16 Unicode character strings in
|
||||
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
|
||||
library, you can add --disable-unicode to the "configure" command. This
|
||||
reduces the size of the libraries. It is not possible to configure one
|
||||
library with Unicode support, and another without, in the same configuration.
|
||||
It is also not possible to use --enable-ebcdic (see below) with Unicode
|
||||
support, so if this option is set, you must also use --disable-unicode.
|
||||
|
||||
When Unicode support is available, the use of a UTF encoding still has to be
|
||||
enabled by setting the PCRE2_UTF option at run time or starting a pattern
|
||||
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
|
||||
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
|
||||
|
||||
As well as supporting UTF strings, Unicode support includes support for the
|
||||
\P, \p, and \X sequences that recognize Unicode character properties.
|
||||
However, only a subset of Unicode properties are supported; see the
|
||||
pcre2pattern man page for details. Escape sequences such as \d and \w in
|
||||
patterns do not by default make use of Unicode properties, but can be made to
|
||||
do so by setting the PCRE2_UCP option or starting a pattern with (*UCP).
|
||||
|
||||
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
|
||||
of the preceding, or any of the Unicode newline sequences, or the NUL (zero)
|
||||
character as indicating the end of a line. Whatever you specify at build time
|
||||
is the default; the caller of PCRE2 can change the selection at run time. The
|
||||
default newline indicator is a single LF character (the Unix standard). You
|
||||
can specify the default newline indicator by adding --enable-newline-is-cr,
|
||||
--enable-newline-is-lf, --enable-newline-is-crlf,
|
||||
--enable-newline-is-anycrlf, --enable-newline-is-any, or
|
||||
--enable-newline-is-nul to the "configure" command, respectively.
|
||||
|
||||
. By default, the sequence \R in a pattern matches any Unicode line ending
|
||||
sequence. This is independent of the option specifying what PCRE2 considers
|
||||
to be the end of a line (see above). However, the caller of PCRE2 can
|
||||
restrict \R to match only CR, LF, or CRLF. You can make this the default by
|
||||
adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
|
||||
|
||||
. In a pattern, the escape sequence \C matches a single code unit, even in a
|
||||
UTF mode. This can be dangerous because it breaks up multi-code-unit
|
||||
characters. You can build PCRE2 with the use of \C permanently locked out by
|
||||
adding --enable-never-backslash-C (note the upper case C) to the "configure"
|
||||
command. When \C is allowed by the library, individual applications can lock
|
||||
it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
|
||||
|
||||
. PCRE2 has a counter that limits the depth of nesting of parentheses in a
|
||||
pattern. This limits the amount of system stack that a pattern uses when it
|
||||
is compiled. The default is 250, but you can change it by setting, for
|
||||
example,
|
||||
|
||||
--with-parens-nest-limit=500
|
||||
|
||||
. PCRE2 has a counter that can be set to limit the amount of computing resource
|
||||
it uses when matching a pattern. If the limit is exceeded during a match, the
|
||||
match fails. The default is ten million. You can change the default by
|
||||
setting, for example,
|
||||
|
||||
--with-match-limit=500000
|
||||
|
||||
on the "configure" command. This is just the default; individual calls to
|
||||
pcre2_match() or pcre2_dfa_match() can supply their own value. There is more
|
||||
discussion in the pcre2api man page (search for pcre2_set_match_limit).
|
||||
|
||||
. There is a separate counter that limits the depth of nested backtracking
|
||||
(pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
|
||||
matching process, which indirectly limits the amount of heap memory that is
|
||||
used, and in the case of pcre2_dfa_match() the amount of stack as well. This
|
||||
counter also has a default of ten million, which is essentially "unlimited".
|
||||
You can change the default by setting, for example,
|
||||
|
||||
--with-match-limit-depth=5000
|
||||
|
||||
There is more discussion in the pcre2api man page (search for
|
||||
pcre2_set_depth_limit).
|
||||
|
||||
. You can also set an explicit limit on the amount of heap memory used by
|
||||
the pcre2_match() and pcre2_dfa_match() interpreters:
|
||||
|
||||
--with-heap-limit=500
|
||||
|
||||
The units are kibibytes (units of 1024 bytes). This limit does not apply when
|
||||
the JIT optimization (which has its own memory control features) is used.
|
||||
There is more discussion on the pcre2api man page (search for
|
||||
pcre2_set_heap_limit).
|
||||
|
||||
. In the 8-bit library, the default maximum compiled pattern size is around
|
||||
64 kibibytes. You can increase this by adding --with-link-size=3 to the
|
||||
"configure" command. PCRE2 then uses three bytes instead of two for offsets
|
||||
to different parts of the compiled pattern. In the 16-bit library,
|
||||
--with-link-size=3 is the same as --with-link-size=4, which (in both
|
||||
libraries) uses four-byte offsets. Increasing the internal link size reduces
|
||||
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
|
||||
link size setting is ignored, as 4-byte offsets are always used.
|
||||
|
||||
. Lookbehind assertions in which one or more branches can match a variable
|
||||
number of characters are supported only if there is a maximum matching length
|
||||
for each top-level branch. There is a limit to this maximum that defaults to
|
||||
255 characters. You can alter this default by a setting such as
|
||||
|
||||
--with-max-varlookbehind=100
|
||||
|
||||
The limit can be changed at runtime by calling pcre2_set_max_varlookbehind().
|
||||
Lookbehind assertions in which every branch matches a fixed number of
|
||||
characters (not necessarily all the same) are not constrained by this limit.
|
||||
|
||||
. For speed, PCRE2 uses four tables for manipulating and identifying characters
|
||||
whose code point values are less than 256. By default, it uses a set of
|
||||
tables for ASCII encoding that is part of the distribution. If you specify
|
||||
|
||||
--enable-rebuild-chartables
|
||||
|
||||
a program called pcre2_dftables is compiled and run in the default C locale
|
||||
when you obey "make". It builds a source file called pcre2_chartables.c. If
|
||||
you do not specify this option, pcre2_chartables.c is created as a copy of
|
||||
pcre2_chartables.c.dist. See "Character tables" below for further
|
||||
information.
|
||||
|
||||
. It is possible to compile PCRE2 for use on systems that use EBCDIC as their
|
||||
character code (as opposed to ASCII/Unicode) by specifying
|
||||
|
||||
--enable-ebcdic --disable-unicode
|
||||
|
||||
This automatically implies --enable-rebuild-chartables (see above). However,
|
||||
when PCRE2 is built this way, it always operates in EBCDIC. It cannot support
|
||||
both EBCDIC and UTF-8/16/32. There is a second option, --enable-ebcdic-nl25,
|
||||
which specifies that the code value for the EBCDIC NL character is 0x25
|
||||
instead of the default 0x15.
|
||||
|
||||
. If you specify --enable-debug, additional debugging code is included in the
|
||||
build. This option is intended for use by the PCRE2 maintainers.
|
||||
|
||||
. In environments where valgrind is installed, if you specify
|
||||
|
||||
--enable-valgrind
|
||||
|
||||
PCRE2 will use valgrind annotations to mark certain memory regions as
|
||||
unaddressable. This allows it to detect invalid memory accesses, and is
|
||||
mostly useful for debugging PCRE2 itself.
|
||||
|
||||
. In environments where the gcc compiler is used and lcov is installed, if you
|
||||
specify
|
||||
|
||||
--enable-coverage
|
||||
|
||||
the build process implements a code coverage report for the test suite. The
|
||||
report is generated by running "make coverage". If ccache is installed on
|
||||
your system, it must be disabled when building PCRE2 for coverage reporting.
|
||||
You can do this by setting the environment variable CCACHE_DISABLE=1 before
|
||||
running "make" to build PCRE2. There is more information about coverage
|
||||
reporting in the "pcre2build" documentation.
|
||||
|
||||
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
|
||||
you add --disable-pcre2grep-jit to the "configure" command.
|
||||
|
||||
. There is support for calling external programs during matching in the
|
||||
pcre2grep command, using PCRE2's callout facility with string arguments. This
|
||||
support can be disabled by adding --disable-pcre2grep-callout to the
|
||||
"configure" command. There are two kinds of callout: one that generates
|
||||
output from inbuilt code, and another that calls an external program. The
|
||||
latter has special support for Windows and VMS; otherwise it assumes the
|
||||
existence of the fork() function. This facility can be disabled by adding
|
||||
--disable-pcre2grep-callout-fork to the "configure" command.
|
||||
|
||||
. The pcre2grep program currently supports only 8-bit data files, and so
|
||||
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
|
||||
libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
|
||||
specifying one or both of
|
||||
|
||||
--enable-pcre2grep-libz
|
||||
--enable-pcre2grep-libbz2
|
||||
|
||||
Of course, the relevant libraries must be installed on your system.
|
||||
|
||||
. The default starting size (in bytes) of the internal buffer used by pcre2grep
|
||||
can be set by, for example:
|
||||
|
||||
--with-pcre2grep-bufsize=51200
|
||||
|
||||
The value must be a plain integer. The default is 20480. The amount of memory
|
||||
used by pcre2grep is actually three times this number, to allow for "before"
|
||||
and "after" lines. If very long lines are encountered, the buffer is
|
||||
automatically enlarged, up to a fixed maximum size.
|
||||
|
||||
. The default maximum size of pcre2grep's internal buffer can be set by, for
|
||||
example:
|
||||
|
||||
--with-pcre2grep-max-bufsize=2097152
|
||||
|
||||
The default is either 1048576 or the value of --with-pcre2grep-bufsize,
|
||||
whichever is the larger.
|
||||
|
||||
. It is possible to compile pcre2test so that it links with the libreadline
|
||||
or libedit libraries, by specifying, respectively,
|
||||
|
||||
--enable-pcre2test-libreadline or --enable-pcre2test-libedit
|
||||
|
||||
If this is done, when pcre2test's input is from a terminal, it reads it using
|
||||
the readline() function. This provides line-editing and history facilities.
|
||||
Note that libreadline is GPL-licensed, so if you distribute a binary of
|
||||
pcre2test linked in this way, there may be licensing issues. These can be
|
||||
avoided by linking with libedit (which has a BSD licence) instead.
|
||||
|
||||
Enabling libreadline causes the -lreadline option to be added to the
|
||||
pcre2test build. In many operating environments with a system-installed
|
||||
readline library this is sufficient. However, in some environments (e.g. if
|
||||
an unmodified distribution version of readline is in use), it may be
|
||||
necessary to specify something like LIBS="-lncurses" as well. This is
|
||||
because, to quote the readline INSTALL, "Readline uses the termcap functions,
|
||||
but does not link with the termcap or curses library itself, allowing
|
||||
applications which link with readline the option to choose an appropriate
|
||||
library." If you get error messages about missing functions tgetstr, tgetent,
|
||||
tputs, tgetflag, or tgoto, this is the problem, and linking with the ncurses
|
||||
library should fix it.
|
||||
|
||||
. The C99 standard defines formatting modifiers z and t for size_t and
|
||||
ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
|
||||
environments other than Microsoft Visual Studio versions earlier than 2013
|
||||
when __STDC_VERSION__ is defined and has a value greater than or equal to
|
||||
199901L (indicating C99). However, there is at least one environment that
|
||||
claims to be C99 but does not support these modifiers. If
|
||||
--disable-percent-zt is specified, no use is made of the z or t modifiers.
|
||||
Instead of %td or %zu, %lu is used, with a cast for size_t values.
|
||||
|
||||
. There is a special option called --enable-fuzz-support for use by people who
|
||||
want to run fuzzing tests on PCRE2. If set, it causes an extra library
|
||||
called libpcre2-fuzzsupport.a to be built, but not installed. This contains
|
||||
a single function called LLVMFuzzerTestOneInput() whose arguments are a
|
||||
pointer to a string and the length of the string. When called, this function
|
||||
tries to compile the string as a pattern, and if that succeeds, to match
|
||||
it. This is done both with no options and with some random options bits that
|
||||
are generated from the string. Setting --enable-fuzz-support also causes an
|
||||
executable called pcre2fuzzcheck-{8,16,32} to be created. This is normally
|
||||
run under valgrind or used when PCRE2 is compiled with address sanitizing
|
||||
enabled. It calls the fuzzing function and outputs information about what it
|
||||
is doing. The input strings are specified by arguments: if an argument
|
||||
starts with "=" the rest of it is a literal input string. Otherwise, it is
|
||||
assumed to be a file name, and the contents of the file are the test string.
|
||||
|
||||
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
|
||||
which caused pcre2_match() to use individual blocks on the heap for
|
||||
backtracking instead of recursive function calls (which use the stack). This
|
||||
is now obsolete because pcre2_match() was refactored always to use the heap
|
||||
(in a much more efficient way than before). This option is retained for
|
||||
backwards compatibility, but has no effect other than to output a warning.
|
||||
|
||||
The "configure" script builds the following files for the basic C library:
|
||||
|
||||
. Makefile the makefile that builds the library
|
||||
. src/config.h build-time configuration options for the library
|
||||
. src/pcre2.h the public PCRE2 header file
|
||||
. pcre2-config script that shows the building settings such as CFLAGS
|
||||
that were set for "configure"
|
||||
. libpcre2-8.pc )
|
||||
. libpcre2-16.pc ) data for the pkg-config command
|
||||
. libpcre2-32.pc )
|
||||
. libpcre2-posix.pc )
|
||||
. libtool script that builds shared and/or static libraries
|
||||
|
||||
Versions of config.h and pcre2.h are distributed in the src directory of PCRE2
|
||||
tarballs under the names config.h.generic and pcre2.h.generic. These are
|
||||
provided for those who have to build PCRE2 without using "configure" or CMake.
|
||||
If you use "configure" or CMake, the .generic versions are not used.
|
||||
|
||||
The "configure" script also creates config.status, which is an executable
|
||||
script that can be run to recreate the configuration, and config.log, which
|
||||
contains compiler output from tests that "configure" runs.
|
||||
|
||||
Once "configure" has run, you can run "make". This builds whichever of the
|
||||
libraries libpcre2-8, libpcre2-16 and libpcre2-32 are configured, and a test
|
||||
program called pcre2test. If you enabled JIT support with --enable-jit, another
|
||||
test program called pcre2_jit_test is built as well. If the 8-bit library is
|
||||
built, libpcre2-posix, pcre2posix_test, and the pcre2grep command are also
|
||||
built. Running "make" with the -j option may speed up compilation on
|
||||
multiprocessor systems.
|
||||
|
||||
The command "make check" runs all the appropriate tests. Details of the PCRE2
|
||||
tests are given below in a separate section of this document. The -j option of
|
||||
"make" can also be used when running the tests.
|
||||
|
||||
You can use "make install" to install PCRE2 into live directories on your
|
||||
system. The following are installed (file names are all relative to the
|
||||
<prefix> that is set when "configure" is run):
|
||||
|
||||
Commands (bin):
|
||||
pcre2test
|
||||
pcre2grep (if 8-bit support is enabled)
|
||||
pcre2-config
|
||||
|
||||
Libraries (lib):
|
||||
libpcre2-8 (if 8-bit support is enabled)
|
||||
libpcre2-16 (if 16-bit support is enabled)
|
||||
libpcre2-32 (if 32-bit support is enabled)
|
||||
libpcre2-posix (if 8-bit support is enabled)
|
||||
|
||||
Configuration information (lib/pkgconfig):
|
||||
libpcre2-8.pc
|
||||
libpcre2-16.pc
|
||||
libpcre2-32.pc
|
||||
libpcre2-posix.pc
|
||||
|
||||
Header files (include):
|
||||
pcre2.h
|
||||
pcre2posix.h
|
||||
|
||||
Man pages (share/man/man{1,3}):
|
||||
pcre2grep.1
|
||||
pcre2test.1
|
||||
pcre2-config.1
|
||||
pcre2.3
|
||||
pcre2*.3 (lots more pages, all starting "pcre2")
|
||||
|
||||
HTML documentation (share/doc/pcre2/html):
|
||||
index.html
|
||||
*.html (lots more pages, hyperlinked from index.html)
|
||||
|
||||
Text file documentation (share/doc/pcre2):
|
||||
AUTHORS
|
||||
COPYING
|
||||
ChangeLog
|
||||
LICENCE
|
||||
NEWS
|
||||
README
|
||||
SECURITY
|
||||
pcre2.txt (a concatenation of the man(3) pages)
|
||||
pcre2test.txt the pcre2test man page
|
||||
pcre2grep.txt the pcre2grep man page
|
||||
pcre2-config.txt the pcre2-config man page
|
||||
|
||||
If you want to remove PCRE2 from your system, you can run "make uninstall".
|
||||
This removes all the files that "make install" installed. However, it does not
|
||||
remove any directories, because these are often shared with other programs.
|
||||
|
||||
|
||||
Retrieving configuration information
|
||||
------------------------------------
|
||||
|
||||
Running "make install" installs the command pcre2-config, which can be used to
|
||||
recall information about the PCRE2 configuration and installation. For example:
|
||||
|
||||
pcre2-config --version
|
||||
|
||||
prints the version number, and
|
||||
|
||||
pcre2-config --libs8
|
||||
|
||||
outputs information about where the 8-bit library is installed. This command
|
||||
can be included in makefiles for programs that use PCRE2, saving the programmer
|
||||
from having to remember too many details. Run pcre2-config with no arguments to
|
||||
obtain a list of possible arguments.
|
||||
|
||||
The pkg-config command is another system for saving and retrieving information
|
||||
about installed libraries. Instead of separate commands for each library, a
|
||||
single command is used. For example:
|
||||
|
||||
pkg-config --libs libpcre2-16
|
||||
|
||||
The data is held in *.pc files that are installed in a directory called
|
||||
<prefix>/lib/pkgconfig.
|
||||
|
||||
|
||||
Shared libraries
|
||||
----------------
|
||||
|
||||
The default distribution builds PCRE2 as shared libraries and static libraries,
|
||||
as long as the operating system supports shared libraries. Shared library
|
||||
support relies on the "libtool" script which is built as part of the
|
||||
"configure" process.
|
||||
|
||||
The libtool script is used to compile and link both shared and static
|
||||
libraries. They are placed in a subdirectory called .libs when they are newly
|
||||
built. The programs pcre2test and pcre2grep are built to use these uninstalled
|
||||
libraries (by means of wrapper scripts in the case of shared libraries). When
|
||||
you use "make install" to install shared libraries, pcre2grep and pcre2test are
|
||||
automatically re-built to use the newly installed shared libraries before being
|
||||
installed themselves. However, the versions left in the build directory still
|
||||
use the uninstalled libraries.
|
||||
|
||||
To build PCRE2 using static libraries only you must use --disable-shared when
|
||||
configuring it. For example:
|
||||
|
||||
./configure --prefix=/usr/gnu --disable-shared
|
||||
|
||||
Then run "make" in the usual way. Similarly, you can use --disable-static to
|
||||
build only shared libraries. Note, however, that when you build only static
|
||||
libraries, binary programs such as pcre2test and pcre2grep may still be
|
||||
dynamically linked with other libraries (for example, libc) unless you set
|
||||
LDFLAGS to --static when running "configure".
|
||||
|
||||
|
||||
Cross-compiling using autotools
|
||||
-------------------------------
|
||||
|
||||
You can specify CC and CFLAGS in the normal way to the "configure" command, in
|
||||
order to cross-compile PCRE2 for some other host. However, you should NOT
|
||||
specify --enable-rebuild-chartables, because if you do, the pcre2_dftables.c
|
||||
source file is compiled and run on the local host, in order to generate the
|
||||
inbuilt character tables (the pcre2_chartables.c file). This will probably not
|
||||
work, because pcre2_dftables.c needs to be compiled with the local compiler,
|
||||
not the cross compiler.
|
||||
|
||||
When --enable-rebuild-chartables is not specified, pcre2_chartables.c is
|
||||
created by making a copy of pcre2_chartables.c.dist, which is a default set of
|
||||
tables that assumes ASCII code. Cross-compiling with the default tables should
|
||||
not be a problem.
|
||||
|
||||
If you need to modify the character tables when cross-compiling, you should
|
||||
move pcre2_chartables.c.dist out of the way, then compile pcre2_dftables.c by
|
||||
hand and run it on the local host to make a new version of
|
||||
pcre2_chartables.c.dist. See the pcre2build section "Creating character tables
|
||||
at build time" for more details.
|
||||
|
||||
|
||||
Making new tarballs
|
||||
-------------------
|
||||
|
||||
The command "make dist" creates three PCRE2 tarballs, in tar.gz, tar.bz2, and
|
||||
zip formats. The command "make distcheck" does the same, but then does a trial
|
||||
build of the new distribution to ensure that it works.
|
||||
|
||||
If you have modified any of the man page sources in the doc directory, you
|
||||
should first run the maint/PrepareRelease script before making a distribution.
|
||||
This script creates the .txt and HTML forms of the documentation from the man
|
||||
pages.
|
||||
|
||||
|
||||
Testing PCRE2
|
||||
-------------
|
||||
|
||||
To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
|
||||
There is another script called RunGrepTest that tests the pcre2grep command.
|
||||
When the 8-bit library is built, a test program for the POSIX wrapper, called
|
||||
pcre2posix_test, is compiled, and when JIT support is enabled, a test program
|
||||
called pcre2_jit_test is built. The scripts and the program tests are all run
|
||||
when you obey "make check". For other environments, see the instructions in
|
||||
NON-AUTOTOOLS-BUILD.
|
||||
|
||||
The RunTest script runs the pcre2test test program (which is documented in its
|
||||
own man page) on each of the relevant testinput files in the testdata
|
||||
directory, and compares the output with the contents of the corresponding
|
||||
testoutput files. RunTest uses a file called testtry to hold the main output
|
||||
from pcre2test. Other files whose names begin with "test" are used as working
|
||||
files in some tests.
|
||||
|
||||
Some tests are relevant only when certain build-time options were selected. For
|
||||
example, the tests for UTF-8/16/32 features are run only when Unicode support
|
||||
is available. RunTest outputs a comment when it skips a test.
|
||||
|
||||
Many (but not all) of the tests that are not skipped are run twice if JIT
|
||||
support is available. On the second run, JIT compilation is forced. This
|
||||
testing can be suppressed by putting "-nojit" on the RunTest command line.
|
||||
|
||||
The entire set of tests is run once for each of the 8-bit, 16-bit and 32-bit
|
||||
libraries that are enabled. If you want to run just one set of tests, call
|
||||
RunTest with either the -8, -16 or -32 option.
|
||||
|
||||
If valgrind is installed, you can run the tests under it by putting "-valgrind"
|
||||
on the RunTest command line. To run pcre2test on just one or more specific test
|
||||
files, give their numbers as arguments to RunTest, for example:
|
||||
|
||||
RunTest 2 7 11
|
||||
|
||||
You can also specify ranges of tests such as 3-6 or 3- (meaning 3 to the
|
||||
end), or a number preceded by ~ to exclude a test. For example:
|
||||
|
||||
Runtest 3-15 ~10
|
||||
|
||||
This runs tests 3 to 15, excluding test 10, and just ~13 runs all the tests
|
||||
except test 13. Whatever order the arguments are in, the tests are always run
|
||||
in numerical order.
|
||||
|
||||
You can also call RunTest with the single argument "list" to cause it to output
|
||||
a list of tests.
|
||||
|
||||
The test sequence starts with "test 0", which is a special test that has no
|
||||
input file, and whose output is not checked. This is because it will be
|
||||
different on different hardware and with different configurations. The test
|
||||
exists in order to exercise some of pcre2test's code that would not otherwise
|
||||
be run.
|
||||
|
||||
Tests 1 and 2 can always be run, as they expect only plain text strings (not
|
||||
UTF) and make no use of Unicode properties. The first test file can be fed
|
||||
directly into the perltest.sh script to check that Perl gives the same results.
|
||||
The only difference you should see is in the first few lines, where the Perl
|
||||
version is given instead of the PCRE2 version. The second set of tests check
|
||||
auxiliary functions, error detection, and run-time flags that are specific to
|
||||
PCRE2. It also uses the debugging flags to check some of the internals of
|
||||
pcre2_compile().
|
||||
|
||||
If you build PCRE2 with a locale setting that is not the standard C locale, the
|
||||
character tables may be different (see next paragraph). In some cases, this may
|
||||
cause failures in the second set of tests. For example, in a locale where the
|
||||
isprint() function yields TRUE for characters in the range 128-255, the use of
|
||||
[:isascii:] inside a character class defines a different set of characters, and
|
||||
this shows up in this test as a difference in the compiled code, which is being
|
||||
listed for checking. For example, where the comparison test output contains
|
||||
[\x00-\x7f] the test might contain [\x00-\xff], and similarly in some other
|
||||
cases. This is not a bug in PCRE2.
|
||||
|
||||
Test 3 checks pcre2_maketables(), the facility for building a set of character
|
||||
tables for a specific locale and using them instead of the default tables. The
|
||||
script uses the "locale" command to check for the availability of the "fr_FR",
|
||||
"french", or "fr" locale, and uses the first one that it finds. If the "locale"
|
||||
command fails, or if its output doesn't include "fr_FR", "french", or "fr" in
|
||||
the list of available locales, the third test cannot be run, and a comment is
|
||||
output to say why. If running this test produces an error like this:
|
||||
|
||||
** Failed to set locale "fr_FR"
|
||||
|
||||
it means that the given locale is not available on your system, despite being
|
||||
listed by "locale". This does not mean that PCRE2 is broken. There are three
|
||||
alternative output files for the third test, because three different versions
|
||||
of the French locale have been encountered. The test passes if its output
|
||||
matches any one of them.
|
||||
|
||||
Tests 4 and 5 check UTF and Unicode property support, test 4 being compatible
|
||||
with the perltest.sh script, and test 5 checking PCRE2-specific things.
|
||||
|
||||
Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in
|
||||
non-UTF mode and UTF-mode with Unicode property support, respectively.
|
||||
|
||||
Test 8 checks some internal offsets and code size features, but it is run only
|
||||
when Unicode support is enabled. The output is different in 8-bit, 16-bit, and
|
||||
32-bit modes and for different link sizes, so there are different output files
|
||||
for each mode and link size.
|
||||
|
||||
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
|
||||
16-bit and 32-bit modes. These are tests that generate different output in
|
||||
8-bit mode. Each pair are for general cases and Unicode support, respectively.
|
||||
|
||||
Test 13 checks the handling of non-UTF characters greater than 255 by
|
||||
pcre2_dfa_match() in 16-bit and 32-bit modes.
|
||||
|
||||
Test 14 contains some special UTF and UCP tests that give different output for
|
||||
different code unit widths.
|
||||
|
||||
Test 15 contains a number of tests that must not be run with JIT. They check,
|
||||
among other non-JIT things, the match-limiting features of the interpretive
|
||||
matcher.
|
||||
|
||||
Test 16 is run only when JIT support is not available. It checks that an
|
||||
attempt to use JIT has the expected behaviour.
|
||||
|
||||
Test 17 is run only when JIT support is available. It checks JIT complete and
|
||||
partial modes, match-limiting under JIT, and other JIT-specific features.
|
||||
|
||||
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
|
||||
the 8-bit library, without and with Unicode support, respectively.
|
||||
|
||||
Test 20 checks the serialization functions by writing a set of compiled
|
||||
patterns to a file, and then reloading and checking them.
|
||||
|
||||
Tests 21 and 22 test \C support when the use of \C is not locked out, without
|
||||
and with UTF support, respectively. Test 23 tests \C when it is locked out.
|
||||
|
||||
Tests 24 and 25 test the experimental pattern conversion functions, without and
|
||||
with UTF support, respectively.
|
||||
|
||||
Test 26 checks Unicode property support using tests that are generated
|
||||
automatically from the Unicode data tables.
|
||||
|
||||
|
||||
Character tables
|
||||
----------------
|
||||
|
||||
For speed, PCRE2 uses four tables for manipulating and identifying characters
|
||||
whose code point values are less than 256. By default, a set of tables that is
|
||||
built into the library is used. The pcre2_maketables() function can be called
|
||||
by an application to create a new set of tables in the current locale. This are
|
||||
passed to PCRE2 by calling pcre2_set_character_tables() to put a pointer into a
|
||||
compile context.
|
||||
|
||||
The source file called pcre2_chartables.c contains the default set of tables.
|
||||
By default, this is created as a copy of pcre2_chartables.c.dist, which
|
||||
contains tables for ASCII coding. However, if --enable-rebuild-chartables is
|
||||
specified for ./configure, a new version of pcre2_chartables.c is built by the
|
||||
program pcre2_dftables (compiled from pcre2_dftables.c), which uses the ANSI C
|
||||
character handling functions such as isalnum(), isalpha(), isupper(),
|
||||
islower(), etc. to build the table sources. This means that the default C
|
||||
locale that is set for your system will control the contents of these default
|
||||
tables. You can change the default tables by editing pcre2_chartables.c and
|
||||
then re-building PCRE2. If you do this, you should take care to ensure that the
|
||||
file does not get automatically re-generated. The best way to do this is to
|
||||
move pcre2_chartables.c.dist out of the way and replace it with your customized
|
||||
tables.
|
||||
|
||||
When the pcre2_dftables program is run as a result of specifying
|
||||
--enable-rebuild-chartables, it uses the default C locale that is set on your
|
||||
system. It does not pay attention to the LC_xxx environment variables. In other
|
||||
words, it uses the system's default locale rather than whatever the compiling
|
||||
user happens to have set. If you really do want to build a source set of
|
||||
character tables in a locale that is specified by the LC_xxx variables, you can
|
||||
run the pcre2_dftables program by hand with the -L option. For example:
|
||||
|
||||
./pcre2_dftables -L pcre2_chartables.c.special
|
||||
|
||||
The second argument names the file where the source code for the tables is
|
||||
written. The first two 256-byte tables provide lower casing and case flipping
|
||||
functions, respectively. The next table consists of a number of 32-byte bit
|
||||
maps which identify certain character classes such as digits, "word"
|
||||
characters, white space, etc. These are used when building 32-byte bit maps
|
||||
that represent character classes for code points less than 256. The final
|
||||
256-byte table has bits indicating various character types, as follows:
|
||||
|
||||
1 white space character
|
||||
2 letter
|
||||
4 lower case letter
|
||||
8 decimal digit
|
||||
16 alphanumeric or '_'
|
||||
|
||||
You can also specify -b (with or without -L) when running pcre2_dftables. This
|
||||
causes the tables to be written in binary instead of as source code. A set of
|
||||
binary tables can be loaded into memory by an application and passed to
|
||||
pcre2_compile() in the same way as tables created dynamically by calling
|
||||
pcre2_maketables(). The tables are just a string of bytes, independent of
|
||||
hardware characteristics such as endianness. This means they can be bundled
|
||||
with an application that runs in different environments, to ensure consistent
|
||||
behaviour.
|
||||
|
||||
See also the pcre2build section "Creating character tables at build time".
|
||||
|
||||
|
||||
File manifest
|
||||
-------------
|
||||
|
||||
The distribution should contain the files listed below.
|
||||
|
||||
(A) Source files for the PCRE2 library functions and their headers are found in
|
||||
the src directory:
|
||||
|
||||
src/pcre2_dftables.c auxiliary program for building pcre2_chartables.c
|
||||
when --enable-rebuild-chartables is specified
|
||||
|
||||
src/pcre2_chartables.c.dist a default set of character tables that assume
|
||||
ASCII coding; unless --enable-rebuild-chartables is
|
||||
specified, used by copying to pcre2_chartables.c
|
||||
|
||||
src/pcre2posix.c )
|
||||
src/pcre2_auto_possess.c )
|
||||
src/pcre2_chkdint.c )
|
||||
src/pcre2_compile.c )
|
||||
src/pcre2_compile_class.c )
|
||||
src/pcre2_config.c )
|
||||
src/pcre2_context.c )
|
||||
src/pcre2_convert.c )
|
||||
src/pcre2_dfa_match.c )
|
||||
src/pcre2_error.c )
|
||||
src/pcre2_extuni.c )
|
||||
src/pcre2_find_bracket.c )
|
||||
src/pcre2_jit_compile.c )
|
||||
src/pcre2_jit_match.c ) sources for the functions in the library,
|
||||
src/pcre2_jit_misc.c ) and some internal functions that they use
|
||||
src/pcre2_maketables.c )
|
||||
src/pcre2_match.c )
|
||||
src/pcre2_match_data.c )
|
||||
src/pcre2_newline.c )
|
||||
src/pcre2_ord2utf.c )
|
||||
src/pcre2_pattern_info.c )
|
||||
src/pcre2_script_run.c )
|
||||
src/pcre2_serialize.c )
|
||||
src/pcre2_string_utils.c )
|
||||
src/pcre2_study.c )
|
||||
src/pcre2_substitute.c )
|
||||
src/pcre2_substring.c )
|
||||
src/pcre2_tables.c )
|
||||
src/pcre2_ucd.c )
|
||||
src/pcre2_ucptables.c )
|
||||
src/pcre2_valid_utf.c )
|
||||
src/pcre2_xclass.c )
|
||||
|
||||
src/pcre2_printint.c debugging function that is used by pcre2test,
|
||||
src/pcre2_fuzzsupport.c function for (optional) fuzzing support
|
||||
|
||||
src/config.h.in template for config.h, when built by "configure"
|
||||
src/pcre2.h.in template for pcre2.h when built by "configure"
|
||||
src/pcre2posix.h header for the external POSIX wrapper API
|
||||
src/pcre2_compile.h header for internal use
|
||||
src/pcre2_internal.h header for internal use
|
||||
src/pcre2_intmodedep.h a mode-specific internal header
|
||||
src/pcre2_jit_char_inc.h header used by JIT
|
||||
src/pcre2_jit_neon_inc.h header used by JIT
|
||||
src/pcre2_jit_simd_inc.h header used by JIT
|
||||
src/pcre2_ucp.h header for Unicode property handling
|
||||
src/pcre2_util.h header for internal utils
|
||||
|
||||
deps/sljit/sljit_src/* source files for the JIT compiler
|
||||
|
||||
(B) Source files for programs that use PCRE2:
|
||||
|
||||
src/pcre2demo.c simple demonstration of coding calls to PCRE2
|
||||
src/pcre2grep.c source of a grep utility that uses PCRE2
|
||||
src/pcre2test.c comprehensive test program
|
||||
src/pcre2_jit_test.c JIT test program
|
||||
src/pcre2posix_test.c POSIX wrapper API test program
|
||||
|
||||
(C) Auxiliary files:
|
||||
|
||||
AUTHORS.md information about the authors of PCRE2
|
||||
ChangeLog log of changes to the code
|
||||
HACKING some notes about the internals of PCRE2
|
||||
INSTALL generic installation instructions
|
||||
LICENCE.md conditions for the use of PCRE2
|
||||
COPYING the same, using GNU's standard name
|
||||
SECURITY.md information on reporting vulnerabilities
|
||||
Makefile.in ) template for Unix Makefile, which is built by
|
||||
) "configure"
|
||||
Makefile.am ) the automake input that was used to create
|
||||
) Makefile.in
|
||||
NEWS important changes in this release
|
||||
NON-AUTOTOOLS-BUILD notes on building PCRE2 without using autotools
|
||||
README this file
|
||||
RunTest a Unix shell script for running tests
|
||||
RunGrepTest a Unix shell script for pcre2grep tests
|
||||
RunTest.bat a Windows batch file for running tests
|
||||
RunGrepTest.bat a Windows batch file for pcre2grep tests
|
||||
aclocal.m4 m4 macros (generated by "aclocal")
|
||||
m4/* m4 macros (used by autoconf)
|
||||
configure a configuring shell script (built by autoconf)
|
||||
configure.ac ) the autoconf input that was used to build
|
||||
) "configure" and config.h
|
||||
doc/*.3 man page sources for PCRE2
|
||||
doc/*.1 man page sources for pcre2grep and pcre2test
|
||||
doc/html/* HTML documentation
|
||||
doc/pcre2.txt plain text version of the man pages
|
||||
doc/pcre2-config.txt plain text documentation of pcre2-config script
|
||||
doc/pcre2grep.txt plain text documentation of grep utility program
|
||||
doc/pcre2test.txt plain text documentation of test program
|
||||
libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
|
||||
libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
|
||||
libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
|
||||
libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
|
||||
ar-lib )
|
||||
config.guess )
|
||||
config.sub )
|
||||
depcomp ) helper tools generated by libtool and
|
||||
compile ) automake, used internally by ./configure
|
||||
install-sh )
|
||||
ltmain.sh )
|
||||
missing )
|
||||
test-driver )
|
||||
perltest.sh Script for running a Perl test program
|
||||
pcre2-config.in source of script which retains PCRE2 information
|
||||
testdata/testinput* test data for main library tests
|
||||
testdata/testoutput* expected test results
|
||||
testdata/grep* input and output for pcre2grep tests
|
||||
testdata/* other supporting test files
|
||||
|
||||
(D) Auxiliary files for CMake support
|
||||
|
||||
cmake/COPYING-CMAKE-SCRIPTS
|
||||
cmake/FindEditline.cmake
|
||||
cmake/FindReadline.cmake
|
||||
cmake/pcre2-config-version.cmake.in
|
||||
cmake/pcre2-config.cmake.in
|
||||
CMakeLists.txt
|
||||
config-cmake.h.in
|
||||
|
||||
(E) Auxiliary files for building PCRE2 "by hand"
|
||||
|
||||
src/pcre2.h.generic ) a version of the public PCRE2 header file
|
||||
) for use in non-"configure" environments
|
||||
src/config.h.generic ) a version of config.h for use in non-"configure"
|
||||
) environments
|
||||
|
||||
(F) Auxiliary files for building PCRE2 using other build systems
|
||||
|
||||
BUILD.bazel )
|
||||
MODULE.bazel ) files used by the Bazel build system
|
||||
WORKSPACE.bazel )
|
||||
build.zig file used by zig's build system
|
||||
|
||||
(G) Auxiliary files for building PCRE2 under OpenVMS
|
||||
|
||||
vms/configure.com )
|
||||
vms/openvms_readme.txt ) These files were contributed by a PCRE2 user.
|
||||
vms/pcre2.h_patch )
|
||||
vms/stdint.h )
|
||||
|
||||
==============================
|
||||
Last updated: 18 December 2024
|
||||
==============================
|
||||
|
||||
327
3rd/pcre2/doc/html/index.html
Normal file
327
3rd/pcre2/doc/html/index.html
Normal file
@@ -0,0 +1,327 @@
|
||||
<html>
|
||||
<!-- This is a manually maintained file that is the root of the HTML version of
|
||||
the PCRE2 documentation. When the HTML documents are built from the man
|
||||
page versions, the entire doc/html directory is emptied, this file is then
|
||||
copied into doc/html/index.html, and the remaining files therein are
|
||||
created by the 132html script.
|
||||
-->
|
||||
<head>
|
||||
<title>PCRE2 specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>Perl-compatible Regular Expressions (revised API: PCRE2)</h1>
|
||||
<p>
|
||||
The HTML documentation for PCRE2 consists of a number of pages that are listed
|
||||
below in alphabetical order. If you are new to PCRE2, please read the first one
|
||||
first.
|
||||
</p>
|
||||
|
||||
<table>
|
||||
<tr><td><a href="pcre2.html">pcre2</a></td>
|
||||
<td> Introductory page</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2-config.html">pcre2-config</a></td>
|
||||
<td> Information about the installation configuration</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2api.html">pcre2api</a></td>
|
||||
<td> PCRE2's native API</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2build.html">pcre2build</a></td>
|
||||
<td> Building PCRE2</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2callout.html">pcre2callout</a></td>
|
||||
<td> The <i>callout</i> facility</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2compat.html">pcre2compat</a></td>
|
||||
<td> Compability with Perl</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2convert.html">pcre2convert</a></td>
|
||||
<td> Experimental foreign pattern conversion functions</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2demo.html">pcre2demo</a></td>
|
||||
<td> A demonstration C program that uses the PCRE2 library</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2grep.html">pcre2grep</a></td>
|
||||
<td> The <b>pcre2grep</b> command</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2jit.html">pcre2jit</a></td>
|
||||
<td> Discussion of the just-in-time optimization support</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2limits.html">pcre2limits</a></td>
|
||||
<td> Details of size and other limits</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2matching.html">pcre2matching</a></td>
|
||||
<td> Discussion of the two matching algorithms</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2partial.html">pcre2partial</a></td>
|
||||
<td> Using PCRE2 for partial matching</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2pattern.html">pcre2pattern</a></td>
|
||||
<td> Specification of the regular expressions supported by PCRE2</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2perform.html">pcre2perform</a></td>
|
||||
<td> Some comments on performance</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2posix.html">pcre2posix</a></td>
|
||||
<td> The POSIX API to the PCRE2 8-bit library</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2sample.html">pcre2sample</a></td>
|
||||
<td> Discussion of the pcre2demo program</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
|
||||
<td> Serializing functions for saving precompiled patterns</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
|
||||
<td> Syntax quick-reference summary</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2test.html">pcre2test</a></td>
|
||||
<td> The <b>pcre2test</b> command for testing PCRE2</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2unicode.html">pcre2unicode</a></td>
|
||||
<td> Discussion of Unicode and UTF-8/UTF-16/UTF-32 support</td></tr>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
There are also individual pages that summarize the interface for each function
|
||||
in the library.
|
||||
</p>
|
||||
|
||||
<table>
|
||||
|
||||
<tr><td><a href="pcre2_callout_enumerate.html">pcre2_callout_enumerate</a></td>
|
||||
<td> Enumerate callouts in a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
|
||||
<td> Copy a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
|
||||
<td> Copy a compiled pattern and its character tables</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
|
||||
<td> Free a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_compile.html">pcre2_compile</a></td>
|
||||
<td> Compile a regular expression pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_compile_context_copy.html">pcre2_compile_context_copy</a></td>
|
||||
<td> Copy a compile context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_compile_context_create.html">pcre2_compile_context_create</a></td>
|
||||
<td> Create a compile context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_compile_context_free.html">pcre2_compile_context_free</a></td>
|
||||
<td> Free a compile context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_config.html">pcre2_config</a></td>
|
||||
<td> Show build-time configuration options</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_convert_context_copy.html">pcre2_convert_context_copy</a></td>
|
||||
<td> Copy a convert context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_convert_context_create.html">pcre2_convert_context_create</a></td>
|
||||
<td> Create a convert context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_convert_context_free.html">pcre2_convert_context_free</a></td>
|
||||
<td> Free a convert context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_converted_pattern_free.html">pcre2_converted_pattern_free</a></td>
|
||||
<td> Free converted foreign pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_dfa_match.html">pcre2_dfa_match</a></td>
|
||||
<td> Match a compiled pattern to a subject string
|
||||
(DFA algorithm; <i>not</i> Perl compatible)</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_general_context_copy.html">pcre2_general_context_copy</a></td>
|
||||
<td> Copy a general context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_general_context_create.html">pcre2_general_context_create</a></td>
|
||||
<td> Create a general context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_general_context_free.html">pcre2_general_context_free</a></td>
|
||||
<td> Free a general context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_get_error_message.html">pcre2_get_error_message</a></td>
|
||||
<td> Get textual error message for error number</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_get_mark.html">pcre2_get_mark</a></td>
|
||||
<td> Get a (*MARK) name</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_get_match_data_size.html">pcre2_get_match_data_size</a></td>
|
||||
<td> Get the size of a match data block</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_get_ovector_count.html">pcre2_get_ovector_count</a></td>
|
||||
<td> Get the ovector count</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_get_ovector_pointer.html">pcre2_get_ovector_pointer</a></td>
|
||||
<td> Get a pointer to the ovector</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_get_startchar.html">pcre2_get_startchar</a></td>
|
||||
<td> Get the starting character offset</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_jit_compile.html">pcre2_jit_compile</a></td>
|
||||
<td> Process a compiled pattern with the JIT compiler</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_jit_free_unused_memory.html">pcre2_jit_free_unused_memory</a></td>
|
||||
<td> Free unused JIT memory</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_jit_match.html">pcre2_jit_match</a></td>
|
||||
<td> Fast path interface to JIT matching</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_jit_stack_assign.html">pcre2_jit_stack_assign</a></td>
|
||||
<td> Assign stack for JIT matching</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_jit_stack_create.html">pcre2_jit_stack_create</a></td>
|
||||
<td> Create a stack for JIT matching</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_jit_stack_free.html">pcre2_jit_stack_free</a></td>
|
||||
<td> Free a JIT matching stack</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_maketables.html">pcre2_maketables</a></td>
|
||||
<td> Build character tables in current locale</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_maketables_free.html">pcre2_maketables_free</a></td>
|
||||
<td> Free character tables</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match.html">pcre2_match</a></td>
|
||||
<td> Match a compiled pattern to a subject string
|
||||
(Perl compatible)</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match_context_copy.html">pcre2_match_context_copy</a></td>
|
||||
<td> Copy a match context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match_context_create.html">pcre2_match_context_create</a></td>
|
||||
<td> Create a match context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match_context_free.html">pcre2_match_context_free</a></td>
|
||||
<td> Free a match context</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match_data_create.html">pcre2_match_data_create</a></td>
|
||||
<td> Create a match data block</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match_data_create_from_pattern.html">pcre2_match_data_create_from_pattern</a></td>
|
||||
<td> Create a match data block getting size from pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_match_data_free.html">pcre2_match_data_free</a></td>
|
||||
<td> Free a match data block</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_pattern_convert.html">pcre2_pattern_convert</a></td>
|
||||
<td> Experimental foreign pattern converter</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_pattern_info.html">pcre2_pattern_info</a></td>
|
||||
<td> Extract information about a pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_serialize_decode.html">pcre2_serialize_decode</a></td>
|
||||
<td> Decode serialized compiled patterns</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_serialize_encode.html">pcre2_serialize_encode</a></td>
|
||||
<td> Serialize compiled patterns for save/restore</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_serialize_free.html">pcre2_serialize_free</a></td>
|
||||
<td> Free serialized compiled patterns</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_serialize_get_number_of_codes.html">pcre2_serialize_get_number_of_codes</a></td>
|
||||
<td> Get number of serialized compiled patterns</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_bsr.html">pcre2_set_bsr</a></td>
|
||||
<td> Set \R convention</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_callout.html">pcre2_set_callout</a></td>
|
||||
<td> Set up a callout function</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_character_tables.html">pcre2_set_character_tables</a></td>
|
||||
<td> Set character tables</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_compile_extra_options.html">pcre2_set_compile_extra_options</a></td>
|
||||
<td> Set compile time extra options</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_compile_recursion_guard.html">pcre2_set_compile_recursion_guard</a></td>
|
||||
<td> Set up a compile recursion guard function</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_depth_limit.html">pcre2_set_depth_limit</a></td>
|
||||
<td> Set the match backtracking depth limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_glob_escape.html">pcre2_set_glob_escape</a></td>
|
||||
<td> Set glob escape character</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_glob_separator.html">pcre2_set_glob_separator</a></td>
|
||||
<td> Set glob separator character</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_heap_limit.html">pcre2_set_heap_limit</a></td>
|
||||
<td> Set the match backtracking heap limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
|
||||
<td> Set the match limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_max_pattern_compiled_length.html">pcre2_set_max_pattern_compiled_length</a></td>
|
||||
<td> Set the maximum length of a compiled pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_max_pattern_length.html">pcre2_set_max_pattern_length</a></td>
|
||||
<td> Set the maximum length of a pattern</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_max_varlookbehind.html">pcre2_set_max_varlookbehind</a></td>
|
||||
<td> Set the maximum match length for a variable-length lookbehind</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_newline.html">pcre2_set_newline</a></td>
|
||||
<td> Set the newline convention</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_offset_limit.html">pcre2_set_offset_limit</a></td>
|
||||
<td> Set the offset limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_optimize.html">pcre2_set_optimize</a></td>
|
||||
<td> Set an optimization directive</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_parens_nest_limit.html">pcre2_set_parens_nest_limit</a></td>
|
||||
<td> Set the parentheses nesting limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_recursion_limit.html">pcre2_set_recursion_limit</a></td>
|
||||
<td> Obsolete: use pcre2_set_depth_limit</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_recursion_memory_management.html">pcre2_set_recursion_memory_management</a></td>
|
||||
<td> Obsolete function that (from 10.30 onwards) does nothing</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_substitute_callout.html">pcre2_set_substitute_callout</a></td>
|
||||
<td> Set a substitution callout function</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_set_substitute_case_callout.html">pcre2_set_substitute_case_callout</a></td>
|
||||
<td> Set a substitution case callout function</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substitute.html">pcre2_substitute</a></td>
|
||||
<td> Match a compiled pattern to a subject string and do
|
||||
substitutions</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_copy_byname.html">pcre2_substring_copy_byname</a></td>
|
||||
<td> Extract named substring into given buffer</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_copy_bynumber.html">pcre2_substring_copy_bynumber</a></td>
|
||||
<td> Extract numbered substring into given buffer</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_free.html">pcre2_substring_free</a></td>
|
||||
<td> Free extracted substring</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_get_byname.html">pcre2_substring_get_byname</a></td>
|
||||
<td> Extract named substring into new memory</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_get_bynumber.html">pcre2_substring_get_bynumber</a></td>
|
||||
<td> Extract numbered substring into new memory</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_length_byname.html">pcre2_substring_length_byname</a></td>
|
||||
<td> Find length of named substring</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_length_bynumber.html">pcre2_substring_length_bynumber</a></td>
|
||||
<td> Find length of numbered substring</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_list_free.html">pcre2_substring_list_free</a></td>
|
||||
<td> Free list of extracted substrings</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_list_get.html">pcre2_substring_list_get</a></td>
|
||||
<td> Extract all substrings into new memory</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_nametable_scan.html">pcre2_substring_nametable_scan</a></td>
|
||||
<td> Find table entries for given string name</td></tr>
|
||||
|
||||
<tr><td><a href="pcre2_substring_number_from_name.html">pcre2_substring_number_from_name</a></td>
|
||||
<td> Convert captured string name to number</td></tr>
|
||||
</table>
|
||||
|
||||
</html>
|
||||
|
||||
102
3rd/pcre2/doc/html/pcre2-config.html
Normal file
102
3rd/pcre2/doc/html/pcre2-config.html
Normal file
@@ -0,0 +1,102 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2-config specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2-config man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
|
||||
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
||||
<li><a name="TOC3" href="#SEC3">OPTIONS</a>
|
||||
<li><a name="TOC4" href="#SEC4">SEE ALSO</a>
|
||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||
<P>
|
||||
<b>pcre2-config [--prefix] [--exec-prefix] [--version]</b>
|
||||
<b> [--libs8] [--libs16] [--libs32] [--libs-posix]</b>
|
||||
<b> [--cflags] [--cflags-posix]</b>
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
||||
<P>
|
||||
<b>pcre2-config</b> returns the configuration of the installed PCRE2 libraries
|
||||
and the options required to compile a program to use them. Some of the options
|
||||
apply only to the 8-bit, or 16-bit, or 32-bit libraries, respectively, and are
|
||||
not available for libraries that have not been built. If an unavailable option
|
||||
is encountered, the "usage" information is output.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">OPTIONS</a><br>
|
||||
<P>
|
||||
<b>--prefix</b>
|
||||
Writes the directory prefix used in the PCRE2 installation for architecture
|
||||
independent files (<i>/usr</i> on many systems, <i>/usr/local</i> on some
|
||||
systems) to the standard output.
|
||||
</P>
|
||||
<P>
|
||||
<b>--exec-prefix</b>
|
||||
Writes the directory prefix used in the PCRE2 installation for architecture
|
||||
dependent files (normally the same as <b>--prefix</b>) to the standard output.
|
||||
</P>
|
||||
<P>
|
||||
<b>--version</b>
|
||||
Writes the version number of the installed PCRE2 libraries to the standard
|
||||
output.
|
||||
</P>
|
||||
<P>
|
||||
<b>--libs8</b>
|
||||
Writes to the standard output the command line options required to link
|
||||
with the 8-bit PCRE2 library (<b>-lpcre2-8</b> on many systems).
|
||||
</P>
|
||||
<P>
|
||||
<b>--libs16</b>
|
||||
Writes to the standard output the command line options required to link
|
||||
with the 16-bit PCRE2 library (<b>-lpcre2-16</b> on many systems).
|
||||
</P>
|
||||
<P>
|
||||
<b>--libs32</b>
|
||||
Writes to the standard output the command line options required to link
|
||||
with the 32-bit PCRE2 library (<b>-lpcre2-32</b> on many systems).
|
||||
</P>
|
||||
<P>
|
||||
<b>--libs-posix</b>
|
||||
Writes to the standard output the command line options required to link with
|
||||
PCRE2's POSIX API wrapper library (<b>-lpcre2-posix</b> <b>-lpcre2-8</b> on many
|
||||
systems).
|
||||
</P>
|
||||
<P>
|
||||
<b>--cflags</b>
|
||||
Writes to the standard output the command line options required to compile
|
||||
files that use PCRE2 (this may include some <b>-I</b> options, but is blank on
|
||||
many systems).
|
||||
</P>
|
||||
<P>
|
||||
<b>--cflags-posix</b>
|
||||
Writes to the standard output the command line options required to compile
|
||||
files that use PCRE2's POSIX API wrapper library (this may include some
|
||||
<b>-I</b> options, but is blank on many systems).
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2(3)</b>
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
This manual page was originally written by Mark Baker for the Debian GNU/Linux
|
||||
system. It has been subsequently revised as a generic PCRE2 man page.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 28 September 2014
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
214
3rd/pcre2/doc/html/pcre2.html
Normal file
214
3rd/pcre2/doc/html/pcre2.html
Normal file
@@ -0,0 +1,214 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2 specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2 man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">INTRODUCTION</a>
|
||||
<li><a name="TOC2" href="#SEC2">SECURITY CONSIDERATIONS</a>
|
||||
<li><a name="TOC3" href="#SEC3">USER DOCUMENTATION</a>
|
||||
<li><a name="TOC4" href="#SEC4">AUTHORS</a>
|
||||
<li><a name="TOC5" href="#SEC5">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">INTRODUCTION</a><br>
|
||||
<P>
|
||||
PCRE2 is the name used for a revised API for the PCRE library, which is a set
|
||||
of functions, written in C, that implement regular expression pattern matching
|
||||
using the same syntax and semantics as Perl, with just a few differences. After
|
||||
nearly two decades, the limitations of the original API were making development
|
||||
increasingly difficult. The new API is more extensible, and it was simplified
|
||||
by abolishing the separate "study" optimizing function; in PCRE2, patterns are
|
||||
automatically optimized where possible. Since forking from PCRE1, the code has
|
||||
been extensively refactored and new features introduced. The old library is now
|
||||
obsolete and is no longer maintained.
|
||||
</P>
|
||||
<P>
|
||||
As well as Perl-style regular expression patterns, some features that appeared
|
||||
in Python and the original PCRE before they appeared in Perl are available
|
||||
using the Python syntax. There is also some support for one or two .NET and
|
||||
Oniguruma syntax items, and there are options for requesting some minor changes
|
||||
that give better ECMAScript (aka JavaScript) compatibility.
|
||||
</P>
|
||||
<P>
|
||||
The source code for PCRE2 can be compiled to support strings of 8-bit, 16-bit,
|
||||
or 32-bit code units, which means that up to three separate libraries may be
|
||||
installed, one for each code unit size. The size of code unit is not related to
|
||||
the bit size of the underlying hardware. In a 64-bit environment that also
|
||||
supports 32-bit applications, versions of PCRE2 that are compiled in both
|
||||
64-bit and 32-bit modes may be needed.
|
||||
</P>
|
||||
<P>
|
||||
The original work to extend PCRE to 16-bit and 32-bit code units was done by
|
||||
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
|
||||
can be interpreted either as one character per code unit, or as UTF-encoded
|
||||
Unicode, with support for Unicode general category properties. Unicode support
|
||||
is optional at build time (but is the default). However, processing strings as
|
||||
UTF code units must be enabled explicitly at run time. The version of Unicode
|
||||
in use can be discovered by running
|
||||
<pre>
|
||||
pcre2test -C
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
The three libraries contain identical sets of functions, with names ending in
|
||||
_8, _16, or _32, respectively (for example, <b>pcre2_compile_8()</b>). However,
|
||||
by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 32, a program that uses just
|
||||
one code unit width can be written using generic names such as
|
||||
<b>pcre2_compile()</b>, and the documentation is written assuming that this is
|
||||
the case.
|
||||
</P>
|
||||
<P>
|
||||
In addition to the Perl-compatible matching function, PCRE2 contains an
|
||||
alternative function that matches the same compiled patterns in a different
|
||||
way. In certain circumstances, the alternative function has some advantages.
|
||||
For a discussion of the two matching algorithms, see the
|
||||
<a href="pcre2matching.html"><b>pcre2matching</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
Details of exactly which Perl regular expression features are and are not
|
||||
supported by PCRE2 are given in separate documents. See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
and
|
||||
<a href="pcre2compat.html"><b>pcre2compat</b></a>
|
||||
pages. There is a syntax summary in the
|
||||
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
Some features of PCRE2 can be included, excluded, or changed when the library
|
||||
is built. The
|
||||
<a href="pcre2_config.html"><b>pcre2_config()</b></a>
|
||||
function makes it possible for a client to discover which features are
|
||||
available. The features themselves are described in the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
page. Documentation about building PCRE2 for various operating systems can be
|
||||
found in the
|
||||
<a href="README.txt"><b>README</b></a>
|
||||
and
|
||||
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS_BUILD</b></a>
|
||||
files in the source distribution.
|
||||
</P>
|
||||
<P>
|
||||
The libraries contains a number of undocumented internal functions and data
|
||||
tables that are used by more than one of the exported external functions, but
|
||||
which are not intended for use by external callers. Their names all begin with
|
||||
"_pcre2", which hopefully will not provoke any name clashes. In some
|
||||
environments, it is possible to control which external symbols are exported
|
||||
when a shared library is built, and in these cases the undocumented symbols are
|
||||
not exported.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">SECURITY CONSIDERATIONS</a><br>
|
||||
<P>
|
||||
If you are using PCRE2 in a non-UTF application that permits users to supply
|
||||
arbitrary patterns for compilation, you should be aware of a feature that
|
||||
allows users to turn on UTF support from within a pattern. For example, an
|
||||
8-bit pattern that begins with "(*UTF)" turns on UTF-8 mode, which interprets
|
||||
patterns and subjects as strings of UTF-8 code units instead of individual
|
||||
8-bit characters. This causes both the pattern and any data against which it is
|
||||
matched to be checked for UTF-8 validity. If the data string is very long, such
|
||||
a check might use sufficiently many resources as to cause your application to
|
||||
lose performance.
|
||||
</P>
|
||||
<P>
|
||||
One way of guarding against this possibility is to use the
|
||||
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
|
||||
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
|
||||
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
|
||||
a UTF-setting sequence.
|
||||
</P>
|
||||
<P>
|
||||
The use of Unicode properties for character types such as \d can also be
|
||||
enabled from within the pattern, by specifying "(*UCP)". This feature can be
|
||||
disallowed by setting the PCRE2_NEVER_UCP option.
|
||||
</P>
|
||||
<P>
|
||||
If your application is one that supports UTF, be aware that validity checking
|
||||
can take time. If the same data string is to be matched many times, you can use
|
||||
the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
|
||||
running redundant checks.
|
||||
</P>
|
||||
<P>
|
||||
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
|
||||
problems, because it may leave the current matching point in the middle of a
|
||||
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
|
||||
application to lock out the use of \C, causing a compile-time error if it is
|
||||
encountered. It is also possible to build PCRE2 with the use of \C permanently
|
||||
disabled.
|
||||
</P>
|
||||
<P>
|
||||
Another way that performance can be hit is by running a pattern that has a very
|
||||
large search tree against a string that will never match. Nested unlimited
|
||||
repeats in a pattern are a common example. PCRE2 provides some protection
|
||||
against this: see the <b>pcre2_set_match_limit()</b> function in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
|
||||
be used to restrict the amount of memory that is used.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
|
||||
<P>
|
||||
The user documentation for PCRE2 comprises a number of different sections. In
|
||||
the "man" format, each of these is a separate "man page". In the HTML format,
|
||||
each is a separate page, linked from the index page. In the plain text format,
|
||||
the descriptions of the <b>pcre2grep</b> and <b>pcre2test</b> programs are in
|
||||
files called <b>pcre2grep.txt</b> and <b>pcre2test.txt</b>, respectively. The
|
||||
remaining sections, except for the <b>pcre2demo</b> section (which is a program
|
||||
listing), and the short pages for individual functions, are concatenated in
|
||||
<b>pcre2.txt</b>, for ease of searching. The sections are as follows:
|
||||
<pre>
|
||||
pcre2 this document
|
||||
pcre2-config show PCRE2 installation configuration information
|
||||
pcre2api details of PCRE2's native C API
|
||||
pcre2build building PCRE2
|
||||
pcre2callout details of the pattern callout feature
|
||||
pcre2compat discussion of Perl compatibility
|
||||
pcre2convert details of pattern conversion functions
|
||||
pcre2demo a demonstration C program that uses PCRE2
|
||||
pcre2grep description of the <b>pcre2grep</b> command (8-bit only)
|
||||
pcre2jit discussion of just-in-time optimization support
|
||||
pcre2limits details of size and other limits
|
||||
pcre2matching discussion of the two matching algorithms
|
||||
pcre2partial details of the partial matching facility
|
||||
pcre2pattern syntax and semantics of supported regular expression patterns
|
||||
pcre2perform discussion of performance issues
|
||||
pcre2posix the POSIX-compatible C API for the 8-bit library
|
||||
pcre2sample discussion of the pcre2demo program
|
||||
pcre2serialize details of pattern serialization
|
||||
pcre2syntax quick syntax reference
|
||||
pcre2test description of the <b>pcre2test</b> command
|
||||
pcre2unicode discussion of Unicode and UTF support
|
||||
</pre>
|
||||
In the "man" and HTML formats, there is also a short page for each C library
|
||||
function, listing its arguments and results.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">AUTHORS</a><br>
|
||||
<P>
|
||||
The current maintainers of PCRE2 are Nicholas Wilson and Zoltan Herczeg.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 was written by Philip Hazel, of the University Computing Service,
|
||||
Cambridge, England. Many others have also contributed.
|
||||
</P>
|
||||
<P>
|
||||
To contact the maintainers, please use the GitHub issues tracker or PCRE2
|
||||
mailing list, as described at the project page:
|
||||
<a href="https://github.com/PCRE2Project/pcre2">https://github.com/PCRE2Project/pcre2</a>
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 18 December 2024
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
63
3rd/pcre2/doc/html/pcre2_callout_enumerate.html
Normal file
63
3rd/pcre2/doc/html/pcre2_callout_enumerate.html
Normal file
@@ -0,0 +1,63 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_callout_enumerate specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_callout_enumerate man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
|
||||
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function scans a compiled regular expression and calls the <i>callback()</i>
|
||||
function for each callout within the pattern. The yield of the function is zero
|
||||
for success and non-zero otherwise. The arguments are:
|
||||
<pre>
|
||||
<i>code</i> Points to the compiled pattern
|
||||
<i>callback</i> The callback function
|
||||
<i>callout_data</i> User data that is passed to the callback
|
||||
</pre>
|
||||
The <i>callback()</i> function is passed a pointer to a data block containing
|
||||
the following fields (not necessarily in this order):
|
||||
<pre>
|
||||
uint32_t <i>version</i> Block version number
|
||||
uint32_t <i>callout_number</i> Number for numbered callouts
|
||||
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
|
||||
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
|
||||
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
|
||||
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
|
||||
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
|
||||
</pre>
|
||||
The second argument passed to the <b>callback()</b> function is the callout data
|
||||
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
|
||||
function must return zero for success. Any other value causes the pattern scan
|
||||
to stop, with the value being passed back as the result of
|
||||
<b>pcre2_callout_enumerate()</b>.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_code_copy.html
Normal file
43
3rd/pcre2/doc/html/pcre2_code_copy.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_code_copy specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_code_copy man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a copy of the memory used for a compiled pattern, excluding
|
||||
any memory used by the JIT compiler. Without a subsequent call to
|
||||
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching. The
|
||||
pointer to the character tables is copied, not the tables themselves (see
|
||||
<b>pcre2_code_copy_with_tables()</b>). The yield of the function is NULL if
|
||||
<i>code</i> is NULL or if sufficient memory cannot be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
44
3rd/pcre2/doc/html/pcre2_code_copy_with_tables.html
Normal file
44
3rd/pcre2/doc/html/pcre2_code_copy_with_tables.html
Normal file
@@ -0,0 +1,44 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_code_copy_with_tables specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_code_copy_with_tables man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a copy of the memory used for a compiled pattern, excluding
|
||||
any memory used by the JIT compiler. Without a subsequent call to
|
||||
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching.
|
||||
Unlike <b>pcre2_code_copy()</b>, a separate copy of the character tables is also
|
||||
made, with the new code pointing to it. This memory will be automatically freed
|
||||
when <b>pcre2_code_free()</b> is called. The yield of the function is NULL if
|
||||
<i>code</i> is NULL or if sufficient memory cannot be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_code_free.html
Normal file
42
3rd/pcre2/doc/html/pcre2_code_free.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_code_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_code_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
If <i>code</i> is NULL, this function does nothing. Otherwise, <i>code</i> must
|
||||
point to a compiled pattern. This function frees its memory, including any
|
||||
memory used by the JIT compiler. If the compiled pattern was created by a call
|
||||
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
|
||||
also freed.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
120
3rd/pcre2/doc/html/pcre2_compile.html
Normal file
120
3rd/pcre2/doc/html/pcre2_compile.html
Normal file
@@ -0,0 +1,120 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_compile specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_compile man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
|
||||
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function compiles a regular expression pattern into an internal form. Its
|
||||
arguments are:
|
||||
<pre>
|
||||
<i>pattern</i> A string containing expression to be compiled
|
||||
<i>length</i> The length of the string or PCRE2_ZERO_TERMINATED
|
||||
<i>options</i> Primary option bits
|
||||
<i>errorcode</i> Where to put an error code
|
||||
<i>erroffset</i> Where to put an error offset
|
||||
<i>ccontext</i> Pointer to a compile context or NULL
|
||||
</pre>
|
||||
The length of the pattern and any error offset that is returned are in code
|
||||
units, not characters. A NULL pattern with zero length is treated as an empty
|
||||
string. A compile context is needed only if you want to provide custom memory
|
||||
allocation functions, or to provide an external function for system stack size
|
||||
checking (see <b>pcre2_set_compile_recursion_guard()</b>), or to change one or
|
||||
more of these parameters:
|
||||
<pre>
|
||||
What \R matches (Unicode newlines, or CR, LF, CRLF only);
|
||||
PCRE2's character tables;
|
||||
The newline character sequence;
|
||||
The compile time nested parentheses limit;
|
||||
The maximum pattern length (in code units) that is allowed;
|
||||
The additional options bits.
|
||||
</pre>
|
||||
The primary option bits are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Force pattern anchoring
|
||||
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
|
||||
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
|
||||
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
|
||||
PCRE2_ALT_EXTENDED_CLASS Alternative extended character class syntax
|
||||
PCRE2_ALT_VERBNAMES Process backslashes in verb names
|
||||
PCRE2_AUTO_CALLOUT Compile automatic callouts
|
||||
PCRE2_CASELESS Do caseless matching
|
||||
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
|
||||
PCRE2_DOTALL . matches anything including NL
|
||||
PCRE2_DUPNAMES Allow duplicate names for subpatterns
|
||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_EXTENDED Ignore white space and # comments
|
||||
PCRE2_FIRSTLINE Force matching to be before newline
|
||||
PCRE2_LITERAL Pattern characters are all literal
|
||||
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
|
||||
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
|
||||
PCRE2_MULTILINE ^ and $ match newlines within data
|
||||
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
|
||||
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
|
||||
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
|
||||
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
|
||||
theses (named ones available)
|
||||
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
|
||||
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
|
||||
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
|
||||
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
|
||||
(only relevant if PCRE2_UTF is set)
|
||||
PCRE2_UCP Use Unicode properties for \d, \w, etc.
|
||||
PCRE2_UNGREEDY Invert greediness of quantifiers
|
||||
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
|
||||
PCRE2_UTF Treat pattern and subjects as UTF strings
|
||||
</pre>
|
||||
PCRE2 must be built with Unicode support (the default) in order to use
|
||||
PCRE2_UTF, PCRE2_UCP and related options.
|
||||
</P>
|
||||
<P>
|
||||
Additional options may be set in the compile context via the
|
||||
<a href="pcre2_set_compile_extra_options.html"><b>pcre2_set_compile_extra_options</b></a>
|
||||
function.
|
||||
</P>
|
||||
<P>
|
||||
If either of <i>errorcode</i> or <i>erroroffset</i> is NULL, the function returns
|
||||
NULL immediately. Otherwise, the yield of this function is a pointer to a
|
||||
private data structure that contains the compiled pattern, or NULL if an error
|
||||
was detected. In the error case, a text error message can be obtained by
|
||||
passing the value returned via the <i>errorcode</i> argument to the
|
||||
<b>pcre2_get_error_message()</b> function. The offset (in code units) where the
|
||||
error was encountered is returned via the <i>erroroffset</i> argument.
|
||||
</P>
|
||||
<P>
|
||||
If there is no error, the value passed via <i>errorcode</i> returns the message
|
||||
"no error" if passed to <b>pcre2_get_error_message()</b>, and the value passed
|
||||
via <i>erroroffset</i> is zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API, with more detail on
|
||||
each option, in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page, and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_compile_context_copy.html
Normal file
41
3rd/pcre2/doc/html/pcre2_compile_context_copy.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_compile_context_copy specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_compile_context_copy man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a new copy of a compile context, using the memory
|
||||
allocation function that was used for the original context. The result is NULL
|
||||
if the memory cannot be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_compile_context_create.html
Normal file
42
3rd/pcre2/doc/html/pcre2_compile_context_create.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_compile_context_create specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_compile_context_create man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function creates and initializes a new compile context. If its argument is
|
||||
NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
|
||||
allocation function within the general context is used. The result is NULL if
|
||||
the memory could not be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_compile_context_free.html
Normal file
41
3rd/pcre2/doc/html/pcre2_compile_context_free.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_compile_context_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_compile_context_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function frees the memory occupied by a compile context, using the memory
|
||||
freeing function from the general context with which it was created, or
|
||||
<b>free()</b> if that was not set. If the argument is NULL, the function returns
|
||||
immediately without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
84
3rd/pcre2/doc/html/pcre2_config.html
Normal file
84
3rd/pcre2/doc/html/pcre2_config.html
Normal file
@@ -0,0 +1,84 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_config specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_config man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes it possible for a client program to find out which optional
|
||||
features are available in the version of the PCRE2 library it is using. The
|
||||
arguments are as follows:
|
||||
<pre>
|
||||
<i>what</i> A code specifying what information is required
|
||||
<i>where</i> Points to where to put the information
|
||||
</pre>
|
||||
If <i>where</i> is NULL, the function returns the amount of memory needed for
|
||||
the requested information. When the information is a string, the value is in
|
||||
code units; for other types of data it is in bytes.
|
||||
</P>
|
||||
<P>
|
||||
If <b>where</b> is not NULL, for PCRE2_CONFIG_JITTARGET,
|
||||
PCRE2_CONFIG_UNICODE_VERSION, and PCRE2_CONFIG_VERSION it must point to a
|
||||
buffer that is large enough to hold the string. For all other codes it must
|
||||
point to a uint32_t integer variable. The available codes are:
|
||||
<pre>
|
||||
PCRE2_CONFIG_BSR Indicates what \R matches by default:
|
||||
PCRE2_BSR_UNICODE
|
||||
PCRE2_BSR_ANYCRLF
|
||||
PCRE2_CONFIG_COMPILED_WIDTHS Which of 8/16/32 support was compiled
|
||||
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
|
||||
PCRE2_CONFIG_HEAPLIMIT Default heap memory limit
|
||||
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
|
||||
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
|
||||
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
|
||||
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
|
||||
PCRE2_CONFIG_NEVER_BACKSLASH_C Whether or not \C is disabled
|
||||
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
|
||||
PCRE2_NEWLINE_CR
|
||||
PCRE2_NEWLINE_LF
|
||||
PCRE2_NEWLINE_CRLF
|
||||
PCRE2_NEWLINE_ANY
|
||||
PCRE2_NEWLINE_ANYCRLF
|
||||
PCRE2_NEWLINE_NUL
|
||||
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
|
||||
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
|
||||
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
|
||||
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
|
||||
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
|
||||
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
|
||||
</pre>
|
||||
The function yields a non-negative value on success or the negative value
|
||||
PCRE2_ERROR_BADOPTION otherwise. This is also the result for the
|
||||
PCRE2_CONFIG_JITTARGET code if JIT support is not available. When a string is
|
||||
requested, the function returns the number of code units used, including the
|
||||
terminating zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_convert_context_copy.html
Normal file
40
3rd/pcre2/doc/html/pcre2_convert_context_copy.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_convert_context_copy specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_convert_context_copy man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
|
||||
<b> pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It makes a new copy of a convert context, using the memory allocation function
|
||||
that was used for the original context. The result is NULL if the memory cannot
|
||||
be obtained.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_convert_context_create.html
Normal file
41
3rd/pcre2/doc/html/pcre2_convert_context_create.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_convert_context_create specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_convert_context_create man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_convert_context *pcre2_convert_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It creates and initializes a new convert context. If its argument is
|
||||
NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
|
||||
allocation function within the general context is used. The result is NULL if
|
||||
the memory could not be obtained.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_convert_context_free.html
Normal file
40
3rd/pcre2/doc/html/pcre2_convert_context_free.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_convert_context_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_convert_context_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It frees the memory occupied by a convert context, using the memory
|
||||
freeing function from the general context with which it was created, or
|
||||
<b>free()</b> if that was not set. If the argument is NULL, the function returns
|
||||
immediately without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_converted_pattern_free.html
Normal file
40
3rd/pcre2/doc/html/pcre2_converted_pattern_free.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_converted_pattern_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_converted_pattern_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It frees the memory occupied by a converted pattern that was obtained by
|
||||
calling <b>pcre2_pattern_convert()</b> with arguments that caused it to place
|
||||
the converted pattern into newly obtained heap memory. If the argument is NULL,
|
||||
the function returns immediately without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
86
3rd/pcre2/doc/html/pcre2_dfa_match.html
Normal file
86
3rd/pcre2/doc/html/pcre2_dfa_match.html
Normal file
@@ -0,0 +1,86 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_dfa_match specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_dfa_match man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function matches a compiled regular expression against a given subject
|
||||
string, using an alternative matching algorithm that scans the subject string
|
||||
just once (except when processing lookaround assertions). This function is
|
||||
<i>not</i> Perl-compatible (the Perl-compatible matching function is
|
||||
<b>pcre2_match()</b>). The arguments for this function are:
|
||||
<pre>
|
||||
<i>code</i> Points to the compiled pattern
|
||||
<i>subject</i> Points to the subject string
|
||||
<i>length</i> Length of the subject string
|
||||
<i>startoffset</i> Offset in the subject at which to start matching
|
||||
<i>options</i> Option bits
|
||||
<i>match_data</i> Points to a match data block, for results
|
||||
<i>mcontext</i> Points to a match context, or is NULL
|
||||
<i>workspace</i> Points to a vector of ints used as working space
|
||||
<i>wscount</i> Number of elements in the vector
|
||||
</pre>
|
||||
The size of output vector needed to contain all the results depends on the
|
||||
number of simultaneous matches, not on the number of parentheses in the
|
||||
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
|
||||
data block is therefore not advisable when using this function.
|
||||
</P>
|
||||
<P>
|
||||
A match context is needed only if you want to set up a callout function or
|
||||
specify the heap limit or the match or the recursion depth limits. The
|
||||
<i>length</i> and <i>startoffset</i> values are code units, not characters. The
|
||||
options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_COPY_MATCHED_SUBJECT
|
||||
On success, make a private subject copy
|
||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
|
||||
was set at compile time)
|
||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
|
||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
|
||||
PCRE2_DFA_RESTART Restart after a partial match
|
||||
PCRE2_DFA_SHORTEST Return only the shortest match
|
||||
</pre>
|
||||
There are restrictions on what may appear in a pattern when using this matching
|
||||
function. Details are given in the
|
||||
<a href="pcre2matching.html"><b>pcre2matching</b></a>
|
||||
documentation. For details of partial matching, see the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
page. There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_general_context_copy.html
Normal file
42
3rd/pcre2/doc/html/pcre2_general_context_copy.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_general_context_copy specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_general_context_copy man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_general_context *pcre2_general_context_copy(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a new copy of a general context, using the memory
|
||||
allocation functions in the context, if set, to get the necessary memory.
|
||||
Otherwise <b>malloc()</b> is used. The result is NULL if the memory cannot be
|
||||
obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
44
3rd/pcre2/doc/html/pcre2_general_context_create.html
Normal file
44
3rd/pcre2/doc/html/pcre2_general_context_create.html
Normal file
@@ -0,0 +1,44 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_general_context_create specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_general_context_create man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_general_context *pcre2_general_context_create(</b>
|
||||
<b> void *(*<i>private_malloc</i>)(size_t, void *),</b>
|
||||
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function creates and initializes a general context. The arguments define
|
||||
custom memory management functions and a data value that is passed to them when
|
||||
they are called. The <b>private_malloc()</b> function is used to get memory for
|
||||
the context. If either of the first two arguments is NULL, the system memory
|
||||
management function is used. The result is NULL if no memory could be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_general_context_free.html
Normal file
40
3rd/pcre2/doc/html/pcre2_general_context_free.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_general_context_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_general_context_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function frees the memory occupied by a general context, using the memory
|
||||
freeing function within the context, if set. If the argument is NULL, the
|
||||
function returns immediately without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
51
3rd/pcre2/doc/html/pcre2_get_error_message.html
Normal file
51
3rd/pcre2/doc/html/pcre2_get_error_message.html
Normal file
@@ -0,0 +1,51 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_error_message specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_error_message man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE <i>bufflen</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function provides a textual error message for each PCRE2 error code.
|
||||
Compilation errors are positive numbers; UTF formatting errors and matching
|
||||
errors are negative numbers. The arguments are:
|
||||
<pre>
|
||||
<i>errorcode</i> an error code (positive or negative)
|
||||
<i>buffer</i> where to put the message
|
||||
<i>bufflen</i> the length of the buffer (code units)
|
||||
</pre>
|
||||
The function returns the length of the message in code units, excluding the
|
||||
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
|
||||
too small. In this case, the returned message is truncated (but still with a
|
||||
trailing zero). If <i>errorcode</i> does not contain a recognized error code
|
||||
number, the negative value PCRE2_ERROR_BADDATA is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
47
3rd/pcre2/doc/html/pcre2_get_mark.html
Normal file
47
3rd/pcre2/doc/html/pcre2_get_mark.html
Normal file
@@ -0,0 +1,47 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_mark specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_mark man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
After a call of <b>pcre2_match()</b> that was passed the match block that is
|
||||
this function's argument, this function returns a pointer to the last (*MARK),
|
||||
(*PRUNE), or (*THEN) name that was encountered during the matching process. The
|
||||
name is zero-terminated, and is within the compiled pattern. The length of the
|
||||
name is in the preceding code unit. If no name is available, NULL is returned.
|
||||
</P>
|
||||
<P>
|
||||
After a successful match, the name that is returned is the last one on the
|
||||
matching path. After a failed match or a partial match, the last encountered
|
||||
name is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_get_match_data_heapframes_size.html
Normal file
40
3rd/pcre2/doc/html/pcre2_get_match_data_heapframes_size.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_match_data_heapframes_size specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_match_data_heapframes_size man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_SIZE pcre2_get_match_data_heapframes_size(</b>
|
||||
<b> pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns the size, in bytes, of the heapframes data block that is
|
||||
owned by its argument.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
39
3rd/pcre2/doc/html/pcre2_get_match_data_size.html
Normal file
39
3rd/pcre2/doc/html/pcre2_get_match_data_size.html
Normal file
@@ -0,0 +1,39 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_match_data_size specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_match_data_size man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns the size, in bytes, of the match data block that is its
|
||||
argument.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
39
3rd/pcre2/doc/html/pcre2_get_ovector_count.html
Normal file
39
3rd/pcre2/doc/html/pcre2_get_ovector_count.html
Normal file
@@ -0,0 +1,39 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_ovector_count specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_ovector_count man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns the number of pairs of offsets in the ovector that forms
|
||||
part of the given match data block.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_get_ovector_pointer.html
Normal file
40
3rd/pcre2/doc/html/pcre2_get_ovector_pointer.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_ovector_pointer specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_ovector_pointer man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns a pointer to the vector of offsets that forms part of the
|
||||
given match data block. The number of pairs can be found by calling
|
||||
<b>pcre2_get_ovector_count()</b>.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
44
3rd/pcre2/doc/html/pcre2_get_startchar.html
Normal file
44
3rd/pcre2/doc/html/pcre2_get_startchar.html
Normal file
@@ -0,0 +1,44 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_get_startchar specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_get_startchar man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
After a successful call of <b>pcre2_match()</b> that was passed the match block
|
||||
that is this function's argument, this function returns the code unit offset of
|
||||
the character at which the successful match started. For a non-partial match,
|
||||
this can be different to the value of <i>ovector[0]</i> if the pattern contains
|
||||
the \K escape sequence. After a partial match, however, this value is always
|
||||
the same as <i>ovector[0]</i> because \K does not affect the result of a
|
||||
partial match.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
74
3rd/pcre2/doc/html/pcre2_jit_compile.html
Normal file
74
3rd/pcre2/doc/html/pcre2_jit_compile.html
Normal file
@@ -0,0 +1,74 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_jit_compile specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_jit_compile man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function requests JIT compilation, which, if the just-in-time compiler is
|
||||
available, further processes a compiled pattern into machine code that executes
|
||||
much faster than the <b>pcre2_match()</b> interpretive matching function. Full
|
||||
details are given in the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
The availability of JIT support can be tested by calling
|
||||
<b>pcre2_compile_jit()</b> with a single option PCRE2_JIT_TEST_ALLOC (the
|
||||
code argument is ignored, so a NULL value is accepted). Such a call
|
||||
returns zero if JIT is available and has a working allocator. Otherwise
|
||||
it returns PCRE2_ERROR_NOMEMORY if JIT is available but cannot allocate
|
||||
executable memory, or PCRE2_ERROR_JIT_UNSUPPORTED if JIT support is not
|
||||
compiled.
|
||||
</P>
|
||||
<P>
|
||||
Otherwise, the first argument must be a pointer that was returned by a
|
||||
successful call to <b>pcre2_compile()</b>, and the second must contain one or
|
||||
more of the following bits:
|
||||
<pre>
|
||||
PCRE2_JIT_COMPLETE compile code for full matching
|
||||
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
|
||||
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
|
||||
</pre>
|
||||
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
|
||||
superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old
|
||||
option is deprecated and may be removed in the future.
|
||||
</P>
|
||||
<P>
|
||||
The yield of the function when called with any of the three options above is 0
|
||||
for success, or a negative error code otherwise. In particular,
|
||||
PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or if an unknown
|
||||
bit is set in <i>options</i>. The function can also return PCRE2_ERROR_NOMEMORY
|
||||
if JIT is unable to allocate executable memory for the compiler, even if it was
|
||||
because of a system security restriction. In a few cases, the function may
|
||||
return with PCRE2_ERROR_JIT_UNSUPPORTED for unsupported features.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_jit_free_unused_memory.html
Normal file
43
3rd/pcre2/doc/html/pcre2_jit_free_unused_memory.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_jit_free_unused_memory specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_jit_free_unused_memory man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function frees unused JIT executable memory. The argument is a general
|
||||
context, for custom memory management, or NULL for standard memory management.
|
||||
JIT memory allocation retains some memory in order to improve future JIT
|
||||
compilation speed. In low memory conditions,
|
||||
<b>pcre2_jit_free_unused_memory()</b> can be used to cause this memory to be
|
||||
freed.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
70
3rd/pcre2/doc/html/pcre2_jit_match.html
Normal file
70
3rd/pcre2/doc/html/pcre2_jit_match.html
Normal file
@@ -0,0 +1,70 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_jit_match specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_jit_match man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function matches a compiled regular expression that has been successfully
|
||||
processed by the JIT compiler against a given subject string, using a matching
|
||||
algorithm that is similar to Perl's. It is a "fast path" interface to JIT, and
|
||||
it bypasses some of the sanity checks that <b>pcre2_match()</b> applies.
|
||||
</P>
|
||||
<P>
|
||||
In UTF mode, the subject string is not checked for UTF validity. Unless
|
||||
PCRE2_MATCH_INVALID_UTF was set when the pattern was compiled, passing an
|
||||
invalid UTF string results in undefined behaviour. Your program may crash or
|
||||
loop or give wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you
|
||||
should only call <b>pcre2_jit_match()</b> in UTF mode if you are sure the
|
||||
subject is valid.
|
||||
</P>
|
||||
<P>
|
||||
The arguments for <b>pcre2_jit_match()</b> are exactly the same as for
|
||||
<a href="pcre2_match.html"><b>pcre2_match()</b>,</a>
|
||||
except that the subject string must be specified with a length;
|
||||
PCRE2_ZERO_TERMINATED is not supported.
|
||||
</P>
|
||||
<P>
|
||||
The supported options are PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
||||
PCRE2_NOTEMPTY_ATSTART, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Unsupported
|
||||
options are ignored.
|
||||
</P>
|
||||
<P>
|
||||
The return values are the same as for <b>pcre2_match()</b> plus
|
||||
PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
|
||||
that was not compiled. For details of partial matching, see the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the JIT API in the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
75
3rd/pcre2/doc/html/pcre2_jit_stack_assign.html
Normal file
75
3rd/pcre2/doc/html/pcre2_jit_stack_assign.html
Normal file
@@ -0,0 +1,75 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_jit_stack_assign specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_jit_stack_assign man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function provides control over the memory used by JIT as a run-time stack
|
||||
when <b>pcre2_match()</b> or <b>pcre2_jit_match()</b> is called with a pattern
|
||||
that has been successfully processed by the JIT compiler. The information that
|
||||
determines which stack is used is put into a match context that is subsequently
|
||||
passed to a matching function. The arguments of this function are:
|
||||
<pre>
|
||||
mcontext a pointer to a match context
|
||||
callback a callback function
|
||||
callback_data a JIT stack or a value to be passed to the callback
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
If <i>mcontext</i> is NULL, the function returns immediately, without doing
|
||||
anything.
|
||||
</P>
|
||||
<P>
|
||||
If <i>callback</i> is NULL and <i>callback_data</i> is NULL, an internal 32KiB
|
||||
block on the machine stack is used.
|
||||
</P>
|
||||
<P>
|
||||
If <i>callback</i> is NULL and <i>callback_data</i> is not NULL,
|
||||
<i>callback_data</i> must be a valid JIT stack, the result of calling
|
||||
<b>pcre2_jit_stack_create()</b>.
|
||||
</P>
|
||||
<P>
|
||||
If <i>callback</i> not NULL, it is called with <i>callback_data</i> as an
|
||||
argument at the start of matching, in order to set up a JIT stack. If the
|
||||
result is NULL, the internal 32KiB stack is used; otherwise the return value
|
||||
must be a valid JIT stack, the result of calling
|
||||
<b>pcre2_jit_stack_create()</b>.
|
||||
</P>
|
||||
<P>
|
||||
You may safely use the same JIT stack for multiple patterns, as long as they
|
||||
are all matched in the same thread. In a multithread application, each thread
|
||||
must use its own JIT stack. For more details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
50
3rd/pcre2/doc/html/pcre2_jit_stack_create.html
Normal file
50
3rd/pcre2/doc/html/pcre2_jit_stack_create.html
Normal file
@@ -0,0 +1,50 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_jit_stack_create specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_jit_stack_create man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_jit_stack *pcre2_jit_stack_create(size_t <i>startsize</i>,</b>
|
||||
<b> size_t <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is used to create a stack for use by the code compiled by the JIT
|
||||
compiler. The first two arguments are a starting size for the stack, and a
|
||||
maximum size to which it is allowed to grow. The final argument is a general
|
||||
context, for memory allocation functions, or NULL for standard memory
|
||||
allocation. The result can be passed to the JIT run-time code by calling
|
||||
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
|
||||
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
|
||||
A maximum stack size of 512KiB to 1MiB should be more than enough for any
|
||||
pattern. If the stack couldn't be allocated or the values passed were not
|
||||
reasonable, NULL will be returned. For more details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_jit_stack_free.html
Normal file
43
3rd/pcre2/doc/html/pcre2_jit_stack_free.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_jit_stack_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_jit_stack_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is used to free a JIT stack that was created by
|
||||
<b>pcre2_jit_stack_create()</b> when it is no longer needed. If the argument is
|
||||
NULL, the function returns immediately without doing anything. For more
|
||||
details, see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
48
3rd/pcre2/doc/html/pcre2_maketables.html
Normal file
48
3rd/pcre2/doc/html/pcre2_maketables.html
Normal file
@@ -0,0 +1,48 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_maketables specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_maketables man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>const uint8_t *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function builds a set of character tables for character code points that
|
||||
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
|
||||
context in order to override the internal, built-in tables (which were either
|
||||
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
|
||||
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
|
||||
page. You might want to do this if you are using a non-standard locale.
|
||||
</P>
|
||||
<P>
|
||||
If the argument is NULL, <b>malloc()</b> is used to get memory for the tables.
|
||||
Otherwise it must point to a general context, which can supply pointers to a
|
||||
custom memory manager. The function yields a pointer to the tables.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
44
3rd/pcre2/doc/html/pcre2_maketables_free.html
Normal file
44
3rd/pcre2/doc/html/pcre2_maketables_free.html
Normal file
@@ -0,0 +1,44 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_maketables_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_maketables_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_maketables_free(pcre2_general_context *<i>gcontext</i>,</b>
|
||||
<b> const uint8_t *<i>tables</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function discards a set of character tables that were created by a call
|
||||
to
|
||||
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>.</a>
|
||||
</P>
|
||||
<P>
|
||||
The <i>gcontext</i> parameter should match what was used in that call to
|
||||
account for any custom allocators that might be in use; if it is NULL
|
||||
the system <b>free()</b> is used.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
87
3rd/pcre2/doc/html/pcre2_match.html
Normal file
87
3rd/pcre2/doc/html/pcre2_match.html
Normal file
@@ -0,0 +1,87 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function matches a compiled regular expression against a given subject
|
||||
string, using a matching algorithm that is similar to Perl's. It returns
|
||||
offsets to what it has matched and to captured substrings via the
|
||||
<b>match_data</b> block, which can be processed by functions with names that
|
||||
start with <b>pcre2_get_ovector_...()</b> or <b>pcre2_substring_...()</b>. The
|
||||
return from <b>pcre2_match()</b> is one more than the highest numbered capturing
|
||||
pair that has been set (for example, 1 if there are no captures), zero if the
|
||||
vector of offsets is too small, or a negative error code for no match and other
|
||||
errors. The function arguments are:
|
||||
<pre>
|
||||
<i>code</i> Points to the compiled pattern
|
||||
<i>subject</i> Points to the subject string
|
||||
<i>length</i> Length of the subject string
|
||||
<i>startoffset</i> Offset in the subject at which to start matching
|
||||
<i>options</i> Option bits
|
||||
<i>match_data</i> Points to a match data block, for results
|
||||
<i>mcontext</i> Points to a match context, or is NULL
|
||||
</pre>
|
||||
A match context is needed only if you want to:
|
||||
<pre>
|
||||
Set up a callout function
|
||||
Set a matching offset limit
|
||||
Change the heap memory limit
|
||||
Change the backtracking match limit
|
||||
Change the backtracking depth limit
|
||||
Set custom memory management specifically for the match
|
||||
</pre>
|
||||
The <i>length</i> and <i>startoffset</i> values are code units, not characters.
|
||||
The length may be given as PCRE2_ZERO_TERMINATED for a subject that is
|
||||
terminated by a binary zero code unit. The options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_COPY_MATCHED_SUBJECT
|
||||
On success, make a private subject copy
|
||||
PCRE2_DISABLE_RECURSELOOP_CHECK
|
||||
Only useful in rare cases; use with care
|
||||
PCRE2_ENDANCHORED Pattern can match only at end of subject
|
||||
PCRE2_NOTBOL Subject string is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject string is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
|
||||
PCRE2_NO_JIT Do not use JIT matching
|
||||
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
|
||||
was set at compile time)
|
||||
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
|
||||
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
|
||||
</pre>
|
||||
For details of partial matching, see the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
page. There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_match_context_copy.html
Normal file
41
3rd/pcre2/doc/html/pcre2_match_context_copy.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match_context_copy specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match_context_copy man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_match_context *pcre2_match_context_copy(</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function makes a new copy of a match context, using the memory
|
||||
allocation function that was used for the original context. The result is NULL
|
||||
if the memory cannot be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_match_context_create.html
Normal file
42
3rd/pcre2/doc/html/pcre2_match_context_create.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match_context_create specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match_context_create man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_match_context *pcre2_match_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function creates and initializes a new match context. If its argument is
|
||||
NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
|
||||
allocation function within the general context is used. The result is NULL if
|
||||
the memory could not be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_match_context_free.html
Normal file
41
3rd/pcre2/doc/html/pcre2_match_context_free.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match_context_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match_context_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function frees the memory occupied by a match context, using the memory
|
||||
freeing function from the general context with which it was created, or
|
||||
<b>free()</b> if that was not set. If the argument is NULL, the function returns
|
||||
immediately without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
50
3rd/pcre2/doc/html/pcre2_match_data_create.html
Normal file
50
3rd/pcre2/doc/html/pcre2_match_data_create.html
Normal file
@@ -0,0 +1,50 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match_data_create specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match_data_create man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function creates a new match data block, which is used for holding the
|
||||
result of a match. The first argument specifies the number of pairs of offsets
|
||||
that are required. These form the "output vector" (ovector) within the match
|
||||
data block, and are used to identify the matched string and any captured
|
||||
substrings when matching with <b>pcre2_match()</b>, or a number of different
|
||||
matches at the same point when used with <b>pcre2_dfa_match()</b>. There is
|
||||
always one pair of offsets; if <b>ovecsize</b> is zero, it is treated as one.
|
||||
</P>
|
||||
<P>
|
||||
The second argument points to a general context, for custom memory management,
|
||||
or is NULL for system memory management. The result of the function is NULL if
|
||||
the memory for the block could not be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
53
3rd/pcre2/doc/html/pcre2_match_data_create_from_pattern.html
Normal file
53
3rd/pcre2/doc/html/pcre2_match_data_create_from_pattern.html
Normal file
@@ -0,0 +1,53 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match_data_create_from_pattern specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match_data_create_from_pattern man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
|
||||
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function creates a new match data block for holding the result of a match.
|
||||
The first argument points to a compiled pattern. The number of capturing
|
||||
parentheses within the pattern is used to compute the number of pairs of
|
||||
offsets that are required in the match data block. These form the "output
|
||||
vector" (ovector) within the match data block, and are used to identify the
|
||||
matched string and any captured substrings when matching with
|
||||
<b>pcre2_match()</b>. If you are using <b>pcre2_dfa_match()</b>, which uses the
|
||||
output vector in a different way, you should use <b>pcre2_match_data_create()</b>
|
||||
instead of this function.
|
||||
</P>
|
||||
<P>
|
||||
The second argument points to a general context, for custom memory management,
|
||||
or is NULL to use the same memory allocator as was used for the compiled
|
||||
pattern. The result of the function is NULL if the memory for the block could
|
||||
not be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
48
3rd/pcre2/doc/html/pcre2_match_data_free.html
Normal file
48
3rd/pcre2/doc/html/pcre2_match_data_free.html
Normal file
@@ -0,0 +1,48 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_match_data_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_match_data_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
If <i>match_data</i> is NULL, this function does nothing. Otherwise,
|
||||
<i>match_data</i> must point to a match data block, which this function frees,
|
||||
using the memory freeing function from the general context or compiled pattern
|
||||
with which it was created, or <b>free()</b> if that was not set. If the match
|
||||
data block was previously passed to <b>pcre2_match()</b>, it will have an
|
||||
attached heapframe vector; this is also freed.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE2_COPY_MATCHED_SUBJECT was used for a successful match using this
|
||||
match data block, the copy of the subject that was referenced within the block
|
||||
is also freed.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
70
3rd/pcre2/doc/html/pcre2_pattern_convert.html
Normal file
70
3rd/pcre2/doc/html/pcre2_pattern_convert.html
Normal file
@@ -0,0 +1,70 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_pattern_convert specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_pattern_convert man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
|
||||
<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It converts a foreign pattern (for example, a glob) into a PCRE2 regular
|
||||
expression pattern. Its arguments are:
|
||||
<pre>
|
||||
<i>pattern</i> The foreign pattern
|
||||
<i>length</i> The length of the input pattern or PCRE2_ZERO_TERMINATED
|
||||
<i>options</i> Option bits
|
||||
<i>buffer</i> Pointer to pointer to output buffer, or NULL
|
||||
<i>blength</i> Pointer to output length field
|
||||
<i>cvcontext</i> Pointer to a convert context or NULL
|
||||
</pre>
|
||||
The length of the converted pattern (excluding the terminating zero) is
|
||||
returned via <i>blength</i>. If <i>buffer</i> is NULL, the function just returns
|
||||
the output length. If <i>buffer</i> points to a NULL pointer, heap memory is
|
||||
obtained for the converted pattern, using the allocator in the context if
|
||||
present (or else <b>malloc()</b>), and the field pointed to by <i>buffer</i> is
|
||||
updated. If <i>buffer</i> points to a non-NULL field, that must point to a
|
||||
buffer whose size is in the variable pointed to by <i>blength</i>. This value is
|
||||
updated.
|
||||
</P>
|
||||
<P>
|
||||
The option bits are:
|
||||
<pre>
|
||||
PCRE2_CONVERT_UTF Input is UTF
|
||||
PCRE2_CONVERT_NO_UTF_CHECK Do not check UTF validity
|
||||
PCRE2_CONVERT_POSIX_BASIC Convert POSIX basic pattern
|
||||
PCRE2_CONVERT_POSIX_EXTENDED Convert POSIX extended pattern
|
||||
PCRE2_CONVERT_GLOB ) Convert
|
||||
PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR ) various types
|
||||
PCRE2_CONVERT_GLOB_NO_STARSTAR ) of glob
|
||||
</pre>
|
||||
The return value from <b>pcre2_pattern_convert()</b> is zero on success or a
|
||||
non-zero PCRE2 error code.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
109
3rd/pcre2/doc/html/pcre2_pattern_info.html
Normal file
109
3rd/pcre2/doc/html/pcre2_pattern_info.html
Normal file
@@ -0,0 +1,109 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_pattern_info specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_pattern_info man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
|
||||
<b> void *<i>where</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns information about a compiled pattern. Its arguments are:
|
||||
<pre>
|
||||
<i>code</i> Pointer to a compiled regular expression pattern
|
||||
<i>what</i> What information is required
|
||||
<i>where</i> Where to put the information
|
||||
</pre>
|
||||
The recognized values for the <i>what</i> argument, and the information they
|
||||
request are as follows:
|
||||
<pre>
|
||||
PCRE2_INFO_ALLOPTIONS Final options after compiling
|
||||
PCRE2_INFO_ARGOPTIONS Options passed to <b>pcre2_compile()</b>
|
||||
PCRE2_INFO_BACKREFMAX Number of highest backreference
|
||||
PCRE2_INFO_BSR What \R matches:
|
||||
PCRE2_BSR_UNICODE: Unicode line endings
|
||||
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
|
||||
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
|
||||
PCRE2_INFO_DEPTHLIMIT Backtracking depth limit if set, otherwise PCRE2_ERROR_UNSET
|
||||
PCRE2_INFO_EXTRAOPTIONS Extra options that were passed in the
|
||||
compile context
|
||||
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
|
||||
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
|
||||
0 nothing set
|
||||
1 first code unit is set
|
||||
2 start of string or after newline
|
||||
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
|
||||
PCRE2_INFO_FRAMESIZE Size of backtracking frame
|
||||
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
|
||||
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches exist in the pattern
|
||||
PCRE2_INFO_HEAPLIMIT Heap memory limit if set, otherwise PCRE2_ERROR_UNSET
|
||||
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
|
||||
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
|
||||
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
|
||||
0 nothing set
|
||||
1 code unit is set
|
||||
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
|
||||
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an empty string, 0 otherwise
|
||||
PCRE2_INFO_MATCHLIMIT Match limit if set, otherwise PCRE2_ERROR_UNSET
|
||||
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest lookbehind assertion
|
||||
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
|
||||
PCRE2_INFO_NAMECOUNT Number of named subpatterns
|
||||
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
|
||||
PCRE2_INFO_NAMETABLE Pointer to name table
|
||||
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
|
||||
PCRE2_NEWLINE_CR
|
||||
PCRE2_NEWLINE_LF
|
||||
PCRE2_NEWLINE_CRLF
|
||||
PCRE2_NEWLINE_ANY
|
||||
PCRE2_NEWLINE_ANYCRLF
|
||||
PCRE2_NEWLINE_NUL
|
||||
PCRE2_INFO_RECURSIONLIMIT Obsolete synonym for PCRE2_INFO_DEPTHLIMIT
|
||||
PCRE2_INFO_SIZE Size of compiled pattern
|
||||
</pre>
|
||||
If <i>where</i> is NULL, the function returns the amount of memory needed for
|
||||
the requested information, in bytes. Otherwise, the <i>where</i> argument must
|
||||
point to an unsigned 32-bit integer (uint32_t variable), except for the
|
||||
following <i>what</i> values, when it must point to a variable of the type
|
||||
shown:
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP const uint8_t *
|
||||
PCRE2_INFO_JITSIZE size_t
|
||||
PCRE2_INFO_NAMETABLE PCRE2_SPTR
|
||||
PCRE2_INFO_SIZE size_t
|
||||
</pre>
|
||||
The yield of the function is zero on success or:
|
||||
<pre>
|
||||
PCRE2_ERROR_NULL the argument <i>code</i> is NULL
|
||||
PCRE2_ERROR_BADMAGIC the "magic number" was not found
|
||||
PCRE2_ERROR_BADOPTION the value of <i>what</i> is invalid
|
||||
PCRE2_ERROR_BADMODE the pattern was compiled in the wrong mode
|
||||
PCRE2_ERROR_UNSET the requested information is not set
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
65
3rd/pcre2/doc/html/pcre2_serialize_decode.html
Normal file
65
3rd/pcre2/doc/html/pcre2_serialize_decode.html
Normal file
@@ -0,0 +1,65 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_serialize_decode specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_serialize_decode man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function decodes a serialized set of compiled patterns back into a list of
|
||||
individual patterns. This is possible only on a host that is running the same
|
||||
version of PCRE2, with the same code unit width, and the host must also have
|
||||
the same endianness, pointer width and PCRE2_SIZE type. The arguments for
|
||||
<b>pcre2_serialize_decode()</b> are:
|
||||
<pre>
|
||||
<i>codes</i> pointer to a vector in which to build the list
|
||||
<i>number_of_codes</i> number of slots in the vector
|
||||
<i>bytes</i> the serialized byte stream
|
||||
<i>gcontext</i> pointer to a general context or NULL
|
||||
</pre>
|
||||
The <i>bytes</i> argument must point to a block of data that was originally
|
||||
created by <b>pcre2_serialize_encode()</b>, though it may have been saved on
|
||||
disc or elsewhere in the meantime. If there are more codes in the serialized
|
||||
data than slots in the list, only those compiled patterns that will fit are
|
||||
decoded. The yield of the function is the number of decoded patterns, or one of
|
||||
the following negative error codes:
|
||||
<pre>
|
||||
PCRE2_ERROR_BADDATA <i>number_of_codes</i> is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in <i>bytes</i>
|
||||
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE version
|
||||
PCRE2_ERROR_NOMEMORY memory allocation failed
|
||||
PCRE2_ERROR_NULL <i>codes</i> or <i>bytes</i> is NULL
|
||||
</pre>
|
||||
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
|
||||
on a system with different endianness.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the serialization functions in the
|
||||
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
66
3rd/pcre2/doc/html/pcre2_serialize_encode.html
Normal file
66
3rd/pcre2/doc/html/pcre2_serialize_encode.html
Normal file
@@ -0,0 +1,66 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_serialize_encode specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_serialize_encode man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function encodes a list of compiled patterns into a byte stream that can
|
||||
be saved on disc or elsewhere. Note that this is not an abstract format like
|
||||
Java or .NET. Conversion of the byte stream back into usable compiled patterns
|
||||
can only happen on a host that is running the same version of PCRE2, with the
|
||||
same code unit width, and the host must also have the same endianness, pointer
|
||||
width and PCRE2_SIZE type. The arguments for <b>pcre2_serialize_encode()</b>
|
||||
are:
|
||||
<pre>
|
||||
<i>codes</i> pointer to a vector containing the list
|
||||
<i>number_of_codes</i> number of slots in the vector
|
||||
<i>serialized_bytes</i> set to point to the serialized byte stream
|
||||
<i>serialized_size</i> set to the number of bytes in the byte stream
|
||||
<i>gcontext</i> pointer to a general context or NULL
|
||||
</pre>
|
||||
The context argument is used to obtain memory for the byte stream. When the
|
||||
serialized data is no longer needed, it must be freed by calling
|
||||
<b>pcre2_serialize_free()</b>. The yield of the function is the number of
|
||||
serialized patterns, or one of the following negative error codes:
|
||||
<pre>
|
||||
PCRE2_ERROR_BADDATA <i>number_of_codes</i> is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
|
||||
PCRE2_ERROR_MEMORY memory allocation failed
|
||||
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
|
||||
PCRE2_ERROR_NULL an argument other than <i>gcontext</i> is NULL
|
||||
</pre>
|
||||
PCRE2_ERROR_BADMAGIC means either that a pattern's code has been corrupted, or
|
||||
that a slot in the vector does not point to a compiled pattern.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the serialization functions in the
|
||||
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_serialize_free.html
Normal file
41
3rd/pcre2/doc/html/pcre2_serialize_free.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_serialize_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_serialize_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function frees the memory that was obtained by
|
||||
<b>pcre2_serialize_encode()</b> to hold a serialized byte stream. The argument
|
||||
must point to such a byte stream or be NULL, in which case the function returns
|
||||
without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the serialization functions in the
|
||||
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
49
3rd/pcre2/doc/html/pcre2_serialize_get_number_of_codes.html
Normal file
49
3rd/pcre2/doc/html/pcre2_serialize_get_number_of_codes.html
Normal file
@@ -0,0 +1,49 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_serialize_get_number_of_codes specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_serialize_get_number_of_codes man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
The <i>bytes</i> argument must point to a serialized byte stream that was
|
||||
originally created by <b>pcre2_serialize_encode()</b> (though it may have been
|
||||
saved on disc or elsewhere in the meantime). The function returns the number of
|
||||
serialized patterns in the byte stream, or one of the following negative error
|
||||
codes:
|
||||
<pre>
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in <i>bytes</i>
|
||||
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE version
|
||||
PCRE2_ERROR_NULL the argument is NULL
|
||||
</pre>
|
||||
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
|
||||
on a system with different endianness.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the serialization functions in the
|
||||
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_set_bsr.html
Normal file
42
3rd/pcre2/doc/html/pcre2_set_bsr.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_bsr specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_bsr man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the convention for processing \R within a compile context.
|
||||
The second argument must be one of PCRE2_BSR_ANYCRLF or PCRE2_BSR_UNICODE. The
|
||||
result is zero for success or PCRE2_ERROR_BADDATA if the second argument is
|
||||
invalid.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_set_callout.html
Normal file
43
3rd/pcre2/doc/html/pcre2_set_callout.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_callout specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_callout man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_callout_block *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the callout fields in a match context (the first argument).
|
||||
The second argument specifies a callout function, and the third argument is an
|
||||
opaque data item that is passed to it. The result of this function is always
|
||||
zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
45
3rd/pcre2/doc/html/pcre2_set_character_tables.html
Normal file
45
3rd/pcre2/doc/html/pcre2_set_character_tables.html
Normal file
@@ -0,0 +1,45 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_character_tables specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_character_tables man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> const uint8_t *<i>tables</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets a pointer to custom character tables within a compile
|
||||
context. The second argument must point to a set of PCRE2 character tables or
|
||||
be NULL to request the default tables. The result is always zero. Character
|
||||
tables can be created by calling <b>pcre2_maketables()</b> or by running the
|
||||
<b>pcre2_dftables</b> maintenance command in binary mode (see the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation).
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
58
3rd/pcre2/doc/html/pcre2_set_compile_extra_options.html
Normal file
58
3rd/pcre2/doc/html/pcre2_set_compile_extra_options.html
Normal file
@@ -0,0 +1,58 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_compile_extra_options specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_compile_extra_options man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>extra_options</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets additional option bits for <b>pcre2_compile()</b> that are
|
||||
housed in a compile context. It completely replaces all the bits. The extra
|
||||
options are:
|
||||
<pre>
|
||||
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK Allow \K in lookarounds
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{d800} to \x{dfff} in UTF-8 and UTF-32 modes
|
||||
PCRE2_EXTRA_ALT_BSUX Extended alternate \u, \U, and \x handling
|
||||
PCRE2_EXTRA_ASCII_BSD \d remains ASCII in UCP mode
|
||||
PCRE2_EXTRA_ASCII_BSS \s remains ASCII in UCP mode
|
||||
PCRE2_EXTRA_ASCII_BSW \w remains ASCII in UCP mode
|
||||
PCRE2_EXTRA_ASCII_DIGIT [:digit:] and [:xdigit:] POSIX classes remain ASCII in UCP mode
|
||||
PCRE2_EXTRA_ASCII_POSIX POSIX classes remain ASCII in UCP mode
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
|
||||
PCRE2_EXTRA_CASELESS_RESTRICT Disable mixed ASCII/non-ASCII case folding
|
||||
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n
|
||||
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
|
||||
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
|
||||
PCRE2_EXTRA_NEVER_CALLOUT Disallow callouts in pattern
|
||||
PCRE2_EXTRA_NO_BS0 Disallow \0 (but not \00 or \000)
|
||||
PCRE2_EXTRA_PYTHON_OCTAL Use Python rules for octal
|
||||
PCRE2_EXTRA_TURKISH_CASING Use Turkish I case folding
|
||||
</pre>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
46
3rd/pcre2/doc/html/pcre2_set_compile_recursion_guard.html
Normal file
46
3rd/pcre2/doc/html/pcre2_set_compile_recursion_guard.html
Normal file
@@ -0,0 +1,46 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_compile_recursion_guard specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_compile_recursion_guard man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function defines, within a compile context, a function that is called
|
||||
whenever <b>pcre2_compile()</b> starts to compile a parenthesized part of a
|
||||
pattern. The first argument to the function gives the current depth of
|
||||
parenthesis nesting, and the second is user data that is supplied when the
|
||||
function is set up. The callout function should return zero if all is well, or
|
||||
non-zero to force an error. This feature is provided so that applications can
|
||||
check the available system stack space, in order to avoid running out. The
|
||||
result of <b>pcre2_set_compile_recursion_guard()</b> is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_set_depth_limit.html
Normal file
40
3rd/pcre2/doc/html/pcre2_set_depth_limit.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_depth_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_depth_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the backtracking depth limit field in a match context. The
|
||||
result is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_set_glob_escape.html
Normal file
43
3rd/pcre2/doc/html/pcre2_set_glob_escape.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_glob_escape specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_glob_escape man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
|
||||
<b> uint32_t <i>escape_char</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It sets the escape character that is used when converting globs. The second
|
||||
argument must either be zero (meaning there is no escape character) or a
|
||||
punctuation character whose code point is less than 256. The default is grave
|
||||
accent if running under Windows, otherwise backslash. The result of the
|
||||
function is zero for success or PCRE2_ERROR_BADDATA if the second argument is
|
||||
invalid.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_set_glob_separator.html
Normal file
42
3rd/pcre2/doc/html/pcre2_set_glob_separator.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_glob_separator specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_glob_separator man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
|
||||
<b> uint32_t <i>separator_char</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is part of an experimental set of pattern conversion functions.
|
||||
It sets the component separator character that is used when converting globs.
|
||||
The second argument must be one of the characters forward slash, backslash, or
|
||||
dot. The default is backslash when running under Windows, otherwise forward
|
||||
slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
|
||||
the second argument is invalid.
|
||||
</P>
|
||||
<P>
|
||||
The pattern conversion functions are described in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_set_heap_limit.html
Normal file
40
3rd/pcre2/doc/html/pcre2_set_heap_limit.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_heap_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_heap_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the backtracking heap limit field in a match context. The
|
||||
result is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_set_match_limit.html
Normal file
40
3rd/pcre2/doc/html/pcre2_set_match_limit.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_match_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_match_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the match limit field in a match context. The result is
|
||||
always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
@@ -0,0 +1,44 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_max_pattern_compiled_length specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_max_pattern_compiled_length man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_max_pattern_compiled_length(</b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>, PCRE2_SIZE <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets, in a compile context, the maximum size (in bytes) for the
|
||||
memory needed to hold the compiled version of a pattern that is using this
|
||||
context. The result is always zero. If a pattern that is passed to
|
||||
<b>pcre2_compile()</b> referencing this context needs more memory, an error is
|
||||
generated. The default is the largest number that a PCRE2_SIZE variable can
|
||||
hold, which is effectively unlimited.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_set_max_pattern_length.html
Normal file
43
3rd/pcre2/doc/html/pcre2_set_max_pattern_length.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_max_pattern_length specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_max_pattern_length man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets, in a compile context, the maximum text length (in code
|
||||
units) of the pattern that can be compiled. The result is always zero. If a
|
||||
longer pattern is passed to <b>pcre2_compile()</b> there is an immediate error
|
||||
return. The default is effectively unlimited, being the largest value a
|
||||
PCRE2_SIZE variable can hold.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
42
3rd/pcre2/doc/html/pcre2_set_max_varlookbehind.html
Normal file
42
3rd/pcre2/doc/html/pcre2_set_max_varlookbehind.html
Normal file
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_max_varlookbehind specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_max_varlookbehind man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_max_varlookbehind(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This sets a maximum length for the number of characters matched by a
|
||||
variable-length lookbehind assertion. The default is set when PCRE2 is built,
|
||||
with the ultimate default being 255, the same as Perl. Lookbehind assertions
|
||||
without a bounding length are not supported. The result is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
51
3rd/pcre2/doc/html/pcre2_set_newline.html
Normal file
51
3rd/pcre2/doc/html/pcre2_set_newline.html
Normal file
@@ -0,0 +1,51 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_newline specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_newline man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the newline convention within a compile context. This
|
||||
specifies which character(s) are recognized as newlines when compiling and
|
||||
matching patterns. The second argument must be one of:
|
||||
<pre>
|
||||
PCRE2_NEWLINE_CR Carriage return only
|
||||
PCRE2_NEWLINE_LF Linefeed only
|
||||
PCRE2_NEWLINE_CRLF CR followed by LF only
|
||||
PCRE2_NEWLINE_ANYCRLF Any of the above
|
||||
PCRE2_NEWLINE_ANY Any Unicode newline sequence
|
||||
PCRE2_NEWLINE_NUL The NUL character (binary zero)
|
||||
</pre>
|
||||
The result is zero for success or PCRE2_ERROR_BADDATA if the second argument is
|
||||
invalid.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_set_offset_limit.html
Normal file
40
3rd/pcre2/doc/html/pcre2_set_offset_limit.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_offset_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_offset_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the offset limit field in a match context. The result is
|
||||
always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
57
3rd/pcre2/doc/html/pcre2_set_optimize.html
Normal file
57
3rd/pcre2/doc/html/pcre2_set_optimize.html
Normal file
@@ -0,0 +1,57 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_optimize specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_optimize man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>directive</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function controls which performance optimizations will be applied
|
||||
by <b>pcre2_compile()</b>. It can be called multiple times with the same compile
|
||||
context; the effects are cumulative, with the effects of later calls taking
|
||||
precedence over earlier ones.
|
||||
</P>
|
||||
<P>
|
||||
The result is zero for success, PCRE2_ERROR_NULL if <i>ccontext</i> is NULL,
|
||||
or PCRE2_ERROR_BADOPTION if <i>directive</i> is unknown. The latter could be
|
||||
useful to detect if a certain optimization is available.
|
||||
</P>
|
||||
<P>
|
||||
The list of possible values for the <i>directive</i> parameter are:
|
||||
<pre>
|
||||
PCRE2_OPTIMIZATION_FULL Enable all optimizations (default)
|
||||
PCRE2_OPTIMIZATION_NONE Disable all optimizations
|
||||
PCRE2_AUTO_POSSESS Enable auto-possessification
|
||||
PCRE2_AUTO_POSSESS_OFF Disable auto-possessification
|
||||
PCRE2_DOTSTAR_ANCHOR Enable implicit dotstar anchoring
|
||||
PCRE2_DOTSTAR_ANCHOR_OFF Disable implicit dotstar anchoring
|
||||
PCRE2_START_OPTIMIZE Enable start-up optimizations at match time
|
||||
PCRE2_START_OPTIMIZE_OFF Disable start-up optimizations at match time
|
||||
</pre>
|
||||
There is a complete description of the PCRE2 native API, including detailed
|
||||
descriptions <i>directive</i> parameter values in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_set_parens_nest_limit.html
Normal file
40
3rd/pcre2/doc/html/pcre2_set_parens_nest_limit.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_parens_nest_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_parens_nest_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets, in a compile context, the maximum depth of nested
|
||||
parentheses in a pattern. The result is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
40
3rd/pcre2/doc/html/pcre2_set_recursion_limit.html
Normal file
40
3rd/pcre2/doc/html/pcre2_set_recursion_limit.html
Normal file
@@ -0,0 +1,40 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_recursion_limit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_recursion_limit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function is obsolete and should not be used in new code. Use
|
||||
<b>pcre2_set_depth_limit()</b> instead.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
@@ -0,0 +1,42 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_recursion_memory_management specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_recursion_memory_management man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_recursion_memory_management(</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> void *(*<i>private_malloc</i>)(size_t, void *),</b>
|
||||
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
From release 10.30 onwards, this function is obsolete and does nothing. The
|
||||
result is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
43
3rd/pcre2/doc/html/pcre2_set_substitute_callout.html
Normal file
43
3rd/pcre2/doc/html/pcre2_set_substitute_callout.html
Normal file
@@ -0,0 +1,43 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_substitute_callout specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_substitute_callout man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the substitute callout fields in a match context (the first
|
||||
argument). The second argument specifies a callout function, and the third
|
||||
argument is an opaque data item that is passed to it. The result of this
|
||||
function is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
45
3rd/pcre2/doc/html/pcre2_set_substitute_case_callout.html
Normal file
45
3rd/pcre2/doc/html/pcre2_set_substitute_case_callout.html
Normal file
@@ -0,0 +1,45 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_set_substitute_case_callout specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_set_substitute_case_callout man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
|
||||
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
|
||||
<b> int, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function sets the substitute case callout fields in a match context (the
|
||||
first argument). The second argument specifies a callout function, and the third
|
||||
argument is an opaque data item that is passed to it. The result of this
|
||||
function is always zero.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
111
3rd/pcre2/doc/html/pcre2_substitute.html
Normal file
111
3rd/pcre2/doc/html/pcre2_substitute.html
Normal file
@@ -0,0 +1,111 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substitute specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substitute man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b>
|
||||
<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>outlengthptr</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function matches a compiled regular expression against a given subject
|
||||
string, using a matching algorithm that is similar to Perl's. It then makes a
|
||||
copy of the subject, substituting a replacement string for what was matched.
|
||||
Its arguments are:
|
||||
<pre>
|
||||
<i>code</i> Points to the compiled pattern
|
||||
<i>subject</i> Points to the subject string
|
||||
<i>length</i> Length of the subject string
|
||||
<i>startoffset</i> Offset in the subject at which to start matching
|
||||
<i>options</i> Option bits
|
||||
<i>match_data</i> Points to a match data block, or is NULL
|
||||
<i>mcontext</i> Points to a match context, or is NULL
|
||||
<i>replacement</i> Points to the replacement string
|
||||
<i>rlength</i> Length of the replacement string
|
||||
<i>outputbuffer</i> Points to the output buffer
|
||||
<i>outlengthptr</i> Points to the length of the output buffer
|
||||
</pre>
|
||||
A match data block is needed only if you want to inspect the data from the
|
||||
final match that is returned in that block or if PCRE2_SUBSTITUTE_MATCHED is
|
||||
set. A match context is needed only if you want to:
|
||||
<pre>
|
||||
Set up a callout function
|
||||
Set a matching offset limit
|
||||
Change the backtracking match limit
|
||||
Change the backtracking depth limit
|
||||
Set custom memory management in the match context
|
||||
</pre>
|
||||
The <i>length</i>, <i>startoffset</i> and <i>rlength</i> values are code units,
|
||||
not characters, as is the contents of the variable pointed at by
|
||||
<i>outlengthptr</i>. This variable must contain the length of the output buffer
|
||||
when the function is called. If the function is successful, the value is
|
||||
changed to the length of the new string, excluding the trailing zero that is
|
||||
automatically added.
|
||||
</P>
|
||||
<P>
|
||||
The subject and replacement lengths can be given as PCRE2_ZERO_TERMINATED for
|
||||
zero-terminated strings. The options are:
|
||||
<pre>
|
||||
PCRE2_ANCHORED Match only at the first position
|
||||
PCRE2_ENDANCHORED Match only at end of subject
|
||||
PCRE2_NOTBOL Subject is not the beginning of a line
|
||||
PCRE2_NOTEOL Subject is not the end of a line
|
||||
PCRE2_NOTEMPTY An empty string is not a valid match
|
||||
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
|
||||
PCRE2_NO_JIT Do not use JIT matching
|
||||
PCRE2_NO_UTF_CHECK Do not check for UTF validity in the subject or replacement
|
||||
(only relevant if PCRE2_UTF was set at compile time)
|
||||
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
|
||||
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
|
||||
PCRE2_SUBSTITUTE_LITERAL The replacement string is literal
|
||||
PCRE2_SUBSTITUTE_MATCHED Use pre-existing match data for first match
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
|
||||
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY Return only replacement string(s)
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
|
||||
</pre>
|
||||
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_EXTENDED,
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET, and PCRE2_SUBSTITUTE_UNSET_EMPTY are ignored.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_SUBSTITUTE_MATCHED is set, <i>match_data</i> must be non-NULL; its
|
||||
contents must be the result of a call to <b>pcre2_match()</b> using the same
|
||||
pattern and subject.
|
||||
</P>
|
||||
<P>
|
||||
The function returns the number of substitutions, which may be zero if there
|
||||
are no matches. The result may be greater than one only when
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
|
||||
is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
58
3rd/pcre2/doc/html/pcre2_substring_copy_byname.html
Normal file
58
3rd/pcre2/doc/html/pcre2_substring_copy_byname.html
Normal file
@@ -0,0 +1,58 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_copy_byname specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_copy_byname man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for extracting a captured substring, identified
|
||||
by name, into a given buffer. The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data block for the match
|
||||
<i>name</i> Name of the required substring
|
||||
<i>buffer</i> Buffer to receive the string
|
||||
<i>bufflen</i> Length of buffer (code units)
|
||||
</pre>
|
||||
The <i>bufflen</i> variable is updated to contain the length of the extracted
|
||||
string, excluding the trailing zero. The yield of the function is zero for
|
||||
success or one of the following error numbers:
|
||||
<pre>
|
||||
PCRE2_ERROR_NOSUBSTRING there are no groups of that name
|
||||
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
|
||||
PCRE2_ERROR_UNSET the group did not participate in the match
|
||||
PCRE2_ERROR_NOMEMORY the buffer is not big enough
|
||||
</pre>
|
||||
If there is more than one group with the given name, the first one that is set
|
||||
is returned. In this situation PCRE2_ERROR_UNSET means that no group with the
|
||||
given name was set.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
57
3rd/pcre2/doc/html/pcre2_substring_copy_bynumber.html
Normal file
57
3rd/pcre2/doc/html/pcre2_substring_copy_bynumber.html
Normal file
@@ -0,0 +1,57 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_copy_bynumber specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_copy_bynumber man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for extracting a captured substring into a given
|
||||
buffer. The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data block for the match
|
||||
<i>number</i> Number of the required substring
|
||||
<i>buffer</i> Buffer to receive the string
|
||||
<i>bufflen</i> Length of buffer
|
||||
</pre>
|
||||
The <i>bufflen</i> variable is updated with the length of the extracted string,
|
||||
excluding the terminating zero. The yield of the function is zero for success
|
||||
or one of the following error numbers:
|
||||
<pre>
|
||||
PCRE2_ERROR_NOSUBSTRING there are no groups of that number
|
||||
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
|
||||
PCRE2_ERROR_UNSET the group did not participate in the match
|
||||
PCRE2_ERROR_NOMEMORY the buffer is too small
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_substring_free.html
Normal file
41
3rd/pcre2/doc/html/pcre2_substring_free.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for freeing the memory obtained by a previous
|
||||
call to <b>pcre2_substring_get_byname()</b> or
|
||||
<b>pcre2_substring_get_bynumber()</b>. Its only argument is a pointer to the
|
||||
string. If the argument is NULL, the function does nothing.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
60
3rd/pcre2/doc/html/pcre2_substring_get_byname.html
Normal file
60
3rd/pcre2/doc/html/pcre2_substring_get_byname.html
Normal file
@@ -0,0 +1,60 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_get_byname specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_get_byname man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for extracting a captured substring by name into
|
||||
newly acquired memory. The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data for the match
|
||||
<i>name</i> Name of the required substring
|
||||
<i>bufferptr</i> Where to put the string pointer
|
||||
<i>bufflen</i> Where to put the string length
|
||||
</pre>
|
||||
The memory in which the substring is placed is obtained by calling the same
|
||||
memory allocation function that was used for the match data block. The
|
||||
convenience function <b>pcre2_substring_free()</b> can be used to free it when
|
||||
it is no longer needed. The yield of the function is zero for success or one of
|
||||
the following error numbers:
|
||||
<pre>
|
||||
PCRE2_ERROR_NOSUBSTRING there are no groups of that name
|
||||
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
|
||||
PCRE2_ERROR_UNSET the group did not participate in the match
|
||||
PCRE2_ERROR_NOMEMORY memory could not be obtained
|
||||
</pre>
|
||||
If there is more than one group with the given name, the first one that is set
|
||||
is returned. In this situation PCRE2_ERROR_UNSET means that no group with the
|
||||
given name was set.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
58
3rd/pcre2/doc/html/pcre2_substring_get_bynumber.html
Normal file
58
3rd/pcre2/doc/html/pcre2_substring_get_bynumber.html
Normal file
@@ -0,0 +1,58 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_get_bynumber specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_get_bynumber man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for extracting a captured substring by number
|
||||
into newly acquired memory. The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data for the match
|
||||
<i>number</i> Number of the required substring
|
||||
<i>bufferptr</i> Where to put the string pointer
|
||||
<i>bufflen</i> Where to put the string length
|
||||
</pre>
|
||||
The memory in which the substring is placed is obtained by calling the same
|
||||
memory allocation function that was used for the match data block. The
|
||||
convenience function <b>pcre2_substring_free()</b> can be used to free it when
|
||||
it is no longer needed. The yield of the function is zero for success or one of
|
||||
the following error numbers:
|
||||
<pre>
|
||||
PCRE2_ERROR_NOSUBSTRING there are no groups of that number
|
||||
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
|
||||
PCRE2_ERROR_UNSET the group did not participate in the match
|
||||
PCRE2_ERROR_NOMEMORY memory could not be obtained
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
46
3rd/pcre2/doc/html/pcre2_substring_length_byname.html
Normal file
46
3rd/pcre2/doc/html/pcre2_substring_length_byname.html
Normal file
@@ -0,0 +1,46 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_length_byname specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_length_byname man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns the length of a matched substring, identified by name.
|
||||
The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data block for the match
|
||||
<i>name</i> The substring name
|
||||
<i>length</i> Where to return the length
|
||||
</pre>
|
||||
The yield is zero on success, or an error code if the substring is not found.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
48
3rd/pcre2/doc/html/pcre2_substring_length_bynumber.html
Normal file
48
3rd/pcre2/doc/html/pcre2_substring_length_bynumber.html
Normal file
@@ -0,0 +1,48 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_length_bynumber specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_length_bynumber man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This function returns the length of a matched substring, identified by number.
|
||||
The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data block for the match
|
||||
<i>number</i> The substring number
|
||||
<i>length</i> Where to return the length, or NULL
|
||||
</pre>
|
||||
The third argument may be NULL if all you want to know is whether or not a
|
||||
substring is set. The yield is zero on success, or a negative error code
|
||||
otherwise. After a partial match, only substring 0 is available.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
41
3rd/pcre2/doc/html/pcre2_substring_list_free.html
Normal file
41
3rd/pcre2/doc/html/pcre2_substring_list_free.html
Normal file
@@ -0,0 +1,41 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_list_free specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_list_free man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>void pcre2_substring_list_free(PCRE2_UCHAR **<i>list</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for freeing the store obtained by a previous
|
||||
call to <b>pcre2substring_list_get()</b>. Its only argument is a pointer to
|
||||
the list of string pointers. If the argument is NULL, the function returns
|
||||
immediately, without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
56
3rd/pcre2/doc/html/pcre2_substring_list_get.html
Normal file
56
3rd/pcre2/doc/html/pcre2_substring_list_get.html
Normal file
@@ -0,0 +1,56 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_list_get specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_list_get man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This is a convenience function for extracting all the captured substrings after
|
||||
a pattern match. It builds a list of pointers to the strings, and (optionally)
|
||||
a second list that contains their lengths (in code units), excluding a
|
||||
terminating zero that is added to each of them. All this is done in a single
|
||||
block of memory that is obtained using the same memory allocation function that
|
||||
was used to get the match data block. The convenience function
|
||||
<b>pcre2_substring_list_free()</b> can be used to free it when it is no longer
|
||||
needed. The arguments are:
|
||||
<pre>
|
||||
<i>match_data</i> The match data block
|
||||
<i>listptr</i> Where to put a pointer to the list
|
||||
<i>lengthsptr</i> Where to put a pointer to the lengths, or NULL
|
||||
</pre>
|
||||
A pointer to a list of pointers is put in the variable whose address is in
|
||||
<i>listptr</i>. The list is terminated by a NULL pointer. If <i>lengthsptr</i> is
|
||||
not NULL, a matching list of lengths is created, and its address is placed in
|
||||
<i>lengthsptr</i>. The yield of the function is zero on success or
|
||||
PCRE2_ERROR_NOMEMORY if sufficient memory could not be obtained.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
53
3rd/pcre2/doc/html/pcre2_substring_nametable_scan.html
Normal file
53
3rd/pcre2/doc/html/pcre2_substring_nametable_scan.html
Normal file
@@ -0,0 +1,53 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_nametable_scan specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_nametable_scan man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This convenience function finds, for a compiled pattern, the first and last
|
||||
entries for a given name in the table that translates capture group names into
|
||||
numbers.
|
||||
<pre>
|
||||
<i>code</i> Compiled regular expression
|
||||
<i>name</i> Name whose entries required
|
||||
<i>first</i> Where to return a pointer to the first entry
|
||||
<i>last</i> Where to return a pointer to the last entry
|
||||
</pre>
|
||||
When the name is found in the table, if <i>first</i> is NULL, the function
|
||||
returns a group number, but if there is more than one matching entry, it is not
|
||||
defined which one. Otherwise, when both pointers have been set, the yield of
|
||||
the function is the length of each entry in code units. If the name is not
|
||||
found, PCRE2_ERROR_NOSUBSTRING is returned.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API, including the format of
|
||||
the table entries, in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page, and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
50
3rd/pcre2/doc/html/pcre2_substring_number_from_name.html
Normal file
50
3rd/pcre2/doc/html/pcre2_substring_number_from_name.html
Normal file
@@ -0,0 +1,50 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2_substring_number_from_name specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2_substring_number_from_name man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SYNOPSIS
|
||||
</b><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>);</b>
|
||||
</P>
|
||||
<br><b>
|
||||
DESCRIPTION
|
||||
</b><br>
|
||||
<P>
|
||||
This convenience function finds the number of a named substring capturing
|
||||
parenthesis in a compiled pattern, provided that it is a unique name. The
|
||||
function arguments are:
|
||||
<pre>
|
||||
<i>code</i> Compiled regular expression
|
||||
<i>name</i> Name whose number is required
|
||||
</pre>
|
||||
The yield of the function is the number of the parenthesis if the name is
|
||||
found, or PCRE2_ERROR_NOSUBSTRING if it is not found. When duplicate names are
|
||||
allowed (PCRE2_DUPNAMES is set), if the name is not unique,
|
||||
PCRE2_ERROR_NOUNIQUESUBSTRING is returned. You can obtain the list of numbers
|
||||
with the same name by calling <b>pcre2_substring_nametable_scan()</b>.
|
||||
</P>
|
||||
<P>
|
||||
There is a complete description of the PCRE2 native API in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page and a description of the POSIX API in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
page.
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
4496
3rd/pcre2/doc/html/pcre2api.html
Normal file
4496
3rd/pcre2/doc/html/pcre2api.html
Normal file
@@ -0,0 +1,4496 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2api specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2api man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 NATIVE API BASIC FUNCTIONS</a>
|
||||
<li><a name="TOC2" href="#SEC2">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a>
|
||||
<li><a name="TOC3" href="#SEC3">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a>
|
||||
<li><a name="TOC4" href="#SEC4">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a>
|
||||
<li><a name="TOC5" href="#SEC5">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a>
|
||||
<li><a name="TOC6" href="#SEC6">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a>
|
||||
<li><a name="TOC7" href="#SEC7">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a>
|
||||
<li><a name="TOC8" href="#SEC8">PCRE2 NATIVE API JIT FUNCTIONS</a>
|
||||
<li><a name="TOC9" href="#SEC9">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a>
|
||||
<li><a name="TOC10" href="#SEC10">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a>
|
||||
<li><a name="TOC11" href="#SEC11">PCRE2 NATIVE API OBSOLETE FUNCTIONS</a>
|
||||
<li><a name="TOC12" href="#SEC12">PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a>
|
||||
<li><a name="TOC13" href="#SEC13">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a>
|
||||
<li><a name="TOC14" href="#SEC14">PCRE2 API OVERVIEW</a>
|
||||
<li><a name="TOC15" href="#SEC15">STRING LENGTHS AND OFFSETS</a>
|
||||
<li><a name="TOC16" href="#SEC16">NEWLINES</a>
|
||||
<li><a name="TOC17" href="#SEC17">MULTITHREADING</a>
|
||||
<li><a name="TOC18" href="#SEC18">PCRE2 CONTEXTS</a>
|
||||
<li><a name="TOC19" href="#SEC19">CHECKING BUILD-TIME OPTIONS</a>
|
||||
<li><a name="TOC20" href="#SEC20">COMPILING A PATTERN</a>
|
||||
<li><a name="TOC21" href="#SEC21">JUST-IN-TIME (JIT) COMPILATION</a>
|
||||
<li><a name="TOC22" href="#SEC22">LOCALE SUPPORT</a>
|
||||
<li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A COMPILED PATTERN</a>
|
||||
<li><a name="TOC24" href="#SEC24">INFORMATION ABOUT A PATTERN'S CALLOUTS</a>
|
||||
<li><a name="TOC25" href="#SEC25">SERIALIZATION AND PRECOMPILING</a>
|
||||
<li><a name="TOC26" href="#SEC26">THE MATCH DATA BLOCK</a>
|
||||
<li><a name="TOC27" href="#SEC27">MEMORY USE FOR MATCH DATA BLOCKS</a>
|
||||
<li><a name="TOC28" href="#SEC28">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
|
||||
<li><a name="TOC29" href="#SEC29">NEWLINE HANDLING WHEN MATCHING</a>
|
||||
<li><a name="TOC30" href="#SEC30">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
|
||||
<li><a name="TOC31" href="#SEC31">OTHER INFORMATION ABOUT A MATCH</a>
|
||||
<li><a name="TOC32" href="#SEC32">ERROR RETURNS FROM <b>pcre2_match()</b></a>
|
||||
<li><a name="TOC33" href="#SEC33">OBTAINING A TEXTUAL ERROR MESSAGE</a>
|
||||
<li><a name="TOC34" href="#SEC34">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
|
||||
<li><a name="TOC35" href="#SEC35">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
|
||||
<li><a name="TOC36" href="#SEC36">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
|
||||
<li><a name="TOC37" href="#SEC37">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
|
||||
<li><a name="TOC38" href="#SEC38">DUPLICATE CAPTURE GROUP NAMES</a>
|
||||
<li><a name="TOC39" href="#SEC39">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
|
||||
<li><a name="TOC40" href="#SEC40">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
|
||||
<li><a name="TOC41" href="#SEC41">SEE ALSO</a>
|
||||
<li><a name="TOC42" href="#SEC42">AUTHOR</a>
|
||||
<li><a name="TOC43" href="#SEC43">REVISION</a>
|
||||
</ul>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
<br>
|
||||
<br>
|
||||
PCRE2 is a new API for PCRE, starting at release 10.0. This document contains a
|
||||
description of all its native functions. See the
|
||||
<a href="pcre2.html"><b>pcre2</b></a>
|
||||
document for an overview of all the PCRE2 documentation.
|
||||
</P>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 NATIVE API BASIC FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
|
||||
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
|
||||
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE pcre2_get_match_data_heapframes_size(</b>
|
||||
<b> pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>pcre2_general_context *pcre2_general_context_create(</b>
|
||||
<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
|
||||
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_general_context *pcre2_general_context_copy(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> const uint8_t *<i>tables</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>extra_options</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_max_pattern_compiled_length(</b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>, PCRE2_SIZE <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_max_varlookbehind(pcre2_compile_contest *<i>ccontext</i>,</b>
|
||||
<b>" uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>directive</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>pcre2_match_context *pcre2_match_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_match_context *pcre2_match_context_copy(</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
|
||||
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
|
||||
<b> int, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_substring_list_free(PCRE2_UCHAR **<i>list</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacementz</i>,</b>
|
||||
<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>outlengthptr</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">PCRE2 NATIVE API JIT FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_jit_stack *pcre2_jit_stack_create(size_t <i>startsize</i>,</b>
|
||||
<b> size_t <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE <i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>const uint8_t *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_maketables_free(pcre2_general_context *<i>gcontext</i>,</b>
|
||||
<b> const uint8_t *<i>tables</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
|
||||
<b> void *<i>where</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
|
||||
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
|
||||
<b> void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">PCRE2 NATIVE API OBSOLETE FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_recursion_memory_management(</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> void *(*<i>private_malloc</i>)(size_t, void *),</b>
|
||||
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
These functions became obsolete at release 10.30 and are retained only for
|
||||
backward compatibility. They should not be used in new code. The first is
|
||||
replaced by <b>pcre2_set_depth_limit()</b>; the second is no longer needed and
|
||||
has no effect (it always returns zero).
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a><br>
|
||||
<P>
|
||||
<b>pcre2_convert_context *pcre2_convert_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
|
||||
<b> pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
|
||||
<b> uint32_t <i>escape_char</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
|
||||
<b> uint32_t <i>separator_char</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
|
||||
<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
These functions provide a way of converting non-PCRE2 patterns into
|
||||
patterns that can be processed by <b>pcre2_compile()</b>. This facility is
|
||||
experimental and may be changed in future releases. At present, "globs" and
|
||||
POSIX basic and extended patterns can be converted. Details are given in the
|
||||
<a href="pcre2convert.html"><b>pcre2convert</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
|
||||
<P>
|
||||
There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit code
|
||||
units, respectively. However, there is just one header file, <b>pcre2.h</b>.
|
||||
This contains the function prototypes and other definitions for all three
|
||||
libraries. One, two, or all three can be installed simultaneously. On Unix-like
|
||||
systems the libraries are called <b>libpcre2-8</b>, <b>libpcre2-16</b>, and
|
||||
<b>libpcre2-32</b>, and they can also co-exist with the original PCRE libraries.
|
||||
Every PCRE2 function comes in three different forms, one for each library, for
|
||||
example:
|
||||
<pre>
|
||||
<b>pcre2_compile_8()</b>
|
||||
<b>pcre2_compile_16()</b>
|
||||
<b>pcre2_compile_32()</b>
|
||||
</pre>
|
||||
There are also three different sets of data types:
|
||||
<pre>
|
||||
<b>PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32</b>
|
||||
<b>PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32</b>
|
||||
</pre>
|
||||
The UCHAR types define unsigned code units of the appropriate widths.
|
||||
For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.
|
||||
The SPTR types are pointers to constants of the equivalent UCHAR types,
|
||||
that is, they are pointers to vectors of unsigned code units.
|
||||
</P>
|
||||
<P>
|
||||
Character strings are passed to a PCRE2 library as sequences of unsigned
|
||||
integers in code units of the appropriate width. The length of a string may
|
||||
be given as a number of code units, or the string may be specified as
|
||||
zero-terminated.
|
||||
</P>
|
||||
<P>
|
||||
Many applications use only one code unit width. For their convenience, macros
|
||||
are defined whose names are the generic forms such as <b>pcre2_compile()</b> and
|
||||
PCRE2_SPTR. These macros use the value of the macro PCRE2_CODE_UNIT_WIDTH to
|
||||
generate the appropriate width-specific function and macro names.
|
||||
PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it
|
||||
to be 8, 16, or 32 before including <b>pcre2.h</b> in order to make use of the
|
||||
generic names.
|
||||
</P>
|
||||
<P>
|
||||
Applications that use more than one code unit width can be linked with more
|
||||
than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to be 0 before
|
||||
including <b>pcre2.h</b>, and then use the real function names. Any code that is
|
||||
to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is
|
||||
unknown should also use the real function names. (Unfortunately, it is not
|
||||
possible in C code to save and restore the value of a macro.)
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_CODE_UNIT_WIDTH is not defined before including <b>pcre2.h</b>, a
|
||||
compiler error occurs.
|
||||
</P>
|
||||
<P>
|
||||
When using multiple libraries in an application, you must take care when
|
||||
processing any particular pattern to use only functions from a single library.
|
||||
For example, if you want to run a match using a pattern that was compiled with
|
||||
<b>pcre2_compile_16()</b>, you must do so with <b>pcre2_match_16()</b>, not
|
||||
<b>pcre2_match_8()</b> or <b>pcre2_match_32()</b>.
|
||||
</P>
|
||||
<P>
|
||||
In the function summaries above, and in the rest of this document and other
|
||||
PCRE2 documents, functions and data types are described using their generic
|
||||
names, without the _8, _16, or _32 suffix.
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">PCRE2 API OVERVIEW</a><br>
|
||||
<P>
|
||||
PCRE2 has its own native API, which is described in this document. There are
|
||||
also some wrapper functions for the 8-bit library that correspond to the
|
||||
POSIX regular expression API, but they do not give access to all the
|
||||
functionality of PCRE2 and they are not thread-safe. They are described in the
|
||||
<a href="pcre2posix.html"><b>pcre2posix</b></a>
|
||||
documentation. Both these APIs define a set of C function calls.
|
||||
</P>
|
||||
<P>
|
||||
The native API C data types, function prototypes, option values, and error
|
||||
codes are defined in the header file <b>pcre2.h</b>, which also contains
|
||||
definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers
|
||||
for the library. Applications can use these to include support for different
|
||||
releases of PCRE2.
|
||||
</P>
|
||||
<P>
|
||||
In a Windows environment, if you want to statically link an application program
|
||||
against a non-dll PCRE2 library, you must define PCRE2_STATIC before including
|
||||
<b>pcre2.h</b>.
|
||||
</P>
|
||||
<P>
|
||||
The functions <b>pcre2_compile()</b> and <b>pcre2_match()</b> are used for
|
||||
compiling and matching regular expressions in a Perl-compatible manner. A
|
||||
sample program that demonstrates the simplest way of using them is provided in
|
||||
the file called <i>pcre2demo.c</i> in the PCRE2 source distribution. A listing
|
||||
of this program is given in the
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
documentation, and the
|
||||
<a href="pcre2sample.html"><b>pcre2sample</b></a>
|
||||
documentation describes how to compile and run it.
|
||||
</P>
|
||||
<P>
|
||||
The compiling and matching functions recognize various options that are passed
|
||||
as bits in an options argument. There are also some more complicated parameters
|
||||
such as custom memory management functions and resource limits that are passed
|
||||
in "contexts" (which are just memory blocks, described below). Simple
|
||||
applications do not need to make use of contexts.
|
||||
</P>
|
||||
<P>
|
||||
Just-in-time (JIT) compiler support is an optional feature of PCRE2 that can be
|
||||
built in appropriate hardware environments. It greatly speeds up the matching
|
||||
performance of many patterns. Programs can request that it be used if
|
||||
available by calling <b>pcre2_jit_compile()</b> after a pattern has been
|
||||
successfully compiled by <b>pcre2_compile()</b>. This does nothing if JIT
|
||||
support is not available.
|
||||
</P>
|
||||
<P>
|
||||
More complicated programs might need to make use of the specialist functions
|
||||
<b>pcre2_jit_stack_create()</b>, <b>pcre2_jit_stack_free()</b>, and
|
||||
<b>pcre2_jit_stack_assign()</b> in order to control the JIT code's memory usage.
|
||||
</P>
|
||||
<P>
|
||||
JIT matching is automatically used by <b>pcre2_match()</b> if it is available,
|
||||
unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
|
||||
matching, which gives improved performance at the expense of less sanity
|
||||
checking. The JIT-specific functions are discussed in the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
A second matching function, <b>pcre2_dfa_match()</b>, which is not
|
||||
Perl-compatible, is also provided. This uses a different algorithm for the
|
||||
matching. The alternative algorithm finds all possible matches (at a given
|
||||
point in the subject), and scans the subject just once (unless there are
|
||||
lookaround assertions). However, this algorithm does not return captured
|
||||
substrings. A description of the two matching algorithms and their advantages
|
||||
and disadvantages is given in the
|
||||
<a href="pcre2matching.html"><b>pcre2matching</b></a>
|
||||
documentation. There is no JIT support for <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
In addition to the main compiling and matching functions, there are convenience
|
||||
functions for extracting captured substrings from a subject string that has
|
||||
been matched by <b>pcre2_match()</b>. They are:
|
||||
<pre>
|
||||
<b>pcre2_substring_copy_byname()</b>
|
||||
<b>pcre2_substring_copy_bynumber()</b>
|
||||
<b>pcre2_substring_get_byname()</b>
|
||||
<b>pcre2_substring_get_bynumber()</b>
|
||||
<b>pcre2_substring_list_get()</b>
|
||||
<b>pcre2_substring_length_byname()</b>
|
||||
<b>pcre2_substring_length_bynumber()</b>
|
||||
<b>pcre2_substring_nametable_scan()</b>
|
||||
<b>pcre2_substring_number_from_name()</b>
|
||||
</pre>
|
||||
<b>pcre2_substring_free()</b> and <b>pcre2_substring_list_free()</b> are also
|
||||
provided, to free memory used for extracted strings. If either of these
|
||||
functions is called with a NULL argument, the function returns immediately
|
||||
without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
The function <b>pcre2_substitute()</b> can be called to match a pattern and
|
||||
return a copy of the subject string with substitutions for parts that were
|
||||
matched.
|
||||
</P>
|
||||
<P>
|
||||
Functions whose names begin with <b>pcre2_serialize_</b> are used for saving
|
||||
compiled patterns on disc or elsewhere, and reloading them later.
|
||||
</P>
|
||||
<P>
|
||||
Finally, there are functions for finding out information about a compiled
|
||||
pattern (<b>pcre2_pattern_info()</b>) and about the configuration with which
|
||||
PCRE2 was built (<b>pcre2_config()</b>).
|
||||
</P>
|
||||
<P>
|
||||
Functions with names ending with <b>_free()</b> are used for freeing memory
|
||||
blocks of various sorts. In all cases, if one of these functions is called with
|
||||
a NULL argument, it does nothing.
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br>
|
||||
<P>
|
||||
The PCRE2 API uses string lengths and offsets into strings of code units in
|
||||
several places. These values are always of type PCRE2_SIZE, which is an
|
||||
unsigned integer type, currently always defined as <i>size_t</i>. The largest
|
||||
value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved
|
||||
as a special indicator for zero-terminated strings and unset offsets.
|
||||
Therefore, the longest string that can be handled is one less than this
|
||||
maximum. Note that string lengths are always given in code units. Only in the
|
||||
8-bit library is such a length the same as the number of bytes in the string.
|
||||
<a name="newlines"></a></P>
|
||||
<br><a name="SEC16" href="#TOC1">NEWLINES</a><br>
|
||||
<P>
|
||||
PCRE2 supports five different conventions for indicating line breaks in
|
||||
strings: a single CR (carriage return) character, a single LF (linefeed)
|
||||
character, the two-character sequence CRLF, any of the three preceding, or any
|
||||
Unicode newline sequence. The Unicode newline sequences are the three just
|
||||
mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
|
||||
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
|
||||
(paragraph separator, U+2029).
|
||||
</P>
|
||||
<P>
|
||||
Each of the first three conventions is used by at least one operating system as
|
||||
its standard newline sequence. When PCRE2 is built, a default can be specified.
|
||||
If it is not, the default is set to LF, which is the Unix standard. However,
|
||||
the newline convention can be changed by an application when calling
|
||||
<b>pcre2_compile()</b>, or it can be specified by special text at the start of
|
||||
the pattern itself; this overrides any other settings. See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page for details of the special character sequences.
|
||||
</P>
|
||||
<P>
|
||||
In the PCRE2 documentation the word "newline" is used to mean "the character or
|
||||
pair of characters that indicate a line break". The choice of newline
|
||||
convention affects the handling of the dot, circumflex, and dollar
|
||||
metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
|
||||
recognized line ending sequence, the match position advancement for a
|
||||
non-anchored pattern. There is more detail about this in the
|
||||
<a href="#matchoptions">section on <b>pcre2_match()</b> options</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
The choice of newline convention does not affect the interpretation of
|
||||
the \n or \r escape sequences, nor does it affect what \R matches; this has
|
||||
its own separate convention.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">MULTITHREADING</a><br>
|
||||
<P>
|
||||
In a multithreaded application it is important to keep thread-specific data
|
||||
separate from data that can be shared between threads. The PCRE2 library code
|
||||
itself is thread-safe: it contains no static or global variables. The API is
|
||||
designed to be fairly simple for non-threaded applications while at the same
|
||||
time ensuring that multithreaded applications can use it.
|
||||
</P>
|
||||
<P>
|
||||
There are several different blocks of data that are used to pass information
|
||||
between the application and the PCRE2 libraries.
|
||||
</P>
|
||||
<br><b>
|
||||
The compiled pattern
|
||||
</b><br>
|
||||
<P>
|
||||
A pointer to the compiled form of a pattern is returned to the user when
|
||||
<b>pcre2_compile()</b> is successful. The data in the compiled pattern is fixed,
|
||||
and does not change when the pattern is matched. Therefore, it is thread-safe,
|
||||
that is, the same compiled pattern can be used by more than one thread
|
||||
simultaneously. For example, an application can compile all its patterns at the
|
||||
start, before forking off multiple threads that use them. However, if the
|
||||
just-in-time (JIT) optimization feature is being used, it needs separate memory
|
||||
stack areas for each thread. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for more details.
|
||||
</P>
|
||||
<P>
|
||||
In a more complicated situation, where patterns are compiled only when they are
|
||||
first needed, but are still shared between threads, pointers to compiled
|
||||
patterns must be protected from simultaneous writing by multiple threads. This
|
||||
is somewhat tricky to do correctly. If you know that writing to a pointer is
|
||||
atomic in your environment, you can use logic like this:
|
||||
<pre>
|
||||
Get a read-only (shared) lock (mutex) for pointer
|
||||
if (pointer == NULL)
|
||||
{
|
||||
Get a write (unique) lock for pointer
|
||||
if (pointer == NULL) pointer = pcre2_compile(...
|
||||
}
|
||||
Release the lock
|
||||
Use pointer in pcre2_match()
|
||||
</pre>
|
||||
Of course, testing for compilation errors should also be included in the code.
|
||||
</P>
|
||||
<P>
|
||||
The reason for checking the pointer a second time is as follows: Several
|
||||
threads may have acquired the shared lock and tested the pointer for being
|
||||
NULL, but only one of them will be given the write lock, with the rest kept
|
||||
waiting. The winning thread will compile the pattern and store the result.
|
||||
After this thread releases the write lock, another thread will get it, and if
|
||||
it does not retest pointer for being NULL, will recompile the pattern and
|
||||
overwrite the pointer, creating a memory leak and possibly causing other
|
||||
issues.
|
||||
</P>
|
||||
<P>
|
||||
In an environment where writing to a pointer may not be atomic, the above logic
|
||||
is not sufficient. The thread that is doing the compiling may be descheduled
|
||||
after writing only part of the pointer, which could cause other threads to use
|
||||
an invalid value. Instead of checking the pointer itself, a separate "pointer
|
||||
is valid" flag (that can be updated atomically) must be used:
|
||||
<pre>
|
||||
Get a read-only (shared) lock (mutex) for pointer
|
||||
if (!pointer_is_valid)
|
||||
{
|
||||
Get a write (unique) lock for pointer
|
||||
if (!pointer_is_valid)
|
||||
{
|
||||
pointer = pcre2_compile(...
|
||||
pointer_is_valid = TRUE
|
||||
}
|
||||
}
|
||||
Release the lock
|
||||
Use pointer in pcre2_match()
|
||||
</pre>
|
||||
If JIT is being used, but the JIT compilation is not being done immediately
|
||||
(perhaps waiting to see if the pattern is used often enough), similar logic is
|
||||
required. JIT compilation updates a value within the compiled code block, so a
|
||||
thread must gain unique write access to the pointer before calling
|
||||
<b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> or
|
||||
<b>pcre2_code_copy_with_tables()</b> can be used to obtain a private copy of the
|
||||
compiled code before calling the JIT compiler.
|
||||
</P>
|
||||
<br><b>
|
||||
Context blocks
|
||||
</b><br>
|
||||
<P>
|
||||
The next main section below introduces the idea of "contexts" in which PCRE2
|
||||
functions are called. A context is nothing more than a collection of parameters
|
||||
that control the way PCRE2 operates. Grouping a number of parameters together
|
||||
in a context is a convenient way of passing them to a PCRE2 function without
|
||||
using lots of arguments. The parameters that are stored in contexts are in some
|
||||
sense "advanced features" of the API. Many straightforward applications will
|
||||
not need to use contexts.
|
||||
</P>
|
||||
<P>
|
||||
In a multithreaded application, if the parameters in a context are values that
|
||||
are never changed, the same context can be used by all the threads. However, if
|
||||
any thread needs to change any value in a context, it must make its own
|
||||
thread-specific copy.
|
||||
</P>
|
||||
<br><b>
|
||||
Match blocks
|
||||
</b><br>
|
||||
<P>
|
||||
The matching functions need a block of memory for storing the results of a
|
||||
match. This includes details of what was matched, as well as additional
|
||||
information such as the name of a (*MARK) setting. Each thread must provide its
|
||||
own copy of this memory.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">PCRE2 CONTEXTS</a><br>
|
||||
<P>
|
||||
Some PCRE2 functions have a lot of parameters, many of which are used only by
|
||||
specialist applications, for example, those that use custom memory management
|
||||
or non-standard character tables. To keep function argument lists at a
|
||||
reasonable size, and at the same time to keep the API extensible, "uncommon"
|
||||
parameters are passed to certain functions in a <b>context</b> instead of
|
||||
directly. A context is just a block of memory that holds the parameter values.
|
||||
Applications that do not need to adjust any of the context parameters can pass
|
||||
NULL when a context pointer is required.
|
||||
</P>
|
||||
<P>
|
||||
There are three different types of context: a general context that is relevant
|
||||
for several PCRE2 operations, a compile-time context, and a match-time context.
|
||||
</P>
|
||||
<br><b>
|
||||
The general context
|
||||
</b><br>
|
||||
<P>
|
||||
At present, this context just contains pointers to (and data for) external
|
||||
memory management functions that are called from several places in the PCRE2
|
||||
library. The context is named `general' rather than specifically `memory'
|
||||
because in future other fields may be added. If you do not want to supply your
|
||||
own custom memory management functions, you do not need to bother with a
|
||||
general context. A general context is created by:
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_general_context *pcre2_general_context_create(</b>
|
||||
<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
|
||||
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The two function pointers specify custom memory management functions, whose
|
||||
prototypes are:
|
||||
<pre>
|
||||
<b>void *private_malloc(PCRE2_SIZE, void *);</b>
|
||||
<b>void private_free(void *, void *);</b>
|
||||
</pre>
|
||||
Whenever code in PCRE2 calls these functions, the final argument is the value
|
||||
of <i>memory_data</i>. Either of the first two arguments of the creation
|
||||
function may be NULL, in which case the system memory management functions
|
||||
<i>malloc()</i> and <i>free()</i> are used. (This is not currently useful, as
|
||||
there are no other fields in a general context, but in future there might be.)
|
||||
The <i>private_malloc()</i> function is used (if supplied) to obtain memory for
|
||||
storing the context, and all three values are saved as part of the context.
|
||||
</P>
|
||||
<P>
|
||||
Whenever PCRE2 creates a data block of any kind, the block contains a pointer
|
||||
to the <i>free()</i> function that matches the <i>malloc()</i> function that was
|
||||
used. When the time comes to free the block, this function is called.
|
||||
</P>
|
||||
<P>
|
||||
A general context can be copied by calling:
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_general_context *pcre2_general_context_copy(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The memory used for a general context should be freed by calling:
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
If this function is passed a NULL argument, it returns immediately without
|
||||
doing anything.
|
||||
<a name="compilecontext"></a></P>
|
||||
<br><b>
|
||||
The compile context
|
||||
</b><br>
|
||||
<P>
|
||||
A compile context is required if you want to provide an external function for
|
||||
stack checking during compilation or to change the default values of any of the
|
||||
following compile-time parameters:
|
||||
<pre>
|
||||
What \R matches (Unicode newlines or CR, LF, CRLF only)
|
||||
PCRE2's character tables
|
||||
The newline character sequence
|
||||
The compile time nested parentheses limit
|
||||
The maximum length of the pattern string
|
||||
The extra options bits (none set by default)
|
||||
Which performance optimizations the compiler should apply
|
||||
</pre>
|
||||
A compile context is also required if you are using custom memory management.
|
||||
If none of these apply, just pass NULL as the context argument of
|
||||
<i>pcre2_compile()</i>.
|
||||
</P>
|
||||
<P>
|
||||
A compile context is created, copied, and freed by the following functions:
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A compile context is created with default values for its parameters. These can
|
||||
be changed by calling the following functions, which return 0 on success, or
|
||||
PCRE2_ERROR_BADDATA if invalid data is detected.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
|
||||
or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
|
||||
ending sequence. The value is used by the JIT compiler and by the two
|
||||
interpreted matching functions, <i>pcre2_match()</i> and
|
||||
<i>pcre2_dfa_match()</i>.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> const uint8_t *<i>tables</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The value must be the result of a call to <b>pcre2_maketables()</b>, whose only
|
||||
argument is a general context. This function builds a set of character tables
|
||||
in the current locale.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>extra_options</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
As PCRE2 has developed, almost all the 32 option bits that are available in
|
||||
the <i>options</i> argument of <b>pcre2_compile()</b> have been used up. To avoid
|
||||
running out, the compile context contains a set of extra option bits which are
|
||||
used for some newer, assumed rarer, options. This function sets those bits. It
|
||||
always sets all the bits (either on or off). It does not modify any existing
|
||||
setting. The available options are defined in the section entitled "Extra
|
||||
compile options"
|
||||
<a href="#extracompileoptions">below.</a>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This sets a maximum length, in code units, for any pattern string that is
|
||||
compiled with this context. If the pattern is longer, an error is generated.
|
||||
This facility is provided so that applications that accept patterns from
|
||||
external sources can limit their size. The default is the largest number that a
|
||||
PCRE2_SIZE variable can hold, which is effectively unlimited.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_max_pattern_compiled_length(</b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>, PCRE2_SIZE <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This sets a maximum size, in bytes, for the memory needed to hold the compiled
|
||||
version of a pattern that is compiled with this context. If the pattern needs
|
||||
more memory, an error is generated. This facility is provided so that
|
||||
applications that accept patterns from external sources can limit the amount of
|
||||
memory they use. The default is the largest number that a PCRE2_SIZE variable
|
||||
can hold, which is effectively unlimited.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_max_varlookbehind(pcre2_compile_contest *<i>ccontext</i>,</b>
|
||||
<b>" uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This sets a maximum length for the number of characters matched by a
|
||||
variable-length lookbehind assertion. The default is set when PCRE2 is built,
|
||||
with the ultimate default being 255, the same as Perl. Lookbehind assertions
|
||||
without a bounding length are not supported.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This specifies which characters or character sequences are to be recognized as
|
||||
newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only),
|
||||
PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character
|
||||
sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above),
|
||||
PCRE2_NEWLINE_ANY (any Unicode newline sequence), or PCRE2_NEWLINE_NUL (the
|
||||
NUL character, that is a binary zero).
|
||||
</P>
|
||||
<P>
|
||||
A pattern can override the value set in the compile context by starting with a
|
||||
sequence such as (*CRLF). See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page for details.
|
||||
</P>
|
||||
<P>
|
||||
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
|
||||
option, the newline convention affects the recognition of the end of internal
|
||||
comments starting with #. The value is saved with the compiled pattern for
|
||||
subsequent use by the JIT compiler and by the two interpreted matching
|
||||
functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This parameter adjusts the limit, set when PCRE2 is built (default 250), on the
|
||||
depth of parenthesis nesting in a pattern. This limit stops rogue patterns
|
||||
using up too much system stack when being compiled. The limit applies to
|
||||
parentheses of all kinds, not just capturing parentheses.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
There is at least one application that runs PCRE2 in threads with very limited
|
||||
system stack, where running out of stack is to be avoided at all costs. The
|
||||
parenthesis limit above cannot take account of how much stack is actually
|
||||
available during compilation. For a finer control, you can supply a function
|
||||
that is called whenever <b>pcre2_compile()</b> starts to compile a parenthesized
|
||||
part of a pattern. This function can check the actual stack size (or anything
|
||||
else that it wants to, of course).
|
||||
</P>
|
||||
<P>
|
||||
The first argument to the callout function gives the current depth of
|
||||
nesting, and the second is user data that is set up by the last argument of
|
||||
<b>pcre2_set_compile_recursion_guard()</b>. The callout function should return
|
||||
zero if all is well, or non-zero to force an error.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
|
||||
<b> uint32_t <i>directive</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
PCRE2 can apply various performance optimizations during compilation, in order
|
||||
to make matching faster. For example, the compiler might convert some regex
|
||||
constructs into an equivalent construct which <b>pcre2_match()</b> can execute
|
||||
faster. By default, all available optimizations are enabled. However, in rare
|
||||
cases, one might wish to disable specific optimizations. For example, if it is
|
||||
known that some optimizations cannot benefit a certain regex, it might be
|
||||
desirable to disable them, in order to speed up compilation.
|
||||
</P>
|
||||
<P>
|
||||
The permitted values of <i>directive</i> are as follows:
|
||||
<pre>
|
||||
PCRE2_OPTIMIZATION_FULL
|
||||
</pre>
|
||||
Enable all optional performance optimizations. This is the default value.
|
||||
<pre>
|
||||
PCRE2_OPTIMIZATION_NONE
|
||||
</pre>
|
||||
Disable all optional performance optimizations.
|
||||
<pre>
|
||||
PCRE2_AUTO_POSSESS
|
||||
PCRE2_AUTO_POSSESS_OFF
|
||||
</pre>
|
||||
Enable/disable "auto-possessification" of variable quantifiers such as * and +.
|
||||
This optimization, for example, turns a+b into a++b in order to avoid
|
||||
backtracks into a+ that can never be successful. However, if callouts are in
|
||||
use, auto-possessification means that some callouts are never taken. You can
|
||||
disable this optimization if you want the matching functions to do a full,
|
||||
unoptimized search and run all the callouts.
|
||||
<pre>
|
||||
PCRE2_DOTSTAR_ANCHOR
|
||||
PCRE2_DOTSTAR_ANCHOR_OFF
|
||||
</pre>
|
||||
Enable/disable an optimization that is applied when .* is the first significant
|
||||
item in a top-level branch of a pattern, and all the other branches also start
|
||||
with .* or with \A or \G or ^. Such a pattern is automatically anchored if
|
||||
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any
|
||||
^ items. Otherwise, the fact that any match must start either at the start of
|
||||
the subject or following a newline is remembered. Like other optimizations,
|
||||
this can cause callouts to be skipped.
|
||||
</P>
|
||||
<P>
|
||||
Dotstar anchor optimization is automatically disabled for .* if it is inside an
|
||||
atomic group or a capture group that is the subject of a backreference, or if
|
||||
the pattern contains (*PRUNE) or (*SKIP).
|
||||
<pre>
|
||||
PCRE2_START_OPTIMIZE
|
||||
PCRE2_START_OPTIMIZE_OFF
|
||||
</pre>
|
||||
Enable/disable optimizations which cause matching functions to scan the subject
|
||||
string for specific code unit values before attempting a match. For example, if
|
||||
it is known that an unanchored match must start with a specific value, the
|
||||
matching code searches the subject for that value, and fails immediately if it
|
||||
cannot find it, without actually running the main matching function. This means
|
||||
that a special item such as (*COMMIT) at the start of a pattern is not
|
||||
considered until after a suitable starting point for the match has been found.
|
||||
Also, when callouts or (*MARK) items are in use, these "start-up" optimizations
|
||||
can cause them to be skipped if the pattern is never actually used. The start-up
|
||||
optimizations are in effect a pre-scan of the subject that takes place before
|
||||
the pattern is run.
|
||||
</P>
|
||||
<P>
|
||||
Disabling start-up optimizations ensures that in cases where the result is "no
|
||||
match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are
|
||||
considered at every possible starting position in the subject string.
|
||||
</P>
|
||||
<P>
|
||||
Disabling start-up optimizations may change the outcome of a matching operation.
|
||||
Consider the pattern
|
||||
<pre>
|
||||
(*COMMIT)ABC
|
||||
</pre>
|
||||
When this is compiled, PCRE2 records the fact that a match must start with the
|
||||
character "A". Suppose the subject string is "DEFABC". The start-up
|
||||
optimization scans along the subject, finds "A" and runs the first match
|
||||
attempt from there. The (*COMMIT) item means that the pattern must match the
|
||||
current starting position, which in this case, it does. However, if the same
|
||||
match is run without start-up optimizations, the initial scan along the subject
|
||||
string does not happen. The first match attempt is run starting from "D" and
|
||||
when this fails, (*COMMIT) prevents any further matches being tried, so the
|
||||
overall result is "no match".
|
||||
</P>
|
||||
<P>
|
||||
Another start-up optimization makes use of a minimum length for a matching
|
||||
subject, which is recorded when possible. Consider the pattern
|
||||
<pre>
|
||||
(*MARK:1)B(*MARK:2)(X|Y)
|
||||
</pre>
|
||||
The minimum length for a match is two characters. If the subject is "XXBB", the
|
||||
"starting character" optimization skips "XX", then tries to match "BB", which
|
||||
is long enough. In the process, (*MARK:2) is encountered and remembered. When
|
||||
the match attempt fails, the next "B" is found, but there is only one character
|
||||
left, so there are no more attempts, and "no match" is returned with the "last
|
||||
mark seen" set to "2". Without start-up optimizations, however, matches are
|
||||
tried at every possible starting position, including at the end of the subject,
|
||||
where (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
|
||||
that is returned is "1". In this case, the optimizations do not affect the
|
||||
overall match result, which is still "no match", but they do affect the
|
||||
auxiliary information that is returned.
|
||||
<a name="matchcontext"></a></P>
|
||||
<br><b>
|
||||
The match context
|
||||
</b><br>
|
||||
<P>
|
||||
A match context is required if you want to:
|
||||
<pre>
|
||||
Set up a callout function
|
||||
Set an offset limit for matching an unanchored pattern
|
||||
Change the limit on the amount of heap used when matching
|
||||
Change the backtracking match limit
|
||||
Change the backtracking depth limit
|
||||
Set custom memory management specifically for the match
|
||||
</pre>
|
||||
If none of these apply, just pass NULL as the context argument of
|
||||
<b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
A match context is created, copied, and freed by the following functions:
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_match_context *pcre2_match_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_match_context *pcre2_match_context_copy(</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A match context is created with default values for its parameters. These can
|
||||
be changed by calling the following functions, which return 0 on success, or
|
||||
PCRE2_ERROR_BADDATA if invalid data is detected.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This sets up a callout function for PCRE2 to call at specified points
|
||||
during a matching operation. Details are given in the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This sets up a callout function for PCRE2 to call after each substitution
|
||||
made by <b>pcre2_substitute()</b>. Details are given in the section entitled
|
||||
"Creating a new string with substitutions"
|
||||
<a href="#substitutions">below.</a>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
|
||||
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
|
||||
<b> int, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This sets up a callout function for PCRE2 to call when performing case
|
||||
transformations inside <b>pcre2_substitute()</b>. Details are given in the
|
||||
section entitled "Creating a new string with substitutions"
|
||||
<a href="#substitutions">below.</a>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The <i>offset_limit</i> parameter limits how far an unanchored search can
|
||||
advance in the subject string. The default value is PCRE2_UNSET. The
|
||||
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return
|
||||
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
|
||||
offset is not found. The <b>pcre2_substitute()</b> function makes no more
|
||||
substitutions.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern /abc/ is matched against "123abc" with an offset
|
||||
limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match can never be
|
||||
found if the <i>startoffset</i> argument of <b>pcre2_match()</b>,
|
||||
<b>pcre2_dfa_match()</b>, or <b>pcre2_substitute()</b> is greater than the offset
|
||||
limit set in the match context.
|
||||
</P>
|
||||
<P>
|
||||
When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option when
|
||||
calling <b>pcre2_compile()</b> so that when JIT is in use, different code can be
|
||||
compiled. If a match is started with a non-default match limit when
|
||||
PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
|
||||
</P>
|
||||
<P>
|
||||
The offset limit facility can be used to track progress when searching large
|
||||
subject strings or to limit the extent of global substitutions. See also the
|
||||
PCRE2_FIRSTLINE option, which requires a match to start before or at the first
|
||||
newline that follows the start of matching in the subject. If this is set with
|
||||
an offset limit, a match must occur in the first line and also within the
|
||||
offset limit. In other words, whichever limit comes first is used.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The <i>heap_limit</i> parameter specifies, in units of kibibytes (1024 bytes),
|
||||
the maximum amount of heap memory that <b>pcre2_match()</b> may use to hold
|
||||
backtracking information when running an interpretive match. This limit also
|
||||
applies to <b>pcre2_dfa_match()</b>, which may use the heap when processing
|
||||
patterns with a lot of nested pattern recursion or lookarounds or atomic
|
||||
groups. This limit does not apply to matching with the JIT optimization, which
|
||||
has its own memory control arrangements (see the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for more details). If the limit is reached, the negative error
|
||||
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
|
||||
is built; if it is not, the default is set very large and is essentially
|
||||
unlimited.
|
||||
</P>
|
||||
<P>
|
||||
A value for the heap limit may also be supplied by an item at the start of a
|
||||
pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_HEAP=ddd)
|
||||
</pre>
|
||||
where ddd is a decimal number. However, such a setting is ignored unless ddd is
|
||||
less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
|
||||
limit is set, less than the default.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_match()</b> function always needs some heap memory, so setting a
|
||||
value of zero guarantees a "heap limit exceeded" error. Details of how
|
||||
<b>pcre2_match()</b> uses the heap are given in the
|
||||
<a href="pcre2perform.html"><b>pcre2perform</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
For <b>pcre2_dfa_match()</b>, a vector on the system stack is used when
|
||||
processing pattern recursions, lookarounds, or atomic groups, and only if this
|
||||
is not big enough is heap memory used. In this case, setting a value of zero
|
||||
disables the use of the heap.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The <i>match_limit</i> parameter provides a means of preventing PCRE2 from using
|
||||
up too many computing resources when processing patterns that are not going to
|
||||
match, but which have a very large number of possibilities in their search
|
||||
trees. The classic example is a pattern that uses nested unlimited repeats.
|
||||
</P>
|
||||
<P>
|
||||
There is an internal counter in <b>pcre2_match()</b> that is incremented each
|
||||
time round its main matching loop. If this value reaches the match limit,
|
||||
<b>pcre2_match()</b> returns the negative value PCRE2_ERROR_MATCHLIMIT. This has
|
||||
the effect of limiting the amount of backtracking that can take place. For
|
||||
patterns that are not anchored, the count restarts from zero for each position
|
||||
in the subject string. This limit also applies to <b>pcre2_dfa_match()</b>,
|
||||
though the counting is done in a different way.
|
||||
</P>
|
||||
<P>
|
||||
When <b>pcre2_match()</b> is called with a pattern that was successfully
|
||||
processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed
|
||||
is entirely different. However, there is still the possibility of runaway
|
||||
matching that goes on for a very long time, and so the <i>match_limit</i> value
|
||||
is also used in this case (but in a different way) to limit how long the
|
||||
matching can continue.
|
||||
</P>
|
||||
<P>
|
||||
The default value for the limit can be set when PCRE2 is built; the default is
|
||||
10 million, which handles all but the most extreme cases. A value for the match
|
||||
limit may also be supplied by an item at the start of a pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_MATCH=ddd)
|
||||
</pre>
|
||||
where ddd is a decimal number. However, such a setting is ignored unless ddd is
|
||||
less than the limit set by the caller of <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b> or, if no such limit is set, less than the default.
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> uint32_t <i>value</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
This parameter limits the depth of nested backtracking in <b>pcre2_match()</b>.
|
||||
Each time a nested backtracking point is passed, a new memory frame is used
|
||||
to remember the state of matching at that point. Thus, this parameter
|
||||
indirectly limits the amount of memory that is used in a match. However,
|
||||
because the size of each memory frame depends on the number of capturing
|
||||
parentheses, the actual memory limit varies from pattern to pattern. This limit
|
||||
was more useful in versions before 10.30, where function recursion was used for
|
||||
backtracking.
|
||||
</P>
|
||||
<P>
|
||||
The depth limit is not relevant, and is ignored, when matching is done using
|
||||
JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
|
||||
uses it to limit the depth of nested internal recursive function calls that
|
||||
implement atomic groups, lookaround assertions, and pattern recursions. This
|
||||
limits, indirectly, the amount of system stack that is used. It was more useful
|
||||
in versions before 10.32, when stack memory was used for local workspace
|
||||
vectors for recursive function calls. From version 10.32, only local variables
|
||||
are allocated on the stack and as each call uses only a few hundred bytes, even
|
||||
a small stack can support quite a lot of recursion.
|
||||
</P>
|
||||
<P>
|
||||
If the depth of internal recursive function calls is great enough, local
|
||||
workspace vectors are allocated on the heap from version 10.32 onwards, so the
|
||||
depth limit also indirectly limits the amount of heap memory that is used. A
|
||||
recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string
|
||||
using <b>pcre2_dfa_match()</b>, can use a great deal of memory. However, it is
|
||||
probably better to limit heap usage directly by calling
|
||||
<b>pcre2_set_heap_limit()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The default value for the depth limit can be set when PCRE2 is built; if it is
|
||||
not, the default is set to the same value as the default for the match limit.
|
||||
If the limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
|
||||
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
|
||||
supplied by an item at the start of a pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_DEPTH=ddd)
|
||||
</pre>
|
||||
where ddd is a decimal number. However, such a setting is ignored unless ddd is
|
||||
less than the limit set by the caller of <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b> or, if no such limit is set, less than the default.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The function <b>pcre2_config()</b> makes it possible for a PCRE2 client to find
|
||||
the value of certain configuration parameters and to discover which optional
|
||||
features have been compiled into the PCRE2 library. The
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation has more details about these features.
|
||||
</P>
|
||||
<P>
|
||||
The first argument for <b>pcre2_config()</b> specifies which information is
|
||||
required. The second argument is a pointer to memory into which the information
|
||||
is placed. If NULL is passed, the function returns the amount of memory that is
|
||||
needed for the requested information. For calls that return numerical values,
|
||||
the value is in bytes; when requesting these values, <i>where</i> should point
|
||||
to appropriately aligned memory. For calls that return strings, the required
|
||||
length is given in code units, not counting the terminating zero.
|
||||
</P>
|
||||
<P>
|
||||
When requesting information, the returned value from <b>pcre2_config()</b> is
|
||||
non-negative on success, or the negative error code PCRE2_ERROR_BADOPTION if
|
||||
the value in the first argument is not recognized. The following information is
|
||||
available:
|
||||
<pre>
|
||||
PCRE2_CONFIG_BSR
|
||||
</pre>
|
||||
The output is a uint32_t integer whose value indicates what character
|
||||
sequences the \R escape sequence matches by default. A value of
|
||||
PCRE2_BSR_UNICODE means that \R matches any Unicode line ending sequence; a
|
||||
value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The
|
||||
default can be overridden when a pattern is compiled.
|
||||
<pre>
|
||||
PCRE2_CONFIG_COMPILED_WIDTHS
|
||||
</pre>
|
||||
The output is a uint32_t integer whose lower bits indicate which code unit
|
||||
widths were selected when PCRE2 was built. The 1-bit indicates 8-bit support,
|
||||
and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
|
||||
<pre>
|
||||
PCRE2_CONFIG_DEPTHLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives the default limit for the depth of
|
||||
nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions,
|
||||
lookarounds, and atomic groups in <b>pcre2_dfa_match()</b>. Further details are
|
||||
given with <b>pcre2_set_depth_limit()</b> above.
|
||||
<pre>
|
||||
PCRE2_CONFIG_HEAPLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives, in kibibytes, the default limit
|
||||
for the amount of heap memory used by <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b>. Further details are given with
|
||||
<b>pcre2_set_heap_limit()</b> above.
|
||||
<pre>
|
||||
PCRE2_CONFIG_JIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that is set to one if support for just-in-time
|
||||
compiling is included in the library; otherwise it is set to zero. Note that
|
||||
having the support in the library does not guarantee that JIT will be used for
|
||||
any given match, and neither does it guarantee that JIT will actually be able
|
||||
to function, because it may not be able to allocate executable memory in some
|
||||
environments. There is a special call to <b>pcre2_jit_compile()</b> that can be
|
||||
used to check this. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for more details.
|
||||
<pre>
|
||||
PCRE2_CONFIG_JITTARGET
|
||||
</pre>
|
||||
The <i>where</i> argument should point to a buffer that is at least 48 code
|
||||
units long. (The exact length required can be found by calling
|
||||
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
|
||||
string that contains the name of the architecture for which the JIT compiler is
|
||||
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
|
||||
is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of
|
||||
code units used is returned. This is the length of the string, plus one unit
|
||||
for the terminating zero.
|
||||
<pre>
|
||||
PCRE2_CONFIG_LINKSIZE
|
||||
</pre>
|
||||
The output is a uint32_t integer that contains the number of bytes used for
|
||||
internal linkage in compiled regular expressions. When PCRE2 is configured, the
|
||||
value can be set to 2, 3, or 4, with the default being 2. This is the value
|
||||
that is returned by <b>pcre2_config()</b>. However, when the 16-bit library is
|
||||
compiled, a value of 3 is rounded up to 4, and when the 32-bit library is
|
||||
compiled, internal linkages always use 4 bytes, so the configured value is not
|
||||
relevant.
|
||||
</P>
|
||||
<P>
|
||||
The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all
|
||||
but the most massive patterns, since it allows the size of the compiled pattern
|
||||
to be up to 65535 code units. Larger values allow larger regular expressions to
|
||||
be compiled by those two libraries, but at the expense of slower matching.
|
||||
<pre>
|
||||
PCRE2_CONFIG_MATCHLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives the default match limit for
|
||||
<b>pcre2_match()</b>. Further details are given with
|
||||
<b>pcre2_set_match_limit()</b> above.
|
||||
<pre>
|
||||
PCRE2_CONFIG_NEWLINE
|
||||
</pre>
|
||||
The output is a uint32_t integer whose value specifies the default character
|
||||
sequence that is recognized as meaning "newline". The values are:
|
||||
<pre>
|
||||
PCRE2_NEWLINE_CR Carriage return (CR)
|
||||
PCRE2_NEWLINE_LF Linefeed (LF)
|
||||
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
||||
PCRE2_NEWLINE_ANY Any Unicode line ending
|
||||
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
||||
PCRE2_NEWLINE_NUL The NUL character (binary zero)
|
||||
</pre>
|
||||
The default should normally correspond to the standard sequence for your
|
||||
operating system.
|
||||
<pre>
|
||||
PCRE2_CONFIG_NEVER_BACKSLASH_C
|
||||
</pre>
|
||||
The output is a uint32_t integer that is set to one if the use of \C was
|
||||
permanently disabled when PCRE2 was built; otherwise it is set to zero.
|
||||
<pre>
|
||||
PCRE2_CONFIG_PARENSLIMIT
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives the maximum depth of nesting
|
||||
of parentheses (of any kind) in a pattern. This limit is imposed to cap the
|
||||
amount of system stack used when a pattern is compiled. It is specified when
|
||||
PCRE2 is built; the default is 250. This limit does not take into account the
|
||||
stack that may already be used by the calling application. For finer control
|
||||
over compilation stack usage, see <b>pcre2_set_compile_recursion_guard()</b>.
|
||||
<pre>
|
||||
PCRE2_CONFIG_STACKRECURSE
|
||||
</pre>
|
||||
This parameter is obsolete and should not be used in new code. The output is a
|
||||
uint32_t integer that is always set to zero.
|
||||
<pre>
|
||||
PCRE2_CONFIG_TABLES_LENGTH
|
||||
</pre>
|
||||
The output is a uint32_t integer that gives the length of PCRE2's character
|
||||
processing tables in bytes. For details of these tables see the
|
||||
<a href="#localesupport">section on locale support</a>
|
||||
below.
|
||||
<pre>
|
||||
PCRE2_CONFIG_UNICODE_VERSION
|
||||
</pre>
|
||||
The <i>where</i> argument should point to a buffer that is at least 24 code
|
||||
units long. (The exact length required can be found by calling
|
||||
<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
|
||||
without Unicode support, the buffer is filled with the text "Unicode not
|
||||
supported". Otherwise, the Unicode version string (for example, "8.0.0") is
|
||||
inserted. The number of code units used is returned. This is the length of the
|
||||
string plus one unit for the terminating zero.
|
||||
<pre>
|
||||
PCRE2_CONFIG_UNICODE
|
||||
</pre>
|
||||
The output is a uint32_t integer that is set to one if Unicode support is
|
||||
available; otherwise it is set to zero. Unicode support implies UTF support.
|
||||
<pre>
|
||||
PCRE2_CONFIG_VERSION
|
||||
</pre>
|
||||
The <i>where</i> argument should point to a buffer that is at least 24 code
|
||||
units long. (The exact length required can be found by calling
|
||||
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
|
||||
the PCRE2 version string, zero-terminated. The number of code units used is
|
||||
returned. This is the length of the string plus one unit for the terminating
|
||||
zero.
|
||||
<a name="compiling"></a></P>
|
||||
<br><a name="SEC20" href="#TOC1">COMPILING A PATTERN</a><br>
|
||||
<P>
|
||||
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
|
||||
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
|
||||
<b> pcre2_compile_context *<i>ccontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
|
||||
The pattern is defined by a pointer to a string of code units and a length in
|
||||
code units. If the pattern is zero-terminated, the length can be specified as
|
||||
PCRE2_ZERO_TERMINATED. A NULL pattern pointer with a length of zero is treated
|
||||
as an empty string (NULL with a non-zero length causes an error return). The
|
||||
function returns a pointer to a block of memory that contains the compiled
|
||||
pattern and related data, or NULL if an error occurred.
|
||||
</P>
|
||||
<P>
|
||||
If the compile context argument <i>ccontext</i> is NULL, memory for the compiled
|
||||
pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from
|
||||
the same memory function that was used for the compile context. The caller must
|
||||
free the memory by calling <b>pcre2_code_free()</b> when it is no longer needed.
|
||||
If <b>pcre2_code_free()</b> is called with a NULL argument, it returns
|
||||
immediately, without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
The function <b>pcre2_code_copy()</b> makes a copy of the compiled code in new
|
||||
memory, using the same memory allocator as was used for the original. However,
|
||||
if the code has been processed by the JIT compiler (see
|
||||
<a href="#jitcompiling">below),</a>
|
||||
the JIT information cannot be copied (because it is position-dependent).
|
||||
The new copy can initially be used only for non-JIT matching, though it can be
|
||||
passed to <b>pcre2_jit_compile()</b> if required. If <b>pcre2_code_copy()</b> is
|
||||
called with a NULL argument, it returns NULL.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_code_copy()</b> function provides a way for individual threads in a
|
||||
multithreaded application to acquire a private copy of shared compiled code.
|
||||
However, it does not make a copy of the character tables used by the compiled
|
||||
pattern; the new pattern code points to the same tables as the original code.
|
||||
(See
|
||||
<a href="#jitcompiling">"Locale Support"</a>
|
||||
below for details of these character tables.) In many applications the same
|
||||
tables are used throughout, so this behaviour is appropriate. Nevertheless,
|
||||
there are occasions when a copy of a compiled pattern and the relevant tables
|
||||
are needed. The <b>pcre2_code_copy_with_tables()</b> provides this facility.
|
||||
Copies of both the code and the tables are made, with the new code pointing to
|
||||
the new tables. The memory for the new tables is automatically freed when
|
||||
<b>pcre2_code_free()</b> is called for the new copy of the compiled code. If
|
||||
<b>pcre2_code_copy_with_tables()</b> is called with a NULL argument, it returns
|
||||
NULL.
|
||||
</P>
|
||||
<P>
|
||||
NOTE: When one of the matching functions is called, pointers to the compiled
|
||||
pattern and the subject string are set in the match data block so that they can
|
||||
be referenced by the substring extraction functions after a successful match.
|
||||
After running a match, you must not free a compiled pattern or a subject string
|
||||
until after all operations on the
|
||||
<a href="#matchdatablock">match data block</a>
|
||||
have taken place, unless, in the case of the subject string, you have used the
|
||||
PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
|
||||
"Option bits for <b>pcre2_match()</b>"
|
||||
<a href="#matchoptions>">below.</a>
|
||||
</P>
|
||||
<P>
|
||||
The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit
|
||||
settings that affect the compilation. It should be zero if none of them are
|
||||
required. The available options are described below. Some of them (in
|
||||
particular, those that are compatible with Perl, but some others as well) can
|
||||
also be set and unset from within the pattern (see the detailed description in
|
||||
the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation).
|
||||
</P>
|
||||
<P>
|
||||
For those options that can be different in different parts of the pattern, the
|
||||
contents of the <i>options</i> argument specifies their settings at the start of
|
||||
compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
|
||||
options can be set at the time of matching as well as at compile time.
|
||||
</P>
|
||||
<P>
|
||||
Some additional options and less frequently required compile-time parameters
|
||||
(for example, the newline setting) can be provided in a compile context (as
|
||||
described
|
||||
<a href="#compilecontext">above).</a>
|
||||
</P>
|
||||
<P>
|
||||
If <i>errorcode</i> or <i>erroroffset</i> is NULL, <b>pcre2_compile()</b> returns
|
||||
NULL immediately. Otherwise, the variables to which these point are set to an
|
||||
error code and an offset (number of code units) within the pattern,
|
||||
respectively, when <b>pcre2_compile()</b> returns NULL because a compilation
|
||||
error has occurred.
|
||||
</P>
|
||||
<P>
|
||||
There are over 100 positive error codes that <b>pcre2_compile()</b> may return
|
||||
if it finds an error in the pattern. There are also some negative error codes
|
||||
that are used for invalid UTF strings when validity checking is in force. These
|
||||
are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
|
||||
are described in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation. There is no separate documentation for the positive error codes,
|
||||
because the textual error messages that are obtained by calling the
|
||||
<b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
|
||||
message"
|
||||
<a href="#geterrormessage">below)</a>
|
||||
should be self-explanatory. Macro names starting with PCRE2_ERROR_ are defined
|
||||
for both positive and negative error codes in <b>pcre2.h</b>. When compilation
|
||||
is successful <i>errorcode</i> is set to a value that returns the message "no
|
||||
error" if passed to <b>pcre2_get_error_message()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The value returned in <i>erroroffset</i> is an indication of where in the
|
||||
pattern an error occurred. When there is no error, zero is returned. A non-zero
|
||||
value is not necessarily the furthest point in the pattern that was read. For
|
||||
example, after the error "lookbehind assertion is not fixed length", the error
|
||||
offset points to the start of the failing assertion. For an invalid UTF-8 or
|
||||
UTF-16 string, the offset is that of the first code unit of the failing
|
||||
character.
|
||||
</P>
|
||||
<P>
|
||||
Some errors are not detected until the whole pattern has been scanned; in these
|
||||
cases, the offset passed back is the length of the pattern. Note that the
|
||||
offset is in code units, not characters, even in a UTF mode. It may sometimes
|
||||
point into the middle of a UTF-8 or UTF-16 character.
|
||||
</P>
|
||||
<P>
|
||||
This code fragment shows a typical straightforward call to
|
||||
<b>pcre2_compile()</b>:
|
||||
<pre>
|
||||
pcre2_code *re;
|
||||
PCRE2_SIZE erroffset;
|
||||
int errorcode;
|
||||
re = pcre2_compile(
|
||||
"^A.*Z", /* the pattern */
|
||||
PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
|
||||
0, /* default options */
|
||||
&errorcode, /* for error code */
|
||||
&erroffset, /* for error offset */
|
||||
NULL); /* no compile context */
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
Main compile options
|
||||
</b><br>
|
||||
<P>
|
||||
The following names for option bits are defined in the <b>pcre2.h</b> header
|
||||
file:
|
||||
<pre>
|
||||
PCRE2_ANCHORED
|
||||
</pre>
|
||||
If this bit is set, the pattern is forced to be "anchored", that is, it is
|
||||
constrained to match only at the first matching point in the string that is
|
||||
being searched (the "subject string"). This effect can also be achieved by
|
||||
appropriate constructs in the pattern itself, which is the only way to do it in
|
||||
Perl.
|
||||
<pre>
|
||||
PCRE2_ALLOW_EMPTY_CLASS
|
||||
</pre>
|
||||
By default, for compatibility with Perl, a closing square bracket that
|
||||
immediately follows an opening one is treated as a data character for the
|
||||
class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which
|
||||
therefore contains no characters and so can never match.
|
||||
<pre>
|
||||
PCRE2_ALT_BSUX
|
||||
</pre>
|
||||
This option request alternative handling of three escape sequences, which
|
||||
makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set:
|
||||
</P>
|
||||
<P>
|
||||
(1) \U matches an upper case "U" character; by default \U causes a compile
|
||||
time error (Perl uses \U to upper case subsequent characters).
|
||||
</P>
|
||||
<P>
|
||||
(2) \u matches a lower case "u" character unless it is followed by four
|
||||
hexadecimal digits, in which case the hexadecimal number defines the code point
|
||||
to match. By default, \u causes a compile time error (Perl uses it to upper
|
||||
case the following character).
|
||||
</P>
|
||||
<P>
|
||||
(3) \x matches a lower case "x" character unless it is followed by two
|
||||
hexadecimal digits, in which case the hexadecimal number defines the code point
|
||||
to match. By default, as in Perl, a hexadecimal number is always expected after
|
||||
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
|
||||
binary zero character followed by z).
|
||||
</P>
|
||||
<P>
|
||||
ECMAscript 6 added additional functionality to \u. This can be accessed using
|
||||
the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile options"
|
||||
<a href="#extracompileoptions">below).</a>
|
||||
Note that this alternative escape handling applies only to patterns. Neither of
|
||||
these options affects the processing of replacement strings passed to
|
||||
<b>pcre2_substitute()</b>.
|
||||
<pre>
|
||||
PCRE2_ALT_CIRCUMFLEX
|
||||
</pre>
|
||||
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
|
||||
matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
|
||||
after any internal newline. However, it does not match after a newline at the
|
||||
end of the subject, for compatibility with Perl. If you want a multiline
|
||||
circumflex also to match after a terminating newline, you must set
|
||||
PCRE2_ALT_CIRCUMFLEX.
|
||||
<pre>
|
||||
PCRE2_ALT_EXTENDED_CLASS
|
||||
</pre>
|
||||
Alters the parsing of character classes to follow the extended syntax
|
||||
described by Unicode UTS#18. The PCRE2_ALT_EXTENDED_CLASS option has no impact
|
||||
on the behaviour of the Perl-specific "(?[...])" syntax for extended classes,
|
||||
but instead enables the alternative syntax of extended class behaviour inside
|
||||
ordinary "[...]" character classes. See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation for details of the character classes supported.
|
||||
<pre>
|
||||
PCRE2_ALT_VERBNAMES
|
||||
</pre>
|
||||
By default, for compatibility with Perl, the name in any verb sequence such as
|
||||
(*MARK:NAME) is any sequence of characters that does not include a closing
|
||||
parenthesis. The name is not processed in any way, and it is not possible to
|
||||
include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
|
||||
option is set, normal backslash processing is applied to verb names and only an
|
||||
unescaped closing parenthesis terminates the name. A closing parenthesis can be
|
||||
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
|
||||
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
|
||||
whitespace in verb names is skipped and #-comments are recognized, exactly as
|
||||
in the rest of the pattern.
|
||||
<pre>
|
||||
PCRE2_AUTO_CALLOUT
|
||||
</pre>
|
||||
If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items,
|
||||
all with number 255, before each pattern item, except immediately before or
|
||||
after an explicit callout in the pattern. For discussion of the callout
|
||||
facility, see the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_CASELESS
|
||||
</pre>
|
||||
If this bit is set, letters in the pattern match both upper and lower case
|
||||
letters in the subject. It is equivalent to Perl's /i option, and it can be
|
||||
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
|
||||
PCRE2_UCP is set, Unicode properties are used for all characters with more than
|
||||
one other case, and for all characters whose code points are greater than
|
||||
U+007F.
|
||||
</P>
|
||||
<P>
|
||||
Note that there are two ASCII characters, K and S, that, in addition to
|
||||
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
|
||||
sign) and U+017F (long S) respectively. If you do not want this case
|
||||
equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
|
||||
</P>
|
||||
<P>
|
||||
One language family, Turkish and Azeri, has its own case-insensitivity rules,
|
||||
which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the
|
||||
behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131
|
||||
(small dotless i) characters.
|
||||
</P>
|
||||
<P>
|
||||
For lower valued characters with only one other case, a lookup table is used
|
||||
for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used
|
||||
for all code points less than 256, and higher code points (available only in
|
||||
16-bit or 32-bit mode) are treated as not having another case.
|
||||
</P>
|
||||
<P>
|
||||
From release 10.45 PCRE2_CASELESS also affects what some of the letter-related
|
||||
Unicode property escapes (\p and \P) match. The properties Lu (upper case
|
||||
letter), Ll (lower case letter), and Lt (title case letter) are all treated as
|
||||
LC (cased letter) when PCRE2_CASELESS is set.
|
||||
<pre>
|
||||
PCRE2_DOLLAR_ENDONLY
|
||||
</pre>
|
||||
If this bit is set, a dollar metacharacter in the pattern matches only at the
|
||||
end of the subject string. Without this option, a dollar also matches
|
||||
immediately before a newline at the end of the string (but not before any other
|
||||
newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is
|
||||
set. There is no equivalent to this option in Perl, and no way to set it within
|
||||
a pattern.
|
||||
<pre>
|
||||
PCRE2_DOTALL
|
||||
</pre>
|
||||
If this bit is set, a dot metacharacter in the pattern matches any character,
|
||||
including one that indicates a newline. However, it only ever matches one
|
||||
character, even if newlines are coded as CRLF. Without this option, a dot does
|
||||
not match when the current position in the subject is at a newline. This option
|
||||
is equivalent to Perl's /s option, and it can be changed within a pattern by a
|
||||
(?s) option setting. A negative class such as [^a] always matches newline
|
||||
characters, and the \N escape sequence always matches a non-newline character,
|
||||
independent of the setting of PCRE2_DOTALL.
|
||||
<pre>
|
||||
PCRE2_DUPNAMES
|
||||
</pre>
|
||||
If this bit is set, names used to identify capture groups need not be unique.
|
||||
This can be helpful for certain types of pattern when it is known that only one
|
||||
instance of the named group can ever be matched. There are more details of
|
||||
named capture groups below; see also the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_ENDANCHORED
|
||||
</pre>
|
||||
If this bit is set, the end of any pattern match must be right at the end of
|
||||
the string being searched (the "subject string"). If the pattern match
|
||||
succeeds by reaching (*ACCEPT), but does not reach the end of the subject, the
|
||||
match fails at the current starting point. For unanchored patterns, a new match
|
||||
is then tried at the next starting point. However, if the match succeeds by
|
||||
reaching the end of the pattern, but not the end of the subject, backtracking
|
||||
occurs and an alternative match may be found. Consider these two patterns:
|
||||
<pre>
|
||||
.(*ACCEPT)|..
|
||||
.|..
|
||||
</pre>
|
||||
If matched against "abc" with PCRE2_ENDANCHORED set, the first matches "c"
|
||||
whereas the second matches "bc". The effect of PCRE2_ENDANCHORED can also be
|
||||
achieved by appropriate constructs in the pattern itself, which is the only way
|
||||
to do it in Perl.
|
||||
</P>
|
||||
<P>
|
||||
For DFA matching with <b>pcre2_dfa_match()</b>, PCRE2_ENDANCHORED applies only
|
||||
to the first (that is, the longest) matched string. Other parallel matches,
|
||||
which are necessarily substrings of the first one, must obviously end before
|
||||
the end of the subject.
|
||||
<pre>
|
||||
PCRE2_EXTENDED
|
||||
</pre>
|
||||
If this bit is set, most white space characters in the pattern are totally
|
||||
ignored except when escaped, inside a character class, or inside a \Q...\E
|
||||
sequence. However, white space is not allowed within sequences such as (?> that
|
||||
introduce various parenthesized groups, nor within numerical quantifiers such
|
||||
as {1,3}. Ignorable white space is permitted between an item and a following
|
||||
quantifier and between a quantifier and a following + that indicates
|
||||
possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
|
||||
changed within a pattern by a (?x) option setting.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
|
||||
white space only those characters with code points less than 256 that are
|
||||
flagged as white space in its low-character table. The table is normally
|
||||
created by
|
||||
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
|
||||
which uses the <b>isspace()</b> function to identify space characters. In most
|
||||
ASCII environments, the relevant characters are those with code points 0x0009
|
||||
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
|
||||
(carriage return), and 0x0020 (space).
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is compiled with Unicode support, in addition to these characters,
|
||||
five more Unicode "Pattern White Space" characters are recognized by
|
||||
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
|
||||
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
|
||||
separator). This set of characters is the same as recognized by Perl's /x
|
||||
option. Note that the horizontal and vertical space characters that are matched
|
||||
by the \h and \v escapes in patterns are a much bigger set.
|
||||
</P>
|
||||
<P>
|
||||
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
|
||||
between an unescaped # outside a character class and the next newline,
|
||||
inclusive, to be ignored, which makes it possible to include comments inside
|
||||
complicated patterns. Note that the end of this type of comment is a literal
|
||||
newline sequence in the pattern; escape sequences that happen to represent a
|
||||
newline do not count.
|
||||
</P>
|
||||
<P>
|
||||
Which characters are interpreted as newlines can be specified by a setting in
|
||||
the compile context that is passed to <b>pcre2_compile()</b> or by a special
|
||||
sequence at the start of the pattern, as described in the section entitled
|
||||
<a href="pcre2pattern.html#newlines">"Newline conventions"</a>
|
||||
in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
|
||||
built.
|
||||
<pre>
|
||||
PCRE2_EXTENDED_MORE
|
||||
</pre>
|
||||
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
|
||||
and horizontal tab characters are ignored inside a character class. Note: only
|
||||
these two characters are ignored, not the full set of pattern white space
|
||||
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
|
||||
equivalent to Perl's /xx option, and it can be changed within a pattern by a
|
||||
(?xx) option setting.
|
||||
<pre>
|
||||
PCRE2_FIRSTLINE
|
||||
</pre>
|
||||
If this option is set, the start of an unanchored pattern match must be before
|
||||
or at the first newline in the subject string following the start of matching,
|
||||
though the matched text may continue over the newline. If <i>startoffset</i> is
|
||||
non-zero, the limiting newline is not necessarily the first newline in the
|
||||
subject. For example, if the subject string is "abc\nxyz" (where \n
|
||||
represents a single-character newline) a pattern match for "yz" succeeds with
|
||||
PCRE2_FIRSTLINE if <i>startoffset</i> is greater than 3. See also
|
||||
PCRE2_USE_OFFSET_LIMIT, which provides a more general limiting facility. If
|
||||
PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the first
|
||||
line and also within the offset limit. In other words, whichever limit comes
|
||||
first is used. This option has no effect for anchored patterns.
|
||||
<pre>
|
||||
PCRE2_LITERAL
|
||||
</pre>
|
||||
If this option is set, all meta-characters in the pattern are disabled, and it
|
||||
is treated as a literal string. Matching literal strings with a regular
|
||||
expression engine is not the most efficient way of doing it. If you are doing a
|
||||
lot of literal matching and are worried about efficiency, you should consider
|
||||
using other approaches. The only other main options that are allowed with
|
||||
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
|
||||
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
|
||||
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
|
||||
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
|
||||
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
|
||||
<pre>
|
||||
PCRE2_MATCH_INVALID_UTF
|
||||
</pre>
|
||||
This option forces PCRE2_UTF (see below) and also enables support for matching
|
||||
by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
|
||||
Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
|
||||
sequences of uint16_t or uint32_t code points. They cannot find valid UTF
|
||||
sequences within an arbitrary string of bytes unless such sequences are
|
||||
suitably aligned. This facility is not supported for DFA matching. For details,
|
||||
see the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_MATCH_UNSET_BACKREF
|
||||
</pre>
|
||||
If this option is set, a backreference to an unset capture group matches an
|
||||
empty string (by default this causes the current matching alternative to fail).
|
||||
A pattern such as (\1)(a) succeeds when this option is set (assuming it can
|
||||
find an "a" in the subject), whereas it fails by default, for Perl
|
||||
compatibility. Setting this option makes PCRE2 behave more like ECMAscript (aka
|
||||
JavaScript).
|
||||
<pre>
|
||||
PCRE2_MULTILINE
|
||||
</pre>
|
||||
By default, for the purposes of matching "start of line" and "end of line",
|
||||
PCRE2 treats the subject string as consisting of a single line of characters,
|
||||
even if it actually contains newlines. The "start of line" metacharacter (^)
|
||||
matches only at the start of the string, and the "end of line" metacharacter
|
||||
($) matches only at the end of the string, or before a terminating newline
|
||||
(except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless
|
||||
PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a
|
||||
newline. This behaviour (for ^, $, and dot) is the same as Perl.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
|
||||
constructs match immediately following or immediately before internal newlines
|
||||
in the subject string, respectively, as well as at the very start and end. This
|
||||
is equivalent to Perl's /m option, and it can be changed within a pattern by a
|
||||
(?m) option setting. Note that the "start of line" metacharacter does not match
|
||||
after a newline at the end of the subject, for compatibility with Perl.
|
||||
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
|
||||
there are no newlines in a subject string, or no occurrences of ^ or $ in a
|
||||
pattern, setting PCRE2_MULTILINE has no effect.
|
||||
<pre>
|
||||
PCRE2_NEVER_BACKSLASH_C
|
||||
</pre>
|
||||
This option locks out the use of \C in the pattern that is being compiled.
|
||||
This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
|
||||
it may leave the current matching point in the middle of a multi-code-unit
|
||||
character. This option may be useful in applications that process patterns from
|
||||
external sources. Note that there is also a build-time option that permanently
|
||||
locks out the use of \C.
|
||||
<pre>
|
||||
PCRE2_NEVER_UCP
|
||||
</pre>
|
||||
This option locks out the use of Unicode properties for handling \B, \b, \D,
|
||||
\d, \S, \s, \W, \w, and some of the POSIX character classes, as described
|
||||
for the PCRE2_UCP option below. In particular, it prevents the creator of the
|
||||
pattern from enabling this facility by starting the pattern with (*UCP). This
|
||||
option may be useful in applications that process patterns from external
|
||||
sources. The option combination PCRE2_UCP and PCRE2_NEVER_UCP causes an error.
|
||||
<pre>
|
||||
PCRE2_NEVER_UTF
|
||||
</pre>
|
||||
This option locks out interpretation of the pattern as UTF-8, UTF-16, or
|
||||
UTF-32, depending on which library is in use. In particular, it prevents the
|
||||
creator of the pattern from switching to UTF interpretation by starting the
|
||||
pattern with (*UTF). This option may be useful in applications that process
|
||||
patterns from external sources. The combination of PCRE2_UTF and
|
||||
PCRE2_NEVER_UTF causes an error.
|
||||
<pre>
|
||||
PCRE2_NO_AUTO_CAPTURE
|
||||
</pre>
|
||||
If this option is set, it disables the use of numbered capturing parentheses in
|
||||
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
|
||||
were followed by ?: but named parentheses can still be used for capturing (and
|
||||
they acquire numbers in the usual way). This is the same as Perl's /n option.
|
||||
Note that, when this option is set, references to capture groups
|
||||
(backreferences or recursion/subroutine calls) may only refer to named groups,
|
||||
though the reference can be by name or by number.
|
||||
<pre>
|
||||
PCRE2_NO_AUTO_POSSESS
|
||||
</pre>
|
||||
If this (deprecated) option is set, it disables "auto-possessification", which
|
||||
is an optimization that, for example, turns a+b into a++b in order to avoid
|
||||
backtracks into a+ that can never be successful. However, if callouts are in
|
||||
use, auto-possessification means that some callouts are never taken. You can
|
||||
set this option if you want the matching functions to do a full unoptimized
|
||||
search and run all the callouts, but it is mainly provided for testing
|
||||
purposes.
|
||||
</P>
|
||||
<P>
|
||||
If a compile context is available, it is recommended to use
|
||||
<b>pcre2_set_optimize()</b> with the <i>directive</i> PCRE2_AUTO_POSSESS_OFF rather
|
||||
than the compile option PCRE2_NO_AUTO_POSSESS. Note that PCRE2_NO_AUTO_POSSESS
|
||||
takes precedence over the <b>pcre2_set_optimize()</b> optimization directives
|
||||
PCRE2_AUTO_POSSESS and PCRE2_AUTO_POSSESS_OFF.
|
||||
<pre>
|
||||
PCRE2_NO_DOTSTAR_ANCHOR
|
||||
</pre>
|
||||
If this (deprecated) option is set, it disables an optimization that is applied
|
||||
when .* is the first significant item in a top-level branch of a pattern, and
|
||||
all the other branches also start with .* or with \A or \G or ^. The
|
||||
optimization is automatically disabled for .* if it is inside an atomic group
|
||||
or a capture group that is the subject of a backreference, or if the pattern
|
||||
contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a
|
||||
pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items
|
||||
and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any
|
||||
match must start either at the start of the subject or following a newline is
|
||||
remembered. Like other optimizations, this can cause callouts to be skipped.
|
||||
(If a compile context is available, it is recommended to use
|
||||
<b>pcre2_set_optimize()</b> with the <i>directive</i> PCRE2_DOTSTAR_ANCHOR_OFF
|
||||
instead.)
|
||||
<pre>
|
||||
PCRE2_NO_START_OPTIMIZE
|
||||
</pre>
|
||||
This is an option whose main effect is at matching time. It does not change
|
||||
what <b>pcre2_compile()</b> generates, but it does affect the output of the JIT
|
||||
compiler. Setting this option is equivalent to calling <b>pcre2_set_optimize()</b>
|
||||
with the <i>directive</i> parameter set to PCRE2_START_OPTIMIZE_OFF.
|
||||
</P>
|
||||
<P>
|
||||
There are a number of optimizations that may occur at the start of a match, in
|
||||
order to speed up the process. For example, if it is known that an unanchored
|
||||
match must start with a specific code unit value, the matching code searches
|
||||
the subject for that value, and fails immediately if it cannot find it, without
|
||||
actually running the main matching function. The start-up optimizations are
|
||||
in effect a pre-scan of the subject that takes place before the pattern is run.
|
||||
</P>
|
||||
<P>
|
||||
Disabling the start-up optimizations may cause performance to suffer. However,
|
||||
this may be desirable for patterns which contain callouts or items such as
|
||||
(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF
|
||||
for further details.
|
||||
<pre>
|
||||
PCRE2_NO_UTF_CHECK
|
||||
</pre>
|
||||
When PCRE2_UTF is set, the validity of the pattern as a UTF string is
|
||||
automatically checked. There are discussions about the validity of
|
||||
<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a>
|
||||
<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a>
|
||||
and
|
||||
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
|
||||
in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
document. If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a
|
||||
negative error code.
|
||||
</P>
|
||||
<P>
|
||||
If you know that your pattern is a valid UTF string, and you want to skip this
|
||||
check for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When
|
||||
it is set, the effect of passing an invalid UTF string as a pattern is
|
||||
undefined. It may cause your program to crash or loop.
|
||||
</P>
|
||||
<P>
|
||||
Note that this option can also be passed to <b>pcre2_match()</b> and
|
||||
<b>pcre2_dfa_match()</b>, to suppress UTF validity checking of the subject
|
||||
string.
|
||||
</P>
|
||||
<P>
|
||||
Note also that setting PCRE2_NO_UTF_CHECK at compile time does not disable the
|
||||
error that is given if an escape sequence for an invalid Unicode code point is
|
||||
encountered in the pattern. In particular, the so-called "surrogate" code
|
||||
points (0xd800 to 0xdfff) are invalid. If you want to allow escape sequences
|
||||
such as \x{d800} you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
|
||||
option, as described in the section entitled "Extra compile options"
|
||||
<a href="#extracompileoptions">below.</a>
|
||||
However, this is possible only in UTF-8 and UTF-32 modes, because these values
|
||||
are not representable in UTF-16.
|
||||
<pre>
|
||||
PCRE2_UCP
|
||||
</pre>
|
||||
This option has two effects. Firstly, it change the way PCRE2 processes \B,
|
||||
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
|
||||
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
|
||||
properties are used to classify characters. There are some PCRE2_EXTRA
|
||||
options (see below) that add finer control to this behaviour. More details are
|
||||
given in the section on
|
||||
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
The second effect of PCRE2_UCP is to force the use of Unicode properties for
|
||||
upper/lower casing operations, even when PCRE2_UTF is not set. This makes it
|
||||
possible to process strings in the 16-bit UCS-2 code. This option is available
|
||||
only if PCRE2 has been compiled with Unicode support (which is the default).
|
||||
</P>
|
||||
<P>
|
||||
The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless
|
||||
matching such that ASCII characters match only ASCII characters and non-ASCII
|
||||
characters match only non-ASCII characters. The PCRE2_EXTRA_TURKISH_CASING option
|
||||
(see above) alters the matching of the 'i' characters to follow their behaviour
|
||||
in Turkish and Azeri languages. For further details on
|
||||
PCRE2_EXTRA_CASELESS_RESTRICT and PCRE2_EXTRA_TURKISH_CASING, see the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
<pre>
|
||||
PCRE2_UNGREEDY
|
||||
</pre>
|
||||
This option inverts the "greediness" of the quantifiers so that they are not
|
||||
greedy by default, but become greedy if followed by "?". It is not compatible
|
||||
with Perl. It can also be set by a (?U) option setting within the pattern.
|
||||
<pre>
|
||||
PCRE2_USE_OFFSET_LIMIT
|
||||
</pre>
|
||||
This option must be set for <b>pcre2_compile()</b> if
|
||||
<b>pcre2_set_offset_limit()</b> is going to be used to set a non-default offset
|
||||
limit in a match context for matches that use this pattern. An error is
|
||||
generated if an offset limit is set without this option. For more details, see
|
||||
the description of <b>pcre2_set_offset_limit()</b> in the
|
||||
<a href="#matchcontext">section</a>
|
||||
that describes match contexts. See also the PCRE2_FIRSTLINE
|
||||
option above.
|
||||
<pre>
|
||||
PCRE2_UTF
|
||||
</pre>
|
||||
This option causes PCRE2 to regard both the pattern and the subject strings
|
||||
that are subsequently processed as strings of UTF characters instead of
|
||||
single-code-unit strings. It is available when PCRE2 is built to include
|
||||
Unicode support (which is the default). If Unicode support is not available,
|
||||
the use of this option provokes an error. Details of how PCRE2_UTF changes the
|
||||
behaviour of PCRE2 are given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page. In particular, note that it changes the way PCRE2_CASELESS works.
|
||||
<a name="extracompileoptions"></a></P>
|
||||
<br><b>
|
||||
Extra compile options
|
||||
</b><br>
|
||||
<P>
|
||||
The option bits that can be set in a compile context by calling the
|
||||
<b>pcre2_set_compile_extra_options()</b> function are as follows:
|
||||
<pre>
|
||||
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
||||
</pre>
|
||||
Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
|
||||
assertions, following Perl's lead. This option is provided to re-enable the
|
||||
previous behaviour (act in positive lookarounds, ignore in negative ones) in
|
||||
case anybody is relying on it.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
|
||||
</pre>
|
||||
This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is
|
||||
forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode "surrogate"
|
||||
code points in the range 0xd800 to 0xdfff are used in pairs in UTF-16 to encode
|
||||
code points with values in the range 0x10000 to 0x10ffff. The surrogates cannot
|
||||
therefore be represented in UTF-16. They can be represented in UTF-8 and
|
||||
UTF-32, but are defined as invalid code points, and cause errors if encountered
|
||||
in a UTF-8 or UTF-32 string that is being checked for validity by PCRE2.
|
||||
</P>
|
||||
<P>
|
||||
These values also cause errors if encountered in escape sequences such as
|
||||
\x{d912} within a pattern. However, it seems that some applications, when
|
||||
using PCRE2 to check for unwanted characters in UTF-8 strings, explicitly test
|
||||
for the surrogates using escape sequences. The PCRE2_NO_UTF_CHECK option does
|
||||
not disable the error that occurs, because it applies only to the testing of
|
||||
input strings for UTF validity.
|
||||
</P>
|
||||
<P>
|
||||
If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
|
||||
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
|
||||
incorporated in the compiled pattern. However, they can only match subject
|
||||
characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ALT_BSUX
|
||||
</pre>
|
||||
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in
|
||||
the way that ECMAscript (aka JavaScript) does. Additional functionality was
|
||||
defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of
|
||||
PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as a hexadecimal
|
||||
character code, where hhh.. is any number of hexadecimal digits.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ASCII_BSD
|
||||
</pre>
|
||||
This option forces \d to match only ASCII digits, even when PCRE2_UCP is set.
|
||||
It can be changed within a pattern by means of the (?aD) option setting.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ASCII_BSS
|
||||
</pre>
|
||||
This option forces \s to match only ASCII space characters, even when
|
||||
PCRE2_UCP is set. It can be changed within a pattern by means of the (?aS)
|
||||
option setting.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ASCII_BSW
|
||||
</pre>
|
||||
This option forces \w to match only ASCII word characters, even when PCRE2_UCP
|
||||
is set. It can be changed within a pattern by means of the (?aW) option
|
||||
setting.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ASCII_DIGIT
|
||||
</pre>
|
||||
This option forces the POSIX character classes [:digit:] and [:xdigit:] to
|
||||
match only ASCII digits, even when PCRE2_UCP is set. It can be changed within
|
||||
a pattern by means of the (?aT) option setting.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ASCII_POSIX
|
||||
</pre>
|
||||
This option forces all the POSIX character classes, including [:digit:] and
|
||||
[:xdigit:], to match only ASCII characters, even when PCRE2_UCP is set. It can
|
||||
be changed within a pattern by means of the (?aP) option setting, but note that
|
||||
this also sets PCRE2_EXTRA_ASCII_DIGIT in order to ensure that (?-aP) unsets
|
||||
all ASCII restrictions for POSIX classes.
|
||||
<pre>
|
||||
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
|
||||
</pre>
|
||||
This is a dangerous option. Use with care. By default, an unrecognized escape
|
||||
such as \j or a malformed one such as \x{2z} causes a compile-time error when
|
||||
detected by <b>pcre2_compile()</b>. Perl is somewhat inconsistent in handling
|
||||
such items: for example, \j is treated as a literal "j", and non-hexadecimal
|
||||
digits in \x{} are just ignored, though warnings are given in both cases if
|
||||
Perl's warning switch is enabled. However, a malformed octal number after \o{
|
||||
always causes an error in Perl.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
|
||||
<b>pcre2_compile()</b>, all unrecognized or malformed escape sequences are
|
||||
treated as single-character escapes. For example, \j is a literal "j" and
|
||||
\x{2z} is treated as the literal string "x{2z}". Setting this option means
|
||||
that typos in patterns may go undetected and have unexpected results. Also note
|
||||
that a sequence such as [\N{] is interpreted as a malformed attempt at
|
||||
[\N{...}] and so is treated as [N{] whereas [\N] gives an error because an
|
||||
unqualified \N is a valid escape sequence but is not supported in a character
|
||||
class. To reiterate: this is a dangerous option. Use with great care.
|
||||
<pre>
|
||||
PCRE2_EXTRA_CASELESS_RESTRICT
|
||||
</pre>
|
||||
When either PCRE2_UCP or PCRE2_UTF is set, caseless matching follows Unicode
|
||||
rules, which allow for more than two cases per character. There are two
|
||||
case-equivalent character sets that contain both ASCII and non-ASCII
|
||||
characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
|
||||
ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
|
||||
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
|
||||
caseless match, both characters must either be ASCII or non-ASCII. The option
|
||||
can be changed within a pattern by the (*CASELESS_RESTRICT) or (?r) option
|
||||
settings.
|
||||
<pre>
|
||||
PCRE2_EXTRA_ESCAPED_CR_IS_LF
|
||||
</pre>
|
||||
There are some legacy applications where the escape sequence \r in a pattern
|
||||
is expected to match a newline. If this option is set, \r in a pattern is
|
||||
converted to \n so that it matches a LF (linefeed) instead of a CR (carriage
|
||||
return) character. The option does not affect a literal CR in the pattern, nor
|
||||
does it affect CR specified as an explicit code point such as \x{0D}.
|
||||
<pre>
|
||||
PCRE2_EXTRA_MATCH_LINE
|
||||
</pre>
|
||||
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
|
||||
causes the pattern only to match complete lines. This is achieved by
|
||||
automatically inserting the code for "^(?:" at the start of the compiled
|
||||
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
|
||||
line may be in the middle of the subject string. This option can be used with
|
||||
PCRE2_LITERAL.
|
||||
<pre>
|
||||
PCRE2_EXTRA_MATCH_WORD
|
||||
</pre>
|
||||
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
|
||||
causes the pattern only to match strings that have a word boundary at the start
|
||||
and the end. This is achieved by automatically inserting the code for "\b(?:"
|
||||
at the start of the compiled pattern and ")\b" at the end. The option may be
|
||||
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
|
||||
also set.
|
||||
<pre>
|
||||
PCRE2_EXTRA_NO_BS0
|
||||
</pre>
|
||||
If this option is set (note that its final character is the digit 0) it locks
|
||||
out the use of the sequence \0 unless at least one more octal digit follows.
|
||||
<pre>
|
||||
PCRE2_EXTRA_PYTHON_OCTAL
|
||||
</pre>
|
||||
If this option is set, PCRE2 follows Python's rules for interpreting octal
|
||||
escape sequences. The rules for handling sequences such as \14, which could
|
||||
be an octal number or a back reference are different. Details are given in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_EXTRA_NEVER_CALLOUT
|
||||
</pre>
|
||||
If this option is set, PCRE2 treats callouts in the pattern as a syntax error,
|
||||
returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if the application
|
||||
knows that a callout will not be provided to <b>pcre2_match()</b>, so that
|
||||
callouts in the pattern are not silently ignored.
|
||||
<pre>
|
||||
PCRE2_EXTRA_TURKISH_CASING
|
||||
</pre>
|
||||
This option alters case-equivalence of the 'i' letters to follow the
|
||||
alphabet used by Turkish and Azeri languages. The option can be changed within
|
||||
a pattern by the (*TURKISH_CASING) start-of-pattern setting. Either the UTF or
|
||||
UCP options must be set. In the 8-bit library, UTF must be set. This option
|
||||
cannot be combined with PCRE2_EXTRA_CASELESS_RESTRICT.
|
||||
<a name="jitcompiling"></a></P>
|
||||
<br><a name="SEC21" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br>
|
||||
<P>
|
||||
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_jit_stack *pcre2_jit_stack_create(size_t <i>startsize</i>,</b>
|
||||
<b> size_t <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
These functions provide support for JIT compilation, which, if the just-in-time
|
||||
compiler is available, further processes a compiled pattern into machine code
|
||||
that executes much faster than the <b>pcre2_match()</b> interpretive matching
|
||||
function. Full details are given in the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
JIT compilation is a heavyweight optimization. It can take some time for
|
||||
patterns to be analyzed, and for one-off matches and simple patterns the
|
||||
benefit of faster execution might be offset by a much slower compilation time.
|
||||
Most (but not all) patterns can be optimized by the JIT compiler.
|
||||
<a name="localesupport"></a></P>
|
||||
<br><a name="SEC22" href="#TOC1">LOCALE SUPPORT</a><br>
|
||||
<P>
|
||||
<b>const uint8_t *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_maketables_free(pcre2_general_context *<i>gcontext</i>,</b>
|
||||
<b> const uint8_t *<i>tables</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 handles caseless matching, and determines whether characters are letters,
|
||||
digits, or whatever, by reference to a set of tables, indexed by character code
|
||||
point. However, this applies only to characters whose code points are less than
|
||||
256. By default, higher-valued code points never match escapes such as \w or
|
||||
\d.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is built with Unicode support (the default), certain Unicode
|
||||
character properties can be tested with \p and \P, or, alternatively, the
|
||||
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
|
||||
friends to use Unicode property support instead of the built-in tables.
|
||||
PCRE2_UCP also causes upper/lower casing operations on characters with code
|
||||
points greater than 127 to use Unicode properties. These effects apply even
|
||||
when PCRE2_UTF is not set. There are, however, some PCRE2_EXTRA options (see
|
||||
above) that can be used to modify or suppress them.
|
||||
</P>
|
||||
<P>
|
||||
The use of locales with Unicode is discouraged. If you are handling characters
|
||||
with code points greater than 127, you should either use Unicode support, or
|
||||
use locales, but not try to mix the two.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 contains a built-in set of character tables that are used by default.
|
||||
These are sufficient for many applications. Normally, the internal tables
|
||||
recognize only ASCII characters. However, when PCRE2 is built, it is possible
|
||||
to cause the internal tables to be rebuilt in the default "C" locale of the
|
||||
local system, which may cause them to be different.
|
||||
</P>
|
||||
<P>
|
||||
The built-in tables can be overridden by tables supplied by the application
|
||||
that calls PCRE2. These may be created in a different locale from the default.
|
||||
As more and more applications change to using Unicode, the need for this locale
|
||||
support is expected to die away.
|
||||
</P>
|
||||
<P>
|
||||
External tables are built by calling the <b>pcre2_maketables()</b> function, in
|
||||
the relevant locale. The only argument to this function is a general context,
|
||||
which can be used to pass a custom memory allocator. If the argument is NULL,
|
||||
the system <b>malloc()</b> is used. The result can be passed to
|
||||
<b>pcre2_compile()</b> as often as necessary, by creating a compile context and
|
||||
calling <b>pcre2_set_character_tables()</b> to set the tables pointer therein.
|
||||
</P>
|
||||
<P>
|
||||
For example, to build and use tables that are appropriate for the French locale
|
||||
(where accented characters with values greater than 127 are treated as
|
||||
letters), the following code could be used:
|
||||
<pre>
|
||||
setlocale(LC_CTYPE, "fr_FR");
|
||||
tables = pcre2_maketables(NULL);
|
||||
ccontext = pcre2_compile_context_create(NULL);
|
||||
pcre2_set_character_tables(ccontext, tables);
|
||||
re = pcre2_compile(..., ccontext);
|
||||
</pre>
|
||||
The locale name "fr_FR" is used on Linux and other Unix-like systems; if you
|
||||
are using Windows, the name for the French locale is "french".
|
||||
</P>
|
||||
<P>
|
||||
The pointer that is passed (via the compile context) to <b>pcre2_compile()</b>
|
||||
is saved with the compiled pattern, and the same tables are used by the
|
||||
matching functions. Thus, for any single pattern, compilation and matching both
|
||||
happen in the same locale, but different patterns can be processed in different
|
||||
locales.
|
||||
</P>
|
||||
<P>
|
||||
It is the caller's responsibility to ensure that the memory containing the
|
||||
tables remains available while they are still in use. When they are no longer
|
||||
needed, you can discard them using <b>pcre2_maketables_free()</b>, which should
|
||||
pass as its first parameter the same global context that was used to create the
|
||||
tables.
|
||||
</P>
|
||||
<br><b>
|
||||
Saving locale tables
|
||||
</b><br>
|
||||
<P>
|
||||
The tables described above are just a sequence of binary bytes, which makes
|
||||
them independent of hardware characteristics such as endianness or whether the
|
||||
processor is 32-bit or 64-bit. A copy of the result of <b>pcre2_maketables()</b>
|
||||
can therefore be saved in a file or elsewhere and re-used later, even in a
|
||||
different program or on another computer. The size of the tables (number of
|
||||
bytes) must be obtained by calling <b>pcre2_config()</b> with the
|
||||
PCRE2_CONFIG_TABLES_LENGTH option because <b>pcre2_maketables()</b> does not
|
||||
return this value. Note that the <b>pcre2_dftables</b> program, which is part of
|
||||
the PCRE2 build system, can be used stand-alone to create a file that contains
|
||||
a set of binary tables. See the
|
||||
<a href="pcre2build.html#createtables"><b>pcre2build</b></a>
|
||||
documentation for details.
|
||||
<a name="infoaboutpattern"></a></P>
|
||||
<br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br>
|
||||
<P>
|
||||
<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_pattern_info()</b> function returns general information about a
|
||||
compiled pattern. For information about callouts, see the
|
||||
<a href="#infoaboutcallouts">next section.</a>
|
||||
The first argument for <b>pcre2_pattern_info()</b> is a pointer to the compiled
|
||||
pattern. The second argument specifies which piece of information is required,
|
||||
and the third argument is a pointer to a variable to receive the data. If the
|
||||
third argument is NULL, the first argument is ignored, and the function returns
|
||||
the size in bytes of the variable that is required for the information
|
||||
requested. Otherwise, the yield of the function is zero for success, or one of
|
||||
the following negative numbers:
|
||||
<pre>
|
||||
PCRE2_ERROR_NULL the argument <i>code</i> was NULL
|
||||
PCRE2_ERROR_BADMAGIC the "magic number" was not found
|
||||
PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid
|
||||
PCRE2_ERROR_UNSET the requested field is not set
|
||||
</pre>
|
||||
The "magic number" is placed at the start of each compiled pattern as a simple
|
||||
check against passing an arbitrary memory pointer. Here is a typical call of
|
||||
<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern:
|
||||
<pre>
|
||||
int rc;
|
||||
size_t length;
|
||||
rc = pcre2_pattern_info(
|
||||
re, /* result of pcre2_compile() */
|
||||
PCRE2_INFO_SIZE, /* what is required */
|
||||
&length); /* where to put the data */
|
||||
</pre>
|
||||
The possible values for the second argument are defined in <b>pcre2.h</b>, and
|
||||
are as follows:
|
||||
<pre>
|
||||
PCRE2_INFO_ALLOPTIONS
|
||||
PCRE2_INFO_ARGOPTIONS
|
||||
PCRE2_INFO_EXTRAOPTIONS
|
||||
</pre>
|
||||
Return copies of the pattern's options. The third argument should point to a
|
||||
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
|
||||
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
|
||||
the compile options as modified by any top-level (*XXX) option settings such as
|
||||
(*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTIONS returns the
|
||||
extra options that were set in the compile context by calling the
|
||||
pcre2_set_compile_extra_options() function.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
|
||||
option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
|
||||
Option settings such as (?i) that can change within a pattern do not affect the
|
||||
result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the
|
||||
pattern. (This was different in some earlier releases.)
|
||||
</P>
|
||||
<P>
|
||||
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
|
||||
the first significant item in every top-level branch is one of the following:
|
||||
<pre>
|
||||
^ unless PCRE2_MULTILINE is set
|
||||
\A always
|
||||
\G always
|
||||
.* sometimes - see below
|
||||
</pre>
|
||||
When .* is the first significant item, anchoring is possible only when all the
|
||||
following are true:
|
||||
<pre>
|
||||
.* is not in an atomic group
|
||||
.* is not in a capture group that is the subject of a backreference
|
||||
PCRE2_DOTALL is in force for .*
|
||||
Neither (*PRUNE) nor (*SKIP) appears in the pattern
|
||||
PCRE2_NO_DOTSTAR_ANCHOR is not set
|
||||
Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF
|
||||
</pre>
|
||||
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
|
||||
options returned for PCRE2_INFO_ALLOPTIONS.
|
||||
<pre>
|
||||
PCRE2_INFO_BACKREFMAX
|
||||
</pre>
|
||||
Return the number of the highest backreference in the pattern. The third
|
||||
argument should point to a <b>uint32_t</b> variable. Named capture groups
|
||||
acquire numbers as well as names, and these count towards the highest
|
||||
backreference. Backreferences such as \4 or \g{12} match the captured
|
||||
characters of the given group, but in addition, the check that a capture
|
||||
group is set in a conditional group such as (?(3)a|b) is also a backreference.
|
||||
Zero is returned if there are no backreferences.
|
||||
<pre>
|
||||
PCRE2_INFO_BSR
|
||||
</pre>
|
||||
The output is a uint32_t integer whose value indicates what character sequences
|
||||
the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R
|
||||
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
|
||||
that \R matches only CR, LF, or CRLF.
|
||||
<pre>
|
||||
PCRE2_INFO_CAPTURECOUNT
|
||||
</pre>
|
||||
Return the highest capture group number in the pattern. In patterns where (?|
|
||||
is not used, this is also the total number of capture groups. The third
|
||||
argument should point to a <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_DEPTHLIMIT
|
||||
</pre>
|
||||
If the pattern set a backtracking depth limit by including an item of the form
|
||||
(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTBITMAP
|
||||
</pre>
|
||||
In the absence of a single first code unit for a non-anchored pattern,
|
||||
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
|
||||
values for the first code unit in any match. For example, a pattern that starts
|
||||
with [abc] results in a table with three bits set. When code unit values
|
||||
greater than 255 are supported, the flag bit for 255 means "any code unit of
|
||||
value 255 or above". If such a table was constructed, a pointer to it is
|
||||
returned. Otherwise NULL is returned. The third argument should point to a
|
||||
<b>const uint8_t *</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTCODETYPE
|
||||
</pre>
|
||||
Return information about the first code unit of any matched string, for a
|
||||
non-anchored pattern. The third argument should point to a <b>uint32_t</b>
|
||||
variable. If there is a fixed first value, for example, the letter "c" from a
|
||||
pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
|
||||
using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
|
||||
known that a match can occur only at the start of the subject or following a
|
||||
newline in the subject, 2 is returned. Otherwise, and for anchored patterns, 0
|
||||
is returned.
|
||||
<pre>
|
||||
PCRE2_INFO_FIRSTCODEUNIT
|
||||
</pre>
|
||||
Return the value of the first code unit of any matched string for a pattern
|
||||
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
|
||||
argument should point to a <b>uint32_t</b> variable. In the 8-bit library, the
|
||||
value is always less than 256. In the 16-bit library the value can be up to
|
||||
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
|
||||
and up to 0xffffffff when not using UTF-32 mode.
|
||||
<pre>
|
||||
PCRE2_INFO_FRAMESIZE
|
||||
</pre>
|
||||
Return the size (in bytes) of the data frames that are used to remember
|
||||
backtracking positions when the pattern is processed by <b>pcre2_match()</b>
|
||||
without the use of JIT. The third argument should point to a <b>size_t</b>
|
||||
variable. The frame size depends on the number of capturing parentheses in the
|
||||
pattern. Each additional capture group adds two PCRE2_SIZE variables.
|
||||
<pre>
|
||||
PCRE2_INFO_HASBACKSLASHC
|
||||
</pre>
|
||||
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
|
||||
argument should point to a <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_HASCRORLF
|
||||
</pre>
|
||||
Return 1 if the pattern contains any explicit matches for CR or LF characters,
|
||||
otherwise 0. The third argument should point to a <b>uint32_t</b> variable. An
|
||||
explicit match is either a literal CR or LF character, or \r or \n or one of
|
||||
the equivalent hexadecimal or octal escape sequences.
|
||||
<pre>
|
||||
PCRE2_INFO_HEAPLIMIT
|
||||
</pre>
|
||||
If the pattern set a heap memory limit by including an item of the form
|
||||
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
<pre>
|
||||
PCRE2_INFO_JCHANGED
|
||||
</pre>
|
||||
Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
|
||||
0. The third argument should point to a <b>uint32_t</b> variable. (?J) and
|
||||
(?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
|
||||
<pre>
|
||||
PCRE2_INFO_JITSIZE
|
||||
</pre>
|
||||
If the compiled pattern was successfully processed by
|
||||
<b>pcre2_jit_compile()</b>, return the size of the JIT compiled code, otherwise
|
||||
return zero. The third argument should point to a <b>size_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_LASTCODETYPE
|
||||
</pre>
|
||||
Returns 1 if there is a rightmost literal code unit that must exist in any
|
||||
matched string, other than at its start. The third argument should point to a
|
||||
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
|
||||
returned, the code unit value itself can be retrieved using
|
||||
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
|
||||
recorded only if it follows something of variable length. For example, for the
|
||||
pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from
|
||||
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
|
||||
<pre>
|
||||
PCRE2_INFO_LASTCODEUNIT
|
||||
</pre>
|
||||
Return the value of the rightmost literal code unit that must exist in any
|
||||
matched string, other than at its start, for a pattern where
|
||||
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
|
||||
should point to a <b>uint32_t</b> variable.
|
||||
<pre>
|
||||
PCRE2_INFO_MATCHEMPTY
|
||||
</pre>
|
||||
Return 1 if the pattern might match an empty string, otherwise 0. The third
|
||||
argument should point to a <b>uint32_t</b> variable. When a pattern contains
|
||||
recursive subroutine calls it is not always possible to determine whether or
|
||||
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
|
||||
in such cases.
|
||||
<pre>
|
||||
PCRE2_INFO_MATCHLIMIT
|
||||
</pre>
|
||||
If the pattern set a match limit by including an item of the form
|
||||
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
|
||||
should point to a uint32_t integer. If no such value has been set, the call to
|
||||
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
|
||||
limit will only be used during matching if it is less than the limit set or
|
||||
defaulted by the caller of the match function.
|
||||
<pre>
|
||||
PCRE2_INFO_MAXLOOKBEHIND
|
||||
</pre>
|
||||
A lookbehind assertion moves back a certain number of characters (not code
|
||||
units) when it starts to process each of its branches. This request returns the
|
||||
largest of these backward moves. The third argument should point to a uint32_t
|
||||
integer. The simple assertions \b and \B require a one-character lookbehind
|
||||
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
|
||||
longer. \A also registers a one-character lookbehind, though it does not
|
||||
actually inspect the previous character.
|
||||
</P>
|
||||
<P>
|
||||
Note that this information is useful for multi-segment matching only
|
||||
if the pattern contains no nested lookbehinds. For example, the pattern
|
||||
(?<=a(?<=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
|
||||
first lookbehind moves back by two characters, matches one character, then the
|
||||
nested lookbehind also moves back by two characters. This puts the matching
|
||||
point three characters earlier than it was at the start.
|
||||
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation for a discussion of multi-segment matching.
|
||||
<pre>
|
||||
PCRE2_INFO_MINLENGTH
|
||||
</pre>
|
||||
If a minimum length for matching subject strings was computed, its value is
|
||||
returned. Otherwise the returned value is 0. This value is not computed when
|
||||
PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in
|
||||
UTF mode may be different from the number of code units. The third argument
|
||||
should point to a <b>uint32_t</b> variable. The value is a lower bound to the
|
||||
length of any matching string. There may not be any strings of that length that
|
||||
do actually match, but every string that does match is at least that long.
|
||||
<pre>
|
||||
PCRE2_INFO_NAMECOUNT
|
||||
PCRE2_INFO_NAMEENTRYSIZE
|
||||
PCRE2_INFO_NAMETABLE
|
||||
</pre>
|
||||
PCRE2 supports the use of named as well as numbered capturing parentheses. The
|
||||
names are just an additional way of identifying the parentheses, which still
|
||||
acquire numbers. Several convenience functions such as
|
||||
<b>pcre2_substring_get_byname()</b> are provided for extracting captured
|
||||
substrings by name. It is also possible to extract the data directly, by first
|
||||
converting the name to a number in order to access the correct pointers in the
|
||||
output vector (described with <b>pcre2_match()</b> below). To do the conversion,
|
||||
you need to use the name-to-number map, which is described by these three
|
||||
values.
|
||||
</P>
|
||||
<P>
|
||||
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
|
||||
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
|
||||
entry in code units; both of these return a <b>uint32_t</b> value. The entry
|
||||
size depends on the length of the longest name.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
|
||||
a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
|
||||
two bytes of each entry are the number of the capturing parenthesis, most
|
||||
significant byte first. In the 16-bit library, the pointer points to 16-bit
|
||||
code units, the first of which contains the parenthesis number. In the 32-bit
|
||||
library, the pointer points to 32-bit code units, the first of which contains
|
||||
the parenthesis number. The rest of the entry is the corresponding name, zero
|
||||
terminated.
|
||||
</P>
|
||||
<P>
|
||||
The names are in alphabetical order. If (?| is used to create multiple capture
|
||||
groups with the same number, as described in the
|
||||
<a href="pcre2pattern.html#dupgroupnumber">section on duplicate group numbers</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page, the groups may be given the same name, but there is only one entry in the
|
||||
table. Different names for groups of the same number are not permitted.
|
||||
</P>
|
||||
<P>
|
||||
Duplicate names for capture groups with different numbers are permitted, but
|
||||
only if PCRE2_DUPNAMES is set. They appear in the table in the order in which
|
||||
they were found in the pattern. In the absence of (?| this is the order of
|
||||
increasing number; when (?| is used this is not necessarily the case because
|
||||
later capture groups may have lower numbers.
|
||||
</P>
|
||||
<P>
|
||||
As a simple example of the name/number table, consider the following pattern
|
||||
after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white
|
||||
space - including newlines - is ignored):
|
||||
<pre>
|
||||
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
|
||||
</pre>
|
||||
There are four named capture groups, so the table has four entries, and each
|
||||
entry in the table is eight bytes long. The table is as follows, with
|
||||
non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:
|
||||
<pre>
|
||||
00 01 d a t e 00 ??
|
||||
00 05 d a y 00 ?? ??
|
||||
00 04 m o n t h 00
|
||||
00 02 y e a r 00 ??
|
||||
</pre>
|
||||
When writing code to extract data from named capture groups using the
|
||||
name-to-number map, remember that the length of the entries is likely to be
|
||||
different for each compiled pattern.
|
||||
<pre>
|
||||
PCRE2_INFO_NEWLINE
|
||||
</pre>
|
||||
The output is one of the following <b>uint32_t</b> values:
|
||||
<pre>
|
||||
PCRE2_NEWLINE_CR Carriage return (CR)
|
||||
PCRE2_NEWLINE_LF Linefeed (LF)
|
||||
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
|
||||
PCRE2_NEWLINE_ANY Any Unicode line ending
|
||||
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
|
||||
PCRE2_NEWLINE_NUL The NUL character (binary zero)
|
||||
</pre>
|
||||
This identifies the character sequence that will be recognized as meaning
|
||||
"newline" while matching.
|
||||
<pre>
|
||||
PCRE2_INFO_SIZE
|
||||
</pre>
|
||||
Return the size of the compiled pattern in bytes (for all three libraries). The
|
||||
third argument should point to a <b>size_t</b> variable. This value includes the
|
||||
size of the general data block that precedes the code units of the compiled
|
||||
pattern itself. The value that is used when <b>pcre2_compile()</b> is getting
|
||||
memory in which to place the compiled pattern may be slightly larger than the
|
||||
value returned by this option, because there are cases where the code that
|
||||
calculates the size has to over-estimate. Processing a pattern with the JIT
|
||||
compiler does not alter the value returned by this option.
|
||||
<a name="infoaboutcallouts"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
|
||||
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
|
||||
<b> void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
|
||||
pointer to a compiled pattern, the second points to a callback function, and
|
||||
the third is arbitrary user data. The callback function is called for every
|
||||
callout in the pattern in the order in which they appear. Its first argument is
|
||||
a pointer to a callout enumeration block, and its second argument is the
|
||||
<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
|
||||
contents of the callout enumeration block are described in the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation, which also gives further details about callouts.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br>
|
||||
<P>
|
||||
It is possible to save compiled patterns on disc or elsewhere, and reload them
|
||||
later, subject to a number of restrictions. The host on which the patterns are
|
||||
reloaded must be running the same version of PCRE2, with the same code unit
|
||||
width, and must also have the same endianness, pointer width, and PCRE2_SIZE
|
||||
type. Before compiled patterns can be saved, they must be converted to a
|
||||
"serialized" form, which in the case of PCRE2 is really just a bytecode dump.
|
||||
The functions whose names begin with <b>pcre2_serialize_</b> are used for
|
||||
converting to and from the serialized form. They are described in the
|
||||
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
|
||||
documentation. Note that PCRE2 serialization does not convert compiled patterns
|
||||
to an abstract format like Java or .NET serialization.
|
||||
<a name="matchdatablock"></a></P>
|
||||
<br><a name="SEC26" href="#TOC1">THE MATCH DATA BLOCK</a><br>
|
||||
<P>
|
||||
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
|
||||
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
Information about a successful or unsuccessful match is placed in a match
|
||||
data block, which is an opaque structure that is accessed by function calls. In
|
||||
particular, the match data block contains a vector of offsets into the subject
|
||||
string that define the matched parts of the subject. This is known as the
|
||||
<i>ovector</i>.
|
||||
</P>
|
||||
<P>
|
||||
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
|
||||
<b>pcre2_jit_match()</b> you must create a match data block by calling one of
|
||||
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
|
||||
argument is the number of pairs of offsets in the <i>ovector</i>.
|
||||
</P>
|
||||
<P>
|
||||
When using <b>pcre2_match()</b>, one pair of offsets is required to identify the
|
||||
string that matched the whole pattern, with an additional pair for each
|
||||
captured substring. For example, a value of 4 creates enough space to record
|
||||
the matched portion of the subject plus three captured substrings.
|
||||
</P>
|
||||
<P>
|
||||
When using <b>pcre2_dfa_match()</b> there may be multiple matched substrings of
|
||||
different lengths at the same point in the subject. The ovector should be made
|
||||
large enough to hold as many as are expected.
|
||||
</P>
|
||||
<P>
|
||||
A minimum of at least 1 pair is imposed by <b>pcre2_match_data_create()</b>, so
|
||||
it is always possible to return the overall matched string in the case of
|
||||
<b>pcre2_match()</b> or the longest match in the case of
|
||||
<b>pcre2_dfa_match()</b>. The maximum number of pairs is 65535; if the first
|
||||
argument of <b>pcre2_match_data_create()</b> is greater than this, 65535 is
|
||||
used.
|
||||
</P>
|
||||
<P>
|
||||
The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
|
||||
general context, which can specify custom memory management for obtaining the
|
||||
memory for the match data block. If you are not using custom memory management,
|
||||
pass NULL, which causes <b>malloc()</b> to be used.
|
||||
</P>
|
||||
<P>
|
||||
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
|
||||
pointer to a compiled pattern. The ovector is created to be exactly the right
|
||||
size to hold all the substrings a pattern might capture when matched using
|
||||
<b>pcre2_match()</b>. You should not use this call when matching with
|
||||
<b>pcre2_dfa_match()</b>. The second argument is again a pointer to a general
|
||||
context, but in this case if NULL is passed, the memory is obtained using the
|
||||
same allocator that was used for the compiled pattern (custom or default).
|
||||
</P>
|
||||
<P>
|
||||
A match data block can be used many times, with the same or different compiled
|
||||
patterns. You can extract information from a match data block after a match
|
||||
operation has finished, using functions that are described in the sections on
|
||||
<a href="#matchedstrings">matched strings</a>
|
||||
and
|
||||
<a href="#matchotherdata">other match data</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
When a call of <b>pcre2_match()</b> fails, valid data is available in the match
|
||||
block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
|
||||
of the error codes for an invalid UTF string. Exactly what is available depends
|
||||
on the error, and is detailed below.
|
||||
</P>
|
||||
<P>
|
||||
When one of the matching functions is called, pointers to the compiled pattern
|
||||
and the subject string are set in the match data block so that they can be
|
||||
referenced by the extraction functions after a successful match. After running
|
||||
a match, you must not free a compiled pattern or a subject string until after
|
||||
all operations on the match data block (for that match) have taken place,
|
||||
unless, in the case of the subject string, you have used the
|
||||
PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
|
||||
"Option bits for <b>pcre2_match()</b>"
|
||||
<a href="#matchoptions>">below.</a>
|
||||
</P>
|
||||
<P>
|
||||
When a match data block itself is no longer needed, it should be freed by
|
||||
calling <b>pcre2_match_data_free()</b>. If this function is called with a NULL
|
||||
argument, it returns immediately, without doing anything.
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">MEMORY USE FOR MATCH DATA BLOCKS</a><br>
|
||||
<P>
|
||||
<b>PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE pcre2_get_match_data_heapframes_size(</b>
|
||||
<b> pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The size of a match data block depends on the size of the ovector that it
|
||||
contains. The function <b>pcre2_get_match_data_size()</b> returns the size, in
|
||||
bytes, of the block that is its argument.
|
||||
</P>
|
||||
<P>
|
||||
When <b>pcre2_match()</b> runs interpretively (that is, without using JIT), it
|
||||
makes use of a vector of data frames for remembering backtracking positions.
|
||||
The size of each individual frame depends on the number of capturing
|
||||
parentheses in the pattern and can be obtained by calling
|
||||
<b>pcre2_pattern_info()</b> with the PCRE2_INFO_FRAMESIZE option (see the
|
||||
section entitled "Information about a compiled pattern"
|
||||
<a href="#infoaboutpattern>">above).</a>
|
||||
</P>
|
||||
<P>
|
||||
Heap memory is used for the frames vector; if the initial memory block turns
|
||||
out to be too small during matching, it is automatically expanded. When
|
||||
<b>pcre2_match()</b> returns, the memory is not freed, but remains attached to
|
||||
the match data block, for use by any subsequent matches that use the same
|
||||
block. It is automatically freed when the match data block itself is freed.
|
||||
</P>
|
||||
<P>
|
||||
You can find the current size of the frames vector that a match data block owns
|
||||
by calling <b>pcre2_get_match_data_heapframes_size()</b>. For a newly created
|
||||
match data block the size will be zero. Some types of match may require a lot
|
||||
of frames and thus a large vector; applications that run in environments where
|
||||
memory is constrained can check this and free the match data block if the heap
|
||||
frames vector has become too big.
|
||||
</P>
|
||||
<br><a name="SEC28" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br>
|
||||
<P>
|
||||
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The function <b>pcre2_match()</b> is called to match a subject string against a
|
||||
compiled pattern, which is passed in the <i>code</i> argument. You can call
|
||||
<b>pcre2_match()</b> with the same <i>code</i> argument as many times as you
|
||||
like, in order to find multiple matches in the subject string or to match
|
||||
different subject strings with the same pattern.
|
||||
</P>
|
||||
<P>
|
||||
This function is the main matching facility of the library, and it operates in
|
||||
a Perl-like manner. For specialist use there is also an alternative matching
|
||||
function, which is described
|
||||
<a href="#dfamatch">below</a>
|
||||
in the section about the <b>pcre2_dfa_match()</b> function.
|
||||
</P>
|
||||
<P>
|
||||
Here is an example of a simple call to <b>pcre2_match()</b>:
|
||||
<pre>
|
||||
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
|
||||
int rc = pcre2_match(
|
||||
re, /* result of pcre2_compile() */
|
||||
"some string", /* the subject string */
|
||||
11, /* the length of the subject string */
|
||||
0, /* start at offset 0 in the subject */
|
||||
0, /* default options */
|
||||
md, /* the match data block */
|
||||
NULL); /* a match context; NULL means use defaults */
|
||||
</pre>
|
||||
If the subject string is zero-terminated, the length can be given as
|
||||
PCRE2_ZERO_TERMINATED. A match context must be provided if certain less common
|
||||
matching parameters are to be changed. For details, see the section on
|
||||
<a href="#matchcontext">the match context</a>
|
||||
above.
|
||||
</P>
|
||||
<br><b>
|
||||
The string to be matched by <b>pcre2_match()</b>
|
||||
</b><br>
|
||||
<P>
|
||||
The subject string is passed to <b>pcre2_match()</b> as a pointer in
|
||||
<i>subject</i>, a length in <i>length</i>, and a starting offset in
|
||||
<i>startoffset</i>. The length and offset are in code units, not characters.
|
||||
That is, they are in bytes for the 8-bit library, 16-bit code units for the
|
||||
16-bit library, and 32-bit code units for the 32-bit library, whether or not
|
||||
UTF processing is enabled. As a special case, if <i>subject</i> is NULL and
|
||||
<i>length</i> is zero, the subject is assumed to be an empty string. If
|
||||
<i>length</i> is non-zero, an error occurs if <i>subject</i> is NULL.
|
||||
</P>
|
||||
<P>
|
||||
If <i>startoffset</i> is greater than the length of the subject,
|
||||
<b>pcre2_match()</b> returns PCRE2_ERROR_BADOFFSET. When the starting offset is
|
||||
zero, the search for a match starts at the beginning of the subject, and this
|
||||
is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
|
||||
must point to the start of a character, or to the end of the subject (in UTF-32
|
||||
mode, one code unit equals one character, so all offsets are valid). Like the
|
||||
pattern string, the subject may contain binary zeros.
|
||||
</P>
|
||||
<P>
|
||||
A non-zero starting offset is useful when searching for another match in the
|
||||
same subject by calling <b>pcre2_match()</b> again after a previous success.
|
||||
Setting <i>startoffset</i> differs from passing over a shortened string and
|
||||
setting PCRE2_NOTBOL in the case of a pattern that begins with any kind of
|
||||
lookbehind. For example, consider the pattern
|
||||
<pre>
|
||||
\Biss\B
|
||||
</pre>
|
||||
which finds occurrences of "iss" in the middle of words. (\B matches only if
|
||||
the current position in the subject is not a word boundary.) When applied to
|
||||
the string "Mississippi" the first call to <b>pcre2_match()</b> finds the first
|
||||
occurrence. If <b>pcre2_match()</b> is called again with just the remainder of
|
||||
the subject, namely "issippi", it does not match, because \B is always false
|
||||
at the start of the subject, which is deemed to be a word boundary. However, if
|
||||
<b>pcre2_match()</b> is passed the entire string again, but with
|
||||
<i>startoffset</i> set to 4, it finds the second occurrence of "iss" because it
|
||||
is able to look behind the starting point to discover that it is preceded by a
|
||||
letter.
|
||||
</P>
|
||||
<P>
|
||||
Finding all the matches in a subject is tricky when the pattern can match an
|
||||
empty string. It is possible to emulate Perl's /g behaviour by first trying the
|
||||
match again at the same offset, with the PCRE2_NOTEMPTY_ATSTART and
|
||||
PCRE2_ANCHORED options, and then if that fails, advancing the starting offset
|
||||
and trying an ordinary match again. There is some code that demonstrates how to
|
||||
do this in the
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
sample program. In the most general case, you have to check to see if the
|
||||
newline convention recognizes CRLF as a newline, and if so, and the current
|
||||
character is CR followed by LF, advance the starting offset by two characters
|
||||
instead of one.
|
||||
</P>
|
||||
<P>
|
||||
If a non-zero starting offset is passed when the pattern is anchored, a single
|
||||
attempt to match at the given offset is made. This can only succeed if the
|
||||
pattern does not require the match to be at the start of the subject. In other
|
||||
words, the anchoring must be the result of setting the PCRE2_ANCHORED option or
|
||||
the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \A.
|
||||
<a name="matchoptions"></a></P>
|
||||
<br><b>
|
||||
Option bits for <b>pcre2_match()</b>
|
||||
</b><br>
|
||||
<P>
|
||||
The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be
|
||||
zero. The only bits that may be set are PCRE2_ANCHORED,
|
||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK, PCRE2_ENDANCHORED,
|
||||
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
|
||||
PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT.
|
||||
Their action is described below.
|
||||
</P>
|
||||
<P>
|
||||
Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
|
||||
the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the
|
||||
interpretive code in <b>pcre2_match()</b> is run.
|
||||
PCRE2_DISABLE_RECURSELOOP_CHECK is ignored by JIT, but apart from PCRE2_NO_JIT
|
||||
(obviously), the remaining options are supported for JIT matching.
|
||||
<pre>
|
||||
PCRE2_ANCHORED
|
||||
</pre>
|
||||
The PCRE2_ANCHORED option limits <b>pcre2_match()</b> to matching at the first
|
||||
matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out
|
||||
to be anchored by virtue of its contents, it cannot be made unachored at
|
||||
matching time. Note that setting the option at match time disables JIT
|
||||
matching.
|
||||
<pre>
|
||||
PCRE2_COPY_MATCHED_SUBJECT
|
||||
</pre>
|
||||
By default, a pointer to the subject is remembered in the match data block so
|
||||
that, after a successful match, it can be referenced by the substring
|
||||
extraction functions. This means that the subject's memory must not be freed
|
||||
until all such operations are complete. For some applications where the
|
||||
lifetime of the subject string is not guaranteed, it may be necessary to make a
|
||||
copy of the subject string, but it is wasteful to do this unless the match is
|
||||
successful. After a successful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the
|
||||
subject is copied and the new pointer is remembered in the match data block
|
||||
instead of the original subject pointer. The memory allocator that was used for
|
||||
the match block itself is used. The copy is automatically freed when
|
||||
<b>pcre2_match_data_free()</b> is called to free the match data block. It is also
|
||||
automatically freed if the match data block is re-used for another match
|
||||
operation.
|
||||
<pre>
|
||||
PCRE2_DISABLE_RECURSELOOP_CHECK
|
||||
</pre>
|
||||
This option is relevant only to <b>pcre2_match()</b> for interpretive matching.
|
||||
It is ignored when JIT is used, and is forbidden for <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The use of recursion in patterns can lead to infinite loops. In the
|
||||
interpretive matcher these would be eventually caught by the match or heap
|
||||
limits, but this could take a long time and/or use a lot of memory if the
|
||||
limits are large. There is therefore a check at the start of each recursion.
|
||||
If the same group is still active from a previous call, and the current subject
|
||||
pointer is the same as it was at the start of that group, and the furthest
|
||||
inspected character of the subject has not changed, an error is generated.
|
||||
</P>
|
||||
<P>
|
||||
There are rare cases of matches that would complete, but nevertheless trigger
|
||||
this error. This option disables the check. It is provided mainly for testing
|
||||
when comparing JIT and interpretive behaviour.
|
||||
<pre>
|
||||
PCRE2_ENDANCHORED
|
||||
</pre>
|
||||
If the PCRE2_ENDANCHORED option is set, any string that <b>pcre2_match()</b>
|
||||
matches must be right at the end of the subject string. Note that setting the
|
||||
option at match time disables JIT matching.
|
||||
<pre>
|
||||
PCRE2_NOTBOL
|
||||
</pre>
|
||||
This option specifies that first character of the subject string is not the
|
||||
beginning of a line, so the circumflex metacharacter should not match before
|
||||
it. Setting this without having set PCRE2_MULTILINE at compile time causes
|
||||
circumflex never to match. This option affects only the behaviour of the
|
||||
circumflex metacharacter. It does not affect \A.
|
||||
<pre>
|
||||
PCRE2_NOTEOL
|
||||
</pre>
|
||||
This option specifies that the end of the subject string is not the end of a
|
||||
line, so the dollar metacharacter should not match it nor (except in multiline
|
||||
mode) a newline immediately before it. Setting this without having set
|
||||
PCRE2_MULTILINE at compile time causes dollar never to match. This option
|
||||
affects only the behaviour of the dollar metacharacter. It does not affect \Z
|
||||
or \z.
|
||||
<pre>
|
||||
PCRE2_NOTEMPTY
|
||||
</pre>
|
||||
An empty string is not considered to be a valid match if this option is set. If
|
||||
there are alternatives in the pattern, they are tried. If all the alternatives
|
||||
match the empty string, the entire match fails. For example, if the pattern
|
||||
<pre>
|
||||
a?b?
|
||||
</pre>
|
||||
is applied to a string not beginning with "a" or "b", it matches an empty
|
||||
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
|
||||
valid, so <b>pcre2_match()</b> searches further into the string for occurrences
|
||||
of "a" or "b".
|
||||
<pre>
|
||||
PCRE2_NOTEMPTY_ATSTART
|
||||
</pre>
|
||||
This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
|
||||
only at the first matching position, that is, at the start of the subject plus
|
||||
the starting offset. An empty string match later in the subject is permitted.
|
||||
If the pattern is anchored, such a match can occur only if the pattern contains
|
||||
\K.
|
||||
<pre>
|
||||
PCRE2_NO_JIT
|
||||
</pre>
|
||||
By default, if a pattern has been successfully processed by
|
||||
<b>pcre2_jit_compile()</b>, JIT is automatically used when <b>pcre2_match()</b>
|
||||
is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use
|
||||
of JIT; it forces matching to be done by the interpreter.
|
||||
<pre>
|
||||
PCRE2_NO_UTF_CHECK
|
||||
</pre>
|
||||
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
|
||||
string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
|
||||
PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
|
||||
case is discussed in detail in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
In the default case, if a non-zero starting offset is given, the check is
|
||||
applied only to that part of the subject that could be inspected during
|
||||
matching, and there is a check that the starting offset points to the first
|
||||
code unit of a character or to the end of the subject. If there are no
|
||||
lookbehind assertions in the pattern, the check starts at the starting offset.
|
||||
Otherwise, it starts at the length of the longest lookbehind before the
|
||||
starting offset, or at the start of the subject if there are not that many
|
||||
characters before the starting offset. Note that the sequences \b and \B are
|
||||
one-character lookbehinds.
|
||||
</P>
|
||||
<P>
|
||||
The check is carried out before any other processing takes place, and a
|
||||
negative error code is returned if the check fails. There are several UTF error
|
||||
codes for each code unit width, corresponding to different problems with the
|
||||
code unit sequence. There are discussions about the validity of
|
||||
<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a>
|
||||
<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a>
|
||||
and
|
||||
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
|
||||
in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
If you know that your subject is valid, and you want to skip this check for
|
||||
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
|
||||
<b>pcre2_match()</b>. You might want to do this for the second and subsequent
|
||||
calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
|
||||
matches in the same subject string.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
|
||||
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
|
||||
string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
|
||||
Your program may crash or loop indefinitely or give wrong results.
|
||||
<pre>
|
||||
PCRE2_PARTIAL_HARD
|
||||
PCRE2_PARTIAL_SOFT
|
||||
</pre>
|
||||
These options turn on the partial matching feature. A partial match occurs if
|
||||
the end of the subject string is reached successfully, but there are not enough
|
||||
subject characters to complete the match. In addition, either at least one
|
||||
character must have been inspected or the pattern must contain a lookbehind, or
|
||||
the pattern must be one that could match an empty string.
|
||||
</P>
|
||||
<P>
|
||||
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
|
||||
is set, matching continues by testing any remaining alternatives. Only if no
|
||||
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
|
||||
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
|
||||
caller is prepared to handle a partial match, but only if no complete match can
|
||||
be found.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
|
||||
a partial match is found, <b>pcre2_match()</b> immediately returns
|
||||
PCRE2_ERROR_PARTIAL, without considering any other alternatives. In other
|
||||
words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more
|
||||
important that an alternative complete match.
|
||||
</P>
|
||||
<P>
|
||||
There is a more detailed discussion of partial and multi-segment matching, with
|
||||
examples, in the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br>
|
||||
<P>
|
||||
When PCRE2 is built, a default newline convention is set; this is usually the
|
||||
standard convention for the operating system. The default can be overridden in
|
||||
a
|
||||
<a href="#compilecontext">compile context</a>
|
||||
by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a
|
||||
pattern string with, for example, (*CRLF), as described in the
|
||||
<a href="pcre2pattern.html#newlines">section on newline conventions</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page. During matching, the newline choice affects the behaviour of the dot,
|
||||
circumflex, and dollar metacharacters. It may also alter the way the match
|
||||
starting position is advanced after a match failure for an unanchored pattern.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
|
||||
the newline convention, and a match attempt for an unanchored pattern fails
|
||||
when the current starting position is at a CRLF sequence, and the pattern
|
||||
contains no explicit matches for CR or LF characters, the match position is
|
||||
advanced by two characters instead of one, in other words, to after the CRLF.
|
||||
</P>
|
||||
<P>
|
||||
The above rule is a compromise that makes the most common cases work as
|
||||
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
|
||||
not set), it does not match the string "\r\nA" because, after failing at the
|
||||
start, it skips both the CR and the LF before retrying. However, the pattern
|
||||
[\r\n]A does match that string, because it contains an explicit CR or LF
|
||||
reference, and so advances only by one character after the first failure.
|
||||
</P>
|
||||
<P>
|
||||
An explicit match for CR of LF is either a literal appearance of one of those
|
||||
characters in the pattern, or one of the \r or \n or equivalent octal or
|
||||
hexadecimal escape sequences. Implicit matches such as [^X] do not count, nor
|
||||
does \s, even though it includes CR and LF in the characters that it matches.
|
||||
</P>
|
||||
<P>
|
||||
Notwithstanding the above, anomalous effects may still occur when CRLF is a
|
||||
valid newline sequence and explicit \r or \n escapes appear in the pattern.
|
||||
<a name="matchedstrings"></a></P>
|
||||
<br><a name="SEC30" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br>
|
||||
<P>
|
||||
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
In general, a pattern matches a certain portion of the subject, and in
|
||||
addition, further substrings from the subject may be picked out by
|
||||
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
|
||||
book, this is called "capturing" in what follows, and the phrase "capture
|
||||
group" (Perl terminology) is used for a fragment of a pattern that picks out a
|
||||
substring. PCRE2 supports several other kinds of parenthesized group that do
|
||||
not cause substrings to be captured. The <b>pcre2_pattern_info()</b> function
|
||||
can be used to find out how many capture groups there are in a compiled
|
||||
pattern.
|
||||
</P>
|
||||
<P>
|
||||
You can use auxiliary functions for accessing captured substrings
|
||||
<a href="#extractbynumber">by number</a>
|
||||
or
|
||||
<a href="#extractbyname">by name,</a>
|
||||
as described in sections below.
|
||||
</P>
|
||||
<P>
|
||||
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
|
||||
called the <b>ovector</b>, which contains the offsets of captured strings. It is
|
||||
part of the
|
||||
<a href="#matchdatablock">match data block.</a>
|
||||
The function <b>pcre2_get_ovector_pointer()</b> returns the address of the
|
||||
ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of
|
||||
values it contains.
|
||||
</P>
|
||||
<P>
|
||||
Within the ovector, the first in each pair of values is set to the offset of
|
||||
the first code unit of a substring, and the second is set to the offset of the
|
||||
first code unit after the end of a substring. These values are always code unit
|
||||
offsets, not character offsets. That is, they are byte offsets in the 8-bit
|
||||
library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit
|
||||
library.
|
||||
</P>
|
||||
<P>
|
||||
After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair
|
||||
of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) are set. They
|
||||
identify the part of the subject that was partially matched. See the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation for details of partial matching.
|
||||
</P>
|
||||
<P>
|
||||
After a fully successful match, the first pair of offsets identifies the
|
||||
portion of the subject string that was matched by the entire pattern. The next
|
||||
pair is used for the first captured substring, and so on. The value returned by
|
||||
<b>pcre2_match()</b> is one more than the highest numbered pair that has been
|
||||
set. For example, if two substrings have been captured, the returned value is
|
||||
3. If there are no captured substrings, the return value from a successful
|
||||
match is 1, indicating that just the first pair of offsets has been set.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern uses the \K escape sequence within a positive assertion, the
|
||||
reported start of a successful match can be greater than the end of the match.
|
||||
For example, if the pattern (?=ab\K) is matched against "ab", the start and
|
||||
end offset values for the match are 2 and 0.
|
||||
</P>
|
||||
<P>
|
||||
If a capture group is matched repeatedly within a single match operation, it is
|
||||
the last portion of the subject that it matched that is returned.
|
||||
</P>
|
||||
<P>
|
||||
If the ovector is too small to hold all the captured substring offsets, as much
|
||||
as possible is filled in, and the function returns a value of zero. If captured
|
||||
substrings are not of interest, <b>pcre2_match()</b> may be called with a match
|
||||
data block whose ovector is of minimum length (that is, one pair).
|
||||
</P>
|
||||
<P>
|
||||
It is possible for capture group number <i>n+1</i> to match some part of the
|
||||
subject when group <i>n</i> has not been used at all. For example, if the string
|
||||
"abc" is matched against the pattern (a|(z))(bc) the return from the function
|
||||
is 4, and groups 1 and 3 are matched, but 2 is not. When this happens, both
|
||||
values in the offset pairs corresponding to unused groups are set to
|
||||
PCRE2_UNSET.
|
||||
</P>
|
||||
<P>
|
||||
Offset values that correspond to unused groups at the end of the expression are
|
||||
also set to PCRE2_UNSET. For example, if the string "abc" is matched against
|
||||
the pattern (abc)(x(yz)?)? groups 2 and 3 are not matched. The return from the
|
||||
function is 2, because the highest used capture group number is 1. The offsets
|
||||
for the second and third capture groups (assuming the vector is large enough,
|
||||
of course) are set to PCRE2_UNSET.
|
||||
</P>
|
||||
<P>
|
||||
Elements in the ovector that do not correspond to capturing parentheses in the
|
||||
pattern are never changed. That is, if a pattern contains <i>n</i> capturing
|
||||
parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
|
||||
<b>pcre2_match()</b>. The other elements retain whatever values they previously
|
||||
had. After a failed match attempt, the contents of the ovector are unchanged.
|
||||
<a name="matchotherdata"></a></P>
|
||||
<br><a name="SEC31" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
|
||||
<P>
|
||||
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
As well as the offsets in the ovector, other information about a match is
|
||||
retained in the match data block and can be retrieved by the above functions in
|
||||
appropriate circumstances. If they are called at other times, the result is
|
||||
undefined.
|
||||
</P>
|
||||
<P>
|
||||
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
|
||||
to match (PCRE2_ERROR_NOMATCH), a mark name may be available. The function
|
||||
<b>pcre2_get_mark()</b> can be called to access this name, which can be
|
||||
specified in the pattern by any of the backtracking control verbs, not just
|
||||
(*MARK). The same function applies to all the verbs. It returns a pointer to
|
||||
the zero-terminated name, which is within the compiled pattern. If no name is
|
||||
available, NULL is returned. The length of the name (excluding the terminating
|
||||
zero) is stored in the code unit that precedes the name. You should use this
|
||||
length instead of relying on the terminating zero if the name might contain a
|
||||
binary zero.
|
||||
</P>
|
||||
<P>
|
||||
After a successful match, the name that is returned is the last mark name
|
||||
encountered on the matching path through the pattern. Instances of backtracking
|
||||
verbs without names do not count. Thus, for example, if the matching path
|
||||
contains (*MARK:A)(*PRUNE), the name "A" is returned. After a "no match" or a
|
||||
partial match, the last encountered name is returned. For example, consider
|
||||
this pattern:
|
||||
<pre>
|
||||
^(*MARK:A)((*MARK:B)a|b)c
|
||||
</pre>
|
||||
When it matches "bc", the returned name is A. The B mark is "seen" in the first
|
||||
branch of the group, but it is not on the matching path. On the other hand,
|
||||
when this pattern fails to match "bx", the returned name is B.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> By default, certain start-of-match optimizations are used to
|
||||
give a fast "no match" result in some situations. For example, if the anchoring
|
||||
is removed from the pattern above, there is an initial check for the presence
|
||||
of "c" in the subject before running the matching engine. This check fails for
|
||||
"bx", causing a match failure without seeing any marks. You can disable the
|
||||
start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option for
|
||||
<b>pcre2_compile()</b> or by starting the pattern with (*NO_START_OPT).
|
||||
</P>
|
||||
<P>
|
||||
After a successful match, a partial match, or one of the invalid UTF errors
|
||||
(for example, PCRE2_ERROR_UTF8_ERR5), <b>pcre2_get_startchar()</b> can be
|
||||
called. After a successful or partial match it returns the code unit offset of
|
||||
the character at which the match started. For a non-partial match, this can be
|
||||
different to the value of <i>ovector[0]</i> if the pattern contains the \K
|
||||
escape sequence. After a partial match, however, this value is always the same
|
||||
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
|
||||
</P>
|
||||
<P>
|
||||
After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain
|
||||
the code unit offset of the invalid UTF character. Details are given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
<a name="errorlist"></a></P>
|
||||
<br><a name="SEC32" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
|
||||
<P>
|
||||
If <b>pcre2_match()</b> fails, it returns a negative number. This can be
|
||||
converted to a text string by calling the <b>pcre2_get_error_message()</b>
|
||||
function (see "Obtaining a textual error message"
|
||||
<a href="#geterrormessage">below).</a>
|
||||
Negative error codes are also returned by other functions, and are documented
|
||||
with them. The codes are given names in the header file. If UTF checking is in
|
||||
force and an invalid UTF subject string is detected, one of a number of
|
||||
UTF-specific negative error codes is returned. Details are given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page. The following are the other errors that may be returned by
|
||||
<b>pcre2_match()</b>:
|
||||
<pre>
|
||||
PCRE2_ERROR_NOMATCH
|
||||
</pre>
|
||||
The subject string did not match the pattern.
|
||||
<pre>
|
||||
PCRE2_ERROR_PARTIAL
|
||||
</pre>
|
||||
The subject string did not match, but it did match partially. See the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation for details of partial matching.
|
||||
<pre>
|
||||
PCRE2_ERROR_BADMAGIC
|
||||
</pre>
|
||||
PCRE2 stores a 4-byte "magic number" at the start of the compiled code, to
|
||||
catch the case when it is passed a junk pointer. This is the error that is
|
||||
returned when the magic number is not present.
|
||||
<pre>
|
||||
PCRE2_ERROR_BADMODE
|
||||
</pre>
|
||||
This error is given when a compiled pattern is passed to a function in a
|
||||
library of a different code unit width, for example, a pattern compiled by
|
||||
the 8-bit library is passed to a 16-bit or 32-bit library function.
|
||||
<pre>
|
||||
PCRE2_ERROR_BADOFFSET
|
||||
</pre>
|
||||
The value of <i>startoffset</i> was greater than the length of the subject.
|
||||
<pre>
|
||||
PCRE2_ERROR_BADOPTION
|
||||
</pre>
|
||||
An unrecognized bit was set in the <i>options</i> argument.
|
||||
<pre>
|
||||
PCRE2_ERROR_BADUTFOFFSET
|
||||
</pre>
|
||||
The UTF code unit sequence that was passed as a subject was checked and found
|
||||
to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the value of
|
||||
<i>startoffset</i> did not point to the beginning of a UTF character or the end
|
||||
of the subject.
|
||||
<pre>
|
||||
PCRE2_ERROR_CALLOUT
|
||||
</pre>
|
||||
This error is never generated by <b>pcre2_match()</b> itself. It is provided for
|
||||
use by callout functions that want to cause <b>pcre2_match()</b> or
|
||||
<b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details.
|
||||
<pre>
|
||||
PCRE2_ERROR_DEPTHLIMIT
|
||||
</pre>
|
||||
The nested backtracking depth limit was reached.
|
||||
<pre>
|
||||
PCRE2_ERROR_HEAPLIMIT
|
||||
</pre>
|
||||
The heap limit was reached.
|
||||
<pre>
|
||||
PCRE2_ERROR_INTERNAL
|
||||
</pre>
|
||||
An unexpected internal error has occurred. This error could be caused by a bug
|
||||
in PCRE2 or by overwriting of the compiled pattern.
|
||||
<pre>
|
||||
PCRE2_ERROR_JIT_STACKLIMIT
|
||||
</pre>
|
||||
This error is returned when a pattern that was successfully studied using JIT
|
||||
is being matched, but the memory available for the just-in-time processing
|
||||
stack is not large enough. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for more details.
|
||||
<pre>
|
||||
PCRE2_ERROR_MATCHLIMIT
|
||||
</pre>
|
||||
The backtracking match limit was reached.
|
||||
<pre>
|
||||
PCRE2_ERROR_NOMEMORY
|
||||
</pre>
|
||||
Heap memory is used to remember backtracking points. This error is given when
|
||||
the memory allocation function (default or custom) fails. Note that a different
|
||||
error, PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
|
||||
the heap limit. PCRE2_ERROR_NOMEMORY is also returned if
|
||||
PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
|
||||
<pre>
|
||||
PCRE2_ERROR_NULL
|
||||
</pre>
|
||||
Either the <i>code</i>, <i>subject</i>, or <i>match_data</i> argument was passed
|
||||
as NULL.
|
||||
<pre>
|
||||
PCRE2_ERROR_RECURSELOOP
|
||||
</pre>
|
||||
This error is returned when <b>pcre2_match()</b> detects a recursion loop within
|
||||
the pattern. Specifically, it means that either the whole pattern or a
|
||||
capture group has been called recursively for the second time at the same
|
||||
position in the subject string. Some simple patterns that might do this are
|
||||
detected and faulted at compile time, but more complicated cases, in particular
|
||||
mutual recursions between two different groups, cannot be detected until
|
||||
matching is attempted.
|
||||
<a name="geterrormessage"></a></P>
|
||||
<br><a name="SEC33" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br>
|
||||
<P>
|
||||
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE <i>bufflen</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
A text message for an error code from any PCRE2 function (compile, match, or
|
||||
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
|
||||
is passed as the first argument, with the remaining two arguments specifying a
|
||||
code unit buffer and its length in code units, into which the text message is
|
||||
placed. The message is returned in code units of the appropriate width for the
|
||||
library that is being used.
|
||||
</P>
|
||||
<P>
|
||||
The returned message is terminated with a trailing zero, and the function
|
||||
returns the number of code units used, excluding the trailing zero. If the
|
||||
error number is unknown, the negative error code PCRE2_ERROR_BADDATA is
|
||||
returned. If the buffer is too small, the message is truncated (but still with
|
||||
a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
|
||||
None of the messages are very long; a buffer size of 120 code units is ample.
|
||||
<a name="extractbynumber"></a></P>
|
||||
<br><a name="SEC34" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
Captured substrings can be accessed directly by using the ovector as described
|
||||
<a href="#matchedstrings">above.</a>
|
||||
For convenience, auxiliary functions are provided for extracting captured
|
||||
substrings as new, separate, zero-terminated strings. A substring that contains
|
||||
a binary zero is correctly extracted and has a further zero added on the end,
|
||||
but the result is not, of course, a C string.
|
||||
</P>
|
||||
<P>
|
||||
The functions in this section identify substrings by number. The number zero
|
||||
refers to the entire matched substring, with higher numbers referring to
|
||||
substrings captured by parenthesized groups. After a partial match, only
|
||||
substring zero is available. An attempt to extract any other substring gives
|
||||
the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for
|
||||
extracting captured substrings by name.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern uses the \K escape sequence within a positive assertion, the
|
||||
reported start of a successful match can be greater than the end of the match.
|
||||
For example, if the pattern (?=ab\K) is matched against "ab", the start and
|
||||
end offset values for the match are 2 and 0. In this situation, calling these
|
||||
functions with a zero substring number extracts a zero-length empty string.
|
||||
</P>
|
||||
<P>
|
||||
You can find the length in code units of a captured substring without
|
||||
extracting it by calling <b>pcre2_substring_length_bynumber()</b>. The first
|
||||
argument is a pointer to the match data block, the second is the group number,
|
||||
and the third is a pointer to a variable into which the length is placed. If
|
||||
you just want to know whether or not the substring has been captured, you can
|
||||
pass the third argument as NULL.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_substring_copy_bynumber()</b> function copies a captured substring
|
||||
into a supplied buffer, whereas <b>pcre2_substring_get_bynumber()</b> copies it
|
||||
into new memory, obtained using the same memory allocation function that was
|
||||
used for the match data block. The first two arguments of these functions are a
|
||||
pointer to the match data block and a capture group number.
|
||||
</P>
|
||||
<P>
|
||||
The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
|
||||
the buffer and a pointer to a variable that contains its length in code units.
|
||||
This is updated to contain the actual number of code units used for the
|
||||
extracted substring, excluding the terminating zero.
|
||||
</P>
|
||||
<P>
|
||||
For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
|
||||
to variables that are updated with a pointer to the new memory and the number
|
||||
of code units that comprise the substring, again excluding the terminating
|
||||
zero. When the substring is no longer needed, the memory should be freed by
|
||||
calling <b>pcre2_substring_free()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The return value from all these functions is zero for success, or a negative
|
||||
error code. If the pattern match failed, the match failure code is returned.
|
||||
If a substring number greater than zero is used after a partial match,
|
||||
PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
|
||||
<pre>
|
||||
PCRE2_ERROR_NOMEMORY
|
||||
</pre>
|
||||
The buffer was too small for <b>pcre2_substring_copy_bynumber()</b>, or the
|
||||
attempt to get memory failed for <b>pcre2_substring_get_bynumber()</b>.
|
||||
<pre>
|
||||
PCRE2_ERROR_NOSUBSTRING
|
||||
</pre>
|
||||
There is no substring with that number in the pattern, that is, the number is
|
||||
greater than the number of capturing parentheses.
|
||||
<pre>
|
||||
PCRE2_ERROR_UNAVAILABLE
|
||||
</pre>
|
||||
The substring number, though not greater than the number of captures in the
|
||||
pattern, is greater than the number of slots in the ovector, so the substring
|
||||
could not be captured.
|
||||
<pre>
|
||||
PCRE2_ERROR_UNSET
|
||||
</pre>
|
||||
The substring did not participate in the match. For example, if the pattern is
|
||||
(abc)|(def) and the subject is "def", and the ovector contains at least two
|
||||
capturing slots, substring number 1 is unset.
|
||||
</P>
|
||||
<br><a name="SEC35" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_substring_list_free(PCRE2_UCHAR **<i>list</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_substring_list_get()</b> function extracts all available substrings
|
||||
and builds a list of pointers to them. It also (optionally) builds a second
|
||||
list that contains their lengths (in code units), excluding a terminating zero
|
||||
that is added to each of them. All this is done in a single block of memory
|
||||
that is obtained using the same memory allocation function that was used to get
|
||||
the match data block.
|
||||
</P>
|
||||
<P>
|
||||
This function must be called only after a successful match. If called after a
|
||||
partial match, the error code PCRE2_ERROR_PARTIAL is returned.
|
||||
</P>
|
||||
<P>
|
||||
The address of the memory block is returned via <i>listptr</i>, which is also
|
||||
the start of the list of string pointers. The end of the list is marked by a
|
||||
NULL pointer. The address of the list of lengths is returned via
|
||||
<i>lengthsptr</i>. If your strings do not contain binary zeros and you do not
|
||||
therefore need the lengths, you may supply NULL as the <b>lengthsptr</b>
|
||||
argument to disable the creation of a list of lengths. The yield of the
|
||||
function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block
|
||||
could not be obtained. When the list is no longer needed, it should be freed by
|
||||
calling <b>pcre2_substring_list_free()</b>.
|
||||
</P>
|
||||
<P>
|
||||
If this function encounters a substring that is unset, which can happen when
|
||||
capture group number <i>n+1</i> matches some part of the subject, but group
|
||||
<i>n</i> has not been used at all, it returns an empty string. This can be
|
||||
distinguished from a genuine zero-length substring by inspecting the
|
||||
appropriate offset in the ovector, which contain PCRE2_UNSET for unset
|
||||
substrings, or by calling <b>pcre2_substring_length_bynumber()</b>.
|
||||
<a name="extractbyname"></a></P>
|
||||
<br><a name="SEC36" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
To extract a substring by name, you first have to find associated number.
|
||||
For example, for this pattern:
|
||||
<pre>
|
||||
(a+)b(?<xxx>\d+)...
|
||||
</pre>
|
||||
the number of the capture group called "xxx" is 2. If the name is known to be
|
||||
unique (PCRE2_DUPNAMES was not set), you can find the number from the name by
|
||||
calling <b>pcre2_substring_number_from_name()</b>. The first argument is the
|
||||
compiled pattern, and the second is the name. The yield of the function is the
|
||||
group number, PCRE2_ERROR_NOSUBSTRING if there is no group with that name, or
|
||||
PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one group with that name.
|
||||
Given the number, you can extract the substring directly from the ovector, or
|
||||
use one of the "bynumber" functions described above.
|
||||
</P>
|
||||
<P>
|
||||
For convenience, there are also "byname" functions that correspond to the
|
||||
"bynumber" functions, the only difference being that the second argument is a
|
||||
name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate
|
||||
names, these functions scan all the groups with the given name, and return the
|
||||
captured substring from the first named group that is set.
|
||||
</P>
|
||||
<P>
|
||||
If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
|
||||
returned. If all groups with the name have numbers that are greater than the
|
||||
number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there
|
||||
is at least one group with a slot in the ovector, but no group is found to be
|
||||
set, PCRE2_ERROR_UNSET is returned.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple
|
||||
capture groups with the same number, as described in the
|
||||
<a href="pcre2pattern.html#dupgroupnumber">section on duplicate group numbers</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page, you cannot use names to distinguish the different capture groups, because
|
||||
names are not included in the compiled code. The matching process uses only
|
||||
numbers. For this reason, the use of different names for groups with the
|
||||
same number causes an error at compile time.
|
||||
<a name="substitutions"></a></P>
|
||||
<br><a name="SEC37" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b>
|
||||
<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>outlengthptr</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
This function optionally calls <b>pcre2_match()</b> and then makes a copy of the
|
||||
subject string in <i>outputbuffer</i>, replacing parts that were matched with
|
||||
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>, which
|
||||
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As a
|
||||
special case, if <i>replacement</i> is NULL and <i>rlength</i> is zero, the
|
||||
replacement is assumed to be an empty string. If <i>rlength</i> is non-zero, an
|
||||
error occurs if <i>replacement</i> is NULL.
|
||||
</P>
|
||||
<P>
|
||||
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just
|
||||
the replacement string(s). The default action is to perform just one
|
||||
replacement if the pattern matches, but there is an option that requests
|
||||
multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below).
|
||||
</P>
|
||||
<P>
|
||||
If successful, <b>pcre2_substitute()</b> returns the number of substitutions
|
||||
that were carried out. This may be zero if no match was found, and is never
|
||||
greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A negative value is
|
||||
returned if an error is detected.
|
||||
</P>
|
||||
<P>
|
||||
Matches in which a \K item in a lookahead in the pattern causes the match to
|
||||
end before it starts are not supported, and give rise to an error return. For
|
||||
global replacements, matches in which \K in a lookbehind causes the match to
|
||||
start earlier than the point that was reached in the previous iteration are
|
||||
also not supported.
|
||||
</P>
|
||||
<P>
|
||||
The first seven arguments of <b>pcre2_substitute()</b> are the same as for
|
||||
<b>pcre2_match()</b>, except that the partial matching options are not
|
||||
permitted, and <i>match_data</i> may be passed as NULL, in which case a match
|
||||
data block is obtained and freed within this function, using memory management
|
||||
functions from the match context, if provided, or else those that were used to
|
||||
allocate memory for the compiled code.
|
||||
</P>
|
||||
<P>
|
||||
If <i>match_data</i> is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
|
||||
provided block is used for all calls to <b>pcre2_match()</b>, and its contents
|
||||
afterwards are the result of the final call. For global changes, this will
|
||||
always be a no-match error. The contents of the ovector within the match data
|
||||
block may or may not have been changed.
|
||||
</P>
|
||||
<P>
|
||||
As well as the usual options for <b>pcre2_match()</b>, a number of additional
|
||||
options can be set in the <i>options</i> argument of <b>pcre2_substitute()</b>.
|
||||
One such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
|
||||
<i>match_data</i> block must be provided, and it must have already been used for
|
||||
an external call to <b>pcre2_match()</b> with the same pattern and subject
|
||||
arguments. The data in the <i>match_data</i> block (return code, offset vector)
|
||||
is then used for the first substitution instead of calling <b>pcre2_match()</b>
|
||||
from within <b>pcre2_substitute()</b>. This allows an application to check for a
|
||||
match before choosing to substitute, without having to repeat the match.
|
||||
</P>
|
||||
<P>
|
||||
The contents of the externally supplied match data block are not changed when
|
||||
PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTITUTE_GLOBAL is also set,
|
||||
<b>pcre2_match()</b> is called after the first substitution to check for further
|
||||
matches, but this is done using an internally obtained match data block, thus
|
||||
always leaving the external block unchanged.
|
||||
</P>
|
||||
<P>
|
||||
The <i>code</i> argument is not used for matching before the first substitution
|
||||
when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, even when
|
||||
PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains information such as the
|
||||
UTF setting and the number of capturing parentheses in the pattern.
|
||||
</P>
|
||||
<P>
|
||||
The default action of <b>pcre2_substitute()</b> is to return a copy of the
|
||||
subject string with matched substrings replaced. However, if
|
||||
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
|
||||
returned. In the global case, multiple replacements are concatenated in the
|
||||
output buffer. Substitution callouts (see
|
||||
<a href="#subcallouts">below)</a>
|
||||
can be used to separate them if necessary.
|
||||
</P>
|
||||
<P>
|
||||
The <i>outlengthptr</i> argument of <b>pcre2_substitute()</b> must point to a
|
||||
variable that contains the length, in code units, of the output buffer. If the
|
||||
function is successful, the value is updated to contain the length in code
|
||||
units of the new string, excluding the trailing zero that is automatically
|
||||
added.
|
||||
</P>
|
||||
<P>
|
||||
If the function is not successful, the value set via <i>outlengthptr</i> depends
|
||||
on the type of error. For syntax errors in the replacement string, the value is
|
||||
the offset in the replacement string where the error was detected. For other
|
||||
errors, the value is PCRE2_UNSET by default. This includes the case of the
|
||||
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
|
||||
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
|
||||
this option is set, however, <b>pcre2_substitute()</b> continues to go through
|
||||
the motions of matching and substituting (without, of course, writing anything)
|
||||
in order to compute the size of buffer that is needed, which will include the
|
||||
extra space for the terminating NUL. This value is passed back via the
|
||||
<i>outlengthptr</i> variable, with the result of the function still being
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
Passing a buffer size of zero is a permitted way of finding out how much memory
|
||||
is needed for given substitution. However, this does mean that the entire
|
||||
operation is carried out twice. Depending on the application, it may be more
|
||||
efficient to allocate a large buffer and free the excess afterwards, instead of
|
||||
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
|
||||
</P>
|
||||
<P>
|
||||
The replacement string, which is interpreted as a UTF string in UTF mode, is
|
||||
checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An invalid UTF
|
||||
replacement string causes an immediate return with the relevant UTF error code.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not interpreted
|
||||
in any way. By default, however, a dollar character is an escape character that
|
||||
can specify the insertion of characters from capture groups and names from
|
||||
(*MARK) or other control verbs in the pattern. Dollar is the only escape
|
||||
character (backslash is treated as literal). The following forms are
|
||||
recognized:
|
||||
<pre>
|
||||
$$ insert a dollar character
|
||||
$n or ${n} insert the contents of group <i>n</i>
|
||||
$0 or $& insert the entire matched substring
|
||||
$` insert the substring that precedes the match
|
||||
$' insert the substring that follows the match
|
||||
$_ insert the entire input string
|
||||
$*MARK or ${*MARK} insert a control verb name
|
||||
</pre>
|
||||
Either a group number or a group name can be given for <i>n</i>, for example $2 or
|
||||
$NAME. Curly brackets are required only if the following character would be
|
||||
interpreted as part of the number or name. The number may be zero to include
|
||||
the entire matched string. For example, if the pattern a(b)c is matched with
|
||||
"=abc=" and the replacement string "+$1$0$1+", the result is "=+babcb+=".
|
||||
</P>
|
||||
<P>
|
||||
The JavaScript form $<name>, where the angle brackets are part of the syntax,
|
||||
is also recognized for group names, but not for group numbers or *MARK.
|
||||
</P>
|
||||
<P>
|
||||
$*MARK inserts the name from the last encountered backtracking control verb on
|
||||
the matching path that has a name. (*MARK) must always include a name, but the
|
||||
other verbs need not. For example, in the case of (*MARK:A)(*PRUNE) the name
|
||||
inserted is "A", but for (*MARK:A)(*PRUNE:B) the relevant name is "B". This
|
||||
facility can be used to perform simple simultaneous substitutions, as this
|
||||
<b>pcre2test</b> example shows:
|
||||
<pre>
|
||||
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
|
||||
apple lemon
|
||||
2: pear orange
|
||||
</pre>
|
||||
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
|
||||
replacing every matching substring. If this option is not set, only the first
|
||||
matching substring is replaced. The search for matches takes place in the
|
||||
original subject string (that is, previous replacements do not affect it).
|
||||
Iteration is implemented by advancing the <i>startoffset</i> value for each
|
||||
search, which is always passed the entire subject string. If an offset limit is
|
||||
set in the match context, searching stops when that limit is reached.
|
||||
</P>
|
||||
<P>
|
||||
You can restrict the effect of a global substitution to a portion of the
|
||||
subject string by setting either or both of <i>startoffset</i> and an offset
|
||||
limit. Here is a <b>pcre2test</b> example:
|
||||
<pre>
|
||||
/B/g,replace=!,use_offset_limit
|
||||
ABC ABC ABC ABC\=offset=3,offset_limit=12
|
||||
2: ABC A!C A!C ABC
|
||||
</pre>
|
||||
When continuing with global substitutions after matching a substring with zero
|
||||
length, an attempt to find a non-empty match at the same offset is performed.
|
||||
If this is not successful, the offset is advanced by one character except when
|
||||
CRLF is a valid newline sequence and the next two characters are CR, LF. In
|
||||
this case, the offset is advanced by two characters.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do
|
||||
not appear in the pattern to be treated as unset groups. This option should be
|
||||
used with care, because it means that a typo in a group name or number no
|
||||
longer causes the PCRE2_ERROR_NOSUBSTRING error.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including unknown
|
||||
groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
|
||||
strings when inserted as described above. If this option is not set, an attempt
|
||||
to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
|
||||
not influence the extended substitution syntax described below.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
|
||||
replacement string. Without this option, only the dollar character is special,
|
||||
and only the group insertion forms listed above are valid. When
|
||||
PCRE2_SUBSTITUTE_EXTENDED is set, several things change:
|
||||
</P>
|
||||
<P>
|
||||
Firstly, backslash in a replacement string is interpreted as an escape
|
||||
character. The usual forms such as \x{ddd} can be used to specify particular
|
||||
character codes, and backslash followed by any non-alphanumeric character
|
||||
quotes that character. Extended quoting can be coded using \Q...\E, exactly
|
||||
as in pattern strings. The escapes \b and \v are interpreted as the
|
||||
characters backspace and vertical tab, respectively.
|
||||
</P>
|
||||
<P>
|
||||
The interpretation of backslash followed by one or more digits is the same as
|
||||
in a pattern, which in Perl has some ambiguities. Details are given in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
The Python form \g<n>, where the angle brackets are part of the syntax and <i>n</i>
|
||||
is either a group name or number, is recognized as an altertive way of
|
||||
inserting the contents of a group, for example \g<3>.
|
||||
</P>
|
||||
<P>
|
||||
There are also four escape sequences for forcing the case of inserted letters.
|
||||
Case forcing applies to all inserted characters, including those from capture
|
||||
groups and letters within \Q...\E quoted sequences. The insertion mechanism
|
||||
has three states: no case forcing, force upper case, and force lower case. The
|
||||
escape sequences change the current state: \U and \L change to upper or lower
|
||||
case forcing, respectively, and \E (when not terminating a \Q quoted
|
||||
sequence) reverts to no case forcing. The sequences \u and \l force the next
|
||||
character (if it is a letter) to upper or lower case, respectively, and then
|
||||
the state automatically reverts to no case forcing.
|
||||
</P>
|
||||
<P>
|
||||
However, if \u is immediately followed by \L or \l is immediately followed
|
||||
by \U, the next character's case is forced by the first escape sequence, and
|
||||
subsequent characters by the second. This provides a "title casing" facility
|
||||
that can be applied to group captures. For example, if group 1 has captured
|
||||
"heLLo", the replacement string "\u\L$1" becomes "Hello".
|
||||
</P>
|
||||
<P>
|
||||
If either PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
|
||||
properties are used for case forcing characters whose code points are greater
|
||||
than 127. However, only simple case folding, as determined by the Unicode file
|
||||
<b>CaseFolding.txt</b> is supported. PCRE2 does not support language-specific
|
||||
special casing rules such as using different lower case Greek sigmas in the
|
||||
middle and ends of words (as defined in the Unicode file
|
||||
<b>SpecialCasing.txt</b>).
|
||||
</P>
|
||||
<P>
|
||||
Note that case forcing sequences such as \U...\E do not nest. For example,
|
||||
the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no
|
||||
effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX options do
|
||||
not apply to replacement strings.
|
||||
</P>
|
||||
<P>
|
||||
The final effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
|
||||
flexibility to capture group substitution. The syntax is similar to that used
|
||||
by Bash:
|
||||
<pre>
|
||||
${n:-string}
|
||||
${n:+string1:string2}
|
||||
</pre>
|
||||
As in the simple case, <i>n</i> may be a group number or a name. The first form
|
||||
specifies a default value. If group <i>n</i> is set, its value is inserted; if
|
||||
not, the string is expanded and the result inserted. The second form specifies
|
||||
strings that are expanded and inserted when group <i>n</i> is set or unset,
|
||||
respectively. The first form is just a convenient shorthand for
|
||||
<pre>
|
||||
${n:+${n}:string}
|
||||
</pre>
|
||||
Backslash can be used to escape colons and closing curly brackets in the
|
||||
replacement strings. A change of the case forcing state within a replacement
|
||||
string remains in force afterwards, as shown in this <b>pcre2test</b> example:
|
||||
<pre>
|
||||
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
|
||||
body
|
||||
1: hello
|
||||
somebody
|
||||
1: HELLO
|
||||
</pre>
|
||||
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
|
||||
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown
|
||||
groups in the extended syntax forms to be treated as unset.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
|
||||
PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrelevant and
|
||||
are ignored.
|
||||
</P>
|
||||
<br><b>
|
||||
Substitution errors
|
||||
</b><br>
|
||||
<P>
|
||||
In the event of an error, <b>pcre2_substitute()</b> returns a negative error
|
||||
code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors from
|
||||
<b>pcre2_match()</b> are passed straight back.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
|
||||
unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
|
||||
unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
|
||||
(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
|
||||
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
|
||||
needed is returned via <i>outlengthptr</i>. Note that this does not happen by
|
||||
default.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
|
||||
<i>match_data</i> argument is NULL or if the <i>subject</i> or <i>replacement</i>
|
||||
arguments are NULL. For backward compatibility reasons an exception is made for
|
||||
the <i>replacement</i> argument if the <i>rlength</i> argument is also 0.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
|
||||
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
|
||||
(invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE (closing curly bracket
|
||||
not found), PCRE2_ERROR_BADSUBSTITUTION (syntax error in extended group
|
||||
substitution), and PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before
|
||||
it started or the match started earlier than the current position in the
|
||||
subject, which can happen if \K is used in an assertion).
|
||||
</P>
|
||||
<P>
|
||||
As for all PCRE2 errors, a text message that describes the error can be
|
||||
obtained by calling the <b>pcre2_get_error_message()</b> function (see
|
||||
"Obtaining a textual error message"
|
||||
<a href="#geterrormessage">above).</a>
|
||||
<a name="subcallouts"></a></P>
|
||||
<br><b>
|
||||
Substitution callouts
|
||||
</b><br>
|
||||
<P>
|
||||
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The <b>pcre2_set_substitution_callout()</b> function can be used to specify a
|
||||
callout function for <b>pcre2_substitute()</b>. This information is passed in
|
||||
a match context. The callout function is called after each substitution has
|
||||
been processed, but it can cause the replacement not to happen.
|
||||
</P>
|
||||
<P>
|
||||
The callout function is not called for simulated substitutions that happen as a
|
||||
result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. In this mode, when
|
||||
substitution processing exceeds the buffer space provided by the caller,
|
||||
processing continues by counting code units. The simulation is unable to
|
||||
populate the callout block, and so the simulation is pessimistic about the
|
||||
required buffer size. Whichever is larger of accepted or rejected substitution
|
||||
is reported as the required size. Therefore, the returned buffer length may be
|
||||
an overestimate (without a substitution callout, it is normally an exact
|
||||
measurement).
|
||||
</P>
|
||||
<P>
|
||||
The first argument of the callout function is a pointer to a substitute callout
|
||||
block structure, which contains the following fields, not necessarily in this
|
||||
order:
|
||||
<pre>
|
||||
uint32_t <i>version</i>;
|
||||
uint32_t <i>subscount</i>;
|
||||
PCRE2_SPTR <i>input</i>;
|
||||
PCRE2_SPTR <i>output</i>;
|
||||
PCRE2_SIZE <i>*ovector</i>;
|
||||
uint32_t <i>oveccount</i>;
|
||||
PCRE2_SIZE <i>output_offsets[2]</i>;
|
||||
</pre>
|
||||
The <i>version</i> field contains the version number of the block format. The
|
||||
current version is 0. The version number will increase in future if more fields
|
||||
are added, but the intention is never to remove any of the existing fields.
|
||||
</P>
|
||||
<P>
|
||||
The <i>subscount</i> field is the number of the current match. It is 1 for the
|
||||
first callout, 2 for the second, and so on. The <i>input</i> and <i>output</i>
|
||||
pointers are copies of the values passed to <b>pcre2_substitute()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The <i>ovector</i> field points to the ovector, which contains the result of the
|
||||
most recent match. The <i>oveccount</i> field contains the number of pairs that
|
||||
are set in the ovector, and is always greater than zero.
|
||||
</P>
|
||||
<P>
|
||||
The <i>output_offsets</i> vector contains the offsets of the replacement in the
|
||||
output string. This has already been processed for dollar and (if requested)
|
||||
backslash substitutions as described above.
|
||||
</P>
|
||||
<P>
|
||||
The second argument of the callout function is the value passed as
|
||||
<i>callout_data</i> when the function was registered. The value returned by the
|
||||
callout function is interpreted as follows:
|
||||
</P>
|
||||
<P>
|
||||
If the value is zero, the replacement is accepted, and, if
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set, processing continues with a search for the next
|
||||
match. If the value is not zero, the current replacement is not accepted. If
|
||||
the value is greater than zero, processing continues when
|
||||
PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero or
|
||||
PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied to the
|
||||
output and the call to <b>pcre2_substitute()</b> exits, returning the number of
|
||||
matches so far.
|
||||
</P>
|
||||
<br><b>
|
||||
Substitution case callouts
|
||||
</b><br>
|
||||
<P>
|
||||
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
|
||||
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
|
||||
<b> int, void *),</b>
|
||||
<b> void *<i>callout_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The <b>pcre2_set_substitution_case_callout()</b> function can be used to specify
|
||||
a callout function for <b>pcre2_substitute()</b> to use when performing case
|
||||
transformations. This does not affect any case insensitivity behaviour when
|
||||
performing a match, but only the user-visible transformations performed when
|
||||
processing a substitution such as:
|
||||
<pre>
|
||||
pcre2_substitute(..., "\\U$1", ...)
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
The default case transformations applied by PCRE2 are reasonably complete, and,
|
||||
in UTF or UCP mode, perform the simple locale-invariant case transformations as
|
||||
specified by Unicode. This is suitable for the internal (invisible)
|
||||
case-equivalence procedures used during pattern matching, but an application
|
||||
may wish to use more sophisticated locale-aware processing for the user-visible
|
||||
substitution transformations.
|
||||
</P>
|
||||
<P>
|
||||
One example implementation of the <i>callout_function</i> using the ICU
|
||||
library would be:
|
||||
<br>
|
||||
<br>
|
||||
<pre>
|
||||
PCRE2_SIZE
|
||||
icu_case_callout(
|
||||
PCRE2_SPTR input, PCRE2_SIZE input_len,
|
||||
PCRE2_UCHAR *output, PCRE2_SIZE output_cap,
|
||||
int to_case, void *data_ptr)
|
||||
{
|
||||
UErrorCode err = U_ZERO_ERROR;
|
||||
int32_t r = to_case == PCRE2_SUBSTITUTE_CASE_LOWER
|
||||
? u_strToLower(output, output_cap, input, input_len, NULL, &err)
|
||||
: to_case == PCRE2_SUBSTITUTE_CASE_UPPER
|
||||
? u_strToUpper(output, output_cap, input, input_len, NULL, &err)
|
||||
: u_strToTitle(output, output_cap, input, input_len, &first_char_only,
|
||||
NULL, &err);
|
||||
if (U_FAILURE(err)) return (~(PCRE2_SIZE)0);
|
||||
return r;
|
||||
}
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
The first and second arguments of the case callout function are the Unicode
|
||||
string to transform.
|
||||
</P>
|
||||
<P>
|
||||
The third and fourth arguments are the output buffer and its capacity.
|
||||
</P>
|
||||
<P>
|
||||
The fifth is one of the constants PCRE2_SUBSTITUTE_CASE_LOWER,
|
||||
PCRE2_SUBSTITUTE_CASE_UPPER, or PCRE2_SUBSTITUTE_CASE_TITLE_FIRST.
|
||||
PCRE2_SUBSTITUTE_CASE_LOWER and PCRE2_SUBSTITUTE_CASE_UPPER are passed to the
|
||||
callout to indicate that the case of the entire callout input should be
|
||||
case-transformed. PCRE2_SUBSTITUTE_CASE_TITLE_FIRST is passed to indicate that
|
||||
only the first character or glyph should be transformed to Unicode titlecase
|
||||
and the rest to Unicode lowercase (note that titlecasing sometimes uses Unicode
|
||||
properties to titlecase each word in a string; but PCRE2 is requesting that only
|
||||
the single leading character is to be titlecased).
|
||||
</P>
|
||||
<P>
|
||||
The sixth argument is the <i>callout_data</i> supplied to
|
||||
<b>pcre2_set_substitute_case_callout()</b>.
|
||||
</P>
|
||||
<P>
|
||||
The resulting string in the destination buffer may be larger or smaller than the
|
||||
input, if the casing rules merge or split characters. The return value is the
|
||||
length required for the output string. If a buffer of sufficient size was
|
||||
provided to the callout, then the result must be written to the buffer and the
|
||||
number of code units returned. If the result does not fit in the provided
|
||||
buffer, then the required capacity must be returned and PCRE2 will not make use
|
||||
of the output buffer. PCRE2 provides input and output buffers which overlap, so
|
||||
the callout must support this by suitable internal buffering.
|
||||
</P>
|
||||
<P>
|
||||
Alternatively, if the callout wishes to indicate an error, then it may return
|
||||
(~(PCRE2_SIZE)0). In this case pcre2_substitute() will immediately fail with
|
||||
error PCRE2_ERROR_REPLACECASE.
|
||||
</P>
|
||||
<P>
|
||||
When a case callout is combined with the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
|
||||
option, there are situations when pcre2_substitute() will return an
|
||||
underestimate of the required buffer size. If you call pcre2_substitute() once
|
||||
with PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, and the input buffer is too small for
|
||||
the replacement string to be constructed, then instead of calling the case
|
||||
callout, pcre2_substitute() will make an estimate of the required buffer size.
|
||||
The second call should also pass PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, because that
|
||||
second call is not guaranteed to succeed either, if the case callout requires
|
||||
more buffer space than expected. The caller must make repeated attempts in a
|
||||
loop.
|
||||
</P>
|
||||
<br><a name="SEC38" href="#TOC1">DUPLICATE CAPTURE GROUP NAMES</a><br>
|
||||
<P>
|
||||
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
|
||||
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
When a pattern is compiled with the PCRE2_DUPNAMES option, names for capture
|
||||
groups are not required to be unique. Duplicate names are always allowed for
|
||||
groups with the same number, created by using the (?| feature. Indeed, if such
|
||||
groups are named, they are required to use the same names.
|
||||
</P>
|
||||
<P>
|
||||
Normally, patterns that use duplicate names are such that in any one match,
|
||||
only one of each set of identically-named groups participates. An example is
|
||||
shown in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
|
||||
<b>pcre2_substring_get_byname()</b> return the first substring corresponding to
|
||||
the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is
|
||||
returned. The <b>pcre2_substring_number_from_name()</b> function returns the
|
||||
error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate names.
|
||||
</P>
|
||||
<P>
|
||||
If you want to get full details of all captured substrings for a given name,
|
||||
you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
|
||||
argument is the compiled pattern, and the second is the name. If the third and
|
||||
fourth arguments are NULL, the function returns a group number for a unique
|
||||
name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
|
||||
</P>
|
||||
<P>
|
||||
When the third and fourth arguments are not NULL, they must be pointers to
|
||||
variables that are updated by the function. After it has run, they point to the
|
||||
first and last entries in the name-to-number table for the given name, and the
|
||||
function returns the length of each entry in code units. In both cases,
|
||||
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
|
||||
</P>
|
||||
<P>
|
||||
The format of the name table is described
|
||||
<a href="#infoaboutpattern">above</a>
|
||||
in the section entitled <i>Information about a pattern</i>. Given all the
|
||||
relevant entries for the name, you can extract each of their numbers, and hence
|
||||
the captured data.
|
||||
</P>
|
||||
<br><a name="SEC39" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
|
||||
<P>
|
||||
The traditional matching function uses a similar algorithm to Perl, which stops
|
||||
when it finds the first match at a given point in the subject. If you want to
|
||||
find all possible matches, or the longest possible match at a given position,
|
||||
consider using the alternative matching function (see below) instead. If you
|
||||
cannot use the alternative function, you can kludge it up by making use of the
|
||||
callout facility, which is described in the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
What you have to do is to insert a callout right at the end of the pattern.
|
||||
When your callout function is called, extract and save the current matched
|
||||
substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
|
||||
other alternatives. Ultimately, when it runs out of matches,
|
||||
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
|
||||
<a name="dfamatch"></a></P>
|
||||
<br><a name="SEC40" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
|
||||
<P>
|
||||
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
|
||||
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
|
||||
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
|
||||
<b> pcre2_match_context *<i>mcontext</i>,</b>
|
||||
<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The function <b>pcre2_dfa_match()</b> is called to match a subject string
|
||||
against a compiled pattern, using a matching algorithm that scans the subject
|
||||
string just once (not counting lookaround assertions), and does not backtrack
|
||||
(except when processing lookaround assertions). This has different
|
||||
characteristics to the normal algorithm, and is not compatible with Perl. Some
|
||||
of the features of PCRE2 patterns are not supported. Nevertheless, there are
|
||||
times when this kind of matching can be useful. For a discussion of the two
|
||||
matching algorithms, and a list of features that <b>pcre2_dfa_match()</b> does
|
||||
not support, see the
|
||||
<a href="pcre2matching.html"><b>pcre2matching</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
The arguments for the <b>pcre2_dfa_match()</b> function are the same as for
|
||||
<b>pcre2_match()</b>, plus two extras. The ovector within the match data block
|
||||
is used in a different way, and this is described below. The other common
|
||||
arguments are used in the same way as for <b>pcre2_match()</b>, so their
|
||||
description is not repeated here.
|
||||
</P>
|
||||
<P>
|
||||
The two additional arguments provide workspace for the function. The workspace
|
||||
vector should contain at least 20 elements. It is used for keeping track of
|
||||
multiple paths through the pattern tree. More workspace is needed for patterns
|
||||
and subjects where there are a lot of potential matches.
|
||||
</P>
|
||||
<P>
|
||||
Here is an example of a simple call to <b>pcre2_dfa_match()</b>:
|
||||
<pre>
|
||||
int wspace[20];
|
||||
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
|
||||
int rc = pcre2_dfa_match(
|
||||
re, /* result of pcre2_compile() */
|
||||
"some string", /* the subject string */
|
||||
11, /* the length of the subject string */
|
||||
0, /* start at offset 0 in the subject */
|
||||
0, /* default options */
|
||||
md, /* the match data block */
|
||||
NULL, /* a match context; NULL means use defaults */
|
||||
wspace, /* working space vector */
|
||||
20); /* number of elements (NOT size in bytes) */
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
Option bits for <b>pcre2_dfa_match()</b>
|
||||
</b><br>
|
||||
<P>
|
||||
The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must
|
||||
be zero. The only bits that may be set are PCRE2_ANCHORED,
|
||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL,
|
||||
PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
|
||||
PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last
|
||||
four of these are exactly the same as for <b>pcre2_match()</b>, so their
|
||||
description is not repeated here.
|
||||
<pre>
|
||||
PCRE2_PARTIAL_HARD
|
||||
PCRE2_PARTIAL_SOFT
|
||||
</pre>
|
||||
These have the same general effect as they do for <b>pcre2_match()</b>, but the
|
||||
details are slightly different. When PCRE2_PARTIAL_HARD is set for
|
||||
<b>pcre2_dfa_match()</b>, it returns PCRE2_ERROR_PARTIAL if the end of the
|
||||
subject is reached and there is still at least one matching possibility that
|
||||
requires additional characters. This happens even if some complete matches have
|
||||
already been found. When PCRE2_PARTIAL_SOFT is set, the return code
|
||||
PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the
|
||||
subject is reached, there have been no complete matches, but there is still at
|
||||
least one matching possibility. The portion of the string that was inspected
|
||||
when the longest partial match was found is set as the first matching string in
|
||||
both cases. There is a more detailed discussion of partial and multi-segment
|
||||
matching, with examples, in the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation.
|
||||
<pre>
|
||||
PCRE2_DFA_SHORTEST
|
||||
</pre>
|
||||
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to stop as
|
||||
soon as it has found one match. Because of the way the alternative algorithm
|
||||
works, this is necessarily the shortest possible match at the first possible
|
||||
matching point in the subject string.
|
||||
<pre>
|
||||
PCRE2_DFA_RESTART
|
||||
</pre>
|
||||
When <b>pcre2_dfa_match()</b> returns a partial match, it is possible to call it
|
||||
again, with additional subject characters, and have it continue with the same
|
||||
match. The PCRE2_DFA_RESTART option requests this action; when it is set, the
|
||||
<i>workspace</i> and <i>wscount</i> options must reference the same vector as
|
||||
before because data about the match so far is left in them after a partial
|
||||
match. There is more discussion of this facility in the
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Successful returns from <b>pcre2_dfa_match()</b>
|
||||
</b><br>
|
||||
<P>
|
||||
When <b>pcre2_dfa_match()</b> succeeds, it may have matched more than one
|
||||
substring in the subject. Note, however, that all the matches from one run of
|
||||
the function start at the same point in the subject. The shorter matches are
|
||||
all initial substrings of the longer matches. For example, if the pattern
|
||||
<pre>
|
||||
<.*>
|
||||
</pre>
|
||||
is matched against the string
|
||||
<pre>
|
||||
This is <something> <something else> <something further> no more
|
||||
</pre>
|
||||
the three matched strings are
|
||||
<pre>
|
||||
<something> <something else> <something further>
|
||||
<something> <something else>
|
||||
<something>
|
||||
</pre>
|
||||
On success, the yield of the function is a number greater than zero, which is
|
||||
the number of matched substrings. The offsets of the substrings are returned in
|
||||
the ovector, and can be extracted by number in the same way as for
|
||||
<b>pcre2_match()</b>, but the numbers bear no relation to any capture groups
|
||||
that may exist in the pattern, because DFA matching does not support capturing.
|
||||
</P>
|
||||
<P>
|
||||
Calls to the convenience functions that extract substrings by name
|
||||
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
|
||||
DFA match. The convenience functions that extract substrings by number never
|
||||
return PCRE2_ERROR_NOSUBSTRING.
|
||||
</P>
|
||||
<P>
|
||||
The matched strings are stored in the ovector in reverse order of length; that
|
||||
is, the longest matching string is first. If there were too many matches to fit
|
||||
into the ovector, the yield of the function is zero, and the vector is filled
|
||||
with the longest matches.
|
||||
</P>
|
||||
<P>
|
||||
NOTE: PCRE2's "auto-possessification" optimization usually applies to character
|
||||
repeats at the end of a pattern (as well as internally). For example, the
|
||||
pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
|
||||
means that only one possible match is found. If you really do want multiple
|
||||
matches in such cases, either use an ungreedy repeat such as "a\d+?" or set
|
||||
the PCRE2_NO_AUTO_POSSESS option when compiling.
|
||||
</P>
|
||||
<br><b>
|
||||
Error returns from <b>pcre2_dfa_match()</b>
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>pcre2_dfa_match()</b> function returns a negative number when it fails.
|
||||
Many of the errors are the same as for <b>pcre2_match()</b>, as described
|
||||
<a href="#errorlist">above.</a>
|
||||
There are in addition the following errors that are specific to
|
||||
<b>pcre2_dfa_match()</b>:
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_UITEM
|
||||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> encounters an item in the
|
||||
pattern that it does not support, for instance, the use of \C in a UTF mode or
|
||||
a backreference.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_UCOND
|
||||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
|
||||
that uses a backreference for the condition, or a test for recursion in a
|
||||
specific capture group. These are not supported.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_UINVALID_UTF
|
||||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
|
||||
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
|
||||
matching.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_WSSIZE
|
||||
</pre>
|
||||
This return is given if <b>pcre2_dfa_match()</b> runs out of space in the
|
||||
<i>workspace</i> vector.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_RECURSE
|
||||
</pre>
|
||||
When a recursion or subroutine call is processed, the matching function calls
|
||||
itself recursively, using private memory for the ovector and <i>workspace</i>.
|
||||
This error is given if the internal ovector is not large enough. This should be
|
||||
extremely rare, as a vector of size 1000 is used.
|
||||
<pre>
|
||||
PCRE2_ERROR_DFA_BADRESTART
|
||||
</pre>
|
||||
When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option,
|
||||
some plausibility checks are made on the contents of the workspace, which
|
||||
should contain data about the previous partial match. If any of these checks
|
||||
fail, this error is given.
|
||||
</P>
|
||||
<br><a name="SEC41" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
|
||||
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
|
||||
<b>pcre2sample</b>(3), <b>pcre2unicode</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC42" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC43" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 26 December 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
652
3rd/pcre2/doc/html/pcre2build.html
Normal file
652
3rd/pcre2/doc/html/pcre2build.html
Normal file
@@ -0,0 +1,652 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2build specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2build man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">BUILDING PCRE2</a>
|
||||
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
|
||||
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
|
||||
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
|
||||
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
|
||||
<li><a name="TOC6" href="#SEC6">DISABLING THE USE OF \C</a>
|
||||
<li><a name="TOC7" href="#SEC7">JUST-IN-TIME COMPILER SUPPORT</a>
|
||||
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
|
||||
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
|
||||
<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
|
||||
<li><a name="TOC12" href="#SEC12">LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC13" href="#SEC13">CREATING CHARACTER TABLES AT BUILD TIME</a>
|
||||
<li><a name="TOC14" href="#SEC14">USING EBCDIC CODE</a>
|
||||
<li><a name="TOC15" href="#SEC15">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
|
||||
<li><a name="TOC16" href="#SEC16">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
|
||||
<li><a name="TOC17" href="#SEC17">PCRE2GREP BUFFER SIZE</a>
|
||||
<li><a name="TOC18" href="#SEC18">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
|
||||
<li><a name="TOC19" href="#SEC19">INCLUDING DEBUGGING CODE</a>
|
||||
<li><a name="TOC20" href="#SEC20">DEBUGGING WITH VALGRIND SUPPORT</a>
|
||||
<li><a name="TOC21" href="#SEC21">CODE COVERAGE REPORTING</a>
|
||||
<li><a name="TOC22" href="#SEC22">DISABLING THE Z AND T FORMATTING MODIFIERS</a>
|
||||
<li><a name="TOC23" href="#SEC23">SUPPORT FOR FUZZERS</a>
|
||||
<li><a name="TOC24" href="#SEC24">OBSOLETE OPTION</a>
|
||||
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
|
||||
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
|
||||
<li><a name="TOC27" href="#SEC27">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
|
||||
<P>
|
||||
PCRE2 is distributed with a <b>configure</b> script that can be used to build
|
||||
the library in Unix-like environments using the applications known as
|
||||
Autotools. Also in the distribution are files to support building using
|
||||
<b>CMake</b> instead of <b>configure</b>. The text file
|
||||
<a href="README.txt"><b>README</b></a>
|
||||
contains general information about building with Autotools (some of which is
|
||||
repeated below), and also has some comments about building on various operating
|
||||
systems. The files in the <b>vms</b> directory support building under OpenVMS.
|
||||
There is a lot more information about building PCRE2 without using
|
||||
Autotools (including information about using <b>CMake</b> and building "by
|
||||
hand") in the text file called
|
||||
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS-BUILD</b>.</a>
|
||||
You should consult this file as well as the
|
||||
<a href="README.txt"><b>README</b></a>
|
||||
file if you are building in a non-Unix-like environment.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">PCRE2 BUILD-TIME OPTIONS</a><br>
|
||||
<P>
|
||||
The rest of this document describes the optional features of PCRE2 that can be
|
||||
selected when the library is compiled. It assumes use of the <b>configure</b>
|
||||
script, where the optional features are selected or deselected by providing
|
||||
options to <b>configure</b> before running the <b>make</b> command. However, the
|
||||
same options can be selected in both Unix-like and non-Unix-like environments
|
||||
if you are using <b>CMake</b> instead of <b>configure</b> to build PCRE2.
|
||||
</P>
|
||||
<P>
|
||||
If you are not using Autotools or <b>CMake</b>, option selection can be done by
|
||||
editing the <b>config.h</b> file, or by passing parameter settings to the
|
||||
compiler, as described in
|
||||
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS-BUILD</b>.</a>
|
||||
</P>
|
||||
<P>
|
||||
The complete list of options for <b>configure</b> (which includes the standard
|
||||
ones such as the selection of the installation directory) can be obtained by
|
||||
running
|
||||
<pre>
|
||||
./configure --help
|
||||
</pre>
|
||||
The following sections include descriptions of "on/off" options whose names
|
||||
begin with --enable or --disable. Because of the way that <b>configure</b>
|
||||
works, --enable and --disable always come in pairs, so the complementary option
|
||||
always exists as well, but as it specifies the default, it is not described.
|
||||
Options that specify values have names that start with --with. At the end of a
|
||||
<b>configure</b> run, a summary of the configuration is output.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
|
||||
<P>
|
||||
By default, a library called <b>libpcre2-8</b> is built, containing functions
|
||||
that take string arguments contained in arrays of bytes, interpreted either as
|
||||
single-byte characters, or UTF-8 strings. You can also build two other
|
||||
libraries, called <b>libpcre2-16</b> and <b>libpcre2-32</b>, which process
|
||||
strings that are contained in arrays of 16-bit and 32-bit code units,
|
||||
respectively. These can be interpreted either as single-unit characters or
|
||||
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
|
||||
the following to the <b>configure</b> command:
|
||||
<pre>
|
||||
--enable-pcre2-16
|
||||
--enable-pcre2-32
|
||||
</pre>
|
||||
If you do not want the 8-bit library, add
|
||||
<pre>
|
||||
--disable-pcre2-8
|
||||
</pre>
|
||||
as well. At least one of the three libraries must be built. Note that the POSIX
|
||||
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
|
||||
program. Neither of these are built if you select only the 16-bit or 32-bit
|
||||
libraries.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
|
||||
<P>
|
||||
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
|
||||
and static libraries by default. You can suppress an unwanted library by adding
|
||||
one of
|
||||
<pre>
|
||||
--disable-shared
|
||||
--disable-static
|
||||
</pre>
|
||||
to the <b>configure</b> command. Setting --disable-shared ensures that PCRE2
|
||||
libraries are built as static libraries. The binaries that are then created as
|
||||
part of the build process (for example, <b>pcre2test</b> and <b>pcre2grep</b>)
|
||||
are linked statically with one or more PCRE2 libraries, but may also be
|
||||
dynamically linked with other libraries such as <b>libc</b>. If you want these
|
||||
binaries to be fully statically linked, you can set LDFLAGS like this:
|
||||
<br>
|
||||
<br>
|
||||
LDFLAGS=--static ./configure --disable-shared
|
||||
<br>
|
||||
<br>
|
||||
Note the two hyphens in --static. Of course, this works only if static versions
|
||||
of all the relevant libraries are available for linking.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
|
||||
<P>
|
||||
By default, PCRE2 is built with support for Unicode and UTF character strings.
|
||||
To build it without Unicode support, add
|
||||
<pre>
|
||||
--disable-unicode
|
||||
</pre>
|
||||
to the <b>configure</b> command. This setting applies to all three libraries. It
|
||||
is not possible to build one library with Unicode support and another without
|
||||
in the same configuration.
|
||||
</P>
|
||||
<P>
|
||||
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
|
||||
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
|
||||
option when they call <b>pcre2_compile()</b> to compile a pattern.
|
||||
Alternatively, patterns may be started with (*UTF) unless the application has
|
||||
locked this out by setting PCRE2_NEVER_UTF.
|
||||
</P>
|
||||
<P>
|
||||
UTF support allows the libraries to process character code points up to
|
||||
0x10ffff in the strings that they handle. Unicode support also gives access to
|
||||
the Unicode properties of characters, using pattern escapes such as \P, \p,
|
||||
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i>,
|
||||
script names, and some bi-directional properties are supported. Details are
|
||||
given in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
Pattern escapes such as \d and \w do not by default make use of Unicode
|
||||
properties. The application can request that they do by setting the PCRE2_UCP
|
||||
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
|
||||
request this by starting with (*UCP).
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">DISABLING THE USE OF \C</a><br>
|
||||
<P>
|
||||
The \C escape sequence, which matches a single code unit, even in a UTF mode,
|
||||
can cause unpredictable behaviour because it may leave the current matching
|
||||
point in the middle of a multi-code-unit character. The application can lock it
|
||||
out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
|
||||
<b>pcre2_compile()</b>. There is also a build-time option
|
||||
<pre>
|
||||
--enable-never-backslash-C
|
||||
</pre>
|
||||
(note the upper case C) which locks out the use of \C entirely.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
Just-in-time (JIT) compiler support is included in the build by specifying
|
||||
<pre>
|
||||
--enable-jit
|
||||
</pre>
|
||||
This support is available only for certain hardware architectures. If this
|
||||
option is set for an unsupported architecture, a building error occurs.
|
||||
If in doubt, use
|
||||
<pre>
|
||||
--enable-jit=auto
|
||||
</pre>
|
||||
which enables JIT only if the current hardware is supported. You can check
|
||||
if JIT is enabled in the configuration summary that is output at the end of a
|
||||
<b>configure</b> run. If you are enabling JIT under SELinux you may also want to
|
||||
add
|
||||
<pre>
|
||||
--enable-jit-sealloc
|
||||
</pre>
|
||||
which enables the use of an execmem allocator in JIT that is compatible with
|
||||
SELinux. This has no effect if JIT is not enabled. See the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation for a discussion of JIT usage. When JIT support is enabled,
|
||||
<b>pcre2grep</b> automatically makes use of it, unless you add
|
||||
<pre>
|
||||
--disable-pcre2grep-jit
|
||||
</pre>
|
||||
to the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">NEWLINE RECOGNITION</a><br>
|
||||
<P>
|
||||
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
|
||||
of a line. This is the normal newline character on Unix-like systems. You can
|
||||
compile PCRE2 to use carriage return (CR) instead, by adding
|
||||
<pre>
|
||||
--enable-newline-is-cr
|
||||
</pre>
|
||||
to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
|
||||
which explicitly specifies linefeed as the newline character.
|
||||
</P>
|
||||
<P>
|
||||
Alternatively, you can specify that line endings are to be indicated by the
|
||||
two-character sequence CRLF (CR immediately followed by LF). If you want this,
|
||||
add
|
||||
<pre>
|
||||
--enable-newline-is-crlf
|
||||
</pre>
|
||||
to the <b>configure</b> command. There is a fourth option, specified by
|
||||
<pre>
|
||||
--enable-newline-is-anycrlf
|
||||
</pre>
|
||||
which causes PCRE2 to recognize any of the three sequences CR, LF, or CRLF as
|
||||
indicating a line ending. A fifth option, specified by
|
||||
<pre>
|
||||
--enable-newline-is-any
|
||||
</pre>
|
||||
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
|
||||
sequences are the three just mentioned, plus the single characters VT (vertical
|
||||
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
|
||||
separator, U+2028), and PS (paragraph separator, U+2029). The final option is
|
||||
<pre>
|
||||
--enable-newline-is-nul
|
||||
</pre>
|
||||
which causes NUL (binary zero) to be set as the default line-ending character.
|
||||
</P>
|
||||
<P>
|
||||
Whatever default line ending convention is selected when PCRE2 is built can be
|
||||
overridden by applications that use the library. At build time it is
|
||||
recommended to use the standard for your operating system.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
By default, the sequence \R in a pattern matches any Unicode newline sequence,
|
||||
independently of what has been selected as the line ending sequence. If you
|
||||
specify
|
||||
<pre>
|
||||
--enable-bsr-anycrlf
|
||||
</pre>
|
||||
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
|
||||
selected when PCRE2 is built can be overridden by applications that use the
|
||||
library.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
|
||||
<P>
|
||||
Within a compiled pattern, offset values are used to point from one part to
|
||||
another (for example, from an opening parenthesis to an alternation
|
||||
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
|
||||
are used for these offsets, leading to a maximum size for a compiled pattern of
|
||||
around 64 thousand code units. This is sufficient to handle all but the most
|
||||
gigantic patterns. Nevertheless, some people do want to process truly enormous
|
||||
patterns, so it is possible to compile PCRE2 to use three-byte or four-byte
|
||||
offsets by adding a setting such as
|
||||
<pre>
|
||||
--with-link-size=3
|
||||
</pre>
|
||||
to the <b>configure</b> command. The value given must be 2, 3, or 4. For the
|
||||
16-bit library, a value of 3 is rounded up to 4. In these libraries, using
|
||||
longer offsets slows down the operation of PCRE2 because it has to load
|
||||
additional data when handling them. For the 32-bit library the value is always
|
||||
4 and cannot be overridden; the value of --with-link-size is ignored.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
|
||||
<P>
|
||||
The <b>pcre2_match()</b> function increments a counter each time it goes round
|
||||
its main loop. Putting a limit on this counter controls the amount of computing
|
||||
resource used by a single call to <b>pcre2_match()</b>. The limit can be changed
|
||||
at run time, as described in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. The default is 10 million, but this can be changed by adding a
|
||||
setting such as
|
||||
<pre>
|
||||
--with-match-limit=500000
|
||||
</pre>
|
||||
to the <b>configure</b> command. This setting also applies to the
|
||||
<b>pcre2_dfa_match()</b> matching function, and to JIT matching (though the
|
||||
counting is done differently).
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_match()</b> function uses heap memory to record backtracking
|
||||
points. The more nested backtracking points there are (that is, the deeper the
|
||||
search tree), the more memory is needed. There is an upper limit, specified in
|
||||
kibibytes (units of 1024 bytes). This limit can be changed at run time, as
|
||||
described in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. The default limit (in effect unlimited) is 20 million. You can
|
||||
change this by a setting such as
|
||||
<pre>
|
||||
--with-heap-limit=500
|
||||
</pre>
|
||||
which limits the amount of heap to 500 KiB. This limit applies only to
|
||||
interpretive matching in <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, which
|
||||
may also use the heap for internal workspace when processing complicated
|
||||
patterns. This limit does not apply when JIT (which has its own memory
|
||||
arrangements) is used.
|
||||
</P>
|
||||
<P>
|
||||
You can also explicitly limit the depth of nested backtracking in the
|
||||
<b>pcre2_match()</b> interpreter. This limit defaults to the value that is set
|
||||
for --with-match-limit. You can set a lower default limit by adding, for
|
||||
example,
|
||||
<pre>
|
||||
--with-match-limit-depth=10000
|
||||
</pre>
|
||||
to the <b>configure</b> command. This value can be overridden at run time. This
|
||||
depth limit indirectly limits the amount of heap memory that is used, but
|
||||
because the size of each backtracking "frame" depends on the number of
|
||||
capturing parentheses in a pattern, the amount of heap that is used before the
|
||||
limit is reached varies from pattern to pattern. This limit was more useful in
|
||||
versions before 10.30, where function recursion was used for backtracking.
|
||||
</P>
|
||||
<P>
|
||||
As well as applying to <b>pcre2_match()</b>, the depth limit also controls
|
||||
the depth of recursive function calls in <b>pcre2_dfa_match()</b>. These are
|
||||
used for lookaround assertions, atomic groups, and recursion within patterns.
|
||||
The limit does not apply to JIT matching.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS</a><br>
|
||||
<P>
|
||||
Lookbehind assertions in which one or more branches can match a variable number
|
||||
of characters are supported only if there is a maximum matching length for each
|
||||
top-level branch. There is a limit to this maximum that defaults to 255
|
||||
characters. You can alter this default by a setting such as
|
||||
<pre>
|
||||
--with-max-varlookbehind=100
|
||||
</pre>
|
||||
The limit can be changed at runtime by calling
|
||||
<b>pcre2_set_max_varlookbehind()</b>. Lookbehind assertions in which every
|
||||
branch matches a fixed number of characters (not necessarily all the same) are
|
||||
not constrained by this limit.
|
||||
<a name="createtables"></a></P>
|
||||
<br><a name="SEC13" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
|
||||
<P>
|
||||
PCRE2 uses fixed tables for processing characters whose code points are less
|
||||
than 256. By default, PCRE2 is built with a set of tables that are distributed
|
||||
in the file <i>src/pcre2_chartables.c.dist</i>. These tables are for ASCII codes
|
||||
only. If you add
|
||||
<pre>
|
||||
--enable-rebuild-chartables
|
||||
</pre>
|
||||
to the <b>configure</b> command, the distributed tables are no longer used.
|
||||
Instead, a program called <b>pcre2_dftables</b> is compiled and run. This
|
||||
outputs the source for new set of tables, created in the default locale of your
|
||||
C run-time system. This method of replacing the tables does not work if you are
|
||||
cross compiling, because <b>pcre2_dftables</b> needs to be run on the local
|
||||
host and therefore not compiled with the cross compiler.
|
||||
</P>
|
||||
<P>
|
||||
If you need to create alternative tables when cross compiling, you will have to
|
||||
do so "by hand". There may also be other reasons for creating tables manually.
|
||||
To cause <b>pcre2_dftables</b> to be built on the local host, run a normal
|
||||
compiling command, and then run the program with the output file as its
|
||||
argument, for example:
|
||||
<pre>
|
||||
cc src/pcre2_dftables.c -o pcre2_dftables
|
||||
./pcre2_dftables src/pcre2_chartables.c
|
||||
</pre>
|
||||
This builds the tables in the default locale of the local host. If you want to
|
||||
specify a locale, you must use the -L option:
|
||||
<pre>
|
||||
LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
|
||||
</pre>
|
||||
You can also specify -b (with or without -L). This causes the tables to be
|
||||
written in binary instead of as source code. A set of binary tables can be
|
||||
loaded into memory by an application and passed to <b>pcre2_compile()</b> in the
|
||||
same way as tables created by calling <b>pcre2_maketables()</b>. The tables are
|
||||
just a string of bytes, independent of hardware characteristics such as
|
||||
endianness. This means they can be bundled with an application that runs in
|
||||
different environments, to ensure consistent behaviour.
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">USING EBCDIC CODE</a><br>
|
||||
<P>
|
||||
PCRE2 assumes by default that it will run in an environment where the character
|
||||
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
|
||||
most computer operating systems. PCRE2 can, however, be compiled to run in an
|
||||
8-bit EBCDIC environment by adding
|
||||
<pre>
|
||||
--enable-ebcdic --disable-unicode
|
||||
</pre>
|
||||
to the <b>configure</b> command. This setting implies
|
||||
--enable-rebuild-chartables. You should only use it if you know that you are in
|
||||
an EBCDIC environment (for example, an IBM mainframe operating system).
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to support both EBCDIC and UTF-8 codes in the same version
|
||||
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
|
||||
exclusive.
|
||||
</P>
|
||||
<P>
|
||||
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
|
||||
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
|
||||
such an environment you should use
|
||||
<pre>
|
||||
--enable-ebcdic-nl25
|
||||
</pre>
|
||||
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR has the
|
||||
same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 0x25 is <i>not</i>
|
||||
chosen as LF is made to correspond to the Unicode NEL character (which, in
|
||||
Unicode, is 0x85).
|
||||
</P>
|
||||
<P>
|
||||
The options that select newline behaviour, such as --enable-newline-is-cr,
|
||||
and equivalent run-time options, refer to these character values in an EBCDIC
|
||||
environment.
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
|
||||
<P>
|
||||
By default <b>pcre2grep</b> supports the use of callouts with string arguments
|
||||
within the patterns it is matching. There are two kinds: one that generates
|
||||
output using local code, and another that calls an external program or script.
|
||||
If --disable-pcre2grep-callout-fork is added to the <b>configure</b> command,
|
||||
only the first kind of callout is supported; if --disable-pcre2grep-callout is
|
||||
used, all callouts are completely ignored. For more details of <b>pcre2grep</b>
|
||||
callouts, see the
|
||||
<a href="pcre2grep.html"><b>pcre2grep</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
|
||||
<P>
|
||||
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
|
||||
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
|
||||
them with <b>libz</b> or <b>libbz2</b>, respectively, by adding one or both of
|
||||
<pre>
|
||||
--enable-pcre2grep-libz
|
||||
--enable-pcre2grep-libbz2
|
||||
</pre>
|
||||
to the <b>configure</b> command. These options naturally require that the
|
||||
relevant libraries are installed on your system. Configuration will fail if
|
||||
they are not.
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
|
||||
<P>
|
||||
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
|
||||
scanning, in order to be able to output "before" and "after" lines when it
|
||||
finds a match. The default starting size of the buffer is 20KiB. The buffer
|
||||
itself is three times this size, but because of the way it is used for holding
|
||||
"before" lines, the longest line that is guaranteed to be processable is the
|
||||
notional buffer size. If a longer line is encountered, <b>pcre2grep</b>
|
||||
automatically expands the buffer, up to a specified maximum size, whose default
|
||||
is 1MiB or the starting size, whichever is the larger. You can change the
|
||||
default parameter values by adding, for example,
|
||||
<pre>
|
||||
--with-pcre2grep-bufsize=51200
|
||||
--with-pcre2grep-max-bufsize=2097152
|
||||
</pre>
|
||||
to the <b>configure</b> command. The caller of <b>pcre2grep</b> can override
|
||||
these values by using --buffer-size and --max-buffer-size on the command line.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
|
||||
<P>
|
||||
If you add one of
|
||||
<pre>
|
||||
--enable-pcre2test-libreadline
|
||||
--enable-pcre2test-libedit
|
||||
</pre>
|
||||
to the <b>configure</b> command, <b>pcre2test</b> is linked with the
|
||||
<b>libreadline</b> or<b>libedit</b> library, respectively, and when its input is
|
||||
from a terminal, it reads it using the <b>readline()</b> function. This provides
|
||||
line-editing and history facilities. Note that <b>libreadline</b> is
|
||||
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
|
||||
way, there may be licensing issues. These can be avoided by linking instead
|
||||
with <b>libedit</b>, which has a BSD licence.
|
||||
</P>
|
||||
<P>
|
||||
Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
|
||||
added to the <b>pcre2test</b> build. In many operating environments with a
|
||||
system-installed readline library this is sufficient. However, in some
|
||||
environments (e.g. if an unmodified distribution version of readline is in
|
||||
use), some extra configuration may be necessary. The INSTALL file for
|
||||
<b>libreadline</b> says this:
|
||||
<pre>
|
||||
"Readline uses the termcap functions, but does not link with
|
||||
the termcap or curses library itself, allowing applications
|
||||
which link with readline the to choose an appropriate library."
|
||||
</pre>
|
||||
If your environment has not been set up so that an appropriate library is
|
||||
automatically included, you may need to add something like
|
||||
<pre>
|
||||
LIBS="-ncurses"
|
||||
</pre>
|
||||
immediately before the <b>configure</b> command.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
--enable-debug
|
||||
</pre>
|
||||
to the <b>configure</b> command, additional debugging code is included in the
|
||||
build. This feature is intended for use by the PCRE2 maintainers.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
|
||||
<P>
|
||||
If you add
|
||||
<pre>
|
||||
--enable-valgrind
|
||||
</pre>
|
||||
to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
|
||||
certain memory regions as unaddressable. This allows it to detect invalid
|
||||
memory accesses, and is mostly useful for debugging PCRE2 itself.
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">CODE COVERAGE REPORTING</a><br>
|
||||
<P>
|
||||
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
|
||||
code coverage report for its test suite. To enable this, you must install
|
||||
<b>lcov</b> version 1.6 or above. Then specify
|
||||
<pre>
|
||||
--enable-coverage
|
||||
</pre>
|
||||
to the <b>configure</b> command and build PCRE2 in the usual way.
|
||||
</P>
|
||||
<P>
|
||||
Note that using <b>ccache</b> (a caching C compiler) is incompatible with code
|
||||
coverage reporting. If you have configured <b>ccache</b> to run automatically
|
||||
on your system, you must set the environment variable
|
||||
<pre>
|
||||
CCACHE_DISABLE=1
|
||||
</pre>
|
||||
before running <b>make</b> to build PCRE2, so that <b>ccache</b> is not used.
|
||||
</P>
|
||||
<P>
|
||||
When --enable-coverage is used, the following addition targets are added to the
|
||||
<i>Makefile</i>:
|
||||
<pre>
|
||||
make coverage
|
||||
</pre>
|
||||
This creates a fresh coverage report for the PCRE2 test suite. It is equivalent
|
||||
to running "make coverage-reset", "make coverage-baseline", "make check", and
|
||||
then "make coverage-report".
|
||||
<pre>
|
||||
make coverage-reset
|
||||
</pre>
|
||||
This zeroes the coverage counters, but does nothing else.
|
||||
<pre>
|
||||
make coverage-baseline
|
||||
</pre>
|
||||
This captures baseline coverage information.
|
||||
<pre>
|
||||
make coverage-report
|
||||
</pre>
|
||||
This creates the coverage report.
|
||||
<pre>
|
||||
make coverage-clean-report
|
||||
</pre>
|
||||
This removes the generated coverage report without cleaning the coverage data
|
||||
itself.
|
||||
<pre>
|
||||
make coverage-clean-data
|
||||
</pre>
|
||||
This removes the captured coverage data without removing the coverage files
|
||||
created at compile time (*.gcno).
|
||||
<pre>
|
||||
make coverage-clean
|
||||
</pre>
|
||||
This cleans all coverage data including the generated coverage report. For more
|
||||
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">DISABLING THE Z AND T FORMATTING MODIFIERS</a><br>
|
||||
<P>
|
||||
The C99 standard defines formatting modifiers z and t for size_t and
|
||||
ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
|
||||
environments other than old versions of Microsoft Visual Studio when
|
||||
__STDC_VERSION__ is defined and has a value greater than or equal to 199901L
|
||||
(indicating support for C99).
|
||||
However, there is at least one environment that claims to be C99 but does not
|
||||
support these modifiers. If
|
||||
<pre>
|
||||
--disable-percent-zt
|
||||
</pre>
|
||||
is specified, no use is made of the z or t modifiers. Instead of %td or %zu,
|
||||
a suitable format is used depending in the size of long for the platform.
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
|
||||
<P>
|
||||
There is a special option for use by people who want to run fuzzing tests on
|
||||
PCRE2:
|
||||
<pre>
|
||||
--enable-fuzz-support
|
||||
</pre>
|
||||
At present this applies only to the 8-bit library. If set, it causes an extra
|
||||
library called libpcre2-fuzzsupport.a to be built, but not installed. This
|
||||
contains a single function called LLVMFuzzerTestOneInput() whose arguments are
|
||||
a pointer to a string and the length of the string. When called, this function
|
||||
tries to compile the string as a pattern, and if that succeeds, to match it.
|
||||
This is done both with no options and with some random options bits that are
|
||||
generated from the string.
|
||||
</P>
|
||||
<P>
|
||||
Setting --enable-fuzz-support also causes a binary called <b>pcre2fuzzcheck</b>
|
||||
to be created. This is normally run under valgrind or used when PCRE2 is
|
||||
compiled with address sanitizing enabled. It calls the fuzzing function and
|
||||
outputs information about what it is doing. The input strings are specified by
|
||||
arguments: if an argument starts with "=" the rest of it is a literal input
|
||||
string. Otherwise, it is assumed to be a file name, and the contents of the
|
||||
file are the test string.
|
||||
</P>
|
||||
<br><a name="SEC24" href="#TOC1">OBSOLETE OPTION</a><br>
|
||||
<P>
|
||||
In versions of PCRE2 prior to 10.30, there were two ways of handling
|
||||
backtracking in the <b>pcre2_match()</b> function. The default was to use the
|
||||
system stack, but if
|
||||
<pre>
|
||||
--disable-stack-for-recursion
|
||||
</pre>
|
||||
was set, memory on the heap was used. From release 10.30 onwards this has
|
||||
changed (the stack is no longer used) and this option now does nothing except
|
||||
give a warning.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 16 April 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
480
3rd/pcre2/doc/html/pcre2callout.html
Normal file
480
3rd/pcre2/doc/html/pcre2callout.html
Normal file
@@ -0,0 +1,480 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2callout specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2callout man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
|
||||
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
||||
<li><a name="TOC3" href="#SEC3">MISSING CALLOUTS</a>
|
||||
<li><a name="TOC4" href="#SEC4">THE CALLOUT INTERFACE</a>
|
||||
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM CALLOUTS</a>
|
||||
<li><a name="TOC6" href="#SEC6">CALLOUT ENUMERATION</a>
|
||||
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
|
||||
<li><a name="TOC8" href="#SEC8">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||
<P>
|
||||
<b>#include <pcre2.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int (*pcre2_callout)(pcre2_callout_block *, void *);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
|
||||
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
|
||||
<b> void *<i>user_data</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
||||
<P>
|
||||
PCRE2 provides a feature called "callout", which is a means of temporarily
|
||||
passing control to the caller of PCRE2 in the middle of pattern matching. The
|
||||
caller of PCRE2 provides an external function by putting its entry point in
|
||||
a match context (see <b>pcre2_set_callout()</b> in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation).
|
||||
</P>
|
||||
<P>
|
||||
When using the <b>pcre2_substitute()</b> function, an additional callout feature
|
||||
is available. This does a callout after each change to the subject string and
|
||||
is described in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation; the rest of this document is concerned with callouts during
|
||||
pattern matching.
|
||||
</P>
|
||||
<P>
|
||||
Within a regular expression, (?C<arg>) indicates a point at which the external
|
||||
function is to be called. Different callout points can be identified by putting
|
||||
a number less than 256 after the letter C. The default value is zero.
|
||||
Alternatively, the argument may be a delimited string. The starting delimiter
|
||||
must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the
|
||||
start, except for {, where the ending delimiter is }. If the ending delimiter
|
||||
is needed within the string, it must be doubled. For example, this pattern has
|
||||
two callout points:
|
||||
<pre>
|
||||
(?C1)abc(?C"some ""arbitrary"" text")def
|
||||
</pre>
|
||||
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
|
||||
automatically inserts callouts, all with number 255, before each item in the
|
||||
pattern except for immediately before or after an explicit callout. For
|
||||
example, if PCRE2_AUTO_CALLOUT is used with the pattern
|
||||
<pre>
|
||||
A(?C3)B
|
||||
</pre>
|
||||
it is processed as if it were
|
||||
<pre>
|
||||
(?C255)A(?C3)B(?C255)
|
||||
</pre>
|
||||
Here is a more complicated example:
|
||||
<pre>
|
||||
A(\d{2}|--)
|
||||
</pre>
|
||||
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
|
||||
<pre>
|
||||
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
|
||||
</pre>
|
||||
Notice that there is a callout before and after each parenthesis and
|
||||
alternation bar. If the pattern contains a conditional group whose condition is
|
||||
an assertion, an automatic callout is inserted immediately before the
|
||||
condition. Such a callout may also be inserted explicitly, for example:
|
||||
<pre>
|
||||
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
|
||||
</pre>
|
||||
This applies only to assertion conditions (because they are themselves
|
||||
independent groups).
|
||||
</P>
|
||||
<P>
|
||||
Callouts can be useful for tracking the progress of pattern matching. The
|
||||
<a href="pcre2test.html"><b>pcre2test</b></a>
|
||||
program has a pattern qualifier (/auto_callout) that sets automatic callouts.
|
||||
When any callouts are present, the output from <b>pcre2test</b> indicates how
|
||||
the pattern is being matched. This is useful information when you are trying to
|
||||
optimize the performance of a particular pattern.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">MISSING CALLOUTS</a><br>
|
||||
<P>
|
||||
You should be aware that, because of optimizations in the way PCRE2 compiles
|
||||
and matches patterns, callouts sometimes do not happen exactly as you might
|
||||
expect.
|
||||
</P>
|
||||
<br><b>
|
||||
Auto-possessification
|
||||
</b><br>
|
||||
<P>
|
||||
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
|
||||
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
|
||||
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
|
||||
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
|
||||
"aaaa" is:
|
||||
<pre>
|
||||
--->aaaa
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
No match
|
||||
</pre>
|
||||
This indicates that when matching [bc] fails, there is no backtracking into a+
|
||||
(because it is being treated as a++) and therefore the callouts that would be
|
||||
taken for the backtracks do not occur. You can disable the auto-possessify
|
||||
feature by passing PCRE2_NO_AUTO_POSSESS to <b>pcre2_compile()</b>, or starting
|
||||
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
|
||||
<pre>
|
||||
--->aaaa
|
||||
+0 ^ a+
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^ ^ [bc]
|
||||
+2 ^^ [bc]
|
||||
No match
|
||||
</pre>
|
||||
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
|
||||
again, repeatedly, until a+ itself fails.
|
||||
</P>
|
||||
<br><b>
|
||||
Automatic .* anchoring
|
||||
</b><br>
|
||||
<P>
|
||||
By default, an optimization is applied when .* is the first significant item in
|
||||
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
|
||||
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
|
||||
start only after an internal newline or at the beginning of the subject, and
|
||||
<b>pcre2_compile()</b> remembers this. If a pattern has more than one top-level
|
||||
branch, automatic anchoring occurs if all branches are anchorable.
|
||||
</P>
|
||||
<P>
|
||||
This optimization is disabled, however, if .* is in an atomic group or if there
|
||||
is a backreference to the capture group in which it appears. It is also
|
||||
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
|
||||
callouts does not affect it.
|
||||
</P>
|
||||
<P>
|
||||
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
|
||||
applied to the string "aa", the <b>pcre2test</b> output is:
|
||||
<pre>
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
</pre>
|
||||
This shows that all match attempts start at the beginning of the subject. In
|
||||
other words, the pattern is anchored. You can disable this optimization by
|
||||
passing PCRE2_NO_DOTSTAR_ANCHOR to <b>pcre2_compile()</b>, or starting the
|
||||
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
|
||||
<pre>
|
||||
--->aa
|
||||
+0 ^ .*
|
||||
+2 ^ ^ \d
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
+0 ^ .*
|
||||
+2 ^^ \d
|
||||
+2 ^ \d
|
||||
No match
|
||||
</pre>
|
||||
This shows more match attempts, starting at the second subject character.
|
||||
Another optimization, described in the next section, means that there is no
|
||||
subsequent attempt to match with an empty subject.
|
||||
</P>
|
||||
<br><b>
|
||||
Other optimizations
|
||||
</b><br>
|
||||
<P>
|
||||
Other optimizations that provide fast "no match" results also affect callouts.
|
||||
For example, if the pattern is
|
||||
<pre>
|
||||
ab(?C4)cd
|
||||
</pre>
|
||||
PCRE2 knows that any matching string must contain the letter "d". If the
|
||||
subject string is "abyz", the lack of "d" means that matching doesn't ever
|
||||
start, and the callout is never reached. However, with "abyd", though the
|
||||
result is still no match, the callout is obeyed.
|
||||
</P>
|
||||
<P>
|
||||
For most patterns PCRE2 also knows the minimum length of a matching string, and
|
||||
will immediately give a "no match" return without actually running a match if
|
||||
the subject is not long enough, or, for unanchored patterns, if it has been
|
||||
scanned far enough.
|
||||
</P>
|
||||
<P>
|
||||
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
|
||||
option to <b>pcre2_compile()</b>, or by starting the pattern with
|
||||
(*NO_START_OPT). This slows down the matching process, but does ensure that
|
||||
callouts such as the example above are obeyed.
|
||||
<a name="calloutinterface"></a></P>
|
||||
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
|
||||
<P>
|
||||
During matching, when PCRE2 reaches a callout point, if an external function is
|
||||
provided in the match context, it is called. This applies to both normal,
|
||||
DFA, and JIT matching. The first argument to the callout function is a pointer
|
||||
to a <b>pcre2_callout</b> block. The second argument is the void * callout data
|
||||
that was supplied when the callout was set up by calling
|
||||
<b>pcre2_set_callout()</b> (see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation). The callout block structure contains the following fields, not
|
||||
necessarily in this order:
|
||||
<pre>
|
||||
uint32_t <i>version</i>;
|
||||
uint32_t <i>callout_number</i>;
|
||||
uint32_t <i>capture_top</i>;
|
||||
uint32_t <i>capture_last</i>;
|
||||
uint32_t <i>callout_flags</i>;
|
||||
PCRE2_SIZE *<i>offset_vector</i>;
|
||||
PCRE2_SPTR <i>mark</i>;
|
||||
PCRE2_SPTR <i>subject</i>;
|
||||
PCRE2_SIZE <i>subject_length</i>;
|
||||
PCRE2_SIZE <i>start_match</i>;
|
||||
PCRE2_SIZE <i>current_position</i>;
|
||||
PCRE2_SIZE <i>pattern_position</i>;
|
||||
PCRE2_SIZE <i>next_item_length</i>;
|
||||
PCRE2_SIZE <i>callout_string_offset</i>;
|
||||
PCRE2_SIZE <i>callout_string_length</i>;
|
||||
PCRE2_SPTR <i>callout_string</i>;
|
||||
</pre>
|
||||
The <i>version</i> field contains the version number of the block format. The
|
||||
current version is 2; the three callout string fields were added for version 1,
|
||||
and the <i>callout_flags</i> field for version 2. If you are writing an
|
||||
application that might use an earlier release of PCRE2, you should check the
|
||||
version number before accessing any of these fields. The version number will
|
||||
increase in future if more fields are added, but the intention is never to
|
||||
remove any of the existing fields.
|
||||
</P>
|
||||
<br><b>
|
||||
Fields for numerical callouts
|
||||
</b><br>
|
||||
<P>
|
||||
For a numerical callout, <i>callout_string</i> is NULL, and <i>callout_number</i>
|
||||
contains the number of the callout, in the range 0-255. This is the number
|
||||
that follows (?C for callouts that part of the pattern; it is 255 for
|
||||
automatically generated callouts.
|
||||
</P>
|
||||
<br><b>
|
||||
Fields for string callouts
|
||||
</b><br>
|
||||
<P>
|
||||
For callouts with string arguments, <i>callout_number</i> is always zero, and
|
||||
<i>callout_string</i> points to the string that is contained within the compiled
|
||||
pattern. Its length is given by <i>callout_string_length</i>. Duplicated ending
|
||||
delimiters that were present in the original pattern string have been turned
|
||||
into single characters, but there is no other processing of the callout string
|
||||
argument. An additional code unit containing binary zero is present after the
|
||||
string, but is not included in the length. The delimiter that was used to start
|
||||
the string is also stored within the pattern, immediately before the string
|
||||
itself. You can access this delimiter as <i>callout_string</i>[-1] if you need
|
||||
it.
|
||||
</P>
|
||||
<P>
|
||||
The <i>callout_string_offset</i> field is the code unit offset to the start of
|
||||
the callout argument string within the original pattern string. This is
|
||||
provided for the benefit of applications such as script languages that might
|
||||
need to report errors in the callout string within the pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Fields for all callouts
|
||||
</b><br>
|
||||
<P>
|
||||
The remaining fields in the callout block are the same for both kinds of
|
||||
callout.
|
||||
</P>
|
||||
<P>
|
||||
The <i>offset_vector</i> field is a pointer to a vector of capturing offsets
|
||||
(the "ovector"). You may read the elements in this vector, but you must not
|
||||
change any of them.
|
||||
</P>
|
||||
<P>
|
||||
For calls to <b>pcre2_match()</b>, the <i>offset_vector</i> field is not (since
|
||||
release 10.30) a pointer to the actual ovector that was passed to the matching
|
||||
function in the match data block. Instead it points to an internal ovector of a
|
||||
size large enough to hold all possible captured substrings in the pattern. Note
|
||||
that whenever a recursion or subroutine call within a pattern completes, the
|
||||
capturing state is reset to what it was before.
|
||||
</P>
|
||||
<P>
|
||||
The <i>capture_last</i> field contains the number of the most recently captured
|
||||
substring, and the <i>capture_top</i> field contains one more than the number of
|
||||
the highest numbered captured substring so far. If no substrings have yet been
|
||||
captured, the value of <i>capture_last</i> is 0 and the value of
|
||||
<i>capture_top</i> is 1. The values of these fields do not always differ by one;
|
||||
for example, when the callout in the pattern ((a)(b))(?C2) is taken,
|
||||
<i>capture_last</i> is 1 but <i>capture_top</i> is 4.
|
||||
</P>
|
||||
<P>
|
||||
The contents of ovector[2] to ovector[<capture_top>*2-1] can be inspected in
|
||||
order to extract substrings that have been matched so far, in the same way as
|
||||
extracting substrings after a match has completed. The values in ovector[0] and
|
||||
ovector[1] are always PCRE2_UNSET because the match is by definition not
|
||||
complete. Substrings that have not been captured but whose numbers are less
|
||||
than <i>capture_top</i> also have both of their ovector slots set to
|
||||
PCRE2_UNSET.
|
||||
</P>
|
||||
<P>
|
||||
For DFA matching, the <i>offset_vector</i> field points to the ovector that was
|
||||
passed to the matching function in the match data block for callouts at the top
|
||||
level, but to an internal ovector during the processing of pattern recursions,
|
||||
lookarounds, and atomic groups. However, these ovectors hold no useful
|
||||
information because <b>pcre2_dfa_match()</b> does not support substring
|
||||
capturing. The value of <i>capture_top</i> is always 1 and the value of
|
||||
<i>capture_last</i> is always 0 for DFA matching.
|
||||
</P>
|
||||
<P>
|
||||
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
|
||||
that were passed to the matching function.
|
||||
</P>
|
||||
<P>
|
||||
The <i>start_match</i> field normally contains the offset within the subject at
|
||||
which the current match attempt started. However, if the escape sequence \K
|
||||
has been encountered, this value is changed to reflect the modified starting
|
||||
point. If the pattern is not anchored, the callout function may be called
|
||||
several times from the same point in the pattern for different starting points
|
||||
in the subject.
|
||||
</P>
|
||||
<P>
|
||||
The <i>current_position</i> field contains the offset within the subject of the
|
||||
current match pointer.
|
||||
</P>
|
||||
<P>
|
||||
The <i>pattern_position</i> field contains the offset in the pattern string to
|
||||
the next item to be matched.
|
||||
</P>
|
||||
<P>
|
||||
The <i>next_item_length</i> field contains the length of the next item to be
|
||||
processed in the pattern string. When the callout is at the end of the pattern,
|
||||
the length is zero. When the callout precedes an opening parenthesis, the
|
||||
length includes meta characters that follow the parenthesis. For example, in a
|
||||
callout before an assertion such as (?=ab) the length is 3. For an alternation
|
||||
bar or a closing parenthesis, the length is one, unless a closing parenthesis
|
||||
is followed by a quantifier, in which case its length is included. (This
|
||||
changed in release 10.23. In earlier releases, before an opening parenthesis
|
||||
the length was that of the entire group, and before an alternation bar or a
|
||||
closing parenthesis the length was zero.)
|
||||
</P>
|
||||
<P>
|
||||
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
|
||||
help in distinguishing between different automatic callouts, which all have the
|
||||
same callout number. However, they are set for all callouts, and are used by
|
||||
<b>pcre2test</b> to show the next item to be matched when displaying callout
|
||||
information.
|
||||
</P>
|
||||
<P>
|
||||
In callouts from <b>pcre2_match()</b> the <i>mark</i> field contains a pointer to
|
||||
the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
|
||||
(*THEN) item in the match, or NULL if no such items have been passed. Instances
|
||||
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
|
||||
callouts from the DFA matching function this field always contains NULL.
|
||||
</P>
|
||||
<P>
|
||||
The <i>callout_flags</i> field is always zero in callouts from
|
||||
<b>pcre2_dfa_match()</b> or when JIT is being used. When <b>pcre2_match()</b>
|
||||
without JIT is used, the following bits may be set:
|
||||
<pre>
|
||||
PCRE2_CALLOUT_STARTMATCH
|
||||
</pre>
|
||||
This is set for the first callout after the start of matching for each new
|
||||
starting position in the subject.
|
||||
<pre>
|
||||
PCRE2_CALLOUT_BACKTRACK
|
||||
</pre>
|
||||
This is set if there has been a matching backtrack since the previous callout,
|
||||
or since the start of matching if this is the first callout from a
|
||||
<b>pcre2_match()</b> run.
|
||||
</P>
|
||||
<P>
|
||||
Both bits are set when a backtrack has caused a "bumpalong" to a new starting
|
||||
position in the subject. Output from <b>pcre2test</b> does not indicate the
|
||||
presence of these bits unless the <b>callout_extra</b> modifier is set.
|
||||
</P>
|
||||
<P>
|
||||
The information in the <b>callout_flags</b> field is provided so that
|
||||
applications can track and tell their users how matching with backtracking is
|
||||
done. This can be useful when trying to optimize patterns, or just to
|
||||
understand how PCRE2 works. There is no support in <b>pcre2_dfa_match()</b>
|
||||
because there is no backtracking in DFA matching, and there is no support in
|
||||
JIT because JIT is all about maximimizing matching performance. In both these
|
||||
cases the <b>callout_flags</b> field is always zero.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM CALLOUTS</a><br>
|
||||
<P>
|
||||
The external callout function returns an integer to PCRE2. If the value is
|
||||
zero, matching proceeds as normal. If the value is greater than zero, matching
|
||||
fails at the current point, but the testing of other matching possibilities
|
||||
goes ahead, just as if a lookahead assertion had failed. If the value is less
|
||||
than zero, the match is abandoned, and the matching function returns the
|
||||
negative value.
|
||||
</P>
|
||||
<P>
|
||||
Negative values should normally be chosen from the set of PCRE2_ERROR_xxx
|
||||
values. In particular, PCRE2_ERROR_NOMATCH forces a standard "no match"
|
||||
failure. The error number PCRE2_ERROR_CALLOUT is reserved for use by callout
|
||||
functions; it will never be used by PCRE2 itself.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">CALLOUT ENUMERATION</a><br>
|
||||
<P>
|
||||
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
|
||||
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
|
||||
<b> void *<i>user_data</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A script language that supports the use of string arguments in callouts might
|
||||
like to scan all the callouts in a pattern before running the match. This can
|
||||
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
|
||||
pointer to a compiled pattern, the second points to a callback function, and
|
||||
the third is arbitrary user data. The callback function is called for every
|
||||
callout in the pattern in the order in which they appear. Its first argument is
|
||||
a pointer to a callout enumeration block, and its second argument is the
|
||||
<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
|
||||
data block contains the following fields:
|
||||
<pre>
|
||||
<i>version</i> Block version number
|
||||
<i>pattern_position</i> Offset to next item in pattern
|
||||
<i>next_item_length</i> Length of next item in pattern
|
||||
<i>callout_number</i> Number for numbered callouts
|
||||
<i>callout_string_offset</i> Offset to string within pattern
|
||||
<i>callout_string_length</i> Length of callout string
|
||||
<i>callout_string</i> Points to callout string or is NULL
|
||||
</pre>
|
||||
The version number is currently 0. It will increase if new fields are ever
|
||||
added to the block. The remaining fields are the same as their namesakes in the
|
||||
<b>pcre2_callout</b> block that is used for callouts during matching, as
|
||||
described
|
||||
<a href="#calloutinterface">above.</a>
|
||||
</P>
|
||||
<P>
|
||||
Note that the value of <i>pattern_position</i> is unique for each callout.
|
||||
However, if a callout occurs inside a group that is quantified with a non-zero
|
||||
minimum or a fixed maximum, the group is replicated inside the compiled
|
||||
pattern. For example, a pattern such as /(a){2}/ is compiled as if it were
|
||||
/(a)(a)/. This means that the callout will be enumerated more than once, but
|
||||
with the same value for <i>pattern_position</i> in each case.
|
||||
</P>
|
||||
<P>
|
||||
The callback function should normally return zero. If it returns a non-zero
|
||||
value, scanning the pattern stops, and that value is returned from
|
||||
<b>pcre2_callout_enumerate()</b>.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 19 January 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
299
3rd/pcre2/doc/html/pcre2compat.html
Normal file
299
3rd/pcre2/doc/html/pcre2compat.html
Normal file
@@ -0,0 +1,299 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2compat specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2compat man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
DIFFERENCES BETWEEN PCRE2 AND PERL
|
||||
</b><br>
|
||||
<P>
|
||||
This document describes some of the known differences in the ways that PCRE2
|
||||
and Perl handle regular expressions. The differences described here are with
|
||||
respect to Perl version 5.38.0, but as both Perl and PCRE2 are continually
|
||||
changing, the information may at times be out of date.
|
||||
</P>
|
||||
<P>
|
||||
1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, the
|
||||
behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' matches the
|
||||
next character unless it is the start of a newline sequence. This means that,
|
||||
if the newline setting is CR, CRLF, or NUL, '.' will match the code point LF
|
||||
(0x0A) in ASCII/Unicode environments, and NL (either 0x15 or 0x25) when using
|
||||
EBCDIC. In Perl, '.' appears never to match LF, even when 0x0A is not a newline
|
||||
indicator.
|
||||
</P>
|
||||
<P>
|
||||
2. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
|
||||
have are given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
|
||||
they do not mean what you might think. For example, (?!a){3} does not assert
|
||||
that the next three characters are not "a". It just asserts that the next
|
||||
character is not "a" three times (in principle; PCRE2 optimizes this to run the
|
||||
assertion just once). Perl allows some repeat quantifiers on other assertions,
|
||||
for example, \b* , but these do not seem to have any use. PCRE2 does not allow
|
||||
any kind of quantifier on non-lookaround assertions.
|
||||
</P>
|
||||
<P>
|
||||
4. If a braced quantifier such as {1,2} appears where there is nothing to
|
||||
repeat (for example, at the start of a branch), PCRE2 raises an error whereas
|
||||
Perl treats the quantifier characters as literal.
|
||||
</P>
|
||||
<P>
|
||||
5. Capture groups that occur inside negative lookaround assertions are counted,
|
||||
but their entries in the offsets vector are set only when a negative assertion
|
||||
is a condition that has a matching branch (that is, the condition is false).
|
||||
Perl may set such capture groups in other circumstances.
|
||||
</P>
|
||||
<P>
|
||||
6. The following Perl escape sequences are not supported: \F, \l, \L, \u,
|
||||
\U, and \N when followed by a character name. \N on its own, matching a
|
||||
non-newline character, and \N{U+dd..}, matching a Unicode code point, are
|
||||
supported. The escapes that modify the case of following letters are
|
||||
implemented by Perl's general string-handling and are not part of its pattern
|
||||
matching engine. If any of these are encountered by PCRE2, an error is
|
||||
generated by default. However, if either of the PCRE2_ALT_BSUX or
|
||||
PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are interpreted as ECMAScript
|
||||
interprets them.
|
||||
</P>
|
||||
<P>
|
||||
7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
|
||||
built with Unicode support (the default). The properties that can be tested
|
||||
with \p and \P are limited to the general category properties such as Lu and
|
||||
Nd, the derived properties Any and Lc (synonym L&), script names such as Greek
|
||||
or Han, Bidi_Class, Bidi_Control, and a few binary properties. Both PCRE2 and
|
||||
Perl support the Cs (surrogate) property, but in PCRE2 its use is limited. See
|
||||
the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation for details. The long synonyms for property names that Perl
|
||||
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
|
||||
to prefix any of these properties with "Is".
|
||||
</P>
|
||||
<P>
|
||||
8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
|
||||
in between are treated as literals. However, this is slightly different from
|
||||
Perl in that $ and @ are also handled as literals inside the quotes. In Perl,
|
||||
they cause variable interpolation (PCRE2 does not have variables). Also, Perl
|
||||
does "double-quotish backslash interpolation" on any backslashes between \Q
|
||||
and \E which, its documentation says, "may lead to confusing results". PCRE2
|
||||
treats a backslash between \Q and \E just like any other character. Note the
|
||||
following examples:
|
||||
<pre>
|
||||
Pattern PCRE2 matches Perl matches
|
||||
|
||||
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
|
||||
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
||||
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
||||
\QA\B\E A\B A\B
|
||||
\Q\\E \ \\E
|
||||
</pre>
|
||||
The \Q...\E sequence is recognized both inside and outside character classes
|
||||
by both PCRE2 and Perl. Another difference from Perl is that any appearance of
|
||||
\Q or \E inside what might otherwise be a quantifier causes PCRE2 not to
|
||||
recognize the sequence as a quantifier. Perl recognizes a quantifier if
|
||||
(redundantly) either of the numbers is inside \Q...\E, but not if the
|
||||
separating comma is. When not recognized as a quantifier a sequence such as
|
||||
{\Q1\E,2} is treated as the literal string "{1,2}".
|
||||
</P>
|
||||
<P>
|
||||
9. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
|
||||
constructions. However, PCRE2 does have a "callout" feature, which allows an
|
||||
external function to be called during pattern matching. See the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details.
|
||||
</P>
|
||||
<P>
|
||||
10. Subroutine calls (whether recursive or not) were treated as atomic groups
|
||||
up to PCRE2 release 10.23, but from release 10.30 this changed, and
|
||||
backtracking into subroutine calls is now supported, as in Perl.
|
||||
</P>
|
||||
<P>
|
||||
11. In PCRE2, if any of the backtracking control verbs are used in a group that
|
||||
is called as a subroutine (whether or not recursively), their effect is
|
||||
confined to that group; it does not extend to the surrounding pattern. This is
|
||||
not always the case in Perl. In particular, if (*THEN) is present in a group
|
||||
that is called as a subroutine, its action is limited to that group, even if
|
||||
the group does not contain any | characters. Note that such groups are
|
||||
processed as anchored at the point where they are tested. PCRE2 also confines
|
||||
all control verbs within atomic assertions, again including (*THEN) in
|
||||
assertions with only one branch.
|
||||
</P>
|
||||
<P>
|
||||
12. If a pattern contains more than one backtracking control verb, the first
|
||||
one that is backtracked onto acts. For example, in the pattern
|
||||
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
|
||||
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
|
||||
same as PCRE2, but there are cases where it differs.
|
||||
</P>
|
||||
<P>
|
||||
13. There are some differences that are concerned with the settings of captured
|
||||
strings when part of a pattern is repeated. For example, matching "aba" against
|
||||
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
|
||||
"b".
|
||||
</P>
|
||||
<P>
|
||||
14. PCRE2's handling of duplicate capture group numbers and names is not as
|
||||
general as Perl's. This is a consequence of the fact the PCRE2 works internally
|
||||
just with numbers, using an external table to translate between numbers and
|
||||
names. In particular, a pattern such as (?|(?<a>A)|(?<b>B)), where the two
|
||||
capture groups have the same number but different names, is not supported, and
|
||||
causes an error at compile time. If it were allowed, it would not be possible
|
||||
to distinguish which group matched, because both names map to capture group
|
||||
number 1. To avoid this confusing situation, an error is given at compile time.
|
||||
</P>
|
||||
<P>
|
||||
15. Perl used to recognize comments in some places that PCRE2 does not, for
|
||||
example, between the ( and ? at the start of a group. If the /x modifier is
|
||||
set, Perl allowed white space between ( and ? though the latest Perls give an
|
||||
error (for a while it was just deprecated). There may still be some cases where
|
||||
Perl behaves differently.
|
||||
</P>
|
||||
<P>
|
||||
16. Perl, when in warning mode, gives warnings for character classes such as
|
||||
[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
|
||||
warning features, so it gives an error in these cases because they are almost
|
||||
certainly user mistakes.
|
||||
</P>
|
||||
<P>
|
||||
17. In PCRE2, until release 10.45, the upper/lower case character properties Lu
|
||||
and Ll were not affected when case-independent matching was specified. Perl has
|
||||
changed in this respect, and PCRE2 has now changed to match. When caseless
|
||||
matching is in force, Lu, Ll, and Lt (title case) are all treated as Lc (cased
|
||||
letter).
|
||||
</P>
|
||||
<P>
|
||||
18. From release 5.32.0, Perl locks out the use of \K in lookaround
|
||||
assertions. From release 10.38 PCRE2 does the same by default. However, there
|
||||
is an option for re-enabling the previous behaviour. When this option is set,
|
||||
\K is acted on when it occurs in positive assertions, but is ignored in
|
||||
negative assertions.
|
||||
</P>
|
||||
<P>
|
||||
19. PCRE2 provides some extensions to the Perl regular expression facilities.
|
||||
Perl 5.10 included new features that were not in earlier versions of Perl, some
|
||||
of which (such as named parentheses) were in PCRE2 for some time before. This
|
||||
list is with respect to Perl 5.38:
|
||||
<br>
|
||||
<br>
|
||||
(a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
|
||||
meta-character matches only at the very end of the string.
|
||||
<br>
|
||||
<br>
|
||||
(b) A backslash followed by a letter with no special meaning is faulted. (Perl
|
||||
can be made to issue a warning.)
|
||||
<br>
|
||||
<br>
|
||||
(c) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
|
||||
inverted, that is, by default they are not greedy, but if followed by a
|
||||
question mark they are.
|
||||
<br>
|
||||
<br>
|
||||
(d) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
|
||||
only at the first matching position in the subject string.
|
||||
<br>
|
||||
<br>
|
||||
(e) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART
|
||||
options have no Perl equivalents.
|
||||
<br>
|
||||
<br>
|
||||
(f) The \R escape sequence can be restricted to match only CR, LF, or CRLF
|
||||
by the PCRE2_BSR_ANYCRLF option.
|
||||
<br>
|
||||
<br>
|
||||
(g) The callout facility is PCRE2-specific. Perl supports codeblocks and
|
||||
variable interpolation, but not general hooks on every match.
|
||||
<br>
|
||||
<br>
|
||||
(h) The partial matching facility is PCRE2-specific.
|
||||
<br>
|
||||
<br>
|
||||
(i) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
|
||||
different way and is not Perl-compatible.
|
||||
<br>
|
||||
<br>
|
||||
(j) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
|
||||
the start of a pattern. These set overall options that cannot be changed within
|
||||
the pattern.
|
||||
<br>
|
||||
<br>
|
||||
(k) PCRE2 supports non-atomic positive lookaround assertions. This is an
|
||||
extension to the lookaround facilities. The default, Perl-compatible
|
||||
lookarounds are atomic.
|
||||
<br>
|
||||
<br>
|
||||
(l) There are three syntactical items in patterns that can refer to a capturing
|
||||
group by number: back references such as \g{2}, subroutine calls such as (?3),
|
||||
and condition references such as (?(4)...). PCRE2 supports relative group
|
||||
numbers such as +2 and -4 in all three cases. Perl supports both plus and minus
|
||||
for subroutine calls, but only minus for back references, and no relative
|
||||
numbering at all for conditions.
|
||||
<br>
|
||||
<br>
|
||||
(m) The scan substring assertion (syntax (*scs:(n)...)) is a PCRE2 extension
|
||||
that is not available in Perl.
|
||||
</P>
|
||||
<P>
|
||||
20. Perl has different limits than PCRE2. See the
|
||||
<a href="pcre2limit.html"><b>pcre2limit</b></a>
|
||||
documentation for details. Perl went with 5.10 from recursion to iteration
|
||||
keeping the intermediate matches on the heap, which is ~10% slower but does not
|
||||
fall into any stack-overflow limit. PCRE2 made a similar change at release
|
||||
10.30, and also has many build-time and run-time customizable limits.
|
||||
</P>
|
||||
<P>
|
||||
21. Unlike Perl, PCRE2 doesn't have character set modifiers and specially no way
|
||||
to set characters by context just like Perl's "/d". A regular expression using
|
||||
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
|
||||
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
|
||||
</P>
|
||||
<P>
|
||||
22. Some recursive patterns that Perl diagnoses as infinite recursions can be
|
||||
handled by PCRE2, either by the interpreter or the JIT. An example is
|
||||
/(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number of repeated
|
||||
"abcd" substrings at the end of the subject.
|
||||
</P>
|
||||
<P>
|
||||
23. Both PCRE2 and Perl error when \x{ escapes are invalid, but Perl tries to
|
||||
recover and prints a warning if the problem was that an invalid hexadecimal
|
||||
digit was found, since PCRE2 doesn't have warnings it returns an error instead.
|
||||
Additionally, Perl accepts \x{} and generates NUL unlike PCRE2.
|
||||
</P>
|
||||
<P>
|
||||
24. From release 10.45, PCRE2 gives an error if \x is not followed by a
|
||||
hexadecimal digit or a curly bracket. It used to interpret this as the NUL
|
||||
character. Perl still generates NUL, but warns when in warning mode in most
|
||||
cases.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 02 October 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
191
3rd/pcre2/doc/html/pcre2convert.html
Normal file
191
3rd/pcre2/doc/html/pcre2convert.html
Normal file
@@ -0,0 +1,191 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2convert specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2convert man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a>
|
||||
<li><a name="TOC2" href="#SEC2">THE CONVERT CONTEXT</a>
|
||||
<li><a name="TOC3" href="#SEC3">THE CONVERSION FUNCTION</a>
|
||||
<li><a name="TOC4" href="#SEC4">CONVERTING GLOBS</a>
|
||||
<li><a name="TOC5" href="#SEC5">CONVERTING POSIX PATTERNS</a>
|
||||
<li><a name="TOC6" href="#SEC6">AUTHOR</a>
|
||||
<li><a name="TOC7" href="#SEC7">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a><br>
|
||||
<P>
|
||||
This document describes a set of functions that can be used to convert
|
||||
"foreign" patterns into PCRE2 regular expressions. This facility is currently
|
||||
experimental, and may be changed in future releases. Two kinds of pattern,
|
||||
globs and POSIX patterns, are supported.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">THE CONVERT CONTEXT</a><br>
|
||||
<P>
|
||||
<b>pcre2_convert_context *pcre2_convert_context_create(</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
|
||||
<b> pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
|
||||
<b> uint32_t <i>escape_char</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
|
||||
<b> uint32_t <i>separator_char</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
A convert context is used to hold parameters that affect the way that pattern
|
||||
conversion works. Like all PCRE2 contexts, you need to use a context only if
|
||||
you want to override the defaults. There are the usual create, copy, and free
|
||||
functions. If custom memory management functions are set in a general context
|
||||
that is passed to <b>pcre2_convert_context_create()</b>, they are used for all
|
||||
memory management within the conversion functions.
|
||||
</P>
|
||||
<P>
|
||||
There are only two parameters in the convert context at present. Both apply
|
||||
only to glob conversions. The escape character defaults to grave accent under
|
||||
Windows, otherwise backslash. It can be set to zero, meaning no escape
|
||||
character, or to any punctuation character with a code point less than 256.
|
||||
The separator character defaults to backslash under Windows, otherwise forward
|
||||
slash. It can be set to forward slash, backslash, or dot.
|
||||
</P>
|
||||
<P>
|
||||
The two setting functions return zero on success, or PCRE2_ERROR_BADDATA if
|
||||
their second argument is invalid.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">THE CONVERSION FUNCTION</a><br>
|
||||
<P>
|
||||
<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
|
||||
<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
The first two arguments of <b>pcre2_pattern_convert()</b> define the foreign
|
||||
pattern that is to be converted. The length may be given as
|
||||
PCRE2_ZERO_TERMINATED. The <b>options</b> argument defines how the pattern is to
|
||||
be processed. If the input is UTF, the PCRE2_CONVERT_UTF option should be set.
|
||||
PCRE2_CONVERT_NO_UTF_CHECK may also be set if you are sure the input is valid.
|
||||
One or more of the glob options, or one of the following POSIX options must be
|
||||
set to define the type of conversion that is required:
|
||||
<pre>
|
||||
PCRE2_CONVERT_GLOB
|
||||
PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
|
||||
PCRE2_CONVERT_GLOB_NO_STARSTAR
|
||||
PCRE2_CONVERT_POSIX_BASIC
|
||||
PCRE2_CONVERT_POSIX_EXTENDED
|
||||
</pre>
|
||||
Details of the conversions are given below. The <b>buffer</b> and <b>blength</b>
|
||||
arguments define how the output is handled:
|
||||
</P>
|
||||
<P>
|
||||
If <b>buffer</b> is NULL, the function just returns the length of the converted
|
||||
pattern via <b>blength</b>. This is one less than the length of buffer needed,
|
||||
because a terminating zero is always added to the output.
|
||||
</P>
|
||||
<P>
|
||||
If <b>buffer</b> points to a NULL pointer, an output buffer is obtained using
|
||||
the allocator in the context or <b>malloc()</b> if no context is supplied. A
|
||||
pointer to this buffer is placed in the variable to which <b>buffer</b> points.
|
||||
When no longer needed the output buffer must be freed by calling
|
||||
<b>pcre2_converted_pattern_free()</b>. If this function is called with a NULL
|
||||
argument, it returns immediately without doing anything.
|
||||
</P>
|
||||
<P>
|
||||
If <b>buffer</b> points to a non-NULL pointer, <b>blength</b> must be set to the
|
||||
actual length of the buffer provided (in code units).
|
||||
</P>
|
||||
<P>
|
||||
In all cases, after successful conversion, the variable pointed to by
|
||||
<b>blength</b> is updated to the length actually used (in code units), excluding
|
||||
the terminating zero that is always added.
|
||||
</P>
|
||||
<P>
|
||||
If an error occurs, the length (via <b>blength</b>) is set to the offset
|
||||
within the input pattern where the error was detected. Only gross syntax errors
|
||||
are caught; there are plenty of errors that will get passed on for
|
||||
<b>pcre2_compile()</b> to discover.
|
||||
</P>
|
||||
<P>
|
||||
The return from <b>pcre2_pattern_convert()</b> is zero on success or a non-zero
|
||||
PCRE2 error code. Note that PCRE2 error codes may be positive or negative:
|
||||
<b>pcre2_compile()</b> uses mostly positive codes and <b>pcre2_match()</b>
|
||||
negative ones; <b>pcre2_convert()</b> uses existing codes of both kinds. A
|
||||
textual error message can be obtained by calling
|
||||
<b>pcre2_get_error_message()</b>.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">CONVERTING GLOBS</a><br>
|
||||
<P>
|
||||
Globs are used to match file names, and consequently have the concept of a
|
||||
"path separator", which defaults to backslash under Windows and forward slash
|
||||
otherwise. If PCRE2_CONVERT_GLOB is set, the wildcards * and ? are not
|
||||
permitted to match separator characters, but the double-star (**) feature
|
||||
(which does match separators) is supported.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR matches globs with wildcards allowed to
|
||||
match separator characters. PCRE2_CONVERT_GLOB_NO_STARSTAR matches globs with
|
||||
the double-star feature disabled. These options may be given together.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">CONVERTING POSIX PATTERNS</a><br>
|
||||
<P>
|
||||
POSIX defines two kinds of regular expression pattern: basic and extended.
|
||||
These can be processed by setting PCRE2_CONVERT_POSIX_BASIC or
|
||||
PCRE2_CONVERT_POSIX_EXTENDED, respectively.
|
||||
</P>
|
||||
<P>
|
||||
In POSIX patterns, backslash is not special in a character class. Unmatched
|
||||
closing parentheses are treated as literals.
|
||||
</P>
|
||||
<P>
|
||||
In basic patterns, ? + | {} and () must be escaped to be recognized
|
||||
as metacharacters outside a character class. If the first character in the
|
||||
pattern is * it is treated as a literal. ^ is a metacharacter only at the start
|
||||
of a branch.
|
||||
</P>
|
||||
<P>
|
||||
In extended patterns, a backslash not in a character class always
|
||||
makes the next character literal, whatever it is. There are no backreferences.
|
||||
</P>
|
||||
<P>
|
||||
Note: POSIX mandates that the longest possible match at the first matching
|
||||
position must be found. This is not what <b>pcre2_match()</b> does; it yields
|
||||
the first match that is found. An application can use <b>pcre2_dfa_match()</b>
|
||||
to find the longest match, but that does not support backreferences (but then
|
||||
neither do POSIX extended patterns).
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 14 November 2023
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
518
3rd/pcre2/doc/html/pcre2demo.html
Normal file
518
3rd/pcre2/doc/html/pcre2demo.html
Normal file
@@ -0,0 +1,518 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2demo specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2demo man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SOURCE CODE
|
||||
</b><br>
|
||||
<PRE>
|
||||
/*************************************************
|
||||
* PCRE2 DEMONSTRATION PROGRAM *
|
||||
*************************************************/
|
||||
|
||||
/* This is a demonstration program to illustrate a straightforward way of
|
||||
using the PCRE2 regular expression library from a C program. See the
|
||||
pcre2sample documentation for a short discussion ("man pcre2sample" if you have
|
||||
the PCRE2 man pages installed). PCRE2 is a revised API for the library, and is
|
||||
incompatible with the original PCRE API.
|
||||
|
||||
There are actually three libraries, each supporting a different code unit
|
||||
width. This demonstration program uses the 8-bit library. The default is to
|
||||
process each code unit as a separate character, but if the pattern begins with
|
||||
"(*UTF)", both it and the subject are treated as UTF-8 strings, where
|
||||
characters may occupy multiple code units.
|
||||
|
||||
In Unix-like environments, if PCRE2 is installed in your standard system
|
||||
libraries, you should be able to compile this program using this command:
|
||||
|
||||
cc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
|
||||
|
||||
If PCRE2 is not installed in a standard place, it is likely to be installed
|
||||
with support for the pkg-config mechanism. If you have pkg-config, you can
|
||||
compile this program using this command:
|
||||
|
||||
cc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
|
||||
|
||||
If you do not have pkg-config, you may have to use something like this:
|
||||
|
||||
cc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \
|
||||
-R/usr/local/lib -lpcre2-8 -o pcre2demo
|
||||
|
||||
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
|
||||
library files for PCRE2 are installed on your system. Only some operating
|
||||
systems (Solaris is one) use the -R option.
|
||||
|
||||
Building under Windows:
|
||||
|
||||
If you want to statically link this program against a non-dll .a file, you must
|
||||
define PCRE2_STATIC before including pcre2.h, so in this environment, uncomment
|
||||
the following line. */
|
||||
|
||||
/* #define PCRE2_STATIC */
|
||||
|
||||
/* The PCRE2_CODE_UNIT_WIDTH macro must be defined before including pcre2.h.
|
||||
For a program that uses only one code unit width, setting it to 8, 16, or 32
|
||||
makes it possible to use generic function names such as pcre2_compile(). Note
|
||||
that just changing 8 to 16 (for example) is not sufficient to convert this
|
||||
program to process 16-bit characters. Even in a fully 16-bit environment, where
|
||||
string-handling functions such as strcmp() and printf() work with 16-bit
|
||||
characters, the code for handling the table of named substrings will still need
|
||||
to be modified. */
|
||||
|
||||
#define PCRE2_CODE_UNIT_WIDTH 8
|
||||
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <pcre2.h>
|
||||
|
||||
|
||||
/**************************************************************************
|
||||
* Here is the program. The API includes the concept of "contexts" for *
|
||||
* setting up unusual interface requirements for compiling and matching, *
|
||||
* such as custom memory managers and non-standard newline definitions. *
|
||||
* This program does not do any of this, so it makes no use of contexts, *
|
||||
* always passing NULL where a context could be given. *
|
||||
**************************************************************************/
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
pcre2_code *re;
|
||||
PCRE2_SPTR pattern; /* PCRE2_SPTR is a pointer to unsigned code units of */
|
||||
PCRE2_SPTR subject; /* the appropriate width (in this case, 8 bits). */
|
||||
PCRE2_SPTR name_table;
|
||||
|
||||
int crlf_is_newline;
|
||||
int errornumber;
|
||||
int find_all;
|
||||
int i;
|
||||
int rc;
|
||||
int utf8;
|
||||
|
||||
uint32_t option_bits;
|
||||
uint32_t namecount;
|
||||
uint32_t name_entry_size;
|
||||
uint32_t newline;
|
||||
|
||||
PCRE2_SIZE erroroffset;
|
||||
PCRE2_SIZE *ovector;
|
||||
PCRE2_SIZE subject_length;
|
||||
|
||||
pcre2_match_data *match_data;
|
||||
|
||||
|
||||
/**************************************************************************
|
||||
* First, sort out the command line. There is only one possible option at *
|
||||
* the moment, "-g" to request repeated matching to find all occurrences, *
|
||||
* like Perl's /g option. We set the variable find_all to a non-zero value *
|
||||
* if the -g option is present. *
|
||||
**************************************************************************/
|
||||
|
||||
find_all = 0;
|
||||
for (i = 1; i < argc; i++)
|
||||
{
|
||||
if (strcmp(argv[i], "-g") == 0) find_all = 1;
|
||||
else if (argv[i][0] == '-')
|
||||
{
|
||||
printf("Unrecognised option %s\n", argv[i]);
|
||||
return 1;
|
||||
}
|
||||
else break;
|
||||
}
|
||||
|
||||
/* After the options, we require exactly two arguments, which are the pattern,
|
||||
and the subject string. */
|
||||
|
||||
if (argc - i != 2)
|
||||
{
|
||||
printf("Exactly two arguments required: a regex and a subject string\n");
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* Pattern and subject are char arguments, so they can be straightforwardly
|
||||
cast to PCRE2_SPTR because we are working in 8-bit code units. The subject
|
||||
length is cast to PCRE2_SIZE for completeness, though PCRE2_SIZE is in fact
|
||||
defined to be size_t. */
|
||||
|
||||
pattern = (PCRE2_SPTR)argv[i];
|
||||
subject = (PCRE2_SPTR)argv[i+1];
|
||||
subject_length = (PCRE2_SIZE)strlen((char *)subject);
|
||||
|
||||
|
||||
/*************************************************************************
|
||||
* Now we are going to compile the regular expression pattern, and handle *
|
||||
* any errors that are detected. *
|
||||
*************************************************************************/
|
||||
|
||||
re = pcre2_compile(
|
||||
pattern, /* the pattern */
|
||||
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
|
||||
0, /* default options */
|
||||
&errornumber, /* for error number */
|
||||
&erroroffset, /* for error offset */
|
||||
NULL); /* use default compile context */
|
||||
|
||||
/* Compilation failed: print the error message and exit. */
|
||||
|
||||
if (re == NULL)
|
||||
{
|
||||
PCRE2_UCHAR buffer[256];
|
||||
pcre2_get_error_message(errornumber, buffer, sizeof(buffer));
|
||||
printf("PCRE2 compilation failed at offset %d: %s\n", (int)erroroffset,
|
||||
buffer);
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
||||
/*************************************************************************
|
||||
* If the compilation succeeded, we call PCRE2 again, in order to do a *
|
||||
* pattern match against the subject string. This does just ONE match. If *
|
||||
* further matching is needed, it will be done below. Before running the *
|
||||
* match we must set up a match_data block for holding the result. Using *
|
||||
* pcre2_match_data_create_from_pattern() ensures that the block is *
|
||||
* exactly the right size for the number of capturing parentheses in the *
|
||||
* pattern. If you need to know the actual size of a match_data block as *
|
||||
* a number of bytes, you can find it like this: *
|
||||
* *
|
||||
* PCRE2_SIZE match_data_size = pcre2_get_match_data_size(match_data); *
|
||||
*************************************************************************/
|
||||
|
||||
match_data = pcre2_match_data_create_from_pattern(re, NULL);
|
||||
|
||||
/* Now run the match. */
|
||||
|
||||
rc = pcre2_match(
|
||||
re, /* the compiled pattern */
|
||||
subject, /* the subject string */
|
||||
subject_length, /* the length of the subject */
|
||||
0, /* start at offset 0 in the subject */
|
||||
0, /* default options */
|
||||
match_data, /* block for storing the result */
|
||||
NULL); /* use default match context */
|
||||
|
||||
/* Matching failed: handle error cases */
|
||||
|
||||
if (rc < 0)
|
||||
{
|
||||
switch(rc)
|
||||
{
|
||||
case PCRE2_ERROR_NOMATCH: printf("No match\n"); break;
|
||||
/*
|
||||
Handle other special cases if you like
|
||||
*/
|
||||
default: printf("Matching error %d\n", rc); break;
|
||||
}
|
||||
pcre2_match_data_free(match_data); /* Release memory used for the match */
|
||||
pcre2_code_free(re); /* data and the compiled pattern. */
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* Match succeeded. Get a pointer to the output vector, where string offsets
|
||||
are stored. */
|
||||
|
||||
ovector = pcre2_get_ovector_pointer(match_data);
|
||||
printf("Match succeeded at offset %d\n", (int)ovector[0]);
|
||||
|
||||
|
||||
/*************************************************************************
|
||||
* We have found the first match within the subject string. If the output *
|
||||
* vector wasn't big enough, say so. Then output any substrings that were *
|
||||
* captured. *
|
||||
*************************************************************************/
|
||||
|
||||
/* The output vector wasn't big enough. This should not happen, because we used
|
||||
pcre2_match_data_create_from_pattern() above. */
|
||||
|
||||
if (rc == 0)
|
||||
printf("ovector was not big enough for all the captured substrings\n");
|
||||
|
||||
/* Since release 10.38 PCRE2 has locked out the use of \K in lookaround
|
||||
assertions. However, there is an option to re-enable the old behaviour. If that
|
||||
is set, it is possible to run patterns such as /(?=.\K)/ that use \K in an
|
||||
assertion to set the start of a match later than its end. In this demonstration
|
||||
program, we show how to detect this case, but it shouldn't arise because the
|
||||
option is never set. */
|
||||
|
||||
if (ovector[0] > ovector[1])
|
||||
{
|
||||
printf("\\K was used in an assertion to set the match start after its end.\n"
|
||||
"From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
|
||||
(char *)(subject + ovector[1]));
|
||||
printf("Run abandoned\n");
|
||||
pcre2_match_data_free(match_data);
|
||||
pcre2_code_free(re);
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* Show substrings stored in the output vector by number. Obviously, in a real
|
||||
application you might want to do things other than print them. */
|
||||
|
||||
for (i = 0; i < rc; i++)
|
||||
{
|
||||
PCRE2_SPTR substring_start = subject + ovector[2*i];
|
||||
PCRE2_SIZE substring_length = ovector[2*i+1] - ovector[2*i];
|
||||
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
|
||||
}
|
||||
|
||||
|
||||
/**************************************************************************
|
||||
* That concludes the basic part of this demonstration program. We have *
|
||||
* compiled a pattern, and performed a single match. The code that follows *
|
||||
* shows first how to access named substrings, and then how to code for *
|
||||
* repeated matches on the same subject. *
|
||||
**************************************************************************/
|
||||
|
||||
/* See if there are any named substrings, and if so, show them by name. First
|
||||
we have to extract the count of named parentheses from the pattern. */
|
||||
|
||||
(void)pcre2_pattern_info(
|
||||
re, /* the compiled pattern */
|
||||
PCRE2_INFO_NAMECOUNT, /* get the number of named substrings */
|
||||
&namecount); /* where to put the answer */
|
||||
|
||||
if (namecount == 0) printf("No named substrings\n"); else
|
||||
{
|
||||
PCRE2_SPTR tabptr;
|
||||
printf("Named substrings\n");
|
||||
|
||||
/* Before we can access the substrings, we must extract the table for
|
||||
translating names to numbers, and the size of each entry in the table. */
|
||||
|
||||
(void)pcre2_pattern_info(
|
||||
re, /* the compiled pattern */
|
||||
PCRE2_INFO_NAMETABLE, /* address of the table */
|
||||
&name_table); /* where to put the answer */
|
||||
|
||||
(void)pcre2_pattern_info(
|
||||
re, /* the compiled pattern */
|
||||
PCRE2_INFO_NAMEENTRYSIZE, /* size of each entry in the table */
|
||||
&name_entry_size); /* where to put the answer */
|
||||
|
||||
/* Now we can scan the table and, for each entry, print the number, the name,
|
||||
and the substring itself. In the 8-bit library the number is held in two
|
||||
bytes, most significant first. */
|
||||
|
||||
tabptr = name_table;
|
||||
for (i = 0; i < namecount; i++)
|
||||
{
|
||||
int n = (tabptr[0] << 8) | tabptr[1];
|
||||
printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
|
||||
(int)(ovector[2*n+1] - ovector[2*n]), subject + ovector[2*n]);
|
||||
tabptr += name_entry_size;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
/*************************************************************************
|
||||
* If the "-g" option was given on the command line, we want to continue *
|
||||
* to search for additional matches in the subject string, in a similar *
|
||||
* way to the /g option in Perl. This turns out to be trickier than you *
|
||||
* might think because of the possibility of matching an empty string. *
|
||||
* What happens is as follows: *
|
||||
* *
|
||||
* If the previous match was NOT for an empty string, we can just start *
|
||||
* the next match at the end of the previous one. *
|
||||
* *
|
||||
* If the previous match WAS for an empty string, we can't do that, as it *
|
||||
* would lead to an infinite loop. Instead, a call of pcre2_match() is *
|
||||
* made with the PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set. The *
|
||||
* first of these tells PCRE2 that an empty string at the start of the *
|
||||
* subject is not a valid match; other possibilities must be tried. The *
|
||||
* second flag restricts PCRE2 to one match attempt at the initial string *
|
||||
* position. If this match succeeds, an alternative to the empty string *
|
||||
* match has been found, and we can print it and proceed round the loop, *
|
||||
* advancing by the length of whatever was found. If this match does not *
|
||||
* succeed, we still stay in the loop, advancing by just one character. *
|
||||
* In UTF-8 mode, which can be set by (*UTF) in the pattern, this may be *
|
||||
* more than one byte. *
|
||||
* *
|
||||
* However, there is a complication concerned with newlines. When the *
|
||||
* newline convention is such that CRLF is a valid newline, we must *
|
||||
* advance by two characters rather than one. The newline convention can *
|
||||
* be set in the regex by (*CR), etc.; if not, we must find the default. *
|
||||
*************************************************************************/
|
||||
|
||||
if (!find_all) /* Check for -g */
|
||||
{
|
||||
pcre2_match_data_free(match_data); /* Release the memory that was used */
|
||||
pcre2_code_free(re); /* for the match data and the pattern. */
|
||||
return 0; /* Exit the program. */
|
||||
}
|
||||
|
||||
/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
|
||||
sequence. First, find the options with which the regex was compiled and extract
|
||||
the UTF state. */
|
||||
|
||||
(void)pcre2_pattern_info(re, PCRE2_INFO_ALLOPTIONS, &option_bits);
|
||||
utf8 = (option_bits & PCRE2_UTF) != 0;
|
||||
|
||||
/* Now find the newline convention and see whether CRLF is a valid newline
|
||||
sequence. */
|
||||
|
||||
(void)pcre2_pattern_info(re, PCRE2_INFO_NEWLINE, &newline);
|
||||
crlf_is_newline = newline == PCRE2_NEWLINE_ANY ||
|
||||
newline == PCRE2_NEWLINE_CRLF ||
|
||||
newline == PCRE2_NEWLINE_ANYCRLF;
|
||||
|
||||
/* Loop for second and subsequent matches */
|
||||
|
||||
for (;;)
|
||||
{
|
||||
uint32_t options = 0; /* Normally no options */
|
||||
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
|
||||
|
||||
/* If the previous match was for an empty string, we are finished if we are
|
||||
at the end of the subject. Otherwise, arrange to run another match at the
|
||||
same point to see if a non-empty match can be found. */
|
||||
|
||||
if (ovector[0] == ovector[1])
|
||||
{
|
||||
if (ovector[0] == subject_length) break;
|
||||
options = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
|
||||
}
|
||||
|
||||
/* If the previous match was not an empty string, there is one tricky case to
|
||||
consider. If a pattern contains \K within a lookbehind assertion at the
|
||||
start, the end of the matched string can be at the offset where the match
|
||||
started. Without special action, this leads to a loop that keeps on matching
|
||||
the same substring. We must detect this case and arrange to move the start on
|
||||
by one character. The pcre2_get_startchar() function returns the starting
|
||||
offset that was passed to pcre2_match(). */
|
||||
|
||||
else
|
||||
{
|
||||
PCRE2_SIZE startchar = pcre2_get_startchar(match_data);
|
||||
if (start_offset <= startchar)
|
||||
{
|
||||
if (startchar >= subject_length) break; /* Reached end of subject. */
|
||||
start_offset = startchar + 1; /* Advance by one character. */
|
||||
if (utf8) /* If UTF-8, it may be more */
|
||||
{ /* than one code unit. */
|
||||
for (; start_offset < subject_length; start_offset++)
|
||||
if ((subject[start_offset] & 0xc0) != 0x80) break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* Run the next matching operation */
|
||||
|
||||
rc = pcre2_match(
|
||||
re, /* the compiled pattern */
|
||||
subject, /* the subject string */
|
||||
subject_length, /* the length of the subject */
|
||||
start_offset, /* starting offset in the subject */
|
||||
options, /* options */
|
||||
match_data, /* block for storing the result */
|
||||
NULL); /* use default match context */
|
||||
|
||||
/* This time, a result of NOMATCH isn't an error. If the value in "options"
|
||||
is zero, it just means we have found all possible matches, so the loop ends.
|
||||
Otherwise, it means we have failed to find a non-empty-string match at a
|
||||
point where there was a previous empty-string match. In this case, we do what
|
||||
Perl does: advance the matching position by one character, and continue. We
|
||||
do this by setting the "end of previous match" offset, because that is picked
|
||||
up at the top of the loop as the point at which to start again.
|
||||
|
||||
There are two complications: (a) When CRLF is a valid newline sequence, and
|
||||
the current position is just before it, advance by an extra byte. (b)
|
||||
Otherwise we must ensure that we skip an entire UTF character if we are in
|
||||
UTF mode. */
|
||||
|
||||
if (rc == PCRE2_ERROR_NOMATCH)
|
||||
{
|
||||
if (options == 0) break; /* All matches found */
|
||||
ovector[1] = start_offset + 1; /* Advance one code unit */
|
||||
if (crlf_is_newline && /* If CRLF is a newline & */
|
||||
start_offset < subject_length - 1 && /* we are at CRLF, */
|
||||
subject[start_offset] == '\r' &&
|
||||
subject[start_offset + 1] == '\n')
|
||||
ovector[1] += 1; /* Advance by one more. */
|
||||
else if (utf8) /* Otherwise, ensure we */
|
||||
{ /* advance a whole UTF-8 */
|
||||
while (ovector[1] < subject_length) /* character. */
|
||||
{
|
||||
if ((subject[ovector[1]] & 0xc0) != 0x80) break;
|
||||
ovector[1] += 1;
|
||||
}
|
||||
}
|
||||
continue; /* Go round the loop again */
|
||||
}
|
||||
|
||||
/* Other matching errors are not recoverable. */
|
||||
|
||||
if (rc < 0)
|
||||
{
|
||||
printf("Matching error %d\n", rc);
|
||||
pcre2_match_data_free(match_data);
|
||||
pcre2_code_free(re);
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* Match succeeded */
|
||||
|
||||
printf("\nMatch succeeded again at offset %d\n", (int)ovector[0]);
|
||||
|
||||
/* The match succeeded, but the output vector wasn't big enough. This
|
||||
should not happen. */
|
||||
|
||||
if (rc == 0)
|
||||
printf("ovector was not big enough for all the captured substrings\n");
|
||||
|
||||
/* We must guard against patterns such as /(?=.\K)/ that use \K in an
|
||||
assertion to set the start of a match later than its end. In this
|
||||
demonstration program, we just detect this case and give up. */
|
||||
|
||||
if (ovector[0] > ovector[1])
|
||||
{
|
||||
printf("\\K was used in an assertion to set the match start after its end.\n"
|
||||
"From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
|
||||
(char *)(subject + ovector[1]));
|
||||
printf("Run abandoned\n");
|
||||
pcre2_match_data_free(match_data);
|
||||
pcre2_code_free(re);
|
||||
return 1;
|
||||
}
|
||||
|
||||
/* As before, show substrings stored in the output vector by number, and then
|
||||
also any named substrings. */
|
||||
|
||||
for (i = 0; i < rc; i++)
|
||||
{
|
||||
PCRE2_SPTR substring_start = subject + ovector[2*i];
|
||||
size_t substring_length = ovector[2*i+1] - ovector[2*i];
|
||||
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
|
||||
}
|
||||
|
||||
if (namecount == 0) printf("No named substrings\n"); else
|
||||
{
|
||||
PCRE2_SPTR tabptr = name_table;
|
||||
printf("Named substrings\n");
|
||||
for (i = 0; i < namecount; i++)
|
||||
{
|
||||
int n = (tabptr[0] << 8) | tabptr[1];
|
||||
printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
|
||||
(int)(ovector[2*n+1] - ovector[2*n]), subject + ovector[2*n]);
|
||||
tabptr += name_entry_size;
|
||||
}
|
||||
}
|
||||
} /* End of loop to find second and subsequent matches */
|
||||
|
||||
printf("\n");
|
||||
pcre2_match_data_free(match_data);
|
||||
pcre2_code_free(re);
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* End of pcre2demo.c */
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
1135
3rd/pcre2/doc/html/pcre2grep.html
Normal file
1135
3rd/pcre2/doc/html/pcre2grep.html
Normal file
@@ -0,0 +1,1135 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2grep specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2grep man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
|
||||
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
||||
<li><a name="TOC3" href="#SEC3">SUPPORT FOR COMPRESSED FILES</a>
|
||||
<li><a name="TOC4" href="#SEC4">BINARY FILES</a>
|
||||
<li><a name="TOC5" href="#SEC5">BINARY ZEROS IN PATTERNS</a>
|
||||
<li><a name="TOC6" href="#SEC6">OPTIONS</a>
|
||||
<li><a name="TOC7" href="#SEC7">ENVIRONMENT VARIABLES</a>
|
||||
<li><a name="TOC8" href="#SEC8">NEWLINES</a>
|
||||
<li><a name="TOC9" href="#SEC9">OPTIONS COMPATIBILITY WITH GNU GREP</a>
|
||||
<li><a name="TOC10" href="#SEC10">OPTIONS WITH DATA</a>
|
||||
<li><a name="TOC11" href="#SEC11">USING PCRE2'S CALLOUT FACILITY</a>
|
||||
<li><a name="TOC12" href="#SEC12">MATCHING ERRORS</a>
|
||||
<li><a name="TOC13" href="#SEC13">DIAGNOSTICS</a>
|
||||
<li><a name="TOC14" href="#SEC14">SEE ALSO</a>
|
||||
<li><a name="TOC15" href="#SEC15">AUTHOR</a>
|
||||
<li><a name="TOC16" href="#SEC16">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||
<P>
|
||||
<b>pcre2grep [options] [long options] [pattern] [path1 path2 ...]</b>
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
||||
<P>
|
||||
<b>pcre2grep</b> searches files for character patterns, in the same way as other
|
||||
grep commands do, but it uses the PCRE2 regular expression library to support
|
||||
patterns that are compatible with the regular expressions of Perl 5. See
|
||||
<a href="pcre2syntax.html"><b>pcre2syntax</b>(3)</a>
|
||||
for a quick-reference summary of pattern syntax, or
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b>(3)</a>
|
||||
for a full description of the syntax and semantics of the regular expressions
|
||||
that PCRE2 supports.
|
||||
</P>
|
||||
<P>
|
||||
Patterns, whether supplied on the command line or in a separate file, are given
|
||||
without delimiters. For example:
|
||||
<pre>
|
||||
pcre2grep Thursday /etc/motd
|
||||
</pre>
|
||||
If you attempt to use delimiters (for example, by surrounding a pattern with
|
||||
slashes, as is common in Perl scripts), they are interpreted as part of the
|
||||
pattern. Quotes can of course be used to delimit patterns on the command line
|
||||
because they are interpreted by the shell, and indeed quotes are required if a
|
||||
pattern contains white space or shell metacharacters.
|
||||
</P>
|
||||
<P>
|
||||
The first argument that follows any option settings is treated as the single
|
||||
pattern to be matched when neither <b>-e</b> nor <b>-f</b> is present.
|
||||
Conversely, when one or both of these options are used to specify patterns, all
|
||||
arguments are treated as path names. At least one of <b>-e</b>, <b>-f</b>, or an
|
||||
argument pattern must be provided.
|
||||
</P>
|
||||
<P>
|
||||
If no files are specified, <b>pcre2grep</b> reads the standard input. The
|
||||
standard input can also be referenced by a name consisting of a single hyphen.
|
||||
For example:
|
||||
<pre>
|
||||
pcre2grep some-pattern file1 - file3
|
||||
</pre>
|
||||
By default, input files are searched line by line, so pattern assertions about
|
||||
the beginning and end of a subject string (^, $, \A, \Z, and \z) match at
|
||||
the beginning and end of each line. When a line matches a pattern, it is copied
|
||||
to the standard output, and if there is more than one file, the file name is
|
||||
output at the start of each line, followed by a colon. However, there are
|
||||
options that can change how <b>pcre2grep</b> behaves. For example, the <b>-M</b>
|
||||
option makes it possible to search for strings that span line boundaries. What
|
||||
defines a line boundary is controlled by the <b>-N</b> (<b>--newline</b>) option.
|
||||
The <b>-h</b> and <b>-H</b> options control whether or not file names are shown,
|
||||
and the <b>-Z</b> option changes the file name terminator to a zero byte.
|
||||
</P>
|
||||
<P>
|
||||
The amount of memory used for buffering files that are being scanned is
|
||||
controlled by parameters that can be set by the <b>--buffer-size</b> and
|
||||
<b>--max-buffer-size</b> options. The first of these sets the size of buffer
|
||||
that is obtained at the start of processing. If an input file contains very
|
||||
long lines, a larger buffer may be needed; this is handled by automatically
|
||||
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
|
||||
default values for these parameters can be set when <b>pcre2grep</b> is
|
||||
built; if nothing is specified, the defaults are set to 20KiB and 1MiB
|
||||
respectively. An error occurs if a line is too long and the buffer can no
|
||||
longer be expanded.
|
||||
</P>
|
||||
<P>
|
||||
The block of memory that is actually used is three times the "buffer size", to
|
||||
allow for buffering "before" and "after" lines. If the buffer size is too
|
||||
small, fewer than requested "before" and "after" lines may be output.
|
||||
</P>
|
||||
<P>
|
||||
When matching with a multiline pattern, the size of the buffer must be at least
|
||||
half of the maximum match expected or the pattern might fail to match.
|
||||
</P>
|
||||
<P>
|
||||
Patterns can be no longer than 8KiB or BUFSIZ bytes, whichever is the greater.
|
||||
BUFSIZ is defined in <b><stdio.h></b>. When there is more than one pattern
|
||||
(specified by the use of <b>-e</b> and/or <b>-f</b>), each pattern is applied to
|
||||
each line in the order in which they are defined, except that all the <b>-e</b>
|
||||
patterns are tried before the <b>-f</b> patterns.
|
||||
</P>
|
||||
<P>
|
||||
By default, as soon as one pattern matches a line, no further patterns are
|
||||
considered. However, if <b>--colour</b> (or <b>--color</b>) is used to colour the
|
||||
matching substrings, or if <b>--only-matching</b>, <b>--file-offsets</b>,
|
||||
<b>--line-offsets</b>, or <b>--output</b> is used to output only the part of the
|
||||
line that matched (either shown literally, or as an offset), the behaviour is
|
||||
different. In this situation, all the patterns are applied to the line. If
|
||||
there is more than one match, the one that begins nearest to the start of the
|
||||
subject is processed; if there is more than one match at that position, the one
|
||||
with the longest matching substring is processed; if the matching substrings
|
||||
are equal, the first match found is processed.
|
||||
</P>
|
||||
<P>
|
||||
Scanning with all the patterns resumes immediately following the match, so that
|
||||
later matches on the same line can be found. Note, however, that an overlapping
|
||||
match that starts in the middle of another match will not be processed.
|
||||
</P>
|
||||
<P>
|
||||
The above behaviour was changed at release 10.41 to be more compatible with GNU
|
||||
grep. In earlier releases, <b>pcre2grep</b> did not recognize matches from
|
||||
later patterns that were earlier in the subject.
|
||||
</P>
|
||||
<P>
|
||||
Patterns that can match an empty string are accepted, but empty string
|
||||
matches are never recognized. An example is the pattern "(super)?(man)?", in
|
||||
which all components are optional. This pattern finds all occurrences of both
|
||||
"super" and "man"; the output differs from matching with "super|man" when only
|
||||
the matching substrings are being shown.
|
||||
</P>
|
||||
<P>
|
||||
If the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variable is set,
|
||||
<b>pcre2grep</b> uses the value to set a locale when calling the PCRE2 library.
|
||||
The <b>--locale</b> option can be used to override this.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">SUPPORT FOR COMPRESSED FILES</a><br>
|
||||
<P>
|
||||
Compile-time options for <b>pcre2grep</b> can set it up to use <b>libz</b> or
|
||||
<b>libbz2</b> for reading compressed files whose names end in <b>.gz</b> or
|
||||
<b>.bz2</b>, respectively. You can find out whether your <b>pcre2grep</b> binary
|
||||
has support for one or both of these file types by running it with the
|
||||
<b>--help</b> option. If the appropriate support is not present, all files are
|
||||
treated as plain text. The standard input is always so treated. If a file with
|
||||
a <b>.gz</b> or <b>.bz2</b> extension is not in fact compressed, it is read as a
|
||||
plain text file. When input is from a compressed .gz or .bz2 file, the
|
||||
<b>--line-buffered</b> option is ignored.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">BINARY FILES</a><br>
|
||||
<P>
|
||||
By default, a file that contains a binary zero byte within the first 1024 bytes
|
||||
is identified as a binary file, and is processed specially. However, if the
|
||||
newline type is specified as NUL, that is, the line terminator is a binary
|
||||
zero, the test for a binary file is not applied. See the <b>--binary-files</b>
|
||||
option for a means of changing the way binary files are handled.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">BINARY ZEROS IN PATTERNS</a><br>
|
||||
<P>
|
||||
Patterns passed from the command line are strings that are terminated by a
|
||||
binary zero, so cannot contain internal zeros. However, patterns that are read
|
||||
from a file via the <b>-f</b> option may contain binary zeros.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">OPTIONS</a><br>
|
||||
<P>
|
||||
The order in which some of the options appear can affect the output. For
|
||||
example, both the <b>-H</b> and <b>-l</b> options affect the printing of file
|
||||
names. Whichever comes later in the command line will be the one that takes
|
||||
effect. Similarly, except where noted below, if an option is given twice, the
|
||||
later setting is used. Numerical values for options may be followed by K or M,
|
||||
to signify multiplication by 1024 or 1024*1024 respectively.
|
||||
</P>
|
||||
<P>
|
||||
<b>--</b>
|
||||
This terminates the list of options. It is useful if the next item on the
|
||||
command line starts with a hyphen but is not an option. This allows for the
|
||||
processing of patterns and file names that start with hyphens.
|
||||
</P>
|
||||
<P>
|
||||
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
|
||||
Output up to <i>number</i> lines of context after each matching line. Fewer
|
||||
lines are output if the next match or the end of the file is reached, or if the
|
||||
processing buffer size has been set too small. If file names and/or line
|
||||
numbers are being output, a hyphen separator is used instead of a colon for the
|
||||
context lines (the <b>-Z</b> option can be used to change the file name
|
||||
terminator to a zero byte). A line containing "--" is output between each group
|
||||
of lines, unless they are in fact contiguous in the input file. The value of
|
||||
<i>number</i> is expected to be relatively small. When <b>-c</b> is used,
|
||||
<b>-A</b> is ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>-a</b>, <b>--text</b>
|
||||
Treat binary files as text. This is equivalent to
|
||||
<b>--binary-files</b>=<i>text</i>.
|
||||
</P>
|
||||
<P>
|
||||
<b>--allow-lookaround-bsk</b>
|
||||
PCRE2 now forbids the use of \K in lookarounds by default, in line with Perl.
|
||||
This option causes <b>pcre2grep</b> to set the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
||||
option, which enables this somewhat dangerous usage.
|
||||
</P>
|
||||
<P>
|
||||
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
|
||||
Output up to <i>number</i> lines of context before each matching line. Fewer
|
||||
lines are output if the previous match or the start of the file is within
|
||||
<i>number</i> lines, or if the processing buffer size has been set too small. If
|
||||
file names and/or line numbers are being output, a hyphen separator is used
|
||||
instead of a colon for the context lines (the <b>-Z</b> option can be used to
|
||||
change the file name terminator to a zero byte). A line containing "--" is
|
||||
output between each group of lines, unless they are in fact contiguous in the
|
||||
input file. The value of <i>number</i> is expected to be relatively small. When
|
||||
<b>-c</b> is used, <b>-B</b> is ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>--binary-files=</b><i>word</i>
|
||||
Specify how binary files are to be processed. If the word is "binary" (the
|
||||
default), pattern matching is performed on binary files, but the only output is
|
||||
"Binary file <name> matches" when a match succeeds. If the word is "text",
|
||||
which is equivalent to the <b>-a</b> or <b>--text</b> option, binary files are
|
||||
processed in the same way as any other file. In this case, when a match
|
||||
succeeds, the output may be binary garbage, which can have nasty effects if
|
||||
sent to a terminal. If the word is "without-match", which is equivalent to the
|
||||
<b>-I</b> option, binary files are not processed at all; they are assumed not to
|
||||
be of interest and are skipped without causing any output or affecting the
|
||||
return code.
|
||||
</P>
|
||||
<P>
|
||||
<b>--buffer-size=</b><i>number</i>
|
||||
Set the parameter that controls how much memory is obtained at the start of
|
||||
processing for buffering files that are being scanned. See also
|
||||
<b>--max-buffer-size</b> below.
|
||||
</P>
|
||||
<P>
|
||||
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
|
||||
Output <i>number</i> lines of context both before and after each matching line.
|
||||
This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
|
||||
</P>
|
||||
<P>
|
||||
<b>-c</b>, <b>--count</b>
|
||||
Do not output lines from the files that are being scanned; instead output the
|
||||
number of lines that would have been shown, either because they matched, or, if
|
||||
<b>-v</b> is set, because they failed to match. By default, this count is
|
||||
exactly the same as the number of lines that would have been output, but if the
|
||||
<b>-M</b> (multiline) option is used (without <b>-v</b>), there may be more
|
||||
suppressed lines than the count (that is, the number of matches).
|
||||
<br>
|
||||
<br>
|
||||
If no lines are selected, the number zero is output. If several files are
|
||||
being scanned, a count is output for each of them and the <b>-t</b> option can
|
||||
be used to cause a total to be output at the end. However, if the
|
||||
<b>--files-with-matches</b> option is also used, only those files whose counts
|
||||
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
|
||||
<b>-B</b>, and <b>-C</b> options are ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>--colour</b>, <b>--color</b>
|
||||
If this option is given without any data, it is equivalent to "--colour=auto".
|
||||
If data is required, it must be given in the same shell item, separated by an
|
||||
equals sign.
|
||||
</P>
|
||||
<P>
|
||||
<b>--colour=</b><i>value</i>, <b>--color=</b><i>value</i>
|
||||
This option specifies under what circumstances the parts of a line that matched
|
||||
a pattern should be coloured in the output. It is ignored if
|
||||
<b>--file-offsets</b>, <b>--line-offsets</b>, or <b>--output</b> is set. By
|
||||
default, output is not coloured. The value for the <b>--colour</b> option (which
|
||||
is optional, see above) may be "never", "always", or "auto". In the latter
|
||||
case, colouring happens only if the standard output is connected to a terminal.
|
||||
More resources are used when colouring is enabled, because <b>pcre2grep</b> has
|
||||
to search for all possible matches in a line, not just one, in order to colour
|
||||
them all.
|
||||
<br>
|
||||
<br>
|
||||
The colour that is used can be specified by setting one of the environment
|
||||
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
|
||||
PCREGREP_COLOR, which are checked in that order. If none of these are set,
|
||||
<b>pcre2grep</b> looks for GREP_COLORS or GREP_COLOR (in that order). The value
|
||||
of the variable should be a string of two numbers, separated by a semicolon,
|
||||
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
|
||||
followed by two semicolon-separated colours, terminated by the end of the
|
||||
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
|
||||
ignored, and GREP_COLOR is checked.
|
||||
<br>
|
||||
<br>
|
||||
If the string obtained from one of the above variables contains any characters
|
||||
other than semicolon or digits, the setting is ignored and the default colour
|
||||
is used. The string is copied directly into the control string for setting
|
||||
colour on a terminal, so it is your responsibility to ensure that the values
|
||||
make sense. If no relevant environment variable is set, the default is "1;31",
|
||||
which gives red.
|
||||
</P>
|
||||
<P>
|
||||
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
|
||||
If an input path is not a regular file or a directory, "action" specifies how
|
||||
it is to be processed. Valid values are "read" (the default) or "skip"
|
||||
(silently skip the path).
|
||||
</P>
|
||||
<P>
|
||||
<b>-d</b> <i>action</i>, <b>--directories=</b><i>action</i>
|
||||
If an input path is a directory, "action" specifies how it is to be processed.
|
||||
Valid values are "read" (the default in non-Windows environments, for
|
||||
compatibility with GNU grep), "recurse" (equivalent to the <b>-r</b> option), or
|
||||
"skip" (silently skip the path, the default in Windows environments). In the
|
||||
"read" case, directories are read as if they were ordinary files. In some
|
||||
operating systems the effect of reading a directory like this is an immediate
|
||||
end-of-file; in others it may provoke an error.
|
||||
</P>
|
||||
<P>
|
||||
<b>--depth-limit</b>=<i>number</i>
|
||||
See <b>--match-limit</b> below.
|
||||
</P>
|
||||
<P>
|
||||
<b>-E</b>, <b>--case-restrict</b>
|
||||
When case distinctions are being ignored in Unicode mode, two ASCII letters (K
|
||||
and S) will by default match Unicode characters U+212A (Kelvin sign) and U+017F
|
||||
(long S) respectively, as well as their lower case ASCII counterparts. When
|
||||
this option is set, case equivalences are restricted such that no ASCII
|
||||
character matches a non-ASCII character, and vice versa.
|
||||
</P>
|
||||
<P>
|
||||
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
|
||||
Specify a pattern to be matched. This option can be used multiple times in
|
||||
order to specify several patterns. It can also be used as a way of specifying a
|
||||
single pattern that starts with a hyphen. When <b>-e</b> is used, no argument
|
||||
pattern is taken from the command line; all arguments are treated as file
|
||||
names. There is no limit to the number of patterns. They are applied to each
|
||||
line in the order in which they are defined.
|
||||
<br>
|
||||
<br>
|
||||
If <b>-f</b> is used with <b>-e</b>, the command line patterns are matched first,
|
||||
followed by the patterns from the file(s), independent of the order in which
|
||||
these options are specified.
|
||||
</P>
|
||||
<P>
|
||||
<b>--exclude</b>=<i>pattern</i>
|
||||
Files (but not directories) whose names match the pattern are skipped without
|
||||
being processed. This applies to all files, whether listed on the command line,
|
||||
obtained from <b>--file-list</b>, or by scanning a directory. The pattern is a
|
||||
PCRE2 regular expression, and is matched against the final component of the
|
||||
file name, not the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do
|
||||
not apply to this pattern. The option may be given any number of times in order
|
||||
to specify multiple patterns. If a file name matches both an <b>--include</b>
|
||||
and an <b>--exclude</b> pattern, it is excluded. There is no short form for this
|
||||
option.
|
||||
</P>
|
||||
<P>
|
||||
<b>--exclude-from=</b><i>filename</i>
|
||||
Treat each non-empty line of the file as the data for an <b>--exclude</b>
|
||||
option. What constitutes a newline when reading the file is the operating
|
||||
system's default. The <b>--newline</b> option has no effect on this option. This
|
||||
option may be given more than once in order to specify a number of files to
|
||||
read.
|
||||
</P>
|
||||
<P>
|
||||
<b>--exclude-dir</b>=<i>pattern</i>
|
||||
Directories whose names match the pattern are skipped without being processed,
|
||||
whatever the setting of the <b>--recursive</b> option. This applies to all
|
||||
directories, whether listed on the command line, obtained from
|
||||
<b>--file-list</b>, or by scanning a parent directory. The pattern is a PCRE2
|
||||
regular expression, and is matched against the final component of the directory
|
||||
name, not the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not
|
||||
apply to this pattern. The option may be given any number of times in order to
|
||||
specify more than one pattern. If a directory matches both <b>--include-dir</b>
|
||||
and <b>--exclude-dir</b>, it is excluded. There is no short form for this
|
||||
option.
|
||||
</P>
|
||||
<P>
|
||||
<b>-F</b>, <b>--fixed-strings</b>
|
||||
Interpret each data-matching pattern as a list of fixed strings, separated by
|
||||
newlines, instead of as a regular expression. What constitutes a newline for
|
||||
this purpose is controlled by the <b>--newline</b> option. The <b>-w</b> (match
|
||||
as a word) and <b>-x</b> (match whole line) options can be used with <b>-F</b>.
|
||||
They apply to each of the fixed strings. A line is selected if any of the fixed
|
||||
strings are found in it (subject to <b>-w</b> or <b>-x</b>, if present). This
|
||||
option applies only to the patterns that are matched against the contents of
|
||||
files; it does not apply to patterns specified by any of the <b>--include</b> or
|
||||
<b>--exclude</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
|
||||
Read patterns from the file, one per line. As is the case with patterns on the
|
||||
command line, no delimiters should be used. What constitutes a newline when
|
||||
reading the file is the operating system's default interpretation of \n. The
|
||||
<b>--newline</b> option has no effect on this option. Trailing white space is
|
||||
removed from each line, and blank lines are ignored unless the
|
||||
<b>--posix-pattern-file</b> option is also provided. An empty file contains no
|
||||
patterns and therefore matches nothing. Patterns read from a file in this way
|
||||
may contain binary zeros, which are treated as ordinary character literals.
|
||||
<br>
|
||||
<br>
|
||||
If this option is given more than once, all the specified files are read. A
|
||||
data line is output if any of the patterns match it. A file name can be given
|
||||
as "-" to refer to the standard input. When <b>-f</b> is used, patterns
|
||||
specified on the command line using <b>-e</b> may also be present; they are
|
||||
matched before the file's patterns. However, no pattern is taken from the
|
||||
command line; all arguments are treated as the names of paths to be searched.
|
||||
</P>
|
||||
<P>
|
||||
<b>--file-list</b>=<i>filename</i>
|
||||
Read a list of files and/or directories that are to be scanned from the given
|
||||
file, one per line. What constitutes a newline when reading the file is the
|
||||
operating system's default. Trailing white space is removed from each line, and
|
||||
blank lines are ignored. These paths are processed before any that are listed
|
||||
on the command line. The file name can be given as "-" to refer to the standard
|
||||
input. If <b>--file</b> and <b>--file-list</b> are both specified as "-",
|
||||
patterns are read first. This is useful only when the standard input is a
|
||||
terminal, from which further lines (the list of files) can be read after an
|
||||
end-of-file indication. If this option is given more than once, all the
|
||||
specified files are read.
|
||||
</P>
|
||||
<P>
|
||||
<b>--file-offsets</b>
|
||||
Instead of showing lines or parts of lines that match, show each match as an
|
||||
offset from the start of the file and a length, separated by a comma. In this
|
||||
mode, <b>--colour</b> has no effect, and no context is shown. That is, the
|
||||
<b>-A</b>, <b>-B</b>, and <b>-C</b> options are ignored. If there is more than one
|
||||
match in a line, each of them is shown separately. This option is mutually
|
||||
exclusive with <b>--output</b>, <b>--line-offsets</b>, and <b>--only-matching</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>--group-separator</b>=<i>text</i>
|
||||
Output this text string instead of two hyphens between groups of lines when
|
||||
<b>-A</b>, <b>-B</b>, or <b>-C</b> is in use. See also <b>--no-group-separator</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>-H</b>, <b>--with-filename</b>
|
||||
Force the inclusion of the file name at the start of output lines when
|
||||
searching a single file. The file name is not normally shown in this case.
|
||||
By default, for matching lines, the file name is followed by a colon; for
|
||||
context lines, a hyphen separator is used. The <b>-Z</b> option can be used to
|
||||
change the terminator to a zero byte. If a line number is also being output,
|
||||
it follows the file name. When the <b>-M</b> option causes a pattern to match
|
||||
more than one line, only the first is preceded by the file name. This option
|
||||
overrides any previous <b>-h</b>, <b>-l</b>, or <b>-L</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>-h</b>, <b>--no-filename</b>
|
||||
Suppress the output file names when searching multiple files. File names are
|
||||
normally shown when multiple files are searched. By default, for matching
|
||||
lines, the file name is followed by a colon; for context lines, a hyphen
|
||||
separator is used. The <b>-Z</b> option can be used to change the terminator to
|
||||
a zero byte. If a line number is also being output, it follows the file name.
|
||||
This option overrides any previous <b>-H</b>, <b>-L</b>, or <b>-l</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>--heap-limit</b>=<i>number</i>
|
||||
See <b>--match-limit</b> below.
|
||||
</P>
|
||||
<P>
|
||||
<b>--help</b>
|
||||
Output a help message, giving brief details of the command options and file
|
||||
type support, and then exit. Anything else on the command line is
|
||||
ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>-I</b>
|
||||
Ignore binary files. This is equivalent to
|
||||
<b>--binary-files</b>=<i>without-match</i>.
|
||||
</P>
|
||||
<P>
|
||||
<b>-i</b>, <b>--ignore-case</b>
|
||||
Ignore upper/lower case distinctions when pattern matching. This applies when
|
||||
matching path names for inclusion or exclusion as well as when matching lines
|
||||
in files.
|
||||
</P>
|
||||
<P>
|
||||
<b>--include</b>=<i>pattern</i>
|
||||
If any <b>--include</b> patterns are specified, the only files that are
|
||||
processed are those whose names match one of the patterns and do not match an
|
||||
<b>--exclude</b> pattern. This option does not affect directories, but it
|
||||
applies to all files, whether listed on the command line, obtained from
|
||||
<b>--file-list</b>, or by scanning a directory. The pattern is a PCRE2 regular
|
||||
expression, and is matched against the final component of the file name, not
|
||||
the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not apply to
|
||||
this pattern. The option may be given any number of times. If a file name
|
||||
matches both an <b>--include</b> and an <b>--exclude</b> pattern, it is excluded.
|
||||
There is no short form for this option.
|
||||
</P>
|
||||
<P>
|
||||
<b>--include-from=</b><i>filename</i>
|
||||
Treat each non-empty line of the file as the data for an <b>--include</b>
|
||||
option. What constitutes a newline for this purpose is the operating system's
|
||||
default. The <b>--newline</b> option has no effect on this option. This option
|
||||
may be given any number of times; all the files are read.
|
||||
</P>
|
||||
<P>
|
||||
<b>--include-dir</b>=<i>pattern</i>
|
||||
If any <b>--include-dir</b> patterns are specified, the only directories that
|
||||
are processed are those whose names match one of the patterns and do not match
|
||||
an <b>--exclude-dir</b> pattern. This applies to all directories, whether listed
|
||||
on the command line, obtained from <b>--file-list</b>, or by scanning a parent
|
||||
directory. The pattern is a PCRE2 regular expression, and is matched against
|
||||
the final component of the directory name, not the entire path. The <b>-F</b>,
|
||||
<b>-w</b>, and <b>-x</b> options do not apply to this pattern. The option may be
|
||||
given any number of times. If a directory matches both <b>--include-dir</b> and
|
||||
<b>--exclude-dir</b>, it is excluded. There is no short form for this option.
|
||||
</P>
|
||||
<P>
|
||||
<b>-L</b>, <b>--files-without-match</b>
|
||||
Instead of outputting lines from the files, just output the names of the files
|
||||
that do not contain any lines that would have been output. Each file name is
|
||||
output once, on a separate line by default, but if the <b>-Z</b> option is set,
|
||||
they are separated by zero bytes instead of newlines. This option overrides any
|
||||
previous <b>-H</b>, <b>-h</b>, or <b>-l</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>-l</b>, <b>--files-with-matches</b>
|
||||
Instead of outputting lines from the files, just output the names of the files
|
||||
containing lines that would have been output. Each file name is output once, on
|
||||
a separate line, but if the <b>-Z</b> option is set, they are separated by zero
|
||||
bytes instead of newlines. Searching normally stops as soon as a matching line
|
||||
is found in a file. However, if the <b>-c</b> (count) option is also used,
|
||||
matching continues in order to obtain the correct count, and those files that
|
||||
have at least one match are listed along with their counts. Using this option
|
||||
with <b>-c</b> is a way of suppressing the listing of files with no matches that
|
||||
occurs with <b>-c</b> on its own. This option overrides any previous <b>-H</b>,
|
||||
<b>-h</b>, or <b>-L</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>--label</b>=<i>name</i>
|
||||
This option supplies a name to be used for the standard input when file names
|
||||
are being output. If not supplied, "(standard input)" is used. There is no
|
||||
short form for this option.
|
||||
</P>
|
||||
<P>
|
||||
<b>--line-buffered</b>
|
||||
When this option is given, non-compressed input is read and processed line by
|
||||
line, and the output is flushed after each write. By default, input is read in
|
||||
large chunks, unless <b>pcre2grep</b> can determine that it is reading from a
|
||||
terminal, which is currently possible only in Unix-like environments or
|
||||
Windows. Output to terminal is normally automatically flushed by the operating
|
||||
system. This option can be useful when the input or output is attached to a
|
||||
pipe and you do not want <b>pcre2grep</b> to buffer up large amounts of data.
|
||||
However, its use will affect performance, and the <b>-M</b> (multiline) option
|
||||
ceases to work. When input is from a compressed .gz or .bz2 file,
|
||||
<b>--line-buffered</b> is ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>--line-offsets</b>
|
||||
Instead of showing lines or parts of lines that match, show each match as a
|
||||
line number, the offset from the start of the line, and a length. The line
|
||||
number is terminated by a colon (as usual; see the <b>-n</b> option), and the
|
||||
offset and length are separated by a comma. In this mode, <b>--colour</b> has no
|
||||
effect, and no context is shown. That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b>
|
||||
options are ignored. If there is more than one match in a line, each of them is
|
||||
shown separately. This option is mutually exclusive with <b>--output</b>,
|
||||
<b>--file-offsets</b>, and <b>--only-matching</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>--locale</b>=<i>locale-name</i>
|
||||
This option specifies a locale to be used for pattern matching. It overrides
|
||||
the value in the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variables. If no
|
||||
locale is specified, the PCRE2 library's default (usually the "C" locale) is
|
||||
used. There is no short form for this option.
|
||||
</P>
|
||||
<P>
|
||||
<b>-M</b>, <b>--multiline</b>
|
||||
Allow patterns to match more than one line. When this option is set, the PCRE2
|
||||
library is called in "multiline" mode, and a match is allowed to continue past
|
||||
the end of the initial line and onto one or more subsequent lines.
|
||||
<br>
|
||||
<br>
|
||||
Patterns used with <b>-M</b> may usefully contain literal newline characters and
|
||||
internal occurrences of ^ and $ characters, because in multiline mode these can
|
||||
match at internal newlines. Because <b>pcre2grep</b> is scanning multiple lines,
|
||||
the \Z and \z assertions match only at the end of the last line in the file.
|
||||
The \A assertion matches at the start of the first line of a match. This can
|
||||
be any line in the file; it is not anchored to the first line.
|
||||
<br>
|
||||
<br>
|
||||
The output for a successful match may consist of more than one line. The first
|
||||
line is the line in which the match started, and the last line is the line in
|
||||
which the match ended. If the matched string ends with a newline sequence, the
|
||||
output ends at the end of that line. If <b>-v</b> is set, none of the lines in a
|
||||
multi-line match are output. Once a match has been handled, scanning restarts
|
||||
at the beginning of the line after the one in which the match ended.
|
||||
<br>
|
||||
<br>
|
||||
The newline sequence that separates multiple lines must be matched as part of
|
||||
the pattern. For example, to find the phrase "regular expression" in a file
|
||||
where "regular" might be at the end of a line and "expression" at the start of
|
||||
the next line, you could use this command:
|
||||
<pre>
|
||||
pcre2grep -M 'regular\s+expression' <file>
|
||||
</pre>
|
||||
The \s escape sequence matches any white space character, including newlines,
|
||||
and is followed by + so as to match trailing white space on the first line as
|
||||
well as possibly handling a two-character newline sequence.
|
||||
<br>
|
||||
<br>
|
||||
There is a limit to the number of lines that can be matched, imposed by the way
|
||||
that <b>pcre2grep</b> buffers the input file as it scans it. With a sufficiently
|
||||
large processing buffer, this should not be a problem.
|
||||
<br>
|
||||
<br>
|
||||
The <b>-M</b> option does not work when input is read line by line (see
|
||||
<b>--line-buffered</b>.)
|
||||
</P>
|
||||
<P>
|
||||
<b>-m</b> <i>number</i>, <b>--max-count</b>=<i>number</i>
|
||||
Stop processing after finding <i>number</i> matching lines, or non-matching
|
||||
lines if <b>-v</b> is also set. Any trailing context lines are output after the
|
||||
final match. In multiline mode, each multiline match counts as just one line
|
||||
for this purpose. If this limit is reached when reading the standard input from
|
||||
a regular file, the file is left positioned just after the last matching line.
|
||||
If <b>-c</b> is also set, the count that is output is never greater than
|
||||
<i>number</i>. This option has no effect if used with <b>-L</b>, <b>-l</b>, or
|
||||
<b>-q</b>, or when just checking for a match in a binary file.
|
||||
</P>
|
||||
<P>
|
||||
<b>--match-limit</b>=<i>number</i>
|
||||
Processing some regular expression patterns may take a very long time to search
|
||||
for all possible matching strings. Others may require a very large amount of
|
||||
memory. There are three options that set resource limits for matching.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--match-limit</b> option provides a means of limiting computing resource
|
||||
usage when processing patterns that are not going to match, but which have a
|
||||
very large number of possibilities in their search trees. The classic example
|
||||
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
|
||||
counter that is incremented each time around its main processing loop. If the
|
||||
value set by <b>--match-limit</b> is reached, an error occurs.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--heap-limit</b> option specifies, as a number of kibibytes (units of
|
||||
1024 bytes), the maximum amount of heap memory that may be used for matching.
|
||||
<br>
|
||||
<br>
|
||||
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
|
||||
which indirectly limits the amount of memory that is used. The amount of memory
|
||||
needed for each backtracking point depends on the number of capturing
|
||||
parentheses in the pattern, so the amount of memory that is used before this
|
||||
limit acts varies from pattern to pattern. This limit is of use only if it is
|
||||
set smaller than <b>--match-limit</b>.
|
||||
<br>
|
||||
<br>
|
||||
There are no short forms for these options. The default limits can be set
|
||||
when the PCRE2 library is compiled; if they are not specified, the defaults
|
||||
are very large and so effectively unlimited.
|
||||
</P>
|
||||
<P>
|
||||
<b>--max-buffer-size</b>=<i>number</i>
|
||||
This limits the expansion of the processing buffer, whose initial size can be
|
||||
set by <b>--buffer-size</b>. The maximum buffer size is silently forced to be no
|
||||
smaller than the starting buffer size.
|
||||
</P>
|
||||
<P>
|
||||
<b>-N</b> <i>newline-type</i>, <b>--newline</b>=<i>newline-type</i>
|
||||
Six different conventions for indicating the ends of lines in scanned files are
|
||||
supported. For example:
|
||||
<pre>
|
||||
pcre2grep -N CRLF 'some pattern' <file>
|
||||
</pre>
|
||||
The newline type may be specified in upper, lower, or mixed case. If the
|
||||
newline type is NUL, lines are separated by binary zero characters. The other
|
||||
types are the single-character sequences CR (carriage return) and LF
|
||||
(linefeed), the two-character sequence CRLF, an "anycrlf" type, which
|
||||
recognizes any of the preceding three types, and an "any" type, for which any
|
||||
Unicode line ending sequence is assumed to end a line. The Unicode sequences
|
||||
are the three just mentioned, plus VT (vertical tab, U+000B), FF (form feed,
|
||||
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
|
||||
(paragraph separator, U+2029).
|
||||
<br>
|
||||
<br>
|
||||
When the PCRE2 library is built, a default line-ending sequence is specified.
|
||||
This is normally the standard sequence for the operating system. Unless
|
||||
otherwise specified by this option, <b>pcre2grep</b> uses the library's default.
|
||||
<br>
|
||||
<br>
|
||||
This option makes it possible to use <b>pcre2grep</b> to scan files that have
|
||||
come from other environments without having to modify their line endings. If
|
||||
the data that is being scanned does not agree with the convention set by this
|
||||
option, <b>pcre2grep</b> may behave in strange ways. Note that this option does
|
||||
not apply to files specified by the <b>-f</b>, <b>--exclude-from</b>, or
|
||||
<b>--include-from</b> options, which are expected to use the operating system's
|
||||
standard newline sequence.
|
||||
</P>
|
||||
<P>
|
||||
<b>-n</b>, <b>--line-number</b>
|
||||
Precede each output line by its line number in the file, followed by a colon
|
||||
for matching lines or a hyphen for context lines. If the file name is also
|
||||
being output, it precedes the line number. When the <b>-M</b> option causes a
|
||||
pattern to match more than one line, only the first is preceded by its line
|
||||
number. This option is forced if <b>--line-offsets</b> is used.
|
||||
</P>
|
||||
<P>
|
||||
<b>--no-group-separator</b>
|
||||
Do not output a separator between groups of lines when <b>-A</b>, <b>-B</b>, or
|
||||
<b>-C</b> is in use. The default is to output a line containing two hyphens. See
|
||||
also <b>--group-separator</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>--no-jit</b>
|
||||
If the PCRE2 library is built with support for just-in-time compiling (which
|
||||
speeds up matching), <b>pcre2grep</b> automatically makes use of this, unless it
|
||||
was explicitly disabled at build time. This option can be used to disable the
|
||||
use of JIT at run time. It is provided for testing and working around problems.
|
||||
It should never be needed in normal use.
|
||||
</P>
|
||||
<P>
|
||||
<b>-O</b> <i>text</i>, <b>--output</b>=<i>text</i>
|
||||
When there is a match, instead of outputting the line that matched, output just
|
||||
the text specified in this option, followed by an operating-system standard
|
||||
newline. In this mode, <b>--colour</b> has no effect, and no context is shown.
|
||||
That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b> options are ignored. The
|
||||
<b>--newline</b> option has no effect on this option, which is mutually
|
||||
exclusive with <b>--only-matching</b>, <b>--file-offsets</b>, and
|
||||
<b>--line-offsets</b>. However, like <b>--only-matching</b>, if there is more
|
||||
than one match in a line, each of them causes a line of output.
|
||||
<br>
|
||||
<br>
|
||||
Escape sequences starting with a dollar character may be used to insert the
|
||||
contents of the matched part of the line and/or captured substrings into the
|
||||
text.
|
||||
<br>
|
||||
<br>
|
||||
$<digits> or ${<digits>} is replaced by the captured substring of the given
|
||||
decimal number; $& (or the legacy $0) substitutes the whole match. If the
|
||||
number is greater than the number of capturing substrings, or if the capture
|
||||
is unset, the replacement is empty.
|
||||
<br>
|
||||
<br>
|
||||
$a is replaced by bell; $b by backspace; $e by escape; $f by form feed; $n by
|
||||
newline; $r by carriage return; $t by tab; $v by vertical tab.
|
||||
<br>
|
||||
<br>
|
||||
$o<digits> or $o{<digits>} is replaced by the character whose code point is the
|
||||
given octal number. In the first form, up to three octal digits are processed.
|
||||
When more digits are needed in Unicode mode to specify a wide character, the
|
||||
second form must be used.
|
||||
<br>
|
||||
<br>
|
||||
$x<digits> or $x{<digits>} is replaced by the character represented by the
|
||||
given hexadecimal number. In the first form, up to two hexadecimal digits are
|
||||
processed. When more digits are needed in Unicode mode to specify a wide
|
||||
character, the second form must be used.
|
||||
<br>
|
||||
<br>
|
||||
Any other character is substituted by itself. In particular, $$ is replaced by
|
||||
a single dollar.
|
||||
</P>
|
||||
<P>
|
||||
<b>-o</b>, <b>--only-matching</b>
|
||||
Show only the part of the line that matched a pattern instead of the whole
|
||||
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
|
||||
<b>-C</b> options are ignored. If there is more than one match in a line, each
|
||||
of them is shown separately, on a separate line of output. If <b>-o</b> is
|
||||
combined with <b>-v</b> (invert the sense of the match to find non-matching
|
||||
lines), no output is generated, but the return code is set appropriately. If
|
||||
the matched portion of the line is empty, nothing is output unless the file
|
||||
name or line number are being printed, in which case they are shown on an
|
||||
otherwise empty line. This option is mutually exclusive with <b>--output</b>,
|
||||
<b>--file-offsets</b> and <b>--line-offsets</b>.
|
||||
</P>
|
||||
<P>
|
||||
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
|
||||
Show only the part of the line that matched the capturing parentheses of the
|
||||
given number. Up to 50 capturing parentheses are supported by default. This
|
||||
limit can be changed via the <b>--om-capture</b> option. A pattern may contain
|
||||
any number of capturing parentheses, but only those whose number is within the
|
||||
limit can be accessed by <b>-o</b>. An error occurs if the number specified by
|
||||
<b>-o</b> is greater than the limit.
|
||||
<br>
|
||||
<br>
|
||||
-o0 is the same as <b>-o</b> without a number. Because these options can be
|
||||
given without an argument (see above), if an argument is present, it must be
|
||||
given in the same shell item, for example, -o3 or --only-matching=2. The
|
||||
comments given for the non-argument case above also apply to this option. If
|
||||
the specified capturing parentheses do not exist in the pattern, or were not
|
||||
set in the match, nothing is output unless the file name or line number are
|
||||
being output.
|
||||
<br>
|
||||
<br>
|
||||
If this option is given multiple times, multiple substrings are output for each
|
||||
match, in the order the options are given, and all on one line. For example,
|
||||
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
|
||||
then 3 again to be output. By default, there is no separator (but see the next
|
||||
but one option).
|
||||
</P>
|
||||
<P>
|
||||
<b>--om-capture</b>=<i>number</i>
|
||||
Set the number of capturing parentheses that can be accessed by <b>-o</b>. The
|
||||
default is 50.
|
||||
</P>
|
||||
<P>
|
||||
<b>--om-separator</b>=<i>text</i>
|
||||
Specify a separating string for multiple occurrences of <b>-o</b>. The default
|
||||
is an empty string. Separating strings are never coloured.
|
||||
</P>
|
||||
<P>
|
||||
<b>-P</b>, <b>--no-ucp</b>
|
||||
Starting from release 10.43, when UTF/Unicode mode is specified with <b>-u</b>
|
||||
or <b>-U</b>, the PCRE2_UCP option is used by default. This means that the
|
||||
POSIX classes in patterns match more than just ASCII characters. For example,
|
||||
[:digit:] matches any Unicode decimal digit. The <b>--no-ucp</b> option
|
||||
suppresses PCRE2_UCP, thus restricting the POSIX classes to ASCII characters,
|
||||
as was the case in earlier releases. Note that there are now more fine-grained
|
||||
option settings within patterns that affect individual classes. For example,
|
||||
when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
|
||||
allowing \w to match Unicode letters and digits.
|
||||
</P>
|
||||
<P>
|
||||
<b>--posix-pattern-file</b>
|
||||
When patterns are provided with the <b>-f</b> option, do not trim trailing
|
||||
spaces or ignore empty lines in a similar way than other grep tools. To keep
|
||||
the behaviour consistent with older versions, if the pattern read was
|
||||
terminated with CRLF (as character literals) then both characters won't be
|
||||
included as part of it, so if you really need to have pattern ending in '\r',
|
||||
use a escape sequence or provide it by a different method.
|
||||
</P>
|
||||
<P>
|
||||
<b>-q</b>, <b>--quiet</b>
|
||||
Work quietly, that is, display nothing except error messages. The exit
|
||||
status indicates whether or not any matches were found.
|
||||
</P>
|
||||
<P>
|
||||
<b>-r</b>, <b>--recursive</b>
|
||||
If any given path is a directory, recursively scan the files it contains,
|
||||
taking note of any <b>--include</b> and <b>--exclude</b> settings. By default, a
|
||||
directory is read as a normal file; in some operating systems this gives an
|
||||
immediate end-of-file. This option is a shorthand for setting the <b>-d</b>
|
||||
option to "recurse".
|
||||
</P>
|
||||
<P>
|
||||
<b>--recursion-limit</b>=<i>number</i>
|
||||
This is an obsolete synonym for <b>--depth-limit</b>. See <b>--match-limit</b>
|
||||
above for details.
|
||||
</P>
|
||||
<P>
|
||||
<b>-s</b>, <b>--no-messages</b>
|
||||
Suppress error messages about non-existent or unreadable files. Such files are
|
||||
quietly skipped. However, the return code is still 2, even if matches were
|
||||
found in other files.
|
||||
</P>
|
||||
<P>
|
||||
<b>-t</b>, <b>--total-count</b>
|
||||
This option is useful when scanning more than one file. If used on its own,
|
||||
<b>-t</b> suppresses all output except for a grand total number of matching
|
||||
lines (or non-matching lines if <b>-v</b> is used) in all the files. If <b>-t</b>
|
||||
is used with <b>-c</b>, a grand total is output except when the previous output
|
||||
is just one line. In other words, it is not output when just one file's count
|
||||
is listed. If file names are being output, the grand total is preceded by
|
||||
"TOTAL:". Otherwise, it appears as just another number. The <b>-t</b> option is
|
||||
ignored when used with <b>-L</b> (list files without matches), because the grand
|
||||
total would always be zero.
|
||||
</P>
|
||||
<P>
|
||||
<b>-u</b>, <b>--utf</b>
|
||||
Operate in UTF/Unicode mode. This option is available only if PCRE2 has been
|
||||
compiled with UTF-8 support. All patterns (including those for any
|
||||
<b>--exclude</b> and <b>--include</b> options) and all lines that are scanned
|
||||
must be valid strings of UTF-8 characters. If an invalid UTF-8 string is
|
||||
encountered, an error occurs.
|
||||
</P>
|
||||
<P>
|
||||
<b>-U</b>, <b>--utf-allow-invalid</b>
|
||||
As <b>--utf</b>, but in addition subject lines may contain invalid UTF-8 code
|
||||
unit sequences. These can never form part of any pattern match. Patterns
|
||||
themselves, however, must still be valid UTF-8 strings. This facility allows
|
||||
valid UTF-8 strings to be sought within arbitrary byte sequences in executable
|
||||
or other binary files. For more details about matching in non-valid UTF-8
|
||||
strings, see the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b>(3)</a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
<b>-V</b>, <b>--version</b>
|
||||
Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the
|
||||
standard output and then exit. Anything else on the command line is
|
||||
ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>-v</b>, <b>--invert-match</b>
|
||||
Invert the sense of the match, so that lines which do <i>not</i> match any of
|
||||
the patterns are the ones that are found. When this option is set, options such
|
||||
as <b>--only-matching</b> and <b>--output</b>, which specify parts of a match
|
||||
that are to be output, are ignored.
|
||||
</P>
|
||||
<P>
|
||||
<b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b>
|
||||
Force the patterns only to match "words". That is, there must be a word
|
||||
boundary at the start and end of each matched string. This is equivalent to
|
||||
having "\b(?:" at the start of each pattern, and ")\b" at the end. This
|
||||
option applies only to the patterns that are matched against the contents of
|
||||
files; it does not apply to patterns specified by any of the <b>--include</b> or
|
||||
<b>--exclude</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
|
||||
Force the patterns to start matching only at the beginnings of lines, and in
|
||||
addition, require them to match entire lines. In multiline mode the match may
|
||||
be more than one line. This is equivalent to having "^(?:" at the start of each
|
||||
pattern and ")$" at the end. This option applies only to the patterns that are
|
||||
matched against the contents of files; it does not apply to patterns specified
|
||||
by any of the <b>--include</b> or <b>--exclude</b> options.
|
||||
</P>
|
||||
<P>
|
||||
<b>-Z</b>, <b>--null</b>
|
||||
Terminate files names in the regular output with a zero byte (the NUL
|
||||
character) instead of what would normally appear. This is useful when file
|
||||
names contain unusual characters such as colons, hyphens, or even newlines. The
|
||||
option does not apply to file names in error messages.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
|
||||
<P>
|
||||
The environment variables <b>LC_ALL</b> and <b>LC_CTYPE</b> are examined, in that
|
||||
order, for a locale. The first one that is set is used. This can be overridden
|
||||
by the <b>--locale</b> option. If no locale is set, the PCRE2 library's default
|
||||
(usually the "C" locale) is used.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">NEWLINES</a><br>
|
||||
<P>
|
||||
The <b>-N</b> (<b>--newline</b>) option allows <b>pcre2grep</b> to scan files with
|
||||
newline conventions that differ from the default. This option affects only the
|
||||
way scanned files are processed. It does not affect the interpretation of files
|
||||
specified by the <b>-f</b>, <b>--file-list</b>, <b>--exclude-from</b>, or
|
||||
<b>--include-from</b> options.
|
||||
</P>
|
||||
<P>
|
||||
Any parts of the scanned input files that are written to the standard output
|
||||
are copied with whatever newline sequences they have in the input. However, if
|
||||
the final line of a file is output, and it does not end with a newline
|
||||
sequence, a newline sequence is added. If the newline setting is CR, LF, CRLF
|
||||
or NUL, that line ending is output; for the other settings (ANYCRLF or ANY) a
|
||||
single NL is used.
|
||||
</P>
|
||||
<P>
|
||||
The newline setting does not affect the way in which <b>pcre2grep</b> writes
|
||||
newlines in informational messages to the standard output and error streams.
|
||||
Under Windows, the standard output is set to be binary, so that "\r\n" at the
|
||||
ends of output lines that are copied from the input is not converted to
|
||||
"\r\r\n" by the C I/O library. This means that any messages written to the
|
||||
standard output must end with "\r\n". For all other operating systems, and
|
||||
for all messages to the standard error stream, "\n" is used.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">OPTIONS COMPATIBILITY WITH GNU GREP</a><br>
|
||||
<P>
|
||||
Many of the short and long forms of <b>pcre2grep</b>'s options are the same as
|
||||
in the GNU <b>grep</b> program. Any long option of the form <b>--xxx-regexp</b>
|
||||
(GNU terminology) is also available as <b>--xxx-regex</b> (PCRE2 terminology).
|
||||
However, the <b>--case-restrict</b>, <b>--depth-limit</b>, <b>-E</b>,
|
||||
<b>--file-list</b>, <b>--file-offsets</b>, <b>--heap-limit</b>,
|
||||
<b>--include-dir</b>, <b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>,
|
||||
<b>-M</b>, <b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--no-ucp</b>,
|
||||
<b>--om-separator</b>, <b>--output</b>, <b>-P</b>, <b>-u</b>, <b>--utf</b>,
|
||||
<b>-U</b>, and <b>--utf-allow-invalid</b> options are specific to
|
||||
<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a
|
||||
capturing parentheses number.
|
||||
</P>
|
||||
<P>
|
||||
Although most of the common options work the same way, a few are different in
|
||||
<b>pcre2grep</b>. For example, the <b>--include</b> option's argument is a glob
|
||||
for GNU <b>grep</b>, but in <b>pcre2grep</b> it is a regular expression to which
|
||||
the <b>-i</b> option applies. If both the <b>-c</b> and <b>-l</b> options are
|
||||
given, GNU grep lists only file names, without counts, but <b>pcre2grep</b>
|
||||
gives the counts as well.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">OPTIONS WITH DATA</a><br>
|
||||
<P>
|
||||
There are four different ways in which an option with data can be specified.
|
||||
If a short form option is used, the data may follow immediately, or (with one
|
||||
exception) in the next command line item. For example:
|
||||
<pre>
|
||||
-f/some/file
|
||||
-f /some/file
|
||||
</pre>
|
||||
The exception is the <b>-o</b> option, which may appear with or without data.
|
||||
Because of this, if data is present, it must follow immediately in the same
|
||||
item, for example -o3.
|
||||
</P>
|
||||
<P>
|
||||
If a long form option is used, the data may appear in the same command line
|
||||
item, separated by an equals character, or (with two exceptions) it may appear
|
||||
in the next command line item. For example:
|
||||
<pre>
|
||||
--file=/some/file
|
||||
--file /some/file
|
||||
</pre>
|
||||
Note, however, that if you want to supply a file name beginning with ~ as data
|
||||
in a shell command, and have the shell expand ~ to a home directory, you must
|
||||
separate the file name from the option, because the shell does not treat ~
|
||||
specially unless it is at the start of an item.
|
||||
</P>
|
||||
<P>
|
||||
The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
|
||||
<b>--only-matching</b> options, for which the data is optional. If one of these
|
||||
options does have data, it must be given in the first form, using an equals
|
||||
character. Otherwise <b>pcre2grep</b> will assume that it has no data.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">USING PCRE2'S CALLOUT FACILITY</a><br>
|
||||
<P>
|
||||
<b>pcre2grep</b> has, by default, support for calling external programs or
|
||||
scripts or echoing specific strings during matching by making use of PCRE2's
|
||||
callout facility. However, this support can be completely or partially disabled
|
||||
when <b>pcre2grep</b> is built. You can find out whether your binary has support
|
||||
for callouts by running it with the <b>--help</b> option. If callout support is
|
||||
completely disabled, callouts in patterns are forbidden by <b>pcre2grep</b>.
|
||||
If the facility is partially disabled, calling external programs is not
|
||||
supported, and callouts that request it are ignored.
|
||||
</P>
|
||||
<P>
|
||||
A callout in a PCRE2 pattern is of the form (?C<arg>) where the argument is
|
||||
either a number or a quoted string (see the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>;
|
||||
only callouts with string arguments are useful.
|
||||
</P>
|
||||
<br><b>
|
||||
Echoing a specific string
|
||||
</b><br>
|
||||
<P>
|
||||
Starting the callout string with a pipe character invokes an echoing facility
|
||||
that avoids calling an external program or script. This facility is always
|
||||
available, provided that callouts were not completely disabled when
|
||||
<b>pcre2grep</b> was built. The rest of the callout string is processed as a
|
||||
zero-terminated string, which means it should not contain any internal binary
|
||||
zeros. It is written to the output, having first been passed through the same
|
||||
escape processing as text from the <b>--output</b> (<b>-O</b>) option (see
|
||||
above). However, $0 or $& cannot be used to insert a matched substring because
|
||||
the match is still in progress. Instead, the single character '0' is inserted.
|
||||
Any syntax errors in the string (for example, a dollar not followed by another
|
||||
character) causes the callout to be ignored. No terminator is added to the
|
||||
output string, so if you want a newline, you must include it explicitly using
|
||||
the escape $n. For example:
|
||||
<pre>
|
||||
pcre2grep '(.)(..(.))(?C"|[$1] [$2] [$3]$n")' <some file>
|
||||
</pre>
|
||||
Matching continues normally after the string is output. If you want to see only
|
||||
the callout output but not any output from an actual match, you should end the
|
||||
pattern with (*FAIL).
|
||||
</P>
|
||||
<br><b>
|
||||
Calling external programs or scripts
|
||||
</b><br>
|
||||
<P>
|
||||
This facility can be independently disabled when <b>pcre2grep</b> is built. It
|
||||
is supported for Windows, where a call to <b>_spawnvp()</b> is used, for VMS,
|
||||
where <b>lib$spawn()</b> is used, and for any Unix-like environment where
|
||||
<b>fork()</b> and <b>execv()</b> are available.
|
||||
</P>
|
||||
<P>
|
||||
If the callout string does not start with a pipe (vertical bar) character, it
|
||||
is parsed into a list of substrings separated by pipe characters. The first
|
||||
substring must be an executable name, with the following substrings specifying
|
||||
arguments:
|
||||
<pre>
|
||||
executable_name|arg1|arg2|...
|
||||
</pre>
|
||||
Any substring (including the executable name) may contain escape sequences
|
||||
started by a dollar character. These are the same as for the <b>--output</b>
|
||||
(<b>-O</b>) option documented above, except that $0 or $& cannot insert the
|
||||
matched string because the match is still in progress. Instead, the character
|
||||
'0' is inserted. If you need a literal dollar or pipe character in any
|
||||
substring, use $$ or $| respectively. Here is an example:
|
||||
<pre>
|
||||
echo -e "abcde\n12345" | pcre2grep \
|
||||
'(?x)(.)(..(.))
|
||||
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
|
||||
|
||||
Output:
|
||||
|
||||
Arg1: [a] [bcd] [d] Arg2: |a| ()
|
||||
abcde
|
||||
Arg1: [1] [234] [4] Arg2: |1| ()
|
||||
12345
|
||||
</pre>
|
||||
The parameters for the system call that is used to run the program or script
|
||||
are zero-terminated strings. This means that binary zero characters in the
|
||||
callout argument will cause premature termination of their substrings, and
|
||||
therefore should not be present. Any syntax errors in the string (for example,
|
||||
a dollar not followed by another character) causes the callout to be ignored.
|
||||
If running the program fails for any reason (including the non-existence of the
|
||||
executable), a local matching failure occurs and the matcher backtracks in the
|
||||
normal way.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">MATCHING ERRORS</a><br>
|
||||
<P>
|
||||
It is possible to supply a regular expression that takes a very long time to
|
||||
fail to match certain lines. Such patterns normally involve nested indefinite
|
||||
repeats, for example: (a+)*\d when matched against a line of a's with no final
|
||||
digit. The PCRE2 matching function has a resource limit that causes it to abort
|
||||
in these circumstances. If this happens, <b>pcre2grep</b> outputs an error
|
||||
message and the line that caused the problem to the standard error stream. If
|
||||
there are more than 20 such errors, <b>pcre2grep</b> gives up.
|
||||
</P>
|
||||
<P>
|
||||
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
|
||||
overall resource limit. There are also other limits that affect the amount of
|
||||
memory used during matching; see the discussion of <b>--heap-limit</b> and
|
||||
<b>--depth-limit</b> above.
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">DIAGNOSTICS</a><br>
|
||||
<P>
|
||||
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
|
||||
for syntax errors, overlong lines, non-existent or inaccessible files (even if
|
||||
matches were found in other files) or too many matching errors. Using the
|
||||
<b>-s</b> option to suppress error messages about inaccessible files does not
|
||||
affect the return code.
|
||||
</P>
|
||||
<P>
|
||||
When run under VMS, the return code is placed in the symbol PCRE2GREP_RC
|
||||
because VMS does not distinguish between exit(0) and exit(1).
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2unicode</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 04 February 2025
|
||||
<br>
|
||||
Copyright © 1997-2023 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
505
3rd/pcre2/doc/html/pcre2jit.html
Normal file
505
3rd/pcre2/doc/html/pcre2jit.html
Normal file
@@ -0,0 +1,505 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2jit specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2jit man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
|
||||
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
|
||||
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
|
||||
<li><a name="TOC4" href="#SEC4">MATCHING SUBJECTS CONTAINING INVALID UTF</a>
|
||||
<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
|
||||
<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT MATCHING</a>
|
||||
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
|
||||
<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
|
||||
<li><a name="TOC9" href="#SEC9">FREEING JIT SPECULATIVE MEMORY</a>
|
||||
<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
|
||||
<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
|
||||
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
|
||||
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
|
||||
<li><a name="TOC14" href="#SEC14">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
|
||||
<P>
|
||||
Just-in-time compiling is a heavyweight optimization that can greatly speed up
|
||||
pattern matching. However, it comes at the cost of extra processing before the
|
||||
match is performed, so it is of most benefit when the same pattern is going to
|
||||
be matched many times. This does not necessarily mean many calls of a matching
|
||||
function; if the pattern is not anchored, matching attempts may take place many
|
||||
times at various positions in the subject, even for a single call. Therefore,
|
||||
if the subject string is very long, it may still pay to use JIT even for
|
||||
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
|
||||
32-bit PCRE2 libraries.
|
||||
</P>
|
||||
<P>
|
||||
JIT support applies only to the traditional Perl-compatible matching function.
|
||||
It does not apply when the DFA matching function is being used. The code for
|
||||
JIT support was written by Zoltan Herczeg.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">AVAILABILITY OF JIT SUPPORT</a><br>
|
||||
<P>
|
||||
JIT support is an optional feature of PCRE2. The "configure" option
|
||||
--enable-jit (or equivalent CMake option) must be set when PCRE2 is built if
|
||||
you want to use JIT. The support is limited to the following hardware
|
||||
platforms:
|
||||
<pre>
|
||||
ARM 32-bit (v7, and Thumb2)
|
||||
ARM 64-bit
|
||||
IBM s390x 64 bit
|
||||
Intel x86 32-bit and 64-bit
|
||||
LoongArch 64 bit
|
||||
MIPS 32-bit and 64-bit
|
||||
Power PC 32-bit and 64-bit
|
||||
RISC-V 32-bit and 64-bit
|
||||
</pre>
|
||||
If --enable-jit is set on an unsupported platform, compilation fails.
|
||||
</P>
|
||||
<P>
|
||||
A client program can tell if JIT support has been compiled by calling
|
||||
<b>pcre2_config()</b> with the PCRE2_CONFIG_JIT option. The result is one if
|
||||
PCRE2 was built with JIT support, and zero otherwise. However, having the JIT
|
||||
code available does not guarantee that it will be used for any particular
|
||||
match. One reason for this is that there are a number of options and pattern
|
||||
items that are
|
||||
<a href="#unsupported">not supported by JIT</a>
|
||||
(see below). Another reason is that in some environments JIT is unable to get
|
||||
executable memory in which to build its compiled code. The only guarantee from
|
||||
<b>pcre2_config()</b> is that if it returns zero, JIT will definitely <i>not</i>
|
||||
be used.
|
||||
</P>
|
||||
<P>
|
||||
As of release 10.45 there is a more informative way to test for JIT support. If
|
||||
<b>pcre2_compile_jit()</b> is called with the single option PCRE2_JIT_TEST_ALLOC
|
||||
it returns zero if JIT is available and has a working allocator. Otherwise it
|
||||
returns PCRE2_ERROR_NOMEMORY if JIT is available but cannot allocate executable
|
||||
memory, or PCRE2_ERROR_JIT_UNSUPPORTED if JIT support is not compiled. The
|
||||
code argument is ignored, so it can be a NULL value.
|
||||
</P>
|
||||
<P>
|
||||
A simple program does not need to check availability in order to use JIT when
|
||||
possible. The API is implemented in a way that falls back to the interpretive
|
||||
code if JIT is not available or cannot be used for a given match. For programs
|
||||
that need the best possible performance, there is a
|
||||
<a href="#fastpath">"fast path"</a>
|
||||
API that is JIT-specific.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">SIMPLE USE OF JIT</a><br>
|
||||
<P>
|
||||
To make use of the JIT support in the simplest way, all you have to do is to
|
||||
call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
|
||||
<b>pcre2_compile()</b>. This function has two arguments: the first is the
|
||||
compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
|
||||
second is zero or more of the following option bits: PCRE2_JIT_COMPLETE,
|
||||
PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
|
||||
</P>
|
||||
<P>
|
||||
If JIT support is not available, a call to <b>pcre2_jit_compile()</b> does
|
||||
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern
|
||||
is passed to the JIT compiler, which turns it into machine code that executes
|
||||
much faster than the normal interpretive code, but yields exactly the same
|
||||
results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
|
||||
or a negative error code.
|
||||
</P>
|
||||
<P>
|
||||
There is a limit to the size of pattern that JIT supports, imposed by the size
|
||||
of machine stack that it uses. The exact rules are not documented because they
|
||||
may change at any time, in particular, when new optimizations are introduced.
|
||||
If a pattern is too big, a call to <b>pcre2_jit_compile()</b> returns
|
||||
PCRE2_ERROR_NOMEMORY.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
|
||||
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
|
||||
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
|
||||
of the other options as well as, or instead of PCRE2_JIT_COMPLETE. The JIT
|
||||
compiler generates different optimized code for each of the three modes
|
||||
(normal, soft partial, hard partial). When <b>pcre2_match()</b> is called, the
|
||||
appropriate code is run if it is available. Otherwise, the pattern is matched
|
||||
using interpretive code.
|
||||
</P>
|
||||
<P>
|
||||
You can call <b>pcre2_jit_compile()</b> multiple times for the same compiled
|
||||
pattern. It does nothing if it has previously compiled code for any of the
|
||||
option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
|
||||
(perhaps later, when you find you need partial matching) again with
|
||||
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
|
||||
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
|
||||
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
|
||||
returns zero. This is an alternative way of testing whether JIT support has
|
||||
been compiled.
|
||||
</P>
|
||||
<P>
|
||||
At present, it is not possible to free JIT compiled code except when the entire
|
||||
compiled pattern is freed by calling <b>pcre2_code_free()</b>.
|
||||
</P>
|
||||
<P>
|
||||
In some circumstances you may need to call additional functions. These are
|
||||
described in the section entitled
|
||||
<a href="#stackcontrol">"Controlling the JIT stack"</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
There are some <b>pcre2_match()</b> options that are not supported by JIT, and
|
||||
there are also some pattern items that JIT cannot handle. Details are given
|
||||
<a href="#unsupported">below.</a>
|
||||
In both cases, matching automatically falls back to the interpretive code. If
|
||||
you want to know whether JIT was actually used for a particular match, you
|
||||
should arrange for a JIT callback function to be set up as described in the
|
||||
section entitled
|
||||
<a href="#stackcontrol">"Controlling the JIT stack"</a>
|
||||
below, even if you do not need to supply a non-default JIT stack. Such a
|
||||
callback function is called whenever JIT code is about to be obeyed. If the
|
||||
match-time options are not right for JIT execution, the callback function is
|
||||
not obeyed.
|
||||
</P>
|
||||
<P>
|
||||
If the JIT compiler finds an unsupported item, no JIT data is generated. You
|
||||
can find out if JIT compilation was successful for a compiled pattern by
|
||||
calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_JITSIZE option. A
|
||||
non-zero result means that JIT compilation was successful. A result of 0 means
|
||||
that JIT support is not available, or the pattern was not processed by
|
||||
<b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
|
||||
pattern. Successful JIT compilation does not, however, guarantee the use of JIT
|
||||
at match time because there are some match time options that are not supported
|
||||
by JIT.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
|
||||
<P>
|
||||
When a pattern is compiled with the PCRE2_UTF option, subject strings are
|
||||
normally expected to be a valid sequence of UTF code units. By default, this is
|
||||
checked at the start of matching and an error is generated if invalid UTF is
|
||||
detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
|
||||
skip the check (for improved performance) if you are sure that a subject string
|
||||
is valid. If this option is used with an invalid string, the result is
|
||||
undefined. The calling program may crash or loop or otherwise misbehave.
|
||||
</P>
|
||||
<P>
|
||||
However, a way of running matches on strings that may contain invalid UTF
|
||||
sequences is available. Calling <b>pcre2_compile()</b> with the
|
||||
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
|
||||
<b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
|
||||
is subsequently called, the compiled JIT code also supports invalid UTF.
|
||||
Details of how this support works, in both the JIT and the interpretive cases,
|
||||
is given in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
There is also an obsolete option for <b>pcre2_jit_compile()</b> called
|
||||
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
|
||||
It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
|
||||
and should no longer be used. It may be removed in future.
|
||||
<a name="unsupported"></a></P>
|
||||
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
|
||||
<P>
|
||||
The <b>pcre2_match()</b> options that are supported for JIT matching are
|
||||
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
|
||||
PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
|
||||
PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options are not
|
||||
supported at match time.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE2_NO_JIT option is passed to <b>pcre2_match()</b> it disables the
|
||||
use of JIT, forcing matching by the interpreter code.
|
||||
</P>
|
||||
<P>
|
||||
The only unsupported pattern items are \C (match a single data unit) when
|
||||
running in a UTF mode, and a callout immediately before an assertion condition
|
||||
in a conditional group.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
|
||||
<P>
|
||||
When a pattern is matched using JIT, the return values are the same as those
|
||||
given by the interpretive <b>pcre2_match()</b> code, with the addition of one
|
||||
new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the memory used for
|
||||
the JIT stack was insufficient. See
|
||||
<a href="#stackcontrol">"Controlling the JIT stack"</a>
|
||||
below for a discussion of JIT stack usage.
|
||||
</P>
|
||||
<P>
|
||||
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
|
||||
a very large pattern tree goes on for too long, as it is in the same
|
||||
circumstance when JIT is not used, but the details of exactly what is counted
|
||||
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
|
||||
when JIT matching is used.
|
||||
<a name="stackcontrol"></a></P>
|
||||
<br><a name="SEC7" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
|
||||
<P>
|
||||
When the compiled JIT code runs, it needs a block of memory to use as a stack.
|
||||
By default, it uses 32KiB on the machine stack. However, some large or
|
||||
complicated patterns need more than this. The error PCRE2_ERROR_JIT_STACKLIMIT
|
||||
is given when there is not enough stack. Three functions are provided for
|
||||
managing blocks of memory for use as JIT stacks. There is further discussion
|
||||
about the use of JIT stacks in the section entitled
|
||||
<a href="#stackfaq">"JIT stack FAQ"</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_jit_stack_create()</b> function creates a JIT stack. Its arguments
|
||||
are a starting size, a maximum size, and a general context (for memory
|
||||
allocation functions, or NULL for standard memory allocation). It returns a
|
||||
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
|
||||
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
|
||||
that is no longer needed. If its argument is NULL, this function returns
|
||||
immediately, without doing anything. (For the technically minded: the address
|
||||
space is allocated by mmap or VirtualAlloc.) A maximum stack size of 512KiB to
|
||||
1MiB should be more than enough for any pattern.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2_jit_stack_assign()</b> function specifies which stack JIT code
|
||||
should use. Its arguments are as follows:
|
||||
<pre>
|
||||
pcre2_match_context *mcontext
|
||||
pcre2_jit_callback callback
|
||||
void *data
|
||||
</pre>
|
||||
The first argument is a pointer to a match context. When this is subsequently
|
||||
passed to a matching function, its information determines which JIT stack is
|
||||
used. If this argument is NULL, the function returns immediately, without doing
|
||||
anything. There are three cases for the values of the other two options:
|
||||
<pre>
|
||||
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32KiB block
|
||||
on the machine stack is used. This is the default when a match
|
||||
context is created.
|
||||
|
||||
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
|
||||
a pointer to a valid JIT stack, the result of calling
|
||||
<b>pcre2_jit_stack_create()</b>.
|
||||
|
||||
(3) If <i>callback</i> is not NULL, it must point to a function that is
|
||||
called with <i>data</i> as an argument at the start of matching, in
|
||||
order to set up a JIT stack. If the return from the callback
|
||||
function is NULL, the internal 32KiB stack is used; otherwise the
|
||||
return value must be a valid JIT stack, the result of calling
|
||||
<b>pcre2_jit_stack_create()</b>.
|
||||
</pre>
|
||||
A callback function is obeyed whenever JIT code is about to be run; it is not
|
||||
obeyed when <b>pcre2_match()</b> is called with options that are incompatible
|
||||
for JIT matching. A callback function can therefore be used to determine
|
||||
whether a match operation was executed by JIT or by the interpreter.
|
||||
</P>
|
||||
<P>
|
||||
You may safely use the same JIT stack for more than one pattern (either by
|
||||
assigning directly or by callback), as long as the patterns are matched
|
||||
sequentially in the same thread. Currently, the only way to set up
|
||||
non-sequential matches in one thread is to use callouts: if a callout function
|
||||
starts another match, that match must use a different JIT stack to the one used
|
||||
for currently suspended match(es).
|
||||
</P>
|
||||
<P>
|
||||
In a multithread application, if you do not specify a JIT stack, or if you
|
||||
assign or pass back NULL from a callback, that is thread-safe, because each
|
||||
thread has its own machine stack. However, if you assign or pass back a
|
||||
non-NULL JIT stack, this must be a different stack for each thread so that the
|
||||
application is thread-safe.
|
||||
</P>
|
||||
<P>
|
||||
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
|
||||
to a match context that is used by any number of patterns, as long as they are
|
||||
not used for matching by multiple threads at the same time. For example, you
|
||||
could use the same stack in all compiled patterns, with a global mutex in the
|
||||
callback to wait until the stack is available for use. However, this is an
|
||||
inefficient solution, and not recommended.
|
||||
</P>
|
||||
<P>
|
||||
This is a suggestion for how a multithreaded program that needs to set up
|
||||
non-default JIT stacks might operate:
|
||||
<pre>
|
||||
During thread initialization
|
||||
thread_local_var = pcre2_jit_stack_create(...)
|
||||
|
||||
During thread exit
|
||||
pcre2_jit_stack_free(thread_local_var)
|
||||
|
||||
Use a one-line callback function
|
||||
return thread_local_var
|
||||
</pre>
|
||||
All the functions described in this section do nothing if JIT is not available.
|
||||
<a name="stackfaq"></a></P>
|
||||
<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
|
||||
<P>
|
||||
(1) Why do we need JIT stacks?
|
||||
<br>
|
||||
<br>
|
||||
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack where
|
||||
the local data of the current node is pushed before checking its child nodes.
|
||||
Allocating real machine stack on some platforms is difficult. For example, the
|
||||
stack chain needs to be updated every time if we extend the stack on PowerPC.
|
||||
Although it is possible, its updating time overhead decreases performance. So
|
||||
we do the recursion in memory.
|
||||
</P>
|
||||
<P>
|
||||
(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>?
|
||||
<br>
|
||||
<br>
|
||||
Modern operating systems have a nice feature: they can reserve an address space
|
||||
instead of allocating memory. We can safely allocate memory pages inside this
|
||||
address space, so the stack could grow without moving memory data (this is
|
||||
important because of pointers). Thus we can allocate 1MiB address space, and
|
||||
use only a single memory page (usually 4KiB) if that is enough. However, we can
|
||||
still grow up to 1MiB anytime if needed.
|
||||
</P>
|
||||
<P>
|
||||
(3) Who "owns" a JIT stack?
|
||||
<br>
|
||||
<br>
|
||||
The owner of the stack is the user program, not the JIT studied pattern or
|
||||
anything else. The user program must ensure that if a stack is being used by
|
||||
<b>pcre2_match()</b>, (that is, it is assigned to a match context that is passed
|
||||
to the pattern currently running), that stack must not be used by any other
|
||||
threads (to avoid overwriting the same memory area). The best practice for
|
||||
multithreaded programs is to allocate a stack for each thread, and return this
|
||||
stack through the JIT callback function.
|
||||
</P>
|
||||
<P>
|
||||
(4) When should a JIT stack be freed?
|
||||
<br>
|
||||
<br>
|
||||
You can free a JIT stack at any time, as long as it will not be used by
|
||||
<b>pcre2_match()</b> again. When you assign the stack to a match context, only a
|
||||
pointer is set. There is no reference counting or any other magic. You can free
|
||||
compiled patterns, contexts, and stacks in any order, anytime.
|
||||
Just <i>do not</i> call <b>pcre2_match()</b> with a match context pointing to an
|
||||
already freed stack, as that will cause SEGFAULT. (Also, do not free a stack
|
||||
currently used by <b>pcre2_match()</b> in another thread). You can also replace
|
||||
the stack in a context at any time when it is not in use. You should free the
|
||||
previous stack before assigning a replacement.
|
||||
</P>
|
||||
<P>
|
||||
(5) Should I allocate/free a stack every time before/after calling
|
||||
<b>pcre2_match()</b>?
|
||||
<br>
|
||||
<br>
|
||||
No, because this is too costly in terms of resources. However, you could
|
||||
implement some clever idea which release the stack if it is not used in let's
|
||||
say two minutes. The JIT callback can help to achieve this without keeping a
|
||||
list of patterns.
|
||||
</P>
|
||||
<P>
|
||||
(6) OK, the stack is for long term memory allocation. But what happens if a
|
||||
pattern causes stack overflow with a stack of 1MiB? Is that 1MiB kept until the
|
||||
stack is freed?
|
||||
<br>
|
||||
<br>
|
||||
Especially on embedded systems, it might be a good idea to release memory
|
||||
sometimes without freeing the stack. There is no API for this at the moment.
|
||||
Probably a function call which returns with the currently allocated memory for
|
||||
any stack and another which allows releasing memory (shrinking the stack) would
|
||||
be a good idea if someone needs this.
|
||||
</P>
|
||||
<P>
|
||||
(7) This is too much of a headache. Isn't there any better solution for JIT
|
||||
stack handling?
|
||||
<br>
|
||||
<br>
|
||||
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
|
||||
out this complicated API.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
|
||||
<P>
|
||||
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
|
||||
</P>
|
||||
<P>
|
||||
The JIT executable allocator does not free all memory when it is possible. It
|
||||
expects new allocations, and keeps some free memory around to improve
|
||||
allocation speed. However, in low memory conditions, it might be better to free
|
||||
all possible memory. You can cause this to happen by calling
|
||||
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
|
||||
memory management, or NULL for standard memory management.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
|
||||
<P>
|
||||
This is a single-threaded example that specifies a JIT stack without using a
|
||||
callback. A real program should include error checking after all the function
|
||||
calls.
|
||||
<pre>
|
||||
int rc;
|
||||
pcre2_code *re;
|
||||
pcre2_match_data *match_data;
|
||||
pcre2_match_context *mcontext;
|
||||
pcre2_jit_stack *jit_stack;
|
||||
|
||||
re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
|
||||
&errornumber, &erroffset, NULL);
|
||||
rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
|
||||
mcontext = pcre2_match_context_create(NULL);
|
||||
jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
|
||||
pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
|
||||
match_data = pcre2_match_data_create(re, 10);
|
||||
rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
|
||||
/* Process result */
|
||||
|
||||
pcre2_code_free(re);
|
||||
pcre2_match_data_free(match_data);
|
||||
pcre2_match_context_free(mcontext);
|
||||
pcre2_jit_stack_free(jit_stack);
|
||||
|
||||
<a name="fastpath"></a></PRE>
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
|
||||
<P>
|
||||
Because the API described above falls back to interpreted matching when JIT is
|
||||
not available, it is convenient for programs that are written for general use
|
||||
in many environments. However, calling JIT via <b>pcre2_match()</b> does have a
|
||||
performance impact. Programs that are written for use where JIT is known to be
|
||||
available, and which need the best possible performance, can instead use a
|
||||
"fast path" API to call JIT matching directly instead of calling
|
||||
<b>pcre2_match()</b> (obviously only for patterns that have been successfully
|
||||
processed by <b>pcre2_jit_compile()</b>).
|
||||
</P>
|
||||
<P>
|
||||
The fast path function is called <b>pcre2_jit_match()</b>, and it takes exactly
|
||||
the same arguments as <b>pcre2_match()</b>. However, the subject string must be
|
||||
specified with a length; PCRE2_ZERO_TERMINATED is not supported. Unsupported
|
||||
option bits (for example, PCRE2_ANCHORED and PCRE2_ENDANCHORED) are ignored, as
|
||||
is the PCRE2_NO_JIT option. The return values are also the same as for
|
||||
<b>pcre2_match()</b>, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial
|
||||
or complete) is requested that was not compiled.
|
||||
</P>
|
||||
<P>
|
||||
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
|
||||
number of other sanity checks are performed on the arguments. For example, if
|
||||
the subject pointer is NULL but the length is non-zero, an immediate error is
|
||||
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
|
||||
for validity. In the interests of speed, these checks do not happen on the JIT
|
||||
fast path. If invalid UTF data is passed when PCRE2_MATCH_INVALID_UTF was not
|
||||
set for <b>pcre2_compile()</b>, the result is undefined. The program may crash
|
||||
or loop or give wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you
|
||||
should call <b>pcre2_jit_match()</b> in UTF mode only if you are sure the
|
||||
subject is valid.
|
||||
</P>
|
||||
<P>
|
||||
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
|
||||
speedups of more than 10%.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2unicode</b>(3)
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel (FAQ by Zoltan Herczeg)
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 22 August 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
105
3rd/pcre2/doc/html/pcre2limits.html
Normal file
105
3rd/pcre2/doc/html/pcre2limits.html
Normal file
@@ -0,0 +1,105 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2limits specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2limits man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
SIZE AND OTHER LIMITATIONS
|
||||
</b><br>
|
||||
<P>
|
||||
There are some size limitations in PCRE2 but it is hoped that they will never
|
||||
in practice be relevant.
|
||||
</P>
|
||||
<P>
|
||||
The maximum size of a compiled pattern is approximately 64 thousand code units
|
||||
for the 8-bit and 16-bit libraries if PCRE2 is compiled with the default
|
||||
internal linkage size, which is 2 bytes for these libraries. If you want to
|
||||
process regular expressions that are truly enormous, you can compile PCRE2 with
|
||||
an internal linkage size of 3 or 4 (when building the 16-bit library, 3 is
|
||||
rounded up to 4). See the <b>README</b> file in the source distribution and the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation for details. In these cases the limit is substantially larger.
|
||||
However, the speed of execution is slower. In the 32-bit library, the internal
|
||||
linkage size is always 4.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a source pattern string is essentially unlimited; it is
|
||||
the largest number a PCRE2_SIZE variable can hold. However, the program that
|
||||
calls <b>pcre2_compile()</b> can specify a smaller limit.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length (in code units) of a subject string is one less than the
|
||||
largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned
|
||||
integer type, usually defined as size_t. Its maximum value (that is
|
||||
~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated strings
|
||||
and unset offsets.
|
||||
</P>
|
||||
<P>
|
||||
All values in repeating quantifiers must be less than 65536.
|
||||
</P>
|
||||
<P>
|
||||
There are two different limits that apply to branches of lookbehind assertions.
|
||||
If every branch in such an assertion matches a fixed number of characters,
|
||||
the maximum length of any branch is 65535 characters. If any branch matches a
|
||||
variable number of characters, then the maximum matching length for every
|
||||
branch is limited. The default limit is set at compile time, defaulting to 255,
|
||||
but can be changed by the calling program.
|
||||
</P>
|
||||
<P>
|
||||
There is no limit to the number of parenthesized groups, but there can be no
|
||||
more than 65535 capture groups, and there is a limit to the depth of nesting of
|
||||
parenthesized subpatterns of all kinds. This is imposed in order to limit the
|
||||
amount of system stack used at compile time. The default limit can be specified
|
||||
when PCRE2 is built; if not, the default is set to 250. An application can
|
||||
change this limit by calling pcre2_set_parens_nest_limit() to set the limit in
|
||||
a compile context.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of name for a named capture group is 32 code units, and the
|
||||
maximum number of such groups is 10000.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
|
||||
is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
|
||||
32-bit libraries.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a string argument to a callout is the largest number a
|
||||
32-bit unsigned integer can hold.
|
||||
</P>
|
||||
<P>
|
||||
The maximum amount of heap memory used for matching is controlled by the heap
|
||||
limit, which can be set in a pattern or in a match context. The default is a
|
||||
very large number, effectively unlimited.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 16 August 2023
|
||||
<br>
|
||||
Copyright © 1997-2023 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
262
3rd/pcre2/doc/html/pcre2matching.html
Normal file
262
3rd/pcre2/doc/html/pcre2matching.html
Normal file
@@ -0,0 +1,262 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2matching specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2matching man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 MATCHING ALGORITHMS</a>
|
||||
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
|
||||
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
|
||||
<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
|
||||
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
|
||||
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
|
||||
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
|
||||
<li><a name="TOC8" href="#SEC8">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 MATCHING ALGORITHMS</a><br>
|
||||
<P>
|
||||
This document describes the two different algorithms that are available in
|
||||
PCRE2 for matching a compiled regular expression against a given subject
|
||||
string. The "standard" algorithm is the one provided by the <b>pcre2_match()</b>
|
||||
function. This works in the same as Perl's matching function, and provides a
|
||||
Perl-compatible matching operation. The just-in-time (JIT) optimization that is
|
||||
described in the
|
||||
<a href="pcre2jit.html"><b>pcre2jit</b></a>
|
||||
documentation is compatible with this function.
|
||||
</P>
|
||||
<P>
|
||||
An alternative algorithm is provided by the <b>pcre2_dfa_match()</b> function;
|
||||
it operates in a different way, and is not Perl-compatible. This alternative
|
||||
has advantages and disadvantages compared with the standard algorithm, and
|
||||
these are described below.
|
||||
</P>
|
||||
<P>
|
||||
When there is only one possible way in which a given subject string can match a
|
||||
pattern, the two algorithms give the same answer. A difference arises, however,
|
||||
when there are multiple possibilities. For example, if the anchored pattern
|
||||
<pre>
|
||||
^<.*>
|
||||
</pre>
|
||||
is matched against the string
|
||||
<pre>
|
||||
<something> <something else> <something further>
|
||||
</pre>
|
||||
there are three possible answers. The standard algorithm finds only one of
|
||||
them, whereas the alternative algorithm finds all three.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
|
||||
<P>
|
||||
The set of strings that are matched by a regular expression can be represented
|
||||
as a tree structure. An unlimited repetition in the pattern makes the tree of
|
||||
infinite size, but it is still a tree. Matching the pattern to a given subject
|
||||
string (from a given starting point) can be thought of as a search of the tree.
|
||||
There are two ways to search a tree: depth-first and breadth-first, and these
|
||||
correspond to the two matching algorithms provided by PCRE2.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
|
||||
<P>
|
||||
In the terminology of Jeffrey Friedl's book "Mastering Regular Expressions",
|
||||
the standard algorithm is an "NFA algorithm". It conducts a depth-first search
|
||||
of the pattern tree. That is, it proceeds along a single path through the tree,
|
||||
checking that the subject matches what is required. When there is a mismatch,
|
||||
the algorithm tries any alternatives at the current point, and if they all
|
||||
fail, it backs up to the previous branch point in the tree, and tries the next
|
||||
alternative branch at that level. This often involves backing up (moving to the
|
||||
left) in the subject string as well. The order in which repetition branches are
|
||||
tried is controlled by the greedy or ungreedy nature of the quantifier.
|
||||
</P>
|
||||
<P>
|
||||
If a leaf node is reached, a matching string has been found, and at that point
|
||||
the algorithm stops. Thus, if there is more than one possible match, this
|
||||
algorithm returns the first one that it finds. Whether this is the shortest,
|
||||
the longest, or some intermediate length depends on the way the alternations
|
||||
and the greedy or ungreedy repetition quantifiers are specified in the
|
||||
pattern.
|
||||
</P>
|
||||
<P>
|
||||
Because it ends up with a single path through the tree, it is relatively
|
||||
straightforward for this algorithm to keep track of the substrings that are
|
||||
matched by portions of the pattern in parentheses. This provides support for
|
||||
capturing parentheses and backreferences.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
|
||||
<P>
|
||||
This algorithm conducts a breadth-first search of the tree. Starting from the
|
||||
first matching point in the subject, it scans the subject string from left to
|
||||
right, once, character by character, and as it does this, it remembers all the
|
||||
paths through the tree that represent valid matches. In Friedl's terminology,
|
||||
this is a kind of "DFA algorithm", though it is not implemented as a
|
||||
traditional finite state machine (it keeps multiple states active
|
||||
simultaneously).
|
||||
</P>
|
||||
<P>
|
||||
Although the general principle of this matching algorithm is that it scans the
|
||||
subject string only once, without backtracking, there is one exception: when a
|
||||
lookaround assertion is encountered, the characters following or preceding the
|
||||
current point have to be independently inspected.
|
||||
</P>
|
||||
<P>
|
||||
The scan continues until either the end of the subject is reached, or there are
|
||||
no more unterminated paths. At this point, terminated paths represent the
|
||||
different matching possibilities (if there are none, the match has failed).
|
||||
Thus, if there is more than one possible match, this algorithm finds all of
|
||||
them, and in particular, it finds the longest. The matches are returned in
|
||||
the output vector in decreasing order of length. There is an option to stop the
|
||||
algorithm after the first match (which is necessarily the shortest) is found.
|
||||
</P>
|
||||
<P>
|
||||
Note that the size of vector needed to contain all the results depends on the
|
||||
number of simultaneous matches, not on the number of capturing parentheses in
|
||||
the pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the
|
||||
match data block is therefore not advisable when doing DFA matching.
|
||||
</P>
|
||||
<P>
|
||||
Note also that all the matches that are found start at the same point in the
|
||||
subject. If the pattern
|
||||
<pre>
|
||||
cat(er(pillar)?)?
|
||||
</pre>
|
||||
is matched against the string "the caterpillar catchment", the result is the
|
||||
three strings "caterpillar", "cater", and "cat" that start at the fifth
|
||||
character of the subject. The algorithm does not automatically move on to find
|
||||
matches that start at later positions.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2's "auto-possessification" optimization usually applies to character
|
||||
repeats at the end of a pattern (as well as internally). For example, the
|
||||
pattern "a\d+" is compiled as if it were "a\d++" because there is no point
|
||||
even considering the possibility of backtracking into the repeated digits. For
|
||||
DFA matching, this means that only one possible match is found. If you really
|
||||
do want multiple matches in such cases, either use an ungreedy repeat
|
||||
("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
|
||||
</P>
|
||||
<P>
|
||||
There are a number of features of PCRE2 regular expressions that are not
|
||||
supported or behave differently in the alternative matching function. Those
|
||||
that are not supported cause an error if encountered.
|
||||
</P>
|
||||
<P>
|
||||
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
||||
nature of repetition quantifiers is not relevant (though it may affect
|
||||
auto-possessification, as just described). During matching, greedy and ungreedy
|
||||
quantifiers are treated in exactly the same way. However, possessive
|
||||
quantifiers can make a difference when what follows could also match what is
|
||||
quantified, for example in a pattern like this:
|
||||
<pre>
|
||||
^a++\w!
|
||||
</pre>
|
||||
This pattern matches "aaab!" but not "aaa!", which would be matched by a
|
||||
non-possessive quantifier. Similarly, if an atomic group is present, it is
|
||||
matched as if it were a standalone pattern at the current point, and the
|
||||
longest match is then "locked in" for the rest of the overall pattern.
|
||||
</P>
|
||||
<P>
|
||||
2. When dealing with multiple paths through the tree simultaneously, it is not
|
||||
straightforward to keep track of captured substrings for the different matching
|
||||
possibilities, and PCRE2's implementation of this algorithm does not attempt to
|
||||
do this. This means that no captured substrings are available.
|
||||
</P>
|
||||
<P>
|
||||
3. Because no substrings are captured, a number of related features are not
|
||||
available:
|
||||
<br>
|
||||
<br>
|
||||
(a) Backreferences;
|
||||
<br>
|
||||
<br>
|
||||
(b) Conditional expressions that use a backreference as the condition or test
|
||||
for a specific group recursion;
|
||||
<br>
|
||||
<br>
|
||||
(c) Script runs;
|
||||
<br>
|
||||
<br>
|
||||
(d) Scan substring assertions.
|
||||
</P>
|
||||
<P>
|
||||
4. Because many paths through the tree may be active, the \K escape sequence,
|
||||
which resets the start of the match when encountered (but may be on some paths
|
||||
and not on others), is not supported.
|
||||
</P>
|
||||
<P>
|
||||
5. Callouts are supported, but the value of the <i>capture_top</i> field is
|
||||
always 1, and the value of the <i>capture_last</i> field is always 0.
|
||||
</P>
|
||||
<P>
|
||||
6. The \C escape sequence, which (in the standard algorithm) always matches a
|
||||
single code unit, even in a UTF mode, is not supported in UTF modes because
|
||||
the alternative algorithm moves through the subject string one character (not
|
||||
code unit) at a time, for all active paths through the tree.
|
||||
</P>
|
||||
<P>
|
||||
7. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||
</P>
|
||||
<P>
|
||||
8. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not
|
||||
supported by <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||
<P>
|
||||
The main advantage of the alternative algorithm is that all possible matches
|
||||
(at a single point in the subject) are automatically found, and in particular,
|
||||
the longest match is found. To find more than one match at the same point using
|
||||
the standard algorithm, you have to do kludgy things with callouts.
|
||||
</P>
|
||||
<P>
|
||||
Partial matching is possible with this algorithm, though it has some
|
||||
limitations. The
|
||||
<a href="pcre2partial.html"><b>pcre2partial</b></a>
|
||||
documentation gives details of partial matching and discusses multi-segment
|
||||
matching.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||
<P>
|
||||
The alternative algorithm suffers from a number of disadvantages:
|
||||
</P>
|
||||
<P>
|
||||
1. It is substantially slower than the standard algorithm. This is partly
|
||||
because it has to search for all possible matches, but is also because it is
|
||||
less susceptible to optimization.
|
||||
</P>
|
||||
<P>
|
||||
2. Capturing parentheses and other features such as backreferences that rely on
|
||||
them are not supported.
|
||||
</P>
|
||||
<P>
|
||||
3. Matching within invalid UTF strings is not supported.
|
||||
</P>
|
||||
<P>
|
||||
4. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
</P>
|
||||
<P>
|
||||
5. JIT optimization is not supported.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 30 August 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
408
3rd/pcre2/doc/html/pcre2partial.html
Normal file
408
3rd/pcre2/doc/html/pcre2partial.html
Normal file
@@ -0,0 +1,408 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2partial specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2partial man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
|
||||
<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a>
|
||||
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a>
|
||||
<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
|
||||
<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a>
|
||||
<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
|
||||
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
|
||||
<li><a name="TOC8" href="#SEC8">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
|
||||
<P>
|
||||
In normal use of PCRE2, if there is a match up to the end of a subject string,
|
||||
but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
|
||||
is returned, just like any other failing match. There are circumstances where
|
||||
it might be helpful to distinguish this "partial match" case.
|
||||
</P>
|
||||
<P>
|
||||
One example is an application where the subject string is very long, and not
|
||||
all available at once. The requirement here is to be able to do the matching
|
||||
segment by segment, but special action is needed when a matched substring spans
|
||||
the boundary between two segments.
|
||||
</P>
|
||||
<P>
|
||||
Another example is checking a user input string as it is typed, to ensure that
|
||||
it conforms to a required format. Invalid characters can be immediately
|
||||
diagnosed and rejected, giving instant feedback.
|
||||
</P>
|
||||
<P>
|
||||
Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
|
||||
requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
|
||||
options when calling a matching function. The difference between the two
|
||||
options is whether or not a partial match is preferred to an alternative
|
||||
complete match, though the details differ between the two types of matching
|
||||
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
|
||||
</P>
|
||||
<P>
|
||||
If you want to use partial matching with just-in-time optimized code, as well
|
||||
as setting a partial match option for the matching function, you must also call
|
||||
<b>pcre2_jit_compile()</b> with one or both of these options:
|
||||
<pre>
|
||||
PCRE2_JIT_PARTIAL_HARD
|
||||
PCRE2_JIT_PARTIAL_SOFT
|
||||
</pre>
|
||||
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
|
||||
matches on the same pattern. Separate code is compiled for each mode. If the
|
||||
appropriate JIT mode has not been compiled, interpretive matching code is used.
|
||||
</P>
|
||||
<P>
|
||||
Setting a partial matching option disables two of PCRE2's standard
|
||||
optimization hints. PCRE2 remembers the last literal code unit in a pattern,
|
||||
and abandons matching immediately if it is not present in the subject string.
|
||||
This optimization cannot be used for a subject string that might match only
|
||||
partially. PCRE2 also remembers a minimum length of a matching string, and does
|
||||
not bother to run the matching function on shorter strings. This optimization
|
||||
is also disabled for partial matching.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br>
|
||||
<P>
|
||||
A possible partial match occurs during matching when the end of the subject
|
||||
string is reached successfully, but either more characters are needed to
|
||||
complete the match, or the addition of more characters might change what is
|
||||
matched.
|
||||
</P>
|
||||
<P>
|
||||
Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
|
||||
definitely needed to complete a match. In this case both hard and soft matching
|
||||
options yield a partial match.
|
||||
</P>
|
||||
<P>
|
||||
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
|
||||
can be found, but the addition of more characters might change what is
|
||||
matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
|
||||
PCRE2_PARTIAL_SOFT returns the complete match.
|
||||
</P>
|
||||
<P>
|
||||
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
|
||||
pattern item is \z, \Z, \b, \B, or $ there is always a partial match.
|
||||
Otherwise, for both options, the next pattern item must be one that inspects a
|
||||
character, and at least one of the following must be true:
|
||||
</P>
|
||||
<P>
|
||||
(1) At least one character has already been inspected. An inspected character
|
||||
need not form part of the final matched string; lookbehind assertions and the
|
||||
\K escape sequence provide ways of inspecting characters before the start of a
|
||||
matched string.
|
||||
</P>
|
||||
<P>
|
||||
(2) The pattern contains one or more lookbehind assertions. This condition
|
||||
exists in case there is a lookbehind that inspects characters before the start
|
||||
of the match.
|
||||
</P>
|
||||
<P>
|
||||
(3) There is a special case when the whole pattern can match an empty string.
|
||||
When the starting point is at the end of the subject, the empty string match is
|
||||
a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
|
||||
conditions is true, it is returned. However, because adding more characters
|
||||
might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
|
||||
which in this case means "there is going to be a match at this point, but until
|
||||
some more characters are added, we do not know if it will be an empty string or
|
||||
something longer".
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
|
||||
<P>
|
||||
When a partial matching option is set, the result of calling
|
||||
<b>pcre2_match()</b> can be one of the following:
|
||||
</P>
|
||||
<P>
|
||||
<b>A successful match</b>
|
||||
A complete match has been found, starting and ending within this subject.
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_ERROR_NOMATCH</b>
|
||||
No match can start anywhere in this subject.
|
||||
</P>
|
||||
<P>
|
||||
<b>PCRE2_ERROR_PARTIAL</b>
|
||||
Adding more characters may result in a complete match that uses one or more
|
||||
characters from the end of this subject.
|
||||
</P>
|
||||
<P>
|
||||
When a partial match is returned, the first two elements in the ovector point
|
||||
to the portion of the subject that was matched, but the values in the rest of
|
||||
the ovector are undefined. The appearance of \K in the pattern has no effect
|
||||
for a partial match. Consider this pattern:
|
||||
<pre>
|
||||
/abc\K123/
|
||||
</pre>
|
||||
If it is matched against "456abc123xyz" the result is a complete match, and the
|
||||
ovector defines the matched string as "123", because \K resets the "start of
|
||||
match" point. However, if a partial match is requested and the subject string
|
||||
is "456abc12", a partial match is found for the string "abc12", because all
|
||||
these characters are needed for a subsequent re-match with additional
|
||||
characters.
|
||||
</P>
|
||||
<P>
|
||||
If there is more than one partial match, the first one that was found provides
|
||||
the data that is returned. Consider this pattern:
|
||||
<pre>
|
||||
/123\w+X|dogY/
|
||||
</pre>
|
||||
If this is matched against the subject string "abc123dog", both alternatives
|
||||
fail to match, but the end of the subject is reached during matching, so
|
||||
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
|
||||
"123dog" as the first partial match. (In this example, there are two partial
|
||||
matches, because "dog" on its own partially matches the second alternative.)
|
||||
</P>
|
||||
<br><b>
|
||||
How a partial match is processed by pcre2_match()
|
||||
</b><br>
|
||||
<P>
|
||||
What happens when a partial match is identified depends on which of the two
|
||||
partial matching options is set.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
|
||||
partial match is found, without continuing to search for possible complete
|
||||
matches. This option is "hard" because it prefers an earlier partial match over
|
||||
a later complete match. For this reason, the assumption is made that the end of
|
||||
the supplied subject string is not the true end of the available data, which is
|
||||
why \z, \Z, \b, \B, and $ always give a partial match.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
|
||||
continues as normal, and other alternatives in the pattern are tried. If no
|
||||
complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
|
||||
PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
|
||||
over a partial match. All the various matching items in a pattern behave as if
|
||||
the subject string is potentially complete; \z, \Z, and $ match at the end of
|
||||
the subject, as normal, and for \b and \B the end of the subject is treated
|
||||
as a non-alphanumeric.
|
||||
</P>
|
||||
<P>
|
||||
The difference between the two partial matching options can be illustrated by a
|
||||
pattern such as:
|
||||
<pre>
|
||||
/dog(sbody)?/
|
||||
</pre>
|
||||
This matches either "dog" or "dogsbody", greedily (that is, it prefers the
|
||||
longer string if possible). If it is matched against the string "dog" with
|
||||
PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". However, if
|
||||
PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PARTIAL. On the other
|
||||
hand, if the pattern is made ungreedy the result is different:
|
||||
<pre>
|
||||
/dog(sbody)??/
|
||||
</pre>
|
||||
In this case the result is always a complete match because that is found first,
|
||||
and matching never continues after finding a complete match. It might be easier
|
||||
to follow this explanation by thinking of the two patterns like this:
|
||||
<pre>
|
||||
/dog(sbody)?/ is the same as /dogsbody|dog/
|
||||
/dog(sbody)??/ is the same as /dog|dogsbody/
|
||||
</pre>
|
||||
The second pattern will never match "dogsbody", because it will always find the
|
||||
shorter match first.
|
||||
</P>
|
||||
<br><b>
|
||||
Example of partial matching using pcre2test
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and
|
||||
<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
|
||||
respectively, when calling <b>pcre2_match()</b>. Here is a run of
|
||||
<b>pcre2test</b> using a pattern that matches the whole subject in the form of a
|
||||
date:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25dec3\=ph
|
||||
Partial match: 23dec3
|
||||
data> 3ju\=ph
|
||||
Partial match: 3ju
|
||||
data> 3juj\=ph
|
||||
No match
|
||||
</pre>
|
||||
This example gives the same results for both hard and soft partial matching
|
||||
options. Here is an example where there is a difference:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 25jun04\=ps
|
||||
0: 25jun04
|
||||
1: jun
|
||||
data> 25jun04\=ph
|
||||
Partial match: 25jun04
|
||||
</pre>
|
||||
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
|
||||
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
|
||||
there is only a partial match.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
|
||||
<P>
|
||||
PCRE was not originally designed with multi-segment matching in mind. However,
|
||||
over time, features (including partial matching) that make multi-segment
|
||||
matching possible have been added. A very long string can be searched segment
|
||||
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
|
||||
the same results that would happen if the entire string was available for
|
||||
searching all the time. Normally, the strings that are being sought are much
|
||||
shorter than each individual segment, and are in the middle of very long
|
||||
strings, so the pattern is normally not anchored.
|
||||
</P>
|
||||
<P>
|
||||
Special logic must be implemented to handle a matched substring that spans a
|
||||
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
|
||||
partial match at the end of a segment whenever there is the possibility of
|
||||
changing the match by adding more characters. The PCRE2_NOTBOL option should
|
||||
also be set for all but the first segment.
|
||||
</P>
|
||||
<P>
|
||||
When a partial match occurs, the next segment must be added to the current
|
||||
subject and the match re-run, using the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> to begin at the point where the partial match started.
|
||||
For example:
|
||||
<pre>
|
||||
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
|
||||
data> ...the date is 23ja\=ph
|
||||
Partial match: 23ja
|
||||
data> ...the date is 23jan19 and on that day...\=offset=15
|
||||
0: 23jan19
|
||||
1: jan
|
||||
</pre>
|
||||
Note the use of the <b>offset</b> modifier to start the new match where the
|
||||
partial match was found. In this example, the next segment was added to the one
|
||||
in which the partial match was found. This is the most straightforward
|
||||
approach, typically using a memory buffer that is twice the size of each
|
||||
segment. After a partial match, the first half of the buffer is discarded, the
|
||||
second half is moved to the start of the buffer, and a new segment is added
|
||||
before repeating the match as in the example above. After a no match, the
|
||||
entire buffer can be discarded.
|
||||
</P>
|
||||
<P>
|
||||
If there are memory constraints, you may want to discard text that precedes a
|
||||
partial match before adding the next segment. Unfortunately, this is not at
|
||||
present straightforward. In cases such as the above, where the pattern does not
|
||||
contain any lookbehinds, it is sufficient to retain only the partially matched
|
||||
substring. However, if the pattern contains a lookbehind assertion, characters
|
||||
that precede the start of the partial match may have been inspected during the
|
||||
matching process. When <b>pcre2test</b> displays a partial match, it indicates
|
||||
these characters with '<' if the <b>allusedtext</b> modifier is set:
|
||||
<pre>
|
||||
re> "(?<=123)abc"
|
||||
data> xx123ab\=ph,allusedtext
|
||||
Partial match: 123ab
|
||||
<<<
|
||||
</pre>
|
||||
However, the <b>allusedtext</b> modifier is not available for JIT matching,
|
||||
because JIT matching does not record the first (or last) consulted characters.
|
||||
For this reason, this information is not available via the API. It is therefore
|
||||
not possible in general to obtain the exact number of characters that must be
|
||||
retained in order to get the right match result. If you cannot retain the
|
||||
entire segment, you must find some heuristic way of choosing.
|
||||
</P>
|
||||
<P>
|
||||
If you know the approximate length of the matching substrings, you can use that
|
||||
to decide how much text to retain. The only lookbehind information that is
|
||||
currently available via the API is the length of the longest individual
|
||||
lookbehind in a pattern, but this can be misleading if there are nested
|
||||
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
|
||||
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
|
||||
units) that any individual lookbehind moves back when it is processed. A
|
||||
pattern such as "(?<=(?<!b)a)" has a maximum lookbehind value of one, but
|
||||
inspects two characters before its starting point.
|
||||
</P>
|
||||
<P>
|
||||
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
|
||||
UTF-8 or UTF-16 you have to count characters while moving back through the code
|
||||
units.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
|
||||
<P>
|
||||
The DFA function moves along the subject string character by character, without
|
||||
backtracking, searching for all possible matches simultaneously. If the end of
|
||||
the subject is reached before the end of the pattern, there is the possibility
|
||||
of a partial match.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
|
||||
have been no complete matches. Otherwise, the complete matches are returned.
|
||||
If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
|
||||
complete matches. The portion of the string that was matched when the longest
|
||||
partial match was found is set as the first matching string.
|
||||
</P>
|
||||
<P>
|
||||
Because the DFA function always searches for all possible matches, and there is
|
||||
no difference between greedy and ungreedy repetition, its behaviour is
|
||||
different from the <b>pcre2_match()</b>. Consider the string "dog" matched
|
||||
against this ungreedy pattern:
|
||||
<pre>
|
||||
/dog(sbody)??/
|
||||
</pre>
|
||||
Whereas the standard function stops as soon as it finds the complete match for
|
||||
"dog", the DFA function also finds the partial match for "dogsbody", and so
|
||||
returns that when PCRE2_PARTIAL_HARD is set.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
|
||||
<P>
|
||||
When a partial match has been found using the DFA matching function, it is
|
||||
possible to continue the match by providing additional subject data and calling
|
||||
the function again with the same compiled regular expression, this time setting
|
||||
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
|
||||
because this is where details of the previous partial match are stored. You can
|
||||
set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
|
||||
to continue partial matching over multiple segments. Here is an example using
|
||||
<b>pcre2test</b>:
|
||||
<pre>
|
||||
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
|
||||
data> 23ja\=dfa,ps
|
||||
Partial match: 23ja
|
||||
data> n05\=dfa,dfa_restart
|
||||
0: n05
|
||||
</pre>
|
||||
The first call has "23ja" as the subject, and requests partial matching; the
|
||||
second call has "n05" as the subject for the continued (restarted) match.
|
||||
Notice that when the match is complete, only the last part is shown; PCRE2 does
|
||||
not retain the previously partially-matched string. It is up to the calling
|
||||
program to do that if it needs to. This means that, for an unanchored pattern,
|
||||
if a continued match fails, it is not possible to try again at a new starting
|
||||
point. All this facility is capable of doing is continuing with the previous
|
||||
match attempt. For example, consider this pattern:
|
||||
<pre>
|
||||
1234|3789
|
||||
</pre>
|
||||
If the first part of the subject is "ABC123", a partial match of the first
|
||||
alternative is found at offset 3. There is no partial match for the second
|
||||
alternative, because such a match does not start at the same point in the
|
||||
subject string. Attempting to continue with the string "7890" does not yield a
|
||||
match because only those alternatives that match at one point in the subject
|
||||
are remembered. Depending on the application, this may or may not be what you
|
||||
want.
|
||||
</P>
|
||||
<P>
|
||||
If you do want to allow for starting again at the next character, one way of
|
||||
doing it is to retain some or all of the segment and try a new complete match,
|
||||
as described for <b>pcre2_match()</b> above. Another possibility is to work with
|
||||
two buffers. If a partial match at offset <i>n</i> in the first buffer is
|
||||
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
|
||||
can then try a new match starting at offset <i>n+1</i> in the first buffer.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 November 2024
|
||||
<br>
|
||||
Copyright © 1997-2019 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
4140
3rd/pcre2/doc/html/pcre2pattern.html
Normal file
4140
3rd/pcre2/doc/html/pcre2pattern.html
Normal file
@@ -0,0 +1,4140 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2pattern specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2pattern man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION DETAILS</a>
|
||||
<li><a name="TOC2" href="#SEC2">EBCDIC CHARACTER CODES</a>
|
||||
<li><a name="TOC3" href="#SEC3">SPECIAL START-OF-PATTERN ITEMS</a>
|
||||
<li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a>
|
||||
<li><a name="TOC5" href="#SEC5">BACKSLASH</a>
|
||||
<li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a>
|
||||
<li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a>
|
||||
<li><a name="TOC8" href="#SEC8">MATCHING A SINGLE CODE UNIT</a>
|
||||
<li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
|
||||
<li><a name="TOC10" href="#SEC10">PERL EXTENDED CHARACTER CLASSES</a>
|
||||
<li><a name="TOC11" href="#SEC11">UTS#18 EXTENDED CHARACTER CLASSES</a>
|
||||
<li><a name="TOC12" href="#SEC12">POSIX CHARACTER CLASSES</a>
|
||||
<li><a name="TOC13" href="#SEC13">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a>
|
||||
<li><a name="TOC14" href="#SEC14">VERTICAL BAR</a>
|
||||
<li><a name="TOC15" href="#SEC15">INTERNAL OPTION SETTING</a>
|
||||
<li><a name="TOC16" href="#SEC16">GROUPS</a>
|
||||
<li><a name="TOC17" href="#SEC17">DUPLICATE GROUP NUMBERS</a>
|
||||
<li><a name="TOC18" href="#SEC18">NAMED CAPTURE GROUPS</a>
|
||||
<li><a name="TOC19" href="#SEC19">REPETITION</a>
|
||||
<li><a name="TOC20" href="#SEC20">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
|
||||
<li><a name="TOC22" href="#SEC22">ASSERTIONS</a>
|
||||
<li><a name="TOC23" href="#SEC23">NON-ATOMIC ASSERTIONS</a>
|
||||
<li><a name="TOC24" href="#SEC24">SCAN SUBSTRING ASSERTIONS</a>
|
||||
<li><a name="TOC25" href="#SEC25">SCRIPT RUNS</a>
|
||||
<li><a name="TOC26" href="#SEC26">CONDITIONAL GROUPS</a>
|
||||
<li><a name="TOC27" href="#SEC27">COMMENTS</a>
|
||||
<li><a name="TOC28" href="#SEC28">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC29" href="#SEC29">GROUPS AS SUBROUTINES</a>
|
||||
<li><a name="TOC30" href="#SEC30">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC31" href="#SEC31">CALLOUTS</a>
|
||||
<li><a name="TOC32" href="#SEC32">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC33" href="#SEC33">EBCDIC ENVIRONMENTS</a>
|
||||
<li><a name="TOC34" href="#SEC34">SEE ALSO</a>
|
||||
<li><a name="TOC35" href="#SEC35">AUTHOR</a>
|
||||
<li><a name="TOC36" href="#SEC36">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
||||
<P>
|
||||
The syntax and semantics of the regular expressions that are supported by PCRE2
|
||||
are described in detail below. There is a quick-reference syntax summary in the
|
||||
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
|
||||
page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
|
||||
PCRE2 also supports some alternative regular expression syntax that does not
|
||||
conflict with the Perl syntax in order to provide some compatibility with
|
||||
regular expressions in Python, .NET, and Oniguruma. There are in addition some
|
||||
options that enable alternative syntax and semantics that are not the same as
|
||||
in Perl.
|
||||
</P>
|
||||
<P>
|
||||
Perl's regular expressions are described in its own documentation, and regular
|
||||
expressions in general are covered in a number of books, some of which have
|
||||
copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
|
||||
by O'Reilly, covers regular expressions in great detail. This description of
|
||||
PCRE2's regular expressions is intended as reference material.
|
||||
</P>
|
||||
<P>
|
||||
This document discusses the regular expression patterns that are supported by
|
||||
PCRE2 when its main matching function, <b>pcre2_match()</b>, is used. PCRE2 also
|
||||
has an alternative matching function, <b>pcre2_dfa_match()</b>, which matches
|
||||
using a different algorithm that is not Perl-compatible. Some of the features
|
||||
discussed below are not available when DFA matching is used. The advantages and
|
||||
disadvantages of the alternative function, and how it differs from the normal
|
||||
function, are discussed in the
|
||||
<a href="pcre2matching.html"><b>pcre2matching</b></a>
|
||||
page.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
||||
<P>
|
||||
Most computers use ASCII or Unicode for encoding characters, and PCRE2 assumes
|
||||
this by default. However, it can be compiled to run in an environment that uses
|
||||
the EBCDIC code, which is the case for some IBM mainframe operating systems. In
|
||||
the sections below, character code values are ASCII or Unicode; in an EBCDIC
|
||||
environment these characters may have different code values, and there are no
|
||||
code points greater than 255. Differences in behaviour when PCRE2 is running in
|
||||
an EBCDIC environment are described in the section
|
||||
<a href="#ebcdicenvironments">"EBCDIC environments"</a>
|
||||
below, which you can ignore unless you really are in an EBCDIC environment.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
|
||||
<P>
|
||||
A number of options that can be passed to <b>pcre2_compile()</b> can also be set
|
||||
by special items at the start of a pattern. These are not Perl-compatible, but
|
||||
are provided to make these options accessible to pattern writers who are not
|
||||
able to change the program that processes the pattern. Any number of these
|
||||
items may appear, but they must all be together right at the start of the
|
||||
pattern string, and the letters must be in upper case.
|
||||
</P>
|
||||
<br><b>
|
||||
UTF support
|
||||
</b><br>
|
||||
<P>
|
||||
In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
|
||||
single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
|
||||
specified for the 32-bit library, in which case it constrains the character
|
||||
values to valid Unicode code points. To process UTF strings, PCRE2 must be
|
||||
built to include Unicode support (which is the default). When using UTF strings
|
||||
you must either call the compiling function with one or both of the PCRE2_UTF
|
||||
or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
|
||||
sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
|
||||
setting a UTF mode affects pattern matching is mentioned in several places
|
||||
below. There is also a summary of features in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
page.
|
||||
</P>
|
||||
<P>
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
|
||||
option is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
|
||||
appearance in a pattern causes an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Unicode property support
|
||||
</b><br>
|
||||
<P>
|
||||
Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||
This has the same effect as setting the PCRE2_UCP option: it causes sequences
|
||||
such as \d and \w to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 256 via a lookup
|
||||
table. If also causes upper/lower casing operations to use Unicode properties
|
||||
for characters with code points greater than 127, even when UTF is not set.
|
||||
These behaviours can be changed within the pattern; see the section entitled
|
||||
<a href="#internaloptions">"Internal Option Setting"</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
Some applications that allow their users to supply patterns may wish to
|
||||
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
|
||||
<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
|
||||
causes an error.
|
||||
</P>
|
||||
<br><b>
|
||||
Locking out empty string matching
|
||||
</b><br>
|
||||
<P>
|
||||
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
|
||||
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
|
||||
matching function is subsequently called to match the pattern. These options
|
||||
lock out the matching of empty strings, either entirely, or only at the start
|
||||
of the subject.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling auto-possessification
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
|
||||
the PCRE2_NO_AUTO_POSSESS option, or calling <b>pcre2_set_optimize()</b> with
|
||||
a PCRE2_AUTO_POSSESS_OFF directive. This stops PCRE2 from making quantifiers
|
||||
possessive when what follows cannot match the repeated item. For example, by
|
||||
default a+b is treated as a++b. For more details, see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling start-up optimizations
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
|
||||
PCRE2_NO_START_OPTIMIZE option, or calling <b>pcre2_set_optimize()</b> with
|
||||
a PCRE2_START_OPTIMIZE_OFF directive. This disables several optimizations for
|
||||
quickly reaching "no match" results. For more details, see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling automatic anchoring
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
|
||||
setting the PCRE2_NO_DOTSTAR_ANCHOR option, or calling <b>pcre2_set_optimize()</b>
|
||||
with a PCRE2_DOTSTAR_ANCHOR_OFF directive. This disables optimizations that
|
||||
apply to patterns whose top-level branches all start with .* (match any number
|
||||
of arbitrary characters). For more details, see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling JIT compilation
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
|
||||
the application to apply the JIT optimization by calling
|
||||
<b>pcre2_jit_compile()</b> is ignored.
|
||||
</P>
|
||||
<br><b>
|
||||
Setting match resource limits
|
||||
</b><br>
|
||||
<P>
|
||||
The <b>pcre2_match()</b> function contains a counter that is incremented every
|
||||
time it goes round its main loop. The caller of <b>pcre2_match()</b> can set a
|
||||
limit on this counter, which therefore limits the amount of computing resource
|
||||
used for a match. The maximum depth of nested backtracking can also be limited;
|
||||
this indirectly restricts the amount of heap memory that is used, but there is
|
||||
also an explicit memory limit that can be set.
|
||||
</P>
|
||||
<P>
|
||||
These facilities are provided to catch runaway matches that are provoked by
|
||||
patterns with huge matching trees. A common example is a pattern with nested
|
||||
unlimited repeats applied to a long string that does not match. When one of
|
||||
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
|
||||
can also be set by items at the start of the pattern of the form
|
||||
<pre>
|
||||
(*LIMIT_HEAP=d)
|
||||
(*LIMIT_MATCH=d)
|
||||
(*LIMIT_DEPTH=d)
|
||||
</pre>
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used. The heap limit is
|
||||
specified in kibibytes (units of 1024 bytes).
|
||||
</P>
|
||||
<P>
|
||||
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
|
||||
still recognized for backwards compatibility.
|
||||
</P>
|
||||
<P>
|
||||
The heap limit applies only when the <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b> interpreters are used for matching. It does not apply
|
||||
to JIT. The match limit is used (but in a different way) when JIT is being
|
||||
used, or when <b>pcre2_dfa_match()</b> is called, to limit computing resource
|
||||
usage by those matching functions. The depth limit is ignored by JIT but is
|
||||
relevant for DFA matching, which uses function recursion for recursions within
|
||||
the pattern and for lookaround assertions and atomic groups. In this case, the
|
||||
depth limit controls the depth of such recursion.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
</b><br>
|
||||
<P>
|
||||
PCRE2 supports six different conventions for indicating line breaks in
|
||||
strings: a single CR (carriage return) character, a single LF (linefeed)
|
||||
character, the two-character sequence CRLF, any of the three preceding, any
|
||||
Unicode newline sequence, or the NUL character (binary zero). The
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page has
|
||||
<a href="pcre2api.html#newlines">further discussion</a>
|
||||
about newlines, and shows how to set the newline convention when calling
|
||||
<b>pcre2_compile()</b>.
|
||||
</P>
|
||||
<P>
|
||||
It is also possible to specify a newline convention by starting a pattern
|
||||
string with one of the following sequences:
|
||||
<pre>
|
||||
(*CR) carriage return
|
||||
(*LF) linefeed
|
||||
(*CRLF) carriage return, followed by linefeed
|
||||
(*ANYCRLF) any of the three above
|
||||
(*ANY) all Unicode newline sequences
|
||||
(*NUL) the NUL character (binary zero)
|
||||
</pre>
|
||||
These override the default and the options given to the compiling function. For
|
||||
example, on a Unix system where LF is the default newline sequence, the pattern
|
||||
<pre>
|
||||
(*CR)a.b
|
||||
</pre>
|
||||
changes the convention to CR. That pattern matches "a\nb" because LF is no
|
||||
longer a newline. If more than one of these settings is present, the last one
|
||||
is used.
|
||||
</P>
|
||||
<P>
|
||||
The newline convention affects where the circumflex and dollar assertions are
|
||||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
|
||||
opening brace. However, it does not affect what the \R escape sequence
|
||||
matches. By default, this is any Unicode newline sequence, for Perl
|
||||
compatibility. However, this can be changed; see the next section and the
|
||||
description of \R in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
below. A change of \R setting can be combined with a change of newline
|
||||
convention.
|
||||
</P>
|
||||
<br><b>
|
||||
Specifying what \R matches
|
||||
</b><br>
|
||||
<P>
|
||||
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
|
||||
complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
|
||||
at compile time. This effect can also be achieved by starting a pattern with
|
||||
(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
|
||||
corresponding to PCRE2_BSR_UNICODE.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
|
||||
<P>
|
||||
A regular expression is a pattern that is matched against a subject string from
|
||||
left to right. Most characters stand for themselves in a pattern, and match the
|
||||
corresponding characters in the subject. As a trivial example, the pattern
|
||||
<pre>
|
||||
The quick brown fox
|
||||
</pre>
|
||||
matches a portion of a subject string that is identical to itself. When
|
||||
caseless matching is specified (the PCRE2_CASELESS option or (?i) within the
|
||||
pattern), letters are matched independently of case. Note that there are two
|
||||
ASCII characters, K and S, that, in addition to their lower case ASCII
|
||||
equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
|
||||
(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the
|
||||
PCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to
|
||||
<b>pcre2_compile()</b> or set by (*CASELESS_RESTRICT) or (?r) within the
|
||||
pattern). If the PCRE2_EXTRA_TURKISH_CASING option is in force (either passed
|
||||
to <b>pcre2_compile()</b> or set by (*TURKISH_CASING) within the pattern), then
|
||||
the 'i' letters are matched according to Turkish and Azeri languages.
|
||||
</P>
|
||||
<P>
|
||||
The power of regular expressions comes from the ability to include wild cards,
|
||||
character classes, alternatives, and repetitions in the pattern. These are
|
||||
encoded in the pattern by the use of <i>metacharacters</i>, which do not stand
|
||||
for themselves but instead are interpreted in some special way.
|
||||
</P>
|
||||
<P>
|
||||
There are two different sets of metacharacters: those that are recognized
|
||||
anywhere in the pattern except within square brackets, and those that are
|
||||
recognized within square brackets. Outside square brackets, the metacharacters
|
||||
are as follows:
|
||||
<pre>
|
||||
\ general escape character with several uses
|
||||
^ assert start of string (or line, in multiline mode)
|
||||
$ assert end of string (or line, in multiline mode)
|
||||
. match any character except newline (by default)
|
||||
[ start character class definition
|
||||
| start of alternative branch
|
||||
( start group or control verb
|
||||
) end group or control verb
|
||||
* 0 or more quantifier
|
||||
+ 1 or more quantifier; also "possessive quantifier"
|
||||
? 0 or 1 quantifier; also quantifier minimizer
|
||||
{ potential start of min/max quantifier
|
||||
</pre>
|
||||
Brace characters { and } are also used to enclose data for constructions such
|
||||
as \g{2} or \k{name}. In almost all uses of braces, space and/or horizontal
|
||||
tab characters that follow { or precede } are allowed and are ignored. In the
|
||||
case of quantifiers, they may also appear before or after the comma. The
|
||||
exception to this is \u{...} which is an ECMAScript compatibility feature
|
||||
that is recognized only when the PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript
|
||||
does not ignore such white space; it causes the item to be interpreted as
|
||||
literal.
|
||||
</P>
|
||||
<P>
|
||||
Part of a pattern that is in square brackets is called a "character class". In
|
||||
a character class the only metacharacters are:
|
||||
<pre>
|
||||
\ general escape character
|
||||
^ negate the class, but only if the first character
|
||||
- indicates character range
|
||||
[ POSIX character class (if followed by POSIX syntax)
|
||||
] terminates the character class
|
||||
</pre>
|
||||
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
|
||||
the pattern, other than in a character class, within a \Q...\E sequence, or
|
||||
between a # outside a character class and the next newline, inclusive, is
|
||||
ignored. An escaping backslash can be used to include a white space or a #
|
||||
character as part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the
|
||||
same applies, but in addition unescaped space and horizontal tab characters are
|
||||
ignored inside a character class. Note: only these two characters are ignored,
|
||||
not the full set of pattern white space characters that are ignored outside a
|
||||
character class. Option settings can be changed within a pattern; see the
|
||||
section entitled
|
||||
<a href="#internaloptions">"Internal Option Setting"</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
The following sections describe the use of each of the metacharacters.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br>
|
||||
<P>
|
||||
The backslash character has several uses. Firstly, if it is followed by a
|
||||
character that is not a digit or a letter, it takes away any special meaning
|
||||
that character may have. This use of backslash as an escape character applies
|
||||
both inside and outside character classes.
|
||||
</P>
|
||||
<P>
|
||||
For example, if you want to match a * character, you must write \* in the
|
||||
pattern. This escaping action applies whether or not the following character
|
||||
would otherwise be interpreted as a metacharacter, so it is always safe to
|
||||
precede a non-alphanumeric with backslash to specify that it stands for itself.
|
||||
In particular, if you want to match a backslash, you write \\.
|
||||
</P>
|
||||
<P>
|
||||
Only ASCII digits and letters have any special meaning after a backslash. All
|
||||
other characters (in particular, those whose code points are greater than 127)
|
||||
are treated as literals.
|
||||
</P>
|
||||
<P>
|
||||
If you want to treat all characters in a sequence as literals, you can do so by
|
||||
putting them between \Q and \E. Note that this includes white space even when
|
||||
the PCRE2_EXTENDED option is set so that most other white space is ignored. The
|
||||
behaviour is different from Perl in that $ and @ are handled as literals in
|
||||
\Q...\E sequences in PCRE2, whereas in Perl, $ and @ cause variable
|
||||
interpolation. Also, Perl does "double-quotish backslash interpolation" on any
|
||||
backslashes between \Q and \E which, its documentation says, "may lead to
|
||||
confusing results". PCRE2 treats a backslash between \Q and \E just like any
|
||||
other character. Note the following examples:
|
||||
<pre>
|
||||
Pattern PCRE2 matches Perl matches
|
||||
|
||||
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
|
||||
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
||||
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
||||
\QA\B\E A\B A\B
|
||||
\Q\\E \ \\E
|
||||
</pre>
|
||||
The \Q...\E sequence is recognized both inside and outside character classes.
|
||||
An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
|
||||
by \E later in the pattern, the literal interpretation continues to the end of
|
||||
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
|
||||
a character class, this causes an error, because the character class is then
|
||||
not terminated by a closing square bracket.
|
||||
</P>
|
||||
<P>
|
||||
Another difference from Perl is that any appearance of \Q or \E inside what
|
||||
might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a
|
||||
quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers
|
||||
is inside \Q...\E, but not if the separating comma is. When not recognized as
|
||||
a quantifier a sequence such as {\Q1\E,2} is treated as the literal string
|
||||
"{1,2}".
|
||||
<a name="digitsafterbackslash"></a></P>
|
||||
<br><b>
|
||||
Non-printing characters
|
||||
</b><br>
|
||||
<P>
|
||||
A second use of backslash provides a way of encoding non-printing characters
|
||||
in patterns in a visible manner. There is no restriction on the appearance of
|
||||
non-printing characters in a pattern, but when a pattern is being prepared by
|
||||
text editing, it is often easier to use one of the following escape sequences
|
||||
instead of the binary character it represents. In an ASCII or Unicode
|
||||
environment, these escapes are as follows:
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is a non-control ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
\r carriage return (hex 0D) (but see below)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or back reference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh..
|
||||
\N{U+hhh..} character with Unicode hex code point hhh..
|
||||
</pre>
|
||||
A description of how back references work is given
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#group">parenthesized groups.</a>
|
||||
</P>
|
||||
<P>
|
||||
By default, after \x that is not followed by {, one or two hexadecimal
|
||||
digits are read (letters can be in upper or lower case). If the character that
|
||||
follows \x is neither { nor a hexadecimal digit, an error occurs. This is
|
||||
different from Perl's default behaviour, which generates a NUL character, but
|
||||
is in line with the behaviour of Perl's 'strict' mode in re.
|
||||
</P>
|
||||
<P>
|
||||
Any number of hexadecimal digits may appear between \x{ and }. If a character
|
||||
other than a hexadecimal digit appears between \x{ and }, or if there is no
|
||||
terminating }, an error occurs.
|
||||
</P>
|
||||
<P>
|
||||
Characters whose code points are less than 256 can be defined by either of the
|
||||
two syntaxes for \x or by an octal sequence. There is no difference in the way
|
||||
they are handled. For example, \xdc is exactly the same as \x{dc} or \334.
|
||||
However, using the braced versions does make such sequences easier to read.
|
||||
</P>
|
||||
<P>
|
||||
Support is available for some ECMAScript (aka JavaScript) escape sequences via
|
||||
two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
|
||||
by { is not recognized. Only if \x is followed by two hexadecimal digits is it
|
||||
recognized as a character escape. Otherwise it is interpreted as a literal "x"
|
||||
character. In this mode, support for code points greater than 256 is provided
|
||||
by \u, which must be followed by four hexadecimal digits; otherwise it is
|
||||
interpreted as a literal "u" character.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
|
||||
\u{hhh..} is recognized as the character specified by hexadecimal code point.
|
||||
There may be any number of hexadecimal digits, but unlike other places that
|
||||
also use curly brackets, spaces are not allowed and would result in the string
|
||||
being interpreted as a literal. This syntax is from ECMAScript 6.
|
||||
</P>
|
||||
<P>
|
||||
The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
|
||||
UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
|
||||
does not support this. Note that when \N is not followed by an opening brace
|
||||
(curly bracket) it has an entirely different meaning, matching any character
|
||||
that is not a newline.
|
||||
</P>
|
||||
<P>
|
||||
There are some legacy applications where the escape sequence \r is expected to
|
||||
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a
|
||||
pattern is converted to \n so that it matches a LF (linefeed) instead of a CR
|
||||
(carriage return) character.
|
||||
</P>
|
||||
<P>
|
||||
An error occurs if \c is not followed by a character whose ASCII code point
|
||||
is in the range 32 to 126. The precise effect of \cx is as follows: if x is a
|
||||
lower case letter, it is converted to upper case. Then bit 6 of the character
|
||||
(hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is
|
||||
5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If
|
||||
the code unit following \c has a code point less than 32 or greater than 126,
|
||||
a compile-time error occurs.
|
||||
</P>
|
||||
<P>
|
||||
For differences in the way some escapes behave in EBCDIC environments,
|
||||
see section
|
||||
<a href="#ebcdicenvironments">"EBCDIC environments"</a>
|
||||
below.
|
||||
</P>
|
||||
<br><b>
|
||||
Octal escapes and back references
|
||||
</b><br>
|
||||
<P>
|
||||
The escape \o must be followed by a sequence of octal digits, enclosed in
|
||||
braces. An error occurs if this is not the case. This escape provides a way of
|
||||
specifying character code points as octal numbers greater than 0777, and it
|
||||
also allows octal numbers and backreferences to be unambiguously distinguished.
|
||||
</P>
|
||||
<P>
|
||||
If braces are not used, after \0 up to two further octal digits are read.
|
||||
However, if the PCRE2_EXTRA_NO_BS0 option is set, at least one more octal digit
|
||||
must follow \0 (use \00 to generate a NUL character). Make sure you supply
|
||||
two digits after the initial zero if the pattern character that follows is
|
||||
itself an octal digit.
|
||||
</P>
|
||||
<P>
|
||||
Inside a character class, when a backslash is followed by any octal digit, up
|
||||
to three octal digits are read to generate a code point. Any subsequent digits
|
||||
stand for themselves. The sequences \8 and \9 are treated as the literal
|
||||
characters "8" and "9".
|
||||
</P>
|
||||
<P>
|
||||
Outside a character class, Perl's handling of a backslash followed by a digit
|
||||
other than 0 is complicated by ambiguity, and Perl has changed over time,
|
||||
causing PCRE2 also to change. From PCRE2 release 10.45 there is an option
|
||||
called PCRE2_EXTRA_PYTHON_OCTAL that causes PCRE2 to use Python's unambiguous
|
||||
rules. The next two subsections describe the two sets of rules.
|
||||
</P>
|
||||
<P>
|
||||
For greater clarity and unambiguity, it is best to avoid following \ by a
|
||||
digit greater than zero. Instead, use \o{...} or \x{...} to specify numerical
|
||||
character code points, and \g{...} to specify backreferences.
|
||||
</P>
|
||||
<br><b>
|
||||
Perl rules for non-class backslash 1-9
|
||||
</b><br>
|
||||
<P>
|
||||
All the digits that follow the backslash are read as a decimal number. If the
|
||||
number is less than 10, begins with the digit 8 or 9, or if there are at least
|
||||
that many previous capture groups in the expression, the entire sequence is
|
||||
taken as a back reference. Otherwise, up to three octal digits are read to form
|
||||
a character code. For example:
|
||||
<pre>
|
||||
\040 is another way of writing an ASCII space
|
||||
\40 is the same, provided there are fewer than 40 previous capture groups
|
||||
\7 is always a backreference
|
||||
\11 might be a backreference, or another way of writing a tab
|
||||
\011 is always a tab
|
||||
\0113 is a tab followed by the character "3"
|
||||
\113 might be a backreference, otherwise the character with octal code 113
|
||||
\377 might be a backreference, otherwise the value 255 (decimal)
|
||||
\81 is always a backreference
|
||||
</pre>
|
||||
Note that octal values of 100 or greater that are specified using this syntax
|
||||
must not be introduced by a leading zero, because no more than three octal
|
||||
digits are ever read.
|
||||
</P>
|
||||
<br><b>
|
||||
Python rules for non_class backslash 1-9
|
||||
</b><br>
|
||||
<P>
|
||||
If there are at least three octal digits after the backslash, exactly three are
|
||||
read as an octal code point number, but the value must be no greater than
|
||||
\377, even in modes where higher code point values are supported. Any
|
||||
subsequent digits stand for themselves. If there are fewer than three octal
|
||||
digits, the sequence is taken as a decimal back reference. Thus, for example,
|
||||
\12 is always a back reference, independent of how many captures there are in
|
||||
the pattern. An error is generated for a reference to a non-existent capturing
|
||||
group.
|
||||
</P>
|
||||
<br><b>
|
||||
Constraints on character values
|
||||
</b><br>
|
||||
<P>
|
||||
Characters that are specified using octal or hexadecimal numbers are
|
||||
limited to certain values, as follows:
|
||||
<pre>
|
||||
8-bit non-UTF mode no greater than 0xff
|
||||
16-bit non-UTF mode no greater than 0xffff
|
||||
32-bit non-UTF mode no greater than 0xffffffff
|
||||
All UTF modes no greater than 0x10ffff and a valid code point
|
||||
</pre>
|
||||
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
|
||||
so-called "surrogate" code points). The check for these can be disabled by the
|
||||
caller of <b>pcre2_compile()</b> by setting the option
|
||||
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
|
||||
and UTF-32 modes, because these values are not representable in UTF-16.
|
||||
</P>
|
||||
<br><b>
|
||||
Escape sequences in character classes
|
||||
</b><br>
|
||||
<P>
|
||||
All the sequences that define a single character value can be used both inside
|
||||
and outside character classes. In addition, inside a character class, \b is
|
||||
interpreted as the backspace character (hex 08).
|
||||
</P>
|
||||
<P>
|
||||
When not followed by an opening brace, \N is not allowed in a character class.
|
||||
\B, \R, and \X are not special inside a character class. Like other
|
||||
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||
character class, these sequences have different meanings.
|
||||
</P>
|
||||
<br><b>
|
||||
Unsupported escape sequences
|
||||
</b><br>
|
||||
<P>
|
||||
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
|
||||
handler and used to modify the case of following characters. By default, PCRE2
|
||||
does not support these escape sequences in patterns. However, if either of the
|
||||
PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U"
|
||||
character, and \u can be used to define a character by code point, as
|
||||
described above.
|
||||
</P>
|
||||
<br><b>
|
||||
Absolute and relative backreferences
|
||||
</b><br>
|
||||
<P>
|
||||
The sequence \g followed by a signed or unsigned number, optionally enclosed
|
||||
in braces, is an absolute or relative backreference. A named backreference
|
||||
can be coded as \g{name}. Backreferences are discussed
|
||||
<a href="#backreferences">later,</a>
|
||||
following the discussion of
|
||||
<a href="#group">parenthesized groups.</a>
|
||||
</P>
|
||||
<br><b>
|
||||
Absolute and relative subroutine calls
|
||||
</b><br>
|
||||
<P>
|
||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||
syntax for referencing a capture group as a subroutine. Details are discussed
|
||||
<a href="#onigurumasubroutines">later.</a>
|
||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a backreference; the latter is a
|
||||
<a href="#groupsassubroutines">subroutine</a>
|
||||
call.
|
||||
<a name="genericchartypes"></a></P>
|
||||
<br><b>
|
||||
Generic character types
|
||||
</b><br>
|
||||
<P>
|
||||
Another use of backslash is for specifying generic character types:
|
||||
<pre>
|
||||
\d any decimal digit
|
||||
\D any character that is not a decimal digit
|
||||
\h any horizontal white space character
|
||||
\H any character that is not a horizontal white space character
|
||||
\N any character that is not a newline
|
||||
\s any white space character
|
||||
\S any character that is not a white space character
|
||||
\v any vertical white space character
|
||||
\V any character that is not a vertical white space character
|
||||
\w any "word" character
|
||||
\W any "non-word" character
|
||||
</pre>
|
||||
The \N escape sequence has the same meaning as
|
||||
<a href="#fullstopdot">the "." metacharacter</a>
|
||||
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||
meaning of \N. Note that when \N is followed by an opening brace it has a
|
||||
different meaning. See the section entitled
|
||||
<a href="#digitsafterbackslash">"Non-printing characters"</a>
|
||||
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||
name; PCRE2 does not support this.
|
||||
</P>
|
||||
<P>
|
||||
Each pair of lower and upper case escape sequences partitions the complete set
|
||||
of characters into two disjoint sets. Any given character matches one, and only
|
||||
one, of each pair. The sequences can appear both inside and outside character
|
||||
classes. They each match one character of the appropriate type. If the current
|
||||
matching point is at the end of the subject string, all of them fail, because
|
||||
there is no character to match.
|
||||
</P>
|
||||
<P>
|
||||
The default \s characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
|
||||
space (32), which are defined as white space in the "C" locale. This list may
|
||||
vary if locale-specific matching is taking place. For example, in some locales
|
||||
the "non-breaking space" character (\xA0) is recognized as white space, and in
|
||||
others the VT character is not.
|
||||
</P>
|
||||
<P>
|
||||
A "word" character is an underscore or any character that is a letter or digit.
|
||||
By default, the definition of letters and digits is controlled by PCRE2's
|
||||
low-valued character tables, and may vary if locale-specific matching is taking
|
||||
place (see
|
||||
<a href="pcre2api.html#localesupport">"Locale support"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
|
||||
or "french" in Windows, some character codes greater than 127 are used for
|
||||
accented letters, and these are then matched by \w. The use of locales with
|
||||
Unicode is discouraged.
|
||||
</P>
|
||||
<P>
|
||||
By default, characters whose code points are greater than 127 never match \d,
|
||||
\s, or \w, and always match \D, \S, and \W, although this may be different
|
||||
for characters in the range 128-255 when locale-specific matching is happening.
|
||||
These escape sequences retain their original meanings from before Unicode
|
||||
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
|
||||
is set, the behaviour is changed so that Unicode properties are used to
|
||||
determine character types, as follows:
|
||||
<pre>
|
||||
\d any character that matches \p{Nd} (decimal digit)
|
||||
\s any character that matches \p{Z} or \h or \v
|
||||
\w any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
|
||||
</pre>
|
||||
The addition of \p{Mn} (non-spacing mark) and the replacement of an explicit
|
||||
test for underscore with a test for \p{Pc} (connector punctuation) happened in
|
||||
PCRE2 release 10.43. This brings PCRE2 into line with Perl.
|
||||
</P>
|
||||
<P>
|
||||
The upper case escapes match the inverse sets of characters. Note that \d
|
||||
matches only decimal digits, whereas \w matches any Unicode digit, as well as
|
||||
other character categories. Note also that PCRE2_UCP affects \b, and
|
||||
\B because they are defined in terms of \w and \W. Matching these sequences
|
||||
is noticeably slower when PCRE2_UCP is set.
|
||||
</P>
|
||||
<P>
|
||||
The effect of PCRE2_UCP on any one of these escape sequences can be negated by
|
||||
the options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, and
|
||||
PCRE2_EXTRA_ASCII_BSW, respectively. These options can be set and reset within
|
||||
a pattern by means of an internal option setting
|
||||
<a href="#internaloptions">(see below).</a>
|
||||
</P>
|
||||
<P>
|
||||
The sequences \h, \H, \v, and \V, in contrast to the other sequences, which
|
||||
match only ASCII characters by default, always match a specific list of code
|
||||
points, whether or not PCRE2_UCP is set. The horizontal space characters are:
|
||||
<pre>
|
||||
U+0009 Horizontal tab (HT)
|
||||
U+0020 Space
|
||||
U+00A0 Non-break space
|
||||
U+1680 Ogham space mark
|
||||
U+180E Mongolian vowel separator
|
||||
U+2000 En quad
|
||||
U+2001 Em quad
|
||||
U+2002 En space
|
||||
U+2003 Em space
|
||||
U+2004 Three-per-em space
|
||||
U+2005 Four-per-em space
|
||||
U+2006 Six-per-em space
|
||||
U+2007 Figure space
|
||||
U+2008 Punctuation space
|
||||
U+2009 Thin space
|
||||
U+200A Hair space
|
||||
U+202F Narrow no-break space
|
||||
U+205F Medium mathematical space
|
||||
U+3000 Ideographic space
|
||||
</pre>
|
||||
The vertical space characters are:
|
||||
<pre>
|
||||
U+000A Linefeed (LF)
|
||||
U+000B Vertical tab (VT)
|
||||
U+000C Form feed (FF)
|
||||
U+000D Carriage return (CR)
|
||||
U+0085 Next line (NEL)
|
||||
U+2028 Line separator
|
||||
U+2029 Paragraph separator
|
||||
</pre>
|
||||
In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
|
||||
are relevant.
|
||||
<a name="newlineseq"></a></P>
|
||||
<br><b>
|
||||
Newline sequences
|
||||
</b><br>
|
||||
<P>
|
||||
Outside a character class, by default, the escape sequence \R matches any
|
||||
Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the
|
||||
following:
|
||||
<pre>
|
||||
(?>\r\n|\n|\x0b|\f|\r|\x85)
|
||||
</pre>
|
||||
This is an example of an "atomic group", details of which are given
|
||||
<a href="#atomicgroup">below.</a>
|
||||
This particular group matches either the two-character sequence CR followed by
|
||||
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
|
||||
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
|
||||
line, U+0085). Because this is an atomic group, the two-character sequence is
|
||||
treated as a single unit that cannot be split.
|
||||
</P>
|
||||
<P>
|
||||
In other modes, two additional characters whose code points are greater than 255
|
||||
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
|
||||
Unicode support is not needed for these characters to be recognized.
|
||||
</P>
|
||||
<P>
|
||||
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
|
||||
complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
|
||||
at compile time. (BSR is an abbreviation for "backslash R".) This can be made
|
||||
the default when PCRE2 is built; if this is the case, the other behaviour can
|
||||
be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
|
||||
these settings by starting a pattern string with one of the following
|
||||
sequences:
|
||||
<pre>
|
||||
(*BSR_ANYCRLF) CR, LF, or CRLF only
|
||||
(*BSR_UNICODE) any Unicode newline sequence
|
||||
</pre>
|
||||
These override the default and the options given to the compiling function.
|
||||
Note that these special settings, which are not Perl-compatible, are recognized
|
||||
only at the very start of a pattern, and that they must be in upper case. If
|
||||
more than one of them is present, the last one is used. They can be combined
|
||||
with a change of newline convention; for example, a pattern can start with:
|
||||
<pre>
|
||||
(*ANY)(*BSR_ANYCRLF)
|
||||
</pre>
|
||||
They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a
|
||||
character class, \R is treated as an unrecognized escape sequence, and causes
|
||||
an error.
|
||||
<a name="uniextseq"></a></P>
|
||||
<br><b>
|
||||
Unicode character properties
|
||||
</b><br>
|
||||
<P>
|
||||
When PCRE2 is built with Unicode support (the default), three additional escape
|
||||
sequences that match characters with specific properties are available. They
|
||||
can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
|
||||
sequences are of course limited to testing characters whose code points are
|
||||
less than U+0100 or U+10000, respectively. In 32-bit non-UTF mode, code points
|
||||
greater than 0x10ffff (the Unicode limit) may be encountered. These are all
|
||||
treated as being in the Unknown script and with an unassigned type.
|
||||
</P>
|
||||
<P>
|
||||
Matching characters by Unicode property is not fast, because PCRE2 has to do a
|
||||
multistage table lookup in order to find a character's property. That is why
|
||||
the traditional escape sequences such as \d and \w do not use Unicode
|
||||
properties in PCRE2 by default, though you can make them do so by setting the
|
||||
PCRE2_UCP option or by starting the pattern with (*UCP).
|
||||
</P>
|
||||
<P>
|
||||
The extra escape sequences that provide property support are:
|
||||
<pre>
|
||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
For compatibility with Perl, negation can be specified by including a
|
||||
circumflex between the opening brace and the property. For example, \p{^Lu} is
|
||||
the same as \P{Lu}.
|
||||
</P>
|
||||
<P>
|
||||
In accordance with Unicode's "loose matching" rules, ASCII white space
|
||||
characters, hyphens, and underscores are ignored in the properties represented
|
||||
by <i>xx</i> above. As well as the space character, ASCII white space can be
|
||||
tab, linefeed, vertical tab, formfeed, or carriage return.
|
||||
</P>
|
||||
<P>
|
||||
Some properties are specified as a name only; others as a name and a value,
|
||||
separated by a colon or an equals sign. The names and values consist of ASCII
|
||||
letters and digits (with one Perl-specific exception, see below). They are not
|
||||
case sensitive. Note, however, that the escapes themselves, \p and \P,
|
||||
<i>are</i> case sensitive. There are abbreviations for many names. The following
|
||||
examples are all equivalent:
|
||||
<pre>
|
||||
\p{bidiclass=al}
|
||||
\p{BC=al}
|
||||
\p{ Bidi_Class : AL }
|
||||
\p{ Bi-di class = Al }
|
||||
\P{ ^ Bi-di class = Al }
|
||||
</pre>
|
||||
There is support for Unicode script names, Unicode general category properties,
|
||||
"Any", which matches any character (including newline), Bidi_Class, a number of
|
||||
binary (yes/no) properties, and some special PCRE2 properties (described
|
||||
<a href="#extraprops">below).</a>
|
||||
Certain other Perl properties such as "InMusicalSymbols" are not supported by
|
||||
PCRE2. Note that \P{Any} does not match any characters, so always causes a
|
||||
match failure.
|
||||
</P>
|
||||
<br><b>
|
||||
Script properties for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
There are three different syntax forms for matching a script. Each Unicode
|
||||
character has a basic script and, optionally, a list of other scripts ("Script
|
||||
Extensions") with which it is commonly used. Using the Adlam script as an
|
||||
example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
|
||||
\p{scx:Adlam} matches, in addition, characters that have Adlam in their
|
||||
extensions list. The full names "script" and "script extensions" for the
|
||||
property types are recognized and, as for all property specifications, an
|
||||
equals sign is an alternative to the colon. If a script name is given without a
|
||||
property type, for example, \p{Adlam}, it is treated as \p{scx:Adlam}. Perl
|
||||
changed to this interpretation at release 5.26 and PCRE2 changed at release
|
||||
10.40.
|
||||
</P>
|
||||
<P>
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of recognized script names and their 4-character abbreviations can be obtained
|
||||
by running this command:
|
||||
<pre>
|
||||
pcre2test -LS
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
The general category property for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
Each character has exactly one Unicode general category property, specified by
|
||||
a two-letter abbreviation. If only one letter is specified with \p or \P, it
|
||||
includes all the general category properties that start with that letter. In
|
||||
this case, in the absence of negation, the curly brackets in the escape
|
||||
sequence are optional; these two examples have the same effect:
|
||||
<pre>
|
||||
\p{L}
|
||||
\pL
|
||||
</pre>
|
||||
The following general category property codes are supported:
|
||||
<pre>
|
||||
C Other
|
||||
Cc Control
|
||||
Cf Format
|
||||
Cn Unassigned
|
||||
Co Private use
|
||||
Cs Surrogate
|
||||
|
||||
L Letter
|
||||
Lc Cased letter
|
||||
Ll Lower case letter
|
||||
Lm Modifier letter
|
||||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
|
||||
M Mark
|
||||
Mc Spacing mark
|
||||
Me Enclosing mark
|
||||
Mn Non-spacing mark
|
||||
|
||||
N Number
|
||||
Nd Decimal number
|
||||
Nl Letter number
|
||||
No Other number
|
||||
|
||||
P Punctuation
|
||||
Pc Connector punctuation
|
||||
Pd Dash punctuation
|
||||
Pe Close punctuation
|
||||
Pf Final punctuation
|
||||
Pi Initial punctuation
|
||||
Po Other punctuation
|
||||
Ps Open punctuation
|
||||
|
||||
S Symbol
|
||||
Sc Currency symbol
|
||||
Sk Modifier symbol
|
||||
Sm Mathematical symbol
|
||||
So Other symbol
|
||||
|
||||
Z Separator
|
||||
Zl Line separator
|
||||
Zp Paragraph separator
|
||||
Zs Space separator
|
||||
</pre>
|
||||
Perl originally used the name L& for the Lc property. This is still supported
|
||||
by Perl, but discouraged. PCRE2 also still supports it. This property matches
|
||||
any character that has the Lu, Ll, or Lt property, in other words, any letter
|
||||
that is not classified as a modifier or "other". From release 10.45 of PCRE2
|
||||
the properties Lu, Ll, and Lt are all treated as Lc when case-independent
|
||||
matching is set by the PCRE2_CASELESS option or (?i) within the pattern. The
|
||||
other properties are not affected by caseless matching.
|
||||
</P>
|
||||
<P>
|
||||
The Cs (Surrogate) property applies only to characters whose code points are in
|
||||
the range U+D800 to U+DFFF. These characters are no different to any other
|
||||
character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
|
||||
However, they are not valid in Unicode strings and so cannot be tested by PCRE2
|
||||
in UTF mode, unless UTF validity checking has been turned off (see the
|
||||
discussion of PCRE2_NO_UTF_CHECK in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
page).
|
||||
</P>
|
||||
<P>
|
||||
The long synonyms for property names that Perl supports (such as \p{Letter})
|
||||
are not supported by PCRE2, nor is it permitted to prefix any of these
|
||||
properties with "Is".
|
||||
</P>
|
||||
<P>
|
||||
No character that is in the Unicode table has the Cn (unassigned) property.
|
||||
Instead, this property is assumed for any code point that is not in the
|
||||
Unicode table.
|
||||
</P>
|
||||
<br><b>
|
||||
Binary (yes/no) properties for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\p and \P, along with their abbreviations, by running this command:
|
||||
<pre>
|
||||
pcre2test -LP
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><b>
|
||||
The Bidi_Class property for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
<pre>
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
</pre>
|
||||
The recognized classes are:
|
||||
<pre>
|
||||
AL Arabic letter
|
||||
AN Arabic number
|
||||
B paragraph separator
|
||||
BN boundary neutral
|
||||
CS common separator
|
||||
EN European number
|
||||
ES European separator
|
||||
ET European terminator
|
||||
FSI first strong isolate
|
||||
L left-to-right
|
||||
LRE left-to-right embedding
|
||||
LRI left-to-right isolate
|
||||
LRO left-to-right override
|
||||
NSM non-spacing mark
|
||||
ON other neutral
|
||||
PDF pop directional format
|
||||
PDI pop directional isolate
|
||||
R right-to-left
|
||||
RLE right-to-left embedding
|
||||
RLI right-to-left isolate
|
||||
RLO right-to-left override
|
||||
S segment separator
|
||||
WS white space
|
||||
</pre>
|
||||
As in all property specifications, an equals sign may be used instead of a
|
||||
colon and the class names are case-insensitive. Only the short names listed
|
||||
above are recognized; PCRE2 does not at present support any long alternatives.
|
||||
</P>
|
||||
<br><b>
|
||||
Extended grapheme clusters
|
||||
</b><br>
|
||||
<P>
|
||||
The \X escape matches any number of Unicode characters that form an "extended
|
||||
grapheme cluster", and treats the sequence as an atomic group
|
||||
<a href="#atomicgroup">(see below).</a>
|
||||
Unicode supports various kinds of composite character by giving each character
|
||||
a grapheme breaking property, and having rules that use these properties to
|
||||
define the boundaries of extended grapheme clusters. The rules are defined in
|
||||
Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
|
||||
abandoned the use of some previous properties that had been used for emojis.
|
||||
Instead it introduced various emoji-specific properties. PCRE2 uses only the
|
||||
Extended Pictographic property.
|
||||
</P>
|
||||
<P>
|
||||
\X always matches at least one character. Then it decides whether to add
|
||||
additional characters according to the following rules for ending a cluster:
|
||||
</P>
|
||||
<P>
|
||||
1. End at the end of the subject string.
|
||||
</P>
|
||||
<P>
|
||||
2. Do not end between CR and LF; otherwise end after any control character.
|
||||
</P>
|
||||
<P>
|
||||
3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
|
||||
are of five types: L, V, T, LV, and LVT. An L character may be followed by an
|
||||
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
|
||||
character; an LVT or T character may be followed only by a T character.
|
||||
</P>
|
||||
<P>
|
||||
4. Do not end before extending characters or spacing marks or the zero-width
|
||||
joiner (ZWJ) character. Characters with the "mark" property always have the
|
||||
"extend" grapheme breaking property.
|
||||
</P>
|
||||
<P>
|
||||
5. Do not end after prepend characters.
|
||||
</P>
|
||||
<P>
|
||||
6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width
|
||||
joiner) sequences. An emoji ZWJ sequence consists of a character with the
|
||||
Extended_Pictographic property, optionally followed by one or more characters
|
||||
with the Extend property, followed by the ZWJ character, followed by another
|
||||
Extended_Pictographic character.
|
||||
</P>
|
||||
<P>
|
||||
7. Do not break within emoji flag sequences. That is, do not break between
|
||||
regional indicator (RI) characters if there are an odd number of RI characters
|
||||
before the break point.
|
||||
</P>
|
||||
<P>
|
||||
8. Otherwise, end the cluster.
|
||||
<a name="extraprops"></a></P>
|
||||
<br><b>
|
||||
PCRE2's additional properties
|
||||
</b><br>
|
||||
<P>
|
||||
As well as the standard Unicode properties described above, PCRE2 supports four
|
||||
more that make it possible to convert traditional escape sequences such as \w
|
||||
and \s to use Unicode properties. PCRE2 uses these non-standard, non-Perl
|
||||
properties internally when PCRE2_UCP is set. However, they may also be used
|
||||
explicitly. These properties are:
|
||||
<pre>
|
||||
Xan Any alphanumeric character
|
||||
Xps Any POSIX space character
|
||||
Xsp Any Perl space character
|
||||
Xwd Any Perl "word" character
|
||||
</pre>
|
||||
Xan matches characters that have either the L (letter) or the N (number)
|
||||
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
|
||||
carriage return, and any other character that has the Z (separator) property
|
||||
(this includes the space character). Xsp is the same as Xps; in PCRE1 it used
|
||||
to exclude vertical tab, for Perl compatibility, but Perl changed. Xwd matches
|
||||
the same characters as Xan, plus those that match Mn (non-spacing mark) or Pc
|
||||
(connector punctuation, which includes underscore).
|
||||
</P>
|
||||
<P>
|
||||
There is another non-standard property, Xuc, which matches any character that
|
||||
can be represented by a Universal Character Name in C++ and other programming
|
||||
languages. These are the characters $, @, ` (grave accent), and all characters
|
||||
with Unicode code points greater than or equal to U+00A0, except for the
|
||||
surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
|
||||
excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
|
||||
where H is a hexadecimal digit. Note that the Xuc property does not match these
|
||||
sequences but the characters that they represent.)
|
||||
<a name="resetmatchstart"></a></P>
|
||||
<br><b>
|
||||
Resetting the match start
|
||||
</b><br>
|
||||
<P>
|
||||
In normal use, the escape sequence \K causes any previously matched characters
|
||||
not to be included in the final matched sequence that is returned. For example,
|
||||
the pattern:
|
||||
<pre>
|
||||
foo\Kbar
|
||||
</pre>
|
||||
matches "foobar", but reports that it has matched "bar". \K does not interact
|
||||
with anchoring in any way. The pattern:
|
||||
<pre>
|
||||
^foo\Kbar
|
||||
</pre>
|
||||
matches only when the subject begins with "foobar" (in single line mode),
|
||||
though it again reports the matched string as "bar". This feature is similar to
|
||||
a lookbehind assertion
|
||||
<a href="#lookbehind">(described below),</a>
|
||||
but the part of the pattern that precedes \K is not constrained to match a
|
||||
limited number of characters, as is required for a lookbehind assertion. The
|
||||
use of \K does not interfere with the setting of
|
||||
<a href="#group">captured substrings.</a>
|
||||
For example, when the pattern
|
||||
<pre>
|
||||
(foo)\Kbar
|
||||
</pre>
|
||||
matches "foobar", the first substring is still set to "foo".
|
||||
</P>
|
||||
<P>
|
||||
From version 5.32.0 Perl forbids the use of \K in lookaround assertions. From
|
||||
release 10.38 PCRE2 also forbids this by default. However, the
|
||||
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
|
||||
<b>pcre2_compile()</b> to re-enable the previous behaviour. When this option is
|
||||
set, \K is acted upon when it occurs inside positive assertions, but is
|
||||
ignored in negative assertions. Note that when a pattern such as (?=ab\K)
|
||||
matches, the reported start of the match can be greater than the end of the
|
||||
match. Using \K in a lookbehind assertion at the start of a pattern can also
|
||||
lead to odd effects. For example, consider this pattern:
|
||||
<pre>
|
||||
(?<=\Kfoo)bar
|
||||
</pre>
|
||||
If the subject is "foobar", a call to <b>pcre2_match()</b> with a starting
|
||||
offset of 3 succeeds and reports the matching string as "foobar", that is, the
|
||||
start of the reported match is earlier than where the match started.
|
||||
<a name="smallassertions"></a></P>
|
||||
<br><b>
|
||||
Simple assertions
|
||||
</b><br>
|
||||
<P>
|
||||
The final use of backslash is for certain simple assertions. An assertion
|
||||
specifies a condition that has to be met at a particular point in a match,
|
||||
without consuming any characters from the subject string. The use of
|
||||
groups for more complicated assertions is described
|
||||
<a href="#bigassertions">below.</a>
|
||||
The backslashed assertions are:
|
||||
<pre>
|
||||
\b matches at a word boundary
|
||||
\B matches when not at a word boundary
|
||||
\A matches at the start of the subject
|
||||
\Z matches at the end of the subject
|
||||
also matches before a newline at the end of the subject
|
||||
\z matches only at the end of the subject
|
||||
\G matches at the first matching position in the subject
|
||||
</pre>
|
||||
Inside a character class, \b has a different meaning; it matches the backspace
|
||||
character. If any other of these assertions appears in a character class, an
|
||||
"invalid escape sequence" error is generated.
|
||||
</P>
|
||||
<P>
|
||||
A word boundary is a position in the subject string where the current character
|
||||
and the previous character do not both match \w or \W (i.e. one matches
|
||||
\w and the other matches \W), or the start or end of the string if the
|
||||
first or last character matches \w, respectively. When PCRE2 is built with
|
||||
Unicode support, the meanings of \w and \W can be changed by setting the
|
||||
PCRE2_UCP option. When this is done, it also affects \b and \B. Neither PCRE2
|
||||
nor Perl has a separate "start of word" or "end of word" metasequence. However,
|
||||
whatever follows \b normally determines which it is. For example, the fragment
|
||||
\ba matches "a" at the start of a word.
|
||||
</P>
|
||||
<P>
|
||||
The \A, \Z, and \z assertions differ from the traditional circumflex and
|
||||
dollar (described in the next section) in that they only ever match at the very
|
||||
start and end of the subject string, whatever options are set. Thus, they are
|
||||
independent of multiline mode. These three assertions are not affected by the
|
||||
PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
|
||||
circumflex and dollar metacharacters. However, if the <i>startoffset</i>
|
||||
argument of <b>pcre2_match()</b> is non-zero, indicating that matching is to
|
||||
start at a point other than the beginning of the subject, \A can never match.
|
||||
The difference between \Z and \z is that \Z matches before a newline at the
|
||||
end of the string as well as at the very end, whereas \z matches only at the
|
||||
end.
|
||||
</P>
|
||||
<P>
|
||||
The \G assertion is true only when the current matching position is at the
|
||||
start point of the matching process, as specified by the <i>startoffset</i>
|
||||
argument of <b>pcre2_match()</b>. It differs from \A when the value of
|
||||
<i>startoffset</i> is non-zero. By calling <b>pcre2_match()</b> multiple times
|
||||
with appropriate arguments, you can mimic Perl's /g option, and it is in this
|
||||
kind of implementation where \G can be useful.
|
||||
</P>
|
||||
<P>
|
||||
Note, however, that PCRE2's implementation of \G, being true at the starting
|
||||
character of the matching process, is subtly different from Perl's, which
|
||||
defines it as true at the end of the previous match. In Perl, these can be
|
||||
different when the previously matched string was empty. Because PCRE2 does just
|
||||
one match at a time, it cannot reproduce this behaviour.
|
||||
</P>
|
||||
<P>
|
||||
If all the alternatives of a pattern begin with \G, the expression is anchored
|
||||
to the starting match position, and the "anchored" flag is set in the compiled
|
||||
regular expression.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
|
||||
<P>
|
||||
The circumflex and dollar metacharacters are zero-width assertions. That is,
|
||||
they test for a particular condition being true without consuming any
|
||||
characters from the subject string. These two metacharacters are concerned with
|
||||
matching the starts and ends of lines. If the newline convention is set so that
|
||||
only the two-character sequence CRLF is recognized as a newline, isolated CR
|
||||
and LF characters are treated as ordinary data characters, and are not
|
||||
recognized as newlines.
|
||||
</P>
|
||||
<P>
|
||||
Outside a character class, in the default matching mode, the circumflex
|
||||
character is an assertion that is true only if the current matching point is at
|
||||
the start of the subject string. If the <i>startoffset</i> argument of
|
||||
<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
|
||||
never match if the PCRE2_MULTILINE option is unset. Inside a character class,
|
||||
circumflex has an entirely different meaning
|
||||
<a href="#characterclass">(see below).</a>
|
||||
</P>
|
||||
<P>
|
||||
Circumflex need not be the first character of the pattern if a number of
|
||||
alternatives are involved, but it should be the first thing in each alternative
|
||||
in which it appears if the pattern is ever to match that branch. If all
|
||||
possible alternatives start with a circumflex, that is, if the pattern is
|
||||
constrained to match only at the start of the subject, it is said to be an
|
||||
"anchored" pattern. (There are also other constructs that can cause a pattern
|
||||
to be anchored.)
|
||||
</P>
|
||||
<P>
|
||||
The dollar character is an assertion that is true only if the current matching
|
||||
point is at the end of the subject string, or immediately before a newline at
|
||||
the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
|
||||
that it does not actually match the newline. Dollar need not be the last
|
||||
character of the pattern if a number of alternatives are involved, but it
|
||||
should be the last item in any branch in which it appears. Dollar has no
|
||||
special meaning in a character class.
|
||||
</P>
|
||||
<P>
|
||||
The meaning of dollar can be changed so that it matches only at the very end of
|
||||
the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
|
||||
does not affect the \Z assertion.
|
||||
</P>
|
||||
<P>
|
||||
The meanings of the circumflex and dollar metacharacters are changed if the
|
||||
PCRE2_MULTILINE option is set. When this is the case, a dollar character
|
||||
matches before any newlines in the string, as well as at the very end, and a
|
||||
circumflex matches immediately after internal newlines as well as at the start
|
||||
of the subject string. It does not match after a newline that ends the string,
|
||||
for compatibility with Perl. However, this can be changed by setting the
|
||||
PCRE2_ALT_CIRCUMFLEX option.
|
||||
</P>
|
||||
<P>
|
||||
For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
|
||||
\n represents a newline) in multiline mode, but not otherwise. Consequently,
|
||||
patterns that are anchored in single line mode because all branches start with
|
||||
^ are not anchored in multiline mode, and a match for circumflex is possible
|
||||
when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
|
||||
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
|
||||
</P>
|
||||
<P>
|
||||
When the newline convention (see
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
below) recognizes the two-character sequence CRLF as a newline, this is
|
||||
preferred, even if the single characters CR and LF are also recognized as
|
||||
newlines. For example, if the newline convention is "any", a multiline mode
|
||||
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
|
||||
CR, even though CR on its own is a valid newline. (It also matches at the very
|
||||
start of the string, of course.)
|
||||
</P>
|
||||
<P>
|
||||
Note that the sequences \A, \Z, and \z can be used to match the start and
|
||||
end of the subject in both modes, and if all branches of a pattern start with
|
||||
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
|
||||
<a name="fullstopdot"></a></P>
|
||||
<br><a name="SEC7" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br>
|
||||
<P>
|
||||
Outside a character class, a dot in the pattern matches any one character in
|
||||
the subject string except (by default) a character that signifies the end of a
|
||||
line. One or more characters may be specified as line terminators (see
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
above).
|
||||
</P>
|
||||
<P>
|
||||
Dot never matches a single line-ending character. When the two-character
|
||||
sequence CRLF is the only line ending, dot does not match CR if it is
|
||||
immediately followed by LF, but otherwise it matches all characters (including
|
||||
isolated CRs and LFs). When ANYCRLF is selected for line endings, no occurrences
|
||||
of CR of LF match dot. When all Unicode line endings are being recognized, dot
|
||||
does not match CR or LF or any of the other line ending characters.
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of dot with regard to newlines can be changed. If the
|
||||
PCRE2_DOTALL option is set, a dot matches any one character, without exception.
|
||||
If the two-character sequence CRLF is present in the subject string, it takes
|
||||
two dots to match it.
|
||||
</P>
|
||||
<P>
|
||||
The handling of dot is entirely independent of the handling of circumflex and
|
||||
dollar, the only relationship being that they both involve newlines. Dot has no
|
||||
special meaning in a character class.
|
||||
</P>
|
||||
<P>
|
||||
The escape sequence \N when not followed by an opening brace behaves like a
|
||||
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||
it matches any character except one that signifies the end of a line.
|
||||
</P>
|
||||
<P>
|
||||
When \N is followed by an opening brace it has a different meaning. See the
|
||||
section entitled
|
||||
<a href="digitsafterbackslash">"Non-printing characters"</a>
|
||||
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||
name; PCRE2 does not support this.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
|
||||
<P>
|
||||
Outside a character class, the escape sequence \C matches any one code unit,
|
||||
whether or not a UTF mode is set. In the 8-bit library, one code unit is one
|
||||
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
|
||||
32-bit unit. Unlike a dot, \C always matches line-ending characters. The
|
||||
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
|
||||
but it is unclear how it can usefully be used.
|
||||
</P>
|
||||
<P>
|
||||
Because \C breaks up characters into individual code units, matching one unit
|
||||
with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
|
||||
with a malformed UTF character. This has undefined results, because PCRE2
|
||||
assumes that it is matching character by character in a valid UTF string (by
|
||||
default it checks the subject string's validity at the start of processing
|
||||
unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
|
||||
</P>
|
||||
<P>
|
||||
An application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
|
||||
build PCRE2 with the use of \C permanently disabled.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions
|
||||
<a href="#lookbehind">(described below)</a>
|
||||
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
|
||||
the length of the lookbehind. Neither the alternative matching function
|
||||
<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes.
|
||||
The former gives a match-time error; the latter fails to optimize and so the
|
||||
match is always run using the interpreter.
|
||||
</P>
|
||||
<P>
|
||||
In the 32-bit library, however, \C is always supported (when not explicitly
|
||||
locked out) because it always matches a single code unit, whether or not UTF-32
|
||||
is specified.
|
||||
</P>
|
||||
<P>
|
||||
In general, the \C escape sequence is best avoided. However, one way of using
|
||||
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
|
||||
lookahead to check the length of the next character, as in this pattern, which
|
||||
could be used with a UTF-8 string (ignore white space and line breaks):
|
||||
<pre>
|
||||
(?| (?=[\x00-\x7f])(\C) |
|
||||
(?=[\x80-\x{7ff}])(\C)(\C) |
|
||||
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
|
||||
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
|
||||
</pre>
|
||||
In this example, a group that starts with (?| resets the capturing parentheses
|
||||
numbers in each alternative (see
|
||||
<a href="#dupgroupnumber">"Duplicate Group Numbers"</a>
|
||||
below). The assertions at the start of each branch check the next UTF-8
|
||||
character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
|
||||
character's individual bytes are then captured by the appropriate number of
|
||||
\C groups.
|
||||
<a name="characterclass"></a></P>
|
||||
<br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
An opening square bracket introduces a character class, terminated by a closing
|
||||
square bracket. A closing square bracket on its own is not special by default.
|
||||
If a closing square bracket is required as a member of the class, it should be
|
||||
the first data character in the class (after an initial circumflex, if present)
|
||||
or escaped with a backslash. This means that, by default, an empty class cannot
|
||||
be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
|
||||
square bracket at the start does end the (empty) class.
|
||||
</P>
|
||||
<P>
|
||||
A character class matches a single character in the subject. A matched
|
||||
character must be in the set of characters defined by the class, unless the
|
||||
first character in the class definition is a circumflex, in which case the
|
||||
subject character must not be in the set defined by the class. If a circumflex
|
||||
is actually required as a member of the class, ensure it is not the first
|
||||
character, or escape it with a backslash.
|
||||
</P>
|
||||
<P>
|
||||
For example, the character class [aeiou] matches any lower case English vowel,
|
||||
whereas [^aeiou] matches all other characters. Note that a circumflex is just a
|
||||
convenient notation for specifying the characters that are in the class by
|
||||
enumerating those that are not. A class that starts with a circumflex is not an
|
||||
assertion; it still consumes a character from the subject string, and therefore
|
||||
it fails to match if the current pointer is at the end of the string.
|
||||
</P>
|
||||
<P>
|
||||
Characters in a class may be specified by their code points using \o, \x, or
|
||||
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||
class represent both their upper case and lower case versions, so for example,
|
||||
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||
match "A", whereas a caseful version would. Note that there are two ASCII
|
||||
characters, K and S, that, in addition to their lower case ASCII equivalents,
|
||||
are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
|
||||
respectively when either PCRE2_UTF or PCRE2_UCP is set. If you do not want
|
||||
these ASCII/non-ASCII case equivalences, you can suppress them by setting
|
||||
PCRE2_EXTRA_CASELESS_RESTRICT, either as an option in a compile context, or by
|
||||
including (*CASELESS_RESTRICT) or (?r) within a pattern.
|
||||
</P>
|
||||
<P>
|
||||
Characters that might indicate line breaks are never treated in any special way
|
||||
when matching character classes, whatever line-ending sequence is in use, and
|
||||
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||
class such as [^a] always matches one of these characters.
|
||||
</P>
|
||||
<P>
|
||||
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||
characters that they match to the class. For example, [\dABCDEF] matches any
|
||||
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||
\d, \s, \w and their upper case partners, just as it does when they appear
|
||||
outside a character class, as described in the section entitled
|
||||
<a href="#genericchartypes">"Generic character types"</a>
|
||||
above. The escape sequence \b has a different meaning inside a character
|
||||
class; it matches the backspace character. The sequences \B, \R, and \X are
|
||||
not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error. The same is true for \N when not followed by
|
||||
an opening brace.
|
||||
</P>
|
||||
<P>
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class,
|
||||
or immediately after a range. For example, [b-d-z] matches letters in the range
|
||||
b to d, a hyphen character, or z.
|
||||
</P>
|
||||
<P>
|
||||
There is some special treatment for alphabetic ranges in EBCDIC environments;
|
||||
see the section
|
||||
<a href="#ebcdicenvironments">"EBCDIC environments"</a>
|
||||
below.
|
||||
</P>
|
||||
<P>
|
||||
Perl treats a hyphen as a literal if it appears before or after a POSIX class
|
||||
(see below) or before or after a character type escape such as \d or \H.
|
||||
However, unless the hyphen is the last character in the class, Perl outputs a
|
||||
warning in its warning mode, as this is most likely a user error. As PCRE2 has
|
||||
no facility for warning, an error is given in these cases.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
range. A pattern such as [W-]46] is interpreted as a class of two characters
|
||||
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
|
||||
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
|
||||
the end of a range, so [W-\]46] is interpreted as a class containing a range
|
||||
and two other characters. The octal or hexadecimal representation of "]" can
|
||||
also be used to end a range.
|
||||
</P>
|
||||
<P>
|
||||
Ranges normally include all code points between the start and end characters,
|
||||
inclusive. They can also be used for code points specified numerically, for
|
||||
example [\000-\037]. Ranges can include any characters that are valid for the
|
||||
current mode. In any UTF mode, the so-called "surrogate" characters (those
|
||||
whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
|
||||
explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
|
||||
this check). However, ranges such as [\x{d7ff}-\x{e000}], which include the
|
||||
surrogates, are always permitted.
|
||||
</P>
|
||||
<P>
|
||||
If a range that includes letters is used when caseless matching is set, it
|
||||
matches the letters in either case. For example, [W-c] is equivalent to
|
||||
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
|
||||
tables for a French locale are in use, [\xc8-\xcb] matches accented E
|
||||
characters in both cases.
|
||||
</P>
|
||||
<P>
|
||||
A circumflex can conveniently be used with the upper case character types to
|
||||
specify a more restricted set of characters than the matching lower case type.
|
||||
For example, the class [^\W_] matches any letter or digit, but not underscore,
|
||||
whereas [\w] includes underscore. A positive character class should be read as
|
||||
"something OR something OR ..." and a negative class as "NOT something AND NOT
|
||||
something AND NOT ...".
|
||||
</P>
|
||||
<P>
|
||||
The metacharacters that are recognized in character classes are backslash,
|
||||
hyphen (when it can be interpreted as specifying a range), circumflex
|
||||
(only at the start), and the terminating closing square bracket. An opening
|
||||
square bracket is also special when it can be interpreted as introducing a
|
||||
POSIX class (see
|
||||
<a href="#posixclasses">"Posix character classes"</a>
|
||||
below), or a special compatibility feature (see
|
||||
<a href="#wordboundcompat">"Compatibility feature for word boundaries"</a>
|
||||
below. Escaping any non-alphanumeric character in a class turns it into a
|
||||
literal, whether or not it would otherwise be a metacharacter.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
From release 10.45 PCRE2 supports Perl's (?[...]) extended character class
|
||||
syntax. This can be used to perform set operations such as intersection on
|
||||
character classes.
|
||||
</P>
|
||||
<P>
|
||||
The syntax permitted within (?[...]) is quite different to ordinary character
|
||||
classes. Inside the extended class, there is an expression syntax consisting of
|
||||
"atoms", operators, and ordinary parentheses "()" used for grouping. Such
|
||||
classes always have the Perl /xx modifier (PCRE2 option PCRE2_EXTENDED_MORE)
|
||||
turned on within them. This means that literal space and tab characters are
|
||||
ignored everywhere in the class.
|
||||
</P>
|
||||
<P>
|
||||
The allowed atoms are individual characters specified by escape sequences such
|
||||
as \n or \x{123}, character types such as \d, POSIX classes such as
|
||||
[:alpha:], and nested ordinary (non-extended) character classes. For example,
|
||||
in (?[\d & [...]]) the nested class [...] follows the usual rules for ordinary
|
||||
character classes, in which parentheses are not metacharacters, and character
|
||||
literals and ranges are permitted.
|
||||
</P>
|
||||
<P>
|
||||
Character literals and ranges may not appear outside a nested ordinary
|
||||
character class because they are not atoms in the extended syntax. The extended
|
||||
syntax does not introduce any additional escape sequences, so (?[\y]) is an
|
||||
unknown escape, as it would be in [\y].
|
||||
</P>
|
||||
<P>
|
||||
In the extended syntax, ^ does not negate a class (except within an
|
||||
ordinary class nested inside an extended class); it is instead a binary
|
||||
operator.
|
||||
</P>
|
||||
<P>
|
||||
The binary operators are "&" (intersection), "|" or "+" (union), "-"
|
||||
(subtraction) and "^" (symmetric difference). These are left-associative and
|
||||
"&" has higher (tighter) precedence, while the others have equal lower
|
||||
precedence. The one prefix unary operator is "!" (complement), with highest
|
||||
precedence.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">UTS#18 EXTENDED CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's (?[...])
|
||||
syntax, allowing instead extended class behaviour inside ordinary [...]
|
||||
character classes. This altered syntax for [...] classes is loosely described
|
||||
by the Unicode standard UTS#18. The PCRE2_ALT_EXTENDED_CLASS option does not
|
||||
prevent use of (?[...]) classes; it just changes the meaning of all
|
||||
[...] classes that are not nested inside a Perl (?[...]) class.
|
||||
</P>
|
||||
<P>
|
||||
Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is a
|
||||
character class with two literal characters "a" and "[", but in UTS#18 extended
|
||||
classes the "[" character becomes an additional metacharacter within classes,
|
||||
denoting the start of a nested class, so a literal "[" must be escaped as "\[".
|
||||
</P>
|
||||
<P>
|
||||
Secondly, within the UTS#18 extended syntax, there are operators "||", "&&",
|
||||
"--" and "~~" which denote character class union, intersection, subtraction,
|
||||
and symmetric difference respectively. In standard Perl syntax, these would
|
||||
simply be needlessly-repeated literals (except for "--" which could be the
|
||||
start or end of a range). In UTS#18 extended classes these operators can be used
|
||||
in constructs such as [\p{L}--[QW]] for "Unicode letters, other than Q and W".
|
||||
A literal "-" at the start or end of a range must be escaped, so while "[--1]"
|
||||
in Perl syntax is the range from hyphen to "1", it must be escaped as "[\--1]"
|
||||
in UTS#18 extended classes.
|
||||
</P>
|
||||
<P>
|
||||
Unlike Perl's (?[...]) extended classes, the PCRE2_EXTENDED_MORE option to
|
||||
ignore space and tab characters is not automatically enabled for UTS#18
|
||||
extended classes, but it is honoured if set.
|
||||
</P>
|
||||
<P>
|
||||
Extended UTS#18 classes can be nested, and nested classes are themselves
|
||||
extended classes (unlike Perl, where nested classes must be simple classes).
|
||||
For example, [\p{L}&&[\p{Thai}||\p{Greek}]] matches any letter that is in
|
||||
the Thai or Greek scripts. Note that this means that no special grouping
|
||||
characters (such as the parentheses used in Perl's (?[...]) class syntax) are
|
||||
needed.
|
||||
</P>
|
||||
<P>
|
||||
Individual class items (literal characters, literal ranges, properties such as
|
||||
\d or \p{...}, and nested classes) can be combined by juxtaposition or by an
|
||||
operator. Juxtaposition is the implicit union operator, and binds more tightly
|
||||
than any explicit operator. Thus a sequence of literals and/or ranges behaves
|
||||
as if it is enclosed in square brackets. For example, [A-Z0-9&&[^E8]] is the
|
||||
same as [[A-Z0-9]&&[^E8]], which matches any upper case alphanumeric character
|
||||
except "E" or "8".
|
||||
</P>
|
||||
<P>
|
||||
Precedence between the explicit operators is not defined, so mixing operators
|
||||
is a syntax error. For example, [A&&B--C] is an error, but [A&&[B--C]] is
|
||||
valid.
|
||||
</P>
|
||||
<P>
|
||||
This is an emerging syntax which is being adopted gradually across the regex
|
||||
ecosystem: for example JavaScript adopted the "/v" flag in ECMAScript 2024;
|
||||
Python's "re" module reserves the syntax for future use with a FutureWarning
|
||||
for unescaped use of "[" as a literal within character classes. Due to UTS#18
|
||||
providing insufficient guidance, engines interpret the syntax differently.
|
||||
Rust's "regex" crate and Python's "regex" PyPi module both implement UTS#18
|
||||
extended classes, but with slight incompatibilities ([A||B&&C] is parsed as
|
||||
[A||[B&&C]] in Python's "regex" but as [[A||B]&&C] in Rust's "regex").
|
||||
</P>
|
||||
<P>
|
||||
PCRE2's syntax adds syntax restrictions similar to ECMASCript's /v flag, so
|
||||
that all the UTS#18 extended classes accepted as valid by PCRE2 have the
|
||||
property that they are interpreted either with the same behaviour, or as
|
||||
invalid, by all other major engines. Please file an issue if you are aware of
|
||||
cross-engine differences in behaviour between PCRE2 and another major engine.
|
||||
<a name="posixclasses"></a></P>
|
||||
<br><a name="SEC12" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
Perl supports the POSIX notation for character classes. This uses names
|
||||
enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
|
||||
this notation, in both ordinary and extended classes. For example,
|
||||
<pre>
|
||||
[01[:alpha:]%]
|
||||
</pre>
|
||||
matches "0", "1", any alphabetic character, or "%". The supported class names
|
||||
are:
|
||||
<pre>
|
||||
alnum letters and digits
|
||||
alpha letters
|
||||
ascii character codes 0 - 127
|
||||
blank space or tab only
|
||||
cntrl control characters
|
||||
digit decimal digits (same as \d)
|
||||
graph printing characters, excluding space
|
||||
lower lower case letters
|
||||
print printing characters, including space
|
||||
punct printing characters, excluding letters and digits and space
|
||||
space white space (the same as \s from PCRE2 8.34)
|
||||
upper upper case letters
|
||||
word "word" characters (same as \w)
|
||||
xdigit hexadecimal digits
|
||||
</pre>
|
||||
The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
|
||||
and space (32). If locale-specific matching is taking place, the list of space
|
||||
characters may be different; there may be fewer or more of them. "Space" and
|
||||
\s match the same set of characters, as do "word" and \w.
|
||||
</P>
|
||||
<P>
|
||||
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
|
||||
5.8. Another Perl extension is negation, which is indicated by a ^ character
|
||||
after the colon. For example,
|
||||
<pre>
|
||||
[12[:^digit:]]
|
||||
</pre>
|
||||
matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
|
||||
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
||||
supported, and an error is given if they are encountered.
|
||||
</P>
|
||||
<P>
|
||||
By default, characters with values greater than 127 do not match any of the
|
||||
POSIX character classes, although this may be different for characters in the
|
||||
range 128-255 when locale-specific matching is happening. However, in UCP mode,
|
||||
unless certain options are set (see below), some of the classes are changed so
|
||||
that Unicode character properties are used. This is achieved by replacing
|
||||
POSIX classes with other sequences, as follows:
|
||||
<pre>
|
||||
[:alnum:] becomes \p{Xan}
|
||||
[:alpha:] becomes \p{L}
|
||||
[:blank:] becomes \h
|
||||
[:cntrl:] becomes \p{Cc}
|
||||
[:digit:] becomes \p{Nd}
|
||||
[:lower:] becomes \p{Ll}
|
||||
[:space:] becomes \p{Xps}
|
||||
[:upper:] becomes \p{Lu}
|
||||
[:word:] becomes \p{Xwd}
|
||||
</pre>
|
||||
Negated versions, such as [:^alpha:] use \P instead of \p. Four other POSIX
|
||||
classes are handled specially in UCP mode:
|
||||
</P>
|
||||
<P>
|
||||
[:graph:]
|
||||
This matches characters that have glyphs that mark the page when printed. In
|
||||
Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
|
||||
properties, except for:
|
||||
<pre>
|
||||
U+061C Arabic Letter Mark
|
||||
U+180E Mongolian Vowel Separator
|
||||
U+2066 - U+2069 Various "isolate"s
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
[:print:]
|
||||
This matches the same characters as [:graph:] plus space characters that are
|
||||
not controls, that is, characters with the Zs property.
|
||||
</P>
|
||||
<P>
|
||||
[:punct:]
|
||||
This matches all characters that have the Unicode P (punctuation) property,
|
||||
plus those characters with code points less than 256 that have the S (Symbol)
|
||||
property.
|
||||
</P>
|
||||
<P>
|
||||
[:xdigit:]
|
||||
In addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
|
||||
versions of those characters, whose Unicode code points start at U+FF10. This
|
||||
is a change that was made in PCRE2 release 10.43 for Perl compatibility.
|
||||
</P>
|
||||
<P>
|
||||
The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
|
||||
with code points less than 256.
|
||||
</P>
|
||||
<P>
|
||||
There are two options that can be used to restrict the POSIX classes to ASCII
|
||||
characters when PCRE2_UCP is set. The option PCRE2_EXTRA_ASCII_DIGIT affects
|
||||
just [:digit:] and [:xdigit:]. Within a pattern, this can be set and unset by
|
||||
(?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
|
||||
for all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
|
||||
(?aP) and (?-aP) set and unset both these options for consistency.
|
||||
<a name="wordboundcompat"></a></P>
|
||||
<br><a name="SEC13" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
|
||||
<P>
|
||||
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
|
||||
syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
|
||||
word". PCRE2 treats these items as follows:
|
||||
<pre>
|
||||
[[:<:]] is converted to \b(?=\w)
|
||||
[[:>:]] is converted to \b(?<=\w)
|
||||
</pre>
|
||||
Only these exact character sequences are recognized. A sequence such as
|
||||
[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
|
||||
not compatible with Perl. It is provided to help migrations from other
|
||||
environments, and is best not used in any new patterns. Note that \b matches
|
||||
at the start and the end of a word (see
|
||||
<a href="#smallassertions">"Simple assertions"</a>
|
||||
above), and in a Perl-style pattern the preceding or following character
|
||||
normally shows which is wanted, without the need for the assertions that are
|
||||
used above in order to give exactly the POSIX behaviour. Note also that the
|
||||
PCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so
|
||||
it also affects these POSIX sequences.
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">VERTICAL BAR</a><br>
|
||||
<P>
|
||||
Vertical bar characters are used to separate alternative patterns. For example,
|
||||
the pattern
|
||||
<pre>
|
||||
gilbert|sullivan
|
||||
</pre>
|
||||
matches either "gilbert" or "sullivan". Any number of alternatives may appear,
|
||||
and an empty alternative is permitted (matching the empty string). The matching
|
||||
process tries each alternative in turn, from left to right, and the first one
|
||||
that succeeds is used. If the alternatives are within a group
|
||||
<a href="#group">(defined below),</a>
|
||||
"succeeds" means matching the rest of the main pattern as well as the
|
||||
alternative in the group.
|
||||
<a name="internaloptions"></a></P>
|
||||
<br><a name="SEC15" href="#TOC1">INTERNAL OPTION SETTING</a><br>
|
||||
<P>
|
||||
The settings of several options can be changed within a pattern by a sequence
|
||||
of letters enclosed between "(?" and ")". The following are Perl-compatible,
|
||||
and are described in detail in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. The option letters are:
|
||||
<pre>
|
||||
i for PCRE2_CASELESS
|
||||
m for PCRE2_MULTILINE
|
||||
n for PCRE2_NO_AUTO_CAPTURE
|
||||
s for PCRE2_DOTALL
|
||||
x for PCRE2_EXTENDED
|
||||
xx for PCRE2_EXTENDED_MORE
|
||||
</pre>
|
||||
For example, (?im) sets caseless, multiline matching. It is also possible to
|
||||
unset these options by preceding the relevant letters with a hyphen, for
|
||||
example (?-im). The two "extended" options are not independent; unsetting
|
||||
either one cancels the effects of both of them.
|
||||
</P>
|
||||
<P>
|
||||
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
|
||||
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
|
||||
permitted. Only one hyphen may appear in the options string. If a letter
|
||||
appears both before and after the hyphen, the option is unset. An empty options
|
||||
setting "(?)" is allowed. Needless to say, it has no effect.
|
||||
</P>
|
||||
<P>
|
||||
If the first character following (? is a circumflex, it causes all of the above
|
||||
options to be unset. Letters may follow the circumflex to cause some options to
|
||||
be re-instated, but a hyphen may not appear.
|
||||
</P>
|
||||
<P>
|
||||
Some PCRE2-specific options can be changed by the same mechanism using these
|
||||
pairs or individual letters:
|
||||
<pre>
|
||||
aD for PCRE2_EXTRA_ASCII_BSD
|
||||
aS for PCRE2_EXTRA_ASCII_BSS
|
||||
aW for PCRE2_EXTRA_ASCII_BSW
|
||||
aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
|
||||
aT for PCRE2_EXTRA_ASCII_DIGIT
|
||||
r for PCRE2_EXTRA_CASELESS_RESTRICT
|
||||
J for PCRE2_DUPNAMES
|
||||
U for PCRE2_UNGREEDY
|
||||
</pre>
|
||||
However, except for 'r', these are not unset by (?^), which is equivalent to
|
||||
(?-imnrsx). If 'a' is not followed by any of the upper case letters shown
|
||||
above, it sets (or unsets) all the ASCII options.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX
|
||||
is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
|
||||
restrictions for POSIX classes.
|
||||
</P>
|
||||
<P>
|
||||
When one of these option changes occurs at top level (that is, not inside group
|
||||
parentheses), the change applies until a subsequent change, or the end of the
|
||||
pattern. An option change within a group (see below for a description of
|
||||
groups) affects only that part of the group that follows it. At the end of the
|
||||
group these options are reset to the state they were before the group. For
|
||||
example,
|
||||
<pre>
|
||||
(a(?i)b)c
|
||||
</pre>
|
||||
matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not set
|
||||
externally). Any changes made in one alternative do carry on into subsequent
|
||||
branches within the same group. For example,
|
||||
<pre>
|
||||
(a(?i)b|c)
|
||||
</pre>
|
||||
matches "ab", "aB", "c", and "C", even though when matching "C" the first
|
||||
branch is abandoned before the option setting. This is because the effects of
|
||||
option settings happen at compile time. There would be some very weird
|
||||
behaviour otherwise.
|
||||
</P>
|
||||
<P>
|
||||
As a convenient shorthand, if any option settings are required at the start of
|
||||
a non-capturing group (see the next section), the option letters may
|
||||
appear between the "?" and the ":". Thus the two patterns
|
||||
<pre>
|
||||
(?i:saturday|sunday)
|
||||
(?:(?i)saturday|sunday)
|
||||
</pre>
|
||||
match exactly the same set of strings.
|
||||
</P>
|
||||
<P>
|
||||
<b>Note:</b> There are other PCRE2-specific options, applying to the whole
|
||||
pattern, which can be set by the application when the compiling function is
|
||||
called. In addition, the pattern can contain special leading sequences such as
|
||||
(*CRLF) to override what the application has set or what has been defaulted.
|
||||
Details are given in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
above. There are also the (*UTF) and (*UCP) leading sequences that can be used
|
||||
to set UTF and Unicode property modes; they are equivalent to setting the
|
||||
PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
|
||||
the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, which lock out the use of the
|
||||
(*UTF) and (*UCP) sequences.
|
||||
<a name="group"></a></P>
|
||||
<br><a name="SEC16" href="#TOC1">GROUPS</a><br>
|
||||
<P>
|
||||
Groups are delimited by parentheses (round brackets), which can be nested.
|
||||
Turning part of a pattern into a group does two things:
|
||||
<br>
|
||||
<br>
|
||||
1. It localizes a set of alternatives. For example, the pattern
|
||||
<pre>
|
||||
cat(aract|erpillar|)
|
||||
</pre>
|
||||
matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
|
||||
match "cataract", "erpillar" or an empty string.
|
||||
<br>
|
||||
<br>
|
||||
2. It creates a "capture group". This means that, when the whole pattern
|
||||
matches, the portion of the subject string that matched the group is passed
|
||||
back to the caller, separately from the portion that matched the whole pattern.
|
||||
(This applies only to the traditional matching function; the DFA matching
|
||||
function does not support capturing.)
|
||||
</P>
|
||||
<P>
|
||||
Opening parentheses are counted from left to right (starting from 1) to obtain
|
||||
numbers for capture groups. For example, if the string "the red king" is
|
||||
matched against the pattern
|
||||
<pre>
|
||||
the ((red|white) (king|queen))
|
||||
</pre>
|
||||
the captured substrings are "red king", "red", and "king", and are numbered 1,
|
||||
2, and 3, respectively.
|
||||
</P>
|
||||
<P>
|
||||
The fact that plain parentheses fulfil two functions is not always helpful.
|
||||
There are often times when grouping is required without capturing. If an
|
||||
opening parenthesis is followed by a question mark and a colon, the group
|
||||
does not do any capturing, and is not counted when computing the number of any
|
||||
subsequent capture groups. For example, if the string "the white queen"
|
||||
is matched against the pattern
|
||||
<pre>
|
||||
the ((?:red|white) (king|queen))
|
||||
</pre>
|
||||
the captured substrings are "white queen" and "queen", and are numbered 1 and
|
||||
2. The maximum number of capture groups is 65535.
|
||||
</P>
|
||||
<P>
|
||||
As a convenient shorthand, if any option settings are required at the start of
|
||||
a non-capturing group, the option letters may appear between the "?" and the
|
||||
":". Thus the two patterns
|
||||
<pre>
|
||||
(?i:saturday|sunday)
|
||||
(?:(?i)saturday|sunday)
|
||||
</pre>
|
||||
match exactly the same set of strings. Because alternative branches are tried
|
||||
from left to right, and options are not reset until the end of the group is
|
||||
reached, an option setting in one branch does affect subsequent branches, so
|
||||
the above patterns match "SUNDAY" as well as "Saturday".
|
||||
<a name="dupgroupnumber"></a></P>
|
||||
<br><a name="SEC17" href="#TOC1">DUPLICATE GROUP NUMBERS</a><br>
|
||||
<P>
|
||||
Perl 5.10 introduced a feature whereby each alternative in a group uses the
|
||||
same numbers for its capturing parentheses. Such a group starts with (?| and is
|
||||
itself a non-capturing group. For example, consider this pattern:
|
||||
<pre>
|
||||
(?|(Sat)ur|(Sun))day
|
||||
</pre>
|
||||
Because the two alternatives are inside a (?| group, both sets of capturing
|
||||
parentheses are numbered one. Thus, when the pattern matches, you can look
|
||||
at captured substring number one, whichever alternative matched. This construct
|
||||
is useful when you want to capture part, but not all, of one of a number of
|
||||
alternatives. Inside a (?| group, parentheses are numbered as usual, but the
|
||||
number is reset at the start of each branch. The numbers of any capturing
|
||||
parentheses that follow the whole group start after the highest number used in
|
||||
any branch. The following example is taken from the Perl documentation. The
|
||||
numbers underneath show in which buffer the captured content will be stored.
|
||||
<pre>
|
||||
# before ---------------branch-reset----------- after
|
||||
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
|
||||
# 1 2 2 3 2 3 4
|
||||
</pre>
|
||||
A backreference to a capture group uses the most recent value that is set for
|
||||
the group. The following pattern matches "abcabc" or "defdef":
|
||||
<pre>
|
||||
/(?|(abc)|(def))\1/
|
||||
</pre>
|
||||
In contrast, a subroutine call to a capture group always refers to the
|
||||
first one in the pattern with the given number. The following pattern matches
|
||||
"abcabc" or "defabc":
|
||||
<pre>
|
||||
/(?|(abc)|(def))(?1)/
|
||||
</pre>
|
||||
A relative reference such as (?-1) is no different: it is just a convenient way
|
||||
of computing an absolute group number.
|
||||
</P>
|
||||
<P>
|
||||
If a
|
||||
<a href="#conditions">condition test</a>
|
||||
for a group's having matched refers to a non-unique number, the test is
|
||||
true if any group with that number has matched.
|
||||
</P>
|
||||
<P>
|
||||
An alternative approach to using this "branch reset" feature is to use
|
||||
duplicate named groups, as described in the next section.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">NAMED CAPTURE GROUPS</a><br>
|
||||
<P>
|
||||
Identifying capture groups by number is simple, but it can be very hard to keep
|
||||
track of the numbers in complicated patterns. Furthermore, if an expression is
|
||||
modified, the numbers may change. To help with this difficulty, PCRE2 supports
|
||||
the naming of capture groups. This feature was not added to Perl until release
|
||||
5.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0,
|
||||
using the Python syntax. PCRE2 supports both the Perl and the Python syntax.
|
||||
</P>
|
||||
<P>
|
||||
In PCRE2, a capture group can be named in one of three ways: (?<name>...) or
|
||||
(?'name'...) as in Perl, or (?P<name>...) as in Python. Names may be up to 128
|
||||
code units long. When PCRE2_UTF is not set, they may contain only ASCII
|
||||
alphanumeric characters and underscores, but must start with a non-digit. When
|
||||
PCRE2_UTF is set, the syntax of group names is extended to allow any Unicode
|
||||
letter or Unicode decimal digit. In other words, group names must match one of
|
||||
these patterns:
|
||||
<pre>
|
||||
^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
|
||||
^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
|
||||
</pre>
|
||||
References to capture groups from other parts of the pattern, such as
|
||||
<a href="#backreferences">backreferences,</a>
|
||||
<a href="#recursion">recursion,</a>
|
||||
and
|
||||
<a href="#conditions">conditions,</a>
|
||||
can all be made by name as well as by number.
|
||||
</P>
|
||||
<P>
|
||||
Named capture groups are allocated numbers as well as names, exactly as
|
||||
if the names were not present. In both PCRE2 and Perl, capture groups
|
||||
are primarily identified by numbers; any names are just aliases for these
|
||||
numbers. The PCRE2 API provides function calls for extracting the complete
|
||||
name-to-number translation table from a compiled pattern, as well as
|
||||
convenience functions for extracting captured substrings by name.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> When more than one capture group has the same number, as
|
||||
described in the previous section, a name given to one of them applies to all
|
||||
of them. Perl allows identically numbered groups to have different names.
|
||||
Consider this pattern, where there are two capture groups, both numbered 1:
|
||||
<pre>
|
||||
(?|(?<AA>aa)|(?<BB>bb))
|
||||
</pre>
|
||||
Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
|
||||
a successful match, both names yield the same value (either "aa" or "bb").
|
||||
</P>
|
||||
<P>
|
||||
In an attempt to reduce confusion, PCRE2 does not allow the same group number
|
||||
to be associated with more than one name. The example above provokes a
|
||||
compile-time error. However, there is still scope for confusion. Consider this
|
||||
pattern:
|
||||
<pre>
|
||||
(?|(?<AA>aa)|(bb))
|
||||
</pre>
|
||||
Although the second group number 1 is not explicitly named, the name AA is
|
||||
still an alias for any group 1. Whether the pattern matches "aa" or "bb", a
|
||||
reference by name to group AA yields the matched string.
|
||||
</P>
|
||||
<P>
|
||||
By default, a name must be unique within a pattern, except that duplicate names
|
||||
are permitted for groups with the same number, for example:
|
||||
<pre>
|
||||
(?|(?<AA>aa)|(?<AA>bb))
|
||||
</pre>
|
||||
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
|
||||
option at compile time, or by the use of (?J) within the pattern, as described
|
||||
in the section entitled
|
||||
<a href="#internaloptions">"Internal Option Setting"</a>
|
||||
above.
|
||||
</P>
|
||||
<P>
|
||||
Duplicate names can be useful for patterns where only one instance of the named
|
||||
capture group can match. Suppose you want to match the name of a weekday,
|
||||
either as a 3-letter abbreviation or as the full name, and in both cases you
|
||||
want to extract the abbreviation. This pattern (ignoring the line breaks) does
|
||||
the job:
|
||||
<pre>
|
||||
(?J)
|
||||
(?<DN>Mon|Fri|Sun)(?:day)?|
|
||||
(?<DN>Tue)(?:sday)?|
|
||||
(?<DN>Wed)(?:nesday)?|
|
||||
(?<DN>Thu)(?:rsday)?|
|
||||
(?<DN>Sat)(?:urday)?
|
||||
</pre>
|
||||
There are five capture groups, but only one is ever set after a match. The
|
||||
convenience functions for extracting the data by name returns the substring for
|
||||
the first (and in this example, the only) group of that name that matched. This
|
||||
saves searching to find which numbered group it was. (An alternative way of
|
||||
solving this problem is to use a "branch reset" group, as described in the
|
||||
previous section.)
|
||||
</P>
|
||||
<P>
|
||||
If you make a backreference to a non-unique named group from elsewhere in the
|
||||
pattern, the groups to which the name refers are checked in the order in which
|
||||
they appear in the overall pattern. The first one that is set is used for the
|
||||
reference. For example, this pattern matches both "foofoo" and "barbar" but not
|
||||
"foobar" or "barfoo":
|
||||
<pre>
|
||||
(?J)(?:(?<n>foo)|(?<n>bar))\k<n>
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
If you make a subroutine call to a non-unique named group, the one that
|
||||
corresponds to the first occurrence of the name is used. In the absence of
|
||||
duplicate numbers this is the one with the lowest number.
|
||||
</P>
|
||||
<P>
|
||||
If you use a named reference in a condition
|
||||
test (see the
|
||||
<a href="#conditions">section about conditions</a>
|
||||
below), either to check whether a capture group has matched, or to check for
|
||||
recursion, all groups with the same name are tested. If the condition is true
|
||||
for any one of them, the overall condition is true. This is the same behaviour
|
||||
as testing by number. For further details of the interfaces for handling named
|
||||
capture groups, see the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">REPETITION</a><br>
|
||||
<P>
|
||||
Repetition is specified by quantifiers, which may follow any one of these
|
||||
items:
|
||||
<pre>
|
||||
a literal data character
|
||||
the dot metacharacter
|
||||
the \C escape sequence
|
||||
the \R escape sequence
|
||||
the \X escape sequence
|
||||
any escape sequence that matches a single character
|
||||
a character class
|
||||
a backreference
|
||||
a parenthesized group (including lookaround assertions)
|
||||
a subroutine call (recursive or otherwise)
|
||||
</pre>
|
||||
If a quantifier does not follow a repeatable item, an error occurs. The
|
||||
general repetition quantifier specifies a minimum and maximum number of
|
||||
permitted matches by giving two numbers in curly brackets (braces), separated
|
||||
by a comma. The numbers must be less than 65536, and the first must be less
|
||||
than or equal to the second. For example,
|
||||
<pre>
|
||||
z{2,4}
|
||||
</pre>
|
||||
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
|
||||
character. If the second number is omitted, but the comma is present, there is
|
||||
no upper limit; if the second number and the comma are both omitted, the
|
||||
quantifier specifies an exact number of required matches. Thus
|
||||
<pre>
|
||||
[aeiou]{3,}
|
||||
</pre>
|
||||
matches at least 3 successive vowels, but may match many more, whereas
|
||||
<pre>
|
||||
\d{8}
|
||||
</pre>
|
||||
matches exactly 8 digits. If the first number is omitted, the lower limit is
|
||||
taken as zero; in this case the upper limit must be present.
|
||||
<pre>
|
||||
X{,4} is interpreted as X{0,4}
|
||||
</pre>
|
||||
This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In
|
||||
earlier versions such a sequence was not interpreted as a quantifier. Other
|
||||
regular expression engines may behave either way.
|
||||
</P>
|
||||
<P>
|
||||
If the characters that follow an opening brace do not match the syntax of a
|
||||
quantifier, the brace is taken as a literal character. In particular, this
|
||||
means that {,} is a literal string of three characters.
|
||||
</P>
|
||||
<P>
|
||||
Note that not every opening brace is potentially the start of a quantifier
|
||||
because braces are used in other items such as \N{U+345} or \k{name}.
|
||||
</P>
|
||||
<P>
|
||||
In UTF modes, quantifiers apply to characters rather than to individual code
|
||||
units. Thus, for example, \x{100}{2} matches two characters, each of
|
||||
which is represented by a two-byte sequence in a UTF-8 string. Similarly,
|
||||
\X{3} matches three Unicode extended grapheme clusters, each of which may be
|
||||
several code units long (and they may be of different lengths).
|
||||
</P>
|
||||
<P>
|
||||
The quantifier {0} is permitted, causing the expression to behave as if the
|
||||
previous item and the quantifier were not present. This may be useful for
|
||||
capture groups that are referenced as
|
||||
<a href="#groupsassubroutines">subroutines</a>
|
||||
from elsewhere in the pattern (but see also the section entitled
|
||||
<a href="#subdefine">"Defining capture groups for use by reference only"</a>
|
||||
below). Except for parenthesized groups, items that have a {0} quantifier are
|
||||
omitted from the compiled pattern.
|
||||
</P>
|
||||
<P>
|
||||
For convenience, the three most common quantifiers have single-character
|
||||
abbreviations:
|
||||
<pre>
|
||||
* is equivalent to {0,}
|
||||
+ is equivalent to {1,}
|
||||
? is equivalent to {0,1}
|
||||
</pre>
|
||||
It is possible to construct infinite loops by following a group that can match
|
||||
no characters with a quantifier that has no upper limit, for example:
|
||||
<pre>
|
||||
(a?)*
|
||||
</pre>
|
||||
Earlier versions of Perl and PCRE1 used to give an error at compile time for
|
||||
such patterns. However, because there are cases where this can be useful, such
|
||||
patterns are now accepted, but whenever an iteration of such a group matches no
|
||||
characters, matching moves on to the next item in the pattern instead of
|
||||
repeatedly matching an empty string. This does not prevent backtracking into
|
||||
any of the iterations if a subsequent item fails to match.
|
||||
</P>
|
||||
<P>
|
||||
By default, quantifiers are "greedy", that is, they match as much as possible
|
||||
(up to the maximum number of permitted repetitions), without causing the rest
|
||||
of the pattern to fail. The classic example of where this gives problems is in
|
||||
trying to match comments in C programs. These appear between /* and */ and
|
||||
within the comment, individual * and / characters may appear. An attempt to
|
||||
match C comments by applying the pattern
|
||||
<pre>
|
||||
/\*.*\*/
|
||||
</pre>
|
||||
to the string
|
||||
<pre>
|
||||
/* first comment */ not comment /* second comment */
|
||||
</pre>
|
||||
fails, because it matches the entire string owing to the greediness of the .*
|
||||
item. However, if a quantifier is followed by a question mark, it ceases to be
|
||||
greedy, and instead matches the minimum number of times possible, so the
|
||||
pattern
|
||||
<pre>
|
||||
/\*.*?\*/
|
||||
</pre>
|
||||
does the right thing with C comments. The meaning of the various quantifiers is
|
||||
not otherwise changed, just the preferred number of matches. Do not confuse
|
||||
this use of question mark with its use as a quantifier in its own right.
|
||||
Because it has two uses, it can sometimes appear doubled, as in
|
||||
<pre>
|
||||
\d??\d
|
||||
</pre>
|
||||
which matches one digit by preference, but can match two if that is the only
|
||||
way the rest of the pattern matches.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
|
||||
the quantifiers are not greedy by default, but individual ones can be made
|
||||
greedy by following them with a question mark. In other words, it inverts the
|
||||
default behaviour.
|
||||
</P>
|
||||
<P>
|
||||
When a parenthesized group is quantified with a minimum repeat count that
|
||||
is greater than 1 or with a limited maximum, more memory is required for the
|
||||
compiled pattern, in proportion to the size of the minimum or maximum.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
|
||||
to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
|
||||
implicitly anchored, because whatever follows will be tried against every
|
||||
character position in the subject string, so there is no point in retrying the
|
||||
overall match at any position after the first. PCRE2 normally treats such a
|
||||
pattern as though it were preceded by \A.
|
||||
</P>
|
||||
<P>
|
||||
In cases where it is known that the subject string contains no newlines, it is
|
||||
worth setting PCRE2_DOTALL in order to obtain this optimization, or
|
||||
alternatively, using ^ to indicate anchoring explicitly.
|
||||
</P>
|
||||
<P>
|
||||
However, there are some cases where the optimization cannot be used. When .*
|
||||
is inside capturing parentheses that are the subject of a backreference
|
||||
elsewhere in the pattern, a match at the start may fail where a later one
|
||||
succeeds. Consider, for example:
|
||||
<pre>
|
||||
(.*)abc\1
|
||||
</pre>
|
||||
If the subject is "xyz123abc123" the match point is the fourth character. For
|
||||
this reason, such a pattern is not implicitly anchored.
|
||||
</P>
|
||||
<P>
|
||||
Another case where implicit anchoring is not applied is when the leading .* is
|
||||
inside an atomic group. Once again, a match at the start may fail where a later
|
||||
one succeeds. Consider this pattern:
|
||||
<pre>
|
||||
(?>.*?a)b
|
||||
</pre>
|
||||
It matches "ab" in the subject "aab". The use of the backtracking control verbs
|
||||
(*PRUNE) and (*SKIP) also disable this optimization. To do so explicitly,
|
||||
either pass the compile option PCRE2_NO_DOTSTAR_ANCHOR, or call
|
||||
<b>pcre2_set_optimize()</b> with a PCRE2_DOTSTAR_ANCHOR_OFF directive.
|
||||
</P>
|
||||
<P>
|
||||
When a capture group is repeated, the value captured is the substring that
|
||||
matched the final iteration. For example, after
|
||||
<pre>
|
||||
(tweedle[dume]{3}\s*)+
|
||||
</pre>
|
||||
has matched "tweedledum tweedledee" the value of the captured substring is
|
||||
"tweedledee". However, if there are nested capture groups, the corresponding
|
||||
captured values may have been set in previous iterations. For example, after
|
||||
<pre>
|
||||
(a|(b))+
|
||||
</pre>
|
||||
matches "aba" the value of the second captured substring is "b".
|
||||
<a name="atomicgroup"></a></P>
|
||||
<br><a name="SEC20" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
|
||||
<P>
|
||||
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
|
||||
repetition, failure of what follows normally causes the repeated item to be
|
||||
re-evaluated to see if a different number of repeats allows the rest of the
|
||||
pattern to match. Sometimes it is useful to prevent this, either to change the
|
||||
nature of the match, or to cause it fail earlier than it otherwise might, when
|
||||
the author of the pattern knows there is no point in carrying on.
|
||||
</P>
|
||||
<P>
|
||||
Consider, for example, the pattern \d+foo when applied to the subject line
|
||||
<pre>
|
||||
123456bar
|
||||
</pre>
|
||||
After matching all 6 digits and then failing to match "foo", the normal
|
||||
action of the matcher is to try again with only 5 digits matching the \d+
|
||||
item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
|
||||
(a term taken from Jeffrey Friedl's book) provides the means for specifying
|
||||
that once a group has matched, it is not to be re-evaluated in this way.
|
||||
</P>
|
||||
<P>
|
||||
If we use atomic grouping for the previous example, the matcher gives up
|
||||
immediately on failing to match "foo" the first time. The notation is a kind of
|
||||
special parenthesis, starting with (?> as in this example:
|
||||
<pre>
|
||||
(?>\d+)foo
|
||||
</pre>
|
||||
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
|
||||
be easier to remember:
|
||||
<pre>
|
||||
(*atomic:\d+)foo
|
||||
</pre>
|
||||
This kind of parenthesized group "locks up" the part of the pattern it contains
|
||||
once it has matched, and a failure further into the pattern is prevented from
|
||||
backtracking into it. Backtracking past it to previous items, however, works as
|
||||
normal.
|
||||
</P>
|
||||
<P>
|
||||
An alternative description is that a group of this type matches exactly the
|
||||
string of characters that an identical standalone pattern would match, if
|
||||
anchored at the current point in the subject string.
|
||||
</P>
|
||||
<P>
|
||||
Atomic groups are not capture groups. Simple cases such as the above example
|
||||
can be thought of as a maximizing repeat that must swallow everything it can.
|
||||
So, while both \d+ and \d+? are prepared to adjust the number of digits they
|
||||
match in order to make the rest of the pattern match, (?>\d+) can only match
|
||||
an entire sequence of digits.
|
||||
</P>
|
||||
<P>
|
||||
Atomic groups in general can of course contain arbitrarily complicated
|
||||
expressions, and can be nested. However, when the contents of an atomic
|
||||
group is just a single repeated item, as in the example above, a simpler
|
||||
notation, called a "possessive quantifier" can be used. This consists of an
|
||||
additional + character following a quantifier. Using this notation, the
|
||||
previous example can be rewritten as
|
||||
<pre>
|
||||
\d++foo
|
||||
</pre>
|
||||
Note that a possessive quantifier can be used with an entire group, for
|
||||
example:
|
||||
<pre>
|
||||
(abc|xyz){2,3}+
|
||||
</pre>
|
||||
Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
|
||||
option is ignored. They are a convenient notation for the simpler forms of
|
||||
atomic group. However, there is no difference in the meaning of a possessive
|
||||
quantifier and the equivalent atomic group, though there may be a performance
|
||||
difference; possessive quantifiers should be slightly faster.
|
||||
</P>
|
||||
<P>
|
||||
The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
|
||||
Jeffrey Friedl originated the idea (and the name) in the first edition of his
|
||||
book. Mike McCloskey liked it, so implemented it when he built Sun's Java
|
||||
package, and PCRE1 copied it from there. It found its way into Perl at release
|
||||
5.10.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 has an optimization that automatically "possessifies" certain simple
|
||||
pattern constructs. For example, the sequence A+B is treated as A++B because
|
||||
there is no point in backtracking into a sequence of A's when B must follow.
|
||||
This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, by calling
|
||||
<b>pcre2_set_optimize()</b> with a PCRE2_AUTO_POSSESS_OFF directive, or by
|
||||
starting the pattern with (*NO_AUTO_POSSESS).
|
||||
</P>
|
||||
<P>
|
||||
When a pattern contains an unlimited repeat inside a group that can itself be
|
||||
repeated an unlimited number of times, the use of an atomic group is the only
|
||||
way to avoid some failing matches taking a very long time indeed. The pattern
|
||||
<pre>
|
||||
(\D+|<\d+>)*[!?]
|
||||
</pre>
|
||||
matches an unlimited number of substrings that either consist of non-digits, or
|
||||
digits enclosed in <>, followed by either ! or ?. When it matches, it runs
|
||||
quickly. However, if it is applied to
|
||||
<pre>
|
||||
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
|
||||
</pre>
|
||||
it takes a long time before reporting failure. This is because the string can
|
||||
be divided between the internal \D+ repeat and the external * repeat in a
|
||||
large number of ways, and all have to be tried. (The example uses [!?] rather
|
||||
than a single character at the end, because both PCRE2 and Perl have an
|
||||
optimization that allows for fast failure when a single character is used. They
|
||||
remember the last single character that is required for a match, and fail early
|
||||
if it is not present in the string.) If the pattern is changed so that it uses
|
||||
an atomic group, like this:
|
||||
<pre>
|
||||
((?>\D+)|<\d+>)*[!?]
|
||||
</pre>
|
||||
sequences of non-digits cannot be broken, and failure happens quickly.
|
||||
<a name="backreferences"></a></P>
|
||||
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
Outside a character class, a backslash followed by a digit greater than 0 (and
|
||||
possibly further digits) is a backreference to a capture group earlier (that
|
||||
is, to its left) in the pattern, provided there have been that many previous
|
||||
capture groups.
|
||||
</P>
|
||||
<P>
|
||||
However, if the decimal number following the backslash is less than 8, it is
|
||||
always taken as a backreference, and causes an error only if there are not that
|
||||
many capture groups in the entire pattern. In other words, the group that is
|
||||
referenced need not be to the left of the reference for numbers less than 8. A
|
||||
"forward backreference" of this type can make sense when a repetition is
|
||||
involved and the group to the right has participated in an earlier iteration.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have a numerical "forward backreference" to a group whose
|
||||
number is 8 or more using this syntax because a sequence such as \50 is
|
||||
interpreted as a character defined in octal. See the subsection entitled
|
||||
"Non-printing characters"
|
||||
<a href="#digitsafterbackslash">above</a>
|
||||
for further details of the handling of digits following a backslash. Other
|
||||
forms of backreferencing do not suffer from this restriction. In particular,
|
||||
there is no problem when named capture groups are used (see below).
|
||||
</P>
|
||||
<P>
|
||||
Another way of avoiding the ambiguity inherent in the use of digits following a
|
||||
backslash is to use the \g escape sequence. This escape must be followed by a
|
||||
signed or unsigned number, optionally enclosed in braces. These examples are
|
||||
all identical:
|
||||
<pre>
|
||||
(ring), \1
|
||||
(ring), \g1
|
||||
(ring), \g{1}
|
||||
</pre>
|
||||
An unsigned number specifies an absolute reference without the ambiguity that
|
||||
is present in the older syntax. It is also useful when literal digits follow
|
||||
the reference. A signed number is a relative reference. Consider this example:
|
||||
<pre>
|
||||
(abc(def)ghi)\g{-1}
|
||||
</pre>
|
||||
The sequence \g{-1} is a reference to the capture group whose number is one
|
||||
less than the number of the next group to be started, so in this example (where
|
||||
the next group would be numbered 3) is it equivalent to \2, and \g{-2} would
|
||||
be equivalent to \1. Note that if this construct is inside a capture group,
|
||||
that group is included in the count, so in this example \g{-2} also refers to
|
||||
group 1:
|
||||
<pre>
|
||||
(A)(\g{-2}B)
|
||||
</pre>
|
||||
The use of relative references can be helpful in long patterns, and also in
|
||||
patterns that are created by joining together fragments that contain references
|
||||
within themselves.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \g{+1} is a reference to the next capture group that is started
|
||||
after this item, and \g{+2} refers to the one after that, and so on. This kind
|
||||
of forward reference can be useful in patterns that repeat. Perl does not
|
||||
support the use of + in this way.
|
||||
</P>
|
||||
<P>
|
||||
A backreference matches whatever actually most recently matched the capture
|
||||
group in the current subject string, rather than anything at all that matches
|
||||
the group (see
|
||||
<a href="#groupsassubroutines">"Groups as subroutines"</a>
|
||||
below for a way of doing that). So the pattern
|
||||
<pre>
|
||||
(sens|respons)e and \1ibility
|
||||
</pre>
|
||||
matches "sense and sensibility" and "response and responsibility", but not
|
||||
"sense and responsibility". If caseful matching is in force at the time of the
|
||||
backreference, the case of letters is relevant. For example,
|
||||
<pre>
|
||||
((?i)rah)\s+\1
|
||||
</pre>
|
||||
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
|
||||
capture group is matched caselessly.
|
||||
</P>
|
||||
<P>
|
||||
There are several different ways of writing backreferences to named capture
|
||||
groups. The .NET syntax is \k{name}, the Python syntax is (?=name), and the
|
||||
original Perl syntax is \k<name> or \k'name'. All of these are now supported
|
||||
by both Perl and PCRE2. Perl 5.10's unified backreference syntax, in which \g
|
||||
can be used for both numeric and named references, is also supported by PCRE2.
|
||||
We could rewrite the above example in any of the following ways:
|
||||
<pre>
|
||||
(?<p1>(?i)rah)\s+\k<p1>
|
||||
(?'p1'(?i)rah)\s+\k{p1}
|
||||
(?P<p1>(?i)rah)\s+(?P=p1)
|
||||
(?<p1>(?i)rah)\s+\g{p1}
|
||||
</pre>
|
||||
A capture group that is referenced by name may appear in the pattern before or
|
||||
after the reference.
|
||||
</P>
|
||||
<P>
|
||||
There may be more than one backreference to the same group. If a group has not
|
||||
actually been used in a particular match, backreferences to it always fail by
|
||||
default. For example, the pattern
|
||||
<pre>
|
||||
(a|(bc))\2
|
||||
</pre>
|
||||
always fails if it starts to match "a" rather than "bc". However, if the
|
||||
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
|
||||
unset value matches an empty string.
|
||||
</P>
|
||||
<P>
|
||||
Because there may be many capture groups in a pattern, all digits following a
|
||||
backslash are taken as part of a potential backreference number. If the pattern
|
||||
continues with a digit character, some delimiter must be used to terminate the
|
||||
backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this
|
||||
can be white space. Otherwise, the \g{} syntax or an empty comment (see
|
||||
<a href="#comments">"Comments"</a>
|
||||
below) can be used.
|
||||
</P>
|
||||
<br><b>
|
||||
Recursive backreferences
|
||||
</b><br>
|
||||
<P>
|
||||
A backreference that occurs inside the group to which it refers fails when the
|
||||
group is first used, so, for example, (a\1) never matches. However, such
|
||||
references can be useful inside repeated groups. For example, the pattern
|
||||
<pre>
|
||||
(a|b\1)+
|
||||
</pre>
|
||||
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
|
||||
the group, the backreference matches the character string corresponding to the
|
||||
previous iteration. In order for this to work, the pattern must be such that
|
||||
the first iteration does not need to match the backreference. This can be done
|
||||
using alternation, as in the example above, or by a quantifier with a minimum
|
||||
of zero.
|
||||
</P>
|
||||
<P>
|
||||
For versions of PCRE2 less than 10.25, backreferences of this type used to
|
||||
cause the group that they reference to be treated as an
|
||||
<a href="#atomicgroup">atomic group.</a>
|
||||
This restriction no longer applies, and backtracking into such groups can occur
|
||||
as normal.
|
||||
<a name="bigassertions"></a></P>
|
||||
<br><a name="SEC22" href="#TOC1">ASSERTIONS</a><br>
|
||||
<P>
|
||||
An assertion is a test that does not consume any characters. The test must
|
||||
succeed for the match to continue. The simple assertions coded as \b, \B,
|
||||
\A, \G, \Z, \z, ^ and $ are described
|
||||
<a href="#smallassertions">above.</a>
|
||||
</P>
|
||||
<P>
|
||||
More complicated assertions are coded as parenthesized groups. If matching such
|
||||
a group succeeds, matching continues after it, but with the matching position
|
||||
in the subject string reset to what it was before the assertion was processed.
|
||||
</P>
|
||||
<P>
|
||||
A special kind of assertion, called a "scan substring" assertion, matches a
|
||||
subpattern against a previously captured substring. This is described in the
|
||||
section entitled
|
||||
<a href="#scansubstringassertions">"Scan substring assertions"</a>
|
||||
below. It is a PCRE2 extension, not compatible with Perl.
|
||||
</P>
|
||||
<P>
|
||||
The other goup-based assertions are of two kinds: those that look ahead of the
|
||||
current position in the subject string, and those that look behind it, and in
|
||||
each case an assertion may be positive (must match for the assertion to be
|
||||
true) or negative (must not match for the assertion to be true).
|
||||
</P>
|
||||
<P>
|
||||
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
|
||||
but there is a subsequent matching failure, there is no backtracking into the
|
||||
assertion. However, there are some cases where non-atomic assertions can be
|
||||
useful. PCRE2 has some support for these, described in the section entitled
|
||||
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
|
||||
below, but they are not Perl-compatible.
|
||||
</P>
|
||||
<P>
|
||||
A lookaround assertion may appear as the condition in a
|
||||
<a href="#conditions">conditional group</a>
|
||||
(see below). In this case, the result of matching the assertion determines
|
||||
which branch of the condition is followed.
|
||||
</P>
|
||||
<P>
|
||||
Assertion groups are not capture groups. If an assertion contains capture
|
||||
groups within it, these are counted for the purposes of numbering the capture
|
||||
groups in the whole pattern. Within each branch of an assertion, locally
|
||||
captured substrings may be referenced in the usual way. For example, a sequence
|
||||
such as (.)\g{-1} can be used to check that two adjacent characters are the
|
||||
same.
|
||||
</P>
|
||||
<P>
|
||||
When a branch within an assertion fails to match, any substrings that were
|
||||
captured are discarded (as happens with any pattern branch that fails to
|
||||
match). A negative assertion is true only when all its branches fail to match;
|
||||
this means that no captured substrings are ever retained after a successful
|
||||
negative assertion. When an assertion contains a matching branch, what happens
|
||||
depends on the type of assertion.
|
||||
</P>
|
||||
<P>
|
||||
For a positive assertion, internally captured substrings in the successful
|
||||
branch are retained, and matching continues with the next pattern item after
|
||||
the assertion. For a negative assertion, a matching branch means that the
|
||||
assertion is not true. If such an assertion is being used as a condition in a
|
||||
<a href="#conditions">conditional group</a>
|
||||
(see below), captured substrings are retained, because matching continues with
|
||||
the "no" branch of the condition. For other failing negative assertions,
|
||||
control passes to the previous backtracking point, thus discarding any captured
|
||||
strings within the assertion.
|
||||
</P>
|
||||
<P>
|
||||
Most assertion groups may be repeated; though it makes no sense to assert the
|
||||
same thing several times, the side effect of capturing in positive assertions
|
||||
may occasionally be useful. However, an assertion that forms the condition for
|
||||
a conditional group may not be quantified. PCRE2 used to restrict the
|
||||
repetition of assertions, but from release 10.35 the only restriction is that
|
||||
an unlimited maximum repetition is changed to be one more than the minimum. For
|
||||
example, {3,} is treated as {3,4}.
|
||||
</P>
|
||||
<br><b>
|
||||
Alphabetic assertion names
|
||||
</b><br>
|
||||
<P>
|
||||
Traditionally, symbolic sequences such as (?= and (?<= have been used to
|
||||
specify lookaround assertions. Perl 5.28 introduced some experimental
|
||||
alphabetic alternatives which might be easier to remember. They all start with
|
||||
(* instead of (? and must be written using lower case letters. PCRE2 supports
|
||||
the following synonyms:
|
||||
<pre>
|
||||
(*positive_lookahead: or (*pla: is the same as (?=
|
||||
(*negative_lookahead: or (*nla: is the same as (?!
|
||||
(*positive_lookbehind: or (*plb: is the same as (?<=
|
||||
(*negative_lookbehind: or (*nlb: is the same as (?<!
|
||||
</pre>
|
||||
For example, (*pla:foo) is the same assertion as (?=foo). In the following
|
||||
sections, the various assertions are described using the original symbolic
|
||||
forms.
|
||||
</P>
|
||||
<br><b>
|
||||
Lookahead assertions
|
||||
</b><br>
|
||||
<P>
|
||||
Lookahead assertions start with (?= for positive assertions and (?! for
|
||||
negative assertions. For example,
|
||||
<pre>
|
||||
\w+(?=;)
|
||||
</pre>
|
||||
matches a word followed by a semicolon, but does not include the semicolon in
|
||||
the match, and
|
||||
<pre>
|
||||
foo(?!bar)
|
||||
</pre>
|
||||
matches any occurrence of "foo" that is not followed by "bar". Note that the
|
||||
apparently similar pattern
|
||||
<pre>
|
||||
(?!foo)bar
|
||||
</pre>
|
||||
does not find an occurrence of "bar" that is preceded by something other than
|
||||
"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
|
||||
(?!foo) is always true when the next three characters are "bar". A
|
||||
lookbehind assertion is needed to achieve the other effect.
|
||||
</P>
|
||||
<P>
|
||||
If you want to force a matching failure at some point in a pattern, the most
|
||||
convenient way to do it is with (?!) because an empty string always matches, so
|
||||
an assertion that requires there not to be an empty string must always fail.
|
||||
The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
|
||||
<a name="lookbehind"></a></P>
|
||||
<br><b>
|
||||
Lookbehind assertions
|
||||
</b><br>
|
||||
<P>
|
||||
Lookbehind assertions start with (?<= for positive assertions and (?<! for
|
||||
negative assertions. For example,
|
||||
<pre>
|
||||
(?<!foo)bar
|
||||
</pre>
|
||||
does find an occurrence of "bar" that is not preceded by "foo". The contents of
|
||||
a lookbehind assertion are restricted such that there must be a known maximum
|
||||
to the lengths of all the strings it matches. There are two cases:
|
||||
</P>
|
||||
<P>
|
||||
If every top-level alternative matches a fixed length, for example
|
||||
<pre>
|
||||
(?<=colour|color)
|
||||
</pre>
|
||||
there is a limit of 65535 characters to the lengths, which do not have to be
|
||||
the same, as this example demonstrates. This is the only kind of lookbehind
|
||||
supported by PCRE2 versions earlier than 10.43 and by the alternative matching
|
||||
function <b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
In PCRE2 10.43 and later, <b>pcre2_match()</b> supports lookbehind assertions in
|
||||
which one or more top-level alternatives can match more than one string length,
|
||||
for example
|
||||
<pre>
|
||||
(?<=colou?r)
|
||||
</pre>
|
||||
The maximum matching length for any branch of the lookbehind is limited to a
|
||||
value set by the calling program (default 255 characters). Unlimited repetition
|
||||
(for example \d*) is not supported. In some cases, the escape sequence \K
|
||||
<a href="#resetmatchstart">(see above)</a>
|
||||
can be used instead of a lookbehind assertion at the start of a pattern to get
|
||||
round the length limit restriction.
|
||||
</P>
|
||||
<P>
|
||||
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
|
||||
single code unit even in a UTF mode) to appear in lookbehind assertions,
|
||||
because it makes it impossible to calculate the length of the lookbehind. The
|
||||
\X and \R escapes, which can match different numbers of code units, are never
|
||||
permitted in lookbehinds.
|
||||
</P>
|
||||
<P>
|
||||
<a href="#groupsassubroutines">"Subroutine"</a>
|
||||
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
|
||||
as the called capture group matches a limited-length string. However,
|
||||
<a href="#recursion">recursion,</a>
|
||||
that is, a "subroutine" call into a group that is already active,
|
||||
is not supported.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 supports backreferences in lookbehinds, but only if certain conditions
|
||||
are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no
|
||||
use of (?| in the pattern (it creates duplicate group numbers), and if the
|
||||
backreference is by name, the name must be unique. Of course, the referenced
|
||||
group must itself match a limited length substring. The following pattern
|
||||
matches words containing at least two characters that begin and end with the
|
||||
same character:
|
||||
<pre>
|
||||
\b(\w)\w++(?<=\1)
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
Possessive quantifiers can be used in conjunction with lookbehind assertions to
|
||||
specify efficient matching at the end of subject strings. Consider a simple
|
||||
pattern such as
|
||||
<pre>
|
||||
abcd$
|
||||
</pre>
|
||||
when applied to a long string that does not match. Because matching proceeds
|
||||
from left to right, PCRE2 will look for each "a" in the subject and then see if
|
||||
what follows matches the rest of the pattern. If the pattern is specified as
|
||||
<pre>
|
||||
^.*abcd$
|
||||
</pre>
|
||||
the initial .* matches the entire string at first, but when this fails (because
|
||||
there is no following "a"), it backtracks to match all but the last character,
|
||||
then all but the last two characters, and so on. Once again the search for "a"
|
||||
covers the entire string, from right to left, so we are no better off. However,
|
||||
if the pattern is written as
|
||||
<pre>
|
||||
^.*+(?<=abcd)
|
||||
</pre>
|
||||
there can be no backtracking for the .*+ item because of the possessive
|
||||
quantifier; it can match only the entire string. The subsequent lookbehind
|
||||
assertion does a single test on the last four characters. If it fails, the
|
||||
match fails immediately. For long strings, this approach makes a significant
|
||||
difference to the processing time.
|
||||
</P>
|
||||
<br><b>
|
||||
Using multiple assertions
|
||||
</b><br>
|
||||
<P>
|
||||
Several assertions (of any sort) may occur in succession. For example,
|
||||
<pre>
|
||||
(?<=\d{3})(?<!999)foo
|
||||
</pre>
|
||||
matches "foo" preceded by three digits that are not "999". Notice that each of
|
||||
the assertions is applied independently at the same point in the subject
|
||||
string. First there is a check that the previous three characters are all
|
||||
digits, and then there is a check that the same three characters are not "999".
|
||||
This pattern does <i>not</i> match "foo" preceded by six characters, the first
|
||||
of which are digits and the last three of which are not "999". For example, it
|
||||
doesn't match "123abcfoo". A pattern to do that is
|
||||
<pre>
|
||||
(?<=\d{3}...)(?<!999)foo
|
||||
</pre>
|
||||
This time the first assertion looks at the preceding six characters, checking
|
||||
that the first three are digits, and then the second assertion checks that the
|
||||
preceding three characters are not "999".
|
||||
</P>
|
||||
<P>
|
||||
Assertions can be nested in any combination. For example,
|
||||
<pre>
|
||||
(?<=(?<!foo)bar)baz
|
||||
</pre>
|
||||
matches an occurrence of "baz" that is preceded by "bar" which in turn is not
|
||||
preceded by "foo", while
|
||||
<pre>
|
||||
(?<=\d{3}(?!999)...)foo
|
||||
</pre>
|
||||
is another pattern that matches "foo" preceded by three digits and any three
|
||||
characters that are not "999".
|
||||
<a name="nonatomicassertions"></a></P>
|
||||
<br><a name="SEC23" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
|
||||
<P>
|
||||
Traditional lookaround assertions are atomic. That is, if an assertion is true,
|
||||
but there is a subsequent matching failure, there is no backtracking into the
|
||||
assertion. However, there are some cases where non-atomic positive assertions
|
||||
can be useful. PCRE2 provides these using the following syntax:
|
||||
<pre>
|
||||
(*non_atomic_positive_lookahead: or (*napla: or (?*
|
||||
(*non_atomic_positive_lookbehind: or (*naplb: or (?<*
|
||||
</pre>
|
||||
Consider the problem of finding the right-most word in a string that also
|
||||
appears earlier in the string, that is, it must appear at least twice in total.
|
||||
This pattern returns the required result as captured substring 1:
|
||||
<pre>
|
||||
^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2}
|
||||
</pre>
|
||||
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
|
||||
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
|
||||
"x" option, which causes white space (introduced for readability) to be
|
||||
ignored. Inside the assertion, the greedy .* at first consumes the entire
|
||||
string, but then has to backtrack until the rest of the assertion can match a
|
||||
word, which is captured by group 1. In other words, when the assertion first
|
||||
succeeds, it captures the right-most word in the string.
|
||||
</P>
|
||||
<P>
|
||||
The current matching point is then reset to the start of the subject, and the
|
||||
rest of the pattern match checks for two occurrences of the captured word,
|
||||
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
|
||||
if the last word in the string does not occur twice, this part of the pattern
|
||||
fails. If a traditional atomic lookahead (?= or (*pla: had been used, the
|
||||
assertion could not be re-entered, and the whole match would fail. The pattern
|
||||
would succeed only if the very last word in the subject was found twice.
|
||||
</P>
|
||||
<P>
|
||||
Using a non-atomic lookahead, however, means that when the last word does not
|
||||
occur twice in the string, the lookahead can backtrack and find the second-last
|
||||
word, and so on, until either the match succeeds, or all words have been
|
||||
tested.
|
||||
</P>
|
||||
<P>
|
||||
Two conditions must be met for a non-atomic assertion to be useful: the
|
||||
contents of one or more capturing groups must change after a backtrack into the
|
||||
assertion, and there must be a backreference to a changed group later in the
|
||||
pattern. If this is not the case, the rest of the pattern match fails exactly
|
||||
as before because nothing has changed, so using a non-atomic assertion just
|
||||
wastes resources.
|
||||
</P>
|
||||
<P>
|
||||
There is one exception to backtracking into a non-atomic assertion. If an
|
||||
(*ACCEPT) control verb is triggered, the assertion succeeds atomically. That
|
||||
is, a subsequent match failure cannot backtrack into the assertion.
|
||||
</P>
|
||||
<P>
|
||||
Non-atomic assertions are not supported by the alternative matching function
|
||||
<b>pcre2_dfa_match()</b>. They are supported by JIT, but only if they do not
|
||||
contain any control verbs such as (*ACCEPT). (This may change in future). Note
|
||||
that assertions that appear as conditions for
|
||||
<a href="#conditions">conditional groups</a>
|
||||
(see below) must be atomic.
|
||||
<a name="scansubstringassertions"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">SCAN SUBSTRING ASSERTIONS</a><br>
|
||||
<P>
|
||||
A special kind of assertion, not compatible with Perl, makes it possible to
|
||||
check the contents of a captured substring by matching it with a subpattern.
|
||||
Because this involves capturing, this feature is not supported by
|
||||
<b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<P>
|
||||
A scan substring assertion starts with the sequence (*scan_substring: or
|
||||
(*scs: which is followed by a list of substring numbers (absolute or relative)
|
||||
and/or substring names enclosed in single quotes or angle brackets, all within
|
||||
parentheses. The rest of the item is the subpattern that is applied to the
|
||||
substring, as shown in these examples:
|
||||
<pre>
|
||||
(*scan_substring:(1)...)
|
||||
(*scs:(-2)...)
|
||||
(*scs:('AB')...)
|
||||
(*scs:(1,'AB',-2)...)
|
||||
</pre>
|
||||
The list of groups is checked in the order they are given, and it is the
|
||||
contents of the first one that is found to be set that are scanned. When
|
||||
PCRE2_DUPNAMES is set and there are ambiguous group names, all groups with the
|
||||
same name are checked in numerical order. A scan substring assertion fails if
|
||||
none of the groups it references have been set.
|
||||
</P>
|
||||
<P>
|
||||
The pattern match on the substring is always anchored, that is, it must match
|
||||
from the start of the substring. There is no "bumpalong" if it does not match
|
||||
at the start. The end of the subject is temporarily reset to be the end of the
|
||||
substring, so \Z, \z, and $ will match there. However, the start of the
|
||||
subject is <i>not</i> reset. This means that ^ matches only if the substring is
|
||||
actually at the start of the main subject, but it also means that lookbehind
|
||||
assertions into what precedes the substring are possible.
|
||||
</P>
|
||||
<P>
|
||||
Here is a very simple example: find a word that contains the rare (in English)
|
||||
sequence of letters "rh" not at the start:
|
||||
<pre>
|
||||
\b(\w++)(*scs:(1).+rh)
|
||||
</pre>
|
||||
The first group captures a word which is then scanned by the second group.
|
||||
This example does not actually need this heavyweight feature; the same match
|
||||
can be achieved with:
|
||||
<pre>
|
||||
\b\w+?rh\w*\b
|
||||
</pre>
|
||||
When things are more complicated, however, scanning a captured substring can be
|
||||
a useful way to describe the required match. For exmple, there is a rather
|
||||
complicated pattern in the PCRE2 test data that checks an entire subject string
|
||||
for a palindrome, that is, the sequence of letters is the same in both
|
||||
directions. Suppose you want to search for individual words of two or more
|
||||
characters such as "level" that are palindromes:
|
||||
<pre>
|
||||
(\b\w{2,}+\b)(*scs:(1)...palindrome-matching-pattern...)
|
||||
</pre>
|
||||
Within a substring scanning subpattern, references to other groups work as
|
||||
normal. Capturing groups may appear, and will retain their values during
|
||||
ongoing matching if the assertion succeeds.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
In concept, a script run is a sequence of characters that are all from the same
|
||||
Unicode script such as Latin or Greek. However, because some scripts are
|
||||
commonly used together, and because some diacritical and other marks are used
|
||||
with multiple scripts, it is not that simple. There is a full description of
|
||||
the rules that PCRE2 uses in the section entitled
|
||||
<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
|
||||
in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
|
||||
parenthesis, it fails if the sequence of characters that it matches are not a
|
||||
script run. After a failure, normal backtracking occurs. Script runs can be
|
||||
used to detect spoofing attacks using characters that look the same, but are
|
||||
from different scripts. The string "paypal.com" is an infamous example, where
|
||||
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
|
||||
the matched characters in a sequence of non-spaces that follow white space are
|
||||
a script run:
|
||||
<pre>
|
||||
\s+(*sr:\S+)
|
||||
</pre>
|
||||
To be sure that they are all from the Latin script (for example), a lookahead
|
||||
can be used:
|
||||
<pre>
|
||||
\s+(?=\p{Latin})(*sr:\S+)
|
||||
</pre>
|
||||
This works as long as the first character is expected to be a character in that
|
||||
script, and not (for example) punctuation, which is allowed with any script. If
|
||||
this is not the case, a more creative lookahead is needed. For example, if
|
||||
digits, underscore, and dots are permitted at the start:
|
||||
<pre>
|
||||
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
In many cases, backtracking into a script run pattern fragment is not
|
||||
desirable. The script run can employ an atomic group to prevent this. Because
|
||||
this is a common requirement, a shorthand notation is provided by
|
||||
(*atomic_script_run: or (*asr:
|
||||
<pre>
|
||||
(*asr:...) is the same as (*sr:(?>...))
|
||||
</pre>
|
||||
Note that the atomic group is inside the script run. Putting it outside would
|
||||
not prevent backtracking into the script run pattern.
|
||||
</P>
|
||||
<P>
|
||||
Support for script runs is not available if PCRE2 is compiled without Unicode
|
||||
support. A compile-time error is given if any of the above constructs is
|
||||
encountered. Script runs are not supported by the alternate matching function,
|
||||
<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
|
||||
parentheses.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> The (*ACCEPT) control verb
|
||||
<a href="#acceptverb">(see below)</a>
|
||||
should not be used within a script run group, because it causes an immediate
|
||||
exit from the group, bypassing the script run checking.
|
||||
<a name="conditions"></a></P>
|
||||
<br><a name="SEC26" href="#TOC1">CONDITIONAL GROUPS</a><br>
|
||||
<P>
|
||||
It is possible to cause the matching process to obey a pattern fragment
|
||||
conditionally or to choose between two alternative fragments, depending on
|
||||
the result of an assertion, or whether a specific capture group has
|
||||
already been matched. The two possible forms of conditional group are:
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
</pre>
|
||||
If the condition is satisfied, the yes-pattern is used; otherwise the
|
||||
no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
|
||||
string (it always matches). If there are more than two alternatives in the
|
||||
group, a compile-time error occurs. Each of the two alternatives may itself
|
||||
contain nested groups of any form, including conditional groups; the
|
||||
restriction to two alternatives applies only at the level of the condition
|
||||
itself. This pattern fragment is an example where the alternatives are complex:
|
||||
<pre>
|
||||
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
There are five kinds of condition: references to capture groups, references to
|
||||
recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
|
||||
</P>
|
||||
<br><b>
|
||||
Checking for a used capture group by number
|
||||
</b><br>
|
||||
<P>
|
||||
If the text between the parentheses consists of a sequence of digits, the
|
||||
condition is true if a capture group of that number has previously matched. If
|
||||
there is more than one capture group with the same number (see the earlier
|
||||
<a href="#recursion">section about duplicate group numbers),</a>
|
||||
the condition is true if any of them have matched. An alternative notation,
|
||||
which is a PCRE2 extension, not supported by Perl, is to precede the digits
|
||||
with a plus or minus sign. In this case, the group number is relative rather
|
||||
than absolute. The most recently opened capture group (which could be enclosing
|
||||
this condition) can be referenced by (?(-1), the next most recent by (?(-2),
|
||||
and so on. Inside loops it can also make sense to refer to subsequent groups.
|
||||
The next capture group to be opened can be referenced as (?(+1), and so on. The
|
||||
value zero in any of these forms is not used; it provokes a compile-time error.
|
||||
</P>
|
||||
<P>
|
||||
Consider the following pattern, which contains non-significant white space to
|
||||
make it more readable (assume the PCRE2_EXTENDED option) and to divide it into
|
||||
three parts for ease of discussion:
|
||||
<pre>
|
||||
( \( )? [^()]+ (?(1) \) )
|
||||
</pre>
|
||||
The first part matches an optional opening parenthesis, and if that
|
||||
character is present, sets it as the first captured substring. The second part
|
||||
matches one or more characters that are not parentheses. The third part is a
|
||||
conditional group that tests whether or not the first capture group
|
||||
matched. If it did, that is, if subject started with an opening parenthesis,
|
||||
the condition is true, and so the yes-pattern is executed and a closing
|
||||
parenthesis is required. Otherwise, since no-pattern is not present, the
|
||||
conditional group matches nothing. In other words, this pattern matches a
|
||||
sequence of non-parentheses, optionally enclosed in parentheses.
|
||||
</P>
|
||||
<P>
|
||||
If you were embedding this pattern in a larger one, you could use a relative
|
||||
reference:
|
||||
<pre>
|
||||
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
|
||||
</pre>
|
||||
This makes the fragment independent of the parentheses in the larger pattern.
|
||||
</P>
|
||||
<br><b>
|
||||
Checking for a used capture group by name
|
||||
</b><br>
|
||||
<P>
|
||||
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
|
||||
capture group by name. For compatibility with earlier versions of PCRE1, which
|
||||
had this facility before Perl, the syntax (?(name)...) is also recognized.
|
||||
Note, however, that undelimited names consisting of the letter R followed by
|
||||
digits are ambiguous (see the following section). Rewriting the above example
|
||||
to use a named group gives this:
|
||||
<pre>
|
||||
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
|
||||
</pre>
|
||||
If the name used in a condition of this kind is a duplicate, the test is
|
||||
applied to all groups of the same name, and is true if any one of them has
|
||||
matched.
|
||||
</P>
|
||||
<br><b>
|
||||
Checking for pattern recursion
|
||||
</b><br>
|
||||
<P>
|
||||
"Recursion" in this sense refers to any subroutine-like call from one part of
|
||||
the pattern to another, whether or not it is actually recursive. See the
|
||||
sections entitled
|
||||
<a href="#recursion">"Recursive patterns"</a>
|
||||
and
|
||||
<a href="#groupsassubroutines">"Groups as subroutines"</a>
|
||||
below for details of recursion and subroutine calls.
|
||||
</P>
|
||||
<P>
|
||||
If a condition is the string (R), and there is no capture group with the name
|
||||
R, the condition is true if matching is currently in a recursion or subroutine
|
||||
call to the whole pattern or any capture group. If digits follow the letter R,
|
||||
and there is no group with that name, the condition is true if the most recent
|
||||
call is into a group with the given number, which must exist somewhere in the
|
||||
overall pattern. This is a contrived example that is equivalent to a+b:
|
||||
<pre>
|
||||
((?(R1)a+|(?1)b))
|
||||
</pre>
|
||||
However, in both cases, if there is a capture group with a matching name, the
|
||||
condition tests for its being set, as described in the section above, instead
|
||||
of testing for recursion. For example, creating a group with the name R1 by
|
||||
adding (?<R1>) to the above pattern completely changes its meaning.
|
||||
</P>
|
||||
<P>
|
||||
If a name preceded by ampersand follows the letter R, for example:
|
||||
<pre>
|
||||
(?(R&name)...)
|
||||
</pre>
|
||||
the condition is true if the most recent recursion is into a group of that name
|
||||
(which must exist within the pattern).
|
||||
</P>
|
||||
<P>
|
||||
This condition does not check the entire recursion stack. It tests only the
|
||||
current level. If the name used in a condition of this kind is a duplicate, the
|
||||
test is applied to all groups of the same name, and is true if any one of
|
||||
them is the most recent recursion.
|
||||
</P>
|
||||
<P>
|
||||
At "top level", all these recursion test conditions are false.
|
||||
<a name="subdefine"></a></P>
|
||||
<br><b>
|
||||
Defining capture groups for use by reference only
|
||||
</b><br>
|
||||
<P>
|
||||
If the condition is the string (DEFINE), the condition is always false, even if
|
||||
there is a group with the name DEFINE. In this case, there may be only one
|
||||
alternative in the rest of the conditional group. It is always skipped if
|
||||
control reaches this point in the pattern; the idea of DEFINE is that it can be
|
||||
used to define subroutines that can be referenced from elsewhere. (The use of
|
||||
<a href="#groupsassubroutines">subroutines</a>
|
||||
is described below.) For example, a pattern to match an IPv4 address such as
|
||||
"192.168.23.245" could be written like this (ignore white space and line
|
||||
breaks):
|
||||
<pre>
|
||||
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
|
||||
\b (?&byte) (\.(?&byte)){3} \b
|
||||
</pre>
|
||||
The first part of the pattern is a DEFINE group inside which another group
|
||||
named "byte" is defined. This matches an individual component of an IPv4
|
||||
address (a number less than 256). When matching takes place, this part of the
|
||||
pattern is skipped because DEFINE acts like a false condition. The rest of the
|
||||
pattern uses references to the named group to match the four dot-separated
|
||||
components of an IPv4 address, insisting on a word boundary at each end.
|
||||
</P>
|
||||
<br><b>
|
||||
Checking the PCRE2 version
|
||||
</b><br>
|
||||
<P>
|
||||
Programs that link with a PCRE2 library can check the version by calling
|
||||
<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
|
||||
not have access to the underlying code cannot do this. A special "condition"
|
||||
called VERSION exists to allow such users to discover which version of PCRE2
|
||||
they are dealing with by using this condition to match a string such as
|
||||
"yesno". VERSION must be followed either by "=" or ">=" and a version number.
|
||||
For example:
|
||||
<pre>
|
||||
(?(VERSION>=10.4)yes|no)
|
||||
</pre>
|
||||
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
|
||||
"no" otherwise. The fractional part of the version number may not contain more
|
||||
than two digits.
|
||||
</P>
|
||||
<br><b>
|
||||
Assertion conditions
|
||||
</b><br>
|
||||
<P>
|
||||
If the condition is not in any of the above formats, it must be a parenthesized
|
||||
assertion. This may be a positive or negative lookahead or lookbehind
|
||||
assertion. However, it must be a traditional atomic assertion, not one of the
|
||||
<a href="#nonatomicassertions">non-atomic assertions.</a>
|
||||
</P>
|
||||
<P>
|
||||
Consider this pattern, again containing non-significant white space, and with
|
||||
the two alternatives on the second line:
|
||||
<pre>
|
||||
(?(?=[^a-z]*[a-z])
|
||||
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
|
||||
</pre>
|
||||
The condition is a positive lookahead assertion that matches an optional
|
||||
sequence of non-letters followed by a letter. In other words, it tests for the
|
||||
presence of at least one letter in the subject. If a letter is found, the
|
||||
subject is matched against the first alternative; otherwise it is matched
|
||||
against the second. This pattern matches strings in one of the two forms
|
||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||
</P>
|
||||
<P>
|
||||
When an assertion that is a condition contains capture groups, any
|
||||
capturing that occurs in a matching branch is retained afterwards, for both
|
||||
positive and negative assertions, because matching always continues after the
|
||||
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
||||
for which captures are retained only for positive assertions that succeed.)
|
||||
<a name="comments"></a></P>
|
||||
<br><a name="SEC27" href="#TOC1">COMMENTS</a><br>
|
||||
<P>
|
||||
There are two ways of including comments in patterns that are processed by
|
||||
PCRE2. In both cases, the start of the comment must not be in a character
|
||||
class, nor in the middle of any other sequence of related characters such as
|
||||
(?: or a group name or number or a Unicode property name. The characters that
|
||||
make up a comment play no part in the pattern matching.
|
||||
</P>
|
||||
<P>
|
||||
The sequence (?# marks the start of a comment that continues up to the next
|
||||
closing parenthesis. Nested parentheses are not permitted. If the
|
||||
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
|
||||
also introduces a comment, which in this case continues to immediately after
|
||||
the next newline character or character sequence in the pattern. Which
|
||||
characters are interpreted as newlines is controlled by an option passed to the
|
||||
compiling function or by a special sequence at the start of the pattern, as
|
||||
described in the section entitled
|
||||
<a href="#newlines">"Newline conventions"</a>
|
||||
above. Note that the end of this type of comment is a literal newline sequence
|
||||
in the pattern; escape sequences that happen to represent a newline do not
|
||||
count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
|
||||
default newline convention (a single linefeed character) is in force:
|
||||
<pre>
|
||||
abc #comment \n still comment
|
||||
</pre>
|
||||
On encountering the # character, <b>pcre2_compile()</b> skips along, looking for
|
||||
a newline in the pattern. The sequence \n is still literal at this stage, so
|
||||
it does not terminate the comment. Only an actual character with the code value
|
||||
0x0a (the default newline) does so.
|
||||
<a name="recursion"></a></P>
|
||||
<br><a name="SEC28" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<P>
|
||||
Consider the problem of matching a string in parentheses, allowing for
|
||||
unlimited nested parentheses. Without the use of recursion, the best that can
|
||||
be done is to use a pattern that matches up to some fixed depth of nesting. It
|
||||
is not possible to handle an arbitrary nesting depth.
|
||||
</P>
|
||||
<P>
|
||||
For some time, Perl has provided a facility that allows regular expressions to
|
||||
recurse (amongst other things). It does this by interpolating Perl code in the
|
||||
expression at run time, and the code can refer to the expression itself. A Perl
|
||||
pattern using code interpolation to solve the parentheses problem can be
|
||||
created like this:
|
||||
<pre>
|
||||
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
|
||||
</pre>
|
||||
The (?p{...}) item interpolates Perl code at run time, and in this case refers
|
||||
recursively to the pattern in which it appears.
|
||||
</P>
|
||||
<P>
|
||||
Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
|
||||
supports special syntax for recursion of the entire pattern, and also for
|
||||
individual capture group recursion. After its introduction in PCRE1 and Python,
|
||||
this kind of recursion was subsequently introduced into Perl at release 5.10.
|
||||
</P>
|
||||
<P>
|
||||
A special item that consists of (? followed by a number greater than zero and a
|
||||
closing parenthesis is a recursive subroutine call of the capture group of the
|
||||
given number, provided that it occurs inside that group. (If not, it is a
|
||||
<a href="#groupsassubroutines">non-recursive subroutine</a>
|
||||
call, which is described in the next section.) The special item (?R) or (?0) is
|
||||
a recursive call of the entire regular expression.
|
||||
</P>
|
||||
<P>
|
||||
This PCRE2 pattern solves the nested parentheses problem (assume the
|
||||
PCRE2_EXTENDED option is set so that white space is ignored):
|
||||
<pre>
|
||||
\( ( [^()]++ | (?R) )* \)
|
||||
</pre>
|
||||
First it matches an opening parenthesis. Then it matches any number of
|
||||
substrings which can either be a sequence of non-parentheses, or a recursive
|
||||
match of the pattern itself (that is, a correctly parenthesized substring).
|
||||
Finally there is a closing parenthesis. Note the use of a possessive quantifier
|
||||
to avoid backtracking into sequences of non-parentheses.
|
||||
</P>
|
||||
<P>
|
||||
If this were part of a larger pattern, you would not want to recurse the entire
|
||||
pattern, so instead you could use this:
|
||||
<pre>
|
||||
( \( ( [^()]++ | (?1) )* \) )
|
||||
</pre>
|
||||
We have put the pattern into parentheses, and caused the recursion to refer to
|
||||
them instead of the whole pattern.
|
||||
</P>
|
||||
<P>
|
||||
In a larger pattern, keeping track of parenthesis numbers can be tricky. This
|
||||
is made easier by the use of relative references. Instead of (?1) in the
|
||||
pattern above you can write (?-2) to refer to the second most recently opened
|
||||
parentheses preceding the recursion. In other words, a negative number counts
|
||||
capturing parentheses leftwards from the point at which it is encountered.
|
||||
</P>
|
||||
<P>
|
||||
Be aware however, that if
|
||||
<a href="#dupgroupnumber">duplicate capture group numbers</a>
|
||||
are in use, relative references refer to the earliest group with the
|
||||
appropriate number. Consider, for example:
|
||||
<pre>
|
||||
(?|(a)|(b)) (c) (?-2)
|
||||
</pre>
|
||||
The first two capture groups (a) and (b) are both numbered 1, and group (c)
|
||||
is number 2. When the reference (?-2) is encountered, the second most recently
|
||||
opened parentheses has the number 1, but it is the first such group (the (a)
|
||||
group) to which the recursion refers. This would be the same if an absolute
|
||||
reference (?1) was used. In other words, relative references are just a
|
||||
shorthand for computing a group number.
|
||||
</P>
|
||||
<P>
|
||||
It is also possible to refer to subsequent capture groups, by writing
|
||||
references such as (?+2). However, these cannot be recursive because the
|
||||
reference is not inside the parentheses that are referenced. They are always
|
||||
<a href="#groupsassubroutines">non-recursive subroutine</a>
|
||||
calls, as described in the next section.
|
||||
</P>
|
||||
<P>
|
||||
An alternative approach is to use named parentheses. The Perl syntax for this
|
||||
is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could
|
||||
rewrite the above example as follows:
|
||||
<pre>
|
||||
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
|
||||
</pre>
|
||||
If there is more than one group with the same name, the earliest one is
|
||||
used.
|
||||
</P>
|
||||
<P>
|
||||
The example pattern that we have been looking at contains nested unlimited
|
||||
repeats, and so the use of a possessive quantifier for matching strings of
|
||||
non-parentheses is important when applying the pattern to strings that do not
|
||||
match. For example, when this pattern is applied to
|
||||
<pre>
|
||||
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
|
||||
</pre>
|
||||
it yields "no match" quickly. However, if a possessive quantifier is not used,
|
||||
the match runs for a very long time indeed because there are so many different
|
||||
ways the + and * repeats can carve up the subject, and all have to be tested
|
||||
before failure can be reported.
|
||||
</P>
|
||||
<P>
|
||||
At the end of a match, the values of capturing parentheses are those from
|
||||
the outermost level. If you want to obtain intermediate values, a callout
|
||||
function can be used (see below and the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation). If the pattern above is matched against
|
||||
<pre>
|
||||
(ab(cd)ef)
|
||||
</pre>
|
||||
the value for the inner capturing parentheses (numbered 2) is "ef", which is
|
||||
the last value taken on at the top level. If a capture group is not matched at
|
||||
the top level, its final captured value is unset, even if it was (temporarily)
|
||||
set at a deeper level during the matching process.
|
||||
</P>
|
||||
<P>
|
||||
Do not confuse the (?R) item with the condition (R), which tests for recursion.
|
||||
Consider this pattern, which matches text in angle brackets, allowing for
|
||||
arbitrary nesting. Only digits are allowed in nested brackets (that is, when
|
||||
recursing), whereas any characters are permitted at the outer level.
|
||||
<pre>
|
||||
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
|
||||
</pre>
|
||||
In this pattern, (?(R) is the start of a conditional group, with two different
|
||||
alternatives for the recursive and non-recursive cases. The (?R) item is the
|
||||
actual recursive call.
|
||||
<a name="recursiondifference"></a></P>
|
||||
<br><b>
|
||||
Differences in recursion processing between PCRE2 and Perl
|
||||
</b><br>
|
||||
<P>
|
||||
Some former differences between PCRE2 and Perl no longer exist.
|
||||
</P>
|
||||
<P>
|
||||
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
|
||||
a recursive subroutine call was always treated as an atomic group. That is,
|
||||
once it had matched some of the subject string, it was never re-entered, even
|
||||
if it contained untried alternatives and there was a subsequent matching
|
||||
failure. (Historical note: PCRE implemented recursion before Perl did.)
|
||||
</P>
|
||||
<P>
|
||||
Starting with release 10.30, recursive subroutine calls are no longer treated
|
||||
as atomic. That is, they can be re-entered to try unused alternatives if there
|
||||
is a matching failure later in the pattern. This is now compatible with the way
|
||||
Perl works. If you want a subroutine call to be atomic, you must explicitly
|
||||
enclose it in an atomic group.
|
||||
</P>
|
||||
<P>
|
||||
Supporting backtracking into recursions simplifies certain types of recursive
|
||||
pattern. For example, this pattern matches palindromic strings:
|
||||
<pre>
|
||||
^((.)(?1)\2|.?)$
|
||||
</pre>
|
||||
The second branch in the group matches a single central character in the
|
||||
palindrome when there are an odd number of characters, or nothing when there
|
||||
are an even number of characters, but in order to work it has to be able to try
|
||||
the second case when the rest of the pattern match fails. If you want to match
|
||||
typical palindromic phrases, the pattern has to ignore all non-word characters,
|
||||
which can be done like this:
|
||||
<pre>
|
||||
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
|
||||
</pre>
|
||||
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
|
||||
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
|
||||
avoid backtracking into sequences of non-word characters. Without this, PCRE2
|
||||
takes a great deal longer (ten times or more) to match typical phrases, and
|
||||
Perl takes so long that you think it has gone into a loop.
|
||||
</P>
|
||||
<P>
|
||||
Another way in which PCRE2 and Perl used to differ in their recursion
|
||||
processing is in the handling of captured values. Formerly in Perl, when a
|
||||
group was called recursively or as a subroutine (see the next section), it
|
||||
had no access to any values that were captured outside the recursion, whereas
|
||||
in PCRE2 these values can be referenced. Consider this pattern:
|
||||
<pre>
|
||||
^(.)(\1|a(?2))
|
||||
</pre>
|
||||
This pattern matches "bab". The first capturing parentheses match "b", then in
|
||||
the second group, when the backreference \1 fails to match "b", the second
|
||||
alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
<a name="groupsassubroutines"></a></P>
|
||||
<br><a name="SEC29" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
|
||||
<P>
|
||||
If the syntax for a recursive group call (either by number or by name) is used
|
||||
outside the parentheses to which it refers, it operates a bit like a subroutine
|
||||
in a programming language. More accurately, PCRE2 treats the referenced group
|
||||
as an independent subpattern which it tries to match at the current matching
|
||||
position. The called group may be defined before or after the reference. A
|
||||
numbered reference can be absolute or relative, as in these examples:
|
||||
<pre>
|
||||
(...(absolute)...)...(?2)...
|
||||
(...(relative)...)...(?-1)...
|
||||
(...(?+1)...(relative)...
|
||||
</pre>
|
||||
An earlier example pointed out that the pattern
|
||||
<pre>
|
||||
(sens|respons)e and \1ibility
|
||||
</pre>
|
||||
matches "sense and sensibility" and "response and responsibility", but not
|
||||
"sense and responsibility". If instead the pattern
|
||||
<pre>
|
||||
(sens|respons)e and (?1)ibility
|
||||
</pre>
|
||||
is used, it does match "sense and responsibility" as well as the other two
|
||||
strings. Another example is given in the discussion of DEFINE above.
|
||||
</P>
|
||||
<P>
|
||||
Like recursions, subroutine calls used to be treated as atomic, but this
|
||||
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
|
||||
occur. However, any capturing parentheses that are set during the subroutine
|
||||
call revert to their previous values afterwards.
|
||||
</P>
|
||||
<P>
|
||||
Processing options such as case-independence are fixed when a group is
|
||||
defined, so if it is used as a subroutine, such options cannot be changed for
|
||||
different calls. For example, consider this pattern:
|
||||
<pre>
|
||||
(abc)(?i:(?-1))
|
||||
</pre>
|
||||
It matches "abcabc". It does not match "abcABC" because the change of
|
||||
processing option does not affect the called group.
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of
|
||||
<a href="#backtrackcontrol">backtracking control verbs</a>
|
||||
in groups when called as subroutines is described in the section entitled
|
||||
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
||||
below.
|
||||
<a name="onigurumasubroutines"></a></P>
|
||||
<br><a name="SEC30" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<P>
|
||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||
syntax for calling a group as a subroutine, possibly recursively. Here are two
|
||||
of the examples used above, rewritten using this syntax:
|
||||
<pre>
|
||||
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
|
||||
(sens|respons)e and \g'1'ibility
|
||||
</pre>
|
||||
PCRE2 supports an extension to Oniguruma: if a number is preceded by a
|
||||
plus or a minus sign it is taken as a relative reference. For example:
|
||||
<pre>
|
||||
(abc)(?i:\g<-1>)
|
||||
</pre>
|
||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC31" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
||||
code to be obeyed in the middle of matching a regular expression. This makes it
|
||||
possible, amongst other things, to extract different substrings that match the
|
||||
same pair of parentheses when there is a repetition.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
|
||||
code. The feature is called "callout". The caller of PCRE2 provides an external
|
||||
function by putting its entry point in a match context using the function
|
||||
<b>pcre2_set_callout()</b>, and then passing that context to <b>pcre2_match()</b>
|
||||
or <b>pcre2_dfa_match()</b>. If no match context is passed, or if the callout
|
||||
entry point is set to NULL, callout points will be passed over silently during
|
||||
matching. To disallow callouts in the pattern syntax, you may use the
|
||||
PCRE2_EXTRA_NEVER_CALLOUT option.
|
||||
</P>
|
||||
<P>
|
||||
Within a regular expression, (?C<arg>) indicates a point at which the external
|
||||
function is to be called. There are two kinds of callout: those with a
|
||||
numerical argument and those with a string argument. (?C) on its own with no
|
||||
argument is treated as (?C0). A numerical argument allows the application to
|
||||
distinguish between different callouts. String arguments were added for release
|
||||
10.20 to make it possible for script languages that use PCRE2 to embed short
|
||||
scripts within patterns in a similar way to Perl.
|
||||
</P>
|
||||
<P>
|
||||
During matching, when PCRE2 reaches a callout point, the external function is
|
||||
called. It is provided with the number or string argument of the callout, the
|
||||
position in the pattern, and one item of data that is also set in the match
|
||||
block. The callout function may cause matching to proceed, to backtrack, or to
|
||||
fail.
|
||||
</P>
|
||||
<P>
|
||||
By default, PCRE2 implements a number of optimizations at matching time, and
|
||||
one side-effect is that sometimes callouts are skipped. If you need all
|
||||
possible callouts to happen, you need to set options that disable the relevant
|
||||
optimizations. More details, including a complete description of the
|
||||
programming interface to the callout function, are given in the
|
||||
<a href="pcre2callout.html"><b>pcre2callout</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Callouts with numerical arguments
|
||||
</b><br>
|
||||
<P>
|
||||
If you just want to have a means of identifying different callout points, put a
|
||||
number less than 256 after the letter C. For example, this pattern has two
|
||||
callout points:
|
||||
<pre>
|
||||
(?C1)abc(?C2)def
|
||||
</pre>
|
||||
If the PCRE2_AUTO_CALLOUT flag is passed to <b>pcre2_compile()</b>, numerical
|
||||
callouts are automatically installed before each item in the pattern. They are
|
||||
all numbered 255. If there is a conditional group in the pattern whose
|
||||
condition is an assertion, an additional callout is inserted just before the
|
||||
condition. An explicit callout may also be set at this position, as in this
|
||||
example:
|
||||
<pre>
|
||||
(?(?C9)(?=a)abc|def)
|
||||
</pre>
|
||||
Note that this applies only to assertion conditions, not to other types of
|
||||
condition.
|
||||
</P>
|
||||
<br><b>
|
||||
Callouts with string arguments
|
||||
</b><br>
|
||||
<P>
|
||||
A delimited string may be used instead of a number as a callout argument. The
|
||||
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
|
||||
the same as the start, except for {, where the ending delimiter is }. If the
|
||||
ending delimiter is needed within the string, it must be doubled. For
|
||||
example:
|
||||
<pre>
|
||||
(?C'ab ''c'' d')xyz(?C{any text})pqr
|
||||
</pre>
|
||||
The doubling is removed before the string is passed to the callout function.
|
||||
<a name="backtrackcontrol"></a></P>
|
||||
<br><a name="SEC32" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||
terminology) that modify the behaviour of backtracking during matching. They
|
||||
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
|
||||
and may behave differently depending on whether or not a name argument is
|
||||
present. The names are not required to be unique within the pattern.
|
||||
</P>
|
||||
<P>
|
||||
By default, for compatibility with Perl, a name is any sequence of characters
|
||||
that does not include a closing parenthesis. The name is not processed in
|
||||
any way, and it is not possible to include a closing parenthesis in the name.
|
||||
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
|
||||
is no longer Perl-compatible.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
|
||||
and only an unescaped closing parenthesis terminates the name. However, the
|
||||
only backslash items that are permitted are \Q, \E, and sequences such as
|
||||
\x{100} that define character code points. Character type escapes such as \d
|
||||
are faulted.
|
||||
</P>
|
||||
<P>
|
||||
A closing parenthesis can be included in a name either as \) or between \Q
|
||||
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
|
||||
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
|
||||
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
|
||||
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
|
||||
PCRE2_ALT_VERBNAMES is also set.
|
||||
</P>
|
||||
<P>
|
||||
The maximum length of a name is 255 in the 8-bit library and 65535 in the
|
||||
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
|
||||
parenthesis immediately follows the colon, the effect is as if the colon were
|
||||
not there. Any number of these verbs may occur in a pattern. Except for
|
||||
(*ACCEPT), they may not be quantified.
|
||||
</P>
|
||||
<P>
|
||||
Since these verbs are specifically related to backtracking, most of them can be
|
||||
used only when the pattern is to be matched using the traditional matching
|
||||
function or JIT, because they use backtracking algorithms. With the exception
|
||||
of (*FAIL), which behaves like a failing negative assertion, the backtracking
|
||||
control verbs cause an error if encountered by the DFA matching function.
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of these verbs in
|
||||
<a href="#btrepeat">repeated groups,</a>
|
||||
<a href="#btassert">assertions,</a>
|
||||
and in
|
||||
<a href="#btsub">capture groups called as subroutines</a>
|
||||
(whether or not recursively) is documented below.
|
||||
<a name="nooptimize"></a></P>
|
||||
<br><b>
|
||||
Optimizations that affect backtracking verbs
|
||||
</b><br>
|
||||
<P>
|
||||
PCRE2 contains some optimizations that are used to speed up matching by running
|
||||
some checks at the start of each match attempt. For example, it may know the
|
||||
minimum length of matching subject, or that a particular character must be
|
||||
present. When one of these optimizations bypasses the running of a match, any
|
||||
included backtracking verbs will not, of course, be processed. You can suppress
|
||||
the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
|
||||
when calling <b>pcre2_compile()</b>, by calling <b>pcre2_set_optimize()</b> with a
|
||||
PCRE2_START_OPTIMIZE_OFF directive, or by starting the pattern with
|
||||
(*NO_START_OPT). There is more discussion of this option in the section
|
||||
entitled
|
||||
<a href="pcre2api.html#compiling">"Compiling a pattern"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
Experiments with Perl suggest that it too has similar optimizations, and like
|
||||
PCRE2, turning them off can change the result of a match.
|
||||
<a name="acceptverb"></a></P>
|
||||
<br><b>
|
||||
Verbs that act immediately
|
||||
</b><br>
|
||||
<P>
|
||||
The following verbs act as soon as they are encountered.
|
||||
<pre>
|
||||
(*ACCEPT) or (*ACCEPT:NAME)
|
||||
</pre>
|
||||
This verb causes the match to end successfully, skipping the remainder of the
|
||||
pattern. However, when it is inside a capture group that is called as a
|
||||
subroutine, only that group is ended successfully. Matching then continues
|
||||
at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
|
||||
assertion succeeds; in a negative assertion, the assertion fails.
|
||||
</P>
|
||||
<P>
|
||||
If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
|
||||
example:
|
||||
<pre>
|
||||
A((?:A|B(*ACCEPT)|C)D)
|
||||
</pre>
|
||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||
the outer parentheses.
|
||||
</P>
|
||||
<P>
|
||||
(*ACCEPT) is the only backtracking verb that is allowed to be quantified
|
||||
because an ungreedy quantification with a minimum of zero acts only when a
|
||||
backtrack happens. Consider, for example,
|
||||
<pre>
|
||||
(A(*ACCEPT)??B)C
|
||||
</pre>
|
||||
where A, B, and C may be complex expressions. After matching "A", the matcher
|
||||
processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
|
||||
the match succeeds. In both cases, all but C is captured. Whereas (*COMMIT)
|
||||
(see below) means "fail on backtrack", a repeated (*ACCEPT) of this type means
|
||||
"succeed on backtrack".
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because
|
||||
it causes an immediate exit from the group, bypassing the script run checking.
|
||||
<pre>
|
||||
(*FAIL) or (*FAIL:NAME)
|
||||
</pre>
|
||||
This verb causes a matching failure, forcing backtracking to occur. It may be
|
||||
abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
|
||||
documentation notes that it is probably useful only when combined with (?{}) or
|
||||
(??{}). Those are, of course, Perl features that are not present in PCRE2. The
|
||||
nearest equivalent is the callout feature, as for example in this pattern:
|
||||
<pre>
|
||||
a+(?C)(*FAIL)
|
||||
</pre>
|
||||
A match with the string "aaaa" always fails, but the callout is taken before
|
||||
each backtrack happens (in this example, 10 times).
|
||||
</P>
|
||||
<P>
|
||||
(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
|
||||
(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
|
||||
the verb acts.
|
||||
</P>
|
||||
<br><b>
|
||||
Recording which path was taken
|
||||
</b><br>
|
||||
<P>
|
||||
There is one verb whose main purpose is to track how a match was arrived at,
|
||||
though it also has a secondary use in conjunction with advancing the match
|
||||
starting point (see (*SKIP) below).
|
||||
<pre>
|
||||
(*MARK:NAME) or (*:NAME)
|
||||
</pre>
|
||||
A name is always required with this verb. For all the other backtracking
|
||||
control verbs, a NAME argument is optional.
|
||||
</P>
|
||||
<P>
|
||||
When a match succeeds, the name of the last-encountered mark name on the
|
||||
matching path is passed back to the caller as described in the section entitled
|
||||
<a href="pcre2api.html#matchotherdata">"Other information about the match"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation. This applies to all instances of (*MARK) and other verbs,
|
||||
including those inside assertions and atomic groups. However, there are
|
||||
differences in those cases when (*MARK) is used in conjunction with (*SKIP) as
|
||||
described below.
|
||||
</P>
|
||||
<P>
|
||||
The mark name that was last encountered on the matching path is passed back. A
|
||||
verb without a NAME argument is ignored for this purpose. Here is an example of
|
||||
<b>pcre2test</b> output, where the "mark" modifier requests the retrieval and
|
||||
outputting of (*MARK) data:
|
||||
<pre>
|
||||
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
|
||||
data> XY
|
||||
0: XY
|
||||
MK: A
|
||||
XZ
|
||||
0: XZ
|
||||
MK: B
|
||||
</pre>
|
||||
The (*MARK) name is tagged with "MK:" in this output, and in this example it
|
||||
indicates which of the two alternatives matched. This is a more efficient way
|
||||
of obtaining this information than putting each alternative in its own
|
||||
capturing parentheses.
|
||||
</P>
|
||||
<P>
|
||||
If a verb with a name is encountered in a positive assertion that is true, the
|
||||
name is recorded and passed back if it is the last-encountered. This does not
|
||||
happen for negative assertions or failing positive assertions.
|
||||
</P>
|
||||
<P>
|
||||
After a partial match or a failed match, the last encountered name in the
|
||||
entire match process is returned. For example:
|
||||
<pre>
|
||||
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
|
||||
data> XP
|
||||
No match, mark = B
|
||||
</pre>
|
||||
Note that in this unanchored example the mark is retained from the match
|
||||
attempt that started at the letter "X" in the subject. Subsequent match
|
||||
attempts starting at "P" and then with an empty string do not get as far as the
|
||||
(*MARK) item, but nevertheless do not reset it.
|
||||
</P>
|
||||
<P>
|
||||
If you are interested in (*MARK) values after failed matches, you should
|
||||
probably either set the PCRE2_NO_START_OPTIMIZE option or call
|
||||
<b>pcre2_set_optimize()</b> with a PCRE2_START_OPTIMIZE_OFF directive
|
||||
<a href="#nooptimize">(see above)</a>
|
||||
to ensure that the match is always attempted.
|
||||
</P>
|
||||
<br><b>
|
||||
Verbs that act after backtracking
|
||||
</b><br>
|
||||
<P>
|
||||
The following verbs do nothing when they are encountered. Matching continues
|
||||
with what follows, but if there is a subsequent match failure, causing a
|
||||
backtrack to the verb, a failure is forced. That is, backtracking cannot pass
|
||||
to the left of the verb. However, when one of these verbs appears inside an
|
||||
atomic group or in an atomic lookaround assertion that is true, its effect is
|
||||
confined to that group, because once the group has been matched, there is never
|
||||
any backtracking into it. Backtracking from beyond an atomic assertion or group
|
||||
ignores the entire group, and seeks a preceding backtracking point.
|
||||
</P>
|
||||
<P>
|
||||
These verbs differ in exactly what kind of failure occurs when backtracking
|
||||
reaches them. The behaviour described below is what happens when the verb is
|
||||
not in a subroutine or an assertion. Subsequent sections cover these special
|
||||
cases.
|
||||
<pre>
|
||||
(*COMMIT) or (*COMMIT:NAME)
|
||||
</pre>
|
||||
This verb causes the whole match to fail outright if there is a later matching
|
||||
failure that causes backtracking to reach it. Even if the pattern is
|
||||
unanchored, no further attempts to find a match by advancing the starting point
|
||||
take place. If (*COMMIT) is the only backtracking verb that is encountered,
|
||||
once it has been passed <b>pcre2_match()</b> is committed to finding a match at
|
||||
the current starting point, or not at all. For example:
|
||||
<pre>
|
||||
a+(*COMMIT)b
|
||||
</pre>
|
||||
This matches "xxaab" but not "aacaab". It can be thought of as a kind of
|
||||
dynamic anchor, or "I've started, so I must finish."
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
|
||||
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
caller. However, (*SKIP:NAME) searches only for names that are set with
|
||||
(*MARK), ignoring those set by any of the other backtracking verbs.
|
||||
</P>
|
||||
<P>
|
||||
If there is more than one backtracking verb in a pattern, a different one that
|
||||
follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
|
||||
match does not always guarantee that a match must be at this starting point.
|
||||
</P>
|
||||
<P>
|
||||
Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
|
||||
unless PCRE2's start-of-match optimizations are turned off, as shown in this
|
||||
output from <b>pcre2test</b>:
|
||||
<pre>
|
||||
re> /(*COMMIT)abc/
|
||||
data> xyzabc
|
||||
0: abc
|
||||
data>
|
||||
re> /(*COMMIT)abc/no_start_optimize
|
||||
data> xyzabc
|
||||
No match
|
||||
</pre>
|
||||
For the first pattern, PCRE2 knows that any match must start with "a", so the
|
||||
optimization skips along the subject to "a" before applying the pattern to the
|
||||
first set of data. The match attempt then succeeds. The second pattern disables
|
||||
the optimization that skips along to the first character. The pattern is now
|
||||
applied starting at "x", and so the (*COMMIT) causes the match to fail without
|
||||
trying any other starting points.
|
||||
<pre>
|
||||
(*PRUNE) or (*PRUNE:NAME)
|
||||
</pre>
|
||||
This verb causes the match to fail at the current starting position in the
|
||||
subject if there is a later matching failure that causes backtracking to reach
|
||||
it. If the pattern is unanchored, the normal "bumpalong" advance to the next
|
||||
starting character then happens. Backtracking can occur as usual to the left of
|
||||
(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
|
||||
if there is no match to the right, backtracking cannot cross (*PRUNE). In
|
||||
simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
|
||||
possessive quantifier, but there are some uses of (*PRUNE) that cannot be
|
||||
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
|
||||
as (*COMMIT).
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
|
||||
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||
ignoring those set by other backtracking verbs.
|
||||
<pre>
|
||||
(*SKIP)
|
||||
</pre>
|
||||
This verb, when given without a name, is like (*PRUNE), except that if the
|
||||
pattern is unanchored, the "bumpalong" advance is not to the next character,
|
||||
but to the position in the subject where (*SKIP) was encountered. (*SKIP)
|
||||
signifies that whatever text was matched leading up to it cannot be part of a
|
||||
successful match if there is a later mismatch. Consider:
|
||||
<pre>
|
||||
a+(*SKIP)b
|
||||
</pre>
|
||||
If the subject is "aaaac...", after the first match attempt fails (starting at
|
||||
the first character in the string), the starting point skips on to start the
|
||||
next attempt at "c". Note that a possessive quantifier does not have the same
|
||||
effect as this example; although it would suppress backtracking during the
|
||||
first match attempt, the second attempt would start at the second character
|
||||
instead of skipping on to "c".
|
||||
</P>
|
||||
<P>
|
||||
If (*SKIP) is used to specify a new starting position that is the same as the
|
||||
starting position of the current match, or (by being inside a lookbehind)
|
||||
earlier, the position specified by (*SKIP) is ignored, and instead the normal
|
||||
"bumpalong" occurs.
|
||||
<pre>
|
||||
(*SKIP:NAME)
|
||||
</pre>
|
||||
When (*SKIP) has an associated name, its behaviour is modified. When such a
|
||||
(*SKIP) is triggered, the previous path through the pattern is searched for the
|
||||
most recent (*MARK) that has the same name. If one is found, the "bumpalong"
|
||||
advance is to the subject position that corresponds to that (*MARK) instead of
|
||||
to where (*SKIP) was encountered. If no (*MARK) with a matching name is found,
|
||||
the (*SKIP) is ignored.
|
||||
</P>
|
||||
<P>
|
||||
The search for a (*MARK) name uses the normal backtracking mechanism, which
|
||||
means that it does not see (*MARK) settings that are inside atomic groups or
|
||||
assertions, because they are never re-entered by backtracking. Compare the
|
||||
following <b>pcre2test</b> examples:
|
||||
<pre>
|
||||
re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
|
||||
data: abc
|
||||
0: a
|
||||
1: a
|
||||
data:
|
||||
re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
|
||||
data: abc
|
||||
0: b
|
||||
1: b
|
||||
</pre>
|
||||
In the first example, the (*MARK) setting is in an atomic group, so it is not
|
||||
seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
|
||||
the second branch of the pattern to be tried at the first character position.
|
||||
In the second example, the (*MARK) setting is not in an atomic group. This
|
||||
allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new
|
||||
matching attempt to start at the second character. This time, the (*MARK) is
|
||||
never seen because "a" does not match "b", so the matcher immediately jumps to
|
||||
the second branch of the pattern.
|
||||
</P>
|
||||
<P>
|
||||
Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
|
||||
names that are set by other backtracking verbs.
|
||||
<pre>
|
||||
(*THEN) or (*THEN:NAME)
|
||||
</pre>
|
||||
This verb causes a skip to the next innermost alternative when backtracking
|
||||
reaches it. That is, it cancels any further backtracking within the current
|
||||
alternative. Its name comes from the observation that it can be used for a
|
||||
pattern-based if-then-else block:
|
||||
<pre>
|
||||
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
|
||||
</pre>
|
||||
If the COND1 pattern matches, FOO is tried (and possibly further items after
|
||||
the end of the group if FOO succeeds); on failure, the matcher skips to the
|
||||
second alternative and tries COND2, without backtracking into COND1. If that
|
||||
succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
|
||||
more alternatives, so there is a backtrack to whatever came before the entire
|
||||
group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
|
||||
</P>
|
||||
<P>
|
||||
The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is
|
||||
like (*MARK:NAME) in that the name is remembered for passing back to the
|
||||
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
|
||||
ignoring those set by other backtracking verbs.
|
||||
</P>
|
||||
<P>
|
||||
A group that does not contain a | character is just a part of the enclosing
|
||||
alternative; it is not a nested alternation with only one alternative. The
|
||||
effect of (*THEN) extends beyond such a group to the enclosing alternative.
|
||||
Consider this pattern, where A, B, etc. are complex pattern fragments that do
|
||||
not contain any | characters at this level:
|
||||
<pre>
|
||||
A (B(*THEN)C) | D
|
||||
</pre>
|
||||
If A and B are matched, but there is a failure in C, matching does not
|
||||
backtrack into A; instead it moves to the next alternative, that is, D.
|
||||
However, if the group containing (*THEN) is given an alternative, it
|
||||
behaves differently:
|
||||
<pre>
|
||||
A (B(*THEN)C | (*FAIL)) | D
|
||||
</pre>
|
||||
The effect of (*THEN) is now confined to the inner group. After a failure in C,
|
||||
matching moves to (*FAIL), which causes the whole group to fail because there
|
||||
are no more alternatives to try. In this case, matching does backtrack into A.
|
||||
</P>
|
||||
<P>
|
||||
Note that a conditional group is not considered as having two alternatives,
|
||||
because only one is ever used. In other words, the | character in a conditional
|
||||
group has a different meaning. Ignoring white space, consider:
|
||||
<pre>
|
||||
^.*? (?(?=a) a | b(*THEN)c )
|
||||
</pre>
|
||||
If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
|
||||
it initially matches zero characters. The condition (?=a) then fails, the
|
||||
character "b" is matched, but "c" is not. At this point, matching does not
|
||||
backtrack to .*? as might perhaps be expected from the presence of the |
|
||||
character. The conditional group is part of the single alternative that
|
||||
comprises the whole pattern, and so the match fails. (If there was a backtrack
|
||||
into .*?, allowing it to match "b", the match would succeed.)
|
||||
</P>
|
||||
<P>
|
||||
The verbs just described provide four different "strengths" of control when
|
||||
subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
|
||||
next alternative. (*PRUNE) comes next, failing the match at the current
|
||||
starting position, but allowing an advance to the next character (for an
|
||||
unanchored pattern). (*SKIP) is similar, except that the advance may be more
|
||||
than one character. (*COMMIT) is the strongest, causing the entire match to
|
||||
fail.
|
||||
</P>
|
||||
<br><b>
|
||||
More than one backtracking verb
|
||||
</b><br>
|
||||
<P>
|
||||
If more than one backtracking verb is present in a pattern, the one that is
|
||||
backtracked onto first acts. For example, consider this pattern, where A, B,
|
||||
etc. are complex pattern fragments:
|
||||
<pre>
|
||||
(A(*COMMIT)B(*THEN)C|ABD)
|
||||
</pre>
|
||||
If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
|
||||
fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
|
||||
the next alternative (ABD) to be tried. This behaviour is consistent, but is
|
||||
not always the same as Perl's. It means that if two or more backtracking verbs
|
||||
appear in succession, all but the last of them has no effect. Consider this
|
||||
example:
|
||||
<pre>
|
||||
...(*COMMIT)(*PRUNE)...
|
||||
</pre>
|
||||
If there is a matching failure to the right, backtracking onto (*PRUNE) causes
|
||||
it to be triggered, and its action is taken. There can never be a backtrack
|
||||
onto (*COMMIT).
|
||||
<a name="btrepeat"></a></P>
|
||||
<br><b>
|
||||
Backtracking verbs in repeated groups
|
||||
</b><br>
|
||||
<P>
|
||||
PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
|
||||
repeated groups. For example, consider:
|
||||
<pre>
|
||||
/(a(*COMMIT)b)+ac/
|
||||
</pre>
|
||||
If the subject is "abac", Perl matches unless its optimizations are disabled,
|
||||
but PCRE2 always fails because the (*COMMIT) in the second repeat of the group
|
||||
acts.
|
||||
<a name="btassert"></a></P>
|
||||
<br><b>
|
||||
Backtracking verbs in assertions
|
||||
</b><br>
|
||||
<P>
|
||||
(*FAIL) in any assertion has its normal effect: it forces an immediate
|
||||
backtrack. The behaviour of the other backtracking verbs depends on whether or
|
||||
not the assertion is standalone or acting as the condition in a conditional
|
||||
group.
|
||||
</P>
|
||||
<P>
|
||||
(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
|
||||
without any further processing; captured strings and a mark name (if set) are
|
||||
retained. In a standalone negative assertion, (*ACCEPT) causes the assertion to
|
||||
fail without any further processing; captured substrings and any mark name are
|
||||
discarded.
|
||||
</P>
|
||||
<P>
|
||||
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
|
||||
a positive assertion and false for a negative one; captured substrings are
|
||||
retained in both cases.
|
||||
</P>
|
||||
<P>
|
||||
The remaining verbs act only when a later failure causes a backtrack to
|
||||
reach them. This means that, for the Perl-compatible assertions, their effect
|
||||
is confined to the assertion, because Perl lookaround assertions are atomic. A
|
||||
backtrack that occurs after such an assertion is complete does not jump back
|
||||
into the assertion. Note in particular that a (*MARK) name that is set in an
|
||||
assertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern.
|
||||
</P>
|
||||
<P>
|
||||
PCRE2 now supports non-atomic positive assertions and also "scan substring"
|
||||
assertions, as described in the sections entitled
|
||||
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
|
||||
and
|
||||
<a href="#scansubstringassertions">"Scan substring assertions"</a>
|
||||
above. These assertions must be standalone (not used as conditions). They are
|
||||
not Perl-compatible. For these assertions, a later backtrack does jump back
|
||||
into the assertion, and therefore verbs such as (*COMMIT) can be triggered by
|
||||
backtracks from later in the pattern.
|
||||
</P>
|
||||
<P>
|
||||
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
|
||||
are no more branches to try, (*THEN) causes a positive assertion to be false,
|
||||
and a negative assertion to be true. This behaviour differs from Perl when the
|
||||
assertion has only one branch.
|
||||
</P>
|
||||
<P>
|
||||
The other backtracking verbs are not treated specially if they appear in a
|
||||
standalone positive assertion. In a conditional positive assertion,
|
||||
backtracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE)
|
||||
causes the condition to be false. However, for both standalone and conditional
|
||||
negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes
|
||||
the assertion to be true, without considering any further alternative branches.
|
||||
<a name="btsub"></a></P>
|
||||
<br><b>
|
||||
Backtracking verbs in subroutines
|
||||
</b><br>
|
||||
<P>
|
||||
These behaviours occur whether or not the group is called recursively.
|
||||
</P>
|
||||
<P>
|
||||
(*ACCEPT) in a group called as a subroutine causes the subroutine match to
|
||||
succeed without any further processing. Matching then continues after the
|
||||
subroutine call. Perl documents this behaviour. Perl's treatment of the other
|
||||
verbs in subroutines is different in some cases.
|
||||
</P>
|
||||
<P>
|
||||
(*FAIL) in a group called as a subroutine has its normal effect: it forces
|
||||
an immediate backtrack.
|
||||
</P>
|
||||
<P>
|
||||
(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when
|
||||
triggered by being backtracked to in a group called as a subroutine. There is
|
||||
then a backtrack at the outer level.
|
||||
</P>
|
||||
<P>
|
||||
(*THEN), when triggered, skips to the next alternative in the innermost
|
||||
enclosing group that has alternatives (its normal behaviour). However, if there
|
||||
is no such group within the subroutine's group, the subroutine match fails and
|
||||
there is a backtrack at the outer level.
|
||||
<a name="ebcdicenvironments"></a></P>
|
||||
<br><a name="SEC33" href="#TOC1">EBCDIC ENVIRONMENTS</a><br>
|
||||
<P>
|
||||
Differences in the way PCRE behaves when it is running in an EBCDIC environment
|
||||
are covered in this section.
|
||||
</P>
|
||||
<br><b>
|
||||
Escape sequences
|
||||
</b><br>
|
||||
<P>
|
||||
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
|
||||
\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
|
||||
escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
|
||||
only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
|
||||
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||
\c@ encodes character code 0; after \c the letters (in either case) encode
|
||||
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
|
||||
(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
</P>
|
||||
<P>
|
||||
Thus, apart from \c?, these escapes generate the same character code values as
|
||||
they do in an ASCII or Unicode environment, though the meanings of the values
|
||||
mostly differ. For example, \cG always generates code value 7, which is BEL in
|
||||
ASCII but DEL in EBCDIC.
|
||||
</P>
|
||||
<P>
|
||||
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \c? generate 95; otherwise it generates 255.
|
||||
</P>
|
||||
<br><b>
|
||||
Character classes
|
||||
</b><br>
|
||||
<P>
|
||||
In character classes there is a special case in EBCDIC environments for ranges
|
||||
whose end points are both specified as literal letters in the same case. For
|
||||
compatibility with Perl, EBCDIC code points within the range that are not
|
||||
letters are omitted. For example, [h-k] matches only four characters, even
|
||||
though the EBCDIC codes for h and k are 0x88 and 0x92, a range of 11 code
|
||||
points. However, if the range is specified numerically, for example,
|
||||
[\x88-\x92] or [h-\x92], all code points are included.
|
||||
</P>
|
||||
<br><a name="SEC34" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC35" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC36" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 November 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
280
3rd/pcre2/doc/html/pcre2perform.html
Normal file
280
3rd/pcre2/doc/html/pcre2perform.html
Normal file
@@ -0,0 +1,280 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2perform specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2perform man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
|
||||
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
|
||||
<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
|
||||
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
|
||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 PERFORMANCE</a><br>
|
||||
<P>
|
||||
Two aspects of performance are discussed below: memory usage and processing
|
||||
time. The way you express your pattern as a regular expression can affect both
|
||||
of them.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
|
||||
<P>
|
||||
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
|
||||
so that most simple patterns do not use much memory for storing the compiled
|
||||
version. However, there is one case where the memory usage of a compiled
|
||||
pattern can be unexpectedly large. If a parenthesized group has a quantifier
|
||||
with a minimum greater than 1 and/or a limited maximum, the whole group is
|
||||
repeated in the compiled code. For example, the pattern
|
||||
<pre>
|
||||
(abc|def){2,4}
|
||||
</pre>
|
||||
is compiled as if it were
|
||||
<pre>
|
||||
(abc|def)(abc|def)((abc|def)(abc|def)?)?
|
||||
</pre>
|
||||
(Technical aside: It is done this way so that backtrack points within each of
|
||||
the repetitions can be independently maintained.)
|
||||
</P>
|
||||
<P>
|
||||
For regular expressions whose quantifiers use only small numbers, this is not
|
||||
usually a problem. However, if the numbers are large, and particularly if such
|
||||
repetitions are nested, the memory usage can become an embarrassment. For
|
||||
example, the very simple pattern
|
||||
<pre>
|
||||
((ab){1,1000}c){1,3}
|
||||
</pre>
|
||||
uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
|
||||
compiled with its default internal pointer size of two bytes, the size limit on
|
||||
a compiled pattern is 65535 code units in the 8-bit and 16-bit libraries, and
|
||||
this is reached with the above pattern if the outer repetition is increased
|
||||
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
|
||||
handle larger compiled patterns, but it is better to try to rewrite your
|
||||
pattern to use less memory if you can.
|
||||
</P>
|
||||
<P>
|
||||
One way of reducing the memory usage for such patterns is to make use of
|
||||
PCRE2's
|
||||
<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a>
|
||||
facility. Re-writing the above pattern as
|
||||
<pre>
|
||||
((ab)(?2){0,999}c)(?1){0,2}
|
||||
</pre>
|
||||
reduces the memory requirements to around 16KiB, and indeed it remains under
|
||||
20KiB even with the outer repetition increased to 100. However, this kind of
|
||||
pattern is not always exactly equivalent, because any captures within
|
||||
subroutine calls are lost when the subroutine completes. If this is not a
|
||||
problem, this kind of rewriting will allow you to process patterns that PCRE2
|
||||
cannot otherwise handle. The matching performance of the two different versions
|
||||
of the pattern are roughly the same. (This applies from release 10.30 - things
|
||||
were different in earlier releases.)
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
|
||||
<P>
|
||||
From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
|
||||
uses very little system stack at run time. In earlier releases recursive
|
||||
function calls could use a great deal of stack, and this could cause problems,
|
||||
but this usage has been eliminated. Backtracking positions are now explicitly
|
||||
remembered in memory frames controlled by the code.
|
||||
</P>
|
||||
<P>
|
||||
The size of each frame depends on the size of pointer variables and the number
|
||||
of capturing parenthesized groups in the pattern being matched. On a 64-bit
|
||||
system the frame size for a pattern with no captures is 128 bytes. For each
|
||||
capturing group the size increases by 16 bytes.
|
||||
</P>
|
||||
<P>
|
||||
Until release 10.41, an initial 20KiB frames vector was allocated on the system
|
||||
stack, but this still caused some issues for multi-thread applications where
|
||||
each thread has a very small stack. From release 10.41 backtracking memory
|
||||
frames are always held in heap memory. An initial heap allocation is obtained
|
||||
the first time any match data block is passed to <b>pcre2_match()</b>. This is
|
||||
remembered with the match data block and re-used if that block is used for
|
||||
another match. It is freed when the match data block itself is freed.
|
||||
</P>
|
||||
<P>
|
||||
The size of the initial block is the larger of 20KiB or ten times the pattern's
|
||||
frame size, unless the heap limit is less than this, in which case the heap
|
||||
limit is used. If the initial block proves to be too small during matching, it
|
||||
is replaced by a larger block, subject to the heap limit. The heap limit is
|
||||
checked only when a new block is to be allocated. Reducing the heap limit
|
||||
between calls to <b>pcre2_match()</b> with the same match data block does not
|
||||
affect the saved block.
|
||||
</P>
|
||||
<P>
|
||||
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
|
||||
function calls, but only for processing atomic groups, lookaround assertions,
|
||||
and recursion within the pattern. The original version of the code used to
|
||||
allocate quite large internal workspace vectors on the stack, which caused some
|
||||
problems for some patterns in environments with small stacks. From release
|
||||
10.32 the code for <b>pcre2_dfa_match()</b> has been re-factored to use heap
|
||||
memory when necessary for internal workspace when recursing, though recursive
|
||||
function calls are still used.
|
||||
</P>
|
||||
<P>
|
||||
The "match depth" parameter can be used to limit the depth of function
|
||||
recursion, and the "match heap" parameter to limit heap memory in
|
||||
<b>pcre2_dfa_match()</b>.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
|
||||
<P>
|
||||
Certain items in regular expression patterns are processed more efficiently
|
||||
than others. It is more efficient to use a character class like [aeiou] than a
|
||||
set of single-character alternatives such as (a|e|i|o|u). In general, the
|
||||
simplest construction that provides the required behaviour is usually the most
|
||||
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
|
||||
about optimizing regular expressions for efficient performance. This document
|
||||
contains a few observations about PCRE2.
|
||||
</P>
|
||||
<P>
|
||||
Using Unicode character properties (the \p, \P, and \X escapes) is slow,
|
||||
because PCRE2 has to use a multi-stage table lookup whenever it needs a
|
||||
character's property. If you can find an alternative pattern that does not use
|
||||
character properties, it will probably be faster.
|
||||
</P>
|
||||
<P>
|
||||
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
|
||||
character classes such as [:alpha:] do not use Unicode properties, partly for
|
||||
backwards compatibility, and partly for performance reasons. However, you can
|
||||
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
|
||||
character properties to be used. This can double the matching time for items
|
||||
such as \d, when matched with <b>pcre2_match()</b>; the performance loss is
|
||||
less with a DFA matching function, and in both cases there is not much
|
||||
difference for \b.
|
||||
</P>
|
||||
<P>
|
||||
When a pattern begins with .* not in atomic parentheses, nor in parentheses
|
||||
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
|
||||
the pattern is implicitly anchored by PCRE2, since it can match only at the
|
||||
start of a subject string. If the pattern has multiple top-level branches, they
|
||||
must all be anchorable. The optimization can be disabled by the
|
||||
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
|
||||
contains (*PRUNE) or (*SKIP).
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
|
||||
dot metacharacter does not then match a newline, and if the subject string
|
||||
contains newlines, the pattern may match from the character immediately
|
||||
following one of them instead of from the very start. For example, the pattern
|
||||
<pre>
|
||||
.*second
|
||||
</pre>
|
||||
matches the subject "first\nand second" (where \n stands for a newline
|
||||
character), with the match starting at the seventh character. In order to do
|
||||
this, PCRE2 has to retry the match starting after every newline in the subject.
|
||||
</P>
|
||||
<P>
|
||||
If you are using such a pattern with subject strings that do not contain
|
||||
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
|
||||
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
|
||||
from having to scan along the subject looking for a newline to restart at.
|
||||
</P>
|
||||
<P>
|
||||
Beware of patterns that contain nested indefinite repeats. These can take a
|
||||
long time to run when applied to a string that does not match. Consider the
|
||||
pattern fragment
|
||||
<pre>
|
||||
^(a+)*
|
||||
</pre>
|
||||
This can match "aaaa" in 16 different ways, and this number increases very
|
||||
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
|
||||
times, and for each of those cases other than 0 or 4, the + repeats can match
|
||||
different numbers of times.) When the remainder of the pattern is such that the
|
||||
entire match is going to fail, PCRE2 has in principle to try every possible
|
||||
variation, and this can take an extremely long time, even for relatively short
|
||||
strings.
|
||||
</P>
|
||||
<P>
|
||||
An optimization catches some of the more simple cases such as
|
||||
<pre>
|
||||
(a+)*b
|
||||
</pre>
|
||||
where a literal character follows. Before embarking on the standard matching
|
||||
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
|
||||
there is not, it fails the match immediately. However, when there is no
|
||||
following literal this optimization cannot be used. You can see the difference
|
||||
by comparing the behaviour of
|
||||
<pre>
|
||||
(a+)*\d
|
||||
</pre>
|
||||
with the pattern above. The former gives a failure almost instantly when
|
||||
applied to a whole line of "a" characters, whereas the latter takes an
|
||||
appreciable time with strings longer than about 20 characters.
|
||||
</P>
|
||||
<P>
|
||||
In many cases, the solution to this kind of performance issue is to use an
|
||||
atomic group or a possessive quantifier. This can often reduce memory
|
||||
requirements as well. As another example, consider this pattern:
|
||||
<pre>
|
||||
([^<]|<(?!inet))+
|
||||
</pre>
|
||||
It matches from wherever it starts until it encounters "<inet" or the end of
|
||||
the data, and is the kind of pattern that might be used when processing an XML
|
||||
file. Each iteration of the outer parentheses matches either one character that
|
||||
is not "<" or a "<" that is not followed by "inet". However, each time a
|
||||
parenthesis is processed, a backtracking position is passed, so this
|
||||
formulation uses a memory frame for each matched character. For a long string,
|
||||
a lot of memory is required. Consider now this rewritten pattern, which matches
|
||||
exactly the same strings:
|
||||
<pre>
|
||||
([^<]++|<(?!inet))+
|
||||
</pre>
|
||||
This runs much faster, because sequences of characters that do not contain "<"
|
||||
are "swallowed" in one item inside the parentheses, and a possessive quantifier
|
||||
is used to stop any backtracking into the runs of non-"<" characters. This
|
||||
version also uses a lot less memory because entry to a new set of parentheses
|
||||
happens only when a "<" character that is not followed by "inet" is encountered
|
||||
(and we assume this is relatively rare).
|
||||
</P>
|
||||
<P>
|
||||
This example shows that one way of optimizing performance when matching long
|
||||
subject strings is to write repeated parenthesized subpatterns to match more
|
||||
than one character whenever possible.
|
||||
</P>
|
||||
<br><b>
|
||||
SETTING RESOURCE LIMITS
|
||||
</b><br>
|
||||
<P>
|
||||
You can set limits on the amount of processing that takes place when matching,
|
||||
and on the amount of heap memory that is used. The default values of the limits
|
||||
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
|
||||
built, and they can also be set when <b>pcre2_match()</b> or
|
||||
<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
|
||||
<a href="pcre2build.html"><b>pcre2build</b></a>
|
||||
documentation and the section entitled
|
||||
<a href="pcre2api.html#matchcontext">"The match context"</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
|
||||
applied to a subject line, causes it to find the smallest limits that allow a
|
||||
pattern to match. This is done by repeatedly matching with different limits.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 06 December 2022
|
||||
<br>
|
||||
Copyright © 1997-2022 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
379
3rd/pcre2/doc/html/pcre2posix.html
Normal file
379
3rd/pcre2/doc/html/pcre2posix.html
Normal file
@@ -0,0 +1,379 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2posix specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2posix man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
|
||||
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
||||
<li><a name="TOC3" href="#SEC3">USING THE POSIX FUNCTIONS</a>
|
||||
<li><a name="TOC4" href="#SEC4">COMPILING A PATTERN</a>
|
||||
<li><a name="TOC5" href="#SEC5">MATCHING NEWLINE CHARACTERS</a>
|
||||
<li><a name="TOC6" href="#SEC6">MATCHING A PATTERN</a>
|
||||
<li><a name="TOC7" href="#SEC7">ERROR MESSAGES</a>
|
||||
<li><a name="TOC8" href="#SEC8">MEMORY USAGE</a>
|
||||
<li><a name="TOC9" href="#SEC9">AUTHOR</a>
|
||||
<li><a name="TOC10" href="#SEC10">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
|
||||
<P>
|
||||
<b>#include <pcre2posix.h></b>
|
||||
</P>
|
||||
<P>
|
||||
<b>int pcre2_regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
|
||||
<b> int <i>cflags</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int pcre2_regexec(const regex_t *<i>preg</i>, const char *<i>string</i>,</b>
|
||||
<b> size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>size_t pcre2_regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
|
||||
<b> char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_regfree(regex_t *<i>preg</i>);</b>
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
||||
<P>
|
||||
This set of functions provides a POSIX-style API for the PCRE2 regular
|
||||
expression 8-bit library. There are no POSIX-style wrappers for PCRE2's 16-bit
|
||||
and 32-bit libraries. See the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation for a description of PCRE2's native API, which contains much
|
||||
additional functionality.
|
||||
</P>
|
||||
<P>
|
||||
<b>IMPORTANT NOTE</b>: The functions described here are NOT thread-safe, and
|
||||
should not be used in multi-threaded applications. They are also limited to
|
||||
processing subjects that are not bigger than 2GB. Use the native API instead.
|
||||
</P>
|
||||
<P>
|
||||
These functions are wrapper functions that ultimately call the PCRE2 native
|
||||
API. Their prototypes are defined in the <b>pcre2posix.h</b> header file, and
|
||||
they all have unique names starting with <b>pcre2_</b>. However, the
|
||||
<b>pcre2posix.h</b> header also contains macro definitions that convert the
|
||||
standard POSIX names such <b>regcomp()</b> into <b>pcre2_regcomp()</b> etc. This
|
||||
means that a program can use the usual POSIX names without running the risk of
|
||||
accidentally linking with POSIX functions from a different library.
|
||||
</P>
|
||||
<P>
|
||||
On Unix-like systems the PCRE2 POSIX library is called <b>libpcre2-posix</b>, so
|
||||
can be accessed by adding <b>-lpcre2-posix</b> to the command for linking an
|
||||
application. Because the POSIX functions call the native ones, it is also
|
||||
necessary to add <b>-lpcre2-8</b>.
|
||||
</P>
|
||||
<P>
|
||||
On Windows systems, if you are linking to a DLL version of the library, it is
|
||||
recommended that <b>PCRE2POSIX_SHARED</b> is defined before including the
|
||||
<b>pcre2posix.h</b> header, as it will allow for a more efficient way to
|
||||
invoke the functions by adding the <b>__declspec(dllimport)</b> decorator.
|
||||
</P>
|
||||
<P>
|
||||
Although they were not defined as prototypes in <b>pcre2posix.h</b>, releases
|
||||
10.33 to 10.36 of the library contained functions with the POSIX names
|
||||
<b>regcomp()</b> etc. These simply passed their arguments to the PCRE2
|
||||
functions. These functions were provided for backwards compatibility with
|
||||
earlier versions of PCRE2, which had only POSIX names. However, this has proved
|
||||
troublesome in situations where a program links with several libraries, some of
|
||||
which use PCRE2's POSIX interface while others use the real POSIX functions.
|
||||
For this reason, the POSIX names have been removed since release 10.37.
|
||||
</P>
|
||||
<P>
|
||||
Calling the header file <b>pcre2posix.h</b> avoids any conflict with other POSIX
|
||||
libraries. It can, of course, be renamed or aliased as <b>regex.h</b>, which is
|
||||
the "correct" name, if there is no clash. It provides two structure types,
|
||||
<i>regex_t</i> for compiled internal forms, and <i>regmatch_t</i> for returning
|
||||
captured substrings. It also defines some constants whose names start with
|
||||
"REG_"; these are used for setting options and identifying error codes.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">USING THE POSIX FUNCTIONS</a><br>
|
||||
<P>
|
||||
Note that these functions are just POSIX-style wrappers for PCRE2's native API.
|
||||
They do not give POSIX regular expression behaviour, and they are not
|
||||
thread-safe or even POSIX compatible.
|
||||
</P>
|
||||
<P>
|
||||
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
|
||||
have been implemented. In addition, the option REG_EXTENDED is defined with the
|
||||
value zero. This has no effect, but since programs that are written to the
|
||||
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
|
||||
replacement library. Other POSIX options are not even defined.
|
||||
</P>
|
||||
<P>
|
||||
There are also some options that are not defined by POSIX. These have been
|
||||
added at the request of users who want to make use of certain PCRE2-specific
|
||||
features via the POSIX calling interface or to add BSD or GNU functionality.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is called via these functions, it is only the API that is POSIX-like
|
||||
in style. The syntax and semantics of the regular expressions themselves are
|
||||
still those of Perl, subject to the setting of various PCRE2 options, as
|
||||
described below. "POSIX-like in style" means that the API approximates to the
|
||||
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
|
||||
domains it is probably even less compatible.
|
||||
</P>
|
||||
<P>
|
||||
The descriptions below use the actual names of the functions, but, as described
|
||||
above, the standard POSIX names (without the <b>pcre2_</b> prefix) may also be
|
||||
used.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">COMPILING A PATTERN</a><br>
|
||||
<P>
|
||||
The function <b>pcre2_regcomp()</b> is called to compile a pattern into an
|
||||
internal form. By default, the pattern is a C string terminated by a binary
|
||||
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
|
||||
<b>regex_t</b> structure that is used as a base for storing information about
|
||||
the compiled regular expression. It is also used for input when REG_PEND is
|
||||
set. The <b>regex_t</b> structure used by <b>pcre2_regcomp()</b> is defined in
|
||||
<b>pcre2posix.h</b> and is not the same as the structure used by other libraries
|
||||
that provide POSIX-style matching.
|
||||
</P>
|
||||
<P>
|
||||
The argument <i>cflags</i> is either zero, or contains one or more of the bits
|
||||
defined by the following macros:
|
||||
<pre>
|
||||
REG_DOTALL
|
||||
</pre>
|
||||
The PCRE2_DOTALL option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that REG_DOTALL is not part of the
|
||||
POSIX standard.
|
||||
<pre>
|
||||
REG_ICASE
|
||||
</pre>
|
||||
The PCRE2_CASELESS option is set when the regular expression is passed for
|
||||
compilation to the native function.
|
||||
<pre>
|
||||
REG_NEWLINE
|
||||
</pre>
|
||||
The PCRE2_MULTILINE option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that this does <i>not</i> mimic the
|
||||
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
||||
<pre>
|
||||
REG_NOSPEC
|
||||
</pre>
|
||||
The PCRE2_LITERAL option is set when the regular expression is passed for
|
||||
compilation to the native function. This disables all meta characters in the
|
||||
pattern, causing it to be treated as a literal string. The only other options
|
||||
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
|
||||
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
|
||||
<pre>
|
||||
REG_NOSUB
|
||||
</pre>
|
||||
When a pattern that is compiled with this flag is passed to
|
||||
<b>pcre2_regexec()</b> for matching, the <i>nmatch</i> and <i>pmatch</i> arguments
|
||||
are ignored, and no captured strings are returned. Versions of the PCRE2 library
|
||||
prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this
|
||||
no longer happens because it disables the use of backreferences.
|
||||
<pre>
|
||||
REG_PEND
|
||||
</pre>
|
||||
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
|
||||
(which has the type const char *) must be set to point to the character beyond
|
||||
the end of the pattern before calling <b>pcre2_regcomp()</b>. The pattern itself
|
||||
may now contain binary zeros, which are treated as data characters. Without
|
||||
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
|
||||
ignored. This is a GNU extension to the POSIX standard and should be used with
|
||||
caution in software intended to be portable to other systems.
|
||||
<pre>
|
||||
REG_UCP
|
||||
</pre>
|
||||
The PCRE2_UCP option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes PCRE2 to use Unicode properties
|
||||
when matching \d, \w, etc., instead of just recognizing ASCII values. Note
|
||||
that REG_UCP is not part of the POSIX standard.
|
||||
<pre>
|
||||
REG_UNGREEDY
|
||||
</pre>
|
||||
The PCRE2_UNGREEDY option is set when the regular expression is passed for
|
||||
compilation to the native function. Note that REG_UNGREEDY is not part of the
|
||||
POSIX standard.
|
||||
<pre>
|
||||
REG_UTF
|
||||
</pre>
|
||||
The PCRE2_UTF option is set when the regular expression is passed for
|
||||
compilation to the native function. This causes the pattern itself and all data
|
||||
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
|
||||
is not part of the POSIX standard.
|
||||
</P>
|
||||
<P>
|
||||
In the absence of these flags, no options are passed to the native function.
|
||||
This means that the regex is compiled with PCRE2 default semantics. In
|
||||
particular, the way it handles newline characters in the subject string is the
|
||||
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
|
||||
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
|
||||
newlines are matched by the dot metacharacter (they are not) or by a negative
|
||||
class such as [^a] (they are).
|
||||
</P>
|
||||
<P>
|
||||
The yield of <b>pcre2_regcomp()</b> is zero on success, and non-zero otherwise.
|
||||
The <i>preg</i> structure is filled in on success, and one other member of the
|
||||
structure (as well as <i>re_endp</i>) is public: <i>re_nsub</i> contains the
|
||||
number of capturing subpatterns in the regular expression. Various error codes
|
||||
are defined in the header file.
|
||||
</P>
|
||||
<P>
|
||||
NOTE: If the yield of <b>pcre2_regcomp()</b> is non-zero, you must not attempt
|
||||
to use the contents of the <i>preg</i> structure. If, for example, you pass it
|
||||
to <b>pcre2_regexec()</b>, the result is undefined and your program is likely to
|
||||
crash.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
|
||||
<P>
|
||||
This area is not simple, because POSIX and Perl take different views of things.
|
||||
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
|
||||
never intended to be a POSIX engine. The following table lists the different
|
||||
possibilities for matching newline characters in Perl and PCRE2:
|
||||
<pre>
|
||||
Default Change with
|
||||
|
||||
. matches newline no PCRE2_DOTALL
|
||||
newline matches [^a] yes not changeable
|
||||
$ matches \n at end yes PCRE2_DOLLAR_ENDONLY
|
||||
$ matches \n in middle no PCRE2_MULTILINE
|
||||
^ matches \n in middle no PCRE2_MULTILINE
|
||||
</pre>
|
||||
This is the equivalent table for a POSIX-compatible pattern matcher:
|
||||
<pre>
|
||||
Default Change with
|
||||
|
||||
. matches newline yes REG_NEWLINE
|
||||
newline matches [^a] yes REG_NEWLINE
|
||||
$ matches \n at end no REG_NEWLINE
|
||||
$ matches \n in middle no REG_NEWLINE
|
||||
^ matches \n in middle no REG_NEWLINE
|
||||
</pre>
|
||||
This behaviour is not what happens when PCRE2 is called via its POSIX
|
||||
API. By default, PCRE2's behaviour is the same as Perl's, except that there is
|
||||
no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
|
||||
is no way to stop newline from matching [^a].
|
||||
</P>
|
||||
<P>
|
||||
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
|
||||
PCRE2_DOLLAR_ENDONLY when calling <b>pcre2_compile()</b> directly, but there is
|
||||
no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
|
||||
the POSIX API, passing REG_NEWLINE to PCRE2's <b>pcre2_regcomp()</b> function
|
||||
causes PCRE2_MULTILINE to be passed to <b>pcre2_compile()</b>, and REG_DOTALL
|
||||
passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">MATCHING A PATTERN</a><br>
|
||||
<P>
|
||||
The function <b>pcre2_regexec()</b> is called to match a compiled pattern
|
||||
<i>preg</i> against a given <i>string</i>, which is by default terminated by a
|
||||
zero byte (but see REG_STARTEND below), subject to the options in <i>eflags</i>.
|
||||
These can be:
|
||||
<pre>
|
||||
REG_NOTBOL
|
||||
</pre>
|
||||
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
|
||||
function.
|
||||
<pre>
|
||||
REG_NOTEMPTY
|
||||
</pre>
|
||||
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
|
||||
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
|
||||
setting this option can give more POSIX-like behaviour in some situations.
|
||||
<pre>
|
||||
REG_NOTEOL
|
||||
</pre>
|
||||
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
|
||||
function.
|
||||
<pre>
|
||||
REG_STARTEND
|
||||
</pre>
|
||||
When this option is set, the subject string starts at <i>string</i> +
|
||||
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
|
||||
should point to the first character beyond the string. There may be binary
|
||||
zeros within the subject string, and indeed, using REG_STARTEND is the only
|
||||
way to pass a subject string that contains a binary zero.
|
||||
</P>
|
||||
<P>
|
||||
Whatever the value of <i>pmatch[0].rm_so</i>, the offsets of the matched string
|
||||
and any captured substrings are still given relative to the start of
|
||||
<i>string</i> itself. (Before PCRE2 release 10.30 these were given relative to
|
||||
<i>string</i> + <i>pmatch[0].rm_so</i>, but this differs from other
|
||||
implementations.)
|
||||
</P>
|
||||
<P>
|
||||
This is a BSD extension, compatible with but not specified by IEEE Standard
|
||||
1003.2 (POSIX.2), and should be used with caution in software intended to be
|
||||
portable to other systems. Note that a non-zero <i>rm_so</i> does not imply
|
||||
REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
|
||||
not how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL
|
||||
are mutually exclusive; the error REG_INVARG is returned.
|
||||
</P>
|
||||
<P>
|
||||
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
||||
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
|
||||
<b>pcre2_regexec()</b> are ignored (except possibly as input for REG_STARTEND).
|
||||
</P>
|
||||
<P>
|
||||
The value of <i>nmatch</i> may be zero, and the value <i>pmatch</i> may be NULL
|
||||
(unless REG_STARTEND is set); in both these cases no data about any matched
|
||||
strings is returned.
|
||||
</P>
|
||||
<P>
|
||||
Otherwise, the portion of the string that was matched, and also any captured
|
||||
substrings, are returned via the <i>pmatch</i> argument, which points to an
|
||||
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
|
||||
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
|
||||
character of each substring and the offset to the first character after the end
|
||||
of each substring, respectively. The 0th element of the vector relates to the
|
||||
entire portion of <i>string</i> that was matched; subsequent elements relate to
|
||||
the capturing subpatterns of the regular expression. Unused entries in the
|
||||
array have both structure members set to -1.
|
||||
</P>
|
||||
<P>
|
||||
<i>regmatch_t</i> as well as the <i>regoff_t</i> typedef it uses are defined in
|
||||
<b>pcre2posix.h</b> and are not warranted to have the same size or layout as other
|
||||
similarly named types from other libraries that provide POSIX-style matching.
|
||||
</P>
|
||||
<P>
|
||||
A successful match yields a zero return; various error codes are defined in the
|
||||
header file, of which REG_NOMATCH is the "expected" failure code.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">ERROR MESSAGES</a><br>
|
||||
<P>
|
||||
The <b>pcre2_regerror()</b> function maps a non-zero errorcode from either
|
||||
<b>pcre2_regcomp()</b> or <b>pcre2_regexec()</b> to a printable message. If
|
||||
<i>preg</i> is not NULL, the error should have arisen from the use of that
|
||||
structure. A message terminated by a binary zero is placed in <i>errbuf</i>. If
|
||||
the buffer is too short, only the first <i>errbuf_size</i> - 1 characters of the
|
||||
error message are used. The yield of the function is the size of buffer needed
|
||||
to hold the whole message, including the terminating zero. This value is
|
||||
greater than <i>errbuf_size</i> if the message was truncated.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">MEMORY USAGE</a><br>
|
||||
<P>
|
||||
Compiling a regular expression causes memory to be allocated and associated
|
||||
with the <i>preg</i> structure. The function <b>pcre2_regfree()</b> frees all
|
||||
such memory, after which <i>preg</i> may no longer be used as a compiled
|
||||
expression.
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 November 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
110
3rd/pcre2/doc/html/pcre2sample.html
Normal file
110
3rd/pcre2/doc/html/pcre2sample.html
Normal file
@@ -0,0 +1,110 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2sample specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2sample man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<br><b>
|
||||
PCRE2 SAMPLE PROGRAM
|
||||
</b><br>
|
||||
<P>
|
||||
A simple, complete demonstration program to get you started with using PCRE2 is
|
||||
supplied in the file <i>pcre2demo.c</i> in the <b>src</b> directory in the PCRE2
|
||||
distribution. A listing of this program is given in the
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
documentation. If you do not have a copy of the PCRE2 distribution, you can
|
||||
save this listing to re-create the contents of <i>pcre2demo.c</i>.
|
||||
</P>
|
||||
<P>
|
||||
The demonstration program compiles the regular expression that is its
|
||||
first argument, and matches it against the subject string in its second
|
||||
argument. No PCRE2 options are set, and default character tables are used. If
|
||||
matching succeeds, the program outputs the portion of the subject that matched,
|
||||
together with the contents of any captured substrings.
|
||||
</P>
|
||||
<P>
|
||||
If the -g option is given on the command line, the program then goes on to
|
||||
check for further matches of the same regular expression in the same subject
|
||||
string. The logic is a little bit tricky because of the possibility of matching
|
||||
an empty string. Comments in the code explain what is going on.
|
||||
</P>
|
||||
<P>
|
||||
The code in <b>pcre2demo.c</b> is an 8-bit program that uses the PCRE2 8-bit
|
||||
library. It handles strings and characters that are stored in 8-bit code units.
|
||||
By default, one character corresponds to one code unit, but if the pattern
|
||||
starts with "(*UTF)", both it and the subject are treated as UTF-8 strings,
|
||||
where characters may occupy multiple code units.
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2 is installed in the standard include and library directories for your
|
||||
operating system, you should be able to compile the demonstration program using
|
||||
a command like this:
|
||||
<pre>
|
||||
cc -o pcre2demo pcre2demo.c -lpcre2-8
|
||||
</pre>
|
||||
If PCRE2 is installed elsewhere, you may need to add additional options to the
|
||||
command line. For example, on a Unix-like system that has PCRE2 installed in
|
||||
<i>/usr/local</i>, you can compile the demonstration program using a command
|
||||
like this:
|
||||
<pre>
|
||||
cc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
|
||||
</pre>
|
||||
Once you have built the demonstration program, you can run simple tests like
|
||||
this:
|
||||
<pre>
|
||||
./pcre2demo 'cat|dog' 'the cat sat on the mat'
|
||||
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
|
||||
</pre>
|
||||
Note that there is a much more comprehensive test program, called
|
||||
<a href="pcre2test.html"><b>pcre2test</b>,</a>
|
||||
which supports many more facilities for testing regular expressions using all
|
||||
three PCRE2 libraries (8-bit, 16-bit, and 32-bit, though not all three need be
|
||||
installed). The
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
program is provided as a relatively simple coding example.
|
||||
</P>
|
||||
<P>
|
||||
If you try to run
|
||||
<a href="pcre2demo.html"><b>pcre2demo</b></a>
|
||||
when PCRE2 is not installed in the standard library directory, you may get an
|
||||
error like this on some operating systems (e.g. Solaris):
|
||||
<pre>
|
||||
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file or directory
|
||||
</pre>
|
||||
This is caused by the way shared library support works on those systems. You
|
||||
need to add
|
||||
<pre>
|
||||
-R/usr/local/lib
|
||||
</pre>
|
||||
(for example) to the compile command to get round this problem.
|
||||
</P>
|
||||
<br><b>
|
||||
AUTHOR
|
||||
</b><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><b>
|
||||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 14 November 2023
|
||||
<br>
|
||||
Copyright © 1997-2016 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
212
3rd/pcre2/doc/html/pcre2serialize.html
Normal file
212
3rd/pcre2/doc/html/pcre2serialize.html
Normal file
@@ -0,0 +1,212 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2serialize specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2serialize man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a>
|
||||
<li><a name="TOC2" href="#SEC2">SECURITY CONCERNS</a>
|
||||
<li><a name="TOC3" href="#SEC3">SAVING COMPILED PATTERNS</a>
|
||||
<li><a name="TOC4" href="#SEC4">RE-USING PRECOMPILED PATTERNS</a>
|
||||
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
|
||||
<li><a name="TOC6" href="#SEC6">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a><br>
|
||||
<P>
|
||||
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
|
||||
<b> pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
|
||||
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
|
||||
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
|
||||
<br>
|
||||
<br>
|
||||
If you are running an application that uses a large number of regular
|
||||
expression patterns, it may be useful to store them in a precompiled form
|
||||
instead of having to compile them every time the application is run. However,
|
||||
if you are using the just-in-time optimization feature, it is not possible to
|
||||
save and reload the JIT data, because it is position-dependent. The host on
|
||||
which the patterns are reloaded must be running the same version of PCRE2, with
|
||||
the same code unit width, and must also have the same endianness, pointer width
|
||||
and PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
|
||||
PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
|
||||
reloaded using the 8-bit library.
|
||||
</P>
|
||||
<P>
|
||||
Note that "serialization" in PCRE2 does not convert compiled patterns to an
|
||||
abstract format like Java or .NET serialization. The serialized output is
|
||||
really just a bytecode dump, which is why it can only be reloaded in the same
|
||||
environment as the one that created it. Hence the restrictions mentioned above.
|
||||
Applications that are not statically linked with a fixed version of PCRE2 must
|
||||
be prepared to recompile patterns from their sources, in order to be immune to
|
||||
PCRE2 upgrades.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">SECURITY CONCERNS</a><br>
|
||||
<P>
|
||||
The facility for saving and restoring compiled patterns is intended for use
|
||||
within individual applications. As such, the data supplied to
|
||||
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
|
||||
arbitrary external sources. There is only some simple consistency checking, not
|
||||
complete validation of what is being re-loaded. Corrupted data may cause
|
||||
undefined results. For example, if the length field of a pattern in the
|
||||
serialized data is corrupted, the deserializing code may read beyond the end of
|
||||
the byte stream that is passed to it.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
|
||||
<P>
|
||||
Before compiled patterns can be saved they must be serialized, which in PCRE2
|
||||
means converting the pattern to a stream of bytes. A single byte stream may
|
||||
contain any number of compiled patterns, but they must all use the same
|
||||
character tables. A single copy of the tables is included in the byte stream
|
||||
(its size is 1088 bytes). For more details of character tables, see the
|
||||
<a href="pcre2api.html#localesupport">section on locale support</a>
|
||||
in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
The function <b>pcre2_serialize_encode()</b> creates a serialized byte stream
|
||||
from a list of compiled patterns. Its first two arguments specify the list,
|
||||
being a pointer to a vector of pointers to compiled patterns, and the length of
|
||||
the vector. The third and fourth arguments point to variables which are set to
|
||||
point to the created byte stream and its length, respectively. The final
|
||||
argument is a pointer to a general context, which can be used to specify custom
|
||||
memory management functions. If this argument is NULL, <b>malloc()</b> is used
|
||||
to obtain memory for the byte stream. The yield of the function is the number
|
||||
of serialized patterns, or one of the following negative error codes:
|
||||
<pre>
|
||||
PCRE2_ERROR_BADDATA the number of patterns is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
|
||||
PCRE2_ERROR_NOMEMORY memory allocation failed
|
||||
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
|
||||
PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
|
||||
</pre>
|
||||
PCRE2_ERROR_BADMAGIC means either that a pattern's code has been corrupted, or
|
||||
that a slot in the vector does not point to a compiled pattern.
|
||||
</P>
|
||||
<P>
|
||||
Once a set of patterns has been serialized you can save the data in any
|
||||
appropriate manner. Here is sample code that compiles two patterns and writes
|
||||
them to a file. It assumes that the variable <i>fd</i> refers to a file that is
|
||||
open for output. The error checking that should be present in a real
|
||||
application has been omitted for simplicity.
|
||||
<pre>
|
||||
int errorcode;
|
||||
uint8_t *bytes;
|
||||
PCRE2_SIZE erroroffset;
|
||||
PCRE2_SIZE bytescount;
|
||||
pcre2_code *list_of_codes[2];
|
||||
list_of_codes[0] = pcre2_compile("first pattern",
|
||||
PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
|
||||
list_of_codes[1] = pcre2_compile("second pattern",
|
||||
PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
|
||||
errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
|
||||
&bytescount, NULL);
|
||||
errorcode = fwrite(bytes, 1, bytescount, fd);
|
||||
</pre>
|
||||
Note that the serialized data is binary data that may contain any of the 256
|
||||
possible byte values. On systems that make a distinction between binary and
|
||||
non-binary data, be sure that the file is opened for binary output.
|
||||
</P>
|
||||
<P>
|
||||
Serializing a set of patterns leaves the original data untouched, so they can
|
||||
still be used for matching. Their memory must eventually be freed in the usual
|
||||
way by calling <b>pcre2_code_free()</b>. When you have finished with the byte
|
||||
stream, it too must be freed by calling <b>pcre2_serialize_free()</b>. If this
|
||||
function is called with a NULL argument, it returns immediately without doing
|
||||
anything.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
|
||||
<P>
|
||||
In order to re-use a set of saved patterns you must first make the serialized
|
||||
byte stream available in main memory (for example, by reading from a file). The
|
||||
management of this memory block is up to the application. You can use the
|
||||
<b>pcre2_serialize_get_number_of_codes()</b> function to find out how many
|
||||
compiled patterns are in the serialized data without actually decoding the
|
||||
patterns:
|
||||
<pre>
|
||||
uint8_t *bytes = <serialized data>;
|
||||
int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
|
||||
</pre>
|
||||
The <b>pcre2_serialize_decode()</b> function reads a byte stream and recreates
|
||||
the compiled patterns in new memory blocks, setting pointers to them in a
|
||||
vector. The first two arguments are a pointer to a suitable vector and its
|
||||
length, and the third argument points to a byte stream. The final argument is a
|
||||
pointer to a general context, which can be used to specify custom memory
|
||||
management functions for the decoded patterns. If this argument is NULL,
|
||||
<b>malloc()</b> and <b>free()</b> are used. After deserialization, the byte
|
||||
stream is no longer needed and can be discarded.
|
||||
<pre>
|
||||
pcre2_code *list_of_codes[2];
|
||||
uint8_t *bytes = <serialized data>;
|
||||
int32_t number_of_codes =
|
||||
pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
|
||||
</pre>
|
||||
If the vector is not large enough for all the patterns in the byte stream, it
|
||||
is filled with those that fit, and the remainder are ignored. The yield of the
|
||||
function is the number of decoded patterns, or one of the following negative
|
||||
error codes:
|
||||
<pre>
|
||||
PCRE2_ERROR_BADDATA second argument is zero or less
|
||||
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
|
||||
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
|
||||
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
|
||||
PCRE2_ERROR_MEMORY memory allocation failed
|
||||
PCRE2_ERROR_NULL first or third argument is NULL
|
||||
</pre>
|
||||
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
|
||||
on a system with different endianness.
|
||||
</P>
|
||||
<P>
|
||||
Decoded patterns can be used for matching in the usual way, and must be freed
|
||||
by calling <b>pcre2_code_free()</b>. However, be aware that there is a potential
|
||||
race issue if you are using multiple patterns that were decoded from a single
|
||||
byte stream in a multithreaded application. A single copy of the character
|
||||
tables is used by all the decoded patterns and a reference count is used to
|
||||
arrange for its memory to be automatically freed when the last pattern is
|
||||
freed, but there is no locking on this reference count. Therefore, if you want
|
||||
to call <b>pcre2_code_free()</b> for these patterns in different threads, you
|
||||
must arrange your own locking, and ensure that <b>pcre2_code_free()</b> cannot
|
||||
be called by two threads at the same time.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern was processed by <b>pcre2_jit_compile()</b> before being
|
||||
serialized, the JIT data is discarded and so is no longer available after a
|
||||
save/restore cycle. You can, however, process a restored pattern with
|
||||
<b>pcre2_jit_compile()</b> if you wish.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 19 January 2024
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
754
3rd/pcre2/doc/html/pcre2syntax.html
Normal file
754
3rd/pcre2/doc/html/pcre2syntax.html
Normal file
@@ -0,0 +1,754 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>pcre2syntax specification</title>
|
||||
</head>
|
||||
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
||||
<h1>pcre2syntax man page</h1>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
<p>
|
||||
This page is part of the PCRE2 HTML documentation. It was generated
|
||||
automatically from the original man page. If there is any nonsense in it,
|
||||
please consult the man page, in case the conversion went wrong.
|
||||
<br>
|
||||
<ul>
|
||||
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
|
||||
<li><a name="TOC2" href="#SEC2">QUOTING</a>
|
||||
<li><a name="TOC3" href="#SEC3">BRACED ITEMS</a>
|
||||
<li><a name="TOC4" href="#SEC4">ESCAPED CHARACTERS</a>
|
||||
<li><a name="TOC5" href="#SEC5">CHARACTER TYPES</a>
|
||||
<li><a name="TOC6" href="#SEC6">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC7" href="#SEC7">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
|
||||
<li><a name="TOC8" href="#SEC8">BINARY PROPERTIES FOR \p AND \P</a>
|
||||
<li><a name="TOC9" href="#SEC9">SCRIPT MATCHING WITH \p AND \P</a>
|
||||
<li><a name="TOC10" href="#SEC10">THE BIDI_CLASS PROPERTY FOR \p AND \P</a>
|
||||
<li><a name="TOC11" href="#SEC11">CHARACTER CLASSES</a>
|
||||
<li><a name="TOC12" href="#SEC12">PERL EXTENDED CHARACTER CLASSES</a>
|
||||
<li><a name="TOC13" href="#SEC13">QUANTIFIERS</a>
|
||||
<li><a name="TOC14" href="#SEC14">ANCHORS AND SIMPLE ASSERTIONS</a>
|
||||
<li><a name="TOC15" href="#SEC15">REPORTED MATCH POINT SETTING</a>
|
||||
<li><a name="TOC16" href="#SEC16">ALTERNATION</a>
|
||||
<li><a name="TOC17" href="#SEC17">CAPTURING</a>
|
||||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPS</a>
|
||||
<li><a name="TOC19" href="#SEC19">COMMENT</a>
|
||||
<li><a name="TOC20" href="#SEC20">OPTION SETTING</a>
|
||||
<li><a name="TOC21" href="#SEC21">NEWLINE CONVENTION</a>
|
||||
<li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC23" href="#SEC23">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC24" href="#SEC24">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
|
||||
<li><a name="TOC25" href="#SEC25">SUBSTRING SCAN ASSERTION</a>
|
||||
<li><a name="TOC26" href="#SEC26">SCRIPT RUNS</a>
|
||||
<li><a name="TOC27" href="#SEC27">BACKREFERENCES</a>
|
||||
<li><a name="TOC28" href="#SEC28">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC29" href="#SEC29">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC30" href="#SEC30">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC31" href="#SEC31">CALLOUTS</a>
|
||||
<li><a name="TOC32" href="#SEC32">REPLACEMENT STRINGS</a>
|
||||
<li><a name="TOC33" href="#SEC33">SEE ALSO</a>
|
||||
<li><a name="TOC34" href="#SEC34">AUTHOR</a>
|
||||
<li><a name="TOC35" href="#SEC35">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||
<P>
|
||||
The full syntax and semantics of the regular expression patterns that are
|
||||
supported by PCRE2 are described in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation. This document contains a quick-reference summary of the pattern
|
||||
syntax followed by the syntax of replacement strings in substitution function.
|
||||
The full description of the latter is in the
|
||||
<a href="pcre2api.html"><b>pcre2api</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\x where x is non-alphanumeric is a literal x
|
||||
\Q...\E treat enclosed characters as literal
|
||||
</pre>
|
||||
Note that white space inside \Q...\E is always treated as literal, even if
|
||||
PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also
|
||||
that PCRE2's handling of \Q...\E has some differences from Perl's. See the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation for details.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">BRACED ITEMS</a><br>
|
||||
<P>
|
||||
With one exception, wherever brace characters { and } are required to enclose
|
||||
data for constructions such as \g{2} or \k{name}, space and/or horizontal tab
|
||||
characters that follow { or precede } are allowed and are ignored. In the case
|
||||
of quantifiers, they may also appear before or after the comma. The exception
|
||||
is \u{...} which is not Perl-compatible and is recognized only when
|
||||
PCRE2_EXTRA_ALT_BSUX is set. This is an ECMAScript compatibility feature, and
|
||||
follows ECMAScript's behaviour.
|
||||
</P>
|
||||
<br><a name="SEC4" href="#TOC1">ESCAPED CHARACTERS</a><br>
|
||||
<P>
|
||||
This table applies to ASCII and Unicode environments. An unrecognized escape
|
||||
sequence causes an error.
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is a non-control ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n newline (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
|
||||
\xhh character with hex code hh
|
||||
\x{hh..} character with hex code hh..
|
||||
</pre>
|
||||
\N{U+hh..} is synonymous with \x{hh..} but is not supported in environments
|
||||
that use EBCDIC code (mainly IBM mainframes). Note that \N not followed by an
|
||||
opening curly bracket has a different meaning (see below).
|
||||
</P>
|
||||
<P>
|
||||
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
|
||||
following are also recognized:
|
||||
<pre>
|
||||
\U the character "U"
|
||||
\uhhhh character with hex code hhhh
|
||||
\u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
|
||||
</pre>
|
||||
When \x is not followed by {, one or two hexadecimal digits are read,
|
||||
but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be
|
||||
recognized as a hexadecimal escape; otherwise it matches a literal "x".
|
||||
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits
|
||||
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
|
||||
matches a literal "u".
|
||||
</P>
|
||||
<P>
|
||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||
a non-zero digit is complicated; for details see the section
|
||||
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
|
||||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">CHARACTER TYPES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
. any character except newline;
|
||||
in dotall mode, any character whatsoever
|
||||
\C one code unit, even in UTF mode (best avoided)
|
||||
\d a decimal digit
|
||||
\D a character that is not a decimal digit
|
||||
\h a horizontal white space character
|
||||
\H a character that is not a horizontal white space character
|
||||
\N a character that is not a newline
|
||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||
\R a newline sequence
|
||||
\s a white space character
|
||||
\S a character that is not a white space character
|
||||
\v a vertical white space character
|
||||
\V a character that is not a vertical white space character
|
||||
\w a "word" character
|
||||
\W a "non-word" character
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
\C is dangerous because it may leave the current matching point in the middle
|
||||
of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
|
||||
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
|
||||
with the use of \C permanently disabled.
|
||||
</P>
|
||||
<P>
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
|
||||
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
|
||||
happening, \s and \w may also match characters with code points in the range
|
||||
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
|
||||
sequences is changed to use Unicode properties and they match many more
|
||||
characters, but there are some option settings that can restrict individual
|
||||
sequences to matching only ASCII characters.
|
||||
</P>
|
||||
<P>
|
||||
Property descriptions in \p and \P are matched caselessly; hyphens,
|
||||
underscores, and ASCII white space characters are ignored, in accordance with
|
||||
Unicode's "loose matching" rules. For example, \p{Bidi_Class=al} is the same
|
||||
as \p{ bidi class = AL }.
|
||||
</P>
|
||||
<br><a name="SEC6" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
C Other
|
||||
Cc Control
|
||||
Cf Format
|
||||
Cn Unassigned
|
||||
Co Private use
|
||||
Cs Surrogate
|
||||
|
||||
L Letter
|
||||
Lc Cased letter, the union of Ll, Lu, and Lt
|
||||
L& Synonym of Lc
|
||||
Ll Lower case letter
|
||||
Lm Modifier letter
|
||||
Lo Other letter
|
||||
Lt Title case letter
|
||||
Lu Upper case letter
|
||||
|
||||
M Mark
|
||||
Mc Spacing mark
|
||||
Me Enclosing mark
|
||||
Mn Non-spacing mark
|
||||
|
||||
N Number
|
||||
Nd Decimal number
|
||||
Nl Letter number
|
||||
No Other number
|
||||
|
||||
P Punctuation
|
||||
Pc Connector punctuation
|
||||
Pd Dash punctuation
|
||||
Pe Close punctuation
|
||||
Pf Final punctuation
|
||||
Pi Initial punctuation
|
||||
Po Other punctuation
|
||||
Ps Open punctuation
|
||||
|
||||
S Symbol
|
||||
Sc Currency symbol
|
||||
Sk Modifier symbol
|
||||
Sm Mathematical symbol
|
||||
So Other symbol
|
||||
|
||||
Z Separator
|
||||
Zl Line separator
|
||||
Zp Paragraph separator
|
||||
Zs Space separator
|
||||
</pre>
|
||||
From release 10.45, when caseless matching is set, Ll, Lu, and Lt are all
|
||||
equivalent to Lc.
|
||||
</P>
|
||||
<br><a name="SEC7" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
Xan Alphanumeric: union of properties L and N
|
||||
Xps POSIX space: property Z or tab, NL, VT, FF, CR
|
||||
Xsp Perl space: property Z or tab, NL, VT, FF, CR
|
||||
Xuc Universally-named character: one that can be
|
||||
represented by a Universal Character Name
|
||||
Xwd Perl word: property Xan or underscore
|
||||
</pre>
|
||||
Perl and POSIX space are now the same. Perl added VT to its space character set
|
||||
at release 5.18.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br>
|
||||
<P>
|
||||
Unicode defines a number of binary properties, that is, properties whose only
|
||||
values are true or false. You can obtain a list of those that are recognized by
|
||||
\p and \P, along with their abbreviations, by running this command:
|
||||
<pre>
|
||||
pcre2test -LP
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC9" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
|
||||
<P>
|
||||
Many script names and their 4-letter abbreviations are recognized in
|
||||
\p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of
|
||||
course). You can obtain a list of these scripts by running this command:
|
||||
<pre>
|
||||
pcre2test -LS
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
</pre>
|
||||
The recognized classes are:
|
||||
<pre>
|
||||
AL Arabic letter
|
||||
AN Arabic number
|
||||
B paragraph separator
|
||||
BN boundary neutral
|
||||
CS common separator
|
||||
EN European number
|
||||
ES European separator
|
||||
ET European terminator
|
||||
FSI first strong isolate
|
||||
L left-to-right
|
||||
LRE left-to-right embedding
|
||||
LRI left-to-right isolate
|
||||
LRO left-to-right override
|
||||
NSM non-spacing mark
|
||||
ON other neutral
|
||||
PDF pop directional format
|
||||
PDI pop directional isolate
|
||||
R right-to-left
|
||||
RLE right-to-left embedding
|
||||
RLI right-to-left isolate
|
||||
RLO right-to-left override
|
||||
S segment separator
|
||||
WS white space
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
[...] positive character class
|
||||
[^...] negative character class
|
||||
[x-y] range (can be used for hex characters)
|
||||
[[:xxx:]] positive POSIX named set
|
||||
[[:^xxx:]] negative POSIX named set
|
||||
|
||||
alnum alphanumeric
|
||||
alpha alphabetic
|
||||
ascii 0-127
|
||||
blank space or tab
|
||||
cntrl control character
|
||||
digit decimal digit
|
||||
graph printing, excluding space
|
||||
lower lower case letter
|
||||
print printing, including space
|
||||
punct printing, excluding alphanumeric
|
||||
space white space
|
||||
upper upper case letter
|
||||
word same as \w
|
||||
xdigit hexadecimal digit
|
||||
</pre>
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by default,
|
||||
but some of them use Unicode properties if PCRE2_UCP is set. You can use
|
||||
\Q...\E inside a character class.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be
|
||||
used, allowing nested character classes, combined using set operators.
|
||||
<pre>
|
||||
[x&&[^y]] UTS#18 extended character class
|
||||
|
||||
x||y set union (OR)
|
||||
x&&y set intersection (AND)
|
||||
x--y set difference (AND NOT)
|
||||
x~~y set symmetric difference (XOR)
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?[...]) Perl extended character class
|
||||
(?[\p{Thai} & \p{Nd}]) operators; whitespace ignored
|
||||
(?[(x - y) & z]) parentheses for grouping
|
||||
|
||||
(?[ [^3] & \p{Nd} ]) [...] is a nested ordinary class
|
||||
(?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...]
|
||||
(?[ \d - [3] ]) backslash-escaped set is allowed outside [...]
|
||||
(?[ !\n & [:ascii:] ]) backslash-escaped character is allowed outside [...]
|
||||
all other characters or ranges must be enclosed in [...]
|
||||
|
||||
x|y, x+y set union (OR)
|
||||
x&y set intersection (AND)
|
||||
x-y set difference (AND NOT)
|
||||
x^y set symmetric difference (XOR)
|
||||
!x set complement (NOT)
|
||||
</pre>
|
||||
Inside a Perl extended character class, [...] switches mode to be interpreted
|
||||
as an ordinary character class. Outside of a nested [...], the only items
|
||||
permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside
|
||||
a nested ordinary class, ^ has its usual meaning (inverts the class when used
|
||||
as the first character); outside of a nested class, ^ is the XOR operator.
|
||||
</P>
|
||||
<br><a name="SEC13" href="#TOC1">QUANTIFIERS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
? 0 or 1, greedy
|
||||
?+ 0 or 1, possessive
|
||||
?? 0 or 1, lazy
|
||||
* 0 or more, greedy
|
||||
*+ 0 or more, possessive
|
||||
*? 0 or more, lazy
|
||||
+ 1 or more, greedy
|
||||
++ 1 or more, possessive
|
||||
+? 1 or more, lazy
|
||||
{n} exactly n
|
||||
{n,m} at least n, no more than m, greedy
|
||||
{n,m}+ at least n, no more than m, possessive
|
||||
{n,m}? at least n, no more than m, lazy
|
||||
{n,} n or more, greedy
|
||||
{n,}+ n or more, possessive
|
||||
{n,}? n or more, lazy
|
||||
{,m} zero up to m, greedy
|
||||
{,m}+ zero up to m, possessive
|
||||
{,m}? zero up to m, lazy
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC14" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\b word boundary
|
||||
\B not a word boundary
|
||||
^ start of subject
|
||||
also after an internal newline in multiline mode
|
||||
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
|
||||
\A start of subject
|
||||
$ end of subject
|
||||
also before newline at end of subject
|
||||
also before internal newline in multiline mode
|
||||
\Z end of subject
|
||||
also before newline at end of subject
|
||||
\z end of subject
|
||||
\G first matching position in subject
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\K set reported start of match
|
||||
</pre>
|
||||
From release 10.38 \K is not permitted by default in lookaround assertions,
|
||||
for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
|
||||
option is set, the previous behaviour is re-enabled. When this option is set,
|
||||
\K is honoured in positive assertions, but ignored in negative ones.
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">ALTERNATION</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
expr|expr|expr...
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC17" href="#TOC1">CAPTURING</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(...) capture group
|
||||
(?<name>...) named capture group (Perl)
|
||||
(?'name'...) named capture group (Perl)
|
||||
(?P<name>...) named capture group (Python)
|
||||
(?:...) non-capture group
|
||||
(?|...) non-capture group; reset group numbers for
|
||||
capture groups in each alternative
|
||||
</pre>
|
||||
In non-UTF modes, names may contain underscores and ASCII letters and digits;
|
||||
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
|
||||
both cases, a name must not start with a digit.
|
||||
</P>
|
||||
<br><a name="SEC18" href="#TOC1">ATOMIC GROUPS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?>...) atomic non-capture group
|
||||
(*atomic:...) atomic non-capture group
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC19" href="#TOC1">COMMENT</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?#....) comment (not nestable)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">OPTION SETTING</a><br>
|
||||
<P>
|
||||
Changes of these options within a group are automatically cancelled at the end
|
||||
of the group.
|
||||
<pre>
|
||||
(?a) all ASCII options
|
||||
(?aD) restrict \d to ASCII in UCP mode
|
||||
(?aS) restrict \s to ASCII in UCP mode
|
||||
(?aW) restrict \w to ASCII in UCP mode
|
||||
(?aP) restrict all POSIX classes to ASCII in UCP mode
|
||||
(?aT) restrict POSIX digit classes to ASCII in UCP mode
|
||||
(?i) caseless
|
||||
(?J) allow duplicate named groups
|
||||
(?m) multiline
|
||||
(?n) no auto capture
|
||||
(?r) restrict caseless to either ASCII or non-ASCII
|
||||
(?s) single line (dotall)
|
||||
(?U) default ungreedy (lazy)
|
||||
(?x) ignore white space except in classes or \Q...\E
|
||||
(?xx) as (?x) but also ignore space and tab in classes
|
||||
(?-...) unset the given option(s)
|
||||
(?^) unset imnrsx options
|
||||
</pre>
|
||||
(?aP) implies (?aT) as well, though this has no additional effect. However, it
|
||||
means that (?-aP) also implies (?-aT) and disables all ASCII restrictions for
|
||||
POSIX classes.
|
||||
</P>
|
||||
<P>
|
||||
Unsetting x or xx unsets both. Several options may be set at once, and a
|
||||
mixture of setting and unsetting such as (?i-x) is allowed, but there may be
|
||||
only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
|
||||
(?^in). An option setting may appear at the start of a non-capture group, for
|
||||
example (?i:...).
|
||||
</P>
|
||||
<P>
|
||||
The following are recognized only at the very start of a pattern or after one
|
||||
of the newline or \R sequences or options with similar syntax. More than one
|
||||
of them may appear. For the first three, d is a decimal number.
|
||||
<pre>
|
||||
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
||||
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
|
||||
(*LIMIT_MATCH=d) set the match limit to d
|
||||
(*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching
|
||||
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
|
||||
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
|
||||
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
|
||||
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
|
||||
(*NO_JIT) disable JIT optimization
|
||||
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
|
||||
(*TURKISH_CASING) set PCRE2_EXTRA_TURKISH_CASING when matching
|
||||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
</pre>
|
||||
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
|
||||
the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>,
|
||||
not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
|
||||
application can lock out the use of (*UTF) and (*UCP) by setting the
|
||||
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">NEWLINE CONVENTION</a><br>
|
||||
<P>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
settings with a similar syntax.
|
||||
<pre>
|
||||
(*CR) carriage return only
|
||||
(*LF) linefeed only
|
||||
(*CRLF) carriage return followed by linefeed
|
||||
(*ANYCRLF) all three of the above
|
||||
(*ANY) any Unicode newline sequence
|
||||
(*NUL) the NUL character (binary zero)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
|
||||
<P>
|
||||
These are recognized only at the very start of the pattern or after option
|
||||
setting with a similar syntax.
|
||||
<pre>
|
||||
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||
(*BSR_UNICODE) any Unicode newline sequence
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?=...) )
|
||||
(*pla:...) ) positive lookahead
|
||||
(*positive_lookahead:...) )
|
||||
|
||||
(?!...) )
|
||||
(*nla:...) ) negative lookahead
|
||||
(*negative_lookahead:...) )
|
||||
|
||||
(?<=...) )
|
||||
(*plb:...) ) positive lookbehind
|
||||
(*positive_lookbehind:...) )
|
||||
|
||||
(?<!...) )
|
||||
(*nlb:...) ) negative lookbehind
|
||||
(*negative_lookbehind:...) )
|
||||
</pre>
|
||||
Each top-level branch of a lookbehind must have a limit for the number of
|
||||
characters it matches. If any branch can match a variable number of characters,
|
||||
the maximum for each branch is limited to a value set by the caller of
|
||||
<b>pcre2_compile()</b> or defaulted. The default is set when PCRE2 is built
|
||||
(ultimate default 255). If every branch matches a fixed number of characters,
|
||||
the limit for each branch is 65535 characters.
|
||||
</P>
|
||||
<br><a name="SEC24" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
|
||||
<P>
|
||||
These assertions are specific to PCRE2 and are not Perl-compatible.
|
||||
<pre>
|
||||
(?*...) )
|
||||
(*napla:...) ) synonyms
|
||||
(*non_atomic_positive_lookahead:...) )
|
||||
|
||||
(?<*...) )
|
||||
(*naplb:...) ) synonyms
|
||||
(*non_atomic_positive_lookbehind:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SUBSTRING SCAN ASSERTION</a><br>
|
||||
<P>
|
||||
This feature is not Perl-compatible.
|
||||
<pre>
|
||||
(*scan_substring:(grouplist)...) scan captured substring
|
||||
(*scs:(grouplist)...) scan captured substring
|
||||
</pre>
|
||||
The comma-separated list may identify groups in any of the following ways:
|
||||
<pre>
|
||||
n absolute reference
|
||||
+n relative reference
|
||||
-n relative reference
|
||||
<name> name
|
||||
'name' name
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
(*sr:...) )
|
||||
|
||||
(*atomic_script_run:...) ) atomic script run
|
||||
(*asr:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\n reference by number (can be ambiguous)
|
||||
\gn reference by number
|
||||
\g{n} reference by number
|
||||
\g+n relative reference by number (PCRE2 extension)
|
||||
\g-n relative reference by number
|
||||
\g{+n} relative reference by number (PCRE2 extension)
|
||||
\g{-n} relative reference by number
|
||||
\k<name> reference by name (Perl)
|
||||
\k'name' reference by name (Perl)
|
||||
\g{name} reference by name (Perl)
|
||||
\k{name} reference by name (.NET)
|
||||
(?P=name) reference by name (Python)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC28" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?R) recurse whole pattern
|
||||
(?n) call subroutine by absolute number
|
||||
(?+n) call subroutine by relative number
|
||||
(?-n) call subroutine by relative number
|
||||
(?&name) call subroutine by name (Perl)
|
||||
(?P>name) call subroutine by name (Python)
|
||||
\g<name> call subroutine by name (Oniguruma)
|
||||
\g'name' call subroutine by name (Oniguruma)
|
||||
\g<n> call subroutine by absolute number (Oniguruma)
|
||||
\g'n' call subroutine by absolute number (Oniguruma)
|
||||
\g<+n> call subroutine by relative number (PCRE2 extension)
|
||||
\g'+n' call subroutine by relative number (PCRE2 extension)
|
||||
\g<-n> call subroutine by relative number (PCRE2 extension)
|
||||
\g'-n' call subroutine by relative number (PCRE2 extension)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
(?(condition)yes-pattern|no-pattern)
|
||||
|
||||
(?(n) absolute reference condition
|
||||
(?(+n) relative reference condition (PCRE2 extension)
|
||||
(?(-n) relative reference condition (PCRE2 extension)
|
||||
(?(<name>) named reference condition (Perl)
|
||||
(?('name') named reference condition (Perl)
|
||||
(?(name) named reference condition (PCRE2, deprecated)
|
||||
(?(R) overall recursion condition
|
||||
(?(Rn) specific numbered group recursion condition
|
||||
(?(R&name) specific named group recursion condition
|
||||
(?(DEFINE) define groups for reference
|
||||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
</pre>
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||
if :NAME is present. The others just set a name for passing back to the caller,
|
||||
but this is not a name that (*SKIP) can see. The following act immediately they
|
||||
are reached:
|
||||
<pre>
|
||||
(*ACCEPT) force successful match
|
||||
(*FAIL) force backtrack; synonym (*F)
|
||||
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||
</pre>
|
||||
The following act only when a subsequent match failure causes a backtrack to
|
||||
reach them. They all force a match failure, but they differ in what happens
|
||||
afterwards. Those that advance the start-of-match point do so only if the
|
||||
pattern is not anchored.
|
||||
<pre>
|
||||
(*COMMIT) overall failure, no advance of starting point
|
||||
(*PRUNE) advance to next starting character
|
||||
(*SKIP) advance to current matching position
|
||||
(*SKIP:NAME) advance to position corresponding to an earlier
|
||||
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||
(*THEN) local failure, backtrack to next alternation
|
||||
</pre>
|
||||
The effect of one of these verbs in a group called as a subroutine is confined
|
||||
to the subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC31" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?C) callout (assumed number 0)
|
||||
(?Cn) callout with numerical data n
|
||||
(?C"text") callout with string data
|
||||
</pre>
|
||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
</P>
|
||||
<br><a name="SEC32" href="#TOC1">REPLACEMENT STRINGS</a><br>
|
||||
<P>
|
||||
If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for
|
||||
<b>pcre2_substitute()</b> is not interpreted. Otherwise, by default, the only
|
||||
special character is the dollar character in one of the following forms:
|
||||
<pre>
|
||||
$$ insert a dollar character
|
||||
$n or ${n} insert the contents of group <i>n</i>
|
||||
$<name> insert the contents of named group
|
||||
$0 or $& insert the entire matched substring
|
||||
$` insert the substring that precedes the match
|
||||
$' insert the substring that follows the match
|
||||
$_ insert the entire input string
|
||||
$*MARK or ${*MARK} insert a control verb name
|
||||
</pre>
|
||||
For ${n}, n can be a name or a number. If PCRE2_SUBSTITUTE_EXTENDED is set,
|
||||
there is additional interpretation:
|
||||
</P>
|
||||
<P>
|
||||
1. Backslash is an escape character, and the forms described in "ESCAPED
|
||||
CHARACTERS" above are recognized. Also:
|
||||
<pre>
|
||||
\Q...\E can be used to suppress interpretation
|
||||
\l force the next character to lower case
|
||||
\u force the next character to upper case
|
||||
\L force subsequent characters to lower case
|
||||
\U force subsequent characters to upper case
|
||||
\u\L force next character to upper case, then all lower
|
||||
\l\U force next character to lower case, then all upper
|
||||
\E end \L or \U case forcing
|
||||
\b backspace character (note: as in character class in pattern)
|
||||
\v vertical tab character (note: not the same as in a pattern)
|
||||
</pre>
|
||||
2. The Python form \g<n>, where the angle brackets are part of the syntax and
|
||||
<i>n</i> is either a group name or a number, is recognized as an alternative way
|
||||
of inserting the contents of a group, for example \g<3>.
|
||||
</P>
|
||||
<P>
|
||||
3. Capture substitution supports the following additional forms:
|
||||
<pre>
|
||||
${n:-string} default for unset group
|
||||
${n:+string1:string2} values for set/unset group
|
||||
</pre>
|
||||
The substitution strings themselves are expanded. Backslash can be used to
|
||||
escape colons and closing curly brackets.
|
||||
</P>
|
||||
<br><a name="SEC33" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC34" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
Retired from University Computing Service
|
||||
<br>
|
||||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC35" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 27 November 2024
|
||||
<br>
|
||||
Copyright © 1997-2024 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
</p>
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user