add PIRegularExpression

This commit is contained in:
2025-08-11 14:23:29 +03:00
parent 91955d44fa
commit 654c0847b2
481 changed files with 434858 additions and 0 deletions

View File

@@ -0,0 +1,442 @@
Building PCRE2 without using autotools
--------------------------------------
This document contains the following sections:
General
Generic instructions for the PCRE2 C libraries
Stack size in Windows environments
Linking programs in Windows environments
Calling conventions in Windows environments
Comments about Win32 builds
Building PCRE2 on Windows with CMake
Building PCRE2 on Windows with Visual Studio
Testing with RunTest.bat
Building PCRE2 on native z/OS and z/VM
Building PCRE2 under VMS
GENERAL
The source of the PCRE2 libraries consists entirely of code written in Standard
C, and so should compile successfully on any system that has a Standard C
compiler and library.
The PCRE2 distribution includes a "configure" file for use by the
configure/make (autotools) build system, as found in many Unix-like
environments. The README file contains information about the options for
"configure".
There is also support for CMake, which some users prefer, especially in Windows
environments, though it can also be run in Unix-like environments. See the
section entitled "Building PCRE2 on Windows with CMake" below.
Versions of src/config.h and src/pcre2.h are distributed in the PCRE2 tarballs
under the names src/config.h.generic and src/pcre2.h.generic. These are
provided for those who build PCRE2 without using "configure" or CMake. If you
use "configure" or CMake, the .generic versions are not used.
GENERIC INSTRUCTIONS FOR THE PCRE2 C LIBRARIES
There are three possible PCRE2 libraries, each handling data with a specific
code unit width: 8, 16, or 32 bits. You can build any combination of them. The
following are generic instructions for building a PCRE2 C library "by hand". If
you are going to use CMake, this section does not apply to you; you can skip
ahead to the CMake section. Note that the settings concerned with 8-bit,
16-bit, and 32-bit code units relate to the type of data string that PCRE2
processes. They are NOT referring to the underlying operating system bit width.
You do not have to do anything special to compile in a 64-bit environment, for
example.
(1) Copy or rename the file src/config.h.generic as src/config.h, and edit the
macro settings that it contains to whatever is appropriate for your
environment. In particular, you can alter the definition of the NEWLINE
macro to specify what character(s) you want to be interpreted as line
terminators by default. You need to #define at least one of
SUPPORT_PCRE2_8, SUPPORT_PCRE2_16, or SUPPORT_PCRE2_32, depending on which
libraries you are going to build. You must set all that apply.
When you subsequently compile any of the PCRE2 modules, you must specify
-DHAVE_CONFIG_H to your compiler so that src/config.h is included in the
sources.
An alternative approach is not to edit src/config.h, but to use -D on the
compiler command line to make any changes that you need to the
configuration options. In this case -DHAVE_CONFIG_H must not be set.
NOTE: There have been occasions when the way in which certain parameters
in src/config.h are used has changed between releases. (In the
configure/make world, this is handled automatically.) When upgrading to a
new release, you are strongly advised to review src/config.h.generic
before re-using what you had previously.
Note also that the src/config.h.generic file is created from a config.h
that was generated by Autotools, which automatically includes settings of
a number of macros that are not actually used by PCRE2 (for example,
HAVE_DLFCN_H).
(2) Copy or rename the file src/pcre2.h.generic as src/pcre2.h.
(3) EITHER:
Copy or rename file src/pcre2_chartables.c.dist as
src/pcre2_chartables.c.
OR:
Compile src/pcre2_dftables.c as a stand-alone program (using
-DHAVE_CONFIG_H if you have set up src/config.h), and then run it with
the single argument "src/pcre2_chartables.c". This generates a set of
standard character tables and writes them to that file. The tables are
generated using the default C locale for your system. If you want to use
a locale that is specified by LC_xxx environment variables, add the -L
option to the pcre2_dftables command. You must use this method if you
are building on a system that uses EBCDIC code.
The tables in src/pcre2_chartables.c are defaults. The caller of PCRE2 can
specify alternative tables at run time.
(4) For a library that supports 8-bit code units in the character strings that
it processes, compile the following source files from the src directory,
setting -DPCRE2_CODE_UNIT_WIDTH=8 as a compiler option. Also set
-DHAVE_CONFIG_H if you have set up src/config.h with your configuration,
or else use other -D settings to change the configuration as required.
pcre2_auto_possess.c
pcre2_chkdint.c
pcre2_chartables.c
pcre2_compile.c
pcre2_compile_class.c
pcre2_config.c
pcre2_context.c
pcre2_convert.c
pcre2_dfa_match.c
pcre2_error.c
pcre2_extuni.c
pcre2_find_bracket.c
pcre2_jit_compile.c
pcre2_maketables.c
pcre2_match.c
pcre2_match_data.c
pcre2_newline.c
pcre2_ord2utf.c
pcre2_pattern_info.c
pcre2_script_run.c
pcre2_serialize.c
pcre2_string_utils.c
pcre2_study.c
pcre2_substitute.c
pcre2_substring.c
pcre2_tables.c
pcre2_ucd.c
pcre2_valid_utf.c
pcre2_xclass.c
Make sure that you include -I. in the compiler command (or equivalent for
an unusual compiler) so that all included PCRE2 header files are first
sought in the src directory under the current directory. Otherwise you run
the risk of picking up a previously-installed file from somewhere else.
Note that you must compile pcre2_jit_compile.c, even if you have not
defined SUPPORT_JIT in src/config.h, because when JIT support is not
configured, dummy functions are compiled. When JIT support IS configured,
pcre2_jit_compile.c #includes other files from the sljit dependency,
all of whose names begin with "sljit". It also #includes
src/pcre2_jit_match.c and src/pcre2_jit_misc.c, so you should not compile
those yourself.
Note also that the pcre2_fuzzsupport.c file contains special code that is
useful to those who want to run fuzzing tests on the PCRE2 library. Unless
you are doing that, you can ignore it.
(5) Now link all the compiled code into an object library in whichever form
your system keeps such libraries. This is the PCRE2 C 8-bit library,
typically called something like libpcre2-8. If your system has static and
shared libraries, you may have to do this once for each type.
(6) If you want to build a library that supports 16-bit or 32-bit code units,
set 16 or 32 as the value of -DPCRE2_CODE_UNIT_WIDTH when obeying step 4
above. If you want to build more than one PCRE2 library, repeat steps 4
and 5 as necessary.
(7) If you want to build the POSIX wrapper functions (which apply only to the
8-bit library), ensure that you have the src/pcre2posix.h file and then
compile src/pcre2posix.c. Link the result (on its own) as the pcre2posix
library. If targeting a DLL in Windows, make sure to include
-DPCRE2POSIX_SHARED with your compiler flags.
(8) The pcre2test program can be linked with any combination of the 8-bit,
16-bit and 32-bit libraries (depending on what you specfied in
src/config.h) . Compile src/pcre2test.c; don't forget -DHAVE_CONFIG_H if
necessary, but do NOT define PCRE2_CODE_UNIT_WIDTH. Then link with the
appropriate library/ies. If you compiled an 8-bit library, pcre2test also
needs the pcre2posix wrapper library.
(9) Run pcre2test on the testinput files in the testdata directory, and check
that the output matches the corresponding testoutput files. There are
comments about what each test does in the section entitled "Testing PCRE2"
in the README file. If you compiled more than one of the 8-bit, 16-bit and
32-bit libraries, you need to run pcre2test with the -16 option to do
16-bit tests and with the -32 option to do 32-bit tests.
Some tests are relevant only when certain build-time options are selected.
For example, test 4 is for Unicode support, and will not run if you have
built PCRE2 without it. See the comments at the start of each testinput
file. If you have a suitable Unix-like shell, the RunTest script will run
the appropriate tests for you. The command "RunTest list" will output a
list of all the tests.
Note that the supplied files are in Unix format, with just LF characters
as line terminators. You may need to edit them to change this if your
system uses a different convention.
(10) If you have built PCRE2 with SUPPORT_JIT, the JIT features can be tested
by running pcre2test with the -jit option. This is done automatically by
the RunTest script. You might also like to build and run the freestanding
JIT test program, src/pcre2_jit_test.c.
(11) The pcre2test program tests the POSIX wrapper library, but there is also a
freestanding test program in src/pcre2posix_test.c. It must be linked with
both the pcre2posix library and the 8-bit PCRE2 library.
(12) If you want to use the pcre2grep command, compile and link
src/pcre2grep.c; it uses only the 8-bit PCRE2 library (it does not need
the pcre2posix library). If you have built the PCRE2 library with JIT
support by defining SUPPORT_JIT in src/config.h, you can also define
SUPPORT_PCRE2GREP_JIT, which causes pcre2grep to make use of JIT (unless
it is run with --no-jit). If you define SUPPORT_PCRE2GREP_JIT without
defining SUPPORT_JIT, pcre2grep does not try to make use of JIT.
STACK SIZE IN WINDOWS ENVIRONMENTS
Prior to release 10.30 the default system stack size of 1MiB in some Windows
environments caused issues with some tests. This should no longer be the case
for 10.30 and later releases.
LINKING PROGRAMS IN WINDOWS ENVIRONMENTS
If you want to statically link a program against a PCRE2 library in the form of
a non-dll .a file, you must define PCRE2_STATIC before including src/pcre2.h.
CALLING CONVENTIONS IN WINDOWS ENVIRONMENTS
It is possible to compile programs to use different calling conventions using
MSVC. Search the web for "calling conventions" for more information. To make it
easier to change the calling convention for the exported functions in a
PCRE2 library, the macro PCRE2_CALL_CONVENTION is present in all the external
definitions. It can be set externally when compiling (e.g. in CFLAGS). If it is
not set, it defaults to empty; the default calling convention is then used
(which is what is wanted most of the time).
COMMENTS ABOUT WIN32 BUILDS (see also "BUILDING PCRE2 ON WINDOWS WITH CMAKE")
There are two ways of building PCRE2 using the "configure, make, make install"
paradigm on Windows systems: using MinGW or using Cygwin. These are not at all
the same thing; they are completely different from each other. There is also
support for building using CMake, which some users find a more straightforward
way of building PCRE2 under Windows.
The MinGW home page (http://www.mingw.org/) says this:
MinGW: A collection of freely available and freely distributable Windows
specific header files and import libraries combined with GNU toolsets that
allow one to produce native Windows programs that do not rely on any
3rd-party C runtime DLLs.
The Cygwin home page (http://www.cygwin.com/) says this:
Cygwin is a Linux-like environment for Windows. It consists of two parts:
. A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing
substantial Linux API functionality
. A collection of tools which provide Linux look and feel.
On both MinGW and Cygwin, PCRE2 should build correctly using:
./configure && make && make install
This should create two libraries called libpcre2-8 and libpcre2-posix. These
are independent libraries: when you link with libpcre2-posix you must also link
with libpcre2-8, which contains the basic functions.
Using Cygwin's compiler generates libraries and executables that depend on
cygwin1.dll. If a library that is generated this way is distributed,
cygwin1.dll has to be distributed as well. Since cygwin1.dll is under the GPL
licence, this forces not only PCRE2 to be under the GPL, but also the entire
application. A distributor who wants to keep their own code proprietary must
purchase an appropriate Cygwin licence.
MinGW has no such restrictions. The MinGW compiler generates a library or
executable that can run standalone on Windows without any third party dll or
licensing issues.
But there is more complication:
If a Cygwin user uses the -mno-cygwin Cygwin gcc flag, what that really does is
to tell Cygwin's gcc to use the MinGW gcc. Cygwin's gcc is only acting as a
front end to MinGW's gcc (if you install Cygwin's gcc, you get both Cygwin's
gcc and MinGW's gcc). So, a user can:
. Build native binaries by using MinGW or by getting Cygwin and using
-mno-cygwin.
. Build binaries that depend on cygwin1.dll by using Cygwin with the normal
compiler flags.
The test files that are supplied with PCRE2 are in UNIX format, with LF
characters as line terminators. Unless your PCRE2 library uses a default
newline option that includes LF as a valid newline, it may be necessary to
change the line terminators in the test files to get some of the tests to work.
BUILDING PCRE2 ON WINDOWS WITH CMAKE
CMake is an alternative configuration facility that can be used instead of
"configure". CMake creates project files (make files, solution files, etc.)
tailored to numerous development environments, including Visual Studio,
Borland, Msys, MinGW, NMake, and Unix. If possible, use short paths with no
spaces in the names for your CMake installation and your PCRE2 source and build
directories.
If you are using CMake and encounter errors, deleting the CMake cache and
restarting from a fresh build may fix the error. In the CMake GUI, the cache can
be deleted by selecting "File > Delete Cache"; or the folder "CMakeCache" can
be deleted.
1. Install the latest CMake version available from http://www.cmake.org/, and
ensure that cmake\bin is on your path.
2. Unzip (retaining folder structure) the PCRE2 source tree into a source
directory such as C:\pcre2. You should ensure your local date and time
is not earlier than the file dates in your source dir if the release is
very new.
3. Create a new, empty build directory, preferably a subdirectory of the
source dir. For example, C:\pcre2\pcre2-xx\build.
4. Run CMake.
- Using the CLI, simply run `cmake ..` inside the `build/` directory. You can
use the `ccmake` ncurses GUI to select and configure PCRE2 features.
- Using the CMake GUI:
a) Run cmake-gui from the Shell environment of your build tool, for
example, Msys for Msys/MinGW or Visual Studio Command Prompt for
VC/VC++.
b) Enter C:\pcre2\pcre2-xx and C:\pcre2\pcre2-xx\build for the source and
build directories, respectively.
c) Press the "Configure" button.
d) Select the particular IDE / build tool that you are using (Visual
Studio, MSYS makefiles, MinGW makefiles, etc.)
e) The GUI will then list several configuration options. This is where
you can disable Unicode support or select other PCRE2 optional features.
f) Press "Configure" again. The adjacent "Generate" button should now be
active.
g) Press "Generate".
5. The build directory should now contain a usable build system, be it a
solution file for Visual Studio, makefiles for MinGW, etc. Exit from
cmake-gui and use the generated build system with your compiler or IDE.
E.g., for MinGW you can run "make", or for Visual Studio, open the PCRE2
solution, select the desired configuration (Debug, or Release, etc.) and
build the ALL_BUILD project.
Regardless of build system used, `cmake --build .` will build it.
6. If during configuration with cmake-gui you've elected to build the test
programs, you can execute them by building the test project. E.g., for
MinGW: "make test"; for Visual Studio build the RUN_TESTS project. The
most recent build configuration is targeted by the tests. A summary of
test results is presented. Complete test output is subsequently
available for review in Testing\Temporary under your build dir.
Regardless of build system used, `ctest` will run the tests.
BUILDING PCRE2 ON WINDOWS WITH VISUAL STUDIO
The code currently cannot be compiled without an inttypes.h header, which is
available only with Visual Studio 2013 or newer. However, this portable and
permissively-licensed implementation of the stdint.h header could be used as an
alternative:
http://www.azillionmonkeys.com/qed/pstdint.h
Just rename it and drop it into the top level of the build tree.
TESTING WITH RUNTEST.BAT
If configured with CMake, building the test project ("make test" or building
ALL_TESTS in Visual Studio) creates (and runs) pcre2_test.bat (and depending
on your configuration options, possibly other test programs) in the build
directory. The pcre2_test.bat script runs RunTest.bat with correct source and
exe paths.
For manual testing with RunTest.bat, provided the build dir is a subdirectory
of the source directory: Open command shell window. Chdir to the location
of your pcre2test.exe and pcre2grep.exe programs. Call RunTest.bat with
"..\RunTest.Bat" or "..\..\RunTest.bat" as appropriate.
To run only a particular test with RunTest.Bat provide a test number argument.
Otherwise:
1. Copy RunTest.bat into the directory where pcre2test.exe and pcre2grep.exe
have been created.
2. Edit RunTest.bat to identify the full or relative location of
the pcre2 source (wherein which the testdata folder resides), e.g.:
set srcdir=C:\pcre2\pcre2-10.00
3. In a Windows command environment, chdir to the location of your bat and
exe programs.
4. Run RunTest.bat. Test outputs will automatically be compared to expected
results, and discrepancies will be identified in the console output.
To independently test the just-in-time compiler, run pcre2_jit_test.exe.
BUILDING PCRE2 ON NATIVE Z/OS AND Z/VM
z/OS and z/VM are operating systems for mainframe computers, produced by IBM.
The character code used is EBCDIC, not ASCII or Unicode. In z/OS, UNIX APIs and
applications can be supported through UNIX System Services, and in such an
environment it should be possible to build PCRE2 in the same way as in other
systems, with the EBCDIC related configuration settings, but it is not known if
anybody has tried this.
In native z/OS (without UNIX System Services) and in z/VM, special ports are
required. For details, please see file 939 on this web site:
http://www.cbttape.org
Everything in that location, source and executable, is in EBCDIC and native
z/OS file formats. The port provides an API for LE languages such as COBOL and
for the z/OS and z/VM versions of the Rexx languages.
BUILDING PCRE2 UNDER VMS
Alexey Chuphin has contributed some auxiliary files for building PCRE2 under
OpenVMS. They are in the "vms" directory in the distribution tarball. Please
read the file called vms/openvms_readme.txt. The pcre2test and pcre2grep
programs contain some VMS-specific code.
==============================
Last updated: 26 December 2024
==============================

View File

@@ -0,0 +1,970 @@
README file for PCRE2 (Perl-compatible regular expression library)
------------------------------------------------------------------
PCRE2 is a re-working of the original PCRE1 library to provide an entirely new
API. Since its initial release in 2015, there has been further development of
the code and it now differs from PCRE1 in more than just the API. There are new
features, and the internals have been improved. The original PCRE1 library is
now obsolete and no longer maintained. The latest release of PCRE2 is available
in .tar.gz, tar.bz2, or .zip form from this GitHub repository:
https://github.com/PCRE2Project/pcre2/releases
There is a mailing list for discussion about the development of PCRE2 at
pcre2-dev@googlegroups.com. You can subscribe by sending an email to
pcre2-dev+subscribe@googlegroups.com.
You can access the archives and also subscribe or manage your subscription
here:
https://groups.google.com/g/pcre2-dev
Please read the NEWS file if you are upgrading from a previous release. The
contents of this README file are:
The PCRE2 APIs
Documentation for PCRE2
Building PCRE2 on non-Unix-like systems
Building PCRE2 without using autotools
Building PCRE2 using autotools
Retrieving configuration information
Shared libraries
Cross-compiling using autotools
Making new tarballs
Testing PCRE2
Character tables
File manifest
The PCRE2 APIs
--------------
PCRE2 is written in C, and it has its own API. There are three sets of
functions, one for the 8-bit library, which processes strings of bytes, one for
the 16-bit library, which processes strings of 16-bit values, and one for the
32-bit library, which processes strings of 32-bit values. Unlike PCRE1, there
are no C++ wrappers.
The distribution does contain a set of C wrapper functions for the 8-bit
library that are based on the POSIX regular expression API (see the pcre2posix
man page). These are built into a library called libpcre2-posix. Note that this
just provides a POSIX calling interface to PCRE2; the regular expressions
themselves still follow Perl syntax and semantics. The POSIX API is restricted,
and does not give full access to all of PCRE2's facilities.
The header file for the POSIX-style functions is called pcre2posix.h. The
official POSIX name is regex.h, but I did not want to risk possible problems
with existing files of that name by distributing it that way. To use PCRE2 with
an existing program that uses the POSIX API, pcre2posix.h will have to be
renamed or pointed at by a link (or the program modified, of course). See the
pcre2posix documentation for more details.
Documentation for PCRE2
-----------------------
If you install PCRE2 in the normal way on a Unix-like system, you will end up
with a set of man pages whose names all start with "pcre2". The one that is
just called "pcre2" lists all the others. In addition to these man pages, the
PCRE2 documentation is supplied in two other forms:
1. There are files called doc/pcre2.txt, doc/pcre2grep.txt, and
doc/pcre2test.txt in the source distribution. The first of these is a
concatenation of the text forms of all the section 3 man pages except the
listing of pcre2demo.c and those that summarize individual functions. The
other two are the text forms of the section 1 man pages for the pcre2grep
and pcre2test commands. These text forms are provided for ease of scanning
with text editors or similar tools. They are installed in
<prefix>/share/doc/pcre2, where <prefix> is the installation prefix
(defaulting to /usr/local).
2. A set of files containing all the documentation in HTML form, hyperlinked
in various ways, and rooted in a file called index.html, is distributed in
doc/html and installed in <prefix>/share/doc/pcre2/html.
Building PCRE2 on non-Unix-like systems
---------------------------------------
For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
your system supports the use of "configure" and "make" you may be able to build
PCRE2 using autotools in the same way as for many Unix-like systems.
PCRE2 can also be configured using CMake, which can be run in various ways
(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
NON-AUTOTOOLS-BUILD has information about CMake.
PCRE2 has been compiled on many different operating systems. It should be
straightforward to build PCRE2 on any system that has a Standard C compiler and
library, because it uses only Standard C functions.
Building PCRE2 without using autotools
--------------------------------------
The use of autotools (in particular, libtool) is problematic in some
environments, even some that are Unix or Unix-like. See the NON-AUTOTOOLS-BUILD
file for ways of building PCRE2 without using autotools.
Building PCRE2 using autotools
------------------------------
The following instructions assume the use of the widely used "configure; make;
make install" (autotools) process.
If you have downloaded and unpacked a PCRE2 release tarball, run the
"configure" command from the PCRE2 directory, with your current directory set
to the directory where you want the files to be created. This command is a
standard GNU "autoconf" configuration script, for which generic instructions
are supplied in the file INSTALL.
The files in the GitHub repository do not contain "configure". If you have
downloaded the PCRE2 source files from GitHub, before you can run "configure"
you must run the shell script called autogen.sh. This runs a number of
autotools to create a "configure" script (you must of course have the autotools
commands installed in order to do this).
Most commonly, people build PCRE2 within its own distribution directory, and in
this case, on many systems, just running "./configure" is sufficient. However,
the usual methods of changing standard defaults are available. For example:
CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
This command specifies that the C compiler should be run with the flags '-O2
-Wall' instead of the default, and that "make install" should install PCRE2
under /opt/local instead of the default /usr/local.
If you want to build in a different directory, just run "configure" with that
directory as current. For example, suppose you have unpacked the PCRE2 source
into /source/pcre2/pcre2-xxx, but you want to build it in
/build/pcre2/pcre2-xxx:
cd /build/pcre2/pcre2-xxx
/source/pcre2/pcre2-xxx/configure
PCRE2 is written in C and is normally compiled as a C library. However, it is
possible to build it as a C++ library, though the provided building apparatus
does not have any features to support this.
There are some optional features that can be included or omitted from the PCRE2
library. They are also documented in the pcre2build man page.
. By default, both shared and static libraries are built. You can change this
by adding one of these options to the "configure" command:
--disable-shared
--disable-static
Setting --disable-shared ensures that PCRE2 libraries are built as static
libraries. The binaries that are then created as part of the build process
(for example, pcre2test and pcre2grep) are linked statically with one or more
PCRE2 libraries, but may also be dynamically linked with other libraries such
as libc. If you want these binaries to be fully statically linked, you can
set LDFLAGS like this:
LDFLAGS=--static ./configure --disable-shared
Note the two hyphens in --static. Of course, this works only if static
versions of all the relevant libraries are available for linking. See also
"Shared libraries" below.
. By default, only the 8-bit library is built. If you add --enable-pcre2-16 to
the "configure" command, the 16-bit library is also built. If you add
--enable-pcre2-32 to the "configure" command, the 32-bit library is also
built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
to disable building the 8-bit library.
. If you want to include support for just-in-time (JIT) compiling, which can
give large performance improvements on certain platforms, add --enable-jit to
the "configure" command. This support is available only for certain hardware
architectures. If you try to enable it on an unsupported architecture, there
will be a compile time error. If in doubt, use --enable-jit=auto, which
enables JIT only if the current hardware is supported.
. If you are enabling JIT under SELinux environment you may also want to add
--enable-jit-sealloc, which enables the use of an executable memory allocator
that is compatible with SELinux. Warning: this allocator is experimental!
It does not support fork() operation and may crash when no disk space is
available. This option has no effect if JIT is disabled.
. If you do not want to make use of the default support for UTF-8 Unicode
character strings in the 8-bit library, UTF-16 Unicode character strings in
the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
library, you can add --disable-unicode to the "configure" command. This
reduces the size of the libraries. It is not possible to configure one
library with Unicode support, and another without, in the same configuration.
It is also not possible to use --enable-ebcdic (see below) with Unicode
support, so if this option is set, you must also use --disable-unicode.
When Unicode support is available, the use of a UTF encoding still has to be
enabled by setting the PCRE2_UTF option at run time or starting a pattern
with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only a subset of Unicode properties are supported; see the
pcre2pattern man page for details. Escape sequences such as \d and \w in
patterns do not by default make use of Unicode properties, but can be made to
do so by setting the PCRE2_UCP option or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, or the NUL (zero)
character as indicating the end of a line. Whatever you specify at build time
is the default; the caller of PCRE2 can change the selection at run time. The
default newline indicator is a single LF character (the Unix standard). You
can specify the default newline indicator by adding --enable-newline-is-cr,
--enable-newline-is-lf, --enable-newline-is-crlf,
--enable-newline-is-anycrlf, --enable-newline-is-any, or
--enable-newline-is-nul to the "configure" command, respectively.
. By default, the sequence \R in a pattern matches any Unicode line ending
sequence. This is independent of the option specifying what PCRE2 considers
to be the end of a line (see above). However, the caller of PCRE2 can
restrict \R to match only CR, LF, or CRLF. You can make this the default by
adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
. In a pattern, the escape sequence \C matches a single code unit, even in a
UTF mode. This can be dangerous because it breaks up multi-code-unit
characters. You can build PCRE2 with the use of \C permanently locked out by
adding --enable-never-backslash-C (note the upper case C) to the "configure"
command. When \C is allowed by the library, individual applications can lock
it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
. PCRE2 has a counter that limits the depth of nesting of parentheses in a
pattern. This limits the amount of system stack that a pattern uses when it
is compiled. The default is 250, but you can change it by setting, for
example,
--with-parens-nest-limit=500
. PCRE2 has a counter that can be set to limit the amount of computing resource
it uses when matching a pattern. If the limit is exceeded during a match, the
match fails. The default is ten million. You can change the default by
setting, for example,
--with-match-limit=500000
on the "configure" command. This is just the default; individual calls to
pcre2_match() or pcre2_dfa_match() can supply their own value. There is more
discussion in the pcre2api man page (search for pcre2_set_match_limit).
. There is a separate counter that limits the depth of nested backtracking
(pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
matching process, which indirectly limits the amount of heap memory that is
used, and in the case of pcre2_dfa_match() the amount of stack as well. This
counter also has a default of ten million, which is essentially "unlimited".
You can change the default by setting, for example,
--with-match-limit-depth=5000
There is more discussion in the pcre2api man page (search for
pcre2_set_depth_limit).
. You can also set an explicit limit on the amount of heap memory used by
the pcre2_match() and pcre2_dfa_match() interpreters:
--with-heap-limit=500
The units are kibibytes (units of 1024 bytes). This limit does not apply when
the JIT optimization (which has its own memory control features) is used.
There is more discussion on the pcre2api man page (search for
pcre2_set_heap_limit).
. In the 8-bit library, the default maximum compiled pattern size is around
64 kibibytes. You can increase this by adding --with-link-size=3 to the
"configure" command. PCRE2 then uses three bytes instead of two for offsets
to different parts of the compiled pattern. In the 16-bit library,
--with-link-size=3 is the same as --with-link-size=4, which (in both
libraries) uses four-byte offsets. Increasing the internal link size reduces
performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
link size setting is ignored, as 4-byte offsets are always used.
. Lookbehind assertions in which one or more branches can match a variable
number of characters are supported only if there is a maximum matching length
for each top-level branch. There is a limit to this maximum that defaults to
255 characters. You can alter this default by a setting such as
--with-max-varlookbehind=100
The limit can be changed at runtime by calling pcre2_set_max_varlookbehind().
Lookbehind assertions in which every branch matches a fixed number of
characters (not necessarily all the same) are not constrained by this limit.
. For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, it uses a set of
tables for ASCII encoding that is part of the distribution. If you specify
--enable-rebuild-chartables
a program called pcre2_dftables is compiled and run in the default C locale
when you obey "make". It builds a source file called pcre2_chartables.c. If
you do not specify this option, pcre2_chartables.c is created as a copy of
pcre2_chartables.c.dist. See "Character tables" below for further
information.
. It is possible to compile PCRE2 for use on systems that use EBCDIC as their
character code (as opposed to ASCII/Unicode) by specifying
--enable-ebcdic --disable-unicode
This automatically implies --enable-rebuild-chartables (see above). However,
when PCRE2 is built this way, it always operates in EBCDIC. It cannot support
both EBCDIC and UTF-8/16/32. There is a second option, --enable-ebcdic-nl25,
which specifies that the code value for the EBCDIC NL character is 0x25
instead of the default 0x15.
. If you specify --enable-debug, additional debugging code is included in the
build. This option is intended for use by the PCRE2 maintainers.
. In environments where valgrind is installed, if you specify
--enable-valgrind
PCRE2 will use valgrind annotations to mark certain memory regions as
unaddressable. This allows it to detect invalid memory accesses, and is
mostly useful for debugging PCRE2 itself.
. In environments where the gcc compiler is used and lcov is installed, if you
specify
--enable-coverage
the build process implements a code coverage report for the test suite. The
report is generated by running "make coverage". If ccache is installed on
your system, it must be disabled when building PCRE2 for coverage reporting.
You can do this by setting the environment variable CCACHE_DISABLE=1 before
running "make" to build PCRE2. There is more information about coverage
reporting in the "pcre2build" documentation.
. When JIT support is enabled, pcre2grep automatically makes use of it, unless
you add --disable-pcre2grep-jit to the "configure" command.
. There is support for calling external programs during matching in the
pcre2grep command, using PCRE2's callout facility with string arguments. This
support can be disabled by adding --disable-pcre2grep-callout to the
"configure" command. There are two kinds of callout: one that generates
output from inbuilt code, and another that calls an external program. The
latter has special support for Windows and VMS; otherwise it assumes the
existence of the fork() function. This facility can be disabled by adding
--disable-pcre2grep-callout-fork to the "configure" command.
. The pcre2grep program currently supports only 8-bit data files, and so
requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
specifying one or both of
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
Of course, the relevant libraries must be installed on your system.
. The default starting size (in bytes) of the internal buffer used by pcre2grep
can be set by, for example:
--with-pcre2grep-bufsize=51200
The value must be a plain integer. The default is 20480. The amount of memory
used by pcre2grep is actually three times this number, to allow for "before"
and "after" lines. If very long lines are encountered, the buffer is
automatically enlarged, up to a fixed maximum size.
. The default maximum size of pcre2grep's internal buffer can be set by, for
example:
--with-pcre2grep-max-bufsize=2097152
The default is either 1048576 or the value of --with-pcre2grep-bufsize,
whichever is the larger.
. It is possible to compile pcre2test so that it links with the libreadline
or libedit libraries, by specifying, respectively,
--enable-pcre2test-libreadline or --enable-pcre2test-libedit
If this is done, when pcre2test's input is from a terminal, it reads it using
the readline() function. This provides line-editing and history facilities.
Note that libreadline is GPL-licensed, so if you distribute a binary of
pcre2test linked in this way, there may be licensing issues. These can be
avoided by linking with libedit (which has a BSD licence) instead.
Enabling libreadline causes the -lreadline option to be added to the
pcre2test build. In many operating environments with a system-installed
readline library this is sufficient. However, in some environments (e.g. if
an unmodified distribution version of readline is in use), it may be
necessary to specify something like LIBS="-lncurses" as well. This is
because, to quote the readline INSTALL, "Readline uses the termcap functions,
but does not link with the termcap or curses library itself, allowing
applications which link with readline the option to choose an appropriate
library." If you get error messages about missing functions tgetstr, tgetent,
tputs, tgetflag, or tgoto, this is the problem, and linking with the ncurses
library should fix it.
. The C99 standard defines formatting modifiers z and t for size_t and
ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
environments other than Microsoft Visual Studio versions earlier than 2013
when __STDC_VERSION__ is defined and has a value greater than or equal to
199901L (indicating C99). However, there is at least one environment that
claims to be C99 but does not support these modifiers. If
--disable-percent-zt is specified, no use is made of the z or t modifiers.
Instead of %td or %zu, %lu is used, with a cast for size_t values.
. There is a special option called --enable-fuzz-support for use by people who
want to run fuzzing tests on PCRE2. If set, it causes an extra library
called libpcre2-fuzzsupport.a to be built, but not installed. This contains
a single function called LLVMFuzzerTestOneInput() whose arguments are a
pointer to a string and the length of the string. When called, this function
tries to compile the string as a pattern, and if that succeeds, to match
it. This is done both with no options and with some random options bits that
are generated from the string. Setting --enable-fuzz-support also causes an
executable called pcre2fuzzcheck-{8,16,32} to be created. This is normally
run under valgrind or used when PCRE2 is compiled with address sanitizing
enabled. It calls the fuzzing function and outputs information about what it
is doing. The input strings are specified by arguments: if an argument
starts with "=" the rest of it is a literal input string. Otherwise, it is
assumed to be a file name, and the contents of the file are the test string.
. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
which caused pcre2_match() to use individual blocks on the heap for
backtracking instead of recursive function calls (which use the stack). This
is now obsolete because pcre2_match() was refactored always to use the heap
(in a much more efficient way than before). This option is retained for
backwards compatibility, but has no effect other than to output a warning.
The "configure" script builds the following files for the basic C library:
. Makefile the makefile that builds the library
. src/config.h build-time configuration options for the library
. src/pcre2.h the public PCRE2 header file
. pcre2-config script that shows the building settings such as CFLAGS
that were set for "configure"
. libpcre2-8.pc )
. libpcre2-16.pc ) data for the pkg-config command
. libpcre2-32.pc )
. libpcre2-posix.pc )
. libtool script that builds shared and/or static libraries
Versions of config.h and pcre2.h are distributed in the src directory of PCRE2
tarballs under the names config.h.generic and pcre2.h.generic. These are
provided for those who have to build PCRE2 without using "configure" or CMake.
If you use "configure" or CMake, the .generic versions are not used.
The "configure" script also creates config.status, which is an executable
script that can be run to recreate the configuration, and config.log, which
contains compiler output from tests that "configure" runs.
Once "configure" has run, you can run "make". This builds whichever of the
libraries libpcre2-8, libpcre2-16 and libpcre2-32 are configured, and a test
program called pcre2test. If you enabled JIT support with --enable-jit, another
test program called pcre2_jit_test is built as well. If the 8-bit library is
built, libpcre2-posix, pcre2posix_test, and the pcre2grep command are also
built. Running "make" with the -j option may speed up compilation on
multiprocessor systems.
The command "make check" runs all the appropriate tests. Details of the PCRE2
tests are given below in a separate section of this document. The -j option of
"make" can also be used when running the tests.
You can use "make install" to install PCRE2 into live directories on your
system. The following are installed (file names are all relative to the
<prefix> that is set when "configure" is run):
Commands (bin):
pcre2test
pcre2grep (if 8-bit support is enabled)
pcre2-config
Libraries (lib):
libpcre2-8 (if 8-bit support is enabled)
libpcre2-16 (if 16-bit support is enabled)
libpcre2-32 (if 32-bit support is enabled)
libpcre2-posix (if 8-bit support is enabled)
Configuration information (lib/pkgconfig):
libpcre2-8.pc
libpcre2-16.pc
libpcre2-32.pc
libpcre2-posix.pc
Header files (include):
pcre2.h
pcre2posix.h
Man pages (share/man/man{1,3}):
pcre2grep.1
pcre2test.1
pcre2-config.1
pcre2.3
pcre2*.3 (lots more pages, all starting "pcre2")
HTML documentation (share/doc/pcre2/html):
index.html
*.html (lots more pages, hyperlinked from index.html)
Text file documentation (share/doc/pcre2):
AUTHORS
COPYING
ChangeLog
LICENCE
NEWS
README
SECURITY
pcre2.txt (a concatenation of the man(3) pages)
pcre2test.txt the pcre2test man page
pcre2grep.txt the pcre2grep man page
pcre2-config.txt the pcre2-config man page
If you want to remove PCRE2 from your system, you can run "make uninstall".
This removes all the files that "make install" installed. However, it does not
remove any directories, because these are often shared with other programs.
Retrieving configuration information
------------------------------------
Running "make install" installs the command pcre2-config, which can be used to
recall information about the PCRE2 configuration and installation. For example:
pcre2-config --version
prints the version number, and
pcre2-config --libs8
outputs information about where the 8-bit library is installed. This command
can be included in makefiles for programs that use PCRE2, saving the programmer
from having to remember too many details. Run pcre2-config with no arguments to
obtain a list of possible arguments.
The pkg-config command is another system for saving and retrieving information
about installed libraries. Instead of separate commands for each library, a
single command is used. For example:
pkg-config --libs libpcre2-16
The data is held in *.pc files that are installed in a directory called
<prefix>/lib/pkgconfig.
Shared libraries
----------------
The default distribution builds PCRE2 as shared libraries and static libraries,
as long as the operating system supports shared libraries. Shared library
support relies on the "libtool" script which is built as part of the
"configure" process.
The libtool script is used to compile and link both shared and static
libraries. They are placed in a subdirectory called .libs when they are newly
built. The programs pcre2test and pcre2grep are built to use these uninstalled
libraries (by means of wrapper scripts in the case of shared libraries). When
you use "make install" to install shared libraries, pcre2grep and pcre2test are
automatically re-built to use the newly installed shared libraries before being
installed themselves. However, the versions left in the build directory still
use the uninstalled libraries.
To build PCRE2 using static libraries only you must use --disable-shared when
configuring it. For example:
./configure --prefix=/usr/gnu --disable-shared
Then run "make" in the usual way. Similarly, you can use --disable-static to
build only shared libraries. Note, however, that when you build only static
libraries, binary programs such as pcre2test and pcre2grep may still be
dynamically linked with other libraries (for example, libc) unless you set
LDFLAGS to --static when running "configure".
Cross-compiling using autotools
-------------------------------
You can specify CC and CFLAGS in the normal way to the "configure" command, in
order to cross-compile PCRE2 for some other host. However, you should NOT
specify --enable-rebuild-chartables, because if you do, the pcre2_dftables.c
source file is compiled and run on the local host, in order to generate the
inbuilt character tables (the pcre2_chartables.c file). This will probably not
work, because pcre2_dftables.c needs to be compiled with the local compiler,
not the cross compiler.
When --enable-rebuild-chartables is not specified, pcre2_chartables.c is
created by making a copy of pcre2_chartables.c.dist, which is a default set of
tables that assumes ASCII code. Cross-compiling with the default tables should
not be a problem.
If you need to modify the character tables when cross-compiling, you should
move pcre2_chartables.c.dist out of the way, then compile pcre2_dftables.c by
hand and run it on the local host to make a new version of
pcre2_chartables.c.dist. See the pcre2build section "Creating character tables
at build time" for more details.
Making new tarballs
-------------------
The command "make dist" creates three PCRE2 tarballs, in tar.gz, tar.bz2, and
zip formats. The command "make distcheck" does the same, but then does a trial
build of the new distribution to ensure that it works.
If you have modified any of the man page sources in the doc directory, you
should first run the maint/PrepareRelease script before making a distribution.
This script creates the .txt and HTML forms of the documentation from the man
pages.
Testing PCRE2
-------------
To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
There is another script called RunGrepTest that tests the pcre2grep command.
When the 8-bit library is built, a test program for the POSIX wrapper, called
pcre2posix_test, is compiled, and when JIT support is enabled, a test program
called pcre2_jit_test is built. The scripts and the program tests are all run
when you obey "make check". For other environments, see the instructions in
NON-AUTOTOOLS-BUILD.
The RunTest script runs the pcre2test test program (which is documented in its
own man page) on each of the relevant testinput files in the testdata
directory, and compares the output with the contents of the corresponding
testoutput files. RunTest uses a file called testtry to hold the main output
from pcre2test. Other files whose names begin with "test" are used as working
files in some tests.
Some tests are relevant only when certain build-time options were selected. For
example, the tests for UTF-8/16/32 features are run only when Unicode support
is available. RunTest outputs a comment when it skips a test.
Many (but not all) of the tests that are not skipped are run twice if JIT
support is available. On the second run, JIT compilation is forced. This
testing can be suppressed by putting "-nojit" on the RunTest command line.
The entire set of tests is run once for each of the 8-bit, 16-bit and 32-bit
libraries that are enabled. If you want to run just one set of tests, call
RunTest with either the -8, -16 or -32 option.
If valgrind is installed, you can run the tests under it by putting "-valgrind"
on the RunTest command line. To run pcre2test on just one or more specific test
files, give their numbers as arguments to RunTest, for example:
RunTest 2 7 11
You can also specify ranges of tests such as 3-6 or 3- (meaning 3 to the
end), or a number preceded by ~ to exclude a test. For example:
Runtest 3-15 ~10
This runs tests 3 to 15, excluding test 10, and just ~13 runs all the tests
except test 13. Whatever order the arguments are in, the tests are always run
in numerical order.
You can also call RunTest with the single argument "list" to cause it to output
a list of tests.
The test sequence starts with "test 0", which is a special test that has no
input file, and whose output is not checked. This is because it will be
different on different hardware and with different configurations. The test
exists in order to exercise some of pcre2test's code that would not otherwise
be run.
Tests 1 and 2 can always be run, as they expect only plain text strings (not
UTF) and make no use of Unicode properties. The first test file can be fed
directly into the perltest.sh script to check that Perl gives the same results.
The only difference you should see is in the first few lines, where the Perl
version is given instead of the PCRE2 version. The second set of tests check
auxiliary functions, error detection, and run-time flags that are specific to
PCRE2. It also uses the debugging flags to check some of the internals of
pcre2_compile().
If you build PCRE2 with a locale setting that is not the standard C locale, the
character tables may be different (see next paragraph). In some cases, this may
cause failures in the second set of tests. For example, in a locale where the
isprint() function yields TRUE for characters in the range 128-255, the use of
[:isascii:] inside a character class defines a different set of characters, and
this shows up in this test as a difference in the compiled code, which is being
listed for checking. For example, where the comparison test output contains
[\x00-\x7f] the test might contain [\x00-\xff], and similarly in some other
cases. This is not a bug in PCRE2.
Test 3 checks pcre2_maketables(), the facility for building a set of character
tables for a specific locale and using them instead of the default tables. The
script uses the "locale" command to check for the availability of the "fr_FR",
"french", or "fr" locale, and uses the first one that it finds. If the "locale"
command fails, or if its output doesn't include "fr_FR", "french", or "fr" in
the list of available locales, the third test cannot be run, and a comment is
output to say why. If running this test produces an error like this:
** Failed to set locale "fr_FR"
it means that the given locale is not available on your system, despite being
listed by "locale". This does not mean that PCRE2 is broken. There are three
alternative output files for the third test, because three different versions
of the French locale have been encountered. The test passes if its output
matches any one of them.
Tests 4 and 5 check UTF and Unicode property support, test 4 being compatible
with the perltest.sh script, and test 5 checking PCRE2-specific things.
Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in
non-UTF mode and UTF-mode with Unicode property support, respectively.
Test 8 checks some internal offsets and code size features, but it is run only
when Unicode support is enabled. The output is different in 8-bit, 16-bit, and
32-bit modes and for different link sizes, so there are different output files
for each mode and link size.
Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
16-bit and 32-bit modes. These are tests that generate different output in
8-bit mode. Each pair are for general cases and Unicode support, respectively.
Test 13 checks the handling of non-UTF characters greater than 255 by
pcre2_dfa_match() in 16-bit and 32-bit modes.
Test 14 contains some special UTF and UCP tests that give different output for
different code unit widths.
Test 15 contains a number of tests that must not be run with JIT. They check,
among other non-JIT things, the match-limiting features of the interpretive
matcher.
Test 16 is run only when JIT support is not available. It checks that an
attempt to use JIT has the expected behaviour.
Test 17 is run only when JIT support is available. It checks JIT complete and
partial modes, match-limiting under JIT, and other JIT-specific features.
Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
the 8-bit library, without and with Unicode support, respectively.
Test 20 checks the serialization functions by writing a set of compiled
patterns to a file, and then reloading and checking them.
Tests 21 and 22 test \C support when the use of \C is not locked out, without
and with UTF support, respectively. Test 23 tests \C when it is locked out.
Tests 24 and 25 test the experimental pattern conversion functions, without and
with UTF support, respectively.
Test 26 checks Unicode property support using tests that are generated
automatically from the Unicode data tables.
Character tables
----------------
For speed, PCRE2 uses four tables for manipulating and identifying characters
whose code point values are less than 256. By default, a set of tables that is
built into the library is used. The pcre2_maketables() function can be called
by an application to create a new set of tables in the current locale. This are
passed to PCRE2 by calling pcre2_set_character_tables() to put a pointer into a
compile context.
The source file called pcre2_chartables.c contains the default set of tables.
By default, this is created as a copy of pcre2_chartables.c.dist, which
contains tables for ASCII coding. However, if --enable-rebuild-chartables is
specified for ./configure, a new version of pcre2_chartables.c is built by the
program pcre2_dftables (compiled from pcre2_dftables.c), which uses the ANSI C
character handling functions such as isalnum(), isalpha(), isupper(),
islower(), etc. to build the table sources. This means that the default C
locale that is set for your system will control the contents of these default
tables. You can change the default tables by editing pcre2_chartables.c and
then re-building PCRE2. If you do this, you should take care to ensure that the
file does not get automatically re-generated. The best way to do this is to
move pcre2_chartables.c.dist out of the way and replace it with your customized
tables.
When the pcre2_dftables program is run as a result of specifying
--enable-rebuild-chartables, it uses the default C locale that is set on your
system. It does not pay attention to the LC_xxx environment variables. In other
words, it uses the system's default locale rather than whatever the compiling
user happens to have set. If you really do want to build a source set of
character tables in a locale that is specified by the LC_xxx variables, you can
run the pcre2_dftables program by hand with the -L option. For example:
./pcre2_dftables -L pcre2_chartables.c.special
The second argument names the file where the source code for the tables is
written. The first two 256-byte tables provide lower casing and case flipping
functions, respectively. The next table consists of a number of 32-byte bit
maps which identify certain character classes such as digits, "word"
characters, white space, etc. These are used when building 32-byte bit maps
that represent character classes for code points less than 256. The final
256-byte table has bits indicating various character types, as follows:
1 white space character
2 letter
4 lower case letter
8 decimal digit
16 alphanumeric or '_'
You can also specify -b (with or without -L) when running pcre2_dftables. This
causes the tables to be written in binary instead of as source code. A set of
binary tables can be loaded into memory by an application and passed to
pcre2_compile() in the same way as tables created dynamically by calling
pcre2_maketables(). The tables are just a string of bytes, independent of
hardware characteristics such as endianness. This means they can be bundled
with an application that runs in different environments, to ensure consistent
behaviour.
See also the pcre2build section "Creating character tables at build time".
File manifest
-------------
The distribution should contain the files listed below.
(A) Source files for the PCRE2 library functions and their headers are found in
the src directory:
src/pcre2_dftables.c auxiliary program for building pcre2_chartables.c
when --enable-rebuild-chartables is specified
src/pcre2_chartables.c.dist a default set of character tables that assume
ASCII coding; unless --enable-rebuild-chartables is
specified, used by copying to pcre2_chartables.c
src/pcre2posix.c )
src/pcre2_auto_possess.c )
src/pcre2_chkdint.c )
src/pcre2_compile.c )
src/pcre2_compile_class.c )
src/pcre2_config.c )
src/pcre2_context.c )
src/pcre2_convert.c )
src/pcre2_dfa_match.c )
src/pcre2_error.c )
src/pcre2_extuni.c )
src/pcre2_find_bracket.c )
src/pcre2_jit_compile.c )
src/pcre2_jit_match.c ) sources for the functions in the library,
src/pcre2_jit_misc.c ) and some internal functions that they use
src/pcre2_maketables.c )
src/pcre2_match.c )
src/pcre2_match_data.c )
src/pcre2_newline.c )
src/pcre2_ord2utf.c )
src/pcre2_pattern_info.c )
src/pcre2_script_run.c )
src/pcre2_serialize.c )
src/pcre2_string_utils.c )
src/pcre2_study.c )
src/pcre2_substitute.c )
src/pcre2_substring.c )
src/pcre2_tables.c )
src/pcre2_ucd.c )
src/pcre2_ucptables.c )
src/pcre2_valid_utf.c )
src/pcre2_xclass.c )
src/pcre2_printint.c debugging function that is used by pcre2test,
src/pcre2_fuzzsupport.c function for (optional) fuzzing support
src/config.h.in template for config.h, when built by "configure"
src/pcre2.h.in template for pcre2.h when built by "configure"
src/pcre2posix.h header for the external POSIX wrapper API
src/pcre2_compile.h header for internal use
src/pcre2_internal.h header for internal use
src/pcre2_intmodedep.h a mode-specific internal header
src/pcre2_jit_char_inc.h header used by JIT
src/pcre2_jit_neon_inc.h header used by JIT
src/pcre2_jit_simd_inc.h header used by JIT
src/pcre2_ucp.h header for Unicode property handling
src/pcre2_util.h header for internal utils
deps/sljit/sljit_src/* source files for the JIT compiler
(B) Source files for programs that use PCRE2:
src/pcre2demo.c simple demonstration of coding calls to PCRE2
src/pcre2grep.c source of a grep utility that uses PCRE2
src/pcre2test.c comprehensive test program
src/pcre2_jit_test.c JIT test program
src/pcre2posix_test.c POSIX wrapper API test program
(C) Auxiliary files:
AUTHORS.md information about the authors of PCRE2
ChangeLog log of changes to the code
HACKING some notes about the internals of PCRE2
INSTALL generic installation instructions
LICENCE.md conditions for the use of PCRE2
COPYING the same, using GNU's standard name
SECURITY.md information on reporting vulnerabilities
Makefile.in ) template for Unix Makefile, which is built by
) "configure"
Makefile.am ) the automake input that was used to create
) Makefile.in
NEWS important changes in this release
NON-AUTOTOOLS-BUILD notes on building PCRE2 without using autotools
README this file
RunTest a Unix shell script for running tests
RunGrepTest a Unix shell script for pcre2grep tests
RunTest.bat a Windows batch file for running tests
RunGrepTest.bat a Windows batch file for pcre2grep tests
aclocal.m4 m4 macros (generated by "aclocal")
m4/* m4 macros (used by autoconf)
configure a configuring shell script (built by autoconf)
configure.ac ) the autoconf input that was used to build
) "configure" and config.h
doc/*.3 man page sources for PCRE2
doc/*.1 man page sources for pcre2grep and pcre2test
doc/html/* HTML documentation
doc/pcre2.txt plain text version of the man pages
doc/pcre2-config.txt plain text documentation of pcre2-config script
doc/pcre2grep.txt plain text documentation of grep utility program
doc/pcre2test.txt plain text documentation of test program
libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
ar-lib )
config.guess )
config.sub )
depcomp ) helper tools generated by libtool and
compile ) automake, used internally by ./configure
install-sh )
ltmain.sh )
missing )
test-driver )
perltest.sh Script for running a Perl test program
pcre2-config.in source of script which retains PCRE2 information
testdata/testinput* test data for main library tests
testdata/testoutput* expected test results
testdata/grep* input and output for pcre2grep tests
testdata/* other supporting test files
(D) Auxiliary files for CMake support
cmake/COPYING-CMAKE-SCRIPTS
cmake/FindEditline.cmake
cmake/FindReadline.cmake
cmake/pcre2-config-version.cmake.in
cmake/pcre2-config.cmake.in
CMakeLists.txt
config-cmake.h.in
(E) Auxiliary files for building PCRE2 "by hand"
src/pcre2.h.generic ) a version of the public PCRE2 header file
) for use in non-"configure" environments
src/config.h.generic ) a version of config.h for use in non-"configure"
) environments
(F) Auxiliary files for building PCRE2 using other build systems
BUILD.bazel )
MODULE.bazel ) files used by the Bazel build system
WORKSPACE.bazel )
build.zig file used by zig's build system
(G) Auxiliary files for building PCRE2 under OpenVMS
vms/configure.com )
vms/openvms_readme.txt ) These files were contributed by a PCRE2 user.
vms/pcre2.h_patch )
vms/stdint.h )
==============================
Last updated: 18 December 2024
==============================

View File

@@ -0,0 +1,327 @@
<html>
<!-- This is a manually maintained file that is the root of the HTML version of
the PCRE2 documentation. When the HTML documents are built from the man
page versions, the entire doc/html directory is emptied, this file is then
copied into doc/html/index.html, and the remaining files therein are
created by the 132html script.
-->
<head>
<title>PCRE2 specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>Perl-compatible Regular Expressions (revised API: PCRE2)</h1>
<p>
The HTML documentation for PCRE2 consists of a number of pages that are listed
below in alphabetical order. If you are new to PCRE2, please read the first one
first.
</p>
<table>
<tr><td><a href="pcre2.html">pcre2</a></td>
<td>&nbsp;&nbsp;Introductory page</td></tr>
<tr><td><a href="pcre2-config.html">pcre2-config</a></td>
<td>&nbsp;&nbsp;Information about the installation configuration</td></tr>
<tr><td><a href="pcre2api.html">pcre2api</a></td>
<td>&nbsp;&nbsp;PCRE2's native API</td></tr>
<tr><td><a href="pcre2build.html">pcre2build</a></td>
<td>&nbsp;&nbsp;Building PCRE2</td></tr>
<tr><td><a href="pcre2callout.html">pcre2callout</a></td>
<td>&nbsp;&nbsp;The <i>callout</i> facility</td></tr>
<tr><td><a href="pcre2compat.html">pcre2compat</a></td>
<td>&nbsp;&nbsp;Compability with Perl</td></tr>
<tr><td><a href="pcre2convert.html">pcre2convert</a></td>
<td>&nbsp;&nbsp;Experimental foreign pattern conversion functions</td></tr>
<tr><td><a href="pcre2demo.html">pcre2demo</a></td>
<td>&nbsp;&nbsp;A demonstration C program that uses the PCRE2 library</td></tr>
<tr><td><a href="pcre2grep.html">pcre2grep</a></td>
<td>&nbsp;&nbsp;The <b>pcre2grep</b> command</td></tr>
<tr><td><a href="pcre2jit.html">pcre2jit</a></td>
<td>&nbsp;&nbsp;Discussion of the just-in-time optimization support</td></tr>
<tr><td><a href="pcre2limits.html">pcre2limits</a></td>
<td>&nbsp;&nbsp;Details of size and other limits</td></tr>
<tr><td><a href="pcre2matching.html">pcre2matching</a></td>
<td>&nbsp;&nbsp;Discussion of the two matching algorithms</td></tr>
<tr><td><a href="pcre2partial.html">pcre2partial</a></td>
<td>&nbsp;&nbsp;Using PCRE2 for partial matching</td></tr>
<tr><td><a href="pcre2pattern.html">pcre2pattern</a></td>
<td>&nbsp;&nbsp;Specification of the regular expressions supported by PCRE2</td></tr>
<tr><td><a href="pcre2perform.html">pcre2perform</a></td>
<td>&nbsp;&nbsp;Some comments on performance</td></tr>
<tr><td><a href="pcre2posix.html">pcre2posix</a></td>
<td>&nbsp;&nbsp;The POSIX API to the PCRE2 8-bit library</td></tr>
<tr><td><a href="pcre2sample.html">pcre2sample</a></td>
<td>&nbsp;&nbsp;Discussion of the pcre2demo program</td></tr>
<tr><td><a href="pcre2serialize.html">pcre2serialize</a></td>
<td>&nbsp;&nbsp;Serializing functions for saving precompiled patterns</td></tr>
<tr><td><a href="pcre2syntax.html">pcre2syntax</a></td>
<td>&nbsp;&nbsp;Syntax quick-reference summary</td></tr>
<tr><td><a href="pcre2test.html">pcre2test</a></td>
<td>&nbsp;&nbsp;The <b>pcre2test</b> command for testing PCRE2</td></tr>
<tr><td><a href="pcre2unicode.html">pcre2unicode</a></td>
<td>&nbsp;&nbsp;Discussion of Unicode and UTF-8/UTF-16/UTF-32 support</td></tr>
</table>
<p>
There are also individual pages that summarize the interface for each function
in the library.
</p>
<table>
<tr><td><a href="pcre2_callout_enumerate.html">pcre2_callout_enumerate</a></td>
<td>&nbsp;&nbsp;Enumerate callouts in a compiled pattern</td></tr>
<tr><td><a href="pcre2_code_copy.html">pcre2_code_copy</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern</td></tr>
<tr><td><a href="pcre2_code_copy_with_tables.html">pcre2_code_copy_with_tables</a></td>
<td>&nbsp;&nbsp;Copy a compiled pattern and its character tables</td></tr>
<tr><td><a href="pcre2_code_free.html">pcre2_code_free</a></td>
<td>&nbsp;&nbsp;Free a compiled pattern</td></tr>
<tr><td><a href="pcre2_compile.html">pcre2_compile</a></td>
<td>&nbsp;&nbsp;Compile a regular expression pattern</td></tr>
<tr><td><a href="pcre2_compile_context_copy.html">pcre2_compile_context_copy</a></td>
<td>&nbsp;&nbsp;Copy a compile context</td></tr>
<tr><td><a href="pcre2_compile_context_create.html">pcre2_compile_context_create</a></td>
<td>&nbsp;&nbsp;Create a compile context</td></tr>
<tr><td><a href="pcre2_compile_context_free.html">pcre2_compile_context_free</a></td>
<td>&nbsp;&nbsp;Free a compile context</td></tr>
<tr><td><a href="pcre2_config.html">pcre2_config</a></td>
<td>&nbsp;&nbsp;Show build-time configuration options</td></tr>
<tr><td><a href="pcre2_convert_context_copy.html">pcre2_convert_context_copy</a></td>
<td>&nbsp;&nbsp;Copy a convert context</td></tr>
<tr><td><a href="pcre2_convert_context_create.html">pcre2_convert_context_create</a></td>
<td>&nbsp;&nbsp;Create a convert context</td></tr>
<tr><td><a href="pcre2_convert_context_free.html">pcre2_convert_context_free</a></td>
<td>&nbsp;&nbsp;Free a convert context</td></tr>
<tr><td><a href="pcre2_converted_pattern_free.html">pcre2_converted_pattern_free</a></td>
<td>&nbsp;&nbsp;Free converted foreign pattern</td></tr>
<tr><td><a href="pcre2_dfa_match.html">pcre2_dfa_match</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string
(DFA algorithm; <i>not</i> Perl compatible)</td></tr>
<tr><td><a href="pcre2_general_context_copy.html">pcre2_general_context_copy</a></td>
<td>&nbsp;&nbsp;Copy a general context</td></tr>
<tr><td><a href="pcre2_general_context_create.html">pcre2_general_context_create</a></td>
<td>&nbsp;&nbsp;Create a general context</td></tr>
<tr><td><a href="pcre2_general_context_free.html">pcre2_general_context_free</a></td>
<td>&nbsp;&nbsp;Free a general context</td></tr>
<tr><td><a href="pcre2_get_error_message.html">pcre2_get_error_message</a></td>
<td>&nbsp;&nbsp;Get textual error message for error number</td></tr>
<tr><td><a href="pcre2_get_mark.html">pcre2_get_mark</a></td>
<td>&nbsp;&nbsp;Get a (*MARK) name</td></tr>
<tr><td><a href="pcre2_get_match_data_size.html">pcre2_get_match_data_size</a></td>
<td>&nbsp;&nbsp;Get the size of a match data block</td></tr>
<tr><td><a href="pcre2_get_ovector_count.html">pcre2_get_ovector_count</a></td>
<td>&nbsp;&nbsp;Get the ovector count</td></tr>
<tr><td><a href="pcre2_get_ovector_pointer.html">pcre2_get_ovector_pointer</a></td>
<td>&nbsp;&nbsp;Get a pointer to the ovector</td></tr>
<tr><td><a href="pcre2_get_startchar.html">pcre2_get_startchar</a></td>
<td>&nbsp;&nbsp;Get the starting character offset</td></tr>
<tr><td><a href="pcre2_jit_compile.html">pcre2_jit_compile</a></td>
<td>&nbsp;&nbsp;Process a compiled pattern with the JIT compiler</td></tr>
<tr><td><a href="pcre2_jit_free_unused_memory.html">pcre2_jit_free_unused_memory</a></td>
<td>&nbsp;&nbsp;Free unused JIT memory</td></tr>
<tr><td><a href="pcre2_jit_match.html">pcre2_jit_match</a></td>
<td>&nbsp;&nbsp;Fast path interface to JIT matching</td></tr>
<tr><td><a href="pcre2_jit_stack_assign.html">pcre2_jit_stack_assign</a></td>
<td>&nbsp;&nbsp;Assign stack for JIT matching</td></tr>
<tr><td><a href="pcre2_jit_stack_create.html">pcre2_jit_stack_create</a></td>
<td>&nbsp;&nbsp;Create a stack for JIT matching</td></tr>
<tr><td><a href="pcre2_jit_stack_free.html">pcre2_jit_stack_free</a></td>
<td>&nbsp;&nbsp;Free a JIT matching stack</td></tr>
<tr><td><a href="pcre2_maketables.html">pcre2_maketables</a></td>
<td>&nbsp;&nbsp;Build character tables in current locale</td></tr>
<tr><td><a href="pcre2_maketables_free.html">pcre2_maketables_free</a></td>
<td>&nbsp;&nbsp;Free character tables</td></tr>
<tr><td><a href="pcre2_match.html">pcre2_match</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string
(Perl compatible)</td></tr>
<tr><td><a href="pcre2_match_context_copy.html">pcre2_match_context_copy</a></td>
<td>&nbsp;&nbsp;Copy a match context</td></tr>
<tr><td><a href="pcre2_match_context_create.html">pcre2_match_context_create</a></td>
<td>&nbsp;&nbsp;Create a match context</td></tr>
<tr><td><a href="pcre2_match_context_free.html">pcre2_match_context_free</a></td>
<td>&nbsp;&nbsp;Free a match context</td></tr>
<tr><td><a href="pcre2_match_data_create.html">pcre2_match_data_create</a></td>
<td>&nbsp;&nbsp;Create a match data block</td></tr>
<tr><td><a href="pcre2_match_data_create_from_pattern.html">pcre2_match_data_create_from_pattern</a></td>
<td>&nbsp;&nbsp;Create a match data block getting size from pattern</td></tr>
<tr><td><a href="pcre2_match_data_free.html">pcre2_match_data_free</a></td>
<td>&nbsp;&nbsp;Free a match data block</td></tr>
<tr><td><a href="pcre2_pattern_convert.html">pcre2_pattern_convert</a></td>
<td>&nbsp;&nbsp;Experimental foreign pattern converter</td></tr>
<tr><td><a href="pcre2_pattern_info.html">pcre2_pattern_info</a></td>
<td>&nbsp;&nbsp;Extract information about a pattern</td></tr>
<tr><td><a href="pcre2_serialize_decode.html">pcre2_serialize_decode</a></td>
<td>&nbsp;&nbsp;Decode serialized compiled patterns</td></tr>
<tr><td><a href="pcre2_serialize_encode.html">pcre2_serialize_encode</a></td>
<td>&nbsp;&nbsp;Serialize compiled patterns for save/restore</td></tr>
<tr><td><a href="pcre2_serialize_free.html">pcre2_serialize_free</a></td>
<td>&nbsp;&nbsp;Free serialized compiled patterns</td></tr>
<tr><td><a href="pcre2_serialize_get_number_of_codes.html">pcre2_serialize_get_number_of_codes</a></td>
<td>&nbsp;&nbsp;Get number of serialized compiled patterns</td></tr>
<tr><td><a href="pcre2_set_bsr.html">pcre2_set_bsr</a></td>
<td>&nbsp;&nbsp;Set \R convention</td></tr>
<tr><td><a href="pcre2_set_callout.html">pcre2_set_callout</a></td>
<td>&nbsp;&nbsp;Set up a callout function</td></tr>
<tr><td><a href="pcre2_set_character_tables.html">pcre2_set_character_tables</a></td>
<td>&nbsp;&nbsp;Set character tables</td></tr>
<tr><td><a href="pcre2_set_compile_extra_options.html">pcre2_set_compile_extra_options</a></td>
<td>&nbsp;&nbsp;Set compile time extra options</td></tr>
<tr><td><a href="pcre2_set_compile_recursion_guard.html">pcre2_set_compile_recursion_guard</a></td>
<td>&nbsp;&nbsp;Set up a compile recursion guard function</td></tr>
<tr><td><a href="pcre2_set_depth_limit.html">pcre2_set_depth_limit</a></td>
<td>&nbsp;&nbsp;Set the match backtracking depth limit</td></tr>
<tr><td><a href="pcre2_set_glob_escape.html">pcre2_set_glob_escape</a></td>
<td>&nbsp;&nbsp;Set glob escape character</td></tr>
<tr><td><a href="pcre2_set_glob_separator.html">pcre2_set_glob_separator</a></td>
<td>&nbsp;&nbsp;Set glob separator character</td></tr>
<tr><td><a href="pcre2_set_heap_limit.html">pcre2_set_heap_limit</a></td>
<td>&nbsp;&nbsp;Set the match backtracking heap limit</td></tr>
<tr><td><a href="pcre2_set_match_limit.html">pcre2_set_match_limit</a></td>
<td>&nbsp;&nbsp;Set the match limit</td></tr>
<tr><td><a href="pcre2_set_max_pattern_compiled_length.html">pcre2_set_max_pattern_compiled_length</a></td>
<td>&nbsp;&nbsp;Set the maximum length of a compiled pattern</td></tr>
<tr><td><a href="pcre2_set_max_pattern_length.html">pcre2_set_max_pattern_length</a></td>
<td>&nbsp;&nbsp;Set the maximum length of a pattern</td></tr>
<tr><td><a href="pcre2_set_max_varlookbehind.html">pcre2_set_max_varlookbehind</a></td>
<td>&nbsp;&nbsp;Set the maximum match length for a variable-length lookbehind</td></tr>
<tr><td><a href="pcre2_set_newline.html">pcre2_set_newline</a></td>
<td>&nbsp;&nbsp;Set the newline convention</td></tr>
<tr><td><a href="pcre2_set_offset_limit.html">pcre2_set_offset_limit</a></td>
<td>&nbsp;&nbsp;Set the offset limit</td></tr>
<tr><td><a href="pcre2_set_optimize.html">pcre2_set_optimize</a></td>
<td>&nbsp;&nbsp;Set an optimization directive</td></tr>
<tr><td><a href="pcre2_set_parens_nest_limit.html">pcre2_set_parens_nest_limit</a></td>
<td>&nbsp;&nbsp;Set the parentheses nesting limit</td></tr>
<tr><td><a href="pcre2_set_recursion_limit.html">pcre2_set_recursion_limit</a></td>
<td>&nbsp;&nbsp;Obsolete: use pcre2_set_depth_limit</td></tr>
<tr><td><a href="pcre2_set_recursion_memory_management.html">pcre2_set_recursion_memory_management</a></td>
<td>&nbsp;&nbsp;Obsolete function that (from 10.30 onwards) does nothing</td></tr>
<tr><td><a href="pcre2_set_substitute_callout.html">pcre2_set_substitute_callout</a></td>
<td>&nbsp;&nbsp;Set a substitution callout function</td></tr>
<tr><td><a href="pcre2_set_substitute_case_callout.html">pcre2_set_substitute_case_callout</a></td>
<td>&nbsp;&nbsp;Set a substitution case callout function</td></tr>
<tr><td><a href="pcre2_substitute.html">pcre2_substitute</a></td>
<td>&nbsp;&nbsp;Match a compiled pattern to a subject string and do
substitutions</td></tr>
<tr><td><a href="pcre2_substring_copy_byname.html">pcre2_substring_copy_byname</a></td>
<td>&nbsp;&nbsp;Extract named substring into given buffer</td></tr>
<tr><td><a href="pcre2_substring_copy_bynumber.html">pcre2_substring_copy_bynumber</a></td>
<td>&nbsp;&nbsp;Extract numbered substring into given buffer</td></tr>
<tr><td><a href="pcre2_substring_free.html">pcre2_substring_free</a></td>
<td>&nbsp;&nbsp;Free extracted substring</td></tr>
<tr><td><a href="pcre2_substring_get_byname.html">pcre2_substring_get_byname</a></td>
<td>&nbsp;&nbsp;Extract named substring into new memory</td></tr>
<tr><td><a href="pcre2_substring_get_bynumber.html">pcre2_substring_get_bynumber</a></td>
<td>&nbsp;&nbsp;Extract numbered substring into new memory</td></tr>
<tr><td><a href="pcre2_substring_length_byname.html">pcre2_substring_length_byname</a></td>
<td>&nbsp;&nbsp;Find length of named substring</td></tr>
<tr><td><a href="pcre2_substring_length_bynumber.html">pcre2_substring_length_bynumber</a></td>
<td>&nbsp;&nbsp;Find length of numbered substring</td></tr>
<tr><td><a href="pcre2_substring_list_free.html">pcre2_substring_list_free</a></td>
<td>&nbsp;&nbsp;Free list of extracted substrings</td></tr>
<tr><td><a href="pcre2_substring_list_get.html">pcre2_substring_list_get</a></td>
<td>&nbsp;&nbsp;Extract all substrings into new memory</td></tr>
<tr><td><a href="pcre2_substring_nametable_scan.html">pcre2_substring_nametable_scan</a></td>
<td>&nbsp;&nbsp;Find table entries for given string name</td></tr>
<tr><td><a href="pcre2_substring_number_from_name.html">pcre2_substring_number_from_name</a></td>
<td>&nbsp;&nbsp;Convert captured string name to number</td></tr>
</table>
</html>

View File

@@ -0,0 +1,102 @@
<html>
<head>
<title>pcre2-config specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2-config man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">OPTIONS</a>
<li><a name="TOC4" href="#SEC4">SEE ALSO</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>pcre2-config [--prefix] [--exec-prefix] [--version]</b>
<b> [--libs8] [--libs16] [--libs32] [--libs-posix]</b>
<b> [--cflags] [--cflags-posix]</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
<b>pcre2-config</b> returns the configuration of the installed PCRE2 libraries
and the options required to compile a program to use them. Some of the options
apply only to the 8-bit, or 16-bit, or 32-bit libraries, respectively, and are
not available for libraries that have not been built. If an unavailable option
is encountered, the "usage" information is output.
</P>
<br><a name="SEC3" href="#TOC1">OPTIONS</a><br>
<P>
<b>--prefix</b>
Writes the directory prefix used in the PCRE2 installation for architecture
independent files (<i>/usr</i> on many systems, <i>/usr/local</i> on some
systems) to the standard output.
</P>
<P>
<b>--exec-prefix</b>
Writes the directory prefix used in the PCRE2 installation for architecture
dependent files (normally the same as <b>--prefix</b>) to the standard output.
</P>
<P>
<b>--version</b>
Writes the version number of the installed PCRE2 libraries to the standard
output.
</P>
<P>
<b>--libs8</b>
Writes to the standard output the command line options required to link
with the 8-bit PCRE2 library (<b>-lpcre2-8</b> on many systems).
</P>
<P>
<b>--libs16</b>
Writes to the standard output the command line options required to link
with the 16-bit PCRE2 library (<b>-lpcre2-16</b> on many systems).
</P>
<P>
<b>--libs32</b>
Writes to the standard output the command line options required to link
with the 32-bit PCRE2 library (<b>-lpcre2-32</b> on many systems).
</P>
<P>
<b>--libs-posix</b>
Writes to the standard output the command line options required to link with
PCRE2's POSIX API wrapper library (<b>-lpcre2-posix</b> <b>-lpcre2-8</b> on many
systems).
</P>
<P>
<b>--cflags</b>
Writes to the standard output the command line options required to compile
files that use PCRE2 (this may include some <b>-I</b> options, but is blank on
many systems).
</P>
<P>
<b>--cflags-posix</b>
Writes to the standard output the command line options required to compile
files that use PCRE2's POSIX API wrapper library (this may include some
<b>-I</b> options, but is blank on many systems).
</P>
<br><a name="SEC4" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2(3)</b>
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
This manual page was originally written by Mark Baker for the Debian GNU/Linux
system. It has been subsequently revised as a generic PCRE2 man page.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 28 September 2014
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,214 @@
<html>
<head>
<title>pcre2 specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2 man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">INTRODUCTION</a>
<li><a name="TOC2" href="#SEC2">SECURITY CONSIDERATIONS</a>
<li><a name="TOC3" href="#SEC3">USER DOCUMENTATION</a>
<li><a name="TOC4" href="#SEC4">AUTHORS</a>
<li><a name="TOC5" href="#SEC5">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">INTRODUCTION</a><br>
<P>
PCRE2 is the name used for a revised API for the PCRE library, which is a set
of functions, written in C, that implement regular expression pattern matching
using the same syntax and semantics as Perl, with just a few differences. After
nearly two decades, the limitations of the original API were making development
increasingly difficult. The new API is more extensible, and it was simplified
by abolishing the separate "study" optimizing function; in PCRE2, patterns are
automatically optimized where possible. Since forking from PCRE1, the code has
been extensively refactored and new features introduced. The old library is now
obsolete and is no longer maintained.
</P>
<P>
As well as Perl-style regular expression patterns, some features that appeared
in Python and the original PCRE before they appeared in Perl are available
using the Python syntax. There is also some support for one or two .NET and
Oniguruma syntax items, and there are options for requesting some minor changes
that give better ECMAScript (aka JavaScript) compatibility.
</P>
<P>
The source code for PCRE2 can be compiled to support strings of 8-bit, 16-bit,
or 32-bit code units, which means that up to three separate libraries may be
installed, one for each code unit size. The size of code unit is not related to
the bit size of the underlying hardware. In a 64-bit environment that also
supports 32-bit applications, versions of PCRE2 that are compiled in both
64-bit and 32-bit modes may be needed.
</P>
<P>
The original work to extend PCRE to 16-bit and 32-bit code units was done by
Zoltan Herczeg and Christian Persch, respectively. In all three cases, strings
can be interpreted either as one character per code unit, or as UTF-encoded
Unicode, with support for Unicode general category properties. Unicode support
is optional at build time (but is the default). However, processing strings as
UTF code units must be enabled explicitly at run time. The version of Unicode
in use can be discovered by running
<pre>
pcre2test -C
</PRE>
</P>
<P>
The three libraries contain identical sets of functions, with names ending in
_8, _16, or _32, respectively (for example, <b>pcre2_compile_8()</b>). However,
by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 32, a program that uses just
one code unit width can be written using generic names such as
<b>pcre2_compile()</b>, and the documentation is written assuming that this is
the case.
</P>
<P>
In addition to the Perl-compatible matching function, PCRE2 contains an
alternative function that matches the same compiled patterns in a different
way. In certain circumstances, the alternative function has some advantages.
For a discussion of the two matching algorithms, see the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
page.
</P>
<P>
Details of exactly which Perl regular expression features are and are not
supported by PCRE2 are given in separate documents. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
and
<a href="pcre2compat.html"><b>pcre2compat</b></a>
pages. There is a syntax summary in the
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
page.
</P>
<P>
Some features of PCRE2 can be included, excluded, or changed when the library
is built. The
<a href="pcre2_config.html"><b>pcre2_config()</b></a>
function makes it possible for a client to discover which features are
available. The features themselves are described in the
<a href="pcre2build.html"><b>pcre2build</b></a>
page. Documentation about building PCRE2 for various operating systems can be
found in the
<a href="README.txt"><b>README</b></a>
and
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS_BUILD</b></a>
files in the source distribution.
</P>
<P>
The libraries contains a number of undocumented internal functions and data
tables that are used by more than one of the exported external functions, but
which are not intended for use by external callers. Their names all begin with
"_pcre2", which hopefully will not provoke any name clashes. In some
environments, it is possible to control which external symbols are exported
when a shared library is built, and in these cases the undocumented symbols are
not exported.
</P>
<br><a name="SEC2" href="#TOC1">SECURITY CONSIDERATIONS</a><br>
<P>
If you are using PCRE2 in a non-UTF application that permits users to supply
arbitrary patterns for compilation, you should be aware of a feature that
allows users to turn on UTF support from within a pattern. For example, an
8-bit pattern that begins with "(*UTF)" turns on UTF-8 mode, which interprets
patterns and subjects as strings of UTF-8 code units instead of individual
8-bit characters. This causes both the pattern and any data against which it is
matched to be checked for UTF-8 validity. If the data string is very long, such
a check might use sufficiently many resources as to cause your application to
lose performance.
</P>
<P>
One way of guarding against this possibility is to use the
<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
a UTF-setting sequence.
</P>
<P>
The use of Unicode properties for character types such as \d can also be
enabled from within the pattern, by specifying "(*UCP)". This feature can be
disallowed by setting the PCRE2_NEVER_UCP option.
</P>
<P>
If your application is one that supports UTF, be aware that validity checking
can take time. If the same data string is to be matched many times, you can use
the PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
running redundant checks.
</P>
<P>
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
problems, because it may leave the current matching point in the middle of a
multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
application to lock out the use of \C, causing a compile-time error if it is
encountered. It is also possible to build PCRE2 with the use of \C permanently
disabled.
</P>
<P>
Another way that performance can be hit is by running a pattern that has a very
large search tree against a string that will never match. Nested unlimited
repeats in a pattern are a common example. PCRE2 provides some protection
against this: see the <b>pcre2_set_match_limit()</b> function in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
be used to restrict the amount of memory that is used.
</P>
<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
<P>
The user documentation for PCRE2 comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
the descriptions of the <b>pcre2grep</b> and <b>pcre2test</b> programs are in
files called <b>pcre2grep.txt</b> and <b>pcre2test.txt</b>, respectively. The
remaining sections, except for the <b>pcre2demo</b> section (which is a program
listing), and the short pages for individual functions, are concatenated in
<b>pcre2.txt</b>, for ease of searching. The sections are as follows:
<pre>
pcre2 this document
pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API
pcre2build building PCRE2
pcre2callout details of the pattern callout feature
pcre2compat discussion of Perl compatibility
pcre2convert details of pattern conversion functions
pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the <b>pcre2grep</b> command (8-bit only)
pcre2jit discussion of just-in-time optimization support
pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility
pcre2pattern syntax and semantics of supported regular expression patterns
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
pcre2serialize details of pattern serialization
pcre2syntax quick syntax reference
pcre2test description of the <b>pcre2test</b> command
pcre2unicode discussion of Unicode and UTF support
</pre>
In the "man" and HTML formats, there is also a short page for each C library
function, listing its arguments and results.
</P>
<br><a name="SEC4" href="#TOC1">AUTHORS</a><br>
<P>
The current maintainers of PCRE2 are Nicholas Wilson and Zoltan Herczeg.
</P>
<P>
PCRE2 was written by Philip Hazel, of the University Computing Service,
Cambridge, England. Many others have also contributed.
</P>
<P>
To contact the maintainers, please use the GitHub issues tracker or PCRE2
mailing list, as described at the project page:
<a href="https://github.com/PCRE2Project/pcre2">https://github.com/PCRE2Project/pcre2</a>
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
Last updated: 18 December 2024
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,63 @@
<html>
<head>
<title>pcre2_callout_enumerate specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_callout_enumerate man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function scans a compiled regular expression and calls the <i>callback()</i>
function for each callout within the pattern. The yield of the function is zero
for success and non-zero otherwise. The arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>callback</i> The callback function
<i>callout_data</i> User data that is passed to the callback
</pre>
The <i>callback()</i> function is passed a pointer to a data block containing
the following fields (not necessarily in this order):
<pre>
uint32_t <i>version</i> Block version number
uint32_t <i>callout_number</i> Number for numbered callouts
PCRE2_SIZE <i>pattern_position</i> Offset to next item in pattern
PCRE2_SIZE <i>next_item_length</i> Length of next item in pattern
PCRE2_SIZE <i>callout_string_offset</i> Offset to string within pattern
PCRE2_SIZE <i>callout_string_length</i> Length of callout string
PCRE2_SPTR <i>callout_string</i> Points to callout string or is NULL
</pre>
The second argument passed to the <b>callback()</b> function is the callout data
that was passed to <b>pcre2_callout_enumerate()</b>. The <b>callback()</b>
function must return zero for success. Any other value causes the pattern scan
to stop, with the value being passed back as the result of
<b>pcre2_callout_enumerate()</b>.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_code_copy specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_code_copy man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching. The
pointer to the character tables is copied, not the tables themselves (see
<b>pcre2_code_copy_with_tables()</b>). The yield of the function is NULL if
<i>code</i> is NULL or if sufficient memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,44 @@
<html>
<head>
<title>pcre2_code_copy_with_tables specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_code_copy_with_tables man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a copy of the memory used for a compiled pattern, excluding
any memory used by the JIT compiler. Without a subsequent call to
<b>pcre2_jit_compile()</b>, the copy can be used only for non-JIT matching.
Unlike <b>pcre2_code_copy()</b>, a separate copy of the character tables is also
made, with the new code pointing to it. This memory will be automatically freed
when <b>pcre2_code_free()</b> is called. The yield of the function is NULL if
<i>code</i> is NULL or if sufficient memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_code_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_code_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
If <i>code</i> is NULL, this function does nothing. Otherwise, <i>code</i> must
point to a compiled pattern. This function frees its memory, including any
memory used by the JIT compiler. If the compiled pattern was created by a call
to <b>pcre2_code_copy_with_tables()</b>, the memory for the character tables is
also freed.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,120 @@
<html>
<head>
<title>pcre2_compile specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_compile man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
<b> pcre2_compile_context *<i>ccontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function compiles a regular expression pattern into an internal form. Its
arguments are:
<pre>
<i>pattern</i> A string containing expression to be compiled
<i>length</i> The length of the string or PCRE2_ZERO_TERMINATED
<i>options</i> Primary option bits
<i>errorcode</i> Where to put an error code
<i>erroffset</i> Where to put an error offset
<i>ccontext</i> Pointer to a compile context or NULL
</pre>
The length of the pattern and any error offset that is returned are in code
units, not characters. A NULL pattern with zero length is treated as an empty
string. A compile context is needed only if you want to provide custom memory
allocation functions, or to provide an external function for system stack size
checking (see <b>pcre2_set_compile_recursion_guard()</b>), or to change one or
more of these parameters:
<pre>
What \R matches (Unicode newlines, or CR, LF, CRLF only);
PCRE2's character tables;
The newline character sequence;
The compile time nested parentheses limit;
The maximum pattern length (in code units) that is allowed;
The additional options bits.
</pre>
The primary option bits are:
<pre>
PCRE2_ANCHORED Force pattern anchoring
PCRE2_ALLOW_EMPTY_CLASS Allow empty classes
PCRE2_ALT_BSUX Alternative handling of \u, \U, and \x
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
PCRE2_ALT_EXTENDED_CLASS Alternative extended character class syntax
PCRE2_ALT_VERBNAMES Process backslashes in verb names
PCRE2_AUTO_CALLOUT Compile automatic callouts
PCRE2_CASELESS Do caseless matching
PCRE2_DOLLAR_ENDONLY $ not to match newline at end
PCRE2_DOTALL . matches anything including NL
PCRE2_DUPNAMES Allow duplicate names for subpatterns
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_EXTENDED Ignore white space and # comments
PCRE2_FIRSTLINE Force matching to be before newline
PCRE2_LITERAL Pattern characters are all literal
PCRE2_MATCH_INVALID_UTF Enable support for matching invalid UTF
PCRE2_MATCH_UNSET_BACKREF Match unset backreferences
PCRE2_MULTILINE ^ and $ match newlines within data
PCRE2_NEVER_BACKSLASH_C Lock out the use of \C in patterns
PCRE2_NEVER_UCP Lock out PCRE2_UCP, e.g. via (*UCP)
PCRE2_NEVER_UTF Lock out PCRE2_UTF, e.g. via (*UTF)
PCRE2_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
(only relevant if PCRE2_UTF is set)
PCRE2_UCP Use Unicode properties for \d, \w, etc.
PCRE2_UNGREEDY Invert greediness of quantifiers
PCRE2_USE_OFFSET_LIMIT Enable offset limit for unanchored matching
PCRE2_UTF Treat pattern and subjects as UTF strings
</pre>
PCRE2 must be built with Unicode support (the default) in order to use
PCRE2_UTF, PCRE2_UCP and related options.
</P>
<P>
Additional options may be set in the compile context via the
<a href="pcre2_set_compile_extra_options.html"><b>pcre2_set_compile_extra_options</b></a>
function.
</P>
<P>
If either of <i>errorcode</i> or <i>erroroffset</i> is NULL, the function returns
NULL immediately. Otherwise, the yield of this function is a pointer to a
private data structure that contains the compiled pattern, or NULL if an error
was detected. In the error case, a text error message can be obtained by
passing the value returned via the <i>errorcode</i> argument to the
<b>pcre2_get_error_message()</b> function. The offset (in code units) where the
error was encountered is returned via the <i>erroroffset</i> argument.
</P>
<P>
If there is no error, the value passed via <i>errorcode</i> returns the message
"no error" if passed to <b>pcre2_get_error_message()</b>, and the value passed
via <i>erroroffset</i> is zero.
</P>
<P>
There is a complete description of the PCRE2 native API, with more detail on
each option, in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page, and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_compile_context_copy specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_compile_context_copy man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
<b> pcre2_compile_context *<i>ccontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a new copy of a compile context, using the memory
allocation function that was used for the original context. The result is NULL
if the memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_compile_context_create specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_compile_context_create man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function creates and initializes a new compile context. If its argument is
NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
allocation function within the general context is used. The result is NULL if
the memory could not be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_compile_context_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_compile_context_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function frees the memory occupied by a compile context, using the memory
freeing function from the general context with which it was created, or
<b>free()</b> if that was not set. If the argument is NULL, the function returns
immediately without doing anything.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,84 @@
<html>
<head>
<title>pcre2_config specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_config man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes it possible for a client program to find out which optional
features are available in the version of the PCRE2 library it is using. The
arguments are as follows:
<pre>
<i>what</i> A code specifying what information is required
<i>where</i> Points to where to put the information
</pre>
If <i>where</i> is NULL, the function returns the amount of memory needed for
the requested information. When the information is a string, the value is in
code units; for other types of data it is in bytes.
</P>
<P>
If <b>where</b> is not NULL, for PCRE2_CONFIG_JITTARGET,
PCRE2_CONFIG_UNICODE_VERSION, and PCRE2_CONFIG_VERSION it must point to a
buffer that is large enough to hold the string. For all other codes it must
point to a uint32_t integer variable. The available codes are:
<pre>
PCRE2_CONFIG_BSR Indicates what \R matches by default:
PCRE2_BSR_UNICODE
PCRE2_BSR_ANYCRLF
PCRE2_CONFIG_COMPILED_WIDTHS Which of 8/16/32 support was compiled
PCRE2_CONFIG_DEPTHLIMIT Default backtracking depth limit
PCRE2_CONFIG_HEAPLIMIT Default heap memory limit
PCRE2_CONFIG_JIT Availability of just-in-time compiler support (1=yes 0=no)
PCRE2_CONFIG_JITTARGET Information (a string) about the target architecture for the JIT compiler
PCRE2_CONFIG_LINKSIZE Configured internal link size (2, 3, 4)
PCRE2_CONFIG_MATCHLIMIT Default internal resource limit
PCRE2_CONFIG_NEVER_BACKSLASH_C Whether or not \C is disabled
PCRE2_CONFIG_NEWLINE Code for the default newline sequence:
PCRE2_NEWLINE_CR
PCRE2_NEWLINE_LF
PCRE2_NEWLINE_CRLF
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
PCRE2_NEWLINE_NUL
PCRE2_CONFIG_PARENSLIMIT Default parentheses nesting limit
PCRE2_CONFIG_RECURSIONLIMIT Obsolete: use PCRE2_CONFIG_DEPTHLIMIT
PCRE2_CONFIG_STACKRECURSE Obsolete: always returns 0
PCRE2_CONFIG_UNICODE Availability of Unicode support (1=yes 0=no)
PCRE2_CONFIG_UNICODE_VERSION The Unicode version (a string)
PCRE2_CONFIG_VERSION The PCRE2 version (a string)
</pre>
The function yields a non-negative value on success or the negative value
PCRE2_ERROR_BADOPTION otherwise. This is also the result for the
PCRE2_CONFIG_JITTARGET code if JIT support is not available. When a string is
requested, the function returns the number of code units used, including the
terminating zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_convert_context_copy specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_convert_context_copy man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
<b> pcre2_convert_context *<i>cvcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It makes a new copy of a convert context, using the memory allocation function
that was used for the original context. The result is NULL if the memory cannot
be obtained.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_convert_context_create specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_convert_context_create man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_convert_context *pcre2_convert_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It creates and initializes a new convert context. If its argument is
NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
allocation function within the general context is used. The result is NULL if
the memory could not be obtained.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_convert_context_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_convert_context_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It frees the memory occupied by a convert context, using the memory
freeing function from the general context with which it was created, or
<b>free()</b> if that was not set. If the argument is NULL, the function returns
immediately without doing anything.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_converted_pattern_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_converted_pattern_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It frees the memory occupied by a converted pattern that was obtained by
calling <b>pcre2_pattern_convert()</b> with arguments that caused it to place
the converted pattern into newly obtained heap memory. If the argument is NULL,
the function returns immediately without doing anything.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,86 @@
<html>
<head>
<title>pcre2_dfa_match specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_dfa_match man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>,</b>
<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression against a given subject
string, using an alternative matching algorithm that scans the subject string
just once (except when processing lookaround assertions). This function is
<i>not</i> Perl-compatible (the Perl-compatible matching function is
<b>pcre2_match()</b>). The arguments for this function are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
<i>length</i> Length of the subject string
<i>startoffset</i> Offset in the subject at which to start matching
<i>options</i> Option bits
<i>match_data</i> Points to a match data block, for results
<i>mcontext</i> Points to a match context, or is NULL
<i>workspace</i> Points to a vector of ints used as working space
<i>wscount</i> Number of elements in the vector
</pre>
The size of output vector needed to contain all the results depends on the
number of simultaneous matches, not on the number of parentheses in the
pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
data block is therefore not advisable when using this function.
</P>
<P>
A match context is needed only if you want to set up a callout function or
specify the heap limit or the match or the recursion depth limits. The
<i>length</i> and <i>startoffset</i> values are code units, not characters. The
options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_COPY_MATCHED_SUBJECT
On success, make a private subject copy
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
was set at compile time)
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
PCRE2_DFA_RESTART Restart after a partial match
PCRE2_DFA_SHORTEST Return only the shortest match
</pre>
There are restrictions on what may appear in a pattern when using this matching
function. Details are given in the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
documentation. For details of partial matching, see the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
page. There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_general_context_copy specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_general_context_copy man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_general_context *pcre2_general_context_copy(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a new copy of a general context, using the memory
allocation functions in the context, if set, to get the necessary memory.
Otherwise <b>malloc()</b> is used. The result is NULL if the memory cannot be
obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,44 @@
<html>
<head>
<title>pcre2_general_context_create specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_general_context_create man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_general_context *pcre2_general_context_create(</b>
<b> void *(*<i>private_malloc</i>)(size_t, void *),</b>
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function creates and initializes a general context. The arguments define
custom memory management functions and a data value that is passed to them when
they are called. The <b>private_malloc()</b> function is used to get memory for
the context. If either of the first two arguments is NULL, the system memory
management function is used. The result is NULL if no memory could be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_general_context_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_general_context_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function frees the memory occupied by a general context, using the memory
freeing function within the context, if set. If the argument is NULL, the
function returns immediately without doing anything.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,51 @@
<html>
<head>
<title>pcre2_get_error_message specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_error_message man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE <i>bufflen</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function provides a textual error message for each PCRE2 error code.
Compilation errors are positive numbers; UTF formatting errors and matching
errors are negative numbers. The arguments are:
<pre>
<i>errorcode</i> an error code (positive or negative)
<i>buffer</i> where to put the message
<i>bufflen</i> the length of the buffer (code units)
</pre>
The function returns the length of the message in code units, excluding the
trailing zero, or the negative error code PCRE2_ERROR_NOMEMORY if the buffer is
too small. In this case, the returned message is truncated (but still with a
trailing zero). If <i>errorcode</i> does not contain a recognized error code
number, the negative value PCRE2_ERROR_BADDATA is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,47 @@
<html>
<head>
<title>pcre2_get_mark specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_mark man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
After a call of <b>pcre2_match()</b> that was passed the match block that is
this function's argument, this function returns a pointer to the last (*MARK),
(*PRUNE), or (*THEN) name that was encountered during the matching process. The
name is zero-terminated, and is within the compiled pattern. The length of the
name is in the preceding code unit. If no name is available, NULL is returned.
</P>
<P>
After a successful match, the name that is returned is the last one on the
matching path. After a failed match or a partial match, the last encountered
name is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_get_match_data_heapframes_size specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_match_data_heapframes_size man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>PCRE2_SIZE pcre2_get_match_data_heapframes_size(</b>
<b> pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns the size, in bytes, of the heapframes data block that is
owned by its argument.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,39 @@
<html>
<head>
<title>pcre2_get_match_data_size specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_match_data_size man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns the size, in bytes, of the match data block that is its
argument.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,39 @@
<html>
<head>
<title>pcre2_get_ovector_count specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_ovector_count man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns the number of pairs of offsets in the ovector that forms
part of the given match data block.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_get_ovector_pointer specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_ovector_pointer man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns a pointer to the vector of offsets that forms part of the
given match data block. The number of pairs can be found by calling
<b>pcre2_get_ovector_count()</b>.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,44 @@
<html>
<head>
<title>pcre2_get_startchar specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_get_startchar man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
After a successful call of <b>pcre2_match()</b> that was passed the match block
that is this function's argument, this function returns the code unit offset of
the character at which the successful match started. For a non-partial match,
this can be different to the value of <i>ovector[0]</i> if the pattern contains
the \K escape sequence. After a partial match, however, this value is always
the same as <i>ovector[0]</i> because \K does not affect the result of a
partial match.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,74 @@
<html>
<head>
<title>pcre2_jit_compile specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_jit_compile man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function requests JIT compilation, which, if the just-in-time compiler is
available, further processes a compiled pattern into machine code that executes
much faster than the <b>pcre2_match()</b> interpretive matching function. Full
details are given in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation.
</P>
<P>
The availability of JIT support can be tested by calling
<b>pcre2_compile_jit()</b> with a single option PCRE2_JIT_TEST_ALLOC (the
code argument is ignored, so a NULL value is accepted). Such a call
returns zero if JIT is available and has a working allocator. Otherwise
it returns PCRE2_ERROR_NOMEMORY if JIT is available but cannot allocate
executable memory, or PCRE2_ERROR_JIT_UNSUPPORTED if JIT support is not
compiled.
</P>
<P>
Otherwise, the first argument must be a pointer that was returned by a
successful call to <b>pcre2_compile()</b>, and the second must contain one or
more of the following bits:
<pre>
PCRE2_JIT_COMPLETE compile code for full matching
PCRE2_JIT_PARTIAL_SOFT compile code for soft partial matching
PCRE2_JIT_PARTIAL_HARD compile code for hard partial matching
</pre>
There is also an obsolete option called PCRE2_JIT_INVALID_UTF, which has been
superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF. The old
option is deprecated and may be removed in the future.
</P>
<P>
The yield of the function when called with any of the three options above is 0
for success, or a negative error code otherwise. In particular,
PCRE2_ERROR_JIT_BADOPTION is returned if JIT is not supported or if an unknown
bit is set in <i>options</i>. The function can also return PCRE2_ERROR_NOMEMORY
if JIT is unable to allocate executable memory for the compiler, even if it was
because of a system security restriction. In a few cases, the function may
return with PCRE2_ERROR_JIT_UNSUPPORTED for unsupported features.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_jit_free_unused_memory specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_jit_free_unused_memory man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function frees unused JIT executable memory. The argument is a general
context, for custom memory management, or NULL for standard memory management.
JIT memory allocation retains some memory in order to improve future JIT
compilation speed. In low memory conditions,
<b>pcre2_jit_free_unused_memory()</b> can be used to cause this memory to be
freed.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,70 @@
<html>
<head>
<title>pcre2_jit_match specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_jit_match man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression that has been successfully
processed by the JIT compiler against a given subject string, using a matching
algorithm that is similar to Perl's. It is a "fast path" interface to JIT, and
it bypasses some of the sanity checks that <b>pcre2_match()</b> applies.
</P>
<P>
In UTF mode, the subject string is not checked for UTF validity. Unless
PCRE2_MATCH_INVALID_UTF was set when the pattern was compiled, passing an
invalid UTF string results in undefined behaviour. Your program may crash or
loop or give wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you
should only call <b>pcre2_jit_match()</b> in UTF mode if you are sure the
subject is valid.
</P>
<P>
The arguments for <b>pcre2_jit_match()</b> are exactly the same as for
<a href="pcre2_match.html"><b>pcre2_match()</b>,</a>
except that the subject string must be specified with a length;
PCRE2_ZERO_TERMINATED is not supported.
</P>
<P>
The supported options are PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
PCRE2_NOTEMPTY_ATSTART, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Unsupported
options are ignored.
</P>
<P>
The return values are the same as for <b>pcre2_match()</b> plus
PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or complete) is requested
that was not compiled. For details of partial matching, see the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
page.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the JIT API in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,75 @@
<html>
<head>
<title>pcre2_jit_stack_assign specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_jit_stack_assign man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function provides control over the memory used by JIT as a run-time stack
when <b>pcre2_match()</b> or <b>pcre2_jit_match()</b> is called with a pattern
that has been successfully processed by the JIT compiler. The information that
determines which stack is used is put into a match context that is subsequently
passed to a matching function. The arguments of this function are:
<pre>
mcontext a pointer to a match context
callback a callback function
callback_data a JIT stack or a value to be passed to the callback
</PRE>
</P>
<P>
If <i>mcontext</i> is NULL, the function returns immediately, without doing
anything.
</P>
<P>
If <i>callback</i> is NULL and <i>callback_data</i> is NULL, an internal 32KiB
block on the machine stack is used.
</P>
<P>
If <i>callback</i> is NULL and <i>callback_data</i> is not NULL,
<i>callback_data</i> must be a valid JIT stack, the result of calling
<b>pcre2_jit_stack_create()</b>.
</P>
<P>
If <i>callback</i> not NULL, it is called with <i>callback_data</i> as an
argument at the start of matching, in order to set up a JIT stack. If the
result is NULL, the internal 32KiB stack is used; otherwise the return value
must be a valid JIT stack, the result of calling
<b>pcre2_jit_stack_create()</b>.
</P>
<P>
You may safely use the same JIT stack for multiple patterns, as long as they
are all matched in the same thread. In a multithread application, each thread
must use its own JIT stack. For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,50 @@
<html>
<head>
<title>pcre2_jit_stack_create specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_jit_stack_create man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_jit_stack *pcre2_jit_stack_create(size_t <i>startsize</i>,</b>
<b> size_t <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is used to create a stack for use by the code compiled by the JIT
compiler. The first two arguments are a starting size for the stack, and a
maximum size to which it is allowed to grow. The final argument is a general
context, for memory allocation functions, or NULL for standard memory
allocation. The result can be passed to the JIT run-time code by calling
<b>pcre2_jit_stack_assign()</b> to associate the stack with a compiled pattern,
which can then be processed by <b>pcre2_match()</b> or <b>pcre2_jit_match()</b>.
A maximum stack size of 512KiB to 1MiB should be more than enough for any
pattern. If the stack couldn't be allocated or the values passed were not
reasonable, NULL will be returned. For more details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_jit_stack_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_jit_stack_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is used to free a JIT stack that was created by
<b>pcre2_jit_stack_create()</b> when it is no longer needed. If the argument is
NULL, the function returns immediately without doing anything. For more
details, see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
page.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,48 @@
<html>
<head>
<title>pcre2_maketables specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_maketables man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>const uint8_t *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function builds a set of character tables for character code points that
are less than 256. These can be passed to <b>pcre2_compile()</b> in a compile
context in order to override the internal, built-in tables (which were either
defaulted or made by <b>pcre2_maketables()</b> when PCRE2 was compiled). See the
<a href="pcre2_set_character_tables.html"><b>pcre2_set_character_tables()</b></a>
page. You might want to do this if you are using a non-standard locale.
</P>
<P>
If the argument is NULL, <b>malloc()</b> is used to get memory for the tables.
Otherwise it must point to a general context, which can supply pointers to a
custom memory manager. The function yields a pointer to the tables.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,44 @@
<html>
<head>
<title>pcre2_maketables_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_maketables_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_maketables_free(pcre2_general_context *<i>gcontext</i>,</b>
<b> const uint8_t *<i>tables</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function discards a set of character tables that were created by a call
to
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>.</a>
</P>
<P>
The <i>gcontext</i> parameter should match what was used in that call to
account for any custom allocators that might be in use; if it is NULL
the system <b>free()</b> is used.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,87 @@
<html>
<head>
<title>pcre2_match specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression against a given subject
string, using a matching algorithm that is similar to Perl's. It returns
offsets to what it has matched and to captured substrings via the
<b>match_data</b> block, which can be processed by functions with names that
start with <b>pcre2_get_ovector_...()</b> or <b>pcre2_substring_...()</b>. The
return from <b>pcre2_match()</b> is one more than the highest numbered capturing
pair that has been set (for example, 1 if there are no captures), zero if the
vector of offsets is too small, or a negative error code for no match and other
errors. The function arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
<i>length</i> Length of the subject string
<i>startoffset</i> Offset in the subject at which to start matching
<i>options</i> Option bits
<i>match_data</i> Points to a match data block, for results
<i>mcontext</i> Points to a match context, or is NULL
</pre>
A match context is needed only if you want to:
<pre>
Set up a callout function
Set a matching offset limit
Change the heap memory limit
Change the backtracking match limit
Change the backtracking depth limit
Set custom memory management specifically for the match
</pre>
The <i>length</i> and <i>startoffset</i> values are code units, not characters.
The length may be given as PCRE2_ZERO_TERMINATED for a subject that is
terminated by a binary zero code unit. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_COPY_MATCHED_SUBJECT
On success, make a private subject copy
PCRE2_DISABLE_RECURSELOOP_CHECK
Only useful in rare cases; use with care
PCRE2_ENDANCHORED Pattern can match only at end of subject
PCRE2_NOTBOL Subject string is not the beginning of a line
PCRE2_NOTEOL Subject string is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
PCRE2_NO_JIT Do not use JIT matching
PCRE2_NO_UTF_CHECK Do not check the subject for UTF validity (only relevant if PCRE2_UTF
was set at compile time)
PCRE2_PARTIAL_HARD Return PCRE2_ERROR_PARTIAL for a partial match even if there is a full match
PCRE2_PARTIAL_SOFT Return PCRE2_ERROR_PARTIAL for a partial match if no full matches are found
</pre>
For details of partial matching, see the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
page. There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_match_context_copy specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match_context_copy man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_match_context *pcre2_match_context_copy(</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function makes a new copy of a match context, using the memory
allocation function that was used for the original context. The result is NULL
if the memory cannot be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_match_context_create specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match_context_create man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_match_context *pcre2_match_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function creates and initializes a new match context. If its argument is
NULL, <b>malloc()</b> is used to get the necessary memory; otherwise the memory
allocation function within the general context is used. The result is NULL if
the memory could not be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_match_context_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match_context_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function frees the memory occupied by a match context, using the memory
freeing function from the general context with which it was created, or
<b>free()</b> if that was not set. If the argument is NULL, the function returns
immediately without doing anything.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,50 @@
<html>
<head>
<title>pcre2_match_data_create specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match_data_create man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function creates a new match data block, which is used for holding the
result of a match. The first argument specifies the number of pairs of offsets
that are required. These form the "output vector" (ovector) within the match
data block, and are used to identify the matched string and any captured
substrings when matching with <b>pcre2_match()</b>, or a number of different
matches at the same point when used with <b>pcre2_dfa_match()</b>. There is
always one pair of offsets; if <b>ovecsize</b> is zero, it is treated as one.
</P>
<P>
The second argument points to a general context, for custom memory management,
or is NULL for system memory management. The result of the function is NULL if
the memory for the block could not be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,53 @@
<html>
<head>
<title>pcre2_match_data_create_from_pattern specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match_data_create_from_pattern man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function creates a new match data block for holding the result of a match.
The first argument points to a compiled pattern. The number of capturing
parentheses within the pattern is used to compute the number of pairs of
offsets that are required in the match data block. These form the "output
vector" (ovector) within the match data block, and are used to identify the
matched string and any captured substrings when matching with
<b>pcre2_match()</b>. If you are using <b>pcre2_dfa_match()</b>, which uses the
output vector in a different way, you should use <b>pcre2_match_data_create()</b>
instead of this function.
</P>
<P>
The second argument points to a general context, for custom memory management,
or is NULL to use the same memory allocator as was used for the compiled
pattern. The result of the function is NULL if the memory for the block could
not be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,48 @@
<html>
<head>
<title>pcre2_match_data_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_match_data_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
If <i>match_data</i> is NULL, this function does nothing. Otherwise,
<i>match_data</i> must point to a match data block, which this function frees,
using the memory freeing function from the general context or compiled pattern
with which it was created, or <b>free()</b> if that was not set. If the match
data block was previously passed to <b>pcre2_match()</b>, it will have an
attached heapframe vector; this is also freed.
</P>
<P>
If the PCRE2_COPY_MATCHED_SUBJECT was used for a successful match using this
match data block, the copy of the subject that was referenced within the block
is also freed.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,70 @@
<html>
<head>
<title>pcre2_pattern_convert specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_pattern_convert man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It converts a foreign pattern (for example, a glob) into a PCRE2 regular
expression pattern. Its arguments are:
<pre>
<i>pattern</i> The foreign pattern
<i>length</i> The length of the input pattern or PCRE2_ZERO_TERMINATED
<i>options</i> Option bits
<i>buffer</i> Pointer to pointer to output buffer, or NULL
<i>blength</i> Pointer to output length field
<i>cvcontext</i> Pointer to a convert context or NULL
</pre>
The length of the converted pattern (excluding the terminating zero) is
returned via <i>blength</i>. If <i>buffer</i> is NULL, the function just returns
the output length. If <i>buffer</i> points to a NULL pointer, heap memory is
obtained for the converted pattern, using the allocator in the context if
present (or else <b>malloc()</b>), and the field pointed to by <i>buffer</i> is
updated. If <i>buffer</i> points to a non-NULL field, that must point to a
buffer whose size is in the variable pointed to by <i>blength</i>. This value is
updated.
</P>
<P>
The option bits are:
<pre>
PCRE2_CONVERT_UTF Input is UTF
PCRE2_CONVERT_NO_UTF_CHECK Do not check UTF validity
PCRE2_CONVERT_POSIX_BASIC Convert POSIX basic pattern
PCRE2_CONVERT_POSIX_EXTENDED Convert POSIX extended pattern
PCRE2_CONVERT_GLOB ) Convert
PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR ) various types
PCRE2_CONVERT_GLOB_NO_STARSTAR ) of glob
</pre>
The return value from <b>pcre2_pattern_convert()</b> is zero on success or a
non-zero PCRE2 error code.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,109 @@
<html>
<head>
<title>pcre2_pattern_info specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_pattern_info man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
<b> void *<i>where</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns information about a compiled pattern. Its arguments are:
<pre>
<i>code</i> Pointer to a compiled regular expression pattern
<i>what</i> What information is required
<i>where</i> Where to put the information
</pre>
The recognized values for the <i>what</i> argument, and the information they
request are as follows:
<pre>
PCRE2_INFO_ALLOPTIONS Final options after compiling
PCRE2_INFO_ARGOPTIONS Options passed to <b>pcre2_compile()</b>
PCRE2_INFO_BACKREFMAX Number of highest backreference
PCRE2_INFO_BSR What \R matches:
PCRE2_BSR_UNICODE: Unicode line endings
PCRE2_BSR_ANYCRLF: CR, LF, or CRLF only
PCRE2_INFO_CAPTURECOUNT Number of capturing subpatterns
PCRE2_INFO_DEPTHLIMIT Backtracking depth limit if set, otherwise PCRE2_ERROR_UNSET
PCRE2_INFO_EXTRAOPTIONS Extra options that were passed in the
compile context
PCRE2_INFO_FIRSTBITMAP Bitmap of first code units, or NULL
PCRE2_INFO_FIRSTCODETYPE Type of start-of-match information
0 nothing set
1 first code unit is set
2 start of string or after newline
PCRE2_INFO_FIRSTCODEUNIT First code unit when type is 1
PCRE2_INFO_FRAMESIZE Size of backtracking frame
PCRE2_INFO_HASBACKSLASHC Return 1 if pattern contains \C
PCRE2_INFO_HASCRORLF Return 1 if explicit CR or LF matches exist in the pattern
PCRE2_INFO_HEAPLIMIT Heap memory limit if set, otherwise PCRE2_ERROR_UNSET
PCRE2_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE2_INFO_JITSIZE Size of JIT compiled code, or 0
PCRE2_INFO_LASTCODETYPE Type of must-be-present information
0 nothing set
1 code unit is set
PCRE2_INFO_LASTCODEUNIT Last code unit when type is 1
PCRE2_INFO_MATCHEMPTY 1 if the pattern can match an empty string, 0 otherwise
PCRE2_INFO_MATCHLIMIT Match limit if set, otherwise PCRE2_ERROR_UNSET
PCRE2_INFO_MAXLOOKBEHIND Length (in characters) of the longest lookbehind assertion
PCRE2_INFO_MINLENGTH Lower bound length of matching strings
PCRE2_INFO_NAMECOUNT Number of named subpatterns
PCRE2_INFO_NAMEENTRYSIZE Size of name table entries
PCRE2_INFO_NAMETABLE Pointer to name table
PCRE2_CONFIG_NEWLINE Code for the newline sequence:
PCRE2_NEWLINE_CR
PCRE2_NEWLINE_LF
PCRE2_NEWLINE_CRLF
PCRE2_NEWLINE_ANY
PCRE2_NEWLINE_ANYCRLF
PCRE2_NEWLINE_NUL
PCRE2_INFO_RECURSIONLIMIT Obsolete synonym for PCRE2_INFO_DEPTHLIMIT
PCRE2_INFO_SIZE Size of compiled pattern
</pre>
If <i>where</i> is NULL, the function returns the amount of memory needed for
the requested information, in bytes. Otherwise, the <i>where</i> argument must
point to an unsigned 32-bit integer (uint32_t variable), except for the
following <i>what</i> values, when it must point to a variable of the type
shown:
<pre>
PCRE2_INFO_FIRSTBITMAP const uint8_t *
PCRE2_INFO_JITSIZE size_t
PCRE2_INFO_NAMETABLE PCRE2_SPTR
PCRE2_INFO_SIZE size_t
</pre>
The yield of the function is zero on success or:
<pre>
PCRE2_ERROR_NULL the argument <i>code</i> is NULL
PCRE2_ERROR_BADMAGIC the "magic number" was not found
PCRE2_ERROR_BADOPTION the value of <i>what</i> is invalid
PCRE2_ERROR_BADMODE the pattern was compiled in the wrong mode
PCRE2_ERROR_UNSET the requested information is not set
</PRE>
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,65 @@
<html>
<head>
<title>pcre2_serialize_decode specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_serialize_decode man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function decodes a serialized set of compiled patterns back into a list of
individual patterns. This is possible only on a host that is running the same
version of PCRE2, with the same code unit width, and the host must also have
the same endianness, pointer width and PCRE2_SIZE type. The arguments for
<b>pcre2_serialize_decode()</b> are:
<pre>
<i>codes</i> pointer to a vector in which to build the list
<i>number_of_codes</i> number of slots in the vector
<i>bytes</i> the serialized byte stream
<i>gcontext</i> pointer to a general context or NULL
</pre>
The <i>bytes</i> argument must point to a block of data that was originally
created by <b>pcre2_serialize_encode()</b>, though it may have been saved on
disc or elsewhere in the meantime. If there are more codes in the serialized
data than slots in the list, only those compiled patterns that will fit are
decoded. The yield of the function is the number of decoded patterns, or one of
the following negative error codes:
<pre>
PCRE2_ERROR_BADDATA <i>number_of_codes</i> is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in <i>bytes</i>
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE version
PCRE2_ERROR_NOMEMORY memory allocation failed
PCRE2_ERROR_NULL <i>codes</i> or <i>bytes</i> is NULL
</pre>
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
on a system with different endianness.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the serialization functions in the
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,66 @@
<html>
<head>
<title>pcre2_serialize_encode specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_serialize_encode man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function encodes a list of compiled patterns into a byte stream that can
be saved on disc or elsewhere. Note that this is not an abstract format like
Java or .NET. Conversion of the byte stream back into usable compiled patterns
can only happen on a host that is running the same version of PCRE2, with the
same code unit width, and the host must also have the same endianness, pointer
width and PCRE2_SIZE type. The arguments for <b>pcre2_serialize_encode()</b>
are:
<pre>
<i>codes</i> pointer to a vector containing the list
<i>number_of_codes</i> number of slots in the vector
<i>serialized_bytes</i> set to point to the serialized byte stream
<i>serialized_size</i> set to the number of bytes in the byte stream
<i>gcontext</i> pointer to a general context or NULL
</pre>
The context argument is used to obtain memory for the byte stream. When the
serialized data is no longer needed, it must be freed by calling
<b>pcre2_serialize_free()</b>. The yield of the function is the number of
serialized patterns, or one of the following negative error codes:
<pre>
PCRE2_ERROR_BADDATA <i>number_of_codes</i> is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
PCRE2_ERROR_NULL an argument other than <i>gcontext</i> is NULL
</pre>
PCRE2_ERROR_BADMAGIC means either that a pattern's code has been corrupted, or
that a slot in the vector does not point to a compiled pattern.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the serialization functions in the
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_serialize_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_serialize_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function frees the memory that was obtained by
<b>pcre2_serialize_encode()</b> to hold a serialized byte stream. The argument
must point to such a byte stream or be NULL, in which case the function returns
without doing anything.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the serialization functions in the
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,49 @@
<html>
<head>
<title>pcre2_serialize_get_number_of_codes specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_serialize_get_number_of_codes man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
The <i>bytes</i> argument must point to a serialized byte stream that was
originally created by <b>pcre2_serialize_encode()</b> (though it may have been
saved on disc or elsewhere in the meantime). The function returns the number of
serialized patterns in the byte stream, or one of the following negative error
codes:
<pre>
PCRE2_ERROR_BADMAGIC mismatch of id bytes in <i>bytes</i>
PCRE2_ERROR_BADMODE mismatch of variable unit size or PCRE version
PCRE2_ERROR_NULL the argument is NULL
</pre>
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
on a system with different endianness.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the serialization functions in the
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_set_bsr specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_bsr man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the convention for processing \R within a compile context.
The second argument must be one of PCRE2_BSR_ANYCRLF or PCRE2_BSR_UNICODE. The
result is zero for success or PCRE2_ERROR_BADDATA if the second argument is
invalid.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_set_callout specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_callout man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_callout_block *),</b>
<b> void *<i>callout_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the callout fields in a match context (the first argument).
The second argument specifies a callout function, and the third argument is an
opaque data item that is passed to it. The result of this function is always
zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,45 @@
<html>
<head>
<title>pcre2_set_character_tables specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_character_tables man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
<b> const uint8_t *<i>tables</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets a pointer to custom character tables within a compile
context. The second argument must point to a set of PCRE2 character tables or
be NULL to request the default tables. The result is always zero. Character
tables can be created by calling <b>pcre2_maketables()</b> or by running the
<b>pcre2_dftables</b> maintenance command in binary mode (see the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation).
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,58 @@
<html>
<head>
<title>pcre2_set_compile_extra_options specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_compile_extra_options man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>extra_options</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets additional option bits for <b>pcre2_compile()</b> that are
housed in a compile context. It completely replaces all the bits. The extra
options are:
<pre>
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK Allow \K in lookarounds
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES Allow \x{d800} to \x{dfff} in UTF-8 and UTF-32 modes
PCRE2_EXTRA_ALT_BSUX Extended alternate \u, \U, and \x handling
PCRE2_EXTRA_ASCII_BSD \d remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSS \s remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSW \w remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_DIGIT [:digit:] and [:xdigit:] POSIX classes remain ASCII in UCP mode
PCRE2_EXTRA_ASCII_POSIX POSIX classes remain ASCII in UCP mode
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
PCRE2_EXTRA_CASELESS_RESTRICT Disable mixed ASCII/non-ASCII case folding
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
PCRE2_EXTRA_NEVER_CALLOUT Disallow callouts in pattern
PCRE2_EXTRA_NO_BS0 Disallow \0 (but not \00 or \000)
PCRE2_EXTRA_PYTHON_OCTAL Use Python rules for octal
PCRE2_EXTRA_TURKISH_CASING Use Turkish I case folding
</pre>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,46 @@
<html>
<head>
<title>pcre2_set_compile_recursion_guard specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_compile_recursion_guard man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function defines, within a compile context, a function that is called
whenever <b>pcre2_compile()</b> starts to compile a parenthesized part of a
pattern. The first argument to the function gives the current depth of
parenthesis nesting, and the second is user data that is supplied when the
function is set up. The callout function should return zero if all is well, or
non-zero to force an error. This feature is provided so that applications can
check the available system stack space, in order to avoid running out. The
result of <b>pcre2_set_compile_recursion_guard()</b> is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_depth_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_depth_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the backtracking depth limit field in a match context. The
result is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_set_glob_escape specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_glob_escape man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
<b> uint32_t <i>escape_char</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It sets the escape character that is used when converting globs. The second
argument must either be zero (meaning there is no escape character) or a
punctuation character whose code point is less than 256. The default is grave
accent if running under Windows, otherwise backslash. The result of the
function is zero for success or PCRE2_ERROR_BADDATA if the second argument is
invalid.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_set_glob_separator specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_glob_separator man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
<b> uint32_t <i>separator_char</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is part of an experimental set of pattern conversion functions.
It sets the component separator character that is used when converting globs.
The second argument must be one of the characters forward slash, backslash, or
dot. The default is backslash when running under Windows, otherwise forward
slash. The result of the function is zero for success or PCRE2_ERROR_BADDATA if
the second argument is invalid.
</P>
<P>
The pattern conversion functions are described in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_heap_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_heap_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the backtracking heap limit field in a match context. The
result is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_match_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_match_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the match limit field in a match context. The result is
always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,44 @@
<html>
<head>
<title>pcre2_set_max_pattern_compiled_length specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_max_pattern_compiled_length man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_max_pattern_compiled_length(</b>
<b> pcre2_compile_context *<i>ccontext</i>, PCRE2_SIZE <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets, in a compile context, the maximum size (in bytes) for the
memory needed to hold the compiled version of a pattern that is using this
context. The result is always zero. If a pattern that is passed to
<b>pcre2_compile()</b> referencing this context needs more memory, an error is
generated. The default is the largest number that a PCRE2_SIZE variable can
hold, which is effectively unlimited.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_set_max_pattern_length specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_max_pattern_length man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets, in a compile context, the maximum text length (in code
units) of the pattern that can be compiled. The result is always zero. If a
longer pattern is passed to <b>pcre2_compile()</b> there is an immediate error
return. The default is effectively unlimited, being the largest value a
PCRE2_SIZE variable can hold.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_set_max_varlookbehind specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_max_varlookbehind man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_max_varlookbehind(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This sets a maximum length for the number of characters matched by a
variable-length lookbehind assertion. The default is set when PCRE2 is built,
with the ultimate default being 255, the same as Perl. Lookbehind assertions
without a bounding length are not supported. The result is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,51 @@
<html>
<head>
<title>pcre2_set_newline specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_newline man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the newline convention within a compile context. This
specifies which character(s) are recognized as newlines when compiling and
matching patterns. The second argument must be one of:
<pre>
PCRE2_NEWLINE_CR Carriage return only
PCRE2_NEWLINE_LF Linefeed only
PCRE2_NEWLINE_CRLF CR followed by LF only
PCRE2_NEWLINE_ANYCRLF Any of the above
PCRE2_NEWLINE_ANY Any Unicode newline sequence
PCRE2_NEWLINE_NUL The NUL character (binary zero)
</pre>
The result is zero for success or PCRE2_ERROR_BADDATA if the second argument is
invalid.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_offset_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_offset_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the offset limit field in a match context. The result is
always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,57 @@
<html>
<head>
<title>pcre2_set_optimize specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_optimize man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function controls which performance optimizations will be applied
by <b>pcre2_compile()</b>. It can be called multiple times with the same compile
context; the effects are cumulative, with the effects of later calls taking
precedence over earlier ones.
</P>
<P>
The result is zero for success, PCRE2_ERROR_NULL if <i>ccontext</i> is NULL,
or PCRE2_ERROR_BADOPTION if <i>directive</i> is unknown. The latter could be
useful to detect if a certain optimization is available.
</P>
<P>
The list of possible values for the <i>directive</i> parameter are:
<pre>
PCRE2_OPTIMIZATION_FULL Enable all optimizations (default)
PCRE2_OPTIMIZATION_NONE Disable all optimizations
PCRE2_AUTO_POSSESS Enable auto-possessification
PCRE2_AUTO_POSSESS_OFF Disable auto-possessification
PCRE2_DOTSTAR_ANCHOR Enable implicit dotstar anchoring
PCRE2_DOTSTAR_ANCHOR_OFF Disable implicit dotstar anchoring
PCRE2_START_OPTIMIZE Enable start-up optimizations at match time
PCRE2_START_OPTIMIZE_OFF Disable start-up optimizations at match time
</pre>
There is a complete description of the PCRE2 native API, including detailed
descriptions <i>directive</i> parameter values in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_parens_nest_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_parens_nest_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets, in a compile context, the maximum depth of nested
parentheses in a pattern. The result is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,40 @@
<html>
<head>
<title>pcre2_set_recursion_limit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_recursion_limit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function is obsolete and should not be used in new code. Use
<b>pcre2_set_depth_limit()</b> instead.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,42 @@
<html>
<head>
<title>pcre2_set_recursion_memory_management specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_recursion_memory_management man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_recursion_memory_management(</b>
<b> pcre2_match_context *<i>mcontext</i>,</b>
<b> void *(*<i>private_malloc</i>)(size_t, void *),</b>
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
From release 10.30 onwards, this function is obsolete and does nothing. The
result is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,43 @@
<html>
<head>
<title>pcre2_set_substitute_callout specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_substitute_callout man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the substitute callout fields in a match context (the first
argument). The second argument specifies a callout function, and the third
argument is an opaque data item that is passed to it. The result of this
function is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,45 @@
<html>
<head>
<title>pcre2_set_substitute_case_callout specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_substitute_case_callout man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
<b> int, void *),</b>
<b> void *<i>callout_data</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function sets the substitute case callout fields in a match context (the
first argument). The second argument specifies a callout function, and the third
argument is an opaque data item that is passed to it. The result of this
function is always zero.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,111 @@
<html>
<head>
<title>pcre2_substitute specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substitute man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b>
<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
<b> PCRE2_SIZE *<i>outlengthptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function matches a compiled regular expression against a given subject
string, using a matching algorithm that is similar to Perl's. It then makes a
copy of the subject, substituting a replacement string for what was matched.
Its arguments are:
<pre>
<i>code</i> Points to the compiled pattern
<i>subject</i> Points to the subject string
<i>length</i> Length of the subject string
<i>startoffset</i> Offset in the subject at which to start matching
<i>options</i> Option bits
<i>match_data</i> Points to a match data block, or is NULL
<i>mcontext</i> Points to a match context, or is NULL
<i>replacement</i> Points to the replacement string
<i>rlength</i> Length of the replacement string
<i>outputbuffer</i> Points to the output buffer
<i>outlengthptr</i> Points to the length of the output buffer
</pre>
A match data block is needed only if you want to inspect the data from the
final match that is returned in that block or if PCRE2_SUBSTITUTE_MATCHED is
set. A match context is needed only if you want to:
<pre>
Set up a callout function
Set a matching offset limit
Change the backtracking match limit
Change the backtracking depth limit
Set custom memory management in the match context
</pre>
The <i>length</i>, <i>startoffset</i> and <i>rlength</i> values are code units,
not characters, as is the contents of the variable pointed at by
<i>outlengthptr</i>. This variable must contain the length of the output buffer
when the function is called. If the function is successful, the value is
changed to the length of the new string, excluding the trailing zero that is
automatically added.
</P>
<P>
The subject and replacement lengths can be given as PCRE2_ZERO_TERMINATED for
zero-terminated strings. The options are:
<pre>
PCRE2_ANCHORED Match only at the first position
PCRE2_ENDANCHORED Match only at end of subject
PCRE2_NOTBOL Subject is not the beginning of a line
PCRE2_NOTEOL Subject is not the end of a line
PCRE2_NOTEMPTY An empty string is not a valid match
PCRE2_NOTEMPTY_ATSTART An empty string at the start of the subject is not a valid match
PCRE2_NO_JIT Do not use JIT matching
PCRE2_NO_UTF_CHECK Do not check for UTF validity in the subject or replacement
(only relevant if PCRE2_UTF was set at compile time)
PCRE2_SUBSTITUTE_EXTENDED Do extended replacement processing
PCRE2_SUBSTITUTE_GLOBAL Replace all occurrences in the subject
PCRE2_SUBSTITUTE_LITERAL The replacement string is literal
PCRE2_SUBSTITUTE_MATCHED Use pre-existing match data for first match
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH If overflow, compute needed length
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY Return only replacement string(s)
PCRE2_SUBSTITUTE_UNKNOWN_UNSET Treat unknown group as unset
PCRE2_SUBSTITUTE_UNSET_EMPTY Simple unset insert = empty string
</pre>
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_EXTENDED,
PCRE2_SUBSTITUTE_UNKNOWN_UNSET, and PCRE2_SUBSTITUTE_UNSET_EMPTY are ignored.
</P>
<P>
If PCRE2_SUBSTITUTE_MATCHED is set, <i>match_data</i> must be non-NULL; its
contents must be the result of a call to <b>pcre2_match()</b> using the same
pattern and subject.
</P>
<P>
The function returns the number of substitutions, which may be zero if there
are no matches. The result may be greater than one only when
PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a negative error code
is returned.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,58 @@
<html>
<head>
<title>pcre2_substring_copy_byname specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_copy_byname man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring, identified
by name, into a given buffer. The arguments are:
<pre>
<i>match_data</i> The match data block for the match
<i>name</i> Name of the required substring
<i>buffer</i> Buffer to receive the string
<i>bufflen</i> Length of buffer (code units)
</pre>
The <i>bufflen</i> variable is updated to contain the length of the extracted
string, excluding the trailing zero. The yield of the function is zero for
success or one of the following error numbers:
<pre>
PCRE2_ERROR_NOSUBSTRING there are no groups of that name
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
PCRE2_ERROR_UNSET the group did not participate in the match
PCRE2_ERROR_NOMEMORY the buffer is not big enough
</pre>
If there is more than one group with the given name, the first one that is set
is returned. In this situation PCRE2_ERROR_UNSET means that no group with the
given name was set.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,57 @@
<html>
<head>
<title>pcre2_substring_copy_bynumber specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_copy_bynumber man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring into a given
buffer. The arguments are:
<pre>
<i>match_data</i> The match data block for the match
<i>number</i> Number of the required substring
<i>buffer</i> Buffer to receive the string
<i>bufflen</i> Length of buffer
</pre>
The <i>bufflen</i> variable is updated with the length of the extracted string,
excluding the terminating zero. The yield of the function is zero for success
or one of the following error numbers:
<pre>
PCRE2_ERROR_NOSUBSTRING there are no groups of that number
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
PCRE2_ERROR_UNSET the group did not participate in the match
PCRE2_ERROR_NOMEMORY the buffer is too small
</PRE>
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_substring_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for freeing the memory obtained by a previous
call to <b>pcre2_substring_get_byname()</b> or
<b>pcre2_substring_get_bynumber()</b>. Its only argument is a pointer to the
string. If the argument is NULL, the function does nothing.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,60 @@
<html>
<head>
<title>pcre2_substring_get_byname specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_get_byname man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring by name into
newly acquired memory. The arguments are:
<pre>
<i>match_data</i> The match data for the match
<i>name</i> Name of the required substring
<i>bufferptr</i> Where to put the string pointer
<i>bufflen</i> Where to put the string length
</pre>
The memory in which the substring is placed is obtained by calling the same
memory allocation function that was used for the match data block. The
convenience function <b>pcre2_substring_free()</b> can be used to free it when
it is no longer needed. The yield of the function is zero for success or one of
the following error numbers:
<pre>
PCRE2_ERROR_NOSUBSTRING there are no groups of that name
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
PCRE2_ERROR_UNSET the group did not participate in the match
PCRE2_ERROR_NOMEMORY memory could not be obtained
</pre>
If there is more than one group with the given name, the first one that is set
is returned. In this situation PCRE2_ERROR_UNSET means that no group with the
given name was set.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,58 @@
<html>
<head>
<title>pcre2_substring_get_bynumber specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_get_bynumber man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting a captured substring by number
into newly acquired memory. The arguments are:
<pre>
<i>match_data</i> The match data for the match
<i>number</i> Number of the required substring
<i>bufferptr</i> Where to put the string pointer
<i>bufflen</i> Where to put the string length
</pre>
The memory in which the substring is placed is obtained by calling the same
memory allocation function that was used for the match data block. The
convenience function <b>pcre2_substring_free()</b> can be used to free it when
it is no longer needed. The yield of the function is zero for success or one of
the following error numbers:
<pre>
PCRE2_ERROR_NOSUBSTRING there are no groups of that number
PCRE2_ERROR_UNAVAILBLE the ovector was too small for that group
PCRE2_ERROR_UNSET the group did not participate in the match
PCRE2_ERROR_NOMEMORY memory could not be obtained
</PRE>
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,46 @@
<html>
<head>
<title>pcre2_substring_length_byname specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_length_byname man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns the length of a matched substring, identified by name.
The arguments are:
<pre>
<i>match_data</i> The match data block for the match
<i>name</i> The substring name
<i>length</i> Where to return the length
</pre>
The yield is zero on success, or an error code if the substring is not found.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,48 @@
<html>
<head>
<title>pcre2_substring_length_bynumber specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_length_bynumber man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function returns the length of a matched substring, identified by number.
The arguments are:
<pre>
<i>match_data</i> The match data block for the match
<i>number</i> The substring number
<i>length</i> Where to return the length, or NULL
</pre>
The third argument may be NULL if all you want to know is whether or not a
substring is set. The yield is zero on success, or a negative error code
otherwise. After a partial match, only substring 0 is available.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,41 @@
<html>
<head>
<title>pcre2_substring_list_free specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_list_free man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>void pcre2_substring_list_free(PCRE2_UCHAR **<i>list</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for freeing the store obtained by a previous
call to <b>pcre2substring_list_get()</b>. Its only argument is a pointer to
the list of string pointers. If the argument is NULL, the function returns
immediately, without doing anything.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,56 @@
<html>
<head>
<title>pcre2_substring_list_get specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_list_get man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This is a convenience function for extracting all the captured substrings after
a pattern match. It builds a list of pointers to the strings, and (optionally)
a second list that contains their lengths (in code units), excluding a
terminating zero that is added to each of them. All this is done in a single
block of memory that is obtained using the same memory allocation function that
was used to get the match data block. The convenience function
<b>pcre2_substring_list_free()</b> can be used to free it when it is no longer
needed. The arguments are:
<pre>
<i>match_data</i> The match data block
<i>listptr</i> Where to put a pointer to the list
<i>lengthsptr</i> Where to put a pointer to the lengths, or NULL
</pre>
A pointer to a list of pointers is put in the variable whose address is in
<i>listptr</i>. The list is terminated by a NULL pointer. If <i>lengthsptr</i> is
not NULL, a matching list of lengths is created, and its address is placed in
<i>lengthsptr</i>. The yield of the function is zero on success or
PCRE2_ERROR_NOMEMORY if sufficient memory could not be obtained.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,53 @@
<html>
<head>
<title>pcre2_substring_nametable_scan specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_nametable_scan man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This convenience function finds, for a compiled pattern, the first and last
entries for a given name in the table that translates capture group names into
numbers.
<pre>
<i>code</i> Compiled regular expression
<i>name</i> Name whose entries required
<i>first</i> Where to return a pointer to the first entry
<i>last</i> Where to return a pointer to the last entry
</pre>
When the name is found in the table, if <i>first</i> is NULL, the function
returns a group number, but if there is more than one matching entry, it is not
defined which one. Otherwise, when both pointers have been set, the yield of
the function is the length of each entry in code units. If the name is not
found, PCRE2_ERROR_NOSUBSTRING is returned.
</P>
<P>
There is a complete description of the PCRE2 native API, including the format of
the table entries, in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page, and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,50 @@
<html>
<head>
<title>pcre2_substring_number_from_name specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_substring_number_from_name man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This convenience function finds the number of a named substring capturing
parenthesis in a compiled pattern, provided that it is a unique name. The
function arguments are:
<pre>
<i>code</i> Compiled regular expression
<i>name</i> Name whose number is required
</pre>
The yield of the function is the number of the parenthesis if the name is
found, or PCRE2_ERROR_NOSUBSTRING if it is not found. When duplicate names are
allowed (PCRE2_DUPNAMES is set), if the name is not unique,
PCRE2_ERROR_NOUNIQUESUBSTRING is returned. You can obtain the list of numbers
with the same name by calling <b>pcre2_substring_nametable_scan()</b>.
</P>
<P>
There is a complete description of the PCRE2 native API in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page and a description of the POSIX API in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,4496 @@
<html>
<head>
<title>pcre2api specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2api man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 NATIVE API BASIC FUNCTIONS</a>
<li><a name="TOC2" href="#SEC2">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a>
<li><a name="TOC3" href="#SEC3">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a>
<li><a name="TOC4" href="#SEC4">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a>
<li><a name="TOC5" href="#SEC5">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a>
<li><a name="TOC6" href="#SEC6">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a>
<li><a name="TOC7" href="#SEC7">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a>
<li><a name="TOC8" href="#SEC8">PCRE2 NATIVE API JIT FUNCTIONS</a>
<li><a name="TOC9" href="#SEC9">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a>
<li><a name="TOC10" href="#SEC10">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a>
<li><a name="TOC11" href="#SEC11">PCRE2 NATIVE API OBSOLETE FUNCTIONS</a>
<li><a name="TOC12" href="#SEC12">PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a>
<li><a name="TOC13" href="#SEC13">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a>
<li><a name="TOC14" href="#SEC14">PCRE2 API OVERVIEW</a>
<li><a name="TOC15" href="#SEC15">STRING LENGTHS AND OFFSETS</a>
<li><a name="TOC16" href="#SEC16">NEWLINES</a>
<li><a name="TOC17" href="#SEC17">MULTITHREADING</a>
<li><a name="TOC18" href="#SEC18">PCRE2 CONTEXTS</a>
<li><a name="TOC19" href="#SEC19">CHECKING BUILD-TIME OPTIONS</a>
<li><a name="TOC20" href="#SEC20">COMPILING A PATTERN</a>
<li><a name="TOC21" href="#SEC21">JUST-IN-TIME (JIT) COMPILATION</a>
<li><a name="TOC22" href="#SEC22">LOCALE SUPPORT</a>
<li><a name="TOC23" href="#SEC23">INFORMATION ABOUT A COMPILED PATTERN</a>
<li><a name="TOC24" href="#SEC24">INFORMATION ABOUT A PATTERN'S CALLOUTS</a>
<li><a name="TOC25" href="#SEC25">SERIALIZATION AND PRECOMPILING</a>
<li><a name="TOC26" href="#SEC26">THE MATCH DATA BLOCK</a>
<li><a name="TOC27" href="#SEC27">MEMORY USE FOR MATCH DATA BLOCKS</a>
<li><a name="TOC28" href="#SEC28">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
<li><a name="TOC29" href="#SEC29">NEWLINE HANDLING WHEN MATCHING</a>
<li><a name="TOC30" href="#SEC30">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a>
<li><a name="TOC31" href="#SEC31">OTHER INFORMATION ABOUT A MATCH</a>
<li><a name="TOC32" href="#SEC32">ERROR RETURNS FROM <b>pcre2_match()</b></a>
<li><a name="TOC33" href="#SEC33">OBTAINING A TEXTUAL ERROR MESSAGE</a>
<li><a name="TOC34" href="#SEC34">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
<li><a name="TOC35" href="#SEC35">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a>
<li><a name="TOC36" href="#SEC36">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
<li><a name="TOC37" href="#SEC37">CREATING A NEW STRING WITH SUBSTITUTIONS</a>
<li><a name="TOC38" href="#SEC38">DUPLICATE CAPTURE GROUP NAMES</a>
<li><a name="TOC39" href="#SEC39">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a>
<li><a name="TOC40" href="#SEC40">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
<li><a name="TOC41" href="#SEC41">SEE ALSO</a>
<li><a name="TOC42" href="#SEC42">AUTHOR</a>
<li><a name="TOC43" href="#SEC43">REVISION</a>
</ul>
<P>
<b>#include &#60;pcre2.h&#62;</b>
<br>
<br>
PCRE2 is a new API for PCRE, starting at release 10.0. This document contains a
description of all its native functions. See the
<a href="pcre2.html"><b>pcre2</b></a>
document for an overview of all the PCRE2 documentation.
</P>
<br><a name="SEC1" href="#TOC1">PCRE2 NATIVE API BASIC FUNCTIONS</a><br>
<P>
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
<b> pcre2_compile_context *<i>ccontext</i>);</b>
<br>
<br>
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
<br>
<br>
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>,</b>
<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
<br>
<br>
<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><a name="SEC2" href="#TOC1">PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS</a><br>
<P>
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE pcre2_get_match_data_heapframes_size(</b>
<b> pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
</P>
<br><a name="SEC3" href="#TOC1">PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS</a><br>
<P>
<b>pcre2_general_context *pcre2_general_context_create(</b>
<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
<br>
<br>
<b>pcre2_general_context *pcre2_general_context_copy(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<br><a name="SEC4" href="#TOC1">PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS</a><br>
<P>
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
<b> pcre2_compile_context *<i>ccontext</i>);</b>
<br>
<br>
<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
<br>
<br>
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
<b> const uint8_t *<i>tables</i>);</b>
<br>
<br>
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>extra_options</i>);</b>
<br>
<br>
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_max_pattern_compiled_length(</b>
<b> pcre2_compile_context *<i>ccontext</i>, PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_max_varlookbehind(pcre2_compile_contest *<i>ccontext</i>,</b>
<b>" uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
<br>
<br>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
</P>
<br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br>
<P>
<b>pcre2_match_context *pcre2_match_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_match_context *pcre2_match_context_copy(</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
<b> int, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
</P>
<br><a name="SEC6" href="#TOC1">PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS</a><br>
<P>
<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
<br>
<br>
<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b>
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
<br>
<br>
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
<br>
<br>
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
<br>
<br>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>);</b>
<br>
<br>
<b>void pcre2_substring_list_free(PCRE2_UCHAR **<i>list</i>);</b>
<br>
<br>
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
</P>
<br><a name="SEC7" href="#TOC1">PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION</a><br>
<P>
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacementz</i>,</b>
<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
<b> PCRE2_SIZE *<i>outlengthptr</i>);</b>
</P>
<br><a name="SEC8" href="#TOC1">PCRE2 NATIVE API JIT FUNCTIONS</a><br>
<P>
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
<br>
<br>
<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_jit_stack *pcre2_jit_stack_create(size_t <i>startsize</i>,</b>
<b> size_t <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
<br>
<br>
<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
</P>
<br><a name="SEC9" href="#TOC1">PCRE2 NATIVE API SERIALIZATION FUNCTIONS</a><br>
<P>
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
<br>
<br>
<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
</P>
<br><a name="SEC10" href="#TOC1">PCRE2 NATIVE API AUXILIARY FUNCTIONS</a><br>
<P>
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
<br>
<br>
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
<br>
<br>
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE <i>bufflen</i>);</b>
<br>
<br>
<b>const uint8_t *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_maketables_free(pcre2_general_context *<i>gcontext</i>,</b>
<b> const uint8_t *<i>tables</i>);</b>
<br>
<br>
<b>int pcre2_pattern_info(const pcre2_code *<i>code</i>, uint32_t <i>what</i>,</b>
<b> void *<i>where</i>);</b>
<br>
<br>
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
<b> void *<i>user_data</i>);</b>
<br>
<br>
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
<br><a name="SEC11" href="#TOC1">PCRE2 NATIVE API OBSOLETE FUNCTIONS</a><br>
<P>
<b>int pcre2_set_recursion_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
<b>int pcre2_set_recursion_memory_management(</b>
<b> pcre2_match_context *<i>mcontext</i>,</b>
<b> void *(*<i>private_malloc</i>)(size_t, void *),</b>
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
<br>
<br>
These functions became obsolete at release 10.30 and are retained only for
backward compatibility. They should not be used in new code. The first is
replaced by <b>pcre2_set_depth_limit()</b>; the second is no longer needed and
has no effect (it always returns zero).
</P>
<br><a name="SEC12" href="#TOC1">PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a><br>
<P>
<b>pcre2_convert_context *pcre2_convert_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
<b> pcre2_convert_context *<i>cvcontext</i>);</b>
<br>
<br>
<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
<br>
<br>
<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
<b> uint32_t <i>escape_char</i>);</b>
<br>
<br>
<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
<b> uint32_t <i>separator_char</i>);</b>
<br>
<br>
<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
<br>
<br>
<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
<br>
<br>
These functions provide a way of converting non-PCRE2 patterns into
patterns that can be processed by <b>pcre2_compile()</b>. This facility is
experimental and may be changed in future releases. At present, "globs" and
POSIX basic and extended patterns can be converted. Details are given in the
<a href="pcre2convert.html"><b>pcre2convert</b></a>
documentation.
</P>
<br><a name="SEC13" href="#TOC1">PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
<P>
There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit code
units, respectively. However, there is just one header file, <b>pcre2.h</b>.
This contains the function prototypes and other definitions for all three
libraries. One, two, or all three can be installed simultaneously. On Unix-like
systems the libraries are called <b>libpcre2-8</b>, <b>libpcre2-16</b>, and
<b>libpcre2-32</b>, and they can also co-exist with the original PCRE libraries.
Every PCRE2 function comes in three different forms, one for each library, for
example:
<pre>
<b>pcre2_compile_8()</b>
<b>pcre2_compile_16()</b>
<b>pcre2_compile_32()</b>
</pre>
There are also three different sets of data types:
<pre>
<b>PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32</b>
<b>PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32</b>
</pre>
The UCHAR types define unsigned code units of the appropriate widths.
For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.
The SPTR types are pointers to constants of the equivalent UCHAR types,
that is, they are pointers to vectors of unsigned code units.
</P>
<P>
Character strings are passed to a PCRE2 library as sequences of unsigned
integers in code units of the appropriate width. The length of a string may
be given as a number of code units, or the string may be specified as
zero-terminated.
</P>
<P>
Many applications use only one code unit width. For their convenience, macros
are defined whose names are the generic forms such as <b>pcre2_compile()</b> and
PCRE2_SPTR. These macros use the value of the macro PCRE2_CODE_UNIT_WIDTH to
generate the appropriate width-specific function and macro names.
PCRE2_CODE_UNIT_WIDTH is not defined by default. An application must define it
to be 8, 16, or 32 before including <b>pcre2.h</b> in order to make use of the
generic names.
</P>
<P>
Applications that use more than one code unit width can be linked with more
than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to be 0 before
including <b>pcre2.h</b>, and then use the real function names. Any code that is
to be included in an environment where the value of PCRE2_CODE_UNIT_WIDTH is
unknown should also use the real function names. (Unfortunately, it is not
possible in C code to save and restore the value of a macro.)
</P>
<P>
If PCRE2_CODE_UNIT_WIDTH is not defined before including <b>pcre2.h</b>, a
compiler error occurs.
</P>
<P>
When using multiple libraries in an application, you must take care when
processing any particular pattern to use only functions from a single library.
For example, if you want to run a match using a pattern that was compiled with
<b>pcre2_compile_16()</b>, you must do so with <b>pcre2_match_16()</b>, not
<b>pcre2_match_8()</b> or <b>pcre2_match_32()</b>.
</P>
<P>
In the function summaries above, and in the rest of this document and other
PCRE2 documents, functions and data types are described using their generic
names, without the _8, _16, or _32 suffix.
</P>
<br><a name="SEC14" href="#TOC1">PCRE2 API OVERVIEW</a><br>
<P>
PCRE2 has its own native API, which is described in this document. There are
also some wrapper functions for the 8-bit library that correspond to the
POSIX regular expression API, but they do not give access to all the
functionality of PCRE2 and they are not thread-safe. They are described in the
<a href="pcre2posix.html"><b>pcre2posix</b></a>
documentation. Both these APIs define a set of C function calls.
</P>
<P>
The native API C data types, function prototypes, option values, and error
codes are defined in the header file <b>pcre2.h</b>, which also contains
definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release numbers
for the library. Applications can use these to include support for different
releases of PCRE2.
</P>
<P>
In a Windows environment, if you want to statically link an application program
against a non-dll PCRE2 library, you must define PCRE2_STATIC before including
<b>pcre2.h</b>.
</P>
<P>
The functions <b>pcre2_compile()</b> and <b>pcre2_match()</b> are used for
compiling and matching regular expressions in a Perl-compatible manner. A
sample program that demonstrates the simplest way of using them is provided in
the file called <i>pcre2demo.c</i> in the PCRE2 source distribution. A listing
of this program is given in the
<a href="pcre2demo.html"><b>pcre2demo</b></a>
documentation, and the
<a href="pcre2sample.html"><b>pcre2sample</b></a>
documentation describes how to compile and run it.
</P>
<P>
The compiling and matching functions recognize various options that are passed
as bits in an options argument. There are also some more complicated parameters
such as custom memory management functions and resource limits that are passed
in "contexts" (which are just memory blocks, described below). Simple
applications do not need to make use of contexts.
</P>
<P>
Just-in-time (JIT) compiler support is an optional feature of PCRE2 that can be
built in appropriate hardware environments. It greatly speeds up the matching
performance of many patterns. Programs can request that it be used if
available by calling <b>pcre2_jit_compile()</b> after a pattern has been
successfully compiled by <b>pcre2_compile()</b>. This does nothing if JIT
support is not available.
</P>
<P>
More complicated programs might need to make use of the specialist functions
<b>pcre2_jit_stack_create()</b>, <b>pcre2_jit_stack_free()</b>, and
<b>pcre2_jit_stack_assign()</b> in order to control the JIT code's memory usage.
</P>
<P>
JIT matching is automatically used by <b>pcre2_match()</b> if it is available,
unless the PCRE2_NO_JIT option is set. There is also a direct interface for JIT
matching, which gives improved performance at the expense of less sanity
checking. The JIT-specific functions are discussed in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation.
</P>
<P>
A second matching function, <b>pcre2_dfa_match()</b>, which is not
Perl-compatible, is also provided. This uses a different algorithm for the
matching. The alternative algorithm finds all possible matches (at a given
point in the subject), and scans the subject just once (unless there are
lookaround assertions). However, this algorithm does not return captured
substrings. A description of the two matching algorithms and their advantages
and disadvantages is given in the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
documentation. There is no JIT support for <b>pcre2_dfa_match()</b>.
</P>
<P>
In addition to the main compiling and matching functions, there are convenience
functions for extracting captured substrings from a subject string that has
been matched by <b>pcre2_match()</b>. They are:
<pre>
<b>pcre2_substring_copy_byname()</b>
<b>pcre2_substring_copy_bynumber()</b>
<b>pcre2_substring_get_byname()</b>
<b>pcre2_substring_get_bynumber()</b>
<b>pcre2_substring_list_get()</b>
<b>pcre2_substring_length_byname()</b>
<b>pcre2_substring_length_bynumber()</b>
<b>pcre2_substring_nametable_scan()</b>
<b>pcre2_substring_number_from_name()</b>
</pre>
<b>pcre2_substring_free()</b> and <b>pcre2_substring_list_free()</b> are also
provided, to free memory used for extracted strings. If either of these
functions is called with a NULL argument, the function returns immediately
without doing anything.
</P>
<P>
The function <b>pcre2_substitute()</b> can be called to match a pattern and
return a copy of the subject string with substitutions for parts that were
matched.
</P>
<P>
Functions whose names begin with <b>pcre2_serialize_</b> are used for saving
compiled patterns on disc or elsewhere, and reloading them later.
</P>
<P>
Finally, there are functions for finding out information about a compiled
pattern (<b>pcre2_pattern_info()</b>) and about the configuration with which
PCRE2 was built (<b>pcre2_config()</b>).
</P>
<P>
Functions with names ending with <b>_free()</b> are used for freeing memory
blocks of various sorts. In all cases, if one of these functions is called with
a NULL argument, it does nothing.
</P>
<br><a name="SEC15" href="#TOC1">STRING LENGTHS AND OFFSETS</a><br>
<P>
The PCRE2 API uses string lengths and offsets into strings of code units in
several places. These values are always of type PCRE2_SIZE, which is an
unsigned integer type, currently always defined as <i>size_t</i>. The largest
value that can be stored in such a type (that is ~(PCRE2_SIZE)0) is reserved
as a special indicator for zero-terminated strings and unset offsets.
Therefore, the longest string that can be handled is one less than this
maximum. Note that string lengths are always given in code units. Only in the
8-bit library is such a length the same as the number of bytes in the string.
<a name="newlines"></a></P>
<br><a name="SEC16" href="#TOC1">NEWLINES</a><br>
<P>
PCRE2 supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
character, the two-character sequence CRLF, any of the three preceding, or any
Unicode newline sequence. The Unicode newline sequences are the three just
mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
(paragraph separator, U+2029).
</P>
<P>
Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE2 is built, a default can be specified.
If it is not, the default is set to LF, which is the Unix standard. However,
the newline convention can be changed by an application when calling
<b>pcre2_compile()</b>, or it can be specified by special text at the start of
the pattern itself; this overrides any other settings. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page for details of the special character sequences.
</P>
<P>
In the PCRE2 documentation the word "newline" is used to mean "the character or
pair of characters that indicate a line break". The choice of newline
convention affects the handling of the dot, circumflex, and dollar
metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
recognized line ending sequence, the match position advancement for a
non-anchored pattern. There is more detail about this in the
<a href="#matchoptions">section on <b>pcre2_match()</b> options</a>
below.
</P>
<P>
The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches; this has
its own separate convention.
</P>
<br><a name="SEC17" href="#TOC1">MULTITHREADING</a><br>
<P>
In a multithreaded application it is important to keep thread-specific data
separate from data that can be shared between threads. The PCRE2 library code
itself is thread-safe: it contains no static or global variables. The API is
designed to be fairly simple for non-threaded applications while at the same
time ensuring that multithreaded applications can use it.
</P>
<P>
There are several different blocks of data that are used to pass information
between the application and the PCRE2 libraries.
</P>
<br><b>
The compiled pattern
</b><br>
<P>
A pointer to the compiled form of a pattern is returned to the user when
<b>pcre2_compile()</b> is successful. The data in the compiled pattern is fixed,
and does not change when the pattern is matched. Therefore, it is thread-safe,
that is, the same compiled pattern can be used by more than one thread
simultaneously. For example, an application can compile all its patterns at the
start, before forking off multiple threads that use them. However, if the
just-in-time (JIT) optimization feature is being used, it needs separate memory
stack areas for each thread. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details.
</P>
<P>
In a more complicated situation, where patterns are compiled only when they are
first needed, but are still shared between threads, pointers to compiled
patterns must be protected from simultaneous writing by multiple threads. This
is somewhat tricky to do correctly. If you know that writing to a pointer is
atomic in your environment, you can use logic like this:
<pre>
Get a read-only (shared) lock (mutex) for pointer
if (pointer == NULL)
{
Get a write (unique) lock for pointer
if (pointer == NULL) pointer = pcre2_compile(...
}
Release the lock
Use pointer in pcre2_match()
</pre>
Of course, testing for compilation errors should also be included in the code.
</P>
<P>
The reason for checking the pointer a second time is as follows: Several
threads may have acquired the shared lock and tested the pointer for being
NULL, but only one of them will be given the write lock, with the rest kept
waiting. The winning thread will compile the pattern and store the result.
After this thread releases the write lock, another thread will get it, and if
it does not retest pointer for being NULL, will recompile the pattern and
overwrite the pointer, creating a memory leak and possibly causing other
issues.
</P>
<P>
In an environment where writing to a pointer may not be atomic, the above logic
is not sufficient. The thread that is doing the compiling may be descheduled
after writing only part of the pointer, which could cause other threads to use
an invalid value. Instead of checking the pointer itself, a separate "pointer
is valid" flag (that can be updated atomically) must be used:
<pre>
Get a read-only (shared) lock (mutex) for pointer
if (!pointer_is_valid)
{
Get a write (unique) lock for pointer
if (!pointer_is_valid)
{
pointer = pcre2_compile(...
pointer_is_valid = TRUE
}
}
Release the lock
Use pointer in pcre2_match()
</pre>
If JIT is being used, but the JIT compilation is not being done immediately
(perhaps waiting to see if the pattern is used often enough), similar logic is
required. JIT compilation updates a value within the compiled code block, so a
thread must gain unique write access to the pointer before calling
<b>pcre2_jit_compile()</b>. Alternatively, <b>pcre2_code_copy()</b> or
<b>pcre2_code_copy_with_tables()</b> can be used to obtain a private copy of the
compiled code before calling the JIT compiler.
</P>
<br><b>
Context blocks
</b><br>
<P>
The next main section below introduces the idea of "contexts" in which PCRE2
functions are called. A context is nothing more than a collection of parameters
that control the way PCRE2 operates. Grouping a number of parameters together
in a context is a convenient way of passing them to a PCRE2 function without
using lots of arguments. The parameters that are stored in contexts are in some
sense "advanced features" of the API. Many straightforward applications will
not need to use contexts.
</P>
<P>
In a multithreaded application, if the parameters in a context are values that
are never changed, the same context can be used by all the threads. However, if
any thread needs to change any value in a context, it must make its own
thread-specific copy.
</P>
<br><b>
Match blocks
</b><br>
<P>
The matching functions need a block of memory for storing the results of a
match. This includes details of what was matched, as well as additional
information such as the name of a (*MARK) setting. Each thread must provide its
own copy of this memory.
</P>
<br><a name="SEC18" href="#TOC1">PCRE2 CONTEXTS</a><br>
<P>
Some PCRE2 functions have a lot of parameters, many of which are used only by
specialist applications, for example, those that use custom memory management
or non-standard character tables. To keep function argument lists at a
reasonable size, and at the same time to keep the API extensible, "uncommon"
parameters are passed to certain functions in a <b>context</b> instead of
directly. A context is just a block of memory that holds the parameter values.
Applications that do not need to adjust any of the context parameters can pass
NULL when a context pointer is required.
</P>
<P>
There are three different types of context: a general context that is relevant
for several PCRE2 operations, a compile-time context, and a match-time context.
</P>
<br><b>
The general context
</b><br>
<P>
At present, this context just contains pointers to (and data for) external
memory management functions that are called from several places in the PCRE2
library. The context is named `general' rather than specifically `memory'
because in future other fields may be added. If you do not want to supply your
own custom memory management functions, you do not need to bother with a
general context. A general context is created by:
<br>
<br>
<b>pcre2_general_context *pcre2_general_context_create(</b>
<b> void *(*<i>private_malloc</i>)(PCRE2_SIZE, void *),</b>
<b> void (*<i>private_free</i>)(void *, void *), void *<i>memory_data</i>);</b>
<br>
<br>
The two function pointers specify custom memory management functions, whose
prototypes are:
<pre>
<b>void *private_malloc(PCRE2_SIZE, void *);</b>
<b>void private_free(void *, void *);</b>
</pre>
Whenever code in PCRE2 calls these functions, the final argument is the value
of <i>memory_data</i>. Either of the first two arguments of the creation
function may be NULL, in which case the system memory management functions
<i>malloc()</i> and <i>free()</i> are used. (This is not currently useful, as
there are no other fields in a general context, but in future there might be.)
The <i>private_malloc()</i> function is used (if supplied) to obtain memory for
storing the context, and all three values are saved as part of the context.
</P>
<P>
Whenever PCRE2 creates a data block of any kind, the block contains a pointer
to the <i>free()</i> function that matches the <i>malloc()</i> function that was
used. When the time comes to free the block, this function is called.
</P>
<P>
A general context can be copied by calling:
<br>
<br>
<b>pcre2_general_context *pcre2_general_context_copy(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
The memory used for a general context should be freed by calling:
<br>
<br>
<b>void pcre2_general_context_free(pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
If this function is passed a NULL argument, it returns immediately without
doing anything.
<a name="compilecontext"></a></P>
<br><b>
The compile context
</b><br>
<P>
A compile context is required if you want to provide an external function for
stack checking during compilation or to change the default values of any of the
following compile-time parameters:
<pre>
What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
The maximum length of the pattern string
The extra options bits (none set by default)
Which performance optimizations the compiler should apply
</pre>
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
<i>pcre2_compile()</i>.
</P>
<P>
A compile context is created, copied, and freed by the following functions:
<br>
<br>
<b>pcre2_compile_context *pcre2_compile_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_compile_context *pcre2_compile_context_copy(</b>
<b> pcre2_compile_context *<i>ccontext</i>);</b>
<br>
<br>
<b>void pcre2_compile_context_free(pcre2_compile_context *<i>ccontext</i>);</b>
<br>
<br>
A compile context is created with default values for its parameters. These can
be changed by calling the following functions, which return 0 on success, or
PCRE2_ERROR_BADDATA if invalid data is detected.
<br>
<br>
<b>int pcre2_set_bsr(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only CR, LF,
or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any Unicode line
ending sequence. The value is used by the JIT compiler and by the two
interpreted matching functions, <i>pcre2_match()</i> and
<i>pcre2_dfa_match()</i>.
<br>
<br>
<b>int pcre2_set_character_tables(pcre2_compile_context *<i>ccontext</i>,</b>
<b> const uint8_t *<i>tables</i>);</b>
<br>
<br>
The value must be the result of a call to <b>pcre2_maketables()</b>, whose only
argument is a general context. This function builds a set of character tables
in the current locale.
<br>
<br>
<b>int pcre2_set_compile_extra_options(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>extra_options</i>);</b>
<br>
<br>
As PCRE2 has developed, almost all the 32 option bits that are available in
the <i>options</i> argument of <b>pcre2_compile()</b> have been used up. To avoid
running out, the compile context contains a set of extra option bits which are
used for some newer, assumed rarer, options. This function sets those bits. It
always sets all the bits (either on or off). It does not modify any existing
setting. The available options are defined in the section entitled "Extra
compile options"
<a href="#extracompileoptions">below.</a>
<br>
<br>
<b>int pcre2_set_max_pattern_length(pcre2_compile_context *<i>ccontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
This sets a maximum length, in code units, for any pattern string that is
compiled with this context. If the pattern is longer, an error is generated.
This facility is provided so that applications that accept patterns from
external sources can limit their size. The default is the largest number that a
PCRE2_SIZE variable can hold, which is effectively unlimited.
<br>
<br>
<b>int pcre2_set_max_pattern_compiled_length(</b>
<b> pcre2_compile_context *<i>ccontext</i>, PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
This sets a maximum size, in bytes, for the memory needed to hold the compiled
version of a pattern that is compiled with this context. If the pattern needs
more memory, an error is generated. This facility is provided so that
applications that accept patterns from external sources can limit the amount of
memory they use. The default is the largest number that a PCRE2_SIZE variable
can hold, which is effectively unlimited.
<br>
<br>
<b>int pcre2_set_max_varlookbehind(pcre2_compile_contest *<i>ccontext</i>,</b>
<b>" uint32_t <i>value</i>);</b>
<br>
<br>
This sets a maximum length for the number of characters matched by a
variable-length lookbehind assertion. The default is set when PCRE2 is built,
with the ultimate default being 255, the same as Perl. Lookbehind assertions
without a bounding length are not supported.
<br>
<br>
<b>int pcre2_set_newline(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
This specifies which characters or character sequences are to be recognized as
newlines. The value must be one of PCRE2_NEWLINE_CR (carriage return only),
PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the two-character
sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any of the above),
PCRE2_NEWLINE_ANY (any Unicode newline sequence), or PCRE2_NEWLINE_NUL (the
NUL character, that is a binary zero).
</P>
<P>
A pattern can override the value set in the compile context by starting with a
sequence such as (*CRLF). See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page for details.
</P>
<P>
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
option, the newline convention affects the recognition of the end of internal
comments starting with #. The value is saved with the compiled pattern for
subsequent use by the JIT compiler and by the two interpreted matching
functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
<br>
<br>
<b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
This parameter adjusts the limit, set when PCRE2 is built (default 250), on the
depth of parenthesis nesting in a pattern. This limit stops rogue patterns
using up too much system stack when being compiled. The limit applies to
parentheses of all kinds, not just capturing parentheses.
<br>
<br>
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
<br>
<br>
There is at least one application that runs PCRE2 in threads with very limited
system stack, where running out of stack is to be avoided at all costs. The
parenthesis limit above cannot take account of how much stack is actually
available during compilation. For a finer control, you can supply a function
that is called whenever <b>pcre2_compile()</b> starts to compile a parenthesized
part of a pattern. This function can check the actual stack size (or anything
else that it wants to, of course).
</P>
<P>
The first argument to the callout function gives the current depth of
nesting, and the second is user data that is set up by the last argument of
<b>pcre2_set_compile_recursion_guard()</b>. The callout function should return
zero if all is well, or non-zero to force an error.
<br>
<br>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
<br>
<br>
PCRE2 can apply various performance optimizations during compilation, in order
to make matching faster. For example, the compiler might convert some regex
constructs into an equivalent construct which <b>pcre2_match()</b> can execute
faster. By default, all available optimizations are enabled. However, in rare
cases, one might wish to disable specific optimizations. For example, if it is
known that some optimizations cannot benefit a certain regex, it might be
desirable to disable them, in order to speed up compilation.
</P>
<P>
The permitted values of <i>directive</i> are as follows:
<pre>
PCRE2_OPTIMIZATION_FULL
</pre>
Enable all optional performance optimizations. This is the default value.
<pre>
PCRE2_OPTIMIZATION_NONE
</pre>
Disable all optional performance optimizations.
<pre>
PCRE2_AUTO_POSSESS
PCRE2_AUTO_POSSESS_OFF
</pre>
Enable/disable "auto-possessification" of variable quantifiers such as * and +.
This optimization, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
disable this optimization if you want the matching functions to do a full,
unoptimized search and run all the callouts.
<pre>
PCRE2_DOTSTAR_ANCHOR
PCRE2_DOTSTAR_ANCHOR_OFF
</pre>
Enable/disable an optimization that is applied when .* is the first significant
item in a top-level branch of a pattern, and all the other branches also start
with .* or with \A or \G or ^. Such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any
^ items. Otherwise, the fact that any match must start either at the start of
the subject or following a newline is remembered. Like other optimizations,
this can cause callouts to be skipped.
</P>
<P>
Dotstar anchor optimization is automatically disabled for .* if it is inside an
atomic group or a capture group that is the subject of a backreference, or if
the pattern contains (*PRUNE) or (*SKIP).
<pre>
PCRE2_START_OPTIMIZE
PCRE2_START_OPTIMIZE_OFF
</pre>
Enable/disable optimizations which cause matching functions to scan the subject
string for specific code unit values before attempting a match. For example, if
it is known that an unanchored match must start with a specific value, the
matching code searches the subject for that value, and fails immediately if it
cannot find it, without actually running the main matching function. This means
that a special item such as (*COMMIT) at the start of a pattern is not
considered until after a suitable starting point for the match has been found.
Also, when callouts or (*MARK) items are in use, these "start-up" optimizations
can cause them to be skipped if the pattern is never actually used. The start-up
optimizations are in effect a pre-scan of the subject that takes place before
the pattern is run.
</P>
<P>
Disabling start-up optimizations ensures that in cases where the result is "no
match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are
considered at every possible starting position in the subject string.
</P>
<P>
Disabling start-up optimizations may change the outcome of a matching operation.
Consider the pattern
<pre>
(*COMMIT)ABC
</pre>
When this is compiled, PCRE2 records the fact that a match must start with the
character "A". Suppose the subject string is "DEFABC". The start-up
optimization scans along the subject, finds "A" and runs the first match
attempt from there. The (*COMMIT) item means that the pattern must match the
current starting position, which in this case, it does. However, if the same
match is run without start-up optimizations, the initial scan along the subject
string does not happen. The first match attempt is run starting from "D" and
when this fails, (*COMMIT) prevents any further matches being tried, so the
overall result is "no match".
</P>
<P>
Another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". Without start-up optimizations, however, matches are
tried at every possible starting position, including at the end of the subject,
where (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
that is returned is "1". In this case, the optimizations do not affect the
overall match result, which is still "no match", but they do affect the
auxiliary information that is returned.
<a name="matchcontext"></a></P>
<br><b>
The match context
</b><br>
<P>
A match context is required if you want to:
<pre>
Set up a callout function
Set an offset limit for matching an unanchored pattern
Change the limit on the amount of heap used when matching
Change the backtracking match limit
Change the backtracking depth limit
Set custom memory management specifically for the match
</pre>
If none of these apply, just pass NULL as the context argument of
<b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or <b>pcre2_jit_match()</b>.
</P>
<P>
A match context is created, copied, and freed by the following functions:
<br>
<br>
<b>pcre2_match_context *pcre2_match_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_match_context *pcre2_match_context_copy(</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
<b>void pcre2_match_context_free(pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
A match context is created with default values for its parameters. These can
be changed by calling the following functions, which return 0 on success, or
PCRE2_ERROR_BADDATA if invalid data is detected.
<br>
<br>
<b>int pcre2_set_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
This sets up a callout function for PCRE2 to call at specified points
during a matching operation. Details are given in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation.
<br>
<br>
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
This sets up a callout function for PCRE2 to call after each substitution
made by <b>pcre2_substitute()</b>. Details are given in the section entitled
"Creating a new string with substitutions"
<a href="#substitutions">below.</a>
<br>
<br>
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
<b> int, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
This sets up a callout function for PCRE2 to call when performing case
transformations inside <b>pcre2_substitute()</b>. Details are given in the
section entitled "Creating a new string with substitutions"
<a href="#substitutions">below.</a>
<br>
<br>
<b>int pcre2_set_offset_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE <i>value</i>);</b>
<br>
<br>
The <i>offset_limit</i> parameter limits how far an unanchored search can
advance in the subject string. The default value is PCRE2_UNSET. The
<b>pcre2_match()</b> and <b>pcre2_dfa_match()</b> functions return
PCRE2_ERROR_NOMATCH if a match with a starting point before or at the given
offset is not found. The <b>pcre2_substitute()</b> function makes no more
substitutions.
</P>
<P>
For example, if the pattern /abc/ is matched against "123abc" with an offset
limit less than 3, the result is PCRE2_ERROR_NOMATCH. A match can never be
found if the <i>startoffset</i> argument of <b>pcre2_match()</b>,
<b>pcre2_dfa_match()</b>, or <b>pcre2_substitute()</b> is greater than the offset
limit set in the match context.
</P>
<P>
When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT option when
calling <b>pcre2_compile()</b> so that when JIT is in use, different code can be
compiled. If a match is started with a non-default match limit when
PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
</P>
<P>
The offset limit facility can be used to track progress when searching large
subject strings or to limit the extent of global substitutions. See also the
PCRE2_FIRSTLINE option, which requires a match to start before or at the first
newline that follows the start of matching in the subject. If this is set with
an offset limit, a match must occur in the first line and also within the
offset limit. In other words, whichever limit comes first is used.
<br>
<br>
<b>int pcre2_set_heap_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
The <i>heap_limit</i> parameter specifies, in units of kibibytes (1024 bytes),
the maximum amount of heap memory that <b>pcre2_match()</b> may use to hold
backtracking information when running an interpretive match. This limit also
applies to <b>pcre2_dfa_match()</b>, which may use the heap when processing
patterns with a lot of nested pattern recursion or lookarounds or atomic
groups. This limit does not apply to matching with the JIT optimization, which
has its own memory control arrangements (see the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details). If the limit is reached, the negative error
code PCRE2_ERROR_HEAPLIMIT is returned. The default limit can be set when PCRE2
is built; if it is not, the default is set very large and is essentially
unlimited.
</P>
<P>
A value for the heap limit may also be supplied by an item at the start of a
pattern of the form
<pre>
(*LIMIT_HEAP=ddd)
</pre>
where ddd is a decimal number. However, such a setting is ignored unless ddd is
less than the limit set by the caller of <b>pcre2_match()</b> or, if no such
limit is set, less than the default.
</P>
<P>
The <b>pcre2_match()</b> function always needs some heap memory, so setting a
value of zero guarantees a "heap limit exceeded" error. Details of how
<b>pcre2_match()</b> uses the heap are given in the
<a href="pcre2perform.html"><b>pcre2perform</b></a>
documentation.
</P>
<P>
For <b>pcre2_dfa_match()</b>, a vector on the system stack is used when
processing pattern recursions, lookarounds, or atomic groups, and only if this
is not big enough is heap memory used. In this case, setting a value of zero
disables the use of the heap.
<br>
<br>
<b>int pcre2_set_match_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
The <i>match_limit</i> parameter provides a means of preventing PCRE2 from using
up too many computing resources when processing patterns that are not going to
match, but which have a very large number of possibilities in their search
trees. The classic example is a pattern that uses nested unlimited repeats.
</P>
<P>
There is an internal counter in <b>pcre2_match()</b> that is incremented each
time round its main matching loop. If this value reaches the match limit,
<b>pcre2_match()</b> returns the negative value PCRE2_ERROR_MATCHLIMIT. This has
the effect of limiting the amount of backtracking that can take place. For
patterns that are not anchored, the count restarts from zero for each position
in the subject string. This limit also applies to <b>pcre2_dfa_match()</b>,
though the counting is done in a different way.
</P>
<P>
When <b>pcre2_match()</b> is called with a pattern that was successfully
processed by <b>pcre2_jit_compile()</b>, the way in which matching is executed
is entirely different. However, there is still the possibility of runaway
matching that goes on for a very long time, and so the <i>match_limit</i> value
is also used in this case (but in a different way) to limit how long the
matching can continue.
</P>
<P>
The default value for the limit can be set when PCRE2 is built; the default is
10 million, which handles all but the most extreme cases. A value for the match
limit may also be supplied by an item at the start of a pattern of the form
<pre>
(*LIMIT_MATCH=ddd)
</pre>
where ddd is a decimal number. However, such a setting is ignored unless ddd is
less than the limit set by the caller of <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b> or, if no such limit is set, less than the default.
<br>
<br>
<b>int pcre2_set_depth_limit(pcre2_match_context *<i>mcontext</i>,</b>
<b> uint32_t <i>value</i>);</b>
<br>
<br>
This parameter limits the depth of nested backtracking in <b>pcre2_match()</b>.
Each time a nested backtracking point is passed, a new memory frame is used
to remember the state of matching at that point. Thus, this parameter
indirectly limits the amount of memory that is used in a match. However,
because the size of each memory frame depends on the number of capturing
parentheses, the actual memory limit varies from pattern to pattern. This limit
was more useful in versions before 10.30, where function recursion was used for
backtracking.
</P>
<P>
The depth limit is not relevant, and is ignored, when matching is done using
JIT compiled code. However, it is supported by <b>pcre2_dfa_match()</b>, which
uses it to limit the depth of nested internal recursive function calls that
implement atomic groups, lookaround assertions, and pattern recursions. This
limits, indirectly, the amount of system stack that is used. It was more useful
in versions before 10.32, when stack memory was used for local workspace
vectors for recursive function calls. From version 10.32, only local variables
are allocated on the stack and as each call uses only a few hundred bytes, even
a small stack can support quite a lot of recursion.
</P>
<P>
If the depth of internal recursive function calls is great enough, local
workspace vectors are allocated on the heap from version 10.32 onwards, so the
depth limit also indirectly limits the amount of heap memory that is used. A
recursive pattern such as /(.(?2))((?1)|)/, when matched to a very long string
using <b>pcre2_dfa_match()</b>, can use a great deal of memory. However, it is
probably better to limit heap usage directly by calling
<b>pcre2_set_heap_limit()</b>.
</P>
<P>
The default value for the depth limit can be set when PCRE2 is built; if it is
not, the default is set to the same value as the default for the match limit.
If the limit is exceeded, <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>
returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth limit may also be
supplied by an item at the start of a pattern of the form
<pre>
(*LIMIT_DEPTH=ddd)
</pre>
where ddd is a decimal number. However, such a setting is ignored unless ddd is
less than the limit set by the caller of <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b> or, if no such limit is set, less than the default.
</P>
<br><a name="SEC19" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
<P>
<b>int pcre2_config(uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
The function <b>pcre2_config()</b> makes it possible for a PCRE2 client to find
the value of certain configuration parameters and to discover which optional
features have been compiled into the PCRE2 library. The
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation has more details about these features.
</P>
<P>
The first argument for <b>pcre2_config()</b> specifies which information is
required. The second argument is a pointer to memory into which the information
is placed. If NULL is passed, the function returns the amount of memory that is
needed for the requested information. For calls that return numerical values,
the value is in bytes; when requesting these values, <i>where</i> should point
to appropriately aligned memory. For calls that return strings, the required
length is given in code units, not counting the terminating zero.
</P>
<P>
When requesting information, the returned value from <b>pcre2_config()</b> is
non-negative on success, or the negative error code PCRE2_ERROR_BADOPTION if
the value in the first argument is not recognized. The following information is
available:
<pre>
PCRE2_CONFIG_BSR
</pre>
The output is a uint32_t integer whose value indicates what character
sequences the \R escape sequence matches by default. A value of
PCRE2_BSR_UNICODE means that \R matches any Unicode line ending sequence; a
value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. The
default can be overridden when a pattern is compiled.
<pre>
PCRE2_CONFIG_COMPILED_WIDTHS
</pre>
The output is a uint32_t integer whose lower bits indicate which code unit
widths were selected when PCRE2 was built. The 1-bit indicates 8-bit support,
and the 2-bit and 4-bit indicate 16-bit and 32-bit support, respectively.
<pre>
PCRE2_CONFIG_DEPTHLIMIT
</pre>
The output is a uint32_t integer that gives the default limit for the depth of
nested backtracking in <b>pcre2_match()</b> or the depth of nested recursions,
lookarounds, and atomic groups in <b>pcre2_dfa_match()</b>. Further details are
given with <b>pcre2_set_depth_limit()</b> above.
<pre>
PCRE2_CONFIG_HEAPLIMIT
</pre>
The output is a uint32_t integer that gives, in kibibytes, the default limit
for the amount of heap memory used by <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b>. Further details are given with
<b>pcre2_set_heap_limit()</b> above.
<pre>
PCRE2_CONFIG_JIT
</pre>
The output is a uint32_t integer that is set to one if support for just-in-time
compiling is included in the library; otherwise it is set to zero. Note that
having the support in the library does not guarantee that JIT will be used for
any given match, and neither does it guarantee that JIT will actually be able
to function, because it may not be able to allocate executable memory in some
environments. There is a special call to <b>pcre2_jit_compile()</b> that can be
used to check this. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details.
<pre>
PCRE2_CONFIG_JITTARGET
</pre>
The <i>where</i> argument should point to a buffer that is at least 48 code
units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with a
string that contains the name of the architecture for which the JIT compiler is
configured, for example "x86 32bit (little endian + unaligned)". If JIT support
is not available, PCRE2_ERROR_BADOPTION is returned, otherwise the number of
code units used is returned. This is the length of the string, plus one unit
for the terminating zero.
<pre>
PCRE2_CONFIG_LINKSIZE
</pre>
The output is a uint32_t integer that contains the number of bytes used for
internal linkage in compiled regular expressions. When PCRE2 is configured, the
value can be set to 2, 3, or 4, with the default being 2. This is the value
that is returned by <b>pcre2_config()</b>. However, when the 16-bit library is
compiled, a value of 3 is rounded up to 4, and when the 32-bit library is
compiled, internal linkages always use 4 bytes, so the configured value is not
relevant.
</P>
<P>
The default value of 2 for the 8-bit and 16-bit libraries is sufficient for all
but the most massive patterns, since it allows the size of the compiled pattern
to be up to 65535 code units. Larger values allow larger regular expressions to
be compiled by those two libraries, but at the expense of slower matching.
<pre>
PCRE2_CONFIG_MATCHLIMIT
</pre>
The output is a uint32_t integer that gives the default match limit for
<b>pcre2_match()</b>. Further details are given with
<b>pcre2_set_match_limit()</b> above.
<pre>
PCRE2_CONFIG_NEWLINE
</pre>
The output is a uint32_t integer whose value specifies the default character
sequence that is recognized as meaning "newline". The values are:
<pre>
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
PCRE2_NEWLINE_NUL The NUL character (binary zero)
</pre>
The default should normally correspond to the standard sequence for your
operating system.
<pre>
PCRE2_CONFIG_NEVER_BACKSLASH_C
</pre>
The output is a uint32_t integer that is set to one if the use of \C was
permanently disabled when PCRE2 was built; otherwise it is set to zero.
<pre>
PCRE2_CONFIG_PARENSLIMIT
</pre>
The output is a uint32_t integer that gives the maximum depth of nesting
of parentheses (of any kind) in a pattern. This limit is imposed to cap the
amount of system stack used when a pattern is compiled. It is specified when
PCRE2 is built; the default is 250. This limit does not take into account the
stack that may already be used by the calling application. For finer control
over compilation stack usage, see <b>pcre2_set_compile_recursion_guard()</b>.
<pre>
PCRE2_CONFIG_STACKRECURSE
</pre>
This parameter is obsolete and should not be used in new code. The output is a
uint32_t integer that is always set to zero.
<pre>
PCRE2_CONFIG_TABLES_LENGTH
</pre>
The output is a uint32_t integer that gives the length of PCRE2's character
processing tables in bytes. For details of these tables see the
<a href="#localesupport">section on locale support</a>
below.
<pre>
PCRE2_CONFIG_UNICODE_VERSION
</pre>
The <i>where</i> argument should point to a buffer that is at least 24 code
units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) If PCRE2 has been compiled
without Unicode support, the buffer is filled with the text "Unicode not
supported". Otherwise, the Unicode version string (for example, "8.0.0") is
inserted. The number of code units used is returned. This is the length of the
string plus one unit for the terminating zero.
<pre>
PCRE2_CONFIG_UNICODE
</pre>
The output is a uint32_t integer that is set to one if Unicode support is
available; otherwise it is set to zero. Unicode support implies UTF support.
<pre>
PCRE2_CONFIG_VERSION
</pre>
The <i>where</i> argument should point to a buffer that is at least 24 code
units long. (The exact length required can be found by calling
<b>pcre2_config()</b> with <b>where</b> set to NULL.) The buffer is filled with
the PCRE2 version string, zero-terminated. The number of code units used is
returned. This is the length of the string plus one unit for the terminating
zero.
<a name="compiling"></a></P>
<br><a name="SEC20" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
<b>pcre2_code *pcre2_compile(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, int *<i>errorcode</i>, PCRE2_SIZE *<i>erroroffset,</i></b>
<b> pcre2_compile_context *<i>ccontext</i>);</b>
<br>
<br>
<b>void pcre2_code_free(pcre2_code *<i>code</i>);</b>
<br>
<br>
<b>pcre2_code *pcre2_code_copy(const pcre2_code *<i>code</i>);</b>
<br>
<br>
<b>pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *<i>code</i>);</b>
</P>
<P>
The <b>pcre2_compile()</b> function compiles a pattern into an internal form.
The pattern is defined by a pointer to a string of code units and a length in
code units. If the pattern is zero-terminated, the length can be specified as
PCRE2_ZERO_TERMINATED. A NULL pattern pointer with a length of zero is treated
as an empty string (NULL with a non-zero length causes an error return). The
function returns a pointer to a block of memory that contains the compiled
pattern and related data, or NULL if an error occurred.
</P>
<P>
If the compile context argument <i>ccontext</i> is NULL, memory for the compiled
pattern is obtained by calling <b>malloc()</b>. Otherwise, it is obtained from
the same memory function that was used for the compile context. The caller must
free the memory by calling <b>pcre2_code_free()</b> when it is no longer needed.
If <b>pcre2_code_free()</b> is called with a NULL argument, it returns
immediately, without doing anything.
</P>
<P>
The function <b>pcre2_code_copy()</b> makes a copy of the compiled code in new
memory, using the same memory allocator as was used for the original. However,
if the code has been processed by the JIT compiler (see
<a href="#jitcompiling">below),</a>
the JIT information cannot be copied (because it is position-dependent).
The new copy can initially be used only for non-JIT matching, though it can be
passed to <b>pcre2_jit_compile()</b> if required. If <b>pcre2_code_copy()</b> is
called with a NULL argument, it returns NULL.
</P>
<P>
The <b>pcre2_code_copy()</b> function provides a way for individual threads in a
multithreaded application to acquire a private copy of shared compiled code.
However, it does not make a copy of the character tables used by the compiled
pattern; the new pattern code points to the same tables as the original code.
(See
<a href="#jitcompiling">"Locale Support"</a>
below for details of these character tables.) In many applications the same
tables are used throughout, so this behaviour is appropriate. Nevertheless,
there are occasions when a copy of a compiled pattern and the relevant tables
are needed. The <b>pcre2_code_copy_with_tables()</b> provides this facility.
Copies of both the code and the tables are made, with the new code pointing to
the new tables. The memory for the new tables is automatically freed when
<b>pcre2_code_free()</b> is called for the new copy of the compiled code. If
<b>pcre2_code_copy_with_tables()</b> is called with a NULL argument, it returns
NULL.
</P>
<P>
NOTE: When one of the matching functions is called, pointers to the compiled
pattern and the subject string are set in the match data block so that they can
be referenced by the substring extraction functions after a successful match.
After running a match, you must not free a compiled pattern or a subject string
until after all operations on the
<a href="#matchdatablock">match data block</a>
have taken place, unless, in the case of the subject string, you have used the
PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
"Option bits for <b>pcre2_match()</b>"
<a href="#matchoptions>">below.</a>
</P>
<P>
The <i>options</i> argument for <b>pcre2_compile()</b> contains various bit
settings that affect the compilation. It should be zero if none of them are
required. The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others as well) can
also be set and unset from within the pattern (see the detailed description in
the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation).
</P>
<P>
For those options that can be different in different parts of the pattern, the
contents of the <i>options</i> argument specifies their settings at the start of
compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and PCRE2_NO_UTF_CHECK
options can be set at the time of matching as well as at compile time.
</P>
<P>
Some additional options and less frequently required compile-time parameters
(for example, the newline setting) can be provided in a compile context (as
described
<a href="#compilecontext">above).</a>
</P>
<P>
If <i>errorcode</i> or <i>erroroffset</i> is NULL, <b>pcre2_compile()</b> returns
NULL immediately. Otherwise, the variables to which these point are set to an
error code and an offset (number of code units) within the pattern,
respectively, when <b>pcre2_compile()</b> returns NULL because a compilation
error has occurred.
</P>
<P>
There are over 100 positive error codes that <b>pcre2_compile()</b> may return
if it finds an error in the pattern. There are also some negative error codes
that are used for invalid UTF strings when validity checking is in force. These
are the same as given by <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, and
are described in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation. There is no separate documentation for the positive error codes,
because the textual error messages that are obtained by calling the
<b>pcre2_get_error_message()</b> function (see "Obtaining a textual error
message"
<a href="#geterrormessage">below)</a>
should be self-explanatory. Macro names starting with PCRE2_ERROR_ are defined
for both positive and negative error codes in <b>pcre2.h</b>. When compilation
is successful <i>errorcode</i> is set to a value that returns the message "no
error" if passed to <b>pcre2_get_error_message()</b>.
</P>
<P>
The value returned in <i>erroroffset</i> is an indication of where in the
pattern an error occurred. When there is no error, zero is returned. A non-zero
value is not necessarily the furthest point in the pattern that was read. For
example, after the error "lookbehind assertion is not fixed length", the error
offset points to the start of the failing assertion. For an invalid UTF-8 or
UTF-16 string, the offset is that of the first code unit of the failing
character.
</P>
<P>
Some errors are not detected until the whole pattern has been scanned; in these
cases, the offset passed back is the length of the pattern. Note that the
offset is in code units, not characters, even in a UTF mode. It may sometimes
point into the middle of a UTF-8 or UTF-16 character.
</P>
<P>
This code fragment shows a typical straightforward call to
<b>pcre2_compile()</b>:
<pre>
pcre2_code *re;
PCRE2_SIZE erroffset;
int errorcode;
re = pcre2_compile(
"^A.*Z", /* the pattern */
PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
0, /* default options */
&errorcode, /* for error code */
&erroffset, /* for error offset */
NULL); /* no compile context */
</PRE>
</P>
<br><b>
Main compile options
</b><br>
<P>
The following names for option bits are defined in the <b>pcre2.h</b> header
file:
<pre>
PCRE2_ANCHORED
</pre>
If this bit is set, the pattern is forced to be "anchored", that is, it is
constrained to match only at the first matching point in the string that is
being searched (the "subject string"). This effect can also be achieved by
appropriate constructs in the pattern itself, which is the only way to do it in
Perl.
<pre>
PCRE2_ALLOW_EMPTY_CLASS
</pre>
By default, for compatibility with Perl, a closing square bracket that
immediately follows an opening one is treated as a data character for the
class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the class, which
therefore contains no characters and so can never match.
<pre>
PCRE2_ALT_BSUX
</pre>
This option request alternative handling of three escape sequences, which
makes PCRE2's behaviour more like ECMAscript (aka JavaScript). When it is set:
</P>
<P>
(1) \U matches an upper case "U" character; by default \U causes a compile
time error (Perl uses \U to upper case subsequent characters).
</P>
<P>
(2) \u matches a lower case "u" character unless it is followed by four
hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, \u causes a compile time error (Perl uses it to upper
case the following character).
</P>
<P>
(3) \x matches a lower case "x" character unless it is followed by two
hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
binary zero character followed by z).
</P>
<P>
ECMAscript 6 added additional functionality to \u. This can be accessed using
the PCRE2_EXTRA_ALT_BSUX extra option (see "Extra compile options"
<a href="#extracompileoptions">below).</a>
Note that this alternative escape handling applies only to patterns. Neither of
these options affects the processing of replacement strings passed to
<b>pcre2_substitute()</b>.
<pre>
PCRE2_ALT_CIRCUMFLEX
</pre>
In multiline mode (when PCRE2_MULTILINE is set), the circumflex metacharacter
matches at the start of the subject (unless PCRE2_NOTBOL is set), and also
after any internal newline. However, it does not match after a newline at the
end of the subject, for compatibility with Perl. If you want a multiline
circumflex also to match after a terminating newline, you must set
PCRE2_ALT_CIRCUMFLEX.
<pre>
PCRE2_ALT_EXTENDED_CLASS
</pre>
Alters the parsing of character classes to follow the extended syntax
described by Unicode UTS#18. The PCRE2_ALT_EXTENDED_CLASS option has no impact
on the behaviour of the Perl-specific "(?[...])" syntax for extended classes,
but instead enables the alternative syntax of extended class behaviour inside
ordinary "[...]" character classes. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details of the character classes supported.
<pre>
PCRE2_ALT_VERBNAMES
</pre>
By default, for compatibility with Perl, the name in any verb sequence such as
(*MARK:NAME) is any sequence of characters that does not include a closing
parenthesis. The name is not processed in any way, and it is not possible to
include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
option is set, normal backslash processing is applied to verb names and only an
unescaped closing parenthesis terminates the name. A closing parenthesis can be
included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
whitespace in verb names is skipped and #-comments are recognized, exactly as
in the rest of the pattern.
<pre>
PCRE2_AUTO_CALLOUT
</pre>
If this bit is set, <b>pcre2_compile()</b> automatically inserts callout items,
all with number 255, before each pattern item, except immediately before or
after an explicit callout in the pattern. For discussion of the callout
facility, see the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation.
<pre>
PCRE2_CASELESS
</pre>
If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
PCRE2_UCP is set, Unicode properties are used for all characters with more than
one other case, and for all characters whose code points are greater than
U+007F.
</P>
<P>
Note that there are two ASCII characters, K and S, that, in addition to
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
sign) and U+017F (long S) respectively. If you do not want this case
equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
</P>
<P>
One language family, Turkish and Azeri, has its own case-insensitivity rules,
which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the
behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131
(small dotless i) characters.
</P>
<P>
For lower valued characters with only one other case, a lookup table is used
for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used
for all code points less than 256, and higher code points (available only in
16-bit or 32-bit mode) are treated as not having another case.
</P>
<P>
From release 10.45 PCRE2_CASELESS also affects what some of the letter-related
Unicode property escapes (\p and \P) match. The properties Lu (upper case
letter), Ll (lower case letter), and Lt (title case letter) are all treated as
LC (cased letter) when PCRE2_CASELESS is set.
<pre>
PCRE2_DOLLAR_ENDONLY
</pre>
If this bit is set, a dollar metacharacter in the pattern matches only at the
end of the subject string. Without this option, a dollar also matches
immediately before a newline at the end of the string (but not before any other
newlines). The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is
set. There is no equivalent to this option in Perl, and no way to set it within
a pattern.
<pre>
PCRE2_DOTALL
</pre>
If this bit is set, a dot metacharacter in the pattern matches any character,
including one that indicates a newline. However, it only ever matches one
character, even if newlines are coded as CRLF. Without this option, a dot does
not match when the current position in the subject is at a newline. This option
is equivalent to Perl's /s option, and it can be changed within a pattern by a
(?s) option setting. A negative class such as [^a] always matches newline
characters, and the \N escape sequence always matches a non-newline character,
independent of the setting of PCRE2_DOTALL.
<pre>
PCRE2_DUPNAMES
</pre>
If this bit is set, names used to identify capture groups need not be unique.
This can be helpful for certain types of pattern when it is known that only one
instance of the named group can ever be matched. There are more details of
named capture groups below; see also the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
<pre>
PCRE2_ENDANCHORED
</pre>
If this bit is set, the end of any pattern match must be right at the end of
the string being searched (the "subject string"). If the pattern match
succeeds by reaching (*ACCEPT), but does not reach the end of the subject, the
match fails at the current starting point. For unanchored patterns, a new match
is then tried at the next starting point. However, if the match succeeds by
reaching the end of the pattern, but not the end of the subject, backtracking
occurs and an alternative match may be found. Consider these two patterns:
<pre>
.(*ACCEPT)|..
.|..
</pre>
If matched against "abc" with PCRE2_ENDANCHORED set, the first matches "c"
whereas the second matches "bc". The effect of PCRE2_ENDANCHORED can also be
achieved by appropriate constructs in the pattern itself, which is the only way
to do it in Perl.
</P>
<P>
For DFA matching with <b>pcre2_dfa_match()</b>, PCRE2_ENDANCHORED applies only
to the first (that is, the longest) matched string. Other parallel matches,
which are necessarily substrings of the first one, must obviously end before
the end of the subject.
<pre>
PCRE2_EXTENDED
</pre>
If this bit is set, most white space characters in the pattern are totally
ignored except when escaped, inside a character class, or inside a \Q...\E
sequence. However, white space is not allowed within sequences such as (?&#62; that
introduce various parenthesized groups, nor within numerical quantifiers such
as {1,3}. Ignorable white space is permitted between an item and a following
quantifier and between a quantifier and a following + that indicates
possessiveness. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be
changed within a pattern by a (?x) option setting.
</P>
<P>
When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as
white space only those characters with code points less than 256 that are
flagged as white space in its low-character table. The table is normally
created by
<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
which uses the <b>isspace()</b> function to identify space characters. In most
ASCII environments, the relevant characters are those with code points 0x0009
(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
(carriage return), and 0x0020 (space).
</P>
<P>
When PCRE2 is compiled with Unicode support, in addition to these characters,
five more Unicode "Pattern White Space" characters are recognized by
PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
separator). This set of characters is the same as recognized by Perl's /x
option. Note that the horizontal and vertical space characters that are matched
by the \h and \v escapes in patterns are a much bigger set.
</P>
<P>
As well as ignoring most white space, PCRE2_EXTENDED also causes characters
between an unescaped # outside a character class and the next newline,
inclusive, to be ignored, which makes it possible to include comments inside
complicated patterns. Note that the end of this type of comment is a literal
newline sequence in the pattern; escape sequences that happen to represent a
newline do not count.
</P>
<P>
Which characters are interpreted as newlines can be specified by a setting in
the compile context that is passed to <b>pcre2_compile()</b> or by a special
sequence at the start of the pattern, as described in the section entitled
<a href="pcre2pattern.html#newlines">"Newline conventions"</a>
in the <b>pcre2pattern</b> documentation. A default is defined when PCRE2 is
built.
<pre>
PCRE2_EXTENDED_MORE
</pre>
This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
and horizontal tab characters are ignored inside a character class. Note: only
these two characters are ignored, not the full set of pattern white space
characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
equivalent to Perl's /xx option, and it can be changed within a pattern by a
(?xx) option setting.
<pre>
PCRE2_FIRSTLINE
</pre>
If this option is set, the start of an unanchored pattern match must be before
or at the first newline in the subject string following the start of matching,
though the matched text may continue over the newline. If <i>startoffset</i> is
non-zero, the limiting newline is not necessarily the first newline in the
subject. For example, if the subject string is "abc\nxyz" (where \n
represents a single-character newline) a pattern match for "yz" succeeds with
PCRE2_FIRSTLINE if <i>startoffset</i> is greater than 3. See also
PCRE2_USE_OFFSET_LIMIT, which provides a more general limiting facility. If
PCRE2_FIRSTLINE is set with an offset limit, a match must occur in the first
line and also within the offset limit. In other words, whichever limit comes
first is used. This option has no effect for anchored patterns.
<pre>
PCRE2_LITERAL
</pre>
If this option is set, all meta-characters in the pattern are disabled, and it
is treated as a literal string. Matching literal strings with a regular
expression engine is not the most efficient way of doing it. If you are doing a
lot of literal matching and are worried about efficiency, you should consider
using other approaches. The only other main options that are allowed with
PCRE2_LITERAL are: PCRE2_ANCHORED, PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT,
PCRE2_CASELESS, PCRE2_FIRSTLINE, PCRE2_MATCH_INVALID_UTF,
PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an error.
<pre>
PCRE2_MATCH_INVALID_UTF
</pre>
This option forces PCRE2_UTF (see below) and also enables support for matching
by <b>pcre2_match()</b> in subject strings that contain invalid UTF sequences.
Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
sequences of uint16_t or uint32_t code points. They cannot find valid UTF
sequences within an arbitrary string of bytes unless such sequences are
suitably aligned. This facility is not supported for DFA matching. For details,
see the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
<pre>
PCRE2_MATCH_UNSET_BACKREF
</pre>
If this option is set, a backreference to an unset capture group matches an
empty string (by default this causes the current matching alternative to fail).
A pattern such as (\1)(a) succeeds when this option is set (assuming it can
find an "a" in the subject), whereas it fails by default, for Perl
compatibility. Setting this option makes PCRE2 behave more like ECMAscript (aka
JavaScript).
<pre>
PCRE2_MULTILINE
</pre>
By default, for the purposes of matching "start of line" and "end of line",
PCRE2 treats the subject string as consisting of a single line of characters,
even if it actually contains newlines. The "start of line" metacharacter (^)
matches only at the start of the string, and the "end of line" metacharacter
($) matches only at the end of the string, or before a terminating newline
(except when PCRE2_DOLLAR_ENDONLY is set). Note, however, that unless
PCRE2_DOTALL is set, the "any character" metacharacter (.) does not match at a
newline. This behaviour (for ^, $, and dot) is the same as Perl.
</P>
<P>
When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
constructs match immediately following or immediately before internal newlines
in the subject string, respectively, as well as at the very start and end. This
is equivalent to Perl's /m option, and it can be changed within a pattern by a
(?m) option setting. Note that the "start of line" metacharacter does not match
after a newline at the end of the subject, for compatibility with Perl.
However, you can change this by setting the PCRE2_ALT_CIRCUMFLEX option. If
there are no newlines in a subject string, or no occurrences of ^ or $ in a
pattern, setting PCRE2_MULTILINE has no effect.
<pre>
PCRE2_NEVER_BACKSLASH_C
</pre>
This option locks out the use of \C in the pattern that is being compiled.
This escape can cause unpredictable behaviour in UTF-8 or UTF-16 modes, because
it may leave the current matching point in the middle of a multi-code-unit
character. This option may be useful in applications that process patterns from
external sources. Note that there is also a build-time option that permanently
locks out the use of \C.
<pre>
PCRE2_NEVER_UCP
</pre>
This option locks out the use of Unicode properties for handling \B, \b, \D,
\d, \S, \s, \W, \w, and some of the POSIX character classes, as described
for the PCRE2_UCP option below. In particular, it prevents the creator of the
pattern from enabling this facility by starting the pattern with (*UCP). This
option may be useful in applications that process patterns from external
sources. The option combination PCRE2_UCP and PCRE2_NEVER_UCP causes an error.
<pre>
PCRE2_NEVER_UTF
</pre>
This option locks out interpretation of the pattern as UTF-8, UTF-16, or
UTF-32, depending on which library is in use. In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the
pattern with (*UTF). This option may be useful in applications that process
patterns from external sources. The combination of PCRE2_UTF and
PCRE2_NEVER_UTF causes an error.
<pre>
PCRE2_NO_AUTO_CAPTURE
</pre>
If this option is set, it disables the use of numbered capturing parentheses in
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
were followed by ?: but named parentheses can still be used for capturing (and
they acquire numbers in the usual way). This is the same as Perl's /n option.
Note that, when this option is set, references to capture groups
(backreferences or recursion/subroutine calls) may only refer to named groups,
though the reference can be by name or by number.
<pre>
PCRE2_NO_AUTO_POSSESS
</pre>
If this (deprecated) option is set, it disables "auto-possessification", which
is an optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
</P>
<P>
If a compile context is available, it is recommended to use
<b>pcre2_set_optimize()</b> with the <i>directive</i> PCRE2_AUTO_POSSESS_OFF rather
than the compile option PCRE2_NO_AUTO_POSSESS. Note that PCRE2_NO_AUTO_POSSESS
takes precedence over the <b>pcre2_set_optimize()</b> optimization directives
PCRE2_AUTO_POSSESS and PCRE2_AUTO_POSSESS_OFF.
<pre>
PCRE2_NO_DOTSTAR_ANCHOR
</pre>
If this (deprecated) option is set, it disables an optimization that is applied
when .* is the first significant item in a top-level branch of a pattern, and
all the other branches also start with .* or with \A or \G or ^. The
optimization is automatically disabled for .* if it is inside an atomic group
or a capture group that is the subject of a backreference, or if the pattern
contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a
pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items
and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any
match must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
(If a compile context is available, it is recommended to use
<b>pcre2_set_optimize()</b> with the <i>directive</i> PCRE2_DOTSTAR_ANCHOR_OFF
instead.)
<pre>
PCRE2_NO_START_OPTIMIZE
</pre>
This is an option whose main effect is at matching time. It does not change
what <b>pcre2_compile()</b> generates, but it does affect the output of the JIT
compiler. Setting this option is equivalent to calling <b>pcre2_set_optimize()</b>
with the <i>directive</i> parameter set to PCRE2_START_OPTIMIZE_OFF.
</P>
<P>
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
match must start with a specific code unit value, the matching code searches
the subject for that value, and fails immediately if it cannot find it, without
actually running the main matching function. The start-up optimizations are
in effect a pre-scan of the subject that takes place before the pattern is run.
</P>
<P>
Disabling the start-up optimizations may cause performance to suffer. However,
this may be desirable for patterns which contain callouts or items such as
(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF
for further details.
<pre>
PCRE2_NO_UTF_CHECK
</pre>
When PCRE2_UTF is set, the validity of the pattern as a UTF string is
automatically checked. There are discussions about the validity of
<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a>
<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a>
and
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
document. If an invalid UTF sequence is found, <b>pcre2_compile()</b> returns a
negative error code.
</P>
<P>
If you know that your pattern is a valid UTF string, and you want to skip this
check for performance reasons, you can set the PCRE2_NO_UTF_CHECK option. When
it is set, the effect of passing an invalid UTF string as a pattern is
undefined. It may cause your program to crash or loop.
</P>
<P>
Note that this option can also be passed to <b>pcre2_match()</b> and
<b>pcre2_dfa_match()</b>, to suppress UTF validity checking of the subject
string.
</P>
<P>
Note also that setting PCRE2_NO_UTF_CHECK at compile time does not disable the
error that is given if an escape sequence for an invalid Unicode code point is
encountered in the pattern. In particular, the so-called "surrogate" code
points (0xd800 to 0xdfff) are invalid. If you want to allow escape sequences
such as \x{d800} you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
option, as described in the section entitled "Extra compile options"
<a href="#extracompileoptions">below.</a>
However, this is possible only in UTF-8 and UTF-32 modes, because these values
are not representable in UTF-16.
<pre>
PCRE2_UCP
</pre>
This option has two effects. Firstly, it change the way PCRE2 processes \B,
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
properties are used to classify characters. There are some PCRE2_EXTRA
options (see below) that add finer control to this behaviour. More details are
given in the section on
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page.
</P>
<P>
The second effect of PCRE2_UCP is to force the use of Unicode properties for
upper/lower casing operations, even when PCRE2_UTF is not set. This makes it
possible to process strings in the 16-bit UCS-2 code. This option is available
only if PCRE2 has been compiled with Unicode support (which is the default).
</P>
<P>
The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless
matching such that ASCII characters match only ASCII characters and non-ASCII
characters match only non-ASCII characters. The PCRE2_EXTRA_TURKISH_CASING option
(see above) alters the matching of the 'i' characters to follow their behaviour
in Turkish and Azeri languages. For further details on
PCRE2_EXTRA_CASELESS_RESTRICT and PCRE2_EXTRA_TURKISH_CASING, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
<pre>
PCRE2_UNGREEDY
</pre>
This option inverts the "greediness" of the quantifiers so that they are not
greedy by default, but become greedy if followed by "?". It is not compatible
with Perl. It can also be set by a (?U) option setting within the pattern.
<pre>
PCRE2_USE_OFFSET_LIMIT
</pre>
This option must be set for <b>pcre2_compile()</b> if
<b>pcre2_set_offset_limit()</b> is going to be used to set a non-default offset
limit in a match context for matches that use this pattern. An error is
generated if an offset limit is set without this option. For more details, see
the description of <b>pcre2_set_offset_limit()</b> in the
<a href="#matchcontext">section</a>
that describes match contexts. See also the PCRE2_FIRSTLINE
option above.
<pre>
PCRE2_UTF
</pre>
This option causes PCRE2 to regard both the pattern and the subject strings
that are subsequently processed as strings of UTF characters instead of
single-code-unit strings. It is available when PCRE2 is built to include
Unicode support (which is the default). If Unicode support is not available,
the use of this option provokes an error. Details of how PCRE2_UTF changes the
behaviour of PCRE2 are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. In particular, note that it changes the way PCRE2_CASELESS works.
<a name="extracompileoptions"></a></P>
<br><b>
Extra compile options
</b><br>
<P>
The option bits that can be set in a compile context by calling the
<b>pcre2_set_compile_extra_options()</b> function are as follows:
<pre>
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
</pre>
Since release 10.38 PCRE2 has forbidden the use of \K within lookaround
assertions, following Perl's lead. This option is provided to re-enable the
previous behaviour (act in positive lookarounds, ignore in negative ones) in
case anybody is relying on it.
<pre>
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
</pre>
This option applies when compiling a pattern in UTF-8 or UTF-32 mode. It is
forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode "surrogate"
code points in the range 0xd800 to 0xdfff are used in pairs in UTF-16 to encode
code points with values in the range 0x10000 to 0x10ffff. The surrogates cannot
therefore be represented in UTF-16. They can be represented in UTF-8 and
UTF-32, but are defined as invalid code points, and cause errors if encountered
in a UTF-8 or UTF-32 string that is being checked for validity by PCRE2.
</P>
<P>
These values also cause errors if encountered in escape sequences such as
\x{d912} within a pattern. However, it seems that some applications, when
using PCRE2 to check for unwanted characters in UTF-8 strings, explicitly test
for the surrogates using escape sequences. The PCRE2_NO_UTF_CHECK option does
not disable the error that occurs, because it applies only to the testing of
input strings for UTF validity.
</P>
<P>
If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surrogate code
point values in UTF-8 and UTF-32 patterns no longer provoke errors and are
incorporated in the compiled pattern. However, they can only match subject
characters if the matching function is called with PCRE2_NO_UTF_CHECK set.
<pre>
PCRE2_EXTRA_ALT_BSUX
</pre>
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in
the way that ECMAscript (aka JavaScript) does. Additional functionality was
defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of
PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as a hexadecimal
character code, where hhh.. is any number of hexadecimal digits.
<pre>
PCRE2_EXTRA_ASCII_BSD
</pre>
This option forces \d to match only ASCII digits, even when PCRE2_UCP is set.
It can be changed within a pattern by means of the (?aD) option setting.
<pre>
PCRE2_EXTRA_ASCII_BSS
</pre>
This option forces \s to match only ASCII space characters, even when
PCRE2_UCP is set. It can be changed within a pattern by means of the (?aS)
option setting.
<pre>
PCRE2_EXTRA_ASCII_BSW
</pre>
This option forces \w to match only ASCII word characters, even when PCRE2_UCP
is set. It can be changed within a pattern by means of the (?aW) option
setting.
<pre>
PCRE2_EXTRA_ASCII_DIGIT
</pre>
This option forces the POSIX character classes [:digit:] and [:xdigit:] to
match only ASCII digits, even when PCRE2_UCP is set. It can be changed within
a pattern by means of the (?aT) option setting.
<pre>
PCRE2_EXTRA_ASCII_POSIX
</pre>
This option forces all the POSIX character classes, including [:digit:] and
[:xdigit:], to match only ASCII characters, even when PCRE2_UCP is set. It can
be changed within a pattern by means of the (?aP) option setting, but note that
this also sets PCRE2_EXTRA_ASCII_DIGIT in order to ensure that (?-aP) unsets
all ASCII restrictions for POSIX classes.
<pre>
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
</pre>
This is a dangerous option. Use with care. By default, an unrecognized escape
such as \j or a malformed one such as \x{2z} causes a compile-time error when
detected by <b>pcre2_compile()</b>. Perl is somewhat inconsistent in handling
such items: for example, \j is treated as a literal "j", and non-hexadecimal
digits in \x{} are just ignored, though warnings are given in both cases if
Perl's warning switch is enabled. However, a malformed octal number after \o{
always causes an error in Perl.
</P>
<P>
If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
<b>pcre2_compile()</b>, all unrecognized or malformed escape sequences are
treated as single-character escapes. For example, \j is a literal "j" and
\x{2z} is treated as the literal string "x{2z}". Setting this option means
that typos in patterns may go undetected and have unexpected results. Also note
that a sequence such as [\N{] is interpreted as a malformed attempt at
[\N{...}] and so is treated as [N{] whereas [\N] gives an error because an
unqualified \N is a valid escape sequence but is not supported in a character
class. To reiterate: this is a dangerous option. Use with great care.
<pre>
PCRE2_EXTRA_CASELESS_RESTRICT
</pre>
When either PCRE2_UCP or PCRE2_UTF is set, caseless matching follows Unicode
rules, which allow for more than two cases per character. There are two
case-equivalent character sets that contain both ASCII and non-ASCII
characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
caseless match, both characters must either be ASCII or non-ASCII. The option
can be changed within a pattern by the (*CASELESS_RESTRICT) or (?r) option
settings.
<pre>
PCRE2_EXTRA_ESCAPED_CR_IS_LF
</pre>
There are some legacy applications where the escape sequence \r in a pattern
is expected to match a newline. If this option is set, \r in a pattern is
converted to \n so that it matches a LF (linefeed) instead of a CR (carriage
return) character. The option does not affect a literal CR in the pattern, nor
does it affect CR specified as an explicit code point such as \x{0D}.
<pre>
PCRE2_EXTRA_MATCH_LINE
</pre>
This option is provided for use by the <b>-x</b> option of <b>pcre2grep</b>. It
causes the pattern only to match complete lines. This is achieved by
automatically inserting the code for "^(?:" at the start of the compiled
pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set, the matched
line may be in the middle of the subject string. This option can be used with
PCRE2_LITERAL.
<pre>
PCRE2_EXTRA_MATCH_WORD
</pre>
This option is provided for use by the <b>-w</b> option of <b>pcre2grep</b>. It
causes the pattern only to match strings that have a word boundary at the start
and the end. This is achieved by automatically inserting the code for "\b(?:"
at the start of the compiled pattern and ")\b" at the end. The option may be
used with PCRE2_LITERAL. However, it is ignored if PCRE2_EXTRA_MATCH_LINE is
also set.
<pre>
PCRE2_EXTRA_NO_BS0
</pre>
If this option is set (note that its final character is the digit 0) it locks
out the use of the sequence \0 unless at least one more octal digit follows.
<pre>
PCRE2_EXTRA_PYTHON_OCTAL
</pre>
If this option is set, PCRE2 follows Python's rules for interpreting octal
escape sequences. The rules for handling sequences such as \14, which could
be an octal number or a back reference are different. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
<pre>
PCRE2_EXTRA_NEVER_CALLOUT
</pre>
If this option is set, PCRE2 treats callouts in the pattern as a syntax error,
returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if the application
knows that a callout will not be provided to <b>pcre2_match()</b>, so that
callouts in the pattern are not silently ignored.
<pre>
PCRE2_EXTRA_TURKISH_CASING
</pre>
This option alters case-equivalence of the 'i' letters to follow the
alphabet used by Turkish and Azeri languages. The option can be changed within
a pattern by the (*TURKISH_CASING) start-of-pattern setting. Either the UTF or
UCP options must be set. In the 8-bit library, UTF must be set. This option
cannot be combined with PCRE2_EXTRA_CASELESS_RESTRICT.
<a name="jitcompiling"></a></P>
<br><a name="SEC21" href="#TOC1">JUST-IN-TIME (JIT) COMPILATION</a><br>
<P>
<b>int pcre2_jit_compile(pcre2_code *<i>code</i>, uint32_t <i>options</i>);</b>
<br>
<br>
<b>int pcre2_jit_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
<br>
<br>
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_jit_stack *pcre2_jit_stack_create(size_t <i>startsize</i>,</b>
<b> size_t <i>maxsize</i>, pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_jit_stack_assign(pcre2_match_context *<i>mcontext</i>,</b>
<b> pcre2_jit_callback <i>callback_function</i>, void *<i>callback_data</i>);</b>
<br>
<br>
<b>void pcre2_jit_stack_free(pcre2_jit_stack *<i>jit_stack</i>);</b>
</P>
<P>
These functions provide support for JIT compilation, which, if the just-in-time
compiler is available, further processes a compiled pattern into machine code
that executes much faster than the <b>pcre2_match()</b> interpretive matching
function. Full details are given in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation.
</P>
<P>
JIT compilation is a heavyweight optimization. It can take some time for
patterns to be analyzed, and for one-off matches and simple patterns the
benefit of faster execution might be offset by a much slower compilation time.
Most (but not all) patterns can be optimized by the JIT compiler.
<a name="localesupport"></a></P>
<br><a name="SEC22" href="#TOC1">LOCALE SUPPORT</a><br>
<P>
<b>const uint8_t *pcre2_maketables(pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_maketables_free(pcre2_general_context *<i>gcontext</i>,</b>
<b> const uint8_t *<i>tables</i>);</b>
</P>
<P>
PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code
point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \w or
\d.
</P>
<P>
When PCRE2 is built with Unicode support (the default), certain Unicode
character properties can be tested with \p and \P, or, alternatively, the
PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
friends to use Unicode property support instead of the built-in tables.
PCRE2_UCP also causes upper/lower casing operations on characters with code
points greater than 127 to use Unicode properties. These effects apply even
when PCRE2_UTF is not set. There are, however, some PCRE2_EXTRA options (see
above) that can be used to modify or suppress them.
</P>
<P>
The use of locales with Unicode is discouraged. If you are handling characters
with code points greater than 127, you should either use Unicode support, or
use locales, but not try to mix the two.
</P>
<P>
PCRE2 contains a built-in set of character tables that are used by default.
These are sufficient for many applications. Normally, the internal tables
recognize only ASCII characters. However, when PCRE2 is built, it is possible
to cause the internal tables to be rebuilt in the default "C" locale of the
local system, which may cause them to be different.
</P>
<P>
The built-in tables can be overridden by tables supplied by the application
that calls PCRE2. These may be created in a different locale from the default.
As more and more applications change to using Unicode, the need for this locale
support is expected to die away.
</P>
<P>
External tables are built by calling the <b>pcre2_maketables()</b> function, in
the relevant locale. The only argument to this function is a general context,
which can be used to pass a custom memory allocator. If the argument is NULL,
the system <b>malloc()</b> is used. The result can be passed to
<b>pcre2_compile()</b> as often as necessary, by creating a compile context and
calling <b>pcre2_set_character_tables()</b> to set the tables pointer therein.
</P>
<P>
For example, to build and use tables that are appropriate for the French locale
(where accented characters with values greater than 127 are treated as
letters), the following code could be used:
<pre>
setlocale(LC_CTYPE, "fr_FR");
tables = pcre2_maketables(NULL);
ccontext = pcre2_compile_context_create(NULL);
pcre2_set_character_tables(ccontext, tables);
re = pcre2_compile(..., ccontext);
</pre>
The locale name "fr_FR" is used on Linux and other Unix-like systems; if you
are using Windows, the name for the French locale is "french".
</P>
<P>
The pointer that is passed (via the compile context) to <b>pcre2_compile()</b>
is saved with the compiled pattern, and the same tables are used by the
matching functions. Thus, for any single pattern, compilation and matching both
happen in the same locale, but different patterns can be processed in different
locales.
</P>
<P>
It is the caller's responsibility to ensure that the memory containing the
tables remains available while they are still in use. When they are no longer
needed, you can discard them using <b>pcre2_maketables_free()</b>, which should
pass as its first parameter the same global context that was used to create the
tables.
</P>
<br><b>
Saving locale tables
</b><br>
<P>
The tables described above are just a sequence of binary bytes, which makes
them independent of hardware characteristics such as endianness or whether the
processor is 32-bit or 64-bit. A copy of the result of <b>pcre2_maketables()</b>
can therefore be saved in a file or elsewhere and re-used later, even in a
different program or on another computer. The size of the tables (number of
bytes) must be obtained by calling <b>pcre2_config()</b> with the
PCRE2_CONFIG_TABLES_LENGTH option because <b>pcre2_maketables()</b> does not
return this value. Note that the <b>pcre2_dftables</b> program, which is part of
the PCRE2 build system, can be used stand-alone to create a file that contains
a set of binary tables. See the
<a href="pcre2build.html#createtables"><b>pcre2build</b></a>
documentation for details.
<a name="infoaboutpattern"></a></P>
<br><a name="SEC23" href="#TOC1">INFORMATION ABOUT A COMPILED PATTERN</a><br>
<P>
<b>int pcre2_pattern_info(const pcre2 *<i>code</i>, uint32_t <i>what</i>, void *<i>where</i>);</b>
</P>
<P>
The <b>pcre2_pattern_info()</b> function returns general information about a
compiled pattern. For information about callouts, see the
<a href="#infoaboutcallouts">next section.</a>
The first argument for <b>pcre2_pattern_info()</b> is a pointer to the compiled
pattern. The second argument specifies which piece of information is required,
and the third argument is a pointer to a variable to receive the data. If the
third argument is NULL, the first argument is ignored, and the function returns
the size in bytes of the variable that is required for the information
requested. Otherwise, the yield of the function is zero for success, or one of
the following negative numbers:
<pre>
PCRE2_ERROR_NULL the argument <i>code</i> was NULL
PCRE2_ERROR_BADMAGIC the "magic number" was not found
PCRE2_ERROR_BADOPTION the value of <i>what</i> was invalid
PCRE2_ERROR_UNSET the requested field is not set
</pre>
The "magic number" is placed at the start of each compiled pattern as a simple
check against passing an arbitrary memory pointer. Here is a typical call of
<b>pcre2_pattern_info()</b>, to obtain the length of the compiled pattern:
<pre>
int rc;
size_t length;
rc = pcre2_pattern_info(
re, /* result of pcre2_compile() */
PCRE2_INFO_SIZE, /* what is required */
&length); /* where to put the data */
</pre>
The possible values for the second argument are defined in <b>pcre2.h</b>, and
are as follows:
<pre>
PCRE2_INFO_ALLOPTIONS
PCRE2_INFO_ARGOPTIONS
PCRE2_INFO_EXTRAOPTIONS
</pre>
Return copies of the pattern's options. The third argument should point to a
<b>uint32_t</b> variable. PCRE2_INFO_ARGOPTIONS returns exactly the options that
were passed to <b>pcre2_compile()</b>, whereas PCRE2_INFO_ALLOPTIONS returns
the compile options as modified by any top-level (*XXX) option settings such as
(*UTF) at the start of the pattern itself. PCRE2_INFO_EXTRAOPTIONS returns the
extra options that were set in the compile context by calling the
pcre2_set_compile_extra_options() function.
</P>
<P>
For example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EXTENDED
option, the result for PCRE2_INFO_ALLOPTIONS is PCRE2_EXTENDED and PCRE2_UTF.
Option settings such as (?i) that can change within a pattern do not affect the
result of PCRE2_INFO_ALLOPTIONS, even if they appear right at the start of the
pattern. (This was different in some earlier releases.)
</P>
<P>
A pattern compiled without PCRE2_ANCHORED is automatically anchored by PCRE2 if
the first significant item in every top-level branch is one of the following:
<pre>
^ unless PCRE2_MULTILINE is set
\A always
\G always
.* sometimes - see below
</pre>
When .* is the first significant item, anchoring is possible only when all the
following are true:
<pre>
.* is not in an atomic group
.* is not in a capture group that is the subject of a backreference
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF
</pre>
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
<pre>
PCRE2_INFO_BACKREFMAX
</pre>
Return the number of the highest backreference in the pattern. The third
argument should point to a <b>uint32_t</b> variable. Named capture groups
acquire numbers as well as names, and these count towards the highest
backreference. Backreferences such as \4 or \g{12} match the captured
characters of the given group, but in addition, the check that a capture
group is set in a conditional group such as (?(3)a|b) is also a backreference.
Zero is returned if there are no backreferences.
<pre>
PCRE2_INFO_BSR
</pre>
The output is a uint32_t integer whose value indicates what character sequences
the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that \R
matches any Unicode line ending sequence; a value of PCRE2_BSR_ANYCRLF means
that \R matches only CR, LF, or CRLF.
<pre>
PCRE2_INFO_CAPTURECOUNT
</pre>
Return the highest capture group number in the pattern. In patterns where (?|
is not used, this is also the total number of capture groups. The third
argument should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_DEPTHLIMIT
</pre>
If the pattern set a backtracking depth limit by including an item of the form
(*LIMIT_DEPTH=nnnn) at the start, the value is returned. The third argument
should point to a uint32_t integer. If no such value has been set, the call to
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
<pre>
PCRE2_INFO_FIRSTBITMAP
</pre>
In the absence of a single first code unit for a non-anchored pattern,
<b>pcre2_compile()</b> may construct a 256-bit table that defines a fixed set of
values for the first code unit in any match. For example, a pattern that starts
with [abc] results in a table with three bits set. When code unit values
greater than 255 are supported, the flag bit for 255 means "any code unit of
value 255 or above". If such a table was constructed, a pointer to it is
returned. Otherwise NULL is returned. The third argument should point to a
<b>const uint8_t *</b> variable.
<pre>
PCRE2_INFO_FIRSTCODETYPE
</pre>
Return information about the first code unit of any matched string, for a
non-anchored pattern. The third argument should point to a <b>uint32_t</b>
variable. If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), 1 is returned, and the value can be retrieved
using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed first value, but it is
known that a match can occur only at the start of the subject or following a
newline in the subject, 2 is returned. Otherwise, and for anchored patterns, 0
is returned.
<pre>
PCRE2_INFO_FIRSTCODEUNIT
</pre>
Return the value of the first code unit of any matched string for a pattern
where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0. The third
argument should point to a <b>uint32_t</b> variable. In the 8-bit library, the
value is always less than 256. In the 16-bit library the value can be up to
0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff,
and up to 0xffffffff when not using UTF-32 mode.
<pre>
PCRE2_INFO_FRAMESIZE
</pre>
Return the size (in bytes) of the data frames that are used to remember
backtracking positions when the pattern is processed by <b>pcre2_match()</b>
without the use of JIT. The third argument should point to a <b>size_t</b>
variable. The frame size depends on the number of capturing parentheses in the
pattern. Each additional capture group adds two PCRE2_SIZE variables.
<pre>
PCRE2_INFO_HASBACKSLASHC
</pre>
Return 1 if the pattern contains any instances of \C, otherwise 0. The third
argument should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_HASCRORLF
</pre>
Return 1 if the pattern contains any explicit matches for CR or LF characters,
otherwise 0. The third argument should point to a <b>uint32_t</b> variable. An
explicit match is either a literal CR or LF character, or \r or \n or one of
the equivalent hexadecimal or octal escape sequences.
<pre>
PCRE2_INFO_HEAPLIMIT
</pre>
If the pattern set a heap memory limit by including an item of the form
(*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argument
should point to a uint32_t integer. If no such value has been set, the call to
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
<pre>
PCRE2_INFO_JCHANGED
</pre>
Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
0. The third argument should point to a <b>uint32_t</b> variable. (?J) and
(?-J) set and unset the local PCRE2_DUPNAMES option, respectively.
<pre>
PCRE2_INFO_JITSIZE
</pre>
If the compiled pattern was successfully processed by
<b>pcre2_jit_compile()</b>, return the size of the JIT compiled code, otherwise
return zero. The third argument should point to a <b>size_t</b> variable.
<pre>
PCRE2_INFO_LASTCODETYPE
</pre>
Returns 1 if there is a rightmost literal code unit that must exist in any
matched string, other than at its start. The third argument should point to a
<b>uint32_t</b> variable. If there is no such value, 0 is returned. When 1 is
returned, the code unit value itself can be retrieved using
PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is
recorded only if it follows something of variable length. For example, for the
pattern /^a\d+z\d+/ the returned value is 1 (with "z" returned from
PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
<pre>
PCRE2_INFO_LASTCODEUNIT
</pre>
Return the value of the rightmost literal code unit that must exist in any
matched string, other than at its start, for a pattern where
PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argument
should point to a <b>uint32_t</b> variable.
<pre>
PCRE2_INFO_MATCHEMPTY
</pre>
Return 1 if the pattern might match an empty string, otherwise 0. The third
argument should point to a <b>uint32_t</b> variable. When a pattern contains
recursive subroutine calls it is not always possible to determine whether or
not it can match an empty string. PCRE2 takes a cautious approach and returns 1
in such cases.
<pre>
PCRE2_INFO_MATCHLIMIT
</pre>
If the pattern set a match limit by including an item of the form
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The third argument
should point to a uint32_t integer. If no such value has been set, the call to
<b>pcre2_pattern_info()</b> returns the error PCRE2_ERROR_UNSET. Note that this
limit will only be used during matching if it is less than the limit set or
defaulted by the caller of the match function.
<pre>
PCRE2_INFO_MAXLOOKBEHIND
</pre>
A lookbehind assertion moves back a certain number of characters (not code
units) when it starts to process each of its branches. This request returns the
largest of these backward moves. The third argument should point to a uint32_t
integer. The simple assertions \b and \B require a one-character lookbehind
and cause PCRE2_INFO_MAXLOOKBEHIND to return 1 in the absence of anything
longer. \A also registers a one-character lookbehind, though it does not
actually inspect the previous character.
</P>
<P>
Note that this information is useful for multi-segment matching only
if the pattern contains no nested lookbehinds. For example, the pattern
(?&#60;=a(?&#60;=ba)c) returns a maximum lookbehind of 2, but when it is processed, the
first lookbehind moves back by two characters, matches one character, then the
nested lookbehind also moves back by two characters. This puts the matching
point three characters earlier than it was at the start.
PCRE2_INFO_MAXLOOKBEHIND is really only useful as a debugging tool. See the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation for a discussion of multi-segment matching.
<pre>
PCRE2_INFO_MINLENGTH
</pre>
If a minimum length for matching subject strings was computed, its value is
returned. Otherwise the returned value is 0. This value is not computed when
PCRE2_NO_START_OPTIMIZE is set. The value is a number of characters, which in
UTF mode may be different from the number of code units. The third argument
should point to a <b>uint32_t</b> variable. The value is a lower bound to the
length of any matching string. There may not be any strings of that length that
do actually match, but every string that does match is at least that long.
<pre>
PCRE2_INFO_NAMECOUNT
PCRE2_INFO_NAMEENTRYSIZE
PCRE2_INFO_NAMETABLE
</pre>
PCRE2 supports the use of named as well as numbered capturing parentheses. The
names are just an additional way of identifying the parentheses, which still
acquire numbers. Several convenience functions such as
<b>pcre2_substring_get_byname()</b> are provided for extracting captured
substrings by name. It is also possible to extract the data directly, by first
converting the name to a number in order to access the correct pointers in the
output vector (described with <b>pcre2_match()</b> below). To do the conversion,
you need to use the name-to-number map, which is described by these three
values.
</P>
<P>
The map consists of a number of fixed-size entries. PCRE2_INFO_NAMECOUNT gives
the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives the size of each
entry in code units; both of these return a <b>uint32_t</b> value. The entry
size depends on the length of the longest name.
</P>
<P>
PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. This is
a PCRE2_SPTR pointer to a block of code units. In the 8-bit library, the first
two bytes of each entry are the number of the capturing parenthesis, most
significant byte first. In the 16-bit library, the pointer points to 16-bit
code units, the first of which contains the parenthesis number. In the 32-bit
library, the pointer points to 32-bit code units, the first of which contains
the parenthesis number. The rest of the entry is the corresponding name, zero
terminated.
</P>
<P>
The names are in alphabetical order. If (?| is used to create multiple capture
groups with the same number, as described in the
<a href="pcre2pattern.html#dupgroupnumber">section on duplicate group numbers</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page, the groups may be given the same name, but there is only one entry in the
table. Different names for groups of the same number are not permitted.
</P>
<P>
Duplicate names for capture groups with different numbers are permitted, but
only if PCRE2_DUPNAMES is set. They appear in the table in the order in which
they were found in the pattern. In the absence of (?| this is the order of
increasing number; when (?| is used this is not necessarily the case because
later capture groups may have lower numbers.
</P>
<P>
As a simple example of the name/number table, consider the following pattern
after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white
space - including newlines - is ignored):
<pre>
(?&#60;date&#62; (?&#60;year&#62;(\d\d)?\d\d) - (?&#60;month&#62;\d\d) - (?&#60;day&#62;\d\d) )
</pre>
There are four named capture groups, so the table has four entries, and each
entry in the table is eight bytes long. The table is as follows, with
non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:
<pre>
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
</pre>
When writing code to extract data from named capture groups using the
name-to-number map, remember that the length of the entries is likely to be
different for each compiled pattern.
<pre>
PCRE2_INFO_NEWLINE
</pre>
The output is one of the following <b>uint32_t</b> values:
<pre>
PCRE2_NEWLINE_CR Carriage return (CR)
PCRE2_NEWLINE_LF Linefeed (LF)
PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
PCRE2_NEWLINE_ANY Any Unicode line ending
PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
PCRE2_NEWLINE_NUL The NUL character (binary zero)
</pre>
This identifies the character sequence that will be recognized as meaning
"newline" while matching.
<pre>
PCRE2_INFO_SIZE
</pre>
Return the size of the compiled pattern in bytes (for all three libraries). The
third argument should point to a <b>size_t</b> variable. This value includes the
size of the general data block that precedes the code units of the compiled
pattern itself. The value that is used when <b>pcre2_compile()</b> is getting
memory in which to place the compiled pattern may be slightly larger than the
value returned by this option, because there are cases where the code that
calculates the size has to over-estimate. Processing a pattern with the JIT
compiler does not alter the value returned by this option.
<a name="infoaboutcallouts"></a></P>
<br><a name="SEC24" href="#TOC1">INFORMATION ABOUT A PATTERN'S CALLOUTS</a><br>
<P>
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
<b> void *<i>user_data</i>);</b>
<br>
<br>
A script language that supports the use of string arguments in callouts might
like to scan all the callouts in a pattern before running the match. This can
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
pointer to a compiled pattern, the second points to a callback function, and
the third is arbitrary user data. The callback function is called for every
callout in the pattern in the order in which they appear. Its first argument is
a pointer to a callout enumeration block, and its second argument is the
<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
contents of the callout enumeration block are described in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation, which also gives further details about callouts.
</P>
<br><a name="SEC25" href="#TOC1">SERIALIZATION AND PRECOMPILING</a><br>
<P>
It is possible to save compiled patterns on disc or elsewhere, and reload them
later, subject to a number of restrictions. The host on which the patterns are
reloaded must be running the same version of PCRE2, with the same code unit
width, and must also have the same endianness, pointer width, and PCRE2_SIZE
type. Before compiled patterns can be saved, they must be converted to a
"serialized" form, which in the case of PCRE2 is really just a bytecode dump.
The functions whose names begin with <b>pcre2_serialize_</b> are used for
converting to and from the serialized form. They are described in the
<a href="pcre2serialize.html"><b>pcre2serialize</b></a>
documentation. Note that PCRE2 serialization does not convert compiled patterns
to an abstract format like Java or .NET serialization.
<a name="matchdatablock"></a></P>
<br><a name="SEC26" href="#TOC1">THE MATCH DATA BLOCK</a><br>
<P>
<b>pcre2_match_data *pcre2_match_data_create(uint32_t <i>ovecsize</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_match_data *pcre2_match_data_create_from_pattern(</b>
<b> const pcre2_code *<i>code</i>, pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_match_data_free(pcre2_match_data *<i>match_data</i>);</b>
</P>
<P>
Information about a successful or unsuccessful match is placed in a match
data block, which is an opaque structure that is accessed by function calls. In
particular, the match data block contains a vector of offsets into the subject
string that define the matched parts of the subject. This is known as the
<i>ovector</i>.
</P>
<P>
Before calling <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b>, or
<b>pcre2_jit_match()</b> you must create a match data block by calling one of
the creation functions above. For <b>pcre2_match_data_create()</b>, the first
argument is the number of pairs of offsets in the <i>ovector</i>.
</P>
<P>
When using <b>pcre2_match()</b>, one pair of offsets is required to identify the
string that matched the whole pattern, with an additional pair for each
captured substring. For example, a value of 4 creates enough space to record
the matched portion of the subject plus three captured substrings.
</P>
<P>
When using <b>pcre2_dfa_match()</b> there may be multiple matched substrings of
different lengths at the same point in the subject. The ovector should be made
large enough to hold as many as are expected.
</P>
<P>
A minimum of at least 1 pair is imposed by <b>pcre2_match_data_create()</b>, so
it is always possible to return the overall matched string in the case of
<b>pcre2_match()</b> or the longest match in the case of
<b>pcre2_dfa_match()</b>. The maximum number of pairs is 65535; if the first
argument of <b>pcre2_match_data_create()</b> is greater than this, 65535 is
used.
</P>
<P>
The second argument of <b>pcre2_match_data_create()</b> is a pointer to a
general context, which can specify custom memory management for obtaining the
memory for the match data block. If you are not using custom memory management,
pass NULL, which causes <b>malloc()</b> to be used.
</P>
<P>
For <b>pcre2_match_data_create_from_pattern()</b>, the first argument is a
pointer to a compiled pattern. The ovector is created to be exactly the right
size to hold all the substrings a pattern might capture when matched using
<b>pcre2_match()</b>. You should not use this call when matching with
<b>pcre2_dfa_match()</b>. The second argument is again a pointer to a general
context, but in this case if NULL is passed, the memory is obtained using the
same allocator that was used for the compiled pattern (custom or default).
</P>
<P>
A match data block can be used many times, with the same or different compiled
patterns. You can extract information from a match data block after a match
operation has finished, using functions that are described in the sections on
<a href="#matchedstrings">matched strings</a>
and
<a href="#matchotherdata">other match data</a>
below.
</P>
<P>
When a call of <b>pcre2_match()</b> fails, valid data is available in the match
block only when the error is PCRE2_ERROR_NOMATCH, PCRE2_ERROR_PARTIAL, or one
of the error codes for an invalid UTF string. Exactly what is available depends
on the error, and is detailed below.
</P>
<P>
When one of the matching functions is called, pointers to the compiled pattern
and the subject string are set in the match data block so that they can be
referenced by the extraction functions after a successful match. After running
a match, you must not free a compiled pattern or a subject string until after
all operations on the match data block (for that match) have taken place,
unless, in the case of the subject string, you have used the
PCRE2_COPY_MATCHED_SUBJECT option, which is described in the section entitled
"Option bits for <b>pcre2_match()</b>"
<a href="#matchoptions>">below.</a>
</P>
<P>
When a match data block itself is no longer needed, it should be freed by
calling <b>pcre2_match_data_free()</b>. If this function is called with a NULL
argument, it returns immediately, without doing anything.
</P>
<br><a name="SEC27" href="#TOC1">MEMORY USE FOR MATCH DATA BLOCKS</a><br>
<P>
<b>PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE pcre2_get_match_data_heapframes_size(</b>
<b> pcre2_match_data *<i>match_data</i>);</b>
</P>
<P>
The size of a match data block depends on the size of the ovector that it
contains. The function <b>pcre2_get_match_data_size()</b> returns the size, in
bytes, of the block that is its argument.
</P>
<P>
When <b>pcre2_match()</b> runs interpretively (that is, without using JIT), it
makes use of a vector of data frames for remembering backtracking positions.
The size of each individual frame depends on the number of capturing
parentheses in the pattern and can be obtained by calling
<b>pcre2_pattern_info()</b> with the PCRE2_INFO_FRAMESIZE option (see the
section entitled "Information about a compiled pattern"
<a href="#infoaboutpattern>">above).</a>
</P>
<P>
Heap memory is used for the frames vector; if the initial memory block turns
out to be too small during matching, it is automatically expanded. When
<b>pcre2_match()</b> returns, the memory is not freed, but remains attached to
the match data block, for use by any subsequent matches that use the same
block. It is automatically freed when the match data block itself is freed.
</P>
<P>
You can find the current size of the frames vector that a match data block owns
by calling <b>pcre2_get_match_data_heapframes_size()</b>. For a newly created
match data block the size will be zero. Some types of match may require a lot
of frames and thus a large vector; applications that run in environments where
memory is constrained can check this and free the match data block if the heap
frames vector has become too big.
</P>
<br><a name="SEC28" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br>
<P>
<b>int pcre2_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>);</b>
</P>
<P>
The function <b>pcre2_match()</b> is called to match a subject string against a
compiled pattern, which is passed in the <i>code</i> argument. You can call
<b>pcre2_match()</b> with the same <i>code</i> argument as many times as you
like, in order to find multiple matches in the subject string or to match
different subject strings with the same pattern.
</P>
<P>
This function is the main matching facility of the library, and it operates in
a Perl-like manner. For specialist use there is also an alternative matching
function, which is described
<a href="#dfamatch">below</a>
in the section about the <b>pcre2_dfa_match()</b> function.
</P>
<P>
Here is an example of a simple call to <b>pcre2_match()</b>:
<pre>
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
int rc = pcre2_match(
re, /* result of pcre2_compile() */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
md, /* the match data block */
NULL); /* a match context; NULL means use defaults */
</pre>
If the subject string is zero-terminated, the length can be given as
PCRE2_ZERO_TERMINATED. A match context must be provided if certain less common
matching parameters are to be changed. For details, see the section on
<a href="#matchcontext">the match context</a>
above.
</P>
<br><b>
The string to be matched by <b>pcre2_match()</b>
</b><br>
<P>
The subject string is passed to <b>pcre2_match()</b> as a pointer in
<i>subject</i>, a length in <i>length</i>, and a starting offset in
<i>startoffset</i>. The length and offset are in code units, not characters.
That is, they are in bytes for the 8-bit library, 16-bit code units for the
16-bit library, and 32-bit code units for the 32-bit library, whether or not
UTF processing is enabled. As a special case, if <i>subject</i> is NULL and
<i>length</i> is zero, the subject is assumed to be an empty string. If
<i>length</i> is non-zero, an error occurs if <i>subject</i> is NULL.
</P>
<P>
If <i>startoffset</i> is greater than the length of the subject,
<b>pcre2_match()</b> returns PCRE2_ERROR_BADOFFSET. When the starting offset is
zero, the search for a match starts at the beginning of the subject, and this
is by far the most common case. In UTF-8 or UTF-16 mode, the starting offset
must point to the start of a character, or to the end of the subject (in UTF-32
mode, one code unit equals one character, so all offsets are valid). Like the
pattern string, the subject may contain binary zeros.
</P>
<P>
A non-zero starting offset is useful when searching for another match in the
same subject by calling <b>pcre2_match()</b> again after a previous success.
Setting <i>startoffset</i> differs from passing over a shortened string and
setting PCRE2_NOTBOL in the case of a pattern that begins with any kind of
lookbehind. For example, consider the pattern
<pre>
\Biss\B
</pre>
which finds occurrences of "iss" in the middle of words. (\B matches only if
the current position in the subject is not a word boundary.) When applied to
the string "Mississippi" the first call to <b>pcre2_match()</b> finds the first
occurrence. If <b>pcre2_match()</b> is called again with just the remainder of
the subject, namely "issippi", it does not match, because \B is always false
at the start of the subject, which is deemed to be a word boundary. However, if
<b>pcre2_match()</b> is passed the entire string again, but with
<i>startoffset</i> set to 4, it finds the second occurrence of "iss" because it
is able to look behind the starting point to discover that it is preceded by a
letter.
</P>
<P>
Finding all the matches in a subject is tricky when the pattern can match an
empty string. It is possible to emulate Perl's /g behaviour by first trying the
match again at the same offset, with the PCRE2_NOTEMPTY_ATSTART and
PCRE2_ANCHORED options, and then if that fails, advancing the starting offset
and trying an ordinary match again. There is some code that demonstrates how to
do this in the
<a href="pcre2demo.html"><b>pcre2demo</b></a>
sample program. In the most general case, you have to check to see if the
newline convention recognizes CRLF as a newline, and if so, and the current
character is CR followed by LF, advance the starting offset by two characters
instead of one.
</P>
<P>
If a non-zero starting offset is passed when the pattern is anchored, a single
attempt to match at the given offset is made. This can only succeed if the
pattern does not require the match to be at the start of the subject. In other
words, the anchoring must be the result of setting the PCRE2_ANCHORED option or
the use of .* with PCRE2_DOTALL, not by starting the pattern with ^ or \A.
<a name="matchoptions"></a></P>
<br><b>
Option bits for <b>pcre2_match()</b>
</b><br>
<P>
The unused bits of the <i>options</i> argument for <b>pcre2_match()</b> must be
zero. The only bits that may be set are PCRE2_ANCHORED,
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK, PCRE2_ENDANCHORED,
PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT.
Their action is described below.
</P>
<P>
Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not supported by
the just-in-time (JIT) compiler. If it is set, JIT matching is disabled and the
interpretive code in <b>pcre2_match()</b> is run.
PCRE2_DISABLE_RECURSELOOP_CHECK is ignored by JIT, but apart from PCRE2_NO_JIT
(obviously), the remaining options are supported for JIT matching.
<pre>
PCRE2_ANCHORED
</pre>
The PCRE2_ANCHORED option limits <b>pcre2_match()</b> to matching at the first
matching position. If a pattern was compiled with PCRE2_ANCHORED, or turned out
to be anchored by virtue of its contents, it cannot be made unachored at
matching time. Note that setting the option at match time disables JIT
matching.
<pre>
PCRE2_COPY_MATCHED_SUBJECT
</pre>
By default, a pointer to the subject is remembered in the match data block so
that, after a successful match, it can be referenced by the substring
extraction functions. This means that the subject's memory must not be freed
until all such operations are complete. For some applications where the
lifetime of the subject string is not guaranteed, it may be necessary to make a
copy of the subject string, but it is wasteful to do this unless the match is
successful. After a successful match, if PCRE2_COPY_MATCHED_SUBJECT is set, the
subject is copied and the new pointer is remembered in the match data block
instead of the original subject pointer. The memory allocator that was used for
the match block itself is used. The copy is automatically freed when
<b>pcre2_match_data_free()</b> is called to free the match data block. It is also
automatically freed if the match data block is re-used for another match
operation.
<pre>
PCRE2_DISABLE_RECURSELOOP_CHECK
</pre>
This option is relevant only to <b>pcre2_match()</b> for interpretive matching.
It is ignored when JIT is used, and is forbidden for <b>pcre2_dfa_match()</b>.
</P>
<P>
The use of recursion in patterns can lead to infinite loops. In the
interpretive matcher these would be eventually caught by the match or heap
limits, but this could take a long time and/or use a lot of memory if the
limits are large. There is therefore a check at the start of each recursion.
If the same group is still active from a previous call, and the current subject
pointer is the same as it was at the start of that group, and the furthest
inspected character of the subject has not changed, an error is generated.
</P>
<P>
There are rare cases of matches that would complete, but nevertheless trigger
this error. This option disables the check. It is provided mainly for testing
when comparing JIT and interpretive behaviour.
<pre>
PCRE2_ENDANCHORED
</pre>
If the PCRE2_ENDANCHORED option is set, any string that <b>pcre2_match()</b>
matches must be right at the end of the subject string. Note that setting the
option at match time disables JIT matching.
<pre>
PCRE2_NOTBOL
</pre>
This option specifies that first character of the subject string is not the
beginning of a line, so the circumflex metacharacter should not match before
it. Setting this without having set PCRE2_MULTILINE at compile time causes
circumflex never to match. This option affects only the behaviour of the
circumflex metacharacter. It does not affect \A.
<pre>
PCRE2_NOTEOL
</pre>
This option specifies that the end of the subject string is not the end of a
line, so the dollar metacharacter should not match it nor (except in multiline
mode) a newline immediately before it. Setting this without having set
PCRE2_MULTILINE at compile time causes dollar never to match. This option
affects only the behaviour of the dollar metacharacter. It does not affect \Z
or \z.
<pre>
PCRE2_NOTEMPTY
</pre>
An empty string is not considered to be a valid match if this option is set. If
there are alternatives in the pattern, they are tried. If all the alternatives
match the empty string, the entire match fails. For example, if the pattern
<pre>
a?b?
</pre>
is applied to a string not beginning with "a" or "b", it matches an empty
string at the start of the subject. With PCRE2_NOTEMPTY set, this match is not
valid, so <b>pcre2_match()</b> searches further into the string for occurrences
of "a" or "b".
<pre>
PCRE2_NOTEMPTY_ATSTART
</pre>
This is like PCRE2_NOTEMPTY, except that it locks out an empty string match
only at the first matching position, that is, at the start of the subject plus
the starting offset. An empty string match later in the subject is permitted.
If the pattern is anchored, such a match can occur only if the pattern contains
\K.
<pre>
PCRE2_NO_JIT
</pre>
By default, if a pattern has been successfully processed by
<b>pcre2_jit_compile()</b>, JIT is automatically used when <b>pcre2_match()</b>
is called with options that JIT supports. Setting PCRE2_NO_JIT disables the use
of JIT; it forces matching to be done by the interpreter.
<pre>
PCRE2_NO_UTF_CHECK
</pre>
When PCRE2_UTF is set at compile time, the validity of the subject as a UTF
string is checked unless PCRE2_NO_UTF_CHECK is passed to <b>pcre2_match()</b> or
PCRE2_MATCH_INVALID_UTF was passed to <b>pcre2_compile()</b>. The latter special
case is discussed in detail in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
In the default case, if a non-zero starting offset is given, the check is
applied only to that part of the subject that could be inspected during
matching, and there is a check that the starting offset points to the first
code unit of a character or to the end of the subject. If there are no
lookbehind assertions in the pattern, the check starts at the starting offset.
Otherwise, it starts at the length of the longest lookbehind before the
starting offset, or at the start of the subject if there are not that many
characters before the starting offset. Note that the sequences \b and \B are
one-character lookbehinds.
</P>
<P>
The check is carried out before any other processing takes place, and a
negative error code is returned if the check fails. There are several UTF error
codes for each code unit width, corresponding to different problems with the
code unit sequence. There are discussions about the validity of
<a href="pcre2unicode.html#utf8strings">UTF-8 strings,</a>
<a href="pcre2unicode.html#utf16strings">UTF-16 strings,</a>
and
<a href="pcre2unicode.html#utf32strings">UTF-32 strings</a>
in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
If you know that your subject is valid, and you want to skip this check for
performance reasons, you can set the PCRE2_NO_UTF_CHECK option when calling
<b>pcre2_match()</b>. You might want to do this for the second and subsequent
calls to <b>pcre2_match()</b> if you are making repeated calls to find multiple
matches in the same subject string.
</P>
<P>
<b>Warning:</b> Unless PCRE2_MATCH_INVALID_UTF was set at compile time, when
PCRE2_NO_UTF_CHECK is set at match time the effect of passing an invalid
string as a subject, or an invalid value of <i>startoffset</i>, is undefined.
Your program may crash or loop indefinitely or give wrong results.
<pre>
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
</pre>
These options turn on the partial matching feature. A partial match occurs if
the end of the subject string is reached successfully, but there are not enough
subject characters to complete the match. In addition, either at least one
character must have been inspected or the pattern must contain a lookbehind, or
the pattern must be one that could match an empty string.
</P>
<P>
If this situation arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD)
is set, matching continues by testing any remaining alternatives. Only if no
complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that the
caller is prepared to handle a partial match, but only if no complete match can
be found.
</P>
<P>
If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this case, if
a partial match is found, <b>pcre2_match()</b> immediately returns
PCRE2_ERROR_PARTIAL, without considering any other alternatives. In other
words, when PCRE2_PARTIAL_HARD is set, a partial match is considered to be more
important that an alternative complete match.
</P>
<P>
There is a more detailed discussion of partial and multi-segment matching, with
examples, in the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
</P>
<br><a name="SEC29" href="#TOC1">NEWLINE HANDLING WHEN MATCHING</a><br>
<P>
When PCRE2 is built, a default newline convention is set; this is usually the
standard convention for the operating system. The default can be overridden in
a
<a href="#compilecontext">compile context</a>
by calling <b>pcre2_set_newline()</b>. It can also be overridden by starting a
pattern string with, for example, (*CRLF), as described in the
<a href="pcre2pattern.html#newlines">section on newline conventions</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. During matching, the newline choice affects the behaviour of the dot,
circumflex, and dollar metacharacters. It may also alter the way the match
starting position is advanced after a match failure for an unanchored pattern.
</P>
<P>
When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is set as
the newline convention, and a match attempt for an unanchored pattern fails
when the current starting position is at a CRLF sequence, and the pattern
contains no explicit matches for CR or LF characters, the match position is
advanced by two characters instead of one, in other words, to after the CRLF.
</P>
<P>
The above rule is a compromise that makes the most common cases work as
expected. For example, if the pattern is .+A (and the PCRE2_DOTALL option is
not set), it does not match the string "\r\nA" because, after failing at the
start, it skips both the CR and the LF before retrying. However, the pattern
[\r\n]A does match that string, because it contains an explicit CR or LF
reference, and so advances only by one character after the first failure.
</P>
<P>
An explicit match for CR of LF is either a literal appearance of one of those
characters in the pattern, or one of the \r or \n or equivalent octal or
hexadecimal escape sequences. Implicit matches such as [^X] do not count, nor
does \s, even though it includes CR and LF in the characters that it matches.
</P>
<P>
Notwithstanding the above, anomalous effects may still occur when CRLF is a
valid newline sequence and explicit \r or \n escapes appear in the pattern.
<a name="matchedstrings"></a></P>
<br><a name="SEC30" href="#TOC1">HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS</a><br>
<P>
<b>uint32_t pcre2_get_ovector_count(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *<i>match_data</i>);</b>
</P>
<P>
In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by
parenthesized parts of the pattern. Following the usage in Jeffrey Friedl's
book, this is called "capturing" in what follows, and the phrase "capture
group" (Perl terminology) is used for a fragment of a pattern that picks out a
substring. PCRE2 supports several other kinds of parenthesized group that do
not cause substrings to be captured. The <b>pcre2_pattern_info()</b> function
can be used to find out how many capture groups there are in a compiled
pattern.
</P>
<P>
You can use auxiliary functions for accessing captured substrings
<a href="#extractbynumber">by number</a>
or
<a href="#extractbyname">by name,</a>
as described in sections below.
</P>
<P>
Alternatively, you can make direct use of the vector of PCRE2_SIZE values,
called the <b>ovector</b>, which contains the offsets of captured strings. It is
part of the
<a href="#matchdatablock">match data block.</a>
The function <b>pcre2_get_ovector_pointer()</b> returns the address of the
ovector, and <b>pcre2_get_ovector_count()</b> returns the number of pairs of
values it contains.
</P>
<P>
Within the ovector, the first in each pair of values is set to the offset of
the first code unit of a substring, and the second is set to the offset of the
first code unit after the end of a substring. These values are always code unit
offsets, not character offsets. That is, they are byte offsets in the 8-bit
library, 16-bit offsets in the 16-bit library, and 32-bit offsets in the 32-bit
library.
</P>
<P>
After a partial match (error return PCRE2_ERROR_PARTIAL), only the first pair
of offsets (that is, <i>ovector[0]</i> and <i>ovector[1]</i>) are set. They
identify the part of the subject that was partially matched. See the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation for details of partial matching.
</P>
<P>
After a fully successful match, the first pair of offsets identifies the
portion of the subject string that was matched by the entire pattern. The next
pair is used for the first captured substring, and so on. The value returned by
<b>pcre2_match()</b> is one more than the highest numbered pair that has been
set. For example, if two substrings have been captured, the returned value is
3. If there are no captured substrings, the return value from a successful
match is 1, indicating that just the first pair of offsets has been set.
</P>
<P>
If a pattern uses the \K escape sequence within a positive assertion, the
reported start of a successful match can be greater than the end of the match.
For example, if the pattern (?=ab\K) is matched against "ab", the start and
end offset values for the match are 2 and 0.
</P>
<P>
If a capture group is matched repeatedly within a single match operation, it is
the last portion of the subject that it matched that is returned.
</P>
<P>
If the ovector is too small to hold all the captured substring offsets, as much
as possible is filled in, and the function returns a value of zero. If captured
substrings are not of interest, <b>pcre2_match()</b> may be called with a match
data block whose ovector is of minimum length (that is, one pair).
</P>
<P>
It is possible for capture group number <i>n+1</i> to match some part of the
subject when group <i>n</i> has not been used at all. For example, if the string
"abc" is matched against the pattern (a|(z))(bc) the return from the function
is 4, and groups 1 and 3 are matched, but 2 is not. When this happens, both
values in the offset pairs corresponding to unused groups are set to
PCRE2_UNSET.
</P>
<P>
Offset values that correspond to unused groups at the end of the expression are
also set to PCRE2_UNSET. For example, if the string "abc" is matched against
the pattern (abc)(x(yz)?)? groups 2 and 3 are not matched. The return from the
function is 2, because the highest used capture group number is 1. The offsets
for the second and third capture groups (assuming the vector is large enough,
of course) are set to PCRE2_UNSET.
</P>
<P>
Elements in the ovector that do not correspond to capturing parentheses in the
pattern are never changed. That is, if a pattern contains <i>n</i> capturing
parentheses, no more than <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by
<b>pcre2_match()</b>. The other elements retain whatever values they previously
had. After a failed match attempt, the contents of the ovector are unchanged.
<a name="matchotherdata"></a></P>
<br><a name="SEC31" href="#TOC1">OTHER INFORMATION ABOUT A MATCH</a><br>
<P>
<b>PCRE2_SPTR pcre2_get_mark(pcre2_match_data *<i>match_data</i>);</b>
<br>
<br>
<b>PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *<i>match_data</i>);</b>
</P>
<P>
As well as the offsets in the ovector, other information about a match is
retained in the match data block and can be retrieved by the above functions in
appropriate circumstances. If they are called at other times, the result is
undefined.
</P>
<P>
After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a failure
to match (PCRE2_ERROR_NOMATCH), a mark name may be available. The function
<b>pcre2_get_mark()</b> can be called to access this name, which can be
specified in the pattern by any of the backtracking control verbs, not just
(*MARK). The same function applies to all the verbs. It returns a pointer to
the zero-terminated name, which is within the compiled pattern. If no name is
available, NULL is returned. The length of the name (excluding the terminating
zero) is stored in the code unit that precedes the name. You should use this
length instead of relying on the terminating zero if the name might contain a
binary zero.
</P>
<P>
After a successful match, the name that is returned is the last mark name
encountered on the matching path through the pattern. Instances of backtracking
verbs without names do not count. Thus, for example, if the matching path
contains (*MARK:A)(*PRUNE), the name "A" is returned. After a "no match" or a
partial match, the last encountered name is returned. For example, consider
this pattern:
<pre>
^(*MARK:A)((*MARK:B)a|b)c
</pre>
When it matches "bc", the returned name is A. The B mark is "seen" in the first
branch of the group, but it is not on the matching path. On the other hand,
when this pattern fails to match "bx", the returned name is B.
</P>
<P>
<b>Warning:</b> By default, certain start-of-match optimizations are used to
give a fast "no match" result in some situations. For example, if the anchoring
is removed from the pattern above, there is an initial check for the presence
of "c" in the subject before running the matching engine. This check fails for
"bx", causing a match failure without seeing any marks. You can disable the
start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option for
<b>pcre2_compile()</b> or by starting the pattern with (*NO_START_OPT).
</P>
<P>
After a successful match, a partial match, or one of the invalid UTF errors
(for example, PCRE2_ERROR_UTF8_ERR5), <b>pcre2_get_startchar()</b> can be
called. After a successful or partial match it returns the code unit offset of
the character at which the match started. For a non-partial match, this can be
different to the value of <i>ovector[0]</i> if the pattern contains the \K
escape sequence. After a partial match, however, this value is always the same
as <i>ovector[0]</i> because \K does not affect the result of a partial match.
</P>
<P>
After a UTF check failure, <b>pcre2_get_startchar()</b> can be used to obtain
the code unit offset of the invalid UTF character. Details are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
<a name="errorlist"></a></P>
<br><a name="SEC32" href="#TOC1">ERROR RETURNS FROM <b>pcre2_match()</b></a><br>
<P>
If <b>pcre2_match()</b> fails, it returns a negative number. This can be
converted to a text string by calling the <b>pcre2_get_error_message()</b>
function (see "Obtaining a textual error message"
<a href="#geterrormessage">below).</a>
Negative error codes are also returned by other functions, and are documented
with them. The codes are given names in the header file. If UTF checking is in
force and an invalid UTF subject string is detected, one of a number of
UTF-specific negative error codes is returned. Details are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page. The following are the other errors that may be returned by
<b>pcre2_match()</b>:
<pre>
PCRE2_ERROR_NOMATCH
</pre>
The subject string did not match the pattern.
<pre>
PCRE2_ERROR_PARTIAL
</pre>
The subject string did not match, but it did match partially. See the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation for details of partial matching.
<pre>
PCRE2_ERROR_BADMAGIC
</pre>
PCRE2 stores a 4-byte "magic number" at the start of the compiled code, to
catch the case when it is passed a junk pointer. This is the error that is
returned when the magic number is not present.
<pre>
PCRE2_ERROR_BADMODE
</pre>
This error is given when a compiled pattern is passed to a function in a
library of a different code unit width, for example, a pattern compiled by
the 8-bit library is passed to a 16-bit or 32-bit library function.
<pre>
PCRE2_ERROR_BADOFFSET
</pre>
The value of <i>startoffset</i> was greater than the length of the subject.
<pre>
PCRE2_ERROR_BADOPTION
</pre>
An unrecognized bit was set in the <i>options</i> argument.
<pre>
PCRE2_ERROR_BADUTFOFFSET
</pre>
The UTF code unit sequence that was passed as a subject was checked and found
to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the value of
<i>startoffset</i> did not point to the beginning of a UTF character or the end
of the subject.
<pre>
PCRE2_ERROR_CALLOUT
</pre>
This error is never generated by <b>pcre2_match()</b> itself. It is provided for
use by callout functions that want to cause <b>pcre2_match()</b> or
<b>pcre2_callout_enumerate()</b> to return a distinctive error code. See the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
<pre>
PCRE2_ERROR_DEPTHLIMIT
</pre>
The nested backtracking depth limit was reached.
<pre>
PCRE2_ERROR_HEAPLIMIT
</pre>
The heap limit was reached.
<pre>
PCRE2_ERROR_INTERNAL
</pre>
An unexpected internal error has occurred. This error could be caused by a bug
in PCRE2 or by overwriting of the compiled pattern.
<pre>
PCRE2_ERROR_JIT_STACKLIMIT
</pre>
This error is returned when a pattern that was successfully studied using JIT
is being matched, but the memory available for the just-in-time processing
stack is not large enough. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for more details.
<pre>
PCRE2_ERROR_MATCHLIMIT
</pre>
The backtracking match limit was reached.
<pre>
PCRE2_ERROR_NOMEMORY
</pre>
Heap memory is used to remember backtracking points. This error is given when
the memory allocation function (default or custom) fails. Note that a different
error, PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
the heap limit. PCRE2_ERROR_NOMEMORY is also returned if
PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
<pre>
PCRE2_ERROR_NULL
</pre>
Either the <i>code</i>, <i>subject</i>, or <i>match_data</i> argument was passed
as NULL.
<pre>
PCRE2_ERROR_RECURSELOOP
</pre>
This error is returned when <b>pcre2_match()</b> detects a recursion loop within
the pattern. Specifically, it means that either the whole pattern or a
capture group has been called recursively for the second time at the same
position in the subject string. Some simple patterns that might do this are
detected and faulted at compile time, but more complicated cases, in particular
mutual recursions between two different groups, cannot be detected until
matching is attempted.
<a name="geterrormessage"></a></P>
<br><a name="SEC33" href="#TOC1">OBTAINING A TEXTUAL ERROR MESSAGE</a><br>
<P>
<b>int pcre2_get_error_message(int <i>errorcode</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE <i>bufflen</i>);</b>
</P>
<P>
A text message for an error code from any PCRE2 function (compile, match, or
auxiliary) can be obtained by calling <b>pcre2_get_error_message()</b>. The code
is passed as the first argument, with the remaining two arguments specifying a
code unit buffer and its length in code units, into which the text message is
placed. The message is returned in code units of the appropriate width for the
library that is being used.
</P>
<P>
The returned message is terminated with a trailing zero, and the function
returns the number of code units used, excluding the trailing zero. If the
error number is unknown, the negative error code PCRE2_ERROR_BADDATA is
returned. If the buffer is too small, the message is truncated (but still with
a trailing zero), and the negative error code PCRE2_ERROR_NOMEMORY is returned.
None of the messages are very long; a buffer size of 120 code units is ample.
<a name="extractbynumber"></a></P>
<br><a name="SEC34" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
<P>
<b>int pcre2_substring_length_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_SIZE *<i>length</i>);</b>
<br>
<br>
<b>int pcre2_substring_copy_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_UCHAR *<i>buffer</i>,</b>
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>int pcre2_substring_get_bynumber(pcre2_match_data *<i>match_data</i>,</b>
<b> uint32_t <i>number</i>, PCRE2_UCHAR **<i>bufferptr</i>,</b>
<b> PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
</P>
<P>
Captured substrings can be accessed directly by using the ovector as described
<a href="#matchedstrings">above.</a>
For convenience, auxiliary functions are provided for extracting captured
substrings as new, separate, zero-terminated strings. A substring that contains
a binary zero is correctly extracted and has a further zero added on the end,
but the result is not, of course, a C string.
</P>
<P>
The functions in this section identify substrings by number. The number zero
refers to the entire matched substring, with higher numbers referring to
substrings captured by parenthesized groups. After a partial match, only
substring zero is available. An attempt to extract any other substring gives
the error PCRE2_ERROR_PARTIAL. The next section describes similar functions for
extracting captured substrings by name.
</P>
<P>
If a pattern uses the \K escape sequence within a positive assertion, the
reported start of a successful match can be greater than the end of the match.
For example, if the pattern (?=ab\K) is matched against "ab", the start and
end offset values for the match are 2 and 0. In this situation, calling these
functions with a zero substring number extracts a zero-length empty string.
</P>
<P>
You can find the length in code units of a captured substring without
extracting it by calling <b>pcre2_substring_length_bynumber()</b>. The first
argument is a pointer to the match data block, the second is the group number,
and the third is a pointer to a variable into which the length is placed. If
you just want to know whether or not the substring has been captured, you can
pass the third argument as NULL.
</P>
<P>
The <b>pcre2_substring_copy_bynumber()</b> function copies a captured substring
into a supplied buffer, whereas <b>pcre2_substring_get_bynumber()</b> copies it
into new memory, obtained using the same memory allocation function that was
used for the match data block. The first two arguments of these functions are a
pointer to the match data block and a capture group number.
</P>
<P>
The final arguments of <b>pcre2_substring_copy_bynumber()</b> are a pointer to
the buffer and a pointer to a variable that contains its length in code units.
This is updated to contain the actual number of code units used for the
extracted substring, excluding the terminating zero.
</P>
<P>
For <b>pcre2_substring_get_bynumber()</b> the third and fourth arguments point
to variables that are updated with a pointer to the new memory and the number
of code units that comprise the substring, again excluding the terminating
zero. When the substring is no longer needed, the memory should be freed by
calling <b>pcre2_substring_free()</b>.
</P>
<P>
The return value from all these functions is zero for success, or a negative
error code. If the pattern match failed, the match failure code is returned.
If a substring number greater than zero is used after a partial match,
PCRE2_ERROR_PARTIAL is returned. Other possible error codes are:
<pre>
PCRE2_ERROR_NOMEMORY
</pre>
The buffer was too small for <b>pcre2_substring_copy_bynumber()</b>, or the
attempt to get memory failed for <b>pcre2_substring_get_bynumber()</b>.
<pre>
PCRE2_ERROR_NOSUBSTRING
</pre>
There is no substring with that number in the pattern, that is, the number is
greater than the number of capturing parentheses.
<pre>
PCRE2_ERROR_UNAVAILABLE
</pre>
The substring number, though not greater than the number of captures in the
pattern, is greater than the number of slots in the ovector, so the substring
could not be captured.
<pre>
PCRE2_ERROR_UNSET
</pre>
The substring did not participate in the match. For example, if the pattern is
(abc)|(def) and the subject is "def", and the ovector contains at least two
capturing slots, substring number 1 is unset.
</P>
<br><a name="SEC35" href="#TOC1">EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS</a><br>
<P>
<b>int pcre2_substring_list_get(pcre2_match_data *<i>match_data</i>,</b>
<b>" PCRE2_UCHAR ***<i>listptr</i>, PCRE2_SIZE **<i>lengthsptr</i>);</b>
<br>
<br>
<b>void pcre2_substring_list_free(PCRE2_UCHAR **<i>list</i>);</b>
</P>
<P>
The <b>pcre2_substring_list_get()</b> function extracts all available substrings
and builds a list of pointers to them. It also (optionally) builds a second
list that contains their lengths (in code units), excluding a terminating zero
that is added to each of them. All this is done in a single block of memory
that is obtained using the same memory allocation function that was used to get
the match data block.
</P>
<P>
This function must be called only after a successful match. If called after a
partial match, the error code PCRE2_ERROR_PARTIAL is returned.
</P>
<P>
The address of the memory block is returned via <i>listptr</i>, which is also
the start of the list of string pointers. The end of the list is marked by a
NULL pointer. The address of the list of lengths is returned via
<i>lengthsptr</i>. If your strings do not contain binary zeros and you do not
therefore need the lengths, you may supply NULL as the <b>lengthsptr</b>
argument to disable the creation of a list of lengths. The yield of the
function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the memory block
could not be obtained. When the list is no longer needed, it should be freed by
calling <b>pcre2_substring_list_free()</b>.
</P>
<P>
If this function encounters a substring that is unset, which can happen when
capture group number <i>n+1</i> matches some part of the subject, but group
<i>n</i> has not been used at all, it returns an empty string. This can be
distinguished from a genuine zero-length substring by inspecting the
appropriate offset in the ovector, which contain PCRE2_UNSET for unset
substrings, or by calling <b>pcre2_substring_length_bynumber()</b>.
<a name="extractbyname"></a></P>
<br><a name="SEC36" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
<P>
<b>int pcre2_substring_number_from_name(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>);</b>
<br>
<br>
<b>int pcre2_substring_length_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SIZE *<i>length</i>);</b>
<br>
<br>
<b>int pcre2_substring_copy_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR *<i>buffer</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>int pcre2_substring_get_byname(pcre2_match_data *<i>match_data</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_UCHAR **<i>bufferptr</i>, PCRE2_SIZE *<i>bufflen</i>);</b>
<br>
<br>
<b>void pcre2_substring_free(PCRE2_UCHAR *<i>buffer</i>);</b>
</P>
<P>
To extract a substring by name, you first have to find associated number.
For example, for this pattern:
<pre>
(a+)b(?&#60;xxx&#62;\d+)...
</pre>
the number of the capture group called "xxx" is 2. If the name is known to be
unique (PCRE2_DUPNAMES was not set), you can find the number from the name by
calling <b>pcre2_substring_number_from_name()</b>. The first argument is the
compiled pattern, and the second is the name. The yield of the function is the
group number, PCRE2_ERROR_NOSUBSTRING if there is no group with that name, or
PCRE2_ERROR_NOUNIQUESUBSTRING if there is more than one group with that name.
Given the number, you can extract the substring directly from the ovector, or
use one of the "bynumber" functions described above.
</P>
<P>
For convenience, there are also "byname" functions that correspond to the
"bynumber" functions, the only difference being that the second argument is a
name instead of a number. If PCRE2_DUPNAMES is set and there are duplicate
names, these functions scan all the groups with the given name, and return the
captured substring from the first named group that is set.
</P>
<P>
If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
returned. If all groups with the name have numbers that are greater than the
number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is returned. If there
is at least one group with a slot in the ovector, but no group is found to be
set, PCRE2_ERROR_UNSET is returned.
</P>
<P>
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple
capture groups with the same number, as described in the
<a href="pcre2pattern.html#dupgroupnumber">section on duplicate group numbers</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page, you cannot use names to distinguish the different capture groups, because
names are not included in the compiled code. The matching process uses only
numbers. For this reason, the use of different names for groups with the
same number causes an error at compile time.
<a name="substitutions"></a></P>
<br><a name="SEC37" href="#TOC1">CREATING A NEW STRING WITH SUBSTITUTIONS</a><br>
<P>
<b>int pcre2_substitute(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>, PCRE2_SPTR <i>replacement</i>,</b>
<b> PCRE2_SIZE <i>rlength</i>, PCRE2_UCHAR *<i>outputbuffer</i>,</b>
<b> PCRE2_SIZE *<i>outlengthptr</i>);</b>
</P>
<P>
This function optionally calls <b>pcre2_match()</b> and then makes a copy of the
subject string in <i>outputbuffer</i>, replacing parts that were matched with
the <i>replacement</i> string, whose length is supplied in <b>rlength</b>, which
can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string. As a
special case, if <i>replacement</i> is NULL and <i>rlength</i> is zero, the
replacement is assumed to be an empty string. If <i>rlength</i> is non-zero, an
error occurs if <i>replacement</i> is NULL.
</P>
<P>
There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to return just
the replacement string(s). The default action is to perform just one
replacement if the pattern matches, but there is an option that requests
multiple replacements (see PCRE2_SUBSTITUTE_GLOBAL below).
</P>
<P>
If successful, <b>pcre2_substitute()</b> returns the number of substitutions
that were carried out. This may be zero if no match was found, and is never
greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set. A negative value is
returned if an error is detected.
</P>
<P>
Matches in which a \K item in a lookahead in the pattern causes the match to
end before it starts are not supported, and give rise to an error return. For
global replacements, matches in which \K in a lookbehind causes the match to
start earlier than the point that was reached in the previous iteration are
also not supported.
</P>
<P>
The first seven arguments of <b>pcre2_substitute()</b> are the same as for
<b>pcre2_match()</b>, except that the partial matching options are not
permitted, and <i>match_data</i> may be passed as NULL, in which case a match
data block is obtained and freed within this function, using memory management
functions from the match context, if provided, or else those that were used to
allocate memory for the compiled code.
</P>
<P>
If <i>match_data</i> is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the
provided block is used for all calls to <b>pcre2_match()</b>, and its contents
afterwards are the result of the final call. For global changes, this will
always be a no-match error. The contents of the ovector within the match data
block may or may not have been changed.
</P>
<P>
As well as the usual options for <b>pcre2_match()</b>, a number of additional
options can be set in the <i>options</i> argument of <b>pcre2_substitute()</b>.
One such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an external
<i>match_data</i> block must be provided, and it must have already been used for
an external call to <b>pcre2_match()</b> with the same pattern and subject
arguments. The data in the <i>match_data</i> block (return code, offset vector)
is then used for the first substitution instead of calling <b>pcre2_match()</b>
from within <b>pcre2_substitute()</b>. This allows an application to check for a
match before choosing to substitute, without having to repeat the match.
</P>
<P>
The contents of the externally supplied match data block are not changed when
PCRE2_SUBSTITUTE_MATCHED is set. If PCRE2_SUBSTITUTE_GLOBAL is also set,
<b>pcre2_match()</b> is called after the first substitution to check for further
matches, but this is done using an internally obtained match data block, thus
always leaving the external block unchanged.
</P>
<P>
The <i>code</i> argument is not used for matching before the first substitution
when PCRE2_SUBSTITUTE_MATCHED is set, but it must be provided, even when
PCRE2_SUBSTITUTE_GLOBAL is not set, because it contains information such as the
UTF setting and the number of capturing parentheses in the pattern.
</P>
<P>
The default action of <b>pcre2_substitute()</b> is to return a copy of the
subject string with matched substrings replaced. However, if
PCRE2_SUBSTITUTE_REPLACEMENT_ONLY is set, only the replacement substrings are
returned. In the global case, multiple replacements are concatenated in the
output buffer. Substitution callouts (see
<a href="#subcallouts">below)</a>
can be used to separate them if necessary.
</P>
<P>
The <i>outlengthptr</i> argument of <b>pcre2_substitute()</b> must point to a
variable that contains the length, in code units, of the output buffer. If the
function is successful, the value is updated to contain the length in code
units of the new string, excluding the trailing zero that is automatically
added.
</P>
<P>
If the function is not successful, the value set via <i>outlengthptr</i> depends
on the type of error. For syntax errors in the replacement string, the value is
the offset in the replacement string where the error was detected. For other
errors, the value is PCRE2_UNSET by default. This includes the case of the
output buffer being too small, unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set.
</P>
<P>
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output buffer is
too small. The default action is to return PCRE2_ERROR_NOMEMORY immediately. If
this option is set, however, <b>pcre2_substitute()</b> continues to go through
the motions of matching and substituting (without, of course, writing anything)
in order to compute the size of buffer that is needed, which will include the
extra space for the terminating NUL. This value is passed back via the
<i>outlengthptr</i> variable, with the result of the function still being
PCRE2_ERROR_NOMEMORY.
</P>
<P>
Passing a buffer size of zero is a permitted way of finding out how much memory
is needed for given substitution. However, this does mean that the entire
operation is carried out twice. Depending on the application, it may be more
efficient to allocate a large buffer and free the excess afterwards, instead of
using PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.
</P>
<P>
The replacement string, which is interpreted as a UTF string in UTF mode, is
checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An invalid UTF
replacement string causes an immediate return with the relevant UTF error code.
</P>
<P>
If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is not interpreted
in any way. By default, however, a dollar character is an escape character that
can specify the insertion of characters from capture groups and names from
(*MARK) or other control verbs in the pattern. Dollar is the only escape
character (backslash is treated as literal). The following forms are
recognized:
<pre>
$$ insert a dollar character
$n or ${n} insert the contents of group <i>n</i>
$0 or $& insert the entire matched substring
$` insert the substring that precedes the match
$' insert the substring that follows the match
$_ insert the entire input string
$*MARK or ${*MARK} insert a control verb name
</pre>
Either a group number or a group name can be given for <i>n</i>, for example $2 or
$NAME. Curly brackets are required only if the following character would be
interpreted as part of the number or name. The number may be zero to include
the entire matched string. For example, if the pattern a(b)c is matched with
"=abc=" and the replacement string "+$1$0$1+", the result is "=+babcb+=".
</P>
<P>
The JavaScript form $&#60;name&#62;, where the angle brackets are part of the syntax,
is also recognized for group names, but not for group numbers or *MARK.
</P>
<P>
$*MARK inserts the name from the last encountered backtracking control verb on
the matching path that has a name. (*MARK) must always include a name, but the
other verbs need not. For example, in the case of (*MARK:A)(*PRUNE) the name
inserted is "A", but for (*MARK:A)(*PRUNE:B) the relevant name is "B". This
facility can be used to perform simple simultaneous substitutions, as this
<b>pcre2test</b> example shows:
<pre>
/(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
apple lemon
2: pear orange
</pre>
PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject string,
replacing every matching substring. If this option is not set, only the first
matching substring is replaced. The search for matches takes place in the
original subject string (that is, previous replacements do not affect it).
Iteration is implemented by advancing the <i>startoffset</i> value for each
search, which is always passed the entire subject string. If an offset limit is
set in the match context, searching stops when that limit is reached.
</P>
<P>
You can restrict the effect of a global substitution to a portion of the
subject string by setting either or both of <i>startoffset</i> and an offset
limit. Here is a <b>pcre2test</b> example:
<pre>
/B/g,replace=!,use_offset_limit
ABC ABC ABC ABC\=offset=3,offset_limit=12
2: ABC A!C A!C ABC
</pre>
When continuing with global substitutions after matching a substring with zero
length, an attempt to find a non-empty match at the same offset is performed.
If this is not successful, the offset is advanced by one character except when
CRLF is a valid newline sequence and the next two characters are CR, LF. In
this case, the offset is advanced by two characters.
</P>
<P>
PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do
not appear in the pattern to be treated as unset groups. This option should be
used with care, because it means that a typo in a group name or number no
longer causes the PCRE2_ERROR_NOSUBSTRING error.
</P>
<P>
PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including unknown
groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty
strings when inserted as described above. If this option is not set, an attempt
to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does
not influence the extended substitution syntax described below.
</P>
<P>
PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
replacement string. Without this option, only the dollar character is special,
and only the group insertion forms listed above are valid. When
PCRE2_SUBSTITUTE_EXTENDED is set, several things change:
</P>
<P>
Firstly, backslash in a replacement string is interpreted as an escape
character. The usual forms such as \x{ddd} can be used to specify particular
character codes, and backslash followed by any non-alphanumeric character
quotes that character. Extended quoting can be coded using \Q...\E, exactly
as in pattern strings. The escapes \b and \v are interpreted as the
characters backspace and vertical tab, respectively.
</P>
<P>
The interpretation of backslash followed by one or more digits is the same as
in a pattern, which in Perl has some ambiguities. Details are given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page.
</P>
<P>
The Python form \g&#60;n&#62;, where the angle brackets are part of the syntax and <i>n</i>
is either a group name or number, is recognized as an altertive way of
inserting the contents of a group, for example \g&#60;3&#62;.
</P>
<P>
There are also four escape sequences for forcing the case of inserted letters.
Case forcing applies to all inserted characters, including those from capture
groups and letters within \Q...\E quoted sequences. The insertion mechanism
has three states: no case forcing, force upper case, and force lower case. The
escape sequences change the current state: \U and \L change to upper or lower
case forcing, respectively, and \E (when not terminating a \Q quoted
sequence) reverts to no case forcing. The sequences \u and \l force the next
character (if it is a letter) to upper or lower case, respectively, and then
the state automatically reverts to no case forcing.
</P>
<P>
However, if \u is immediately followed by \L or \l is immediately followed
by \U, the next character's case is forced by the first escape sequence, and
subsequent characters by the second. This provides a "title casing" facility
that can be applied to group captures. For example, if group 1 has captured
"heLLo", the replacement string "\u\L$1" becomes "Hello".
</P>
<P>
If either PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
properties are used for case forcing characters whose code points are greater
than 127. However, only simple case folding, as determined by the Unicode file
<b>CaseFolding.txt</b> is supported. PCRE2 does not support language-specific
special casing rules such as using different lower case Greek sigmas in the
middle and ends of words (as defined in the Unicode file
<b>SpecialCasing.txt</b>).
</P>
<P>
Note that case forcing sequences such as \U...\E do not nest. For example,
the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final \E has no
effect. Note also that the PCRE2_ALT_BSUX and PCRE2_EXTRA_ALT_BSUX options do
not apply to replacement strings.
</P>
<P>
The final effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
flexibility to capture group substitution. The syntax is similar to that used
by Bash:
<pre>
${n:-string}
${n:+string1:string2}
</pre>
As in the simple case, <i>n</i> may be a group number or a name. The first form
specifies a default value. If group <i>n</i> is set, its value is inserted; if
not, the string is expanded and the result inserted. The second form specifies
strings that are expanded and inserted when group <i>n</i> is set or unset,
respectively. The first form is just a convenient shorthand for
<pre>
${n:+${n}:string}
</pre>
Backslash can be used to escape colons and closing curly brackets in the
replacement strings. A change of the case forcing state within a replacement
string remains in force afterwards, as shown in this <b>pcre2test</b> example:
<pre>
/(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
body
1: hello
somebody
1: HELLO
</pre>
The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause unknown
groups in the extended syntax forms to be treated as unset.
</P>
<P>
If PCRE2_SUBSTITUTE_LITERAL is set, PCRE2_SUBSTITUTE_UNKNOWN_UNSET,
PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrelevant and
are ignored.
</P>
<br><b>
Substitution errors
</b><br>
<P>
In the event of an error, <b>pcre2_substitute()</b> returns a negative error
code. Except for PCRE2_ERROR_NOMATCH (which is never returned), errors from
<b>pcre2_match()</b> are passed straight back.
</P>
<P>
PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring insertion,
unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
</P>
<P>
PCRE2_ERROR_UNSET is returned for an unset substring insertion (including an
unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) when the simple
(non-extended) syntax is used and PCRE2_SUBSTITUTE_UNSET_EMPTY is not set.
</P>
<P>
PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big enough. If the
PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size of buffer that is
needed is returned via <i>outlengthptr</i>. Note that this does not happen by
default.
</P>
<P>
PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the
<i>match_data</i> argument is NULL or if the <i>subject</i> or <i>replacement</i>
arguments are NULL. For backward compatibility reasons an exception is made for
the <i>replacement</i> argument if the <i>rlength</i> argument is also 0.
</P>
<P>
PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in the
replacement string, with more particular errors being PCRE2_ERROR_BADREPESCAPE
(invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE (closing curly bracket
not found), PCRE2_ERROR_BADSUBSTITUTION (syntax error in extended group
substitution), and PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before
it started or the match started earlier than the current position in the
subject, which can happen if \K is used in an assertion).
</P>
<P>
As for all PCRE2 errors, a text message that describes the error can be
obtained by calling the <b>pcre2_get_error_message()</b> function (see
"Obtaining a textual error message"
<a href="#geterrormessage">above).</a>
<a name="subcallouts"></a></P>
<br><b>
Substitution callouts
</b><br>
<P>
<b>int pcre2_set_substitute_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> int (*<i>callout_function</i>)(pcre2_substitute_callout_block *, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
The <b>pcre2_set_substitution_callout()</b> function can be used to specify a
callout function for <b>pcre2_substitute()</b>. This information is passed in
a match context. The callout function is called after each substitution has
been processed, but it can cause the replacement not to happen.
</P>
<P>
The callout function is not called for simulated substitutions that happen as a
result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option. In this mode, when
substitution processing exceeds the buffer space provided by the caller,
processing continues by counting code units. The simulation is unable to
populate the callout block, and so the simulation is pessimistic about the
required buffer size. Whichever is larger of accepted or rejected substitution
is reported as the required size. Therefore, the returned buffer length may be
an overestimate (without a substitution callout, it is normally an exact
measurement).
</P>
<P>
The first argument of the callout function is a pointer to a substitute callout
block structure, which contains the following fields, not necessarily in this
order:
<pre>
uint32_t <i>version</i>;
uint32_t <i>subscount</i>;
PCRE2_SPTR <i>input</i>;
PCRE2_SPTR <i>output</i>;
PCRE2_SIZE <i>*ovector</i>;
uint32_t <i>oveccount</i>;
PCRE2_SIZE <i>output_offsets[2]</i>;
</pre>
The <i>version</i> field contains the version number of the block format. The
current version is 0. The version number will increase in future if more fields
are added, but the intention is never to remove any of the existing fields.
</P>
<P>
The <i>subscount</i> field is the number of the current match. It is 1 for the
first callout, 2 for the second, and so on. The <i>input</i> and <i>output</i>
pointers are copies of the values passed to <b>pcre2_substitute()</b>.
</P>
<P>
The <i>ovector</i> field points to the ovector, which contains the result of the
most recent match. The <i>oveccount</i> field contains the number of pairs that
are set in the ovector, and is always greater than zero.
</P>
<P>
The <i>output_offsets</i> vector contains the offsets of the replacement in the
output string. This has already been processed for dollar and (if requested)
backslash substitutions as described above.
</P>
<P>
The second argument of the callout function is the value passed as
<i>callout_data</i> when the function was registered. The value returned by the
callout function is interpreted as follows:
</P>
<P>
If the value is zero, the replacement is accepted, and, if
PCRE2_SUBSTITUTE_GLOBAL is set, processing continues with a search for the next
match. If the value is not zero, the current replacement is not accepted. If
the value is greater than zero, processing continues when
PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than zero or
PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied to the
output and the call to <b>pcre2_substitute()</b> exits, returning the number of
matches so far.
</P>
<br><b>
Substitution case callouts
</b><br>
<P>
<b>int pcre2_set_substitute_case_callout(pcre2_match_context *<i>mcontext</i>,</b>
<b> PCRE2_SIZE (*<i>callout_function</i>)(PCRE2_SPTR, PCRE2_SIZE,</b>
<b> PCRE2_UCHAR *, PCRE2_SIZE,</b>
<b> int, void *),</b>
<b> void *<i>callout_data</i>);</b>
<br>
<br>
The <b>pcre2_set_substitution_case_callout()</b> function can be used to specify
a callout function for <b>pcre2_substitute()</b> to use when performing case
transformations. This does not affect any case insensitivity behaviour when
performing a match, but only the user-visible transformations performed when
processing a substitution such as:
<pre>
pcre2_substitute(..., "\\U$1", ...)
</PRE>
</P>
<P>
The default case transformations applied by PCRE2 are reasonably complete, and,
in UTF or UCP mode, perform the simple locale-invariant case transformations as
specified by Unicode. This is suitable for the internal (invisible)
case-equivalence procedures used during pattern matching, but an application
may wish to use more sophisticated locale-aware processing for the user-visible
substitution transformations.
</P>
<P>
One example implementation of the <i>callout_function</i> using the ICU
library would be:
<br>
<br>
<pre>
PCRE2_SIZE
icu_case_callout(
PCRE2_SPTR input, PCRE2_SIZE input_len,
PCRE2_UCHAR *output, PCRE2_SIZE output_cap,
int to_case, void *data_ptr)
{
UErrorCode err = U_ZERO_ERROR;
int32_t r = to_case == PCRE2_SUBSTITUTE_CASE_LOWER
? u_strToLower(output, output_cap, input, input_len, NULL, &err)
: to_case == PCRE2_SUBSTITUTE_CASE_UPPER
? u_strToUpper(output, output_cap, input, input_len, NULL, &err)
: u_strToTitle(output, output_cap, input, input_len, &first_char_only,
NULL, &err);
if (U_FAILURE(err)) return (~(PCRE2_SIZE)0);
return r;
}
</PRE>
</P>
<P>
The first and second arguments of the case callout function are the Unicode
string to transform.
</P>
<P>
The third and fourth arguments are the output buffer and its capacity.
</P>
<P>
The fifth is one of the constants PCRE2_SUBSTITUTE_CASE_LOWER,
PCRE2_SUBSTITUTE_CASE_UPPER, or PCRE2_SUBSTITUTE_CASE_TITLE_FIRST.
PCRE2_SUBSTITUTE_CASE_LOWER and PCRE2_SUBSTITUTE_CASE_UPPER are passed to the
callout to indicate that the case of the entire callout input should be
case-transformed. PCRE2_SUBSTITUTE_CASE_TITLE_FIRST is passed to indicate that
only the first character or glyph should be transformed to Unicode titlecase
and the rest to Unicode lowercase (note that titlecasing sometimes uses Unicode
properties to titlecase each word in a string; but PCRE2 is requesting that only
the single leading character is to be titlecased).
</P>
<P>
The sixth argument is the <i>callout_data</i> supplied to
<b>pcre2_set_substitute_case_callout()</b>.
</P>
<P>
The resulting string in the destination buffer may be larger or smaller than the
input, if the casing rules merge or split characters. The return value is the
length required for the output string. If a buffer of sufficient size was
provided to the callout, then the result must be written to the buffer and the
number of code units returned. If the result does not fit in the provided
buffer, then the required capacity must be returned and PCRE2 will not make use
of the output buffer. PCRE2 provides input and output buffers which overlap, so
the callout must support this by suitable internal buffering.
</P>
<P>
Alternatively, if the callout wishes to indicate an error, then it may return
(~(PCRE2_SIZE)0). In this case pcre2_substitute() will immediately fail with
error PCRE2_ERROR_REPLACECASE.
</P>
<P>
When a case callout is combined with the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH
option, there are situations when pcre2_substitute() will return an
underestimate of the required buffer size. If you call pcre2_substitute() once
with PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, and the input buffer is too small for
the replacement string to be constructed, then instead of calling the case
callout, pcre2_substitute() will make an estimate of the required buffer size.
The second call should also pass PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, because that
second call is not guaranteed to succeed either, if the case callout requires
more buffer space than expected. The caller must make repeated attempts in a
loop.
</P>
<br><a name="SEC38" href="#TOC1">DUPLICATE CAPTURE GROUP NAMES</a><br>
<P>
<b>int pcre2_substring_nametable_scan(const pcre2_code *<i>code</i>,</b>
<b> PCRE2_SPTR <i>name</i>, PCRE2_SPTR *<i>first</i>, PCRE2_SPTR *<i>last</i>);</b>
</P>
<P>
When a pattern is compiled with the PCRE2_DUPNAMES option, names for capture
groups are not required to be unique. Duplicate names are always allowed for
groups with the same number, created by using the (?| feature. Indeed, if such
groups are named, they are required to use the same names.
</P>
<P>
Normally, patterns that use duplicate names are such that in any one match,
only one of each set of identically-named groups participates. An example is
shown in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
<P>
When duplicates are present, <b>pcre2_substring_copy_byname()</b> and
<b>pcre2_substring_get_byname()</b> return the first substring corresponding to
the given name that is set. Only if none are set is PCRE2_ERROR_UNSET is
returned. The <b>pcre2_substring_number_from_name()</b> function returns the
error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate names.
</P>
<P>
If you want to get full details of all captured substrings for a given name,
you must use the <b>pcre2_substring_nametable_scan()</b> function. The first
argument is the compiled pattern, and the second is the name. If the third and
fourth arguments are NULL, the function returns a group number for a unique
name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
</P>
<P>
When the third and fourth arguments are not NULL, they must be pointers to
variables that are updated by the function. After it has run, they point to the
first and last entries in the name-to-number table for the given name, and the
function returns the length of each entry in code units. In both cases,
PCRE2_ERROR_NOSUBSTRING is returned if there are no entries for the given name.
</P>
<P>
The format of the name table is described
<a href="#infoaboutpattern">above</a>
in the section entitled <i>Information about a pattern</i>. Given all the
relevant entries for the name, you can extract each of their numbers, and hence
the captured data.
</P>
<br><a name="SEC39" href="#TOC1">FINDING ALL POSSIBLE MATCHES AT ONE POSITION</a><br>
<P>
The traditional matching function uses a similar algorithm to Perl, which stops
when it finds the first match at a given point in the subject. If you want to
find all possible matches, or the longest possible match at a given position,
consider using the alternative matching function (see below) instead. If you
cannot use the alternative function, you can kludge it up by making use of the
callout facility, which is described in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation.
</P>
<P>
What you have to do is to insert a callout right at the end of the pattern.
When your callout function is called, extract and save the current matched
substring. Then return 1, which forces <b>pcre2_match()</b> to backtrack and try
other alternatives. Ultimately, when it runs out of matches,
<b>pcre2_match()</b> will yield PCRE2_ERROR_NOMATCH.
<a name="dfamatch"></a></P>
<br><a name="SEC40" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
<P>
<b>int pcre2_dfa_match(const pcre2_code *<i>code</i>, PCRE2_SPTR <i>subject</i>,</b>
<b> PCRE2_SIZE <i>length</i>, PCRE2_SIZE <i>startoffset</i>,</b>
<b> uint32_t <i>options</i>, pcre2_match_data *<i>match_data</i>,</b>
<b> pcre2_match_context *<i>mcontext</i>,</b>
<b> int *<i>workspace</i>, PCRE2_SIZE <i>wscount</i>);</b>
</P>
<P>
The function <b>pcre2_dfa_match()</b> is called to match a subject string
against a compiled pattern, using a matching algorithm that scans the subject
string just once (not counting lookaround assertions), and does not backtrack
(except when processing lookaround assertions). This has different
characteristics to the normal algorithm, and is not compatible with Perl. Some
of the features of PCRE2 patterns are not supported. Nevertheless, there are
times when this kind of matching can be useful. For a discussion of the two
matching algorithms, and a list of features that <b>pcre2_dfa_match()</b> does
not support, see the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
documentation.
</P>
<P>
The arguments for the <b>pcre2_dfa_match()</b> function are the same as for
<b>pcre2_match()</b>, plus two extras. The ovector within the match data block
is used in a different way, and this is described below. The other common
arguments are used in the same way as for <b>pcre2_match()</b>, so their
description is not repeated here.
</P>
<P>
The two additional arguments provide workspace for the function. The workspace
vector should contain at least 20 elements. It is used for keeping track of
multiple paths through the pattern tree. More workspace is needed for patterns
and subjects where there are a lot of potential matches.
</P>
<P>
Here is an example of a simple call to <b>pcre2_dfa_match()</b>:
<pre>
int wspace[20];
pcre2_match_data *md = pcre2_match_data_create(4, NULL);
int rc = pcre2_dfa_match(
re, /* result of pcre2_compile() */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
md, /* the match data block */
NULL, /* a match context; NULL means use defaults */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
</PRE>
</P>
<br><b>
Option bits for <b>pcre2_dfa_match()</b>
</b><br>
<P>
The unused bits of the <i>options</i> argument for <b>pcre2_dfa_match()</b> must
be zero. The only bits that may be set are PCRE2_ANCHORED,
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL, PCRE2_NOTEOL,
PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but the last
four of these are exactly the same as for <b>pcre2_match()</b>, so their
description is not repeated here.
<pre>
PCRE2_PARTIAL_HARD
PCRE2_PARTIAL_SOFT
</pre>
These have the same general effect as they do for <b>pcre2_match()</b>, but the
details are slightly different. When PCRE2_PARTIAL_HARD is set for
<b>pcre2_dfa_match()</b>, it returns PCRE2_ERROR_PARTIAL if the end of the
subject is reached and there is still at least one matching possibility that
requires additional characters. This happens even if some complete matches have
already been found. When PCRE2_PARTIAL_SOFT is set, the return code
PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL if the end of the
subject is reached, there have been no complete matches, but there is still at
least one matching possibility. The portion of the string that was inspected
when the longest partial match was found is set as the first matching string in
both cases. There is a more detailed discussion of partial and multi-segment
matching, with examples, in the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
<pre>
PCRE2_DFA_SHORTEST
</pre>
Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to stop as
soon as it has found one match. Because of the way the alternative algorithm
works, this is necessarily the shortest possible match at the first possible
matching point in the subject string.
<pre>
PCRE2_DFA_RESTART
</pre>
When <b>pcre2_dfa_match()</b> returns a partial match, it is possible to call it
again, with additional subject characters, and have it continue with the same
match. The PCRE2_DFA_RESTART option requests this action; when it is set, the
<i>workspace</i> and <i>wscount</i> options must reference the same vector as
before because data about the match so far is left in them after a partial
match. There is more discussion of this facility in the
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation.
</P>
<br><b>
Successful returns from <b>pcre2_dfa_match()</b>
</b><br>
<P>
When <b>pcre2_dfa_match()</b> succeeds, it may have matched more than one
substring in the subject. Note, however, that all the matches from one run of
the function start at the same point in the subject. The shorter matches are
all initial substrings of the longer matches. For example, if the pattern
<pre>
&#60;.*&#62;
</pre>
is matched against the string
<pre>
This is &#60;something&#62; &#60;something else&#62; &#60;something further&#62; no more
</pre>
the three matched strings are
<pre>
&#60;something&#62; &#60;something else&#62; &#60;something further&#62;
&#60;something&#62; &#60;something else&#62;
&#60;something&#62;
</pre>
On success, the yield of the function is a number greater than zero, which is
the number of matched substrings. The offsets of the substrings are returned in
the ovector, and can be extracted by number in the same way as for
<b>pcre2_match()</b>, but the numbers bear no relation to any capture groups
that may exist in the pattern, because DFA matching does not support capturing.
</P>
<P>
Calls to the convenience functions that extract substrings by name
return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used after a
DFA match. The convenience functions that extract substrings by number never
return PCRE2_ERROR_NOSUBSTRING.
</P>
<P>
The matched strings are stored in the ovector in reverse order of length; that
is, the longest matching string is first. If there were too many matches to fit
into the ovector, the yield of the function is zero, and the vector is filled
with the longest matches.
</P>
<P>
NOTE: PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the
pattern "a\d+" is compiled as if it were "a\d++". For DFA matching, this
means that only one possible match is found. If you really do want multiple
matches in such cases, either use an ungreedy repeat such as "a\d+?" or set
the PCRE2_NO_AUTO_POSSESS option when compiling.
</P>
<br><b>
Error returns from <b>pcre2_dfa_match()</b>
</b><br>
<P>
The <b>pcre2_dfa_match()</b> function returns a negative number when it fails.
Many of the errors are the same as for <b>pcre2_match()</b>, as described
<a href="#errorlist">above.</a>
There are in addition the following errors that are specific to
<b>pcre2_dfa_match()</b>:
<pre>
PCRE2_ERROR_DFA_UITEM
</pre>
This return is given if <b>pcre2_dfa_match()</b> encounters an item in the
pattern that it does not support, for instance, the use of \C in a UTF mode or
a backreference.
<pre>
PCRE2_ERROR_DFA_UCOND
</pre>
This return is given if <b>pcre2_dfa_match()</b> encounters a condition item
that uses a backreference for the condition, or a test for recursion in a
specific capture group. These are not supported.
<pre>
PCRE2_ERROR_DFA_UINVALID_UTF
</pre>
This return is given if <b>pcre2_dfa_match()</b> is called for a pattern that
was compiled with PCRE2_MATCH_INVALID_UTF. This is not supported for DFA
matching.
<pre>
PCRE2_ERROR_DFA_WSSIZE
</pre>
This return is given if <b>pcre2_dfa_match()</b> runs out of space in the
<i>workspace</i> vector.
<pre>
PCRE2_ERROR_DFA_RECURSE
</pre>
When a recursion or subroutine call is processed, the matching function calls
itself recursively, using private memory for the ovector and <i>workspace</i>.
This error is given if the internal ovector is not large enough. This should be
extremely rare, as a vector of size 1000 is used.
<pre>
PCRE2_ERROR_DFA_BADRESTART
</pre>
When <b>pcre2_dfa_match()</b> is called with the <b>PCRE2_DFA_RESTART</b> option,
some plausibility checks are made on the contents of the workspace, which
should contain data about the previous partial match. If any of these checks
fail, this error is given.
</P>
<br><a name="SEC41" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2build</b>(3), <b>pcre2callout</b>(3), <b>pcre2demo(3)</b>,
<b>pcre2matching</b>(3), <b>pcre2partial</b>(3), <b>pcre2posix</b>(3),
<b>pcre2sample</b>(3), <b>pcre2unicode</b>(3).
</P>
<br><a name="SEC42" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC43" href="#TOC1">REVISION</a><br>
<P>
Last updated: 26 December 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,652 @@
<html>
<head>
<title>pcre2build specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2build man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">BUILDING PCRE2</a>
<li><a name="TOC2" href="#SEC2">PCRE2 BUILD-TIME OPTIONS</a>
<li><a name="TOC3" href="#SEC3">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a>
<li><a name="TOC4" href="#SEC4">BUILDING SHARED AND STATIC LIBRARIES</a>
<li><a name="TOC5" href="#SEC5">UNICODE AND UTF SUPPORT</a>
<li><a name="TOC6" href="#SEC6">DISABLING THE USE OF \C</a>
<li><a name="TOC7" href="#SEC7">JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC8" href="#SEC8">NEWLINE RECOGNITION</a>
<li><a name="TOC9" href="#SEC9">WHAT \R MATCHES</a>
<li><a name="TOC10" href="#SEC10">HANDLING VERY LARGE PATTERNS</a>
<li><a name="TOC11" href="#SEC11">LIMITING PCRE2 RESOURCE USAGE</a>
<li><a name="TOC12" href="#SEC12">LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC13" href="#SEC13">CREATING CHARACTER TABLES AT BUILD TIME</a>
<li><a name="TOC14" href="#SEC14">USING EBCDIC CODE</a>
<li><a name="TOC15" href="#SEC15">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a>
<li><a name="TOC16" href="#SEC16">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a>
<li><a name="TOC17" href="#SEC17">PCRE2GREP BUFFER SIZE</a>
<li><a name="TOC18" href="#SEC18">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a>
<li><a name="TOC19" href="#SEC19">INCLUDING DEBUGGING CODE</a>
<li><a name="TOC20" href="#SEC20">DEBUGGING WITH VALGRIND SUPPORT</a>
<li><a name="TOC21" href="#SEC21">CODE COVERAGE REPORTING</a>
<li><a name="TOC22" href="#SEC22">DISABLING THE Z AND T FORMATTING MODIFIERS</a>
<li><a name="TOC23" href="#SEC23">SUPPORT FOR FUZZERS</a>
<li><a name="TOC24" href="#SEC24">OBSOLETE OPTION</a>
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
<li><a name="TOC27" href="#SEC27">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">BUILDING PCRE2</a><br>
<P>
PCRE2 is distributed with a <b>configure</b> script that can be used to build
the library in Unix-like environments using the applications known as
Autotools. Also in the distribution are files to support building using
<b>CMake</b> instead of <b>configure</b>. The text file
<a href="README.txt"><b>README</b></a>
contains general information about building with Autotools (some of which is
repeated below), and also has some comments about building on various operating
systems. The files in the <b>vms</b> directory support building under OpenVMS.
There is a lot more information about building PCRE2 without using
Autotools (including information about using <b>CMake</b> and building "by
hand") in the text file called
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS-BUILD</b>.</a>
You should consult this file as well as the
<a href="README.txt"><b>README</b></a>
file if you are building in a non-Unix-like environment.
</P>
<br><a name="SEC2" href="#TOC1">PCRE2 BUILD-TIME OPTIONS</a><br>
<P>
The rest of this document describes the optional features of PCRE2 that can be
selected when the library is compiled. It assumes use of the <b>configure</b>
script, where the optional features are selected or deselected by providing
options to <b>configure</b> before running the <b>make</b> command. However, the
same options can be selected in both Unix-like and non-Unix-like environments
if you are using <b>CMake</b> instead of <b>configure</b> to build PCRE2.
</P>
<P>
If you are not using Autotools or <b>CMake</b>, option selection can be done by
editing the <b>config.h</b> file, or by passing parameter settings to the
compiler, as described in
<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS-BUILD</b>.</a>
</P>
<P>
The complete list of options for <b>configure</b> (which includes the standard
ones such as the selection of the installation directory) can be obtained by
running
<pre>
./configure --help
</pre>
The following sections include descriptions of "on/off" options whose names
begin with --enable or --disable. Because of the way that <b>configure</b>
works, --enable and --disable always come in pairs, so the complementary option
always exists as well, but as it specifies the default, it is not described.
Options that specify values have names that start with --with. At the end of a
<b>configure</b> run, a summary of the configuration is output.
</P>
<br><a name="SEC3" href="#TOC1">BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES</a><br>
<P>
By default, a library called <b>libpcre2-8</b> is built, containing functions
that take string arguments contained in arrays of bytes, interpreted either as
single-byte characters, or UTF-8 strings. You can also build two other
libraries, called <b>libpcre2-16</b> and <b>libpcre2-32</b>, which process
strings that are contained in arrays of 16-bit and 32-bit code units,
respectively. These can be interpreted either as single-unit characters or
UTF-16/UTF-32 strings. To build these additional libraries, add one or both of
the following to the <b>configure</b> command:
<pre>
--enable-pcre2-16
--enable-pcre2-32
</pre>
If you do not want the 8-bit library, add
<pre>
--disable-pcre2-8
</pre>
as well. At least one of the three libraries must be built. Note that the POSIX
wrapper is for the 8-bit library only, and that <b>pcre2grep</b> is an 8-bit
program. Neither of these are built if you select only the 16-bit or 32-bit
libraries.
</P>
<br><a name="SEC4" href="#TOC1">BUILDING SHARED AND STATIC LIBRARIES</a><br>
<P>
The Autotools PCRE2 building process uses <b>libtool</b> to build both shared
and static libraries by default. You can suppress an unwanted library by adding
one of
<pre>
--disable-shared
--disable-static
</pre>
to the <b>configure</b> command. Setting --disable-shared ensures that PCRE2
libraries are built as static libraries. The binaries that are then created as
part of the build process (for example, <b>pcre2test</b> and <b>pcre2grep</b>)
are linked statically with one or more PCRE2 libraries, but may also be
dynamically linked with other libraries such as <b>libc</b>. If you want these
binaries to be fully statically linked, you can set LDFLAGS like this:
<br>
<br>
LDFLAGS=--static ./configure --disable-shared
<br>
<br>
Note the two hyphens in --static. Of course, this works only if static versions
of all the relevant libraries are available for linking.
</P>
<br><a name="SEC5" href="#TOC1">UNICODE AND UTF SUPPORT</a><br>
<P>
By default, PCRE2 is built with support for Unicode and UTF character strings.
To build it without Unicode support, add
<pre>
--disable-unicode
</pre>
to the <b>configure</b> command. This setting applies to all three libraries. It
is not possible to build one library with Unicode support and another without
in the same configuration.
</P>
<P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
option when they call <b>pcre2_compile()</b> to compile a pattern.
Alternatively, patterns may be started with (*UTF) unless the application has
locked this out by setting PCRE2_NEVER_UTF.
</P>
<P>
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. Unicode support also gives access to
the Unicode properties of characters, using pattern escapes such as \P, \p,
and \X. Only the general category properties such as <i>Lu</i> and <i>Nd</i>,
script names, and some bi-directional properties are supported. Details are
given in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
<P>
Pattern escapes such as \d and \w do not by default make use of Unicode
properties. The application can request that they do by setting the PCRE2_UCP
option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
request this by starting with (*UCP).
</P>
<br><a name="SEC6" href="#TOC1">DISABLING THE USE OF \C</a><br>
<P>
The \C escape sequence, which matches a single code unit, even in a UTF mode,
can cause unpredictable behaviour because it may leave the current matching
point in the middle of a multi-code-unit character. The application can lock it
out by setting the PCRE2_NEVER_BACKSLASH_C option when calling
<b>pcre2_compile()</b>. There is also a build-time option
<pre>
--enable-never-backslash-C
</pre>
(note the upper case C) which locks out the use of \C entirely.
</P>
<br><a name="SEC7" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time (JIT) compiler support is included in the build by specifying
<pre>
--enable-jit
</pre>
This support is available only for certain hardware architectures. If this
option is set for an unsupported architecture, a building error occurs.
If in doubt, use
<pre>
--enable-jit=auto
</pre>
which enables JIT only if the current hardware is supported. You can check
if JIT is enabled in the configuration summary that is output at the end of a
<b>configure</b> run. If you are enabling JIT under SELinux you may also want to
add
<pre>
--enable-jit-sealloc
</pre>
which enables the use of an execmem allocator in JIT that is compatible with
SELinux. This has no effect if JIT is not enabled. See the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation for a discussion of JIT usage. When JIT support is enabled,
<b>pcre2grep</b> automatically makes use of it, unless you add
<pre>
--disable-pcre2grep-jit
</pre>
to the <b>configure</b> command.
</P>
<br><a name="SEC8" href="#TOC1">NEWLINE RECOGNITION</a><br>
<P>
By default, PCRE2 interprets the linefeed (LF) character as indicating the end
of a line. This is the normal newline character on Unix-like systems. You can
compile PCRE2 to use carriage return (CR) instead, by adding
<pre>
--enable-newline-is-cr
</pre>
to the <b>configure</b> command. There is also an --enable-newline-is-lf option,
which explicitly specifies linefeed as the newline character.
</P>
<P>
Alternatively, you can specify that line endings are to be indicated by the
two-character sequence CRLF (CR immediately followed by LF). If you want this,
add
<pre>
--enable-newline-is-crlf
</pre>
to the <b>configure</b> command. There is a fourth option, specified by
<pre>
--enable-newline-is-anycrlf
</pre>
which causes PCRE2 to recognize any of the three sequences CR, LF, or CRLF as
indicating a line ending. A fifth option, specified by
<pre>
--enable-newline-is-any
</pre>
causes PCRE2 to recognize any Unicode newline sequence. The Unicode newline
sequences are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029). The final option is
<pre>
--enable-newline-is-nul
</pre>
which causes NUL (binary zero) to be set as the default line-ending character.
</P>
<P>
Whatever default line ending convention is selected when PCRE2 is built can be
overridden by applications that use the library. At build time it is
recommended to use the standard for your operating system.
</P>
<br><a name="SEC9" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
By default, the sequence \R in a pattern matches any Unicode newline sequence,
independently of what has been selected as the line ending sequence. If you
specify
<pre>
--enable-bsr-anycrlf
</pre>
the default is changed so that \R matches only CR, LF, or CRLF. Whatever is
selected when PCRE2 is built can be overridden by applications that use the
library.
</P>
<br><a name="SEC10" href="#TOC1">HANDLING VERY LARGE PATTERNS</a><br>
<P>
Within a compiled pattern, offset values are used to point from one part to
another (for example, from an opening parenthesis to an alternation
metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values
are used for these offsets, leading to a maximum size for a compiled pattern of
around 64 thousand code units. This is sufficient to handle all but the most
gigantic patterns. Nevertheless, some people do want to process truly enormous
patterns, so it is possible to compile PCRE2 to use three-byte or four-byte
offsets by adding a setting such as
<pre>
--with-link-size=3
</pre>
to the <b>configure</b> command. The value given must be 2, 3, or 4. For the
16-bit library, a value of 3 is rounded up to 4. In these libraries, using
longer offsets slows down the operation of PCRE2 because it has to load
additional data when handling them. For the 32-bit library the value is always
4 and cannot be overridden; the value of --with-link-size is ignored.
</P>
<br><a name="SEC11" href="#TOC1">LIMITING PCRE2 RESOURCE USAGE</a><br>
<P>
The <b>pcre2_match()</b> function increments a counter each time it goes round
its main loop. Putting a limit on this counter controls the amount of computing
resource used by a single call to <b>pcre2_match()</b>. The limit can be changed
at run time, as described in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The default is 10 million, but this can be changed by adding a
setting such as
<pre>
--with-match-limit=500000
</pre>
to the <b>configure</b> command. This setting also applies to the
<b>pcre2_dfa_match()</b> matching function, and to JIT matching (though the
counting is done differently).
</P>
<P>
The <b>pcre2_match()</b> function uses heap memory to record backtracking
points. The more nested backtracking points there are (that is, the deeper the
search tree), the more memory is needed. There is an upper limit, specified in
kibibytes (units of 1024 bytes). This limit can be changed at run time, as
described in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The default limit (in effect unlimited) is 20 million. You can
change this by a setting such as
<pre>
--with-heap-limit=500
</pre>
which limits the amount of heap to 500 KiB. This limit applies only to
interpretive matching in <b>pcre2_match()</b> and <b>pcre2_dfa_match()</b>, which
may also use the heap for internal workspace when processing complicated
patterns. This limit does not apply when JIT (which has its own memory
arrangements) is used.
</P>
<P>
You can also explicitly limit the depth of nested backtracking in the
<b>pcre2_match()</b> interpreter. This limit defaults to the value that is set
for --with-match-limit. You can set a lower default limit by adding, for
example,
<pre>
--with-match-limit-depth=10000
</pre>
to the <b>configure</b> command. This value can be overridden at run time. This
depth limit indirectly limits the amount of heap memory that is used, but
because the size of each backtracking "frame" depends on the number of
capturing parentheses in a pattern, the amount of heap that is used before the
limit is reached varies from pattern to pattern. This limit was more useful in
versions before 10.30, where function recursion was used for backtracking.
</P>
<P>
As well as applying to <b>pcre2_match()</b>, the depth limit also controls
the depth of recursive function calls in <b>pcre2_dfa_match()</b>. These are
used for lookaround assertions, atomic groups, and recursion within patterns.
The limit does not apply to JIT matching.
</P>
<br><a name="SEC12" href="#TOC1">LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS</a><br>
<P>
Lookbehind assertions in which one or more branches can match a variable number
of characters are supported only if there is a maximum matching length for each
top-level branch. There is a limit to this maximum that defaults to 255
characters. You can alter this default by a setting such as
<pre>
--with-max-varlookbehind=100
</pre>
The limit can be changed at runtime by calling
<b>pcre2_set_max_varlookbehind()</b>. Lookbehind assertions in which every
branch matches a fixed number of characters (not necessarily all the same) are
not constrained by this limit.
<a name="createtables"></a></P>
<br><a name="SEC13" href="#TOC1">CREATING CHARACTER TABLES AT BUILD TIME</a><br>
<P>
PCRE2 uses fixed tables for processing characters whose code points are less
than 256. By default, PCRE2 is built with a set of tables that are distributed
in the file <i>src/pcre2_chartables.c.dist</i>. These tables are for ASCII codes
only. If you add
<pre>
--enable-rebuild-chartables
</pre>
to the <b>configure</b> command, the distributed tables are no longer used.
Instead, a program called <b>pcre2_dftables</b> is compiled and run. This
outputs the source for new set of tables, created in the default locale of your
C run-time system. This method of replacing the tables does not work if you are
cross compiling, because <b>pcre2_dftables</b> needs to be run on the local
host and therefore not compiled with the cross compiler.
</P>
<P>
If you need to create alternative tables when cross compiling, you will have to
do so "by hand". There may also be other reasons for creating tables manually.
To cause <b>pcre2_dftables</b> to be built on the local host, run a normal
compiling command, and then run the program with the output file as its
argument, for example:
<pre>
cc src/pcre2_dftables.c -o pcre2_dftables
./pcre2_dftables src/pcre2_chartables.c
</pre>
This builds the tables in the default locale of the local host. If you want to
specify a locale, you must use the -L option:
<pre>
LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
</pre>
You can also specify -b (with or without -L). This causes the tables to be
written in binary instead of as source code. A set of binary tables can be
loaded into memory by an application and passed to <b>pcre2_compile()</b> in the
same way as tables created by calling <b>pcre2_maketables()</b>. The tables are
just a string of bytes, independent of hardware characteristics such as
endianness. This means they can be bundled with an application that runs in
different environments, to ensure consistent behaviour.
</P>
<br><a name="SEC14" href="#TOC1">USING EBCDIC CODE</a><br>
<P>
PCRE2 assumes by default that it will run in an environment where the character
code is ASCII or Unicode, which is a superset of ASCII. This is the case for
most computer operating systems. PCRE2 can, however, be compiled to run in an
8-bit EBCDIC environment by adding
<pre>
--enable-ebcdic --disable-unicode
</pre>
to the <b>configure</b> command. This setting implies
--enable-rebuild-chartables. You should only use it if you know that you are in
an EBCDIC environment (for example, an IBM mainframe operating system).
</P>
<P>
It is not possible to support both EBCDIC and UTF-8 codes in the same version
of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually
exclusive.
</P>
<P>
The EBCDIC character that corresponds to an ASCII LF is assumed to have the
value 0x15 by default. However, in some EBCDIC environments, 0x25 is used. In
such an environment you should use
<pre>
--enable-ebcdic-nl25
</pre>
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR has the
same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 0x25 is <i>not</i>
chosen as LF is made to correspond to the Unicode NEL character (which, in
Unicode, is 0x85).
</P>
<P>
The options that select newline behaviour, such as --enable-newline-is-cr,
and equivalent run-time options, refer to these character values in an EBCDIC
environment.
</P>
<br><a name="SEC15" href="#TOC1">PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS</a><br>
<P>
By default <b>pcre2grep</b> supports the use of callouts with string arguments
within the patterns it is matching. There are two kinds: one that generates
output using local code, and another that calls an external program or script.
If --disable-pcre2grep-callout-fork is added to the <b>configure</b> command,
only the first kind of callout is supported; if --disable-pcre2grep-callout is
used, all callouts are completely ignored. For more details of <b>pcre2grep</b>
callouts, see the
<a href="pcre2grep.html"><b>pcre2grep</b></a>
documentation.
</P>
<br><a name="SEC16" href="#TOC1">PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT</a><br>
<P>
By default, <b>pcre2grep</b> reads all files as plain text. You can build it so
that it recognizes files whose names end in <b>.gz</b> or <b>.bz2</b>, and reads
them with <b>libz</b> or <b>libbz2</b>, respectively, by adding one or both of
<pre>
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
</pre>
to the <b>configure</b> command. These options naturally require that the
relevant libraries are installed on your system. Configuration will fail if
they are not.
</P>
<br><a name="SEC17" href="#TOC1">PCRE2GREP BUFFER SIZE</a><br>
<P>
<b>pcre2grep</b> uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when it
finds a match. The default starting size of the buffer is 20KiB. The buffer
itself is three times this size, but because of the way it is used for holding
"before" lines, the longest line that is guaranteed to be processable is the
notional buffer size. If a longer line is encountered, <b>pcre2grep</b>
automatically expands the buffer, up to a specified maximum size, whose default
is 1MiB or the starting size, whichever is the larger. You can change the
default parameter values by adding, for example,
<pre>
--with-pcre2grep-bufsize=51200
--with-pcre2grep-max-bufsize=2097152
</pre>
to the <b>configure</b> command. The caller of <b>pcre2grep</b> can override
these values by using --buffer-size and --max-buffer-size on the command line.
</P>
<br><a name="SEC18" href="#TOC1">PCRE2TEST OPTION FOR LIBREADLINE SUPPORT</a><br>
<P>
If you add one of
<pre>
--enable-pcre2test-libreadline
--enable-pcre2test-libedit
</pre>
to the <b>configure</b> command, <b>pcre2test</b> is linked with the
<b>libreadline</b> or<b>libedit</b> library, respectively, and when its input is
from a terminal, it reads it using the <b>readline()</b> function. This provides
line-editing and history facilities. Note that <b>libreadline</b> is
GPL-licensed, so if you distribute a binary of <b>pcre2test</b> linked in this
way, there may be licensing issues. These can be avoided by linking instead
with <b>libedit</b>, which has a BSD licence.
</P>
<P>
Setting --enable-pcre2test-libreadline causes the <b>-lreadline</b> option to be
added to the <b>pcre2test</b> build. In many operating environments with a
system-installed readline library this is sufficient. However, in some
environments (e.g. if an unmodified distribution version of readline is in
use), some extra configuration may be necessary. The INSTALL file for
<b>libreadline</b> says this:
<pre>
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
which link with readline the to choose an appropriate library."
</pre>
If your environment has not been set up so that an appropriate library is
automatically included, you may need to add something like
<pre>
LIBS="-ncurses"
</pre>
immediately before the <b>configure</b> command.
</P>
<br><a name="SEC19" href="#TOC1">INCLUDING DEBUGGING CODE</a><br>
<P>
If you add
<pre>
--enable-debug
</pre>
to the <b>configure</b> command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
</P>
<br><a name="SEC20" href="#TOC1">DEBUGGING WITH VALGRIND SUPPORT</a><br>
<P>
If you add
<pre>
--enable-valgrind
</pre>
to the <b>configure</b> command, PCRE2 will use valgrind annotations to mark
certain memory regions as unaddressable. This allows it to detect invalid
memory accesses, and is mostly useful for debugging PCRE2 itself.
</P>
<br><a name="SEC21" href="#TOC1">CODE COVERAGE REPORTING</a><br>
<P>
If your C compiler is gcc, you can build a version of PCRE2 that can generate a
code coverage report for its test suite. To enable this, you must install
<b>lcov</b> version 1.6 or above. Then specify
<pre>
--enable-coverage
</pre>
to the <b>configure</b> command and build PCRE2 in the usual way.
</P>
<P>
Note that using <b>ccache</b> (a caching C compiler) is incompatible with code
coverage reporting. If you have configured <b>ccache</b> to run automatically
on your system, you must set the environment variable
<pre>
CCACHE_DISABLE=1
</pre>
before running <b>make</b> to build PCRE2, so that <b>ccache</b> is not used.
</P>
<P>
When --enable-coverage is used, the following addition targets are added to the
<i>Makefile</i>:
<pre>
make coverage
</pre>
This creates a fresh coverage report for the PCRE2 test suite. It is equivalent
to running "make coverage-reset", "make coverage-baseline", "make check", and
then "make coverage-report".
<pre>
make coverage-reset
</pre>
This zeroes the coverage counters, but does nothing else.
<pre>
make coverage-baseline
</pre>
This captures baseline coverage information.
<pre>
make coverage-report
</pre>
This creates the coverage report.
<pre>
make coverage-clean-report
</pre>
This removes the generated coverage report without cleaning the coverage data
itself.
<pre>
make coverage-clean-data
</pre>
This removes the captured coverage data without removing the coverage files
created at compile time (*.gcno).
<pre>
make coverage-clean
</pre>
This cleans all coverage data including the generated coverage report. For more
information about code coverage, see the <b>gcov</b> and <b>lcov</b>
documentation.
</P>
<br><a name="SEC22" href="#TOC1">DISABLING THE Z AND T FORMATTING MODIFIERS</a><br>
<P>
The C99 standard defines formatting modifiers z and t for size_t and
ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
environments other than old versions of Microsoft Visual Studio when
__STDC_VERSION__ is defined and has a value greater than or equal to 199901L
(indicating support for C99).
However, there is at least one environment that claims to be C99 but does not
support these modifiers. If
<pre>
--disable-percent-zt
</pre>
is specified, no use is made of the z or t modifiers. Instead of %td or %zu,
a suitable format is used depending in the size of long for the platform.
</P>
<br><a name="SEC23" href="#TOC1">SUPPORT FOR FUZZERS</a><br>
<P>
There is a special option for use by people who want to run fuzzing tests on
PCRE2:
<pre>
--enable-fuzz-support
</pre>
At present this applies only to the 8-bit library. If set, it causes an extra
library called libpcre2-fuzzsupport.a to be built, but not installed. This
contains a single function called LLVMFuzzerTestOneInput() whose arguments are
a pointer to a string and the length of the string. When called, this function
tries to compile the string as a pattern, and if that succeeds, to match it.
This is done both with no options and with some random options bits that are
generated from the string.
</P>
<P>
Setting --enable-fuzz-support also causes a binary called <b>pcre2fuzzcheck</b>
to be created. This is normally run under valgrind or used when PCRE2 is
compiled with address sanitizing enabled. It calls the fuzzing function and
outputs information about what it is doing. The input strings are specified by
arguments: if an argument starts with "=" the rest of it is a literal input
string. Otherwise, it is assumed to be a file name, and the contents of the
file are the test string.
</P>
<br><a name="SEC24" href="#TOC1">OBSOLETE OPTION</a><br>
<P>
In versions of PCRE2 prior to 10.30, there were two ways of handling
backtracking in the <b>pcre2_match()</b> function. The default was to use the
system stack, but if
<pre>
--disable-stack-for-recursion
</pre>
was set, memory on the heap was used. From release 10.30 onwards this has
changed (the stack is no longer used) and this option now does nothing except
give a warning.
</P>
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2-config</b>(3).
</P>
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
Last updated: 16 April 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,480 @@
<html>
<head>
<title>pcre2callout specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2callout man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">MISSING CALLOUTS</a>
<li><a name="TOC4" href="#SEC4">THE CALLOUT INTERFACE</a>
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM CALLOUTS</a>
<li><a name="TOC6" href="#SEC6">CALLOUT ENUMERATION</a>
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
<li><a name="TOC8" href="#SEC8">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int (*pcre2_callout)(pcre2_callout_block *, void *);</b>
<br>
<br>
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
<b> void *<i>user_data</i>);</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
PCRE2 provides a feature called "callout", which is a means of temporarily
passing control to the caller of PCRE2 in the middle of pattern matching. The
caller of PCRE2 provides an external function by putting its entry point in
a match context (see <b>pcre2_set_callout()</b> in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation).
</P>
<P>
When using the <b>pcre2_substitute()</b> function, an additional callout feature
is available. This does a callout after each change to the subject string and
is described in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation; the rest of this document is concerned with callouts during
pattern matching.
</P>
<P>
Within a regular expression, (?C&#60;arg&#62;) indicates a point at which the external
function is to be called. Different callout points can be identified by putting
a number less than 256 after the letter C. The default value is zero.
Alternatively, the argument may be a delimited string. The starting delimiter
must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the
start, except for {, where the ending delimiter is }. If the ending delimiter
is needed within the string, it must be doubled. For example, this pattern has
two callout points:
<pre>
(?C1)abc(?C"some ""arbitrary"" text")def
</pre>
If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, PCRE2
automatically inserts callouts, all with number 255, before each item in the
pattern except for immediately before or after an explicit callout. For
example, if PCRE2_AUTO_CALLOUT is used with the pattern
<pre>
A(?C3)B
</pre>
it is processed as if it were
<pre>
(?C255)A(?C3)B(?C255)
</pre>
Here is a more complicated example:
<pre>
A(\d{2}|--)
</pre>
With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
<pre>
(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
</pre>
Notice that there is a callout before and after each parenthesis and
alternation bar. If the pattern contains a conditional group whose condition is
an assertion, an automatic callout is inserted immediately before the
condition. Such a callout may also be inserted explicitly, for example:
<pre>
(?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
</pre>
This applies only to assertion conditions (because they are themselves
independent groups).
</P>
<P>
Callouts can be useful for tracking the progress of pattern matching. The
<a href="pcre2test.html"><b>pcre2test</b></a>
program has a pattern qualifier (/auto_callout) that sets automatic callouts.
When any callouts are present, the output from <b>pcre2test</b> indicates how
the pattern is being matched. This is useful information when you are trying to
optimize the performance of a particular pattern.
</P>
<br><a name="SEC3" href="#TOC1">MISSING CALLOUTS</a><br>
<P>
You should be aware that, because of optimizations in the way PCRE2 compiles
and matches patterns, callouts sometimes do not happen exactly as you might
expect.
</P>
<br><b>
Auto-possessification
</b><br>
<P>
At compile time, PCRE2 "auto-possessifies" repeated items when it knows that
what follows cannot be part of the repeat. For example, a+[bc] is compiled as
if it were a++[bc]. The <b>pcre2test</b> output when this pattern is compiled
with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied to the string
"aaaa" is:
<pre>
---&#62;aaaa
+0 ^ a+
+2 ^ ^ [bc]
No match
</pre>
This indicates that when matching [bc] fails, there is no backtracking into a+
(because it is being treated as a++) and therefore the callouts that would be
taken for the backtracks do not occur. You can disable the auto-possessify
feature by passing PCRE2_NO_AUTO_POSSESS to <b>pcre2_compile()</b>, or starting
the pattern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
<pre>
---&#62;aaaa
+0 ^ a+
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^ ^ [bc]
+2 ^^ [bc]
No match
</pre>
This time, when matching [bc] fails, the matcher backtracks into a+ and tries
again, repeatedly, until a+ itself fails.
</P>
<br><b>
Automatic .* anchoring
</b><br>
<P>
By default, an optimization is applied when .* is the first significant item in
a pattern. If PCRE2_DOTALL is set, so that the dot can match any character, the
pattern is automatically anchored. If PCRE2_DOTALL is not set, a match can
start only after an internal newline or at the beginning of the subject, and
<b>pcre2_compile()</b> remembers this. If a pattern has more than one top-level
branch, automatic anchoring occurs if all branches are anchorable.
</P>
<P>
This optimization is disabled, however, if .* is in an atomic group or if there
is a backreference to the capture group in which it appears. It is also
disabled if the pattern contains (*PRUNE) or (*SKIP). However, the presence of
callouts does not affect it.
</P>
<P>
For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT and
applied to the string "aa", the <b>pcre2test</b> output is:
<pre>
---&#62;aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
No match
</pre>
This shows that all match attempts start at the beginning of the subject. In
other words, the pattern is anchored. You can disable this optimization by
passing PCRE2_NO_DOTSTAR_ANCHOR to <b>pcre2_compile()</b>, or starting the
pattern with (*NO_DOTSTAR_ANCHOR). In this case, the output changes to:
<pre>
---&#62;aa
+0 ^ .*
+2 ^ ^ \d
+2 ^^ \d
+2 ^ \d
+0 ^ .*
+2 ^^ \d
+2 ^ \d
No match
</pre>
This shows more match attempts, starting at the second subject character.
Another optimization, described in the next section, means that there is no
subsequent attempt to match with an empty subject.
</P>
<br><b>
Other optimizations
</b><br>
<P>
Other optimizations that provide fast "no match" results also affect callouts.
For example, if the pattern is
<pre>
ab(?C4)cd
</pre>
PCRE2 knows that any matching string must contain the letter "d". If the
subject string is "abyz", the lack of "d" means that matching doesn't ever
start, and the callout is never reached. However, with "abyd", though the
result is still no match, the callout is obeyed.
</P>
<P>
For most patterns PCRE2 also knows the minimum length of a matching string, and
will immediately give a "no match" return without actually running a match if
the subject is not long enough, or, for unanchored patterns, if it has been
scanned far enough.
</P>
<P>
You can disable these optimizations by passing the PCRE2_NO_START_OPTIMIZE
option to <b>pcre2_compile()</b>, or by starting the pattern with
(*NO_START_OPT). This slows down the matching process, but does ensure that
callouts such as the example above are obeyed.
<a name="calloutinterface"></a></P>
<br><a name="SEC4" href="#TOC1">THE CALLOUT INTERFACE</a><br>
<P>
During matching, when PCRE2 reaches a callout point, if an external function is
provided in the match context, it is called. This applies to both normal,
DFA, and JIT matching. The first argument to the callout function is a pointer
to a <b>pcre2_callout</b> block. The second argument is the void * callout data
that was supplied when the callout was set up by calling
<b>pcre2_set_callout()</b> (see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation). The callout block structure contains the following fields, not
necessarily in this order:
<pre>
uint32_t <i>version</i>;
uint32_t <i>callout_number</i>;
uint32_t <i>capture_top</i>;
uint32_t <i>capture_last</i>;
uint32_t <i>callout_flags</i>;
PCRE2_SIZE *<i>offset_vector</i>;
PCRE2_SPTR <i>mark</i>;
PCRE2_SPTR <i>subject</i>;
PCRE2_SIZE <i>subject_length</i>;
PCRE2_SIZE <i>start_match</i>;
PCRE2_SIZE <i>current_position</i>;
PCRE2_SIZE <i>pattern_position</i>;
PCRE2_SIZE <i>next_item_length</i>;
PCRE2_SIZE <i>callout_string_offset</i>;
PCRE2_SIZE <i>callout_string_length</i>;
PCRE2_SPTR <i>callout_string</i>;
</pre>
The <i>version</i> field contains the version number of the block format. The
current version is 2; the three callout string fields were added for version 1,
and the <i>callout_flags</i> field for version 2. If you are writing an
application that might use an earlier release of PCRE2, you should check the
version number before accessing any of these fields. The version number will
increase in future if more fields are added, but the intention is never to
remove any of the existing fields.
</P>
<br><b>
Fields for numerical callouts
</b><br>
<P>
For a numerical callout, <i>callout_string</i> is NULL, and <i>callout_number</i>
contains the number of the callout, in the range 0-255. This is the number
that follows (?C for callouts that part of the pattern; it is 255 for
automatically generated callouts.
</P>
<br><b>
Fields for string callouts
</b><br>
<P>
For callouts with string arguments, <i>callout_number</i> is always zero, and
<i>callout_string</i> points to the string that is contained within the compiled
pattern. Its length is given by <i>callout_string_length</i>. Duplicated ending
delimiters that were present in the original pattern string have been turned
into single characters, but there is no other processing of the callout string
argument. An additional code unit containing binary zero is present after the
string, but is not included in the length. The delimiter that was used to start
the string is also stored within the pattern, immediately before the string
itself. You can access this delimiter as <i>callout_string</i>[-1] if you need
it.
</P>
<P>
The <i>callout_string_offset</i> field is the code unit offset to the start of
the callout argument string within the original pattern string. This is
provided for the benefit of applications such as script languages that might
need to report errors in the callout string within the pattern.
</P>
<br><b>
Fields for all callouts
</b><br>
<P>
The remaining fields in the callout block are the same for both kinds of
callout.
</P>
<P>
The <i>offset_vector</i> field is a pointer to a vector of capturing offsets
(the "ovector"). You may read the elements in this vector, but you must not
change any of them.
</P>
<P>
For calls to <b>pcre2_match()</b>, the <i>offset_vector</i> field is not (since
release 10.30) a pointer to the actual ovector that was passed to the matching
function in the match data block. Instead it points to an internal ovector of a
size large enough to hold all possible captured substrings in the pattern. Note
that whenever a recursion or subroutine call within a pattern completes, the
capturing state is reset to what it was before.
</P>
<P>
The <i>capture_last</i> field contains the number of the most recently captured
substring, and the <i>capture_top</i> field contains one more than the number of
the highest numbered captured substring so far. If no substrings have yet been
captured, the value of <i>capture_last</i> is 0 and the value of
<i>capture_top</i> is 1. The values of these fields do not always differ by one;
for example, when the callout in the pattern ((a)(b))(?C2) is taken,
<i>capture_last</i> is 1 but <i>capture_top</i> is 4.
</P>
<P>
The contents of ovector[2] to ovector[&#60;capture_top&#62;*2-1] can be inspected in
order to extract substrings that have been matched so far, in the same way as
extracting substrings after a match has completed. The values in ovector[0] and
ovector[1] are always PCRE2_UNSET because the match is by definition not
complete. Substrings that have not been captured but whose numbers are less
than <i>capture_top</i> also have both of their ovector slots set to
PCRE2_UNSET.
</P>
<P>
For DFA matching, the <i>offset_vector</i> field points to the ovector that was
passed to the matching function in the match data block for callouts at the top
level, but to an internal ovector during the processing of pattern recursions,
lookarounds, and atomic groups. However, these ovectors hold no useful
information because <b>pcre2_dfa_match()</b> does not support substring
capturing. The value of <i>capture_top</i> is always 1 and the value of
<i>capture_last</i> is always 0 for DFA matching.
</P>
<P>
The <i>subject</i> and <i>subject_length</i> fields contain copies of the values
that were passed to the matching function.
</P>
<P>
The <i>start_match</i> field normally contains the offset within the subject at
which the current match attempt started. However, if the escape sequence \K
has been encountered, this value is changed to reflect the modified starting
point. If the pattern is not anchored, the callout function may be called
several times from the same point in the pattern for different starting points
in the subject.
</P>
<P>
The <i>current_position</i> field contains the offset within the subject of the
current match pointer.
</P>
<P>
The <i>pattern_position</i> field contains the offset in the pattern string to
the next item to be matched.
</P>
<P>
The <i>next_item_length</i> field contains the length of the next item to be
processed in the pattern string. When the callout is at the end of the pattern,
the length is zero. When the callout precedes an opening parenthesis, the
length includes meta characters that follow the parenthesis. For example, in a
callout before an assertion such as (?=ab) the length is 3. For an alternation
bar or a closing parenthesis, the length is one, unless a closing parenthesis
is followed by a quantifier, in which case its length is included. (This
changed in release 10.23. In earlier releases, before an opening parenthesis
the length was that of the entire group, and before an alternation bar or a
closing parenthesis the length was zero.)
</P>
<P>
The <i>pattern_position</i> and <i>next_item_length</i> fields are intended to
help in distinguishing between different automatic callouts, which all have the
same callout number. However, they are set for all callouts, and are used by
<b>pcre2test</b> to show the next item to be matched when displaying callout
information.
</P>
<P>
In callouts from <b>pcre2_match()</b> the <i>mark</i> field contains a pointer to
the zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
(*THEN) item in the match, or NULL if no such items have been passed. Instances
of (*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
callouts from the DFA matching function this field always contains NULL.
</P>
<P>
The <i>callout_flags</i> field is always zero in callouts from
<b>pcre2_dfa_match()</b> or when JIT is being used. When <b>pcre2_match()</b>
without JIT is used, the following bits may be set:
<pre>
PCRE2_CALLOUT_STARTMATCH
</pre>
This is set for the first callout after the start of matching for each new
starting position in the subject.
<pre>
PCRE2_CALLOUT_BACKTRACK
</pre>
This is set if there has been a matching backtrack since the previous callout,
or since the start of matching if this is the first callout from a
<b>pcre2_match()</b> run.
</P>
<P>
Both bits are set when a backtrack has caused a "bumpalong" to a new starting
position in the subject. Output from <b>pcre2test</b> does not indicate the
presence of these bits unless the <b>callout_extra</b> modifier is set.
</P>
<P>
The information in the <b>callout_flags</b> field is provided so that
applications can track and tell their users how matching with backtracking is
done. This can be useful when trying to optimize patterns, or just to
understand how PCRE2 works. There is no support in <b>pcre2_dfa_match()</b>
because there is no backtracking in DFA matching, and there is no support in
JIT because JIT is all about maximimizing matching performance. In both these
cases the <b>callout_flags</b> field is always zero.
</P>
<br><a name="SEC5" href="#TOC1">RETURN VALUES FROM CALLOUTS</a><br>
<P>
The external callout function returns an integer to PCRE2. If the value is
zero, matching proceeds as normal. If the value is greater than zero, matching
fails at the current point, but the testing of other matching possibilities
goes ahead, just as if a lookahead assertion had failed. If the value is less
than zero, the match is abandoned, and the matching function returns the
negative value.
</P>
<P>
Negative values should normally be chosen from the set of PCRE2_ERROR_xxx
values. In particular, PCRE2_ERROR_NOMATCH forces a standard "no match"
failure. The error number PCRE2_ERROR_CALLOUT is reserved for use by callout
functions; it will never be used by PCRE2 itself.
</P>
<br><a name="SEC6" href="#TOC1">CALLOUT ENUMERATION</a><br>
<P>
<b>int pcre2_callout_enumerate(const pcre2_code *<i>code</i>,</b>
<b> int (*<i>callback</i>)(pcre2_callout_enumerate_block *, void *),</b>
<b> void *<i>user_data</i>);</b>
<br>
<br>
A script language that supports the use of string arguments in callouts might
like to scan all the callouts in a pattern before running the match. This can
be done by calling <b>pcre2_callout_enumerate()</b>. The first argument is a
pointer to a compiled pattern, the second points to a callback function, and
the third is arbitrary user data. The callback function is called for every
callout in the pattern in the order in which they appear. Its first argument is
a pointer to a callout enumeration block, and its second argument is the
<i>user_data</i> value that was passed to <b>pcre2_callout_enumerate()</b>. The
data block contains the following fields:
<pre>
<i>version</i> Block version number
<i>pattern_position</i> Offset to next item in pattern
<i>next_item_length</i> Length of next item in pattern
<i>callout_number</i> Number for numbered callouts
<i>callout_string_offset</i> Offset to string within pattern
<i>callout_string_length</i> Length of callout string
<i>callout_string</i> Points to callout string or is NULL
</pre>
The version number is currently 0. It will increase if new fields are ever
added to the block. The remaining fields are the same as their namesakes in the
<b>pcre2_callout</b> block that is used for callouts during matching, as
described
<a href="#calloutinterface">above.</a>
</P>
<P>
Note that the value of <i>pattern_position</i> is unique for each callout.
However, if a callout occurs inside a group that is quantified with a non-zero
minimum or a fixed maximum, the group is replicated inside the compiled
pattern. For example, a pattern such as /(a){2}/ is compiled as if it were
/(a)(a)/. This means that the callout will be enumerated more than once, but
with the same value for <i>pattern_position</i> in each case.
</P>
<P>
The callback function should normally return zero. If it returns a non-zero
value, scanning the pattern stops, and that value is returned from
<b>pcre2_callout_enumerate()</b>.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 19 January 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,299 @@
<html>
<head>
<title>pcre2compat specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2compat man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
DIFFERENCES BETWEEN PCRE2 AND PERL
</b><br>
<P>
This document describes some of the known differences in the ways that PCRE2
and Perl handle regular expressions. The differences described here are with
respect to Perl version 5.38.0, but as both Perl and PCRE2 are continually
changing, the information may at times be out of date.
</P>
<P>
1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, the
behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' matches the
next character unless it is the start of a newline sequence. This means that,
if the newline setting is CR, CRLF, or NUL, '.' will match the code point LF
(0x0A) in ASCII/Unicode environments, and NL (either 0x15 or 0x25) when using
EBCDIC. In Perl, '.' appears never to match LF, even when 0x0A is not a newline
indicator.
</P>
<P>
2. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
have are given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
</P>
<P>
3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
they do not mean what you might think. For example, (?!a){3} does not assert
that the next three characters are not "a". It just asserts that the next
character is not "a" three times (in principle; PCRE2 optimizes this to run the
assertion just once). Perl allows some repeat quantifiers on other assertions,
for example, \b* , but these do not seem to have any use. PCRE2 does not allow
any kind of quantifier on non-lookaround assertions.
</P>
<P>
4. If a braced quantifier such as {1,2} appears where there is nothing to
repeat (for example, at the start of a branch), PCRE2 raises an error whereas
Perl treats the quantifier characters as literal.
</P>
<P>
5. Capture groups that occur inside negative lookaround assertions are counted,
but their entries in the offsets vector are set only when a negative assertion
is a condition that has a matching branch (that is, the condition is false).
Perl may set such capture groups in other circumstances.
</P>
<P>
6. The following Perl escape sequences are not supported: \F, \l, \L, \u,
\U, and \N when followed by a character name. \N on its own, matching a
non-newline character, and \N{U+dd..}, matching a Unicode code point, are
supported. The escapes that modify the case of following letters are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE2, an error is
generated by default. However, if either of the PCRE2_ALT_BSUX or
PCRE2_EXTRA_ALT_BSUX options is set, \U and \u are interpreted as ECMAScript
interprets them.
</P>
<P>
7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
built with Unicode support (the default). The properties that can be tested
with \p and \P are limited to the general category properties such as Lu and
Nd, the derived properties Any and Lc (synonym L&), script names such as Greek
or Han, Bidi_Class, Bidi_Control, and a few binary properties. Both PCRE2 and
Perl support the Cs (surrogate) property, but in PCRE2 its use is limited. See
the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details. The long synonyms for property names that Perl
supports (such as \p{Letter}) are not supported by PCRE2, nor is it permitted
to prefix any of these properties with "Is".
</P>
<P>
8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
in between are treated as literals. However, this is slightly different from
Perl in that $ and @ are also handled as literals inside the quotes. In Perl,
they cause variable interpolation (PCRE2 does not have variables). Also, Perl
does "double-quotish backslash interpolation" on any backslashes between \Q
and \E which, its documentation says, "may lead to confusing results". PCRE2
treats a backslash between \Q and \E just like any other character. Note the
following examples:
<pre>
Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\QA\B\E A\B A\B
\Q\\E \ \\E
</pre>
The \Q...\E sequence is recognized both inside and outside character classes
by both PCRE2 and Perl. Another difference from Perl is that any appearance of
\Q or \E inside what might otherwise be a quantifier causes PCRE2 not to
recognize the sequence as a quantifier. Perl recognizes a quantifier if
(redundantly) either of the numbers is inside \Q...\E, but not if the
separating comma is. When not recognized as a quantifier a sequence such as
{\Q1\E,2} is treated as the literal string "{1,2}".
</P>
<P>
9. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
constructions. However, PCRE2 does have a "callout" feature, which allows an
external function to be called during pattern matching. See the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details.
</P>
<P>
10. Subroutine calls (whether recursive or not) were treated as atomic groups
up to PCRE2 release 10.23, but from release 10.30 this changed, and
backtracking into subroutine calls is now supported, as in Perl.
</P>
<P>
11. In PCRE2, if any of the backtracking control verbs are used in a group that
is called as a subroutine (whether or not recursively), their effect is
confined to that group; it does not extend to the surrounding pattern. This is
not always the case in Perl. In particular, if (*THEN) is present in a group
that is called as a subroutine, its action is limited to that group, even if
the group does not contain any | characters. Note that such groups are
processed as anchored at the point where they are tested. PCRE2 also confines
all control verbs within atomic assertions, again including (*THEN) in
assertions with only one branch.
</P>
<P>
12. If a pattern contains more than one backtracking control verb, the first
one that is backtracked onto acts. For example, in the pattern
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
same as PCRE2, but there are cases where it differs.
</P>
<P>
13. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
"b".
</P>
<P>
14. PCRE2's handling of duplicate capture group numbers and names is not as
general as Perl's. This is a consequence of the fact the PCRE2 works internally
just with numbers, using an external table to translate between numbers and
names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B)), where the two
capture groups have the same number but different names, is not supported, and
causes an error at compile time. If it were allowed, it would not be possible
to distinguish which group matched, because both names map to capture group
number 1. To avoid this confusing situation, an error is given at compile time.
</P>
<P>
15. Perl used to recognize comments in some places that PCRE2 does not, for
example, between the ( and ? at the start of a group. If the /x modifier is
set, Perl allowed white space between ( and ? though the latest Perls give an
error (for a while it was just deprecated). There may still be some cases where
Perl behaves differently.
</P>
<P>
16. Perl, when in warning mode, gives warnings for character classes such as
[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
warning features, so it gives an error in these cases because they are almost
certainly user mistakes.
</P>
<P>
17. In PCRE2, until release 10.45, the upper/lower case character properties Lu
and Ll were not affected when case-independent matching was specified. Perl has
changed in this respect, and PCRE2 has now changed to match. When caseless
matching is in force, Lu, Ll, and Lt (title case) are all treated as Lc (cased
letter).
</P>
<P>
18. From release 5.32.0, Perl locks out the use of \K in lookaround
assertions. From release 10.38 PCRE2 does the same by default. However, there
is an option for re-enabling the previous behaviour. When this option is set,
\K is acted on when it occurs in positive assertions, but is ignored in
negative assertions.
</P>
<P>
19. PCRE2 provides some extensions to the Perl regular expression facilities.
Perl 5.10 included new features that were not in earlier versions of Perl, some
of which (such as named parentheses) were in PCRE2 for some time before. This
list is with respect to Perl 5.38:
<br>
<br>
(a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
<br>
<br>
(b) A backslash followed by a letter with no special meaning is faulted. (Perl
can be made to issue a warning.)
<br>
<br>
(c) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.
<br>
<br>
(d) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
only at the first matching position in the subject string.
<br>
<br>
(e) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART
options have no Perl equivalents.
<br>
<br>
(f) The \R escape sequence can be restricted to match only CR, LF, or CRLF
by the PCRE2_BSR_ANYCRLF option.
<br>
<br>
(g) The callout facility is PCRE2-specific. Perl supports codeblocks and
variable interpolation, but not general hooks on every match.
<br>
<br>
(h) The partial matching facility is PCRE2-specific.
<br>
<br>
(i) The alternative matching function (<b>pcre2_dfa_match()</b> matches in a
different way and is not Perl-compatible.
<br>
<br>
(j) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
the start of a pattern. These set overall options that cannot be changed within
the pattern.
<br>
<br>
(k) PCRE2 supports non-atomic positive lookaround assertions. This is an
extension to the lookaround facilities. The default, Perl-compatible
lookarounds are atomic.
<br>
<br>
(l) There are three syntactical items in patterns that can refer to a capturing
group by number: back references such as \g{2}, subroutine calls such as (?3),
and condition references such as (?(4)...). PCRE2 supports relative group
numbers such as +2 and -4 in all three cases. Perl supports both plus and minus
for subroutine calls, but only minus for back references, and no relative
numbering at all for conditions.
<br>
<br>
(m) The scan substring assertion (syntax (*scs:(n)...)) is a PCRE2 extension
that is not available in Perl.
</P>
<P>
20. Perl has different limits than PCRE2. See the
<a href="pcre2limit.html"><b>pcre2limit</b></a>
documentation for details. Perl went with 5.10 from recursion to iteration
keeping the intermediate matches on the heap, which is ~10% slower but does not
fall into any stack-overflow limit. PCRE2 made a similar change at release
10.30, and also has many build-time and run-time customizable limits.
</P>
<P>
21. Unlike Perl, PCRE2 doesn't have character set modifiers and specially no way
to set characters by context just like Perl's "/d". A regular expression using
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
</P>
<P>
22. Some recursive patterns that Perl diagnoses as infinite recursions can be
handled by PCRE2, either by the interpreter or the JIT. An example is
/(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number of repeated
"abcd" substrings at the end of the subject.
</P>
<P>
23. Both PCRE2 and Perl error when \x{ escapes are invalid, but Perl tries to
recover and prints a warning if the problem was that an invalid hexadecimal
digit was found, since PCRE2 doesn't have warnings it returns an error instead.
Additionally, Perl accepts \x{} and generates NUL unlike PCRE2.
</P>
<P>
24. From release 10.45, PCRE2 gives an error if \x is not followed by a
hexadecimal digit or a curly bracket. It used to interpret this as the NUL
character. Perl still generates NUL, but warns when in warning mode in most
cases.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 02 October 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,191 @@
<html>
<head>
<title>pcre2convert specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2convert man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a>
<li><a name="TOC2" href="#SEC2">THE CONVERT CONTEXT</a>
<li><a name="TOC3" href="#SEC3">THE CONVERSION FUNCTION</a>
<li><a name="TOC4" href="#SEC4">CONVERTING GLOBS</a>
<li><a name="TOC5" href="#SEC5">CONVERTING POSIX PATTERNS</a>
<li><a name="TOC6" href="#SEC6">AUTHOR</a>
<li><a name="TOC7" href="#SEC7">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">EXPERIMENTAL PATTERN CONVERSION FUNCTIONS</a><br>
<P>
This document describes a set of functions that can be used to convert
"foreign" patterns into PCRE2 regular expressions. This facility is currently
experimental, and may be changed in future releases. Two kinds of pattern,
globs and POSIX patterns, are supported.
</P>
<br><a name="SEC2" href="#TOC1">THE CONVERT CONTEXT</a><br>
<P>
<b>pcre2_convert_context *pcre2_convert_context_create(</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>pcre2_convert_context *pcre2_convert_context_copy(</b>
<b> pcre2_convert_context *<i>cvcontext</i>);</b>
<br>
<br>
<b>void pcre2_convert_context_free(pcre2_convert_context *<i>cvcontext</i>);</b>
<br>
<br>
<b>int pcre2_set_glob_escape(pcre2_convert_context *<i>cvcontext</i>,</b>
<b> uint32_t <i>escape_char</i>);</b>
<br>
<br>
<b>int pcre2_set_glob_separator(pcre2_convert_context *<i>cvcontext</i>,</b>
<b> uint32_t <i>separator_char</i>);</b>
<br>
<br>
A convert context is used to hold parameters that affect the way that pattern
conversion works. Like all PCRE2 contexts, you need to use a context only if
you want to override the defaults. There are the usual create, copy, and free
functions. If custom memory management functions are set in a general context
that is passed to <b>pcre2_convert_context_create()</b>, they are used for all
memory management within the conversion functions.
</P>
<P>
There are only two parameters in the convert context at present. Both apply
only to glob conversions. The escape character defaults to grave accent under
Windows, otherwise backslash. It can be set to zero, meaning no escape
character, or to any punctuation character with a code point less than 256.
The separator character defaults to backslash under Windows, otherwise forward
slash. It can be set to forward slash, backslash, or dot.
</P>
<P>
The two setting functions return zero on success, or PCRE2_ERROR_BADDATA if
their second argument is invalid.
</P>
<br><a name="SEC3" href="#TOC1">THE CONVERSION FUNCTION</a><br>
<P>
<b>int pcre2_pattern_convert(PCRE2_SPTR <i>pattern</i>, PCRE2_SIZE <i>length</i>,</b>
<b> uint32_t <i>options</i>, PCRE2_UCHAR **<i>buffer</i>,</b>
<b> PCRE2_SIZE *<i>blength</i>, pcre2_convert_context *<i>cvcontext</i>);</b>
<br>
<br>
<b>void pcre2_converted_pattern_free(PCRE2_UCHAR *<i>converted_pattern</i>);</b>
<br>
<br>
The first two arguments of <b>pcre2_pattern_convert()</b> define the foreign
pattern that is to be converted. The length may be given as
PCRE2_ZERO_TERMINATED. The <b>options</b> argument defines how the pattern is to
be processed. If the input is UTF, the PCRE2_CONVERT_UTF option should be set.
PCRE2_CONVERT_NO_UTF_CHECK may also be set if you are sure the input is valid.
One or more of the glob options, or one of the following POSIX options must be
set to define the type of conversion that is required:
<pre>
PCRE2_CONVERT_GLOB
PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR
PCRE2_CONVERT_GLOB_NO_STARSTAR
PCRE2_CONVERT_POSIX_BASIC
PCRE2_CONVERT_POSIX_EXTENDED
</pre>
Details of the conversions are given below. The <b>buffer</b> and <b>blength</b>
arguments define how the output is handled:
</P>
<P>
If <b>buffer</b> is NULL, the function just returns the length of the converted
pattern via <b>blength</b>. This is one less than the length of buffer needed,
because a terminating zero is always added to the output.
</P>
<P>
If <b>buffer</b> points to a NULL pointer, an output buffer is obtained using
the allocator in the context or <b>malloc()</b> if no context is supplied. A
pointer to this buffer is placed in the variable to which <b>buffer</b> points.
When no longer needed the output buffer must be freed by calling
<b>pcre2_converted_pattern_free()</b>. If this function is called with a NULL
argument, it returns immediately without doing anything.
</P>
<P>
If <b>buffer</b> points to a non-NULL pointer, <b>blength</b> must be set to the
actual length of the buffer provided (in code units).
</P>
<P>
In all cases, after successful conversion, the variable pointed to by
<b>blength</b> is updated to the length actually used (in code units), excluding
the terminating zero that is always added.
</P>
<P>
If an error occurs, the length (via <b>blength</b>) is set to the offset
within the input pattern where the error was detected. Only gross syntax errors
are caught; there are plenty of errors that will get passed on for
<b>pcre2_compile()</b> to discover.
</P>
<P>
The return from <b>pcre2_pattern_convert()</b> is zero on success or a non-zero
PCRE2 error code. Note that PCRE2 error codes may be positive or negative:
<b>pcre2_compile()</b> uses mostly positive codes and <b>pcre2_match()</b>
negative ones; <b>pcre2_convert()</b> uses existing codes of both kinds. A
textual error message can be obtained by calling
<b>pcre2_get_error_message()</b>.
</P>
<br><a name="SEC4" href="#TOC1">CONVERTING GLOBS</a><br>
<P>
Globs are used to match file names, and consequently have the concept of a
"path separator", which defaults to backslash under Windows and forward slash
otherwise. If PCRE2_CONVERT_GLOB is set, the wildcards * and ? are not
permitted to match separator characters, but the double-star (**) feature
(which does match separators) is supported.
</P>
<P>
PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR matches globs with wildcards allowed to
match separator characters. PCRE2_CONVERT_GLOB_NO_STARSTAR matches globs with
the double-star feature disabled. These options may be given together.
</P>
<br><a name="SEC5" href="#TOC1">CONVERTING POSIX PATTERNS</a><br>
<P>
POSIX defines two kinds of regular expression pattern: basic and extended.
These can be processed by setting PCRE2_CONVERT_POSIX_BASIC or
PCRE2_CONVERT_POSIX_EXTENDED, respectively.
</P>
<P>
In POSIX patterns, backslash is not special in a character class. Unmatched
closing parentheses are treated as literals.
</P>
<P>
In basic patterns, ? + | {} and () must be escaped to be recognized
as metacharacters outside a character class. If the first character in the
pattern is * it is treated as a literal. ^ is a metacharacter only at the start
of a branch.
</P>
<P>
In extended patterns, a backslash not in a character class always
makes the next character literal, whatever it is. There are no backreferences.
</P>
<P>
Note: POSIX mandates that the longest possible match at the first matching
position must be found. This is not what <b>pcre2_match()</b> does; it yields
the first match that is found. An application can use <b>pcre2_dfa_match()</b>
to find the longest match, but that does not support backreferences (but then
neither do POSIX extended patterns).
</P>
<br><a name="SEC6" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC7" href="#TOC1">REVISION</a><br>
<P>
Last updated: 14 November 2023
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,518 @@
<html>
<head>
<title>pcre2demo specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2demo man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SOURCE CODE
</b><br>
<PRE>
/*************************************************
* PCRE2 DEMONSTRATION PROGRAM *
*************************************************/
/* This is a demonstration program to illustrate a straightforward way of
using the PCRE2 regular expression library from a C program. See the
pcre2sample documentation for a short discussion ("man pcre2sample" if you have
the PCRE2 man pages installed). PCRE2 is a revised API for the library, and is
incompatible with the original PCRE API.
There are actually three libraries, each supporting a different code unit
width. This demonstration program uses the 8-bit library. The default is to
process each code unit as a separate character, but if the pattern begins with
"(*UTF)", both it and the subject are treated as UTF-8 strings, where
characters may occupy multiple code units.
In Unix-like environments, if PCRE2 is installed in your standard system
libraries, you should be able to compile this program using this command:
cc -Wall pcre2demo.c -lpcre2-8 -o pcre2demo
If PCRE2 is not installed in a standard place, it is likely to be installed
with support for the pkg-config mechanism. If you have pkg-config, you can
compile this program using this command:
cc -Wall pcre2demo.c `pkg-config --cflags --libs libpcre2-8` -o pcre2demo
If you do not have pkg-config, you may have to use something like this:
cc -Wall pcre2demo.c -I/usr/local/include -L/usr/local/lib \
-R/usr/local/lib -lpcre2-8 -o pcre2demo
Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
library files for PCRE2 are installed on your system. Only some operating
systems (Solaris is one) use the -R option.
Building under Windows:
If you want to statically link this program against a non-dll .a file, you must
define PCRE2_STATIC before including pcre2.h, so in this environment, uncomment
the following line. */
/* #define PCRE2_STATIC */
/* The PCRE2_CODE_UNIT_WIDTH macro must be defined before including pcre2.h.
For a program that uses only one code unit width, setting it to 8, 16, or 32
makes it possible to use generic function names such as pcre2_compile(). Note
that just changing 8 to 16 (for example) is not sufficient to convert this
program to process 16-bit characters. Even in a fully 16-bit environment, where
string-handling functions such as strcmp() and printf() work with 16-bit
characters, the code for handling the table of named substrings will still need
to be modified. */
#define PCRE2_CODE_UNIT_WIDTH 8
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
#include &lt;pcre2.h&gt;
/**************************************************************************
* Here is the program. The API includes the concept of "contexts" for *
* setting up unusual interface requirements for compiling and matching, *
* such as custom memory managers and non-standard newline definitions. *
* This program does not do any of this, so it makes no use of contexts, *
* always passing NULL where a context could be given. *
**************************************************************************/
int main(int argc, char **argv)
{
pcre2_code *re;
PCRE2_SPTR pattern; /* PCRE2_SPTR is a pointer to unsigned code units of */
PCRE2_SPTR subject; /* the appropriate width (in this case, 8 bits). */
PCRE2_SPTR name_table;
int crlf_is_newline;
int errornumber;
int find_all;
int i;
int rc;
int utf8;
uint32_t option_bits;
uint32_t namecount;
uint32_t name_entry_size;
uint32_t newline;
PCRE2_SIZE erroroffset;
PCRE2_SIZE *ovector;
PCRE2_SIZE subject_length;
pcre2_match_data *match_data;
/**************************************************************************
* First, sort out the command line. There is only one possible option at *
* the moment, "-g" to request repeated matching to find all occurrences, *
* like Perl's /g option. We set the variable find_all to a non-zero value *
* if the -g option is present. *
**************************************************************************/
find_all = 0;
for (i = 1; i &lt; argc; i++)
{
if (strcmp(argv[i], "-g") == 0) find_all = 1;
else if (argv[i][0] == '-')
{
printf("Unrecognised option %s\n", argv[i]);
return 1;
}
else break;
}
/* After the options, we require exactly two arguments, which are the pattern,
and the subject string. */
if (argc - i != 2)
{
printf("Exactly two arguments required: a regex and a subject string\n");
return 1;
}
/* Pattern and subject are char arguments, so they can be straightforwardly
cast to PCRE2_SPTR because we are working in 8-bit code units. The subject
length is cast to PCRE2_SIZE for completeness, though PCRE2_SIZE is in fact
defined to be size_t. */
pattern = (PCRE2_SPTR)argv[i];
subject = (PCRE2_SPTR)argv[i+1];
subject_length = (PCRE2_SIZE)strlen((char *)subject);
/*************************************************************************
* Now we are going to compile the regular expression pattern, and handle *
* any errors that are detected. *
*************************************************************************/
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&amp;errornumber, /* for error number */
&amp;erroroffset, /* for error offset */
NULL); /* use default compile context */
/* Compilation failed: print the error message and exit. */
if (re == NULL)
{
PCRE2_UCHAR buffer[256];
pcre2_get_error_message(errornumber, buffer, sizeof(buffer));
printf("PCRE2 compilation failed at offset %d: %s\n", (int)erroroffset,
buffer);
return 1;
}
/*************************************************************************
* If the compilation succeeded, we call PCRE2 again, in order to do a *
* pattern match against the subject string. This does just ONE match. If *
* further matching is needed, it will be done below. Before running the *
* match we must set up a match_data block for holding the result. Using *
* pcre2_match_data_create_from_pattern() ensures that the block is *
* exactly the right size for the number of capturing parentheses in the *
* pattern. If you need to know the actual size of a match_data block as *
* a number of bytes, you can find it like this: *
* *
* PCRE2_SIZE match_data_size = pcre2_get_match_data_size(match_data); *
*************************************************************************/
match_data = pcre2_match_data_create_from_pattern(re, NULL);
/* Now run the match. */
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
/* Matching failed: handle error cases */
if (rc &lt; 0)
{
switch(rc)
{
case PCRE2_ERROR_NOMATCH: printf("No match\n"); break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\n", rc); break;
}
pcre2_match_data_free(match_data); /* Release memory used for the match */
pcre2_code_free(re); /* data and the compiled pattern. */
return 1;
}
/* Match succeeded. Get a pointer to the output vector, where string offsets
are stored. */
ovector = pcre2_get_ovector_pointer(match_data);
printf("Match succeeded at offset %d\n", (int)ovector[0]);
/*************************************************************************
* We have found the first match within the subject string. If the output *
* vector wasn't big enough, say so. Then output any substrings that were *
* captured. *
*************************************************************************/
/* The output vector wasn't big enough. This should not happen, because we used
pcre2_match_data_create_from_pattern() above. */
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\n");
/* Since release 10.38 PCRE2 has locked out the use of \K in lookaround
assertions. However, there is an option to re-enable the old behaviour. If that
is set, it is possible to run patterns such as /(?=.\K)/ that use \K in an
assertion to set the start of a match later than its end. In this demonstration
program, we show how to detect this case, but it shouldn't arise because the
option is never set. */
if (ovector[0] &gt; ovector[1])
{
printf("\\K was used in an assertion to set the match start after its end.\n"
"From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
(char *)(subject + ovector[1]));
printf("Run abandoned\n");
pcre2_match_data_free(match_data);
pcre2_code_free(re);
return 1;
}
/* Show substrings stored in the output vector by number. Obviously, in a real
application you might want to do things other than print them. */
for (i = 0; i &lt; rc; i++)
{
PCRE2_SPTR substring_start = subject + ovector[2*i];
PCRE2_SIZE substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
}
/**************************************************************************
* That concludes the basic part of this demonstration program. We have *
* compiled a pattern, and performed a single match. The code that follows *
* shows first how to access named substrings, and then how to code for *
* repeated matches on the same subject. *
**************************************************************************/
/* See if there are any named substrings, and if so, show them by name. First
we have to extract the count of named parentheses from the pattern. */
(void)pcre2_pattern_info(
re, /* the compiled pattern */
PCRE2_INFO_NAMECOUNT, /* get the number of named substrings */
&amp;namecount); /* where to put the answer */
if (namecount == 0) printf("No named substrings\n"); else
{
PCRE2_SPTR tabptr;
printf("Named substrings\n");
/* Before we can access the substrings, we must extract the table for
translating names to numbers, and the size of each entry in the table. */
(void)pcre2_pattern_info(
re, /* the compiled pattern */
PCRE2_INFO_NAMETABLE, /* address of the table */
&amp;name_table); /* where to put the answer */
(void)pcre2_pattern_info(
re, /* the compiled pattern */
PCRE2_INFO_NAMEENTRYSIZE, /* size of each entry in the table */
&amp;name_entry_size); /* where to put the answer */
/* Now we can scan the table and, for each entry, print the number, the name,
and the substring itself. In the 8-bit library the number is held in two
bytes, most significant first. */
tabptr = name_table;
for (i = 0; i &lt; namecount; i++)
{
int n = (tabptr[0] &lt;&lt; 8) | tabptr[1];
printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
(int)(ovector[2*n+1] - ovector[2*n]), subject + ovector[2*n]);
tabptr += name_entry_size;
}
}
/*************************************************************************
* If the "-g" option was given on the command line, we want to continue *
* to search for additional matches in the subject string, in a similar *
* way to the /g option in Perl. This turns out to be trickier than you *
* might think because of the possibility of matching an empty string. *
* What happens is as follows: *
* *
* If the previous match was NOT for an empty string, we can just start *
* the next match at the end of the previous one. *
* *
* If the previous match WAS for an empty string, we can't do that, as it *
* would lead to an infinite loop. Instead, a call of pcre2_match() is *
* made with the PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set. The *
* first of these tells PCRE2 that an empty string at the start of the *
* subject is not a valid match; other possibilities must be tried. The *
* second flag restricts PCRE2 to one match attempt at the initial string *
* position. If this match succeeds, an alternative to the empty string *
* match has been found, and we can print it and proceed round the loop, *
* advancing by the length of whatever was found. If this match does not *
* succeed, we still stay in the loop, advancing by just one character. *
* In UTF-8 mode, which can be set by (*UTF) in the pattern, this may be *
* more than one byte. *
* *
* However, there is a complication concerned with newlines. When the *
* newline convention is such that CRLF is a valid newline, we must *
* advance by two characters rather than one. The newline convention can *
* be set in the regex by (*CR), etc.; if not, we must find the default. *
*************************************************************************/
if (!find_all) /* Check for -g */
{
pcre2_match_data_free(match_data); /* Release the memory that was used */
pcre2_code_free(re); /* for the match data and the pattern. */
return 0; /* Exit the program. */
}
/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
sequence. First, find the options with which the regex was compiled and extract
the UTF state. */
(void)pcre2_pattern_info(re, PCRE2_INFO_ALLOPTIONS, &amp;option_bits);
utf8 = (option_bits &amp; PCRE2_UTF) != 0;
/* Now find the newline convention and see whether CRLF is a valid newline
sequence. */
(void)pcre2_pattern_info(re, PCRE2_INFO_NEWLINE, &amp;newline);
crlf_is_newline = newline == PCRE2_NEWLINE_ANY ||
newline == PCRE2_NEWLINE_CRLF ||
newline == PCRE2_NEWLINE_ANYCRLF;
/* Loop for second and subsequent matches */
for (;;)
{
uint32_t options = 0; /* Normally no options */
PCRE2_SIZE start_offset = ovector[1]; /* Start at end of previous match */
/* If the previous match was for an empty string, we are finished if we are
at the end of the subject. Otherwise, arrange to run another match at the
same point to see if a non-empty match can be found. */
if (ovector[0] == ovector[1])
{
if (ovector[0] == subject_length) break;
options = PCRE2_NOTEMPTY_ATSTART | PCRE2_ANCHORED;
}
/* If the previous match was not an empty string, there is one tricky case to
consider. If a pattern contains \K within a lookbehind assertion at the
start, the end of the matched string can be at the offset where the match
started. Without special action, this leads to a loop that keeps on matching
the same substring. We must detect this case and arrange to move the start on
by one character. The pcre2_get_startchar() function returns the starting
offset that was passed to pcre2_match(). */
else
{
PCRE2_SIZE startchar = pcre2_get_startchar(match_data);
if (start_offset &lt;= startchar)
{
if (startchar &gt;= subject_length) break; /* Reached end of subject. */
start_offset = startchar + 1; /* Advance by one character. */
if (utf8) /* If UTF-8, it may be more */
{ /* than one code unit. */
for (; start_offset &lt; subject_length; start_offset++)
if ((subject[start_offset] &amp; 0xc0) != 0x80) break;
}
}
}
/* Run the next matching operation */
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
start_offset, /* starting offset in the subject */
options, /* options */
match_data, /* block for storing the result */
NULL); /* use default match context */
/* This time, a result of NOMATCH isn't an error. If the value in "options"
is zero, it just means we have found all possible matches, so the loop ends.
Otherwise, it means we have failed to find a non-empty-string match at a
point where there was a previous empty-string match. In this case, we do what
Perl does: advance the matching position by one character, and continue. We
do this by setting the "end of previous match" offset, because that is picked
up at the top of the loop as the point at which to start again.
There are two complications: (a) When CRLF is a valid newline sequence, and
the current position is just before it, advance by an extra byte. (b)
Otherwise we must ensure that we skip an entire UTF character if we are in
UTF mode. */
if (rc == PCRE2_ERROR_NOMATCH)
{
if (options == 0) break; /* All matches found */
ovector[1] = start_offset + 1; /* Advance one code unit */
if (crlf_is_newline &amp;&amp; /* If CRLF is a newline &amp; */
start_offset &lt; subject_length - 1 &amp;&amp; /* we are at CRLF, */
subject[start_offset] == '\r' &amp;&amp;
subject[start_offset + 1] == '\n')
ovector[1] += 1; /* Advance by one more. */
else if (utf8) /* Otherwise, ensure we */
{ /* advance a whole UTF-8 */
while (ovector[1] &lt; subject_length) /* character. */
{
if ((subject[ovector[1]] &amp; 0xc0) != 0x80) break;
ovector[1] += 1;
}
}
continue; /* Go round the loop again */
}
/* Other matching errors are not recoverable. */
if (rc &lt; 0)
{
printf("Matching error %d\n", rc);
pcre2_match_data_free(match_data);
pcre2_code_free(re);
return 1;
}
/* Match succeeded */
printf("\nMatch succeeded again at offset %d\n", (int)ovector[0]);
/* The match succeeded, but the output vector wasn't big enough. This
should not happen. */
if (rc == 0)
printf("ovector was not big enough for all the captured substrings\n");
/* We must guard against patterns such as /(?=.\K)/ that use \K in an
assertion to set the start of a match later than its end. In this
demonstration program, we just detect this case and give up. */
if (ovector[0] &gt; ovector[1])
{
printf("\\K was used in an assertion to set the match start after its end.\n"
"From end to start the match was: %.*s\n", (int)(ovector[0] - ovector[1]),
(char *)(subject + ovector[1]));
printf("Run abandoned\n");
pcre2_match_data_free(match_data);
pcre2_code_free(re);
return 1;
}
/* As before, show substrings stored in the output vector by number, and then
also any named substrings. */
for (i = 0; i &lt; rc; i++)
{
PCRE2_SPTR substring_start = subject + ovector[2*i];
size_t substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, (int)substring_length, (char *)substring_start);
}
if (namecount == 0) printf("No named substrings\n"); else
{
PCRE2_SPTR tabptr = name_table;
printf("Named substrings\n");
for (i = 0; i &lt; namecount; i++)
{
int n = (tabptr[0] &lt;&lt; 8) | tabptr[1];
printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
(int)(ovector[2*n+1] - ovector[2*n]), subject + ovector[2*n]);
tabptr += name_entry_size;
}
}
} /* End of loop to find second and subsequent matches */
printf("\n");
pcre2_match_data_free(match_data);
pcre2_code_free(re);
return 0;
}
/* End of pcre2demo.c */
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,1135 @@
<html>
<head>
<title>pcre2grep specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2grep man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">SUPPORT FOR COMPRESSED FILES</a>
<li><a name="TOC4" href="#SEC4">BINARY FILES</a>
<li><a name="TOC5" href="#SEC5">BINARY ZEROS IN PATTERNS</a>
<li><a name="TOC6" href="#SEC6">OPTIONS</a>
<li><a name="TOC7" href="#SEC7">ENVIRONMENT VARIABLES</a>
<li><a name="TOC8" href="#SEC8">NEWLINES</a>
<li><a name="TOC9" href="#SEC9">OPTIONS COMPATIBILITY WITH GNU GREP</a>
<li><a name="TOC10" href="#SEC10">OPTIONS WITH DATA</a>
<li><a name="TOC11" href="#SEC11">USING PCRE2'S CALLOUT FACILITY</a>
<li><a name="TOC12" href="#SEC12">MATCHING ERRORS</a>
<li><a name="TOC13" href="#SEC13">DIAGNOSTICS</a>
<li><a name="TOC14" href="#SEC14">SEE ALSO</a>
<li><a name="TOC15" href="#SEC15">AUTHOR</a>
<li><a name="TOC16" href="#SEC16">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>pcre2grep [options] [long options] [pattern] [path1 path2 ...]</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
<b>pcre2grep</b> searches files for character patterns, in the same way as other
grep commands do, but it uses the PCRE2 regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
<a href="pcre2syntax.html"><b>pcre2syntax</b>(3)</a>
for a quick-reference summary of pattern syntax, or
<a href="pcre2pattern.html"><b>pcre2pattern</b>(3)</a>
for a full description of the syntax and semantics of the regular expressions
that PCRE2 supports.
</P>
<P>
Patterns, whether supplied on the command line or in a separate file, are given
without delimiters. For example:
<pre>
pcre2grep Thursday /etc/motd
</pre>
If you attempt to use delimiters (for example, by surrounding a pattern with
slashes, as is common in Perl scripts), they are interpreted as part of the
pattern. Quotes can of course be used to delimit patterns on the command line
because they are interpreted by the shell, and indeed quotes are required if a
pattern contains white space or shell metacharacters.
</P>
<P>
The first argument that follows any option settings is treated as the single
pattern to be matched when neither <b>-e</b> nor <b>-f</b> is present.
Conversely, when one or both of these options are used to specify patterns, all
arguments are treated as path names. At least one of <b>-e</b>, <b>-f</b>, or an
argument pattern must be provided.
</P>
<P>
If no files are specified, <b>pcre2grep</b> reads the standard input. The
standard input can also be referenced by a name consisting of a single hyphen.
For example:
<pre>
pcre2grep some-pattern file1 - file3
</pre>
By default, input files are searched line by line, so pattern assertions about
the beginning and end of a subject string (^, $, \A, \Z, and \z) match at
the beginning and end of each line. When a line matches a pattern, it is copied
to the standard output, and if there is more than one file, the file name is
output at the start of each line, followed by a colon. However, there are
options that can change how <b>pcre2grep</b> behaves. For example, the <b>-M</b>
option makes it possible to search for strings that span line boundaries. What
defines a line boundary is controlled by the <b>-N</b> (<b>--newline</b>) option.
The <b>-h</b> and <b>-H</b> options control whether or not file names are shown,
and the <b>-Z</b> option changes the file name terminator to a zero byte.
</P>
<P>
The amount of memory used for buffering files that are being scanned is
controlled by parameters that can be set by the <b>--buffer-size</b> and
<b>--max-buffer-size</b> options. The first of these sets the size of buffer
that is obtained at the start of processing. If an input file contains very
long lines, a larger buffer may be needed; this is handled by automatically
extending the buffer, up to the limit specified by <b>--max-buffer-size</b>. The
default values for these parameters can be set when <b>pcre2grep</b> is
built; if nothing is specified, the defaults are set to 20KiB and 1MiB
respectively. An error occurs if a line is too long and the buffer can no
longer be expanded.
</P>
<P>
The block of memory that is actually used is three times the "buffer size", to
allow for buffering "before" and "after" lines. If the buffer size is too
small, fewer than requested "before" and "after" lines may be output.
</P>
<P>
When matching with a multiline pattern, the size of the buffer must be at least
half of the maximum match expected or the pattern might fail to match.
</P>
<P>
Patterns can be no longer than 8KiB or BUFSIZ bytes, whichever is the greater.
BUFSIZ is defined in <b>&#60;stdio.h&#62;</b>. When there is more than one pattern
(specified by the use of <b>-e</b> and/or <b>-f</b>), each pattern is applied to
each line in the order in which they are defined, except that all the <b>-e</b>
patterns are tried before the <b>-f</b> patterns.
</P>
<P>
By default, as soon as one pattern matches a line, no further patterns are
considered. However, if <b>--colour</b> (or <b>--color</b>) is used to colour the
matching substrings, or if <b>--only-matching</b>, <b>--file-offsets</b>,
<b>--line-offsets</b>, or <b>--output</b> is used to output only the part of the
line that matched (either shown literally, or as an offset), the behaviour is
different. In this situation, all the patterns are applied to the line. If
there is more than one match, the one that begins nearest to the start of the
subject is processed; if there is more than one match at that position, the one
with the longest matching substring is processed; if the matching substrings
are equal, the first match found is processed.
</P>
<P>
Scanning with all the patterns resumes immediately following the match, so that
later matches on the same line can be found. Note, however, that an overlapping
match that starts in the middle of another match will not be processed.
</P>
<P>
The above behaviour was changed at release 10.41 to be more compatible with GNU
grep. In earlier releases, <b>pcre2grep</b> did not recognize matches from
later patterns that were earlier in the subject.
</P>
<P>
Patterns that can match an empty string are accepted, but empty string
matches are never recognized. An example is the pattern "(super)?(man)?", in
which all components are optional. This pattern finds all occurrences of both
"super" and "man"; the output differs from matching with "super|man" when only
the matching substrings are being shown.
</P>
<P>
If the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variable is set,
<b>pcre2grep</b> uses the value to set a locale when calling the PCRE2 library.
The <b>--locale</b> option can be used to override this.
</P>
<br><a name="SEC3" href="#TOC1">SUPPORT FOR COMPRESSED FILES</a><br>
<P>
Compile-time options for <b>pcre2grep</b> can set it up to use <b>libz</b> or
<b>libbz2</b> for reading compressed files whose names end in <b>.gz</b> or
<b>.bz2</b>, respectively. You can find out whether your <b>pcre2grep</b> binary
has support for one or both of these file types by running it with the
<b>--help</b> option. If the appropriate support is not present, all files are
treated as plain text. The standard input is always so treated. If a file with
a <b>.gz</b> or <b>.bz2</b> extension is not in fact compressed, it is read as a
plain text file. When input is from a compressed .gz or .bz2 file, the
<b>--line-buffered</b> option is ignored.
</P>
<br><a name="SEC4" href="#TOC1">BINARY FILES</a><br>
<P>
By default, a file that contains a binary zero byte within the first 1024 bytes
is identified as a binary file, and is processed specially. However, if the
newline type is specified as NUL, that is, the line terminator is a binary
zero, the test for a binary file is not applied. See the <b>--binary-files</b>
option for a means of changing the way binary files are handled.
</P>
<br><a name="SEC5" href="#TOC1">BINARY ZEROS IN PATTERNS</a><br>
<P>
Patterns passed from the command line are strings that are terminated by a
binary zero, so cannot contain internal zeros. However, patterns that are read
from a file via the <b>-f</b> option may contain binary zeros.
</P>
<br><a name="SEC6" href="#TOC1">OPTIONS</a><br>
<P>
The order in which some of the options appear can affect the output. For
example, both the <b>-H</b> and <b>-l</b> options affect the printing of file
names. Whichever comes later in the command line will be the one that takes
effect. Similarly, except where noted below, if an option is given twice, the
later setting is used. Numerical values for options may be followed by K or M,
to signify multiplication by 1024 or 1024*1024 respectively.
</P>
<P>
<b>--</b>
This terminates the list of options. It is useful if the next item on the
command line starts with a hyphen but is not an option. This allows for the
processing of patterns and file names that start with hyphens.
</P>
<P>
<b>-A</b> <i>number</i>, <b>--after-context=</b><i>number</i>
Output up to <i>number</i> lines of context after each matching line. Fewer
lines are output if the next match or the end of the file is reached, or if the
processing buffer size has been set too small. If file names and/or line
numbers are being output, a hyphen separator is used instead of a colon for the
context lines (the <b>-Z</b> option can be used to change the file name
terminator to a zero byte). A line containing "--" is output between each group
of lines, unless they are in fact contiguous in the input file. The value of
<i>number</i> is expected to be relatively small. When <b>-c</b> is used,
<b>-A</b> is ignored.
</P>
<P>
<b>-a</b>, <b>--text</b>
Treat binary files as text. This is equivalent to
<b>--binary-files</b>=<i>text</i>.
</P>
<P>
<b>--allow-lookaround-bsk</b>
PCRE2 now forbids the use of \K in lookarounds by default, in line with Perl.
This option causes <b>pcre2grep</b> to set the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
option, which enables this somewhat dangerous usage.
</P>
<P>
<b>-B</b> <i>number</i>, <b>--before-context=</b><i>number</i>
Output up to <i>number</i> lines of context before each matching line. Fewer
lines are output if the previous match or the start of the file is within
<i>number</i> lines, or if the processing buffer size has been set too small. If
file names and/or line numbers are being output, a hyphen separator is used
instead of a colon for the context lines (the <b>-Z</b> option can be used to
change the file name terminator to a zero byte). A line containing "--" is
output between each group of lines, unless they are in fact contiguous in the
input file. The value of <i>number</i> is expected to be relatively small. When
<b>-c</b> is used, <b>-B</b> is ignored.
</P>
<P>
<b>--binary-files=</b><i>word</i>
Specify how binary files are to be processed. If the word is "binary" (the
default), pattern matching is performed on binary files, but the only output is
"Binary file &#60;name&#62; matches" when a match succeeds. If the word is "text",
which is equivalent to the <b>-a</b> or <b>--text</b> option, binary files are
processed in the same way as any other file. In this case, when a match
succeeds, the output may be binary garbage, which can have nasty effects if
sent to a terminal. If the word is "without-match", which is equivalent to the
<b>-I</b> option, binary files are not processed at all; they are assumed not to
be of interest and are skipped without causing any output or affecting the
return code.
</P>
<P>
<b>--buffer-size=</b><i>number</i>
Set the parameter that controls how much memory is obtained at the start of
processing for buffering files that are being scanned. See also
<b>--max-buffer-size</b> below.
</P>
<P>
<b>-C</b> <i>number</i>, <b>--context=</b><i>number</i>
Output <i>number</i> lines of context both before and after each matching line.
This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
</P>
<P>
<b>-c</b>, <b>--count</b>
Do not output lines from the files that are being scanned; instead output the
number of lines that would have been shown, either because they matched, or, if
<b>-v</b> is set, because they failed to match. By default, this count is
exactly the same as the number of lines that would have been output, but if the
<b>-M</b> (multiline) option is used (without <b>-v</b>), there may be more
suppressed lines than the count (that is, the number of matches).
<br>
<br>
If no lines are selected, the number zero is output. If several files are
being scanned, a count is output for each of them and the <b>-t</b> option can
be used to cause a total to be output at the end. However, if the
<b>--files-with-matches</b> option is also used, only those files whose counts
are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
<b>-B</b>, and <b>-C</b> options are ignored.
</P>
<P>
<b>--colour</b>, <b>--color</b>
If this option is given without any data, it is equivalent to "--colour=auto".
If data is required, it must be given in the same shell item, separated by an
equals sign.
</P>
<P>
<b>--colour=</b><i>value</i>, <b>--color=</b><i>value</i>
This option specifies under what circumstances the parts of a line that matched
a pattern should be coloured in the output. It is ignored if
<b>--file-offsets</b>, <b>--line-offsets</b>, or <b>--output</b> is set. By
default, output is not coloured. The value for the <b>--colour</b> option (which
is optional, see above) may be "never", "always", or "auto". In the latter
case, colouring happens only if the standard output is connected to a terminal.
More resources are used when colouring is enabled, because <b>pcre2grep</b> has
to search for all possible matches in a line, not just one, in order to colour
them all.
<br>
<br>
The colour that is used can be specified by setting one of the environment
variables PCRE2GREP_COLOUR, PCRE2GREP_COLOR, PCREGREP_COLOUR, or
PCREGREP_COLOR, which are checked in that order. If none of these are set,
<b>pcre2grep</b> looks for GREP_COLORS or GREP_COLOR (in that order). The value
of the variable should be a string of two numbers, separated by a semicolon,
except in the case of GREP_COLORS, which must start with "ms=" or "mt="
followed by two semicolon-separated colours, terminated by the end of the
string or by a colon. If GREP_COLORS does not start with "ms=" or "mt=" it is
ignored, and GREP_COLOR is checked.
<br>
<br>
If the string obtained from one of the above variables contains any characters
other than semicolon or digits, the setting is ignored and the default colour
is used. The string is copied directly into the control string for setting
colour on a terminal, so it is your responsibility to ensure that the values
make sense. If no relevant environment variable is set, the default is "1;31",
which gives red.
</P>
<P>
<b>-D</b> <i>action</i>, <b>--devices=</b><i>action</i>
If an input path is not a regular file or a directory, "action" specifies how
it is to be processed. Valid values are "read" (the default) or "skip"
(silently skip the path).
</P>
<P>
<b>-d</b> <i>action</i>, <b>--directories=</b><i>action</i>
If an input path is a directory, "action" specifies how it is to be processed.
Valid values are "read" (the default in non-Windows environments, for
compatibility with GNU grep), "recurse" (equivalent to the <b>-r</b> option), or
"skip" (silently skip the path, the default in Windows environments). In the
"read" case, directories are read as if they were ordinary files. In some
operating systems the effect of reading a directory like this is an immediate
end-of-file; in others it may provoke an error.
</P>
<P>
<b>--depth-limit</b>=<i>number</i>
See <b>--match-limit</b> below.
</P>
<P>
<b>-E</b>, <b>--case-restrict</b>
When case distinctions are being ignored in Unicode mode, two ASCII letters (K
and S) will by default match Unicode characters U+212A (Kelvin sign) and U+017F
(long S) respectively, as well as their lower case ASCII counterparts. When
this option is set, case equivalences are restricted such that no ASCII
character matches a non-ASCII character, and vice versa.
</P>
<P>
<b>-e</b> <i>pattern</i>, <b>--regex=</b><i>pattern</i>, <b>--regexp=</b><i>pattern</i>
Specify a pattern to be matched. This option can be used multiple times in
order to specify several patterns. It can also be used as a way of specifying a
single pattern that starts with a hyphen. When <b>-e</b> is used, no argument
pattern is taken from the command line; all arguments are treated as file
names. There is no limit to the number of patterns. They are applied to each
line in the order in which they are defined.
<br>
<br>
If <b>-f</b> is used with <b>-e</b>, the command line patterns are matched first,
followed by the patterns from the file(s), independent of the order in which
these options are specified.
</P>
<P>
<b>--exclude</b>=<i>pattern</i>
Files (but not directories) whose names match the pattern are skipped without
being processed. This applies to all files, whether listed on the command line,
obtained from <b>--file-list</b>, or by scanning a directory. The pattern is a
PCRE2 regular expression, and is matched against the final component of the
file name, not the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do
not apply to this pattern. The option may be given any number of times in order
to specify multiple patterns. If a file name matches both an <b>--include</b>
and an <b>--exclude</b> pattern, it is excluded. There is no short form for this
option.
</P>
<P>
<b>--exclude-from=</b><i>filename</i>
Treat each non-empty line of the file as the data for an <b>--exclude</b>
option. What constitutes a newline when reading the file is the operating
system's default. The <b>--newline</b> option has no effect on this option. This
option may be given more than once in order to specify a number of files to
read.
</P>
<P>
<b>--exclude-dir</b>=<i>pattern</i>
Directories whose names match the pattern are skipped without being processed,
whatever the setting of the <b>--recursive</b> option. This applies to all
directories, whether listed on the command line, obtained from
<b>--file-list</b>, or by scanning a parent directory. The pattern is a PCRE2
regular expression, and is matched against the final component of the directory
name, not the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not
apply to this pattern. The option may be given any number of times in order to
specify more than one pattern. If a directory matches both <b>--include-dir</b>
and <b>--exclude-dir</b>, it is excluded. There is no short form for this
option.
</P>
<P>
<b>-F</b>, <b>--fixed-strings</b>
Interpret each data-matching pattern as a list of fixed strings, separated by
newlines, instead of as a regular expression. What constitutes a newline for
this purpose is controlled by the <b>--newline</b> option. The <b>-w</b> (match
as a word) and <b>-x</b> (match whole line) options can be used with <b>-F</b>.
They apply to each of the fixed strings. A line is selected if any of the fixed
strings are found in it (subject to <b>-w</b> or <b>-x</b>, if present). This
option applies only to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the <b>--include</b> or
<b>--exclude</b> options.
</P>
<P>
<b>-f</b> <i>filename</i>, <b>--file=</b><i>filename</i>
Read patterns from the file, one per line. As is the case with patterns on the
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \n. The
<b>--newline</b> option has no effect on this option. Trailing white space is
removed from each line, and blank lines are ignored unless the
<b>--posix-pattern-file</b> option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
may contain binary zeros, which are treated as ordinary character literals.
<br>
<br>
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
as "-" to refer to the standard input. When <b>-f</b> is used, patterns
specified on the command line using <b>-e</b> may also be present; they are
matched before the file's patterns. However, no pattern is taken from the
command line; all arguments are treated as the names of paths to be searched.
</P>
<P>
<b>--file-list</b>=<i>filename</i>
Read a list of files and/or directories that are to be scanned from the given
file, one per line. What constitutes a newline when reading the file is the
operating system's default. Trailing white space is removed from each line, and
blank lines are ignored. These paths are processed before any that are listed
on the command line. The file name can be given as "-" to refer to the standard
input. If <b>--file</b> and <b>--file-list</b> are both specified as "-",
patterns are read first. This is useful only when the standard input is a
terminal, from which further lines (the list of files) can be read after an
end-of-file indication. If this option is given more than once, all the
specified files are read.
</P>
<P>
<b>--file-offsets</b>
Instead of showing lines or parts of lines that match, show each match as an
offset from the start of the file and a length, separated by a comma. In this
mode, <b>--colour</b> has no effect, and no context is shown. That is, the
<b>-A</b>, <b>-B</b>, and <b>-C</b> options are ignored. If there is more than one
match in a line, each of them is shown separately. This option is mutually
exclusive with <b>--output</b>, <b>--line-offsets</b>, and <b>--only-matching</b>.
</P>
<P>
<b>--group-separator</b>=<i>text</i>
Output this text string instead of two hyphens between groups of lines when
<b>-A</b>, <b>-B</b>, or <b>-C</b> is in use. See also <b>--no-group-separator</b>.
</P>
<P>
<b>-H</b>, <b>--with-filename</b>
Force the inclusion of the file name at the start of output lines when
searching a single file. The file name is not normally shown in this case.
By default, for matching lines, the file name is followed by a colon; for
context lines, a hyphen separator is used. The <b>-Z</b> option can be used to
change the terminator to a zero byte. If a line number is also being output,
it follows the file name. When the <b>-M</b> option causes a pattern to match
more than one line, only the first is preceded by the file name. This option
overrides any previous <b>-h</b>, <b>-l</b>, or <b>-L</b> options.
</P>
<P>
<b>-h</b>, <b>--no-filename</b>
Suppress the output file names when searching multiple files. File names are
normally shown when multiple files are searched. By default, for matching
lines, the file name is followed by a colon; for context lines, a hyphen
separator is used. The <b>-Z</b> option can be used to change the terminator to
a zero byte. If a line number is also being output, it follows the file name.
This option overrides any previous <b>-H</b>, <b>-L</b>, or <b>-l</b> options.
</P>
<P>
<b>--heap-limit</b>=<i>number</i>
See <b>--match-limit</b> below.
</P>
<P>
<b>--help</b>
Output a help message, giving brief details of the command options and file
type support, and then exit. Anything else on the command line is
ignored.
</P>
<P>
<b>-I</b>
Ignore binary files. This is equivalent to
<b>--binary-files</b>=<i>without-match</i>.
</P>
<P>
<b>-i</b>, <b>--ignore-case</b>
Ignore upper/lower case distinctions when pattern matching. This applies when
matching path names for inclusion or exclusion as well as when matching lines
in files.
</P>
<P>
<b>--include</b>=<i>pattern</i>
If any <b>--include</b> patterns are specified, the only files that are
processed are those whose names match one of the patterns and do not match an
<b>--exclude</b> pattern. This option does not affect directories, but it
applies to all files, whether listed on the command line, obtained from
<b>--file-list</b>, or by scanning a directory. The pattern is a PCRE2 regular
expression, and is matched against the final component of the file name, not
the entire path. The <b>-F</b>, <b>-w</b>, and <b>-x</b> options do not apply to
this pattern. The option may be given any number of times. If a file name
matches both an <b>--include</b> and an <b>--exclude</b> pattern, it is excluded.
There is no short form for this option.
</P>
<P>
<b>--include-from=</b><i>filename</i>
Treat each non-empty line of the file as the data for an <b>--include</b>
option. What constitutes a newline for this purpose is the operating system's
default. The <b>--newline</b> option has no effect on this option. This option
may be given any number of times; all the files are read.
</P>
<P>
<b>--include-dir</b>=<i>pattern</i>
If any <b>--include-dir</b> patterns are specified, the only directories that
are processed are those whose names match one of the patterns and do not match
an <b>--exclude-dir</b> pattern. This applies to all directories, whether listed
on the command line, obtained from <b>--file-list</b>, or by scanning a parent
directory. The pattern is a PCRE2 regular expression, and is matched against
the final component of the directory name, not the entire path. The <b>-F</b>,
<b>-w</b>, and <b>-x</b> options do not apply to this pattern. The option may be
given any number of times. If a directory matches both <b>--include-dir</b> and
<b>--exclude-dir</b>, it is excluded. There is no short form for this option.
</P>
<P>
<b>-L</b>, <b>--files-without-match</b>
Instead of outputting lines from the files, just output the names of the files
that do not contain any lines that would have been output. Each file name is
output once, on a separate line by default, but if the <b>-Z</b> option is set,
they are separated by zero bytes instead of newlines. This option overrides any
previous <b>-H</b>, <b>-h</b>, or <b>-l</b> options.
</P>
<P>
<b>-l</b>, <b>--files-with-matches</b>
Instead of outputting lines from the files, just output the names of the files
containing lines that would have been output. Each file name is output once, on
a separate line, but if the <b>-Z</b> option is set, they are separated by zero
bytes instead of newlines. Searching normally stops as soon as a matching line
is found in a file. However, if the <b>-c</b> (count) option is also used,
matching continues in order to obtain the correct count, and those files that
have at least one match are listed along with their counts. Using this option
with <b>-c</b> is a way of suppressing the listing of files with no matches that
occurs with <b>-c</b> on its own. This option overrides any previous <b>-H</b>,
<b>-h</b>, or <b>-L</b> options.
</P>
<P>
<b>--label</b>=<i>name</i>
This option supplies a name to be used for the standard input when file names
are being output. If not supplied, "(standard input)" is used. There is no
short form for this option.
</P>
<P>
<b>--line-buffered</b>
When this option is given, non-compressed input is read and processed line by
line, and the output is flushed after each write. By default, input is read in
large chunks, unless <b>pcre2grep</b> can determine that it is reading from a
terminal, which is currently possible only in Unix-like environments or
Windows. Output to terminal is normally automatically flushed by the operating
system. This option can be useful when the input or output is attached to a
pipe and you do not want <b>pcre2grep</b> to buffer up large amounts of data.
However, its use will affect performance, and the <b>-M</b> (multiline) option
ceases to work. When input is from a compressed .gz or .bz2 file,
<b>--line-buffered</b> is ignored.
</P>
<P>
<b>--line-offsets</b>
Instead of showing lines or parts of lines that match, show each match as a
line number, the offset from the start of the line, and a length. The line
number is terminated by a colon (as usual; see the <b>-n</b> option), and the
offset and length are separated by a comma. In this mode, <b>--colour</b> has no
effect, and no context is shown. That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b>
options are ignored. If there is more than one match in a line, each of them is
shown separately. This option is mutually exclusive with <b>--output</b>,
<b>--file-offsets</b>, and <b>--only-matching</b>.
</P>
<P>
<b>--locale</b>=<i>locale-name</i>
This option specifies a locale to be used for pattern matching. It overrides
the value in the <b>LC_ALL</b> or <b>LC_CTYPE</b> environment variables. If no
locale is specified, the PCRE2 library's default (usually the "C" locale) is
used. There is no short form for this option.
</P>
<P>
<b>-M</b>, <b>--multiline</b>
Allow patterns to match more than one line. When this option is set, the PCRE2
library is called in "multiline" mode, and a match is allowed to continue past
the end of the initial line and onto one or more subsequent lines.
<br>
<br>
Patterns used with <b>-M</b> may usefully contain literal newline characters and
internal occurrences of ^ and $ characters, because in multiline mode these can
match at internal newlines. Because <b>pcre2grep</b> is scanning multiple lines,
the \Z and \z assertions match only at the end of the last line in the file.
The \A assertion matches at the start of the first line of a match. This can
be any line in the file; it is not anchored to the first line.
<br>
<br>
The output for a successful match may consist of more than one line. The first
line is the line in which the match started, and the last line is the line in
which the match ended. If the matched string ends with a newline sequence, the
output ends at the end of that line. If <b>-v</b> is set, none of the lines in a
multi-line match are output. Once a match has been handled, scanning restarts
at the beginning of the line after the one in which the match ended.
<br>
<br>
The newline sequence that separates multiple lines must be matched as part of
the pattern. For example, to find the phrase "regular expression" in a file
where "regular" might be at the end of a line and "expression" at the start of
the next line, you could use this command:
<pre>
pcre2grep -M 'regular\s+expression' &#60;file&#62;
</pre>
The \s escape sequence matches any white space character, including newlines,
and is followed by + so as to match trailing white space on the first line as
well as possibly handling a two-character newline sequence.
<br>
<br>
There is a limit to the number of lines that can be matched, imposed by the way
that <b>pcre2grep</b> buffers the input file as it scans it. With a sufficiently
large processing buffer, this should not be a problem.
<br>
<br>
The <b>-M</b> option does not work when input is read line by line (see
<b>--line-buffered</b>.)
</P>
<P>
<b>-m</b> <i>number</i>, <b>--max-count</b>=<i>number</i>
Stop processing after finding <i>number</i> matching lines, or non-matching
lines if <b>-v</b> is also set. Any trailing context lines are output after the
final match. In multiline mode, each multiline match counts as just one line
for this purpose. If this limit is reached when reading the standard input from
a regular file, the file is left positioned just after the last matching line.
If <b>-c</b> is also set, the count that is output is never greater than
<i>number</i>. This option has no effect if used with <b>-L</b>, <b>-l</b>, or
<b>-q</b>, or when just checking for a match in a binary file.
</P>
<P>
<b>--match-limit</b>=<i>number</i>
Processing some regular expression patterns may take a very long time to search
for all possible matching strings. Others may require a very large amount of
memory. There are three options that set resource limits for matching.
<br>
<br>
The <b>--match-limit</b> option provides a means of limiting computing resource
usage when processing patterns that are not going to match, but which have a
very large number of possibilities in their search trees. The classic example
is a pattern that uses nested unlimited repeats. Internally, PCRE2 has a
counter that is incremented each time around its main processing loop. If the
value set by <b>--match-limit</b> is reached, an error occurs.
<br>
<br>
The <b>--heap-limit</b> option specifies, as a number of kibibytes (units of
1024 bytes), the maximum amount of heap memory that may be used for matching.
<br>
<br>
The <b>--depth-limit</b> option limits the depth of nested backtracking points,
which indirectly limits the amount of memory that is used. The amount of memory
needed for each backtracking point depends on the number of capturing
parentheses in the pattern, so the amount of memory that is used before this
limit acts varies from pattern to pattern. This limit is of use only if it is
set smaller than <b>--match-limit</b>.
<br>
<br>
There are no short forms for these options. The default limits can be set
when the PCRE2 library is compiled; if they are not specified, the defaults
are very large and so effectively unlimited.
</P>
<P>
<b>--max-buffer-size</b>=<i>number</i>
This limits the expansion of the processing buffer, whose initial size can be
set by <b>--buffer-size</b>. The maximum buffer size is silently forced to be no
smaller than the starting buffer size.
</P>
<P>
<b>-N</b> <i>newline-type</i>, <b>--newline</b>=<i>newline-type</i>
Six different conventions for indicating the ends of lines in scanned files are
supported. For example:
<pre>
pcre2grep -N CRLF 'some pattern' &#60;file&#62;
</pre>
The newline type may be specified in upper, lower, or mixed case. If the
newline type is NUL, lines are separated by binary zero characters. The other
types are the single-character sequences CR (carriage return) and LF
(linefeed), the two-character sequence CRLF, an "anycrlf" type, which
recognizes any of the preceding three types, and an "any" type, for which any
Unicode line ending sequence is assumed to end a line. The Unicode sequences
are the three just mentioned, plus VT (vertical tab, U+000B), FF (form feed,
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
(paragraph separator, U+2029).
<br>
<br>
When the PCRE2 library is built, a default line-ending sequence is specified.
This is normally the standard sequence for the operating system. Unless
otherwise specified by this option, <b>pcre2grep</b> uses the library's default.
<br>
<br>
This option makes it possible to use <b>pcre2grep</b> to scan files that have
come from other environments without having to modify their line endings. If
the data that is being scanned does not agree with the convention set by this
option, <b>pcre2grep</b> may behave in strange ways. Note that this option does
not apply to files specified by the <b>-f</b>, <b>--exclude-from</b>, or
<b>--include-from</b> options, which are expected to use the operating system's
standard newline sequence.
</P>
<P>
<b>-n</b>, <b>--line-number</b>
Precede each output line by its line number in the file, followed by a colon
for matching lines or a hyphen for context lines. If the file name is also
being output, it precedes the line number. When the <b>-M</b> option causes a
pattern to match more than one line, only the first is preceded by its line
number. This option is forced if <b>--line-offsets</b> is used.
</P>
<P>
<b>--no-group-separator</b>
Do not output a separator between groups of lines when <b>-A</b>, <b>-B</b>, or
<b>-C</b> is in use. The default is to output a line containing two hyphens. See
also <b>--group-separator</b>.
</P>
<P>
<b>--no-jit</b>
If the PCRE2 library is built with support for just-in-time compiling (which
speeds up matching), <b>pcre2grep</b> automatically makes use of this, unless it
was explicitly disabled at build time. This option can be used to disable the
use of JIT at run time. It is provided for testing and working around problems.
It should never be needed in normal use.
</P>
<P>
<b>-O</b> <i>text</i>, <b>--output</b>=<i>text</i>
When there is a match, instead of outputting the line that matched, output just
the text specified in this option, followed by an operating-system standard
newline. In this mode, <b>--colour</b> has no effect, and no context is shown.
That is, the <b>-A</b>, <b>-B</b>, and <b>-C</b> options are ignored. The
<b>--newline</b> option has no effect on this option, which is mutually
exclusive with <b>--only-matching</b>, <b>--file-offsets</b>, and
<b>--line-offsets</b>. However, like <b>--only-matching</b>, if there is more
than one match in a line, each of them causes a line of output.
<br>
<br>
Escape sequences starting with a dollar character may be used to insert the
contents of the matched part of the line and/or captured substrings into the
text.
<br>
<br>
$&#60;digits&#62; or ${&#60;digits&#62;} is replaced by the captured substring of the given
decimal number; $& (or the legacy $0) substitutes the whole match. If the
number is greater than the number of capturing substrings, or if the capture
is unset, the replacement is empty.
<br>
<br>
$a is replaced by bell; $b by backspace; $e by escape; $f by form feed; $n by
newline; $r by carriage return; $t by tab; $v by vertical tab.
<br>
<br>
$o&#60;digits&#62; or $o{&#60;digits&#62;} is replaced by the character whose code point is the
given octal number. In the first form, up to three octal digits are processed.
When more digits are needed in Unicode mode to specify a wide character, the
second form must be used.
<br>
<br>
$x&#60;digits&#62; or $x{&#60;digits&#62;} is replaced by the character represented by the
given hexadecimal number. In the first form, up to two hexadecimal digits are
processed. When more digits are needed in Unicode mode to specify a wide
character, the second form must be used.
<br>
<br>
Any other character is substituted by itself. In particular, $$ is replaced by
a single dollar.
</P>
<P>
<b>-o</b>, <b>--only-matching</b>
Show only the part of the line that matched a pattern instead of the whole
line. In this mode, no context is shown. That is, the <b>-A</b>, <b>-B</b>, and
<b>-C</b> options are ignored. If there is more than one match in a line, each
of them is shown separately, on a separate line of output. If <b>-o</b> is
combined with <b>-v</b> (invert the sense of the match to find non-matching
lines), no output is generated, but the return code is set appropriately. If
the matched portion of the line is empty, nothing is output unless the file
name or line number are being printed, in which case they are shown on an
otherwise empty line. This option is mutually exclusive with <b>--output</b>,
<b>--file-offsets</b> and <b>--line-offsets</b>.
</P>
<P>
<b>-o</b><i>number</i>, <b>--only-matching</b>=<i>number</i>
Show only the part of the line that matched the capturing parentheses of the
given number. Up to 50 capturing parentheses are supported by default. This
limit can be changed via the <b>--om-capture</b> option. A pattern may contain
any number of capturing parentheses, but only those whose number is within the
limit can be accessed by <b>-o</b>. An error occurs if the number specified by
<b>-o</b> is greater than the limit.
<br>
<br>
-o0 is the same as <b>-o</b> without a number. Because these options can be
given without an argument (see above), if an argument is present, it must be
given in the same shell item, for example, -o3 or --only-matching=2. The
comments given for the non-argument case above also apply to this option. If
the specified capturing parentheses do not exist in the pattern, or were not
set in the match, nothing is output unless the file name or line number are
being output.
<br>
<br>
If this option is given multiple times, multiple substrings are output for each
match, in the order the options are given, and all on one line. For example,
-o3 -o1 -o3 causes the substrings matched by capturing parentheses 3 and 1 and
then 3 again to be output. By default, there is no separator (but see the next
but one option).
</P>
<P>
<b>--om-capture</b>=<i>number</i>
Set the number of capturing parentheses that can be accessed by <b>-o</b>. The
default is 50.
</P>
<P>
<b>--om-separator</b>=<i>text</i>
Specify a separating string for multiple occurrences of <b>-o</b>. The default
is an empty string. Separating strings are never coloured.
</P>
<P>
<b>-P</b>, <b>--no-ucp</b>
Starting from release 10.43, when UTF/Unicode mode is specified with <b>-u</b>
or <b>-U</b>, the PCRE2_UCP option is used by default. This means that the
POSIX classes in patterns match more than just ASCII characters. For example,
[:digit:] matches any Unicode decimal digit. The <b>--no-ucp</b> option
suppresses PCRE2_UCP, thus restricting the POSIX classes to ASCII characters,
as was the case in earlier releases. Note that there are now more fine-grained
option settings within patterns that affect individual classes. For example,
when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
allowing \w to match Unicode letters and digits.
</P>
<P>
<b>--posix-pattern-file</b>
When patterns are provided with the <b>-f</b> option, do not trim trailing
spaces or ignore empty lines in a similar way than other grep tools. To keep
the behaviour consistent with older versions, if the pattern read was
terminated with CRLF (as character literals) then both characters won't be
included as part of it, so if you really need to have pattern ending in '\r',
use a escape sequence or provide it by a different method.
</P>
<P>
<b>-q</b>, <b>--quiet</b>
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.
</P>
<P>
<b>-r</b>, <b>--recursive</b>
If any given path is a directory, recursively scan the files it contains,
taking note of any <b>--include</b> and <b>--exclude</b> settings. By default, a
directory is read as a normal file; in some operating systems this gives an
immediate end-of-file. This option is a shorthand for setting the <b>-d</b>
option to "recurse".
</P>
<P>
<b>--recursion-limit</b>=<i>number</i>
This is an obsolete synonym for <b>--depth-limit</b>. See <b>--match-limit</b>
above for details.
</P>
<P>
<b>-s</b>, <b>--no-messages</b>
Suppress error messages about non-existent or unreadable files. Such files are
quietly skipped. However, the return code is still 2, even if matches were
found in other files.
</P>
<P>
<b>-t</b>, <b>--total-count</b>
This option is useful when scanning more than one file. If used on its own,
<b>-t</b> suppresses all output except for a grand total number of matching
lines (or non-matching lines if <b>-v</b> is used) in all the files. If <b>-t</b>
is used with <b>-c</b>, a grand total is output except when the previous output
is just one line. In other words, it is not output when just one file's count
is listed. If file names are being output, the grand total is preceded by
"TOTAL:". Otherwise, it appears as just another number. The <b>-t</b> option is
ignored when used with <b>-L</b> (list files without matches), because the grand
total would always be zero.
</P>
<P>
<b>-u</b>, <b>--utf</b>
Operate in UTF/Unicode mode. This option is available only if PCRE2 has been
compiled with UTF-8 support. All patterns (including those for any
<b>--exclude</b> and <b>--include</b> options) and all lines that are scanned
must be valid strings of UTF-8 characters. If an invalid UTF-8 string is
encountered, an error occurs.
</P>
<P>
<b>-U</b>, <b>--utf-allow-invalid</b>
As <b>--utf</b>, but in addition subject lines may contain invalid UTF-8 code
unit sequences. These can never form part of any pattern match. Patterns
themselves, however, must still be valid UTF-8 strings. This facility allows
valid UTF-8 strings to be sought within arbitrary byte sequences in executable
or other binary files. For more details about matching in non-valid UTF-8
strings, see the
<a href="pcre2unicode.html"><b>pcre2unicode</b>(3)</a>
documentation.
</P>
<P>
<b>-V</b>, <b>--version</b>
Write the version numbers of <b>pcre2grep</b> and the PCRE2 library to the
standard output and then exit. Anything else on the command line is
ignored.
</P>
<P>
<b>-v</b>, <b>--invert-match</b>
Invert the sense of the match, so that lines which do <i>not</i> match any of
the patterns are the ones that are found. When this option is set, options such
as <b>--only-matching</b> and <b>--output</b>, which specify parts of a match
that are to be output, are ignored.
</P>
<P>
<b>-w</b>, <b>--word-regex</b>, <b>--word-regexp</b>
Force the patterns only to match "words". That is, there must be a word
boundary at the start and end of each matched string. This is equivalent to
having "\b(?:" at the start of each pattern, and ")\b" at the end. This
option applies only to the patterns that are matched against the contents of
files; it does not apply to patterns specified by any of the <b>--include</b> or
<b>--exclude</b> options.
</P>
<P>
<b>-x</b>, <b>--line-regex</b>, <b>--line-regexp</b>
Force the patterns to start matching only at the beginnings of lines, and in
addition, require them to match entire lines. In multiline mode the match may
be more than one line. This is equivalent to having "^(?:" at the start of each
pattern and ")$" at the end. This option applies only to the patterns that are
matched against the contents of files; it does not apply to patterns specified
by any of the <b>--include</b> or <b>--exclude</b> options.
</P>
<P>
<b>-Z</b>, <b>--null</b>
Terminate files names in the regular output with a zero byte (the NUL
character) instead of what would normally appear. This is useful when file
names contain unusual characters such as colons, hyphens, or even newlines. The
option does not apply to file names in error messages.
</P>
<br><a name="SEC7" href="#TOC1">ENVIRONMENT VARIABLES</a><br>
<P>
The environment variables <b>LC_ALL</b> and <b>LC_CTYPE</b> are examined, in that
order, for a locale. The first one that is set is used. This can be overridden
by the <b>--locale</b> option. If no locale is set, the PCRE2 library's default
(usually the "C" locale) is used.
</P>
<br><a name="SEC8" href="#TOC1">NEWLINES</a><br>
<P>
The <b>-N</b> (<b>--newline</b>) option allows <b>pcre2grep</b> to scan files with
newline conventions that differ from the default. This option affects only the
way scanned files are processed. It does not affect the interpretation of files
specified by the <b>-f</b>, <b>--file-list</b>, <b>--exclude-from</b>, or
<b>--include-from</b> options.
</P>
<P>
Any parts of the scanned input files that are written to the standard output
are copied with whatever newline sequences they have in the input. However, if
the final line of a file is output, and it does not end with a newline
sequence, a newline sequence is added. If the newline setting is CR, LF, CRLF
or NUL, that line ending is output; for the other settings (ANYCRLF or ANY) a
single NL is used.
</P>
<P>
The newline setting does not affect the way in which <b>pcre2grep</b> writes
newlines in informational messages to the standard output and error streams.
Under Windows, the standard output is set to be binary, so that "\r\n" at the
ends of output lines that are copied from the input is not converted to
"\r\r\n" by the C I/O library. This means that any messages written to the
standard output must end with "\r\n". For all other operating systems, and
for all messages to the standard error stream, "\n" is used.
</P>
<br><a name="SEC9" href="#TOC1">OPTIONS COMPATIBILITY WITH GNU GREP</a><br>
<P>
Many of the short and long forms of <b>pcre2grep</b>'s options are the same as
in the GNU <b>grep</b> program. Any long option of the form <b>--xxx-regexp</b>
(GNU terminology) is also available as <b>--xxx-regex</b> (PCRE2 terminology).
However, the <b>--case-restrict</b>, <b>--depth-limit</b>, <b>-E</b>,
<b>--file-list</b>, <b>--file-offsets</b>, <b>--heap-limit</b>,
<b>--include-dir</b>, <b>--line-offsets</b>, <b>--locale</b>, <b>--match-limit</b>,
<b>-M</b>, <b>--multiline</b>, <b>-N</b>, <b>--newline</b>, <b>--no-ucp</b>,
<b>--om-separator</b>, <b>--output</b>, <b>-P</b>, <b>-u</b>, <b>--utf</b>,
<b>-U</b>, and <b>--utf-allow-invalid</b> options are specific to
<b>pcre2grep</b>, as is the use of the <b>--only-matching</b> option with a
capturing parentheses number.
</P>
<P>
Although most of the common options work the same way, a few are different in
<b>pcre2grep</b>. For example, the <b>--include</b> option's argument is a glob
for GNU <b>grep</b>, but in <b>pcre2grep</b> it is a regular expression to which
the <b>-i</b> option applies. If both the <b>-c</b> and <b>-l</b> options are
given, GNU grep lists only file names, without counts, but <b>pcre2grep</b>
gives the counts as well.
</P>
<br><a name="SEC10" href="#TOC1">OPTIONS WITH DATA</a><br>
<P>
There are four different ways in which an option with data can be specified.
If a short form option is used, the data may follow immediately, or (with one
exception) in the next command line item. For example:
<pre>
-f/some/file
-f /some/file
</pre>
The exception is the <b>-o</b> option, which may appear with or without data.
Because of this, if data is present, it must follow immediately in the same
item, for example -o3.
</P>
<P>
If a long form option is used, the data may appear in the same command line
item, separated by an equals character, or (with two exceptions) it may appear
in the next command line item. For example:
<pre>
--file=/some/file
--file /some/file
</pre>
Note, however, that if you want to supply a file name beginning with ~ as data
in a shell command, and have the shell expand ~ to a home directory, you must
separate the file name from the option, because the shell does not treat ~
specially unless it is at the start of an item.
</P>
<P>
The exceptions to the above are the <b>--colour</b> (or <b>--color</b>) and
<b>--only-matching</b> options, for which the data is optional. If one of these
options does have data, it must be given in the first form, using an equals
character. Otherwise <b>pcre2grep</b> will assume that it has no data.
</P>
<br><a name="SEC11" href="#TOC1">USING PCRE2'S CALLOUT FACILITY</a><br>
<P>
<b>pcre2grep</b> has, by default, support for calling external programs or
scripts or echoing specific strings during matching by making use of PCRE2's
callout facility. However, this support can be completely or partially disabled
when <b>pcre2grep</b> is built. You can find out whether your binary has support
for callouts by running it with the <b>--help</b> option. If callout support is
completely disabled, callouts in patterns are forbidden by <b>pcre2grep</b>.
If the facility is partially disabled, calling external programs is not
supported, and callouts that request it are ignored.
</P>
<P>
A callout in a PCRE2 pattern is of the form (?C&#60;arg&#62;) where the argument is
either a number or a quoted string (see the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation for details). Numbered callouts are ignored by <b>pcre2grep</b>;
only callouts with string arguments are useful.
</P>
<br><b>
Echoing a specific string
</b><br>
<P>
Starting the callout string with a pipe character invokes an echoing facility
that avoids calling an external program or script. This facility is always
available, provided that callouts were not completely disabled when
<b>pcre2grep</b> was built. The rest of the callout string is processed as a
zero-terminated string, which means it should not contain any internal binary
zeros. It is written to the output, having first been passed through the same
escape processing as text from the <b>--output</b> (<b>-O</b>) option (see
above). However, $0 or $& cannot be used to insert a matched substring because
the match is still in progress. Instead, the single character '0' is inserted.
Any syntax errors in the string (for example, a dollar not followed by another
character) causes the callout to be ignored. No terminator is added to the
output string, so if you want a newline, you must include it explicitly using
the escape $n. For example:
<pre>
pcre2grep '(.)(..(.))(?C"|[$1] [$2] [$3]$n")' &#60;some file&#62;
</pre>
Matching continues normally after the string is output. If you want to see only
the callout output but not any output from an actual match, you should end the
pattern with (*FAIL).
</P>
<br><b>
Calling external programs or scripts
</b><br>
<P>
This facility can be independently disabled when <b>pcre2grep</b> is built. It
is supported for Windows, where a call to <b>_spawnvp()</b> is used, for VMS,
where <b>lib$spawn()</b> is used, and for any Unix-like environment where
<b>fork()</b> and <b>execv()</b> are available.
</P>
<P>
If the callout string does not start with a pipe (vertical bar) character, it
is parsed into a list of substrings separated by pipe characters. The first
substring must be an executable name, with the following substrings specifying
arguments:
<pre>
executable_name|arg1|arg2|...
</pre>
Any substring (including the executable name) may contain escape sequences
started by a dollar character. These are the same as for the <b>--output</b>
(<b>-O</b>) option documented above, except that $0 or $& cannot insert the
matched string because the match is still in progress. Instead, the character
'0' is inserted. If you need a literal dollar or pipe character in any
substring, use $$ or $| respectively. Here is an example:
<pre>
echo -e "abcde\n12345" | pcre2grep \
'(?x)(.)(..(.))
(?C"/bin/echo|Arg1: [$1] [$2] [$3]|Arg2: $|${1}$| ($4)")()' -
Output:
Arg1: [a] [bcd] [d] Arg2: |a| ()
abcde
Arg1: [1] [234] [4] Arg2: |1| ()
12345
</pre>
The parameters for the system call that is used to run the program or script
are zero-terminated strings. This means that binary zero characters in the
callout argument will cause premature termination of their substrings, and
therefore should not be present. Any syntax errors in the string (for example,
a dollar not followed by another character) causes the callout to be ignored.
If running the program fails for any reason (including the non-existence of the
executable), a local matching failure occurs and the matcher backtracks in the
normal way.
</P>
<br><a name="SEC12" href="#TOC1">MATCHING ERRORS</a><br>
<P>
It is possible to supply a regular expression that takes a very long time to
fail to match certain lines. Such patterns normally involve nested indefinite
repeats, for example: (a+)*\d when matched against a line of a's with no final
digit. The PCRE2 matching function has a resource limit that causes it to abort
in these circumstances. If this happens, <b>pcre2grep</b> outputs an error
message and the line that caused the problem to the standard error stream. If
there are more than 20 such errors, <b>pcre2grep</b> gives up.
</P>
<P>
The <b>--match-limit</b> option of <b>pcre2grep</b> can be used to set the
overall resource limit. There are also other limits that affect the amount of
memory used during matching; see the discussion of <b>--heap-limit</b> and
<b>--depth-limit</b> above.
</P>
<br><a name="SEC13" href="#TOC1">DIAGNOSTICS</a><br>
<P>
Exit status is 0 if any matches were found, 1 if no matches were found, and 2
for syntax errors, overlong lines, non-existent or inaccessible files (even if
matches were found in other files) or too many matching errors. Using the
<b>-s</b> option to suppress error messages about inaccessible files does not
affect the return code.
</P>
<P>
When run under VMS, the return code is placed in the symbol PCRE2GREP_RC
because VMS does not distinguish between exit(0) and exit(1).
</P>
<br><a name="SEC14" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2syntax</b>(3), <b>pcre2callout</b>(3),
<b>pcre2unicode</b>(3).
</P>
<br><a name="SEC15" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC16" href="#TOC1">REVISION</a><br>
<P>
Last updated: 04 February 2025
<br>
Copyright &copy; 1997-2023 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,505 @@
<html>
<head>
<title>pcre2jit specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2jit man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a>
<li><a name="TOC2" href="#SEC2">AVAILABILITY OF JIT SUPPORT</a>
<li><a name="TOC3" href="#SEC3">SIMPLE USE OF JIT</a>
<li><a name="TOC4" href="#SEC4">MATCHING SUBJECTS CONTAINING INVALID UTF</a>
<li><a name="TOC5" href="#SEC5">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a>
<li><a name="TOC6" href="#SEC6">RETURN VALUES FROM JIT MATCHING</a>
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
<li><a name="TOC9" href="#SEC9">FREEING JIT SPECULATIVE MEMORY</a>
<li><a name="TOC10" href="#SEC10">EXAMPLE CODE</a>
<li><a name="TOC11" href="#SEC11">JIT FAST PATH API</a>
<li><a name="TOC12" href="#SEC12">SEE ALSO</a>
<li><a name="TOC13" href="#SEC13">AUTHOR</a>
<li><a name="TOC14" href="#SEC14">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiling is a heavyweight optimization that can greatly speed up
pattern matching. However, it comes at the cost of extra processing before the
match is performed, so it is of most benefit when the same pattern is going to
be matched many times. This does not necessarily mean many calls of a matching
function; if the pattern is not anchored, matching attempts may take place many
times at various positions in the subject, even for a single call. Therefore,
if the subject string is very long, it may still pay to use JIT even for
one-off matches. JIT support is available for all of the 8-bit, 16-bit and
32-bit PCRE2 libraries.
</P>
<P>
JIT support applies only to the traditional Perl-compatible matching function.
It does not apply when the DFA matching function is being used. The code for
JIT support was written by Zoltan Herczeg.
</P>
<br><a name="SEC2" href="#TOC1">AVAILABILITY OF JIT SUPPORT</a><br>
<P>
JIT support is an optional feature of PCRE2. The "configure" option
--enable-jit (or equivalent CMake option) must be set when PCRE2 is built if
you want to use JIT. The support is limited to the following hardware
platforms:
<pre>
ARM 32-bit (v7, and Thumb2)
ARM 64-bit
IBM s390x 64 bit
Intel x86 32-bit and 64-bit
LoongArch 64 bit
MIPS 32-bit and 64-bit
Power PC 32-bit and 64-bit
RISC-V 32-bit and 64-bit
</pre>
If --enable-jit is set on an unsupported platform, compilation fails.
</P>
<P>
A client program can tell if JIT support has been compiled by calling
<b>pcre2_config()</b> with the PCRE2_CONFIG_JIT option. The result is one if
PCRE2 was built with JIT support, and zero otherwise. However, having the JIT
code available does not guarantee that it will be used for any particular
match. One reason for this is that there are a number of options and pattern
items that are
<a href="#unsupported">not supported by JIT</a>
(see below). Another reason is that in some environments JIT is unable to get
executable memory in which to build its compiled code. The only guarantee from
<b>pcre2_config()</b> is that if it returns zero, JIT will definitely <i>not</i>
be used.
</P>
<P>
As of release 10.45 there is a more informative way to test for JIT support. If
<b>pcre2_compile_jit()</b> is called with the single option PCRE2_JIT_TEST_ALLOC
it returns zero if JIT is available and has a working allocator. Otherwise it
returns PCRE2_ERROR_NOMEMORY if JIT is available but cannot allocate executable
memory, or PCRE2_ERROR_JIT_UNSUPPORTED if JIT support is not compiled. The
code argument is ignored, so it can be a NULL value.
</P>
<P>
A simple program does not need to check availability in order to use JIT when
possible. The API is implemented in a way that falls back to the interpretive
code if JIT is not available or cannot be used for a given match. For programs
that need the best possible performance, there is a
<a href="#fastpath">"fast path"</a>
API that is JIT-specific.
</P>
<br><a name="SEC3" href="#TOC1">SIMPLE USE OF JIT</a><br>
<P>
To make use of the JIT support in the simplest way, all you have to do is to
call <b>pcre2_jit_compile()</b> after successfully compiling a pattern with
<b>pcre2_compile()</b>. This function has two arguments: the first is the
compiled pattern pointer that was returned by <b>pcre2_compile()</b>, and the
second is zero or more of the following option bits: PCRE2_JIT_COMPLETE,
PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
</P>
<P>
If JIT support is not available, a call to <b>pcre2_jit_compile()</b> does
nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled pattern
is passed to the JIT compiler, which turns it into machine code that executes
much faster than the normal interpretive code, but yields exactly the same
results. The returned value from <b>pcre2_jit_compile()</b> is zero on success,
or a negative error code.
</P>
<P>
There is a limit to the size of pattern that JIT supports, imposed by the size
of machine stack that it uses. The exact rules are not documented because they
may change at any time, in particular, when new optimizations are introduced.
If a pattern is too big, a call to <b>pcre2_jit_compile()</b> returns
PCRE2_ERROR_NOMEMORY.
</P>
<P>
PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for complete
matches. If you want to run partial matches using the PCRE2_PARTIAL_HARD or
PCRE2_PARTIAL_SOFT options of <b>pcre2_match()</b>, you should set one or both
of the other options as well as, or instead of PCRE2_JIT_COMPLETE. The JIT
compiler generates different optimized code for each of the three modes
(normal, soft partial, hard partial). When <b>pcre2_match()</b> is called, the
appropriate code is run if it is available. Otherwise, the pattern is matched
using interpretive code.
</P>
<P>
You can call <b>pcre2_jit_compile()</b> multiple times for the same compiled
pattern. It does nothing if it has previously compiled code for any of the
option bits. For example, you can call it once with PCRE2_JIT_COMPLETE and
(perhaps later, when you find you need partial matching) again with
PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it will ignore
PCRE2_JIT_COMPLETE and just compile code for partial matching. If
<b>pcre2_jit_compile()</b> is called with no option bits set, it immediately
returns zero. This is an alternative way of testing whether JIT support has
been compiled.
</P>
<P>
At present, it is not possible to free JIT compiled code except when the entire
compiled pattern is freed by calling <b>pcre2_code_free()</b>.
</P>
<P>
In some circumstances you may need to call additional functions. These are
described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below.
</P>
<P>
There are some <b>pcre2_match()</b> options that are not supported by JIT, and
there are also some pattern items that JIT cannot handle. Details are given
<a href="#unsupported">below.</a>
In both cases, matching automatically falls back to the interpretive code. If
you want to know whether JIT was actually used for a particular match, you
should arrange for a JIT callback function to be set up as described in the
section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below, even if you do not need to supply a non-default JIT stack. Such a
callback function is called whenever JIT code is about to be obeyed. If the
match-time options are not right for JIT execution, the callback function is
not obeyed.
</P>
<P>
If the JIT compiler finds an unsupported item, no JIT data is generated. You
can find out if JIT compilation was successful for a compiled pattern by
calling <b>pcre2_pattern_info()</b> with the PCRE2_INFO_JITSIZE option. A
non-zero result means that JIT compilation was successful. A result of 0 means
that JIT support is not available, or the pattern was not processed by
<b>pcre2_jit_compile()</b>, or the JIT compiler was not able to handle the
pattern. Successful JIT compilation does not, however, guarantee the use of JIT
at match time because there are some match time options that are not supported
by JIT.
</P>
<br><a name="SEC4" href="#TOC1">MATCHING SUBJECTS CONTAINING INVALID UTF</a><br>
<P>
When a pattern is compiled with the PCRE2_UTF option, subject strings are
normally expected to be a valid sequence of UTF code units. By default, this is
checked at the start of matching and an error is generated if invalid UTF is
detected. The PCRE2_NO_UTF_CHECK option can be passed to <b>pcre2_match()</b> to
skip the check (for improved performance) if you are sure that a subject string
is valid. If this option is used with an invalid string, the result is
undefined. The calling program may crash or loop or otherwise misbehave.
</P>
<P>
However, a way of running matches on strings that may contain invalid UTF
sequences is available. Calling <b>pcre2_compile()</b> with the
PCRE2_MATCH_INVALID_UTF option has two effects: it tells the interpreter in
<b>pcre2_match()</b> to support invalid UTF, and, if <b>pcre2_jit_compile()</b>
is subsequently called, the compiled JIT code also supports invalid UTF.
Details of how this support works, in both the JIT and the interpretive cases,
is given in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
There is also an obsolete option for <b>pcre2_jit_compile()</b> called
PCRE2_JIT_INVALID_UTF, which currently exists only for backward compatibility.
It is superseded by the <b>pcre2_compile()</b> option PCRE2_MATCH_INVALID_UTF
and should no longer be used. It may be removed in future.
<a name="unsupported"></a></P>
<br><a name="SEC5" href="#TOC1">UNSUPPORTED OPTIONS AND PATTERN ITEMS</a><br>
<P>
The <b>pcre2_match()</b> options that are supported for JIT matching are
PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and
PCRE2_PARTIAL_SOFT. The PCRE2_ANCHORED and PCRE2_ENDANCHORED options are not
supported at match time.
</P>
<P>
If the PCRE2_NO_JIT option is passed to <b>pcre2_match()</b> it disables the
use of JIT, forcing matching by the interpreter code.
</P>
<P>
The only unsupported pattern items are \C (match a single data unit) when
running in a UTF mode, and a callout immediately before an assertion condition
in a conditional group.
</P>
<br><a name="SEC6" href="#TOC1">RETURN VALUES FROM JIT MATCHING</a><br>
<P>
When a pattern is matched using JIT, the return values are the same as those
given by the interpretive <b>pcre2_match()</b> code, with the addition of one
new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the memory used for
the JIT stack was insufficient. See
<a href="#stackcontrol">"Controlling the JIT stack"</a>
below for a discussion of JIT stack usage.
</P>
<P>
The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if searching
a very large pattern tree goes on for too long, as it is in the same
circumstance when JIT is not used, but the details of exactly what is counted
are not the same. The PCRE2_ERROR_DEPTHLIMIT error code is never returned
when JIT matching is used.
<a name="stackcontrol"></a></P>
<br><a name="SEC7" href="#TOC1">CONTROLLING THE JIT STACK</a><br>
<P>
When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32KiB on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE2_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
managing blocks of memory for use as JIT stacks. There is further discussion
about the use of JIT stacks in the section entitled
<a href="#stackfaq">"JIT stack FAQ"</a>
below.
</P>
<P>
The <b>pcre2_jit_stack_create()</b> function creates a JIT stack. Its arguments
are a starting size, a maximum size, and a general context (for memory
allocation functions, or NULL for standard memory allocation). It returns a
pointer to an opaque structure of type <b>pcre2_jit_stack</b>, or NULL if there
is an error. The <b>pcre2_jit_stack_free()</b> function is used to free a stack
that is no longer needed. If its argument is NULL, this function returns
immediately, without doing anything. (For the technically minded: the address
space is allocated by mmap or VirtualAlloc.) A maximum stack size of 512KiB to
1MiB should be more than enough for any pattern.
</P>
<P>
The <b>pcre2_jit_stack_assign()</b> function specifies which stack JIT code
should use. Its arguments are as follows:
<pre>
pcre2_match_context *mcontext
pcre2_jit_callback callback
void *data
</pre>
The first argument is a pointer to a match context. When this is subsequently
passed to a matching function, its information determines which JIT stack is
used. If this argument is NULL, the function returns immediately, without doing
anything. There are three cases for the values of the other two options:
<pre>
(1) If <i>callback</i> is NULL and <i>data</i> is NULL, an internal 32KiB block
on the machine stack is used. This is the default when a match
context is created.
(2) If <i>callback</i> is NULL and <i>data</i> is not NULL, <i>data</i> must be
a pointer to a valid JIT stack, the result of calling
<b>pcre2_jit_stack_create()</b>.
(3) If <i>callback</i> is not NULL, it must point to a function that is
called with <i>data</i> as an argument at the start of matching, in
order to set up a JIT stack. If the return from the callback
function is NULL, the internal 32KiB stack is used; otherwise the
return value must be a valid JIT stack, the result of calling
<b>pcre2_jit_stack_create()</b>.
</pre>
A callback function is obeyed whenever JIT code is about to be run; it is not
obeyed when <b>pcre2_match()</b> is called with options that are incompatible
for JIT matching. A callback function can therefore be used to determine
whether a match operation was executed by JIT or by the interpreter.
</P>
<P>
You may safely use the same JIT stack for more than one pattern (either by
assigning directly or by callback), as long as the patterns are matched
sequentially in the same thread. Currently, the only way to set up
non-sequential matches in one thread is to use callouts: if a callout function
starts another match, that match must use a different JIT stack to the one used
for currently suspended match(es).
</P>
<P>
In a multithread application, if you do not specify a JIT stack, or if you
assign or pass back NULL from a callback, that is thread-safe, because each
thread has its own machine stack. However, if you assign or pass back a
non-NULL JIT stack, this must be a different stack for each thread so that the
application is thread-safe.
</P>
<P>
Strictly speaking, even more is allowed. You can assign the same non-NULL stack
to a match context that is used by any number of patterns, as long as they are
not used for matching by multiple threads at the same time. For example, you
could use the same stack in all compiled patterns, with a global mutex in the
callback to wait until the stack is available for use. However, this is an
inefficient solution, and not recommended.
</P>
<P>
This is a suggestion for how a multithreaded program that needs to set up
non-default JIT stacks might operate:
<pre>
During thread initialization
thread_local_var = pcre2_jit_stack_create(...)
During thread exit
pcre2_jit_stack_free(thread_local_var)
Use a one-line callback function
return thread_local_var
</pre>
All the functions described in this section do nothing if JIT is not available.
<a name="stackfaq"></a></P>
<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
<P>
(1) Why do we need JIT stacks?
<br>
<br>
PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack where
the local data of the current node is pushed before checking its child nodes.
Allocating real machine stack on some platforms is difficult. For example, the
stack chain needs to be updated every time if we extend the stack on PowerPC.
Although it is possible, its updating time overhead decreases performance. So
we do the recursion in memory.
</P>
<P>
(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>?
<br>
<br>
Modern operating systems have a nice feature: they can reserve an address space
instead of allocating memory. We can safely allocate memory pages inside this
address space, so the stack could grow without moving memory data (this is
important because of pointers). Thus we can allocate 1MiB address space, and
use only a single memory page (usually 4KiB) if that is enough. However, we can
still grow up to 1MiB anytime if needed.
</P>
<P>
(3) Who "owns" a JIT stack?
<br>
<br>
The owner of the stack is the user program, not the JIT studied pattern or
anything else. The user program must ensure that if a stack is being used by
<b>pcre2_match()</b>, (that is, it is assigned to a match context that is passed
to the pattern currently running), that stack must not be used by any other
threads (to avoid overwriting the same memory area). The best practice for
multithreaded programs is to allocate a stack for each thread, and return this
stack through the JIT callback function.
</P>
<P>
(4) When should a JIT stack be freed?
<br>
<br>
You can free a JIT stack at any time, as long as it will not be used by
<b>pcre2_match()</b> again. When you assign the stack to a match context, only a
pointer is set. There is no reference counting or any other magic. You can free
compiled patterns, contexts, and stacks in any order, anytime.
Just <i>do not</i> call <b>pcre2_match()</b> with a match context pointing to an
already freed stack, as that will cause SEGFAULT. (Also, do not free a stack
currently used by <b>pcre2_match()</b> in another thread). You can also replace
the stack in a context at any time when it is not in use. You should free the
previous stack before assigning a replacement.
</P>
<P>
(5) Should I allocate/free a stack every time before/after calling
<b>pcre2_match()</b>?
<br>
<br>
No, because this is too costly in terms of resources. However, you could
implement some clever idea which release the stack if it is not used in let's
say two minutes. The JIT callback can help to achieve this without keeping a
list of patterns.
</P>
<P>
(6) OK, the stack is for long term memory allocation. But what happens if a
pattern causes stack overflow with a stack of 1MiB? Is that 1MiB kept until the
stack is freed?
<br>
<br>
Especially on embedded systems, it might be a good idea to release memory
sometimes without freeing the stack. There is no API for this at the moment.
Probably a function call which returns with the currently allocated memory for
any stack and another which allows releasing memory (shrinking the stack) would
be a good idea if someone needs this.
</P>
<P>
(7) This is too much of a headache. Isn't there any better solution for JIT
stack handling?
<br>
<br>
No, thanks to Windows. If POSIX threads were used everywhere, we could throw
out this complicated API.
</P>
<br><a name="SEC9" href="#TOC1">FREEING JIT SPECULATIVE MEMORY</a><br>
<P>
<b>void pcre2_jit_free_unused_memory(pcre2_general_context *<i>gcontext</i>);</b>
</P>
<P>
The JIT executable allocator does not free all memory when it is possible. It
expects new allocations, and keeps some free memory around to improve
allocation speed. However, in low memory conditions, it might be better to free
all possible memory. You can cause this to happen by calling
pcre2_jit_free_unused_memory(). Its argument is a general context, for custom
memory management, or NULL for standard memory management.
</P>
<br><a name="SEC10" href="#TOC1">EXAMPLE CODE</a><br>
<P>
This is a single-threaded example that specifies a JIT stack without using a
callback. A real program should include error checking after all the function
calls.
<pre>
int rc;
pcre2_code *re;
pcre2_match_data *match_data;
pcre2_match_context *mcontext;
pcre2_jit_stack *jit_stack;
re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
&errornumber, &erroffset, NULL);
rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
mcontext = pcre2_match_context_create(NULL);
jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
match_data = pcre2_match_data_create(re, 10);
rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
/* Process result */
pcre2_code_free(re);
pcre2_match_data_free(match_data);
pcre2_match_context_free(mcontext);
pcre2_jit_stack_free(jit_stack);
<a name="fastpath"></a></PRE>
</P>
<br><a name="SEC11" href="#TOC1">JIT FAST PATH API</a><br>
<P>
Because the API described above falls back to interpreted matching when JIT is
not available, it is convenient for programs that are written for general use
in many environments. However, calling JIT via <b>pcre2_match()</b> does have a
performance impact. Programs that are written for use where JIT is known to be
available, and which need the best possible performance, can instead use a
"fast path" API to call JIT matching directly instead of calling
<b>pcre2_match()</b> (obviously only for patterns that have been successfully
processed by <b>pcre2_jit_compile()</b>).
</P>
<P>
The fast path function is called <b>pcre2_jit_match()</b>, and it takes exactly
the same arguments as <b>pcre2_match()</b>. However, the subject string must be
specified with a length; PCRE2_ZERO_TERMINATED is not supported. Unsupported
option bits (for example, PCRE2_ANCHORED and PCRE2_ENDANCHORED) are ignored, as
is the PCRE2_NO_JIT option. The return values are also the same as for
<b>pcre2_match()</b>, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial
or complete) is requested that was not compiled.
</P>
<P>
When you call <b>pcre2_match()</b>, as well as testing for invalid options, a
number of other sanity checks are performed on the arguments. For example, if
the subject pointer is NULL but the length is non-zero, an immediate error is
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
for validity. In the interests of speed, these checks do not happen on the JIT
fast path. If invalid UTF data is passed when PCRE2_MATCH_INVALID_UTF was not
set for <b>pcre2_compile()</b>, the result is undefined. The program may crash
or loop or give wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you
should call <b>pcre2_jit_match()</b> in UTF mode only if you are sure the
subject is valid.
</P>
<P>
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
speedups of more than 10%.
</P>
<br><a name="SEC12" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2unicode</b>(3)
</P>
<br><a name="SEC13" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel (FAQ by Zoltan Herczeg)
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
Last updated: 22 August 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,105 @@
<html>
<head>
<title>pcre2limits specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2limits man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SIZE AND OTHER LIMITATIONS
</b><br>
<P>
There are some size limitations in PCRE2 but it is hoped that they will never
in practice be relevant.
</P>
<P>
The maximum size of a compiled pattern is approximately 64 thousand code units
for the 8-bit and 16-bit libraries if PCRE2 is compiled with the default
internal linkage size, which is 2 bytes for these libraries. If you want to
process regular expressions that are truly enormous, you can compile PCRE2 with
an internal linkage size of 3 or 4 (when building the 16-bit library, 3 is
rounded up to 4). See the <b>README</b> file in the source distribution and the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation for details. In these cases the limit is substantially larger.
However, the speed of execution is slower. In the 32-bit library, the internal
linkage size is always 4.
</P>
<P>
The maximum length of a source pattern string is essentially unlimited; it is
the largest number a PCRE2_SIZE variable can hold. However, the program that
calls <b>pcre2_compile()</b> can specify a smaller limit.
</P>
<P>
The maximum length (in code units) of a subject string is one less than the
largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an unsigned
integer type, usually defined as size_t. Its maximum value (that is
~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated strings
and unset offsets.
</P>
<P>
All values in repeating quantifiers must be less than 65536.
</P>
<P>
There are two different limits that apply to branches of lookbehind assertions.
If every branch in such an assertion matches a fixed number of characters,
the maximum length of any branch is 65535 characters. If any branch matches a
variable number of characters, then the maximum matching length for every
branch is limited. The default limit is set at compile time, defaulting to 255,
but can be changed by the calling program.
</P>
<P>
There is no limit to the number of parenthesized groups, but there can be no
more than 65535 capture groups, and there is a limit to the depth of nesting of
parenthesized subpatterns of all kinds. This is imposed in order to limit the
amount of system stack used at compile time. The default limit can be specified
when PCRE2 is built; if not, the default is set to 250. An application can
change this limit by calling pcre2_set_parens_nest_limit() to set the limit in
a compile context.
</P>
<P>
The maximum length of name for a named capture group is 32 code units, and the
maximum number of such groups is 10000.
</P>
<P>
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or (*THEN) verb
is 255 code units for the 8-bit library and 65535 code units for the 16-bit and
32-bit libraries.
</P>
<P>
The maximum length of a string argument to a callout is the largest number a
32-bit unsigned integer can hold.
</P>
<P>
The maximum amount of heap memory used for matching is controlled by the heap
limit, which can be set in a pattern or in a match context. The default is a
very large number, effectively unlimited.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 16 August 2023
<br>
Copyright &copy; 1997-2023 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,262 @@
<html>
<head>
<title>pcre2matching specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2matching man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 MATCHING ALGORITHMS</a>
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
<li><a name="TOC8" href="#SEC8">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 MATCHING ALGORITHMS</a><br>
<P>
This document describes the two different algorithms that are available in
PCRE2 for matching a compiled regular expression against a given subject
string. The "standard" algorithm is the one provided by the <b>pcre2_match()</b>
function. This works in the same as Perl's matching function, and provides a
Perl-compatible matching operation. The just-in-time (JIT) optimization that is
described in the
<a href="pcre2jit.html"><b>pcre2jit</b></a>
documentation is compatible with this function.
</P>
<P>
An alternative algorithm is provided by the <b>pcre2_dfa_match()</b> function;
it operates in a different way, and is not Perl-compatible. This alternative
has advantages and disadvantages compared with the standard algorithm, and
these are described below.
</P>
<P>
When there is only one possible way in which a given subject string can match a
pattern, the two algorithms give the same answer. A difference arises, however,
when there are multiple possibilities. For example, if the anchored pattern
<pre>
^&#60;.*&#62;
</pre>
is matched against the string
<pre>
&#60;something&#62; &#60;something else&#62; &#60;something further&#62;
</pre>
there are three possible answers. The standard algorithm finds only one of
them, whereas the alternative algorithm finds all three.
</P>
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
<P>
The set of strings that are matched by a regular expression can be represented
as a tree structure. An unlimited repetition in the pattern makes the tree of
infinite size, but it is still a tree. Matching the pattern to a given subject
string (from a given starting point) can be thought of as a search of the tree.
There are two ways to search a tree: depth-first and breadth-first, and these
correspond to the two matching algorithms provided by PCRE2.
</P>
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
<P>
In the terminology of Jeffrey Friedl's book "Mastering Regular Expressions",
the standard algorithm is an "NFA algorithm". It conducts a depth-first search
of the pattern tree. That is, it proceeds along a single path through the tree,
checking that the subject matches what is required. When there is a mismatch,
the algorithm tries any alternatives at the current point, and if they all
fail, it backs up to the previous branch point in the tree, and tries the next
alternative branch at that level. This often involves backing up (moving to the
left) in the subject string as well. The order in which repetition branches are
tried is controlled by the greedy or ungreedy nature of the quantifier.
</P>
<P>
If a leaf node is reached, a matching string has been found, and at that point
the algorithm stops. Thus, if there is more than one possible match, this
algorithm returns the first one that it finds. Whether this is the shortest,
the longest, or some intermediate length depends on the way the alternations
and the greedy or ungreedy repetition quantifiers are specified in the
pattern.
</P>
<P>
Because it ends up with a single path through the tree, it is relatively
straightforward for this algorithm to keep track of the substrings that are
matched by portions of the pattern in parentheses. This provides support for
capturing parentheses and backreferences.
</P>
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
<P>
This algorithm conducts a breadth-first search of the tree. Starting from the
first matching point in the subject, it scans the subject string from left to
right, once, character by character, and as it does this, it remembers all the
paths through the tree that represent valid matches. In Friedl's terminology,
this is a kind of "DFA algorithm", though it is not implemented as a
traditional finite state machine (it keeps multiple states active
simultaneously).
</P>
<P>
Although the general principle of this matching algorithm is that it scans the
subject string only once, without backtracking, there is one exception: when a
lookaround assertion is encountered, the characters following or preceding the
current point have to be independently inspected.
</P>
<P>
The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
them, and in particular, it finds the longest. The matches are returned in
the output vector in decreasing order of length. There is an option to stop the
algorithm after the first match (which is necessarily the shortest) is found.
</P>
<P>
Note that the size of vector needed to contain all the results depends on the
number of simultaneous matches, not on the number of capturing parentheses in
the pattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the
match data block is therefore not advisable when doing DFA matching.
</P>
<P>
Note also that all the matches that are found start at the same point in the
subject. If the pattern
<pre>
cat(er(pillar)?)?
</pre>
is matched against the string "the caterpillar catchment", the result is the
three strings "caterpillar", "cater", and "cat" that start at the fifth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
</P>
<P>
PCRE2's "auto-possessification" optimization usually applies to character
repeats at the end of a pattern (as well as internally). For example, the
pattern "a\d+" is compiled as if it were "a\d++" because there is no point
even considering the possibility of backtracking into the repeated digits. For
DFA matching, this means that only one possible match is found. If you really
do want multiple matches in such cases, either use an ungreedy repeat
("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
</P>
<P>
There are a number of features of PCRE2 regular expressions that are not
supported or behave differently in the alternative matching function. Those
that are not supported cause an error if encountered.
</P>
<P>
1. Because the algorithm finds all possible matches, the greedy or ungreedy
nature of repetition quantifiers is not relevant (though it may affect
auto-possessification, as just described). During matching, greedy and ungreedy
quantifiers are treated in exactly the same way. However, possessive
quantifiers can make a difference when what follows could also match what is
quantified, for example in a pattern like this:
<pre>
^a++\w!
</pre>
This pattern matches "aaab!" but not "aaa!", which would be matched by a
non-possessive quantifier. Similarly, if an atomic group is present, it is
matched as if it were a standalone pattern at the current point, and the
longest match is then "locked in" for the rest of the overall pattern.
</P>
<P>
2. When dealing with multiple paths through the tree simultaneously, it is not
straightforward to keep track of captured substrings for the different matching
possibilities, and PCRE2's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
</P>
<P>
3. Because no substrings are captured, a number of related features are not
available:
<br>
<br>
(a) Backreferences;
<br>
<br>
(b) Conditional expressions that use a backreference as the condition or test
for a specific group recursion;
<br>
<br>
(c) Script runs;
<br>
<br>
(d) Scan substring assertions.
</P>
<P>
4. Because many paths through the tree may be active, the \K escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported.
</P>
<P>
5. Callouts are supported, but the value of the <i>capture_top</i> field is
always 1, and the value of the <i>capture_last</i> field is always 0.
</P>
<P>
6. The \C escape sequence, which (in the standard algorithm) always matches a
single code unit, even in a UTF mode, is not supported in UTF modes because
the alternative algorithm moves through the subject string one character (not
code unit) at a time, for all active paths through the tree.
</P>
<P>
7. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
</P>
<P>
8. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not
supported by <b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
The main advantage of the alternative algorithm is that all possible matches
(at a single point in the subject) are automatically found, and in particular,
the longest match is found. To find more than one match at the same point using
the standard algorithm, you have to do kludgy things with callouts.
</P>
<P>
Partial matching is possible with this algorithm, though it has some
limitations. The
<a href="pcre2partial.html"><b>pcre2partial</b></a>
documentation gives details of partial matching and discusses multi-segment
matching.
</P>
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
The alternative algorithm suffers from a number of disadvantages:
</P>
<P>
1. It is substantially slower than the standard algorithm. This is partly
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
</P>
<P>
2. Capturing parentheses and other features such as backreferences that rely on
them are not supported.
</P>
<P>
3. Matching within invalid UTF strings is not supported.
</P>
<P>
4. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
</P>
<P>
5. JIT optimization is not supported.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 30 August 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,408 @@
<html>
<head>
<title>pcre2partial specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2partial man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a>
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a>
<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a>
<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
<li><a name="TOC7" href="#SEC7">AUTHOR</a>
<li><a name="TOC8" href="#SEC8">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
<P>
In normal use of PCRE2, if there is a match up to the end of a subject string,
but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
is returned, just like any other failing match. There are circumstances where
it might be helpful to distinguish this "partial match" case.
</P>
<P>
One example is an application where the subject string is very long, and not
all available at once. The requirement here is to be able to do the matching
segment by segment, but special action is needed when a matched substring spans
the boundary between two segments.
</P>
<P>
Another example is checking a user input string as it is typed, to ensure that
it conforms to a required format. Invalid characters can be immediately
diagnosed and rejected, giving instant feedback.
</P>
<P>
Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
options when calling a matching function. The difference between the two
options is whether or not a partial match is preferred to an alternative
complete match, though the details differ between the two types of matching
function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
</P>
<P>
If you want to use partial matching with just-in-time optimized code, as well
as setting a partial match option for the matching function, you must also call
<b>pcre2_jit_compile()</b> with one or both of these options:
<pre>
PCRE2_JIT_PARTIAL_HARD
PCRE2_JIT_PARTIAL_SOFT
</pre>
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
matches on the same pattern. Separate code is compiled for each mode. If the
appropriate JIT mode has not been compiled, interpretive matching code is used.
</P>
<P>
Setting a partial matching option disables two of PCRE2's standard
optimization hints. PCRE2 remembers the last literal code unit in a pattern,
and abandons matching immediately if it is not present in the subject string.
This optimization cannot be used for a subject string that might match only
partially. PCRE2 also remembers a minimum length of a matching string, and does
not bother to run the matching function on shorter strings. This optimization
is also disabled for partial matching.
</P>
<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br>
<P>
A possible partial match occurs during matching when the end of the subject
string is reached successfully, but either more characters are needed to
complete the match, or the addition of more characters might change what is
matched.
</P>
<P>
Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
definitely needed to complete a match. In this case both hard and soft matching
options yield a partial match.
</P>
<P>
Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
can be found, but the addition of more characters might change what is
matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
PCRE2_PARTIAL_SOFT returns the complete match.
</P>
<P>
On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
pattern item is \z, \Z, \b, \B, or $ there is always a partial match.
Otherwise, for both options, the next pattern item must be one that inspects a
character, and at least one of the following must be true:
</P>
<P>
(1) At least one character has already been inspected. An inspected character
need not form part of the final matched string; lookbehind assertions and the
\K escape sequence provide ways of inspecting characters before the start of a
matched string.
</P>
<P>
(2) The pattern contains one or more lookbehind assertions. This condition
exists in case there is a lookbehind that inspects characters before the start
of the match.
</P>
<P>
(3) There is a special case when the whole pattern can match an empty string.
When the starting point is at the end of the subject, the empty string match is
a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
conditions is true, it is returned. However, because adding more characters
might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
which in this case means "there is going to be a match at this point, but until
some more characters are added, we do not know if it will be an empty string or
something longer".
</P>
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
<P>
When a partial matching option is set, the result of calling
<b>pcre2_match()</b> can be one of the following:
</P>
<P>
<b>A successful match</b>
A complete match has been found, starting and ending within this subject.
</P>
<P>
<b>PCRE2_ERROR_NOMATCH</b>
No match can start anywhere in this subject.
</P>
<P>
<b>PCRE2_ERROR_PARTIAL</b>
Adding more characters may result in a complete match that uses one or more
characters from the end of this subject.
</P>
<P>
When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched, but the values in the rest of
the ovector are undefined. The appearance of \K in the pattern has no effect
for a partial match. Consider this pattern:
<pre>
/abc\K123/
</pre>
If it is matched against "456abc123xyz" the result is a complete match, and the
ovector defines the matched string as "123", because \K resets the "start of
match" point. However, if a partial match is requested and the subject string
is "456abc12", a partial match is found for the string "abc12", because all
these characters are needed for a subsequent re-match with additional
characters.
</P>
<P>
If there is more than one partial match, the first one that was found provides
the data that is returned. Consider this pattern:
<pre>
/123\w+X|dogY/
</pre>
If this is matched against the subject string "abc123dog", both alternatives
fail to match, but the end of the subject is reached during matching, so
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
"123dog" as the first partial match. (In this example, there are two partial
matches, because "dog" on its own partially matches the second alternative.)
</P>
<br><b>
How a partial match is processed by pcre2_match()
</b><br>
<P>
What happens when a partial match is identified depends on which of the two
partial matching options is set.
</P>
<P>
If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
partial match is found, without continuing to search for possible complete
matches. This option is "hard" because it prefers an earlier partial match over
a later complete match. For this reason, the assumption is made that the end of
the supplied subject string is not the true end of the available data, which is
why \z, \Z, \b, \B, and $ always give a partial match.
</P>
<P>
If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
continues as normal, and other alternatives in the pattern are tried. If no
complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
over a partial match. All the various matching items in a pattern behave as if
the subject string is potentially complete; \z, \Z, and $ match at the end of
the subject, as normal, and for \b and \B the end of the subject is treated
as a non-alphanumeric.
</P>
<P>
The difference between the two partial matching options can be illustrated by a
pattern such as:
<pre>
/dog(sbody)?/
</pre>
This matches either "dog" or "dogsbody", greedily (that is, it prefers the
longer string if possible). If it is matched against the string "dog" with
PCRE2_PARTIAL_SOFT, it yields a complete match for "dog". However, if
PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PARTIAL. On the other
hand, if the pattern is made ungreedy the result is different:
<pre>
/dog(sbody)??/
</pre>
In this case the result is always a complete match because that is found first,
and matching never continues after finding a complete match. It might be easier
to follow this explanation by thinking of the two patterns like this:
<pre>
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
</pre>
The second pattern will never match "dogsbody", because it will always find the
shorter match first.
</P>
<br><b>
Example of partial matching using pcre2test
</b><br>
<P>
The <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and
<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
respectively, when calling <b>pcre2_match()</b>. Here is a run of
<b>pcre2test</b> using a pattern that matches the whole subject in the form of a
date:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 25dec3\=ph
Partial match: 23dec3
data&#62; 3ju\=ph
Partial match: 3ju
data&#62; 3juj\=ph
No match
</pre>
This example gives the same results for both hard and soft partial matching
options. Here is an example where there is a difference:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 25jun04\=ps
0: 25jun04
1: jun
data&#62; 25jun04\=ph
Partial match: 25jun04
</pre>
With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
there is only a partial match.
</P>
<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
<P>
PCRE was not originally designed with multi-segment matching in mind. However,
over time, features (including partial matching) that make multi-segment
matching possible have been added. A very long string can be searched segment
by segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
the same results that would happen if the entire string was available for
searching all the time. Normally, the strings that are being sought are much
shorter than each individual segment, and are in the middle of very long
strings, so the pattern is normally not anchored.
</P>
<P>
Special logic must be implemented to handle a matched substring that spans a
segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
partial match at the end of a segment whenever there is the possibility of
changing the match by adding more characters. The PCRE2_NOTBOL option should
also be set for all but the first segment.
</P>
<P>
When a partial match occurs, the next segment must be added to the current
subject and the match re-run, using the <i>startoffset</i> argument of
<b>pcre2_match()</b> to begin at the point where the partial match started.
For example:
<pre>
re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
data&#62; ...the date is 23ja\=ph
Partial match: 23ja
data&#62; ...the date is 23jan19 and on that day...\=offset=15
0: 23jan19
1: jan
</pre>
Note the use of the <b>offset</b> modifier to start the new match where the
partial match was found. In this example, the next segment was added to the one
in which the partial match was found. This is the most straightforward
approach, typically using a memory buffer that is twice the size of each
segment. After a partial match, the first half of the buffer is discarded, the
second half is moved to the start of the buffer, and a new segment is added
before repeating the match as in the example above. After a no match, the
entire buffer can be discarded.
</P>
<P>
If there are memory constraints, you may want to discard text that precedes a
partial match before adding the next segment. Unfortunately, this is not at
present straightforward. In cases such as the above, where the pattern does not
contain any lookbehinds, it is sufficient to retain only the partially matched
substring. However, if the pattern contains a lookbehind assertion, characters
that precede the start of the partial match may have been inspected during the
matching process. When <b>pcre2test</b> displays a partial match, it indicates
these characters with '&#60;' if the <b>allusedtext</b> modifier is set:
<pre>
re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab
&#60;&#60;&#60;
</pre>
However, the <b>allusedtext</b> modifier is not available for JIT matching,
because JIT matching does not record the first (or last) consulted characters.
For this reason, this information is not available via the API. It is therefore
not possible in general to obtain the exact number of characters that must be
retained in order to get the right match result. If you cannot retain the
entire segment, you must find some heuristic way of choosing.
</P>
<P>
If you know the approximate length of the matching substrings, you can use that
to decide how much text to retain. The only lookbehind information that is
currently available via the API is the length of the longest individual
lookbehind in a pattern, but this can be misleading if there are nested
lookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
PCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
units) that any individual lookbehind moves back when it is processed. A
pattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
inspects two characters before its starting point.
</P>
<P>
In a non-UTF or a 32-bit case, moving back is just a subtraction, but in
UTF-8 or UTF-16 you have to count characters while moving back through the code
units.
</P>
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
<P>
The DFA function moves along the subject string character by character, without
backtracking, searching for all possible matches simultaneously. If the end of
the subject is reached before the end of the pattern, there is the possibility
of a partial match.
</P>
<P>
When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
complete matches. The portion of the string that was matched when the longest
partial match was found is set as the first matching string.
</P>
<P>
Because the DFA function always searches for all possible matches, and there is
no difference between greedy and ungreedy repetition, its behaviour is
different from the <b>pcre2_match()</b>. Consider the string "dog" matched
against this ungreedy pattern:
<pre>
/dog(sbody)??/
</pre>
Whereas the standard function stops as soon as it finds the complete match for
"dog", the DFA function also finds the partial match for "dogsbody", and so
returns that when PCRE2_PARTIAL_HARD is set.
</P>
<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
<P>
When a partial match has been found using the DFA matching function, it is
possible to continue the match by providing additional subject data and calling
the function again with the same compiled regular expression, this time setting
the PCRE2_DFA_RESTART option. You must pass the same working space as before,
because this is where details of the previous partial match are stored. You can
set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
to continue partial matching over multiple segments. Here is an example using
<b>pcre2test</b>:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 23ja\=dfa,ps
Partial match: 23ja
data&#62; n05\=dfa,dfa_restart
0: n05
</pre>
The first call has "23ja" as the subject, and requests partial matching; the
second call has "n05" as the subject for the continued (restarted) match.
Notice that when the match is complete, only the last part is shown; PCRE2 does
not retain the previously partially-matched string. It is up to the calling
program to do that if it needs to. This means that, for an unanchored pattern,
if a continued match fails, it is not possible to try again at a new starting
point. All this facility is capable of doing is continuing with the previous
match attempt. For example, consider this pattern:
<pre>
1234|3789
</pre>
If the first part of the subject is "ABC123", a partial match of the first
alternative is found at offset 3. There is no partial match for the second
alternative, because such a match does not start at the same point in the
subject string. Attempting to continue with the string "7890" does not yield a
match because only those alternatives that match at one point in the subject
are remembered. Depending on the application, this may or may not be what you
want.
</P>
<P>
If you do want to allow for starting again at the next character, one way of
doing it is to retain some or all of the segment and try a new complete match,
as described for <b>pcre2_match()</b> above. Another possibility is to work with
two buffers. If a partial match at offset <i>n</i> in the first buffer is
followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
can then try a new match starting at offset <i>n+1</i> in the first buffer.
</P>
<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 November 2024
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,4140 @@
<html>
<head>
<title>pcre2pattern specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2pattern man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION DETAILS</a>
<li><a name="TOC2" href="#SEC2">EBCDIC CHARACTER CODES</a>
<li><a name="TOC3" href="#SEC3">SPECIAL START-OF-PATTERN ITEMS</a>
<li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a>
<li><a name="TOC5" href="#SEC5">BACKSLASH</a>
<li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a>
<li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a>
<li><a name="TOC8" href="#SEC8">MATCHING A SINGLE CODE UNIT</a>
<li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
<li><a name="TOC10" href="#SEC10">PERL EXTENDED CHARACTER CLASSES</a>
<li><a name="TOC11" href="#SEC11">UTS#18 EXTENDED CHARACTER CLASSES</a>
<li><a name="TOC12" href="#SEC12">POSIX CHARACTER CLASSES</a>
<li><a name="TOC13" href="#SEC13">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a>
<li><a name="TOC14" href="#SEC14">VERTICAL BAR</a>
<li><a name="TOC15" href="#SEC15">INTERNAL OPTION SETTING</a>
<li><a name="TOC16" href="#SEC16">GROUPS</a>
<li><a name="TOC17" href="#SEC17">DUPLICATE GROUP NUMBERS</a>
<li><a name="TOC18" href="#SEC18">NAMED CAPTURE GROUPS</a>
<li><a name="TOC19" href="#SEC19">REPETITION</a>
<li><a name="TOC20" href="#SEC20">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
<li><a name="TOC22" href="#SEC22">ASSERTIONS</a>
<li><a name="TOC23" href="#SEC23">NON-ATOMIC ASSERTIONS</a>
<li><a name="TOC24" href="#SEC24">SCAN SUBSTRING ASSERTIONS</a>
<li><a name="TOC25" href="#SEC25">SCRIPT RUNS</a>
<li><a name="TOC26" href="#SEC26">CONDITIONAL GROUPS</a>
<li><a name="TOC27" href="#SEC27">COMMENTS</a>
<li><a name="TOC28" href="#SEC28">RECURSIVE PATTERNS</a>
<li><a name="TOC29" href="#SEC29">GROUPS AS SUBROUTINES</a>
<li><a name="TOC30" href="#SEC30">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC31" href="#SEC31">CALLOUTS</a>
<li><a name="TOC32" href="#SEC32">BACKTRACKING CONTROL</a>
<li><a name="TOC33" href="#SEC33">EBCDIC ENVIRONMENTS</a>
<li><a name="TOC34" href="#SEC34">SEE ALSO</a>
<li><a name="TOC35" href="#SEC35">AUTHOR</a>
<li><a name="TOC36" href="#SEC36">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
<P>
The syntax and semantics of the regular expressions that are supported by PCRE2
are described in detail below. There is a quick-reference syntax summary in the
<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
PCRE2 also supports some alternative regular expression syntax that does not
conflict with the Perl syntax in order to provide some compatibility with
regular expressions in Python, .NET, and Oniguruma. There are in addition some
options that enable alternative syntax and semantics that are not the same as
in Perl.
</P>
<P>
Perl's regular expressions are described in its own documentation, and regular
expressions in general are covered in a number of books, some of which have
copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
by O'Reilly, covers regular expressions in great detail. This description of
PCRE2's regular expressions is intended as reference material.
</P>
<P>
This document discusses the regular expression patterns that are supported by
PCRE2 when its main matching function, <b>pcre2_match()</b>, is used. PCRE2 also
has an alternative matching function, <b>pcre2_dfa_match()</b>, which matches
using a different algorithm that is not Perl-compatible. Some of the features
discussed below are not available when DFA matching is used. The advantages and
disadvantages of the alternative function, and how it differs from the normal
function, are discussed in the
<a href="pcre2matching.html"><b>pcre2matching</b></a>
page.
</P>
<br><a name="SEC2" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
<P>
Most computers use ASCII or Unicode for encoding characters, and PCRE2 assumes
this by default. However, it can be compiled to run in an environment that uses
the EBCDIC code, which is the case for some IBM mainframe operating systems. In
the sections below, character code values are ASCII or Unicode; in an EBCDIC
environment these characters may have different code values, and there are no
code points greater than 255. Differences in behaviour when PCRE2 is running in
an EBCDIC environment are described in the section
<a href="#ebcdicenvironments">"EBCDIC environments"</a>
below, which you can ignore unless you really are in an EBCDIC environment.
</P>
<br><a name="SEC3" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
<P>
A number of options that can be passed to <b>pcre2_compile()</b> can also be set
by special items at the start of a pattern. These are not Perl-compatible, but
are provided to make these options accessible to pattern writers who are not
able to change the program that processes the pattern. Any number of these
items may appear, but they must all be together right at the start of the
pattern string, and the letters must be in upper case.
</P>
<br><b>
UTF support
</b><br>
<P>
In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
specified for the 32-bit library, in which case it constrains the character
values to valid Unicode code points. To process UTF strings, PCRE2 must be
built to include Unicode support (which is the default). When using UTF strings
you must either call the compiling function with one or both of the PCRE2_UTF
or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
setting a UTF mode affects pattern matching is mentioned in several places
below. There is also a summary of features in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
page.
</P>
<P>
Some applications that allow their users to supply patterns may wish to
restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
option is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
appearance in a pattern causes an error.
</P>
<br><b>
Unicode property support
</b><br>
<P>
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 256 via a lookup
table. If also causes upper/lower casing operations to use Unicode properties
for characters with code points greater than 127, even when UTF is not set.
These behaviours can be changed within the pattern; see the section entitled
<a href="#internaloptions">"Internal Option Setting"</a>
below.
</P>
<P>
Some applications that allow their users to supply patterns may wish to
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
causes an error.
</P>
<br><b>
Locking out empty string matching
</b><br>
<P>
Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
matching function is subsequently called to match the pattern. These options
lock out the matching of empty strings, either entirely, or only at the start
of the subject.
</P>
<br><b>
Disabling auto-possessification
</b><br>
<P>
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
the PCRE2_NO_AUTO_POSSESS option, or calling <b>pcre2_set_optimize()</b> with
a PCRE2_AUTO_POSSESS_OFF directive. This stops PCRE2 from making quantifiers
possessive when what follows cannot match the repeated item. For example, by
default a+b is treated as a++b. For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><b>
Disabling start-up optimizations
</b><br>
<P>
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
PCRE2_NO_START_OPTIMIZE option, or calling <b>pcre2_set_optimize()</b> with
a PCRE2_START_OPTIMIZE_OFF directive. This disables several optimizations for
quickly reaching "no match" results. For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><b>
Disabling automatic anchoring
</b><br>
<P>
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
setting the PCRE2_NO_DOTSTAR_ANCHOR option, or calling <b>pcre2_set_optimize()</b>
with a PCRE2_DOTSTAR_ANCHOR_OFF directive. This disables optimizations that
apply to patterns whose top-level branches all start with .* (match any number
of arbitrary characters). For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><b>
Disabling JIT compilation
</b><br>
<P>
If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
the application to apply the JIT optimization by calling
<b>pcre2_jit_compile()</b> is ignored.
</P>
<br><b>
Setting match resource limits
</b><br>
<P>
The <b>pcre2_match()</b> function contains a counter that is incremented every
time it goes round its main loop. The caller of <b>pcre2_match()</b> can set a
limit on this counter, which therefore limits the amount of computing resource
used for a match. The maximum depth of nested backtracking can also be limited;
this indirectly restricts the amount of heap memory that is used, but there is
also an explicit memory limit that can be set.
</P>
<P>
These facilities are provided to catch runaway matches that are provoked by
patterns with huge matching trees. A common example is a pattern with nested
unlimited repeats applied to a long string that does not match. When one of
these limits is reached, <b>pcre2_match()</b> gives an error return. The limits
can also be set by items at the start of the pattern of the form
<pre>
(*LIMIT_HEAP=d)
(*LIMIT_MATCH=d)
(*LIMIT_DEPTH=d)
</pre>
where d is any number of decimal digits. However, the value of the setting must
be less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
for it to have any effect. In other words, the pattern writer can lower the
limits set by the programmer, but not raise them. If there is more than one
setting of one of these limits, the lower value is used. The heap limit is
specified in kibibytes (units of 1024 bytes).
</P>
<P>
Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
still recognized for backwards compatibility.
</P>
<P>
The heap limit applies only when the <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b> interpreters are used for matching. It does not apply
to JIT. The match limit is used (but in a different way) when JIT is being
used, or when <b>pcre2_dfa_match()</b> is called, to limit computing resource
usage by those matching functions. The depth limit is ignored by JIT but is
relevant for DFA matching, which uses function recursion for recursions within
the pattern and for lookaround assertions and atomic groups. In this case, the
depth limit controls the depth of such recursion.
<a name="newlines"></a></P>
<br><b>
Newline conventions
</b><br>
<P>
PCRE2 supports six different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
character, the two-character sequence CRLF, any of the three preceding, any
Unicode newline sequence, or the NUL character (binary zero). The
<a href="pcre2api.html"><b>pcre2api</b></a>
page has
<a href="pcre2api.html#newlines">further discussion</a>
about newlines, and shows how to set the newline convention when calling
<b>pcre2_compile()</b>.
</P>
<P>
It is also possible to specify a newline convention by starting a pattern
string with one of the following sequences:
<pre>
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
(*NUL) the NUL character (binary zero)
</pre>
These override the default and the options given to the compiling function. For
example, on a Unix system where LF is the default newline sequence, the pattern
<pre>
(*CR)a.b
</pre>
changes the convention to CR. That pattern matches "a\nb" because LF is no
longer a newline. If more than one of these settings is present, the last one
is used.
</P>
<P>
The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
opening brace. However, it does not affect what the \R escape sequence
matches. By default, this is any Unicode newline sequence, for Perl
compatibility. However, this can be changed; see the next section and the
description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline
convention.
</P>
<br><b>
Specifying what \R matches
</b><br>
<P>
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
at compile time. This effect can also be achieved by starting a pattern with
(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
corresponding to PCRE2_BSR_UNICODE.
</P>
<br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
<P>
A regular expression is a pattern that is matched against a subject string from
left to right. Most characters stand for themselves in a pattern, and match the
corresponding characters in the subject. As a trivial example, the pattern
<pre>
The quick brown fox
</pre>
matches a portion of a subject string that is identical to itself. When
caseless matching is specified (the PCRE2_CASELESS option or (?i) within the
pattern), letters are matched independently of case. Note that there are two
ASCII characters, K and S, that, in addition to their lower case ASCII
equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the
PCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to
<b>pcre2_compile()</b> or set by (*CASELESS_RESTRICT) or (?r) within the
pattern). If the PCRE2_EXTRA_TURKISH_CASING option is in force (either passed
to <b>pcre2_compile()</b> or set by (*TURKISH_CASING) within the pattern), then
the 'i' letters are matched according to Turkish and Azeri languages.
</P>
<P>
The power of regular expressions comes from the ability to include wild cards,
character classes, alternatives, and repetitions in the pattern. These are
encoded in the pattern by the use of <i>metacharacters</i>, which do not stand
for themselves but instead are interpreted in some special way.
</P>
<P>
There are two different sets of metacharacters: those that are recognized
anywhere in the pattern except within square brackets, and those that are
recognized within square brackets. Outside square brackets, the metacharacters
are as follows:
<pre>
\ general escape character with several uses
^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start group or control verb
) end group or control verb
* 0 or more quantifier
+ 1 or more quantifier; also "possessive quantifier"
? 0 or 1 quantifier; also quantifier minimizer
{ potential start of min/max quantifier
</pre>
Brace characters { and } are also used to enclose data for constructions such
as \g{2} or \k{name}. In almost all uses of braces, space and/or horizontal
tab characters that follow { or precede } are allowed and are ignored. In the
case of quantifiers, they may also appear before or after the comma. The
exception to this is \u{...} which is an ECMAScript compatibility feature
that is recognized only when the PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript
does not ignore such white space; it causes the item to be interpreted as
literal.
</P>
<P>
Part of a pattern that is in square brackets is called a "character class". In
a character class the only metacharacters are:
<pre>
\ general escape character
^ negate the class, but only if the first character
- indicates character range
[ POSIX character class (if followed by POSIX syntax)
] terminates the character class
</pre>
If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
the pattern, other than in a character class, within a \Q...\E sequence, or
between a # outside a character class and the next newline, inclusive, is
ignored. An escaping backslash can be used to include a white space or a #
character as part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the
same applies, but in addition unescaped space and horizontal tab characters are
ignored inside a character class. Note: only these two characters are ignored,
not the full set of pattern white space characters that are ignored outside a
character class. Option settings can be changed within a pattern; see the
section entitled
<a href="#internaloptions">"Internal Option Setting"</a>
below.
</P>
<P>
The following sections describe the use of each of the metacharacters.
</P>
<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br>
<P>
The backslash character has several uses. Firstly, if it is followed by a
character that is not a digit or a letter, it takes away any special meaning
that character may have. This use of backslash as an escape character applies
both inside and outside character classes.
</P>
<P>
For example, if you want to match a * character, you must write \* in the
pattern. This escaping action applies whether or not the following character
would otherwise be interpreted as a metacharacter, so it is always safe to
precede a non-alphanumeric with backslash to specify that it stands for itself.
In particular, if you want to match a backslash, you write \\.
</P>
<P>
Only ASCII digits and letters have any special meaning after a backslash. All
other characters (in particular, those whose code points are greater than 127)
are treated as literals.
</P>
<P>
If you want to treat all characters in a sequence as literals, you can do so by
putting them between \Q and \E. Note that this includes white space even when
the PCRE2_EXTENDED option is set so that most other white space is ignored. The
behaviour is different from Perl in that $ and @ are handled as literals in
\Q...\E sequences in PCRE2, whereas in Perl, $ and @ cause variable
interpolation. Also, Perl does "double-quotish backslash interpolation" on any
backslashes between \Q and \E which, its documentation says, "may lead to
confusing results". PCRE2 treats a backslash between \Q and \E just like any
other character. Note the following examples:
<pre>
Pattern PCRE2 matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
\QA\B\E A\B A\B
\Q\\E \ \\E
</pre>
The \Q...\E sequence is recognized both inside and outside character classes.
An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
by \E later in the pattern, the literal interpretation continues to the end of
the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
a character class, this causes an error, because the character class is then
not terminated by a closing square bracket.
</P>
<P>
Another difference from Perl is that any appearance of \Q or \E inside what
might otherwise be a quantifier causes PCRE2 not to recognize the sequence as a
quantifier. Perl recognizes a quantifier if (redundantly) either of the numbers
is inside \Q...\E, but not if the separating comma is. When not recognized as
a quantifier a sequence such as {\Q1\E,2} is treated as the literal string
"{1,2}".
<a name="digitsafterbackslash"></a></P>
<br><b>
Non-printing characters
</b><br>
<P>
A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters in a pattern, but when a pattern is being prepared by
text editing, it is often easier to use one of the following escape sequences
instead of the binary character it represents. In an ASCII or Unicode
environment, these escapes are as follows:
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is a non-control ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n linefeed (hex 0A)
\r carriage return (hex 0D) (but see below)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
\N{U+hhh..} character with Unicode hex code point hhh..
</pre>
A description of how back references work is given
<a href="#backreferences">later,</a>
following the discussion of
<a href="#group">parenthesized groups.</a>
</P>
<P>
By default, after \x that is not followed by {, one or two hexadecimal
digits are read (letters can be in upper or lower case). If the character that
follows \x is neither { nor a hexadecimal digit, an error occurs. This is
different from Perl's default behaviour, which generates a NUL character, but
is in line with the behaviour of Perl's 'strict' mode in re.
</P>
<P>
Any number of hexadecimal digits may appear between \x{ and }. If a character
other than a hexadecimal digit appears between \x{ and }, or if there is no
terminating }, an error occurs.
</P>
<P>
Characters whose code points are less than 256 can be defined by either of the
two syntaxes for \x or by an octal sequence. There is no difference in the way
they are handled. For example, \xdc is exactly the same as \x{dc} or \334.
However, using the braced versions does make such sequences easier to read.
</P>
<P>
Support is available for some ECMAScript (aka JavaScript) escape sequences via
two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
by { is not recognized. Only if \x is followed by two hexadecimal digits is it
recognized as a character escape. Otherwise it is interpreted as a literal "x"
character. In this mode, support for code points greater than 256 is provided
by \u, which must be followed by four hexadecimal digits; otherwise it is
interpreted as a literal "u" character.
</P>
<P>
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
\u{hhh..} is recognized as the character specified by hexadecimal code point.
There may be any number of hexadecimal digits, but unlike other places that
also use curly brackets, spaces are not allowed and would result in the string
being interpreted as a literal. This syntax is from ECMAScript 6.
</P>
<P>
The \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
UTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
does not support this. Note that when \N is not followed by an opening brace
(curly bracket) it has an entirely different meaning, matching any character
that is not a newline.
</P>
<P>
There are some legacy applications where the escape sequence \r is expected to
match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a
pattern is converted to \n so that it matches a LF (linefeed) instead of a CR
(carriage return) character.
</P>
<P>
An error occurs if \c is not followed by a character whose ASCII code point
is in the range 32 to 126. The precise effect of \cx is as follows: if x is a
lower case letter, it is converted to upper case. Then bit 6 of the character
(hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is
5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If
the code unit following \c has a code point less than 32 or greater than 126,
a compile-time error occurs.
</P>
<P>
For differences in the way some escapes behave in EBCDIC environments,
see section
<a href="#ebcdicenvironments">"EBCDIC environments"</a>
below.
</P>
<br><b>
Octal escapes and back references
</b><br>
<P>
The escape \o must be followed by a sequence of octal digits, enclosed in
braces. An error occurs if this is not the case. This escape provides a way of
specifying character code points as octal numbers greater than 0777, and it
also allows octal numbers and backreferences to be unambiguously distinguished.
</P>
<P>
If braces are not used, after \0 up to two further octal digits are read.
However, if the PCRE2_EXTRA_NO_BS0 option is set, at least one more octal digit
must follow \0 (use \00 to generate a NUL character). Make sure you supply
two digits after the initial zero if the pattern character that follows is
itself an octal digit.
</P>
<P>
Inside a character class, when a backslash is followed by any octal digit, up
to three octal digits are read to generate a code point. Any subsequent digits
stand for themselves. The sequences \8 and \9 are treated as the literal
characters "8" and "9".
</P>
<P>
Outside a character class, Perl's handling of a backslash followed by a digit
other than 0 is complicated by ambiguity, and Perl has changed over time,
causing PCRE2 also to change. From PCRE2 release 10.45 there is an option
called PCRE2_EXTRA_PYTHON_OCTAL that causes PCRE2 to use Python's unambiguous
rules. The next two subsections describe the two sets of rules.
</P>
<P>
For greater clarity and unambiguity, it is best to avoid following \ by a
digit greater than zero. Instead, use \o{...} or \x{...} to specify numerical
character code points, and \g{...} to specify backreferences.
</P>
<br><b>
Perl rules for non-class backslash 1-9
</b><br>
<P>
All the digits that follow the backslash are read as a decimal number. If the
number is less than 10, begins with the digit 8 or 9, or if there are at least
that many previous capture groups in the expression, the entire sequence is
taken as a back reference. Otherwise, up to three octal digits are read to form
a character code. For example:
<pre>
\040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40 previous capture groups
\7 is always a backreference
\11 might be a backreference, or another way of writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 might be a backreference, otherwise the character with octal code 113
\377 might be a backreference, otherwise the value 255 (decimal)
\81 is always a backreference
</pre>
Note that octal values of 100 or greater that are specified using this syntax
must not be introduced by a leading zero, because no more than three octal
digits are ever read.
</P>
<br><b>
Python rules for non_class backslash 1-9
</b><br>
<P>
If there are at least three octal digits after the backslash, exactly three are
read as an octal code point number, but the value must be no greater than
\377, even in modes where higher code point values are supported. Any
subsequent digits stand for themselves. If there are fewer than three octal
digits, the sequence is taken as a decimal back reference. Thus, for example,
\12 is always a back reference, independent of how many captures there are in
the pattern. An error is generated for a reference to a non-existent capturing
group.
</P>
<br><b>
Constraints on character values
</b><br>
<P>
Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
<pre>
8-bit non-UTF mode no greater than 0xff
16-bit non-UTF mode no greater than 0xffff
32-bit non-UTF mode no greater than 0xffffffff
All UTF modes no greater than 0x10ffff and a valid code point
</pre>
Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
so-called "surrogate" code points). The check for these can be disabled by the
caller of <b>pcre2_compile()</b> by setting the option
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
and UTF-32 modes, because these values are not representable in UTF-16.
</P>
<br><b>
Escape sequences in character classes
</b><br>
<P>
All the sequences that define a single character value can be used both inside
and outside character classes. In addition, inside a character class, \b is
interpreted as the backspace character (hex 08).
</P>
<P>
When not followed by an opening brace, \N is not allowed in a character class.
\B, \R, and \X are not special inside a character class. Like other
unrecognized alphabetic escape sequences, they cause an error. Outside a
character class, these sequences have different meanings.
</P>
<br><b>
Unsupported escape sequences
</b><br>
<P>
In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
handler and used to modify the case of following characters. By default, PCRE2
does not support these escape sequences in patterns. However, if either of the
PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U"
character, and \u can be used to define a character by code point, as
described above.
</P>
<br><b>
Absolute and relative backreferences
</b><br>
<P>
The sequence \g followed by a signed or unsigned number, optionally enclosed
in braces, is an absolute or relative backreference. A named backreference
can be coded as \g{name}. Backreferences are discussed
<a href="#backreferences">later,</a>
following the discussion of
<a href="#group">parenthesized groups.</a>
</P>
<br><b>
Absolute and relative subroutine calls
</b><br>
<P>
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
syntax for referencing a capture group as a subroutine. Details are discussed
<a href="#onigurumasubroutines">later.</a>
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a backreference; the latter is a
<a href="#groupsassubroutines">subroutine</a>
call.
<a name="genericchartypes"></a></P>
<br><b>
Generic character types
</b><br>
<P>
Another use of backslash is for specifying generic character types:
<pre>
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\N any character that is not a newline
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space character
\w any "word" character
\W any "non-word" character
</pre>
The \N escape sequence has the same meaning as
<a href="#fullstopdot">the "." metacharacter</a>
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
meaning of \N. Note that when \N is followed by an opening brace it has a
different meaning. See the section entitled
<a href="#digitsafterbackslash">"Non-printing characters"</a>
above for details. Perl also uses \N{name} to specify characters by Unicode
name; PCRE2 does not support this.
</P>
<P>
Each pair of lower and upper case escape sequences partitions the complete set
of characters into two disjoint sets. Any given character matches one, and only
one, of each pair. The sequences can appear both inside and outside character
classes. They each match one character of the appropriate type. If the current
matching point is at the end of the subject string, all of them fail, because
there is no character to match.
</P>
<P>
The default \s characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
space (32), which are defined as white space in the "C" locale. This list may
vary if locale-specific matching is taking place. For example, in some locales
the "non-breaking space" character (\xA0) is recognized as white space, and in
others the VT character is not.
</P>
<P>
A "word" character is an underscore or any character that is a letter or digit.
By default, the definition of letters and digits is controlled by PCRE2's
low-valued character tables, and may vary if locale-specific matching is taking
place (see
<a href="pcre2api.html#localesupport">"Locale support"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
or "french" in Windows, some character codes greater than 127 are used for
accented letters, and these are then matched by \w. The use of locales with
Unicode is discouraged.
</P>
<P>
By default, characters whose code points are greater than 127 never match \d,
\s, or \w, and always match \D, \S, and \W, although this may be different
for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
determine character types, as follows:
<pre>
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
</pre>
The addition of \p{Mn} (non-spacing mark) and the replacement of an explicit
test for underscore with a test for \p{Pc} (connector punctuation) happened in
PCRE2 release 10.43. This brings PCRE2 into line with Perl.
</P>
<P>
The upper case escapes match the inverse sets of characters. Note that \d
matches only decimal digits, whereas \w matches any Unicode digit, as well as
other character categories. Note also that PCRE2_UCP affects \b, and
\B because they are defined in terms of \w and \W. Matching these sequences
is noticeably slower when PCRE2_UCP is set.
</P>
<P>
The effect of PCRE2_UCP on any one of these escape sequences can be negated by
the options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, and
PCRE2_EXTRA_ASCII_BSW, respectively. These options can be set and reset within
a pattern by means of an internal option setting
<a href="#internaloptions">(see below).</a>
</P>
<P>
The sequences \h, \H, \v, and \V, in contrast to the other sequences, which
match only ASCII characters by default, always match a specific list of code
points, whether or not PCRE2_UCP is set. The horizontal space characters are:
<pre>
U+0009 Horizontal tab (HT)
U+0020 Space
U+00A0 Non-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
</pre>
The vertical space characters are:
<pre>
U+000A Linefeed (LF)
U+000B Vertical tab (VT)
U+000C Form feed (FF)
U+000D Carriage return (CR)
U+0085 Next line (NEL)
U+2028 Line separator
U+2029 Paragraph separator
</pre>
In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
are relevant.
<a name="newlineseq"></a></P>
<br><b>
Newline sequences
</b><br>
<P>
Outside a character class, by default, the escape sequence \R matches any
Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the
following:
<pre>
(?&#62;\r\n|\n|\x0b|\f|\r|\x85)
</pre>
This is an example of an "atomic group", details of which are given
<a href="#atomicgroup">below.</a>
This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). Because this is an atomic group, the two-character sequence is
treated as a single unit that cannot be split.
</P>
<P>
In other modes, two additional characters whose code points are greater than 255
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
Unicode support is not needed for these characters to be recognized.
</P>
<P>
It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
at compile time. (BSR is an abbreviation for "backslash R".) This can be made
the default when PCRE2 is built; if this is the case, the other behaviour can
be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
these settings by starting a pattern string with one of the following
sequences:
<pre>
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
</pre>
These override the default and the options given to the compiling function.
Note that these special settings, which are not Perl-compatible, are recognized
only at the very start of a pattern, and that they must be in upper case. If
more than one of them is present, the last one is used. They can be combined
with a change of newline convention; for example, a pattern can start with:
<pre>
(*ANY)(*BSR_ANYCRLF)
</pre>
They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a
character class, \R is treated as an unrecognized escape sequence, and causes
an error.
<a name="uniextseq"></a></P>
<br><b>
Unicode character properties
</b><br>
<P>
When PCRE2 is built with Unicode support (the default), three additional escape
sequences that match characters with specific properties are available. They
can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
sequences are of course limited to testing characters whose code points are
less than U+0100 or U+10000, respectively. In 32-bit non-UTF mode, code points
greater than 0x10ffff (the Unicode limit) may be encountered. These are all
treated as being in the Unknown script and with an unassigned type.
</P>
<P>
Matching characters by Unicode property is not fast, because PCRE2 has to do a
multistage table lookup in order to find a character's property. That is why
the traditional escape sequences such as \d and \w do not use Unicode
properties in PCRE2 by default, though you can make them do so by setting the
PCRE2_UCP option or by starting the pattern with (*UCP).
</P>
<P>
The extra escape sequences that provide property support are:
<pre>
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
\X a Unicode extended grapheme cluster
</pre>
For compatibility with Perl, negation can be specified by including a
circumflex between the opening brace and the property. For example, \p{^Lu} is
the same as \P{Lu}.
</P>
<P>
In accordance with Unicode's "loose matching" rules, ASCII white space
characters, hyphens, and underscores are ignored in the properties represented
by <i>xx</i> above. As well as the space character, ASCII white space can be
tab, linefeed, vertical tab, formfeed, or carriage return.
</P>
<P>
Some properties are specified as a name only; others as a name and a value,
separated by a colon or an equals sign. The names and values consist of ASCII
letters and digits (with one Perl-specific exception, see below). They are not
case sensitive. Note, however, that the escapes themselves, \p and \P,
<i>are</i> case sensitive. There are abbreviations for many names. The following
examples are all equivalent:
<pre>
\p{bidiclass=al}
\p{BC=al}
\p{ Bidi_Class : AL }
\p{ Bi-di class = Al }
\P{ ^ Bi-di class = Al }
</pre>
There is support for Unicode script names, Unicode general category properties,
"Any", which matches any character (including newline), Bidi_Class, a number of
binary (yes/no) properties, and some special PCRE2 properties (described
<a href="#extraprops">below).</a>
Certain other Perl properties such as "InMusicalSymbols" are not supported by
PCRE2. Note that \P{Any} does not match any characters, so always causes a
match failure.
</P>
<br><b>
Script properties for \p and \P
</b><br>
<P>
There are three different syntax forms for matching a script. Each Unicode
character has a basic script and, optionally, a list of other scripts ("Script
Extensions") with which it is commonly used. Using the Adlam script as an
example, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
\p{scx:Adlam} matches, in addition, characters that have Adlam in their
extensions list. The full names "script" and "script extensions" for the
property types are recognized and, as for all property specifications, an
equals sign is an alternative to the colon. If a script name is given without a
property type, for example, \p{Adlam}, it is treated as \p{scx:Adlam}. Perl
changed to this interpretation at release 5.26 and PCRE2 changed at release
10.40.
</P>
<P>
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list
of recognized script names and their 4-character abbreviations can be obtained
by running this command:
<pre>
pcre2test -LS
</PRE>
</P>
<br><b>
The general category property for \p and \P
</b><br>
<P>
Each character has exactly one Unicode general category property, specified by
a two-letter abbreviation. If only one letter is specified with \p or \P, it
includes all the general category properties that start with that letter. In
this case, in the absence of negation, the curly brackets in the escape
sequence are optional; these two examples have the same effect:
<pre>
\p{L}
\pL
</pre>
The following general category property codes are supported:
<pre>
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Lc Cased letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
</pre>
Perl originally used the name L& for the Lc property. This is still supported
by Perl, but discouraged. PCRE2 also still supports it. This property matches
any character that has the Lu, Ll, or Lt property, in other words, any letter
that is not classified as a modifier or "other". From release 10.45 of PCRE2
the properties Lu, Ll, and Lt are all treated as Lc when case-independent
matching is set by the PCRE2_CASELESS option or (?i) within the pattern. The
other properties are not affected by caseless matching.
</P>
<P>
The Cs (Surrogate) property applies only to characters whose code points are in
the range U+D800 to U+DFFF. These characters are no different to any other
character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
However, they are not valid in Unicode strings and so cannot be tested by PCRE2
in UTF mode, unless UTF validity checking has been turned off (see the
discussion of PCRE2_NO_UTF_CHECK in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page).
</P>
<P>
The long synonyms for property names that Perl supports (such as \p{Letter})
are not supported by PCRE2, nor is it permitted to prefix any of these
properties with "Is".
</P>
<P>
No character that is in the Unicode table has the Cn (unassigned) property.
Instead, this property is assumed for any code point that is not in the
Unicode table.
</P>
<br><b>
Binary (yes/no) properties for \p and \P
</b><br>
<P>
Unicode defines a number of binary properties, that is, properties whose only
values are true or false. You can obtain a list of those that are recognized by
\p and \P, along with their abbreviations, by running this command:
<pre>
pcre2test -LP
</PRE>
</P>
<br><b>
The Bidi_Class property for \p and \P
</b><br>
<P>
<pre>
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS white space
</pre>
As in all property specifications, an equals sign may be used instead of a
colon and the class names are case-insensitive. Only the short names listed
above are recognized; PCRE2 does not at present support any long alternatives.
</P>
<br><b>
Extended grapheme clusters
</b><br>
<P>
The \X escape matches any number of Unicode characters that form an "extended
grapheme cluster", and treats the sequence as an atomic group
<a href="#atomicgroup">(see below).</a>
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. The rules are defined in
Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
abandoned the use of some previous properties that had been used for emojis.
Instead it introduced various emoji-specific properties. PCRE2 uses only the
Extended Pictographic property.
</P>
<P>
\X always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
</P>
<P>
1. End at the end of the subject string.
</P>
<P>
2. Do not end between CR and LF; otherwise end after any control character.
</P>
<P>
3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
are of five types: L, V, T, LV, and LVT. An L character may be followed by an
L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be followed only by a T character.
</P>
<P>
4. Do not end before extending characters or spacing marks or the zero-width
joiner (ZWJ) character. Characters with the "mark" property always have the
"extend" grapheme breaking property.
</P>
<P>
5. Do not end after prepend characters.
</P>
<P>
6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width
joiner) sequences. An emoji ZWJ sequence consists of a character with the
Extended_Pictographic property, optionally followed by one or more characters
with the Extend property, followed by the ZWJ character, followed by another
Extended_Pictographic character.
</P>
<P>
7. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
</P>
<P>
8. Otherwise, end the cluster.
<a name="extraprops"></a></P>
<br><b>
PCRE2's additional properties
</b><br>
<P>
As well as the standard Unicode properties described above, PCRE2 supports four
more that make it possible to convert traditional escape sequences such as \w
and \s to use Unicode properties. PCRE2 uses these non-standard, non-Perl
properties internally when PCRE2_UCP is set. However, they may also be used
explicitly. These properties are:
<pre>
Xan Any alphanumeric character
Xps Any POSIX space character
Xsp Any Perl space character
Xwd Any Perl "word" character
</pre>
Xan matches characters that have either the L (letter) or the N (number)
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
carriage return, and any other character that has the Z (separator) property
(this includes the space character). Xsp is the same as Xps; in PCRE1 it used
to exclude vertical tab, for Perl compatibility, but Perl changed. Xwd matches
the same characters as Xan, plus those that match Mn (non-spacing mark) or Pc
(connector punctuation, which includes underscore).
</P>
<P>
There is another non-standard property, Xuc, which matches any character that
can be represented by a Universal Character Name in C++ and other programming
languages. These are the characters $, @, ` (grave accent), and all characters
with Unicode code points greater than or equal to U+00A0, except for the
surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
where H is a hexadecimal digit. Note that the Xuc property does not match these
sequences but the characters that they represent.)
<a name="resetmatchstart"></a></P>
<br><b>
Resetting the match start
</b><br>
<P>
In normal use, the escape sequence \K causes any previously matched characters
not to be included in the final matched sequence that is returned. For example,
the pattern:
<pre>
foo\Kbar
</pre>
matches "foobar", but reports that it has matched "bar". \K does not interact
with anchoring in any way. The pattern:
<pre>
^foo\Kbar
</pre>
matches only when the subject begins with "foobar" (in single line mode),
though it again reports the matched string as "bar". This feature is similar to
a lookbehind assertion
<a href="#lookbehind">(described below),</a>
but the part of the pattern that precedes \K is not constrained to match a
limited number of characters, as is required for a lookbehind assertion. The
use of \K does not interfere with the setting of
<a href="#group">captured substrings.</a>
For example, when the pattern
<pre>
(foo)\Kbar
</pre>
matches "foobar", the first substring is still set to "foo".
</P>
<P>
From version 5.32.0 Perl forbids the use of \K in lookaround assertions. From
release 10.38 PCRE2 also forbids this by default. However, the
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
<b>pcre2_compile()</b> to re-enable the previous behaviour. When this option is
set, \K is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions. Note that when a pattern such as (?=ab\K)
matches, the reported start of the match can be greater than the end of the
match. Using \K in a lookbehind assertion at the start of a pattern can also
lead to odd effects. For example, consider this pattern:
<pre>
(?&#60;=\Kfoo)bar
</pre>
If the subject is "foobar", a call to <b>pcre2_match()</b> with a starting
offset of 3 succeeds and reports the matching string as "foobar", that is, the
start of the reported match is earlier than where the match started.
<a name="smallassertions"></a></P>
<br><b>
Simple assertions
</b><br>
<P>
The final use of backslash is for certain simple assertions. An assertion
specifies a condition that has to be met at a particular point in a match,
without consuming any characters from the subject string. The use of
groups for more complicated assertions is described
<a href="#bigassertions">below.</a>
The backslashed assertions are:
<pre>
\b matches at a word boundary
\B matches when not at a word boundary
\A matches at the start of the subject
\Z matches at the end of the subject
also matches before a newline at the end of the subject
\z matches only at the end of the subject
\G matches at the first matching position in the subject
</pre>
Inside a character class, \b has a different meaning; it matches the backspace
character. If any other of these assertions appears in a character class, an
"invalid escape sequence" error is generated.
</P>
<P>
A word boundary is a position in the subject string where the current character
and the previous character do not both match \w or \W (i.e. one matches
\w and the other matches \W), or the start or end of the string if the
first or last character matches \w, respectively. When PCRE2 is built with
Unicode support, the meanings of \w and \W can be changed by setting the
PCRE2_UCP option. When this is done, it also affects \b and \B. Neither PCRE2
nor Perl has a separate "start of word" or "end of word" metasequence. However,
whatever follows \b normally determines which it is. For example, the fragment
\ba matches "a" at the start of a word.
</P>
<P>
The \A, \Z, and \z assertions differ from the traditional circumflex and
dollar (described in the next section) in that they only ever match at the very
start and end of the subject string, whatever options are set. Thus, they are
independent of multiline mode. These three assertions are not affected by the
PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
circumflex and dollar metacharacters. However, if the <i>startoffset</i>
argument of <b>pcre2_match()</b> is non-zero, indicating that matching is to
start at a point other than the beginning of the subject, \A can never match.
The difference between \Z and \z is that \Z matches before a newline at the
end of the string as well as at the very end, whereas \z matches only at the
end.
</P>
<P>
The \G assertion is true only when the current matching position is at the
start point of the matching process, as specified by the <i>startoffset</i>
argument of <b>pcre2_match()</b>. It differs from \A when the value of
<i>startoffset</i> is non-zero. By calling <b>pcre2_match()</b> multiple times
with appropriate arguments, you can mimic Perl's /g option, and it is in this
kind of implementation where \G can be useful.
</P>
<P>
Note, however, that PCRE2's implementation of \G, being true at the starting
character of the matching process, is subtly different from Perl's, which
defines it as true at the end of the previous match. In Perl, these can be
different when the previously matched string was empty. Because PCRE2 does just
one match at a time, it cannot reproduce this behaviour.
</P>
<P>
If all the alternatives of a pattern begin with \G, the expression is anchored
to the starting match position, and the "anchored" flag is set in the compiled
regular expression.
</P>
<br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
<P>
The circumflex and dollar metacharacters are zero-width assertions. That is,
they test for a particular condition being true without consuming any
characters from the subject string. These two metacharacters are concerned with
matching the starts and ends of lines. If the newline convention is set so that
only the two-character sequence CRLF is recognized as a newline, isolated CR
and LF characters are treated as ordinary data characters, and are not
recognized as newlines.
</P>
<P>
Outside a character class, in the default matching mode, the circumflex
character is an assertion that is true only if the current matching point is at
the start of the subject string. If the <i>startoffset</i> argument of
<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
never match if the PCRE2_MULTILINE option is unset. Inside a character class,
circumflex has an entirely different meaning
<a href="#characterclass">(see below).</a>
</P>
<P>
Circumflex need not be the first character of the pattern if a number of
alternatives are involved, but it should be the first thing in each alternative
in which it appears if the pattern is ever to match that branch. If all
possible alternatives start with a circumflex, that is, if the pattern is
constrained to match only at the start of the subject, it is said to be an
"anchored" pattern. (There are also other constructs that can cause a pattern
to be anchored.)
</P>
<P>
The dollar character is an assertion that is true only if the current matching
point is at the end of the subject string, or immediately before a newline at
the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
that it does not actually match the newline. Dollar need not be the last
character of the pattern if a number of alternatives are involved, but it
should be the last item in any branch in which it appears. Dollar has no
special meaning in a character class.
</P>
<P>
The meaning of dollar can be changed so that it matches only at the very end of
the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
does not affect the \Z assertion.
</P>
<P>
The meanings of the circumflex and dollar metacharacters are changed if the
PCRE2_MULTILINE option is set. When this is the case, a dollar character
matches before any newlines in the string, as well as at the very end, and a
circumflex matches immediately after internal newlines as well as at the start
of the subject string. It does not match after a newline that ends the string,
for compatibility with Perl. However, this can be changed by setting the
PCRE2_ALT_CIRCUMFLEX option.
</P>
<P>
For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
\n represents a newline) in multiline mode, but not otherwise. Consequently,
patterns that are anchored in single line mode because all branches start with
^ are not anchored in multiline mode, and a match for circumflex is possible
when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
</P>
<P>
When the newline convention (see
<a href="#newlines">"Newline conventions"</a>
below) recognizes the two-character sequence CRLF as a newline, this is
preferred, even if the single characters CR and LF are also recognized as
newlines. For example, if the newline convention is "any", a multiline mode
circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
CR, even though CR on its own is a valid newline. (It also matches at the very
start of the string, of course.)
</P>
<P>
Note that the sequences \A, \Z, and \z can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\A it is always anchored, whether or not PCRE2_MULTILINE is set.
<a name="fullstopdot"></a></P>
<br><a name="SEC7" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br>
<P>
Outside a character class, a dot in the pattern matches any one character in
the subject string except (by default) a character that signifies the end of a
line. One or more characters may be specified as line terminators (see
<a href="#newlines">"Newline conventions"</a>
above).
</P>
<P>
Dot never matches a single line-ending character. When the two-character
sequence CRLF is the only line ending, dot does not match CR if it is
immediately followed by LF, but otherwise it matches all characters (including
isolated CRs and LFs). When ANYCRLF is selected for line endings, no occurrences
of CR of LF match dot. When all Unicode line endings are being recognized, dot
does not match CR or LF or any of the other line ending characters.
</P>
<P>
The behaviour of dot with regard to newlines can be changed. If the
PCRE2_DOTALL option is set, a dot matches any one character, without exception.
If the two-character sequence CRLF is present in the subject string, it takes
two dots to match it.
</P>
<P>
The handling of dot is entirely independent of the handling of circumflex and
dollar, the only relationship being that they both involve newlines. Dot has no
special meaning in a character class.
</P>
<P>
The escape sequence \N when not followed by an opening brace behaves like a
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
it matches any character except one that signifies the end of a line.
</P>
<P>
When \N is followed by an opening brace it has a different meaning. See the
section entitled
<a href="digitsafterbackslash">"Non-printing characters"</a>
above for details. Perl also uses \N{name} to specify characters by Unicode
name; PCRE2 does not support this.
</P>
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
<P>
Outside a character class, the escape sequence \C matches any one code unit,
whether or not a UTF mode is set. In the 8-bit library, one code unit is one
byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
32-bit unit. Unlike a dot, \C always matches line-ending characters. The
feature is provided in Perl in order to match individual bytes in UTF-8 mode,
but it is unclear how it can usefully be used.
</P>
<P>
Because \C breaks up characters into individual code units, matching one unit
with \C in UTF-8 or UTF-16 mode means that the rest of the string may start
with a malformed UTF character. This has undefined results, because PCRE2
assumes that it is matching character by character in a valid UTF string (by
default it checks the subject string's validity at the start of processing
unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
</P>
<P>
An application can lock out the use of \C by setting the
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
build PCRE2 with the use of \C permanently disabled.
</P>
<P>
PCRE2 does not allow \C to appear in lookbehind assertions
<a href="#lookbehind">(described below)</a>
in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
the length of the lookbehind. Neither the alternative matching function
<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes.
The former gives a match-time error; the latter fails to optimize and so the
match is always run using the interpreter.
</P>
<P>
In the 32-bit library, however, \C is always supported (when not explicitly
locked out) because it always matches a single code unit, whether or not UTF-32
is specified.
</P>
<P>
In general, the \C escape sequence is best avoided. However, one way of using
it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
lookahead to check the length of the next character, as in this pattern, which
could be used with a UTF-8 string (ignore white space and line breaks):
<pre>
(?| (?=[\x00-\x7f])(\C) |
(?=[\x80-\x{7ff}])(\C)(\C) |
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
</pre>
In this example, a group that starts with (?| resets the capturing parentheses
numbers in each alternative (see
<a href="#dupgroupnumber">"Duplicate Group Numbers"</a>
below). The assertions at the start of each branch check the next UTF-8
character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
character's individual bytes are then captured by the appropriate number of
\C groups.
<a name="characterclass"></a></P>
<br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
<P>
An opening square bracket introduces a character class, terminated by a closing
square bracket. A closing square bracket on its own is not special by default.
If a closing square bracket is required as a member of the class, it should be
the first data character in the class (after an initial circumflex, if present)
or escaped with a backslash. This means that, by default, an empty class cannot
be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
square bracket at the start does end the (empty) class.
</P>
<P>
A character class matches a single character in the subject. A matched
character must be in the set of characters defined by the class, unless the
first character in the class definition is a circumflex, in which case the
subject character must not be in the set defined by the class. If a circumflex
is actually required as a member of the class, ensure it is not the first
character, or escape it with a backslash.
</P>
<P>
For example, the character class [aeiou] matches any lower case English vowel,
whereas [^aeiou] matches all other characters. Note that a circumflex is just a
convenient notation for specifying the characters that are in the class by
enumerating those that are not. A class that starts with a circumflex is not an
assertion; it still consumes a character from the subject string, and therefore
it fails to match if the current pointer is at the end of the string.
</P>
<P>
Characters in a class may be specified by their code points using \o, \x, or
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
class represent both their upper case and lower case versions, so for example,
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would. Note that there are two ASCII
characters, K and S, that, in addition to their lower case ASCII equivalents,
are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
respectively when either PCRE2_UTF or PCRE2_UCP is set. If you do not want
these ASCII/non-ASCII case equivalences, you can suppress them by setting
PCRE2_EXTRA_CASELESS_RESTRICT, either as an option in a compile context, or by
including (*CASELESS_RESTRICT) or (?r) within a pattern.
</P>
<P>
Characters that might indicate line breaks are never treated in any special way
when matching character classes, whatever line-ending sequence is in use, and
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters.
</P>
<P>
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
\S, \v, \V, \w, and \W may appear in a character class, and add the
characters that they match to the class. For example, [\dABCDEF] matches any
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
\d, \s, \w and their upper case partners, just as it does when they appear
outside a character class, as described in the section entitled
<a href="#genericchartypes">"Generic character types"</a>
above. The escape sequence \b has a different meaning inside a character
class; it matches the backspace character. The sequences \B, \R, and \X are
not special inside a character class. Like any other unrecognized escape
sequences, they cause an error. The same is true for \N when not followed by
an opening brace.
</P>
<P>
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as
indicating a range, typically as the first or last character in the class,
or immediately after a range. For example, [b-d-z] matches letters in the range
b to d, a hyphen character, or z.
</P>
<P>
There is some special treatment for alphabetic ranges in EBCDIC environments;
see the section
<a href="#ebcdicenvironments">"EBCDIC environments"</a>
below.
</P>
<P>
Perl treats a hyphen as a literal if it appears before or after a POSIX class
(see below) or before or after a character type escape such as \d or \H.
However, unless the hyphen is the last character in the class, Perl outputs a
warning in its warning mode, as this is most likely a user error. As PCRE2 has
no facility for warning, an error is given in these cases.
</P>
<P>
It is not possible to have the literal character "]" as the end character of a
range. A pattern such as [W-]46] is interpreted as a class of two characters
("W" and "-") followed by a literal string "46]", so it would match "W46]" or
"-46]". However, if the "]" is escaped with a backslash it is interpreted as
the end of a range, so [W-\]46] is interpreted as a class containing a range
and two other characters. The octal or hexadecimal representation of "]" can
also be used to end a range.
</P>
<P>
Ranges normally include all code points between the start and end characters,
inclusive. They can also be used for code points specified numerically, for
example [\000-\037]. Ranges can include any characters that are valid for the
current mode. In any UTF mode, the so-called "surrogate" characters (those
whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
this check). However, ranges such as [\x{d7ff}-\x{e000}], which include the
surrogates, are always permitted.
</P>
<P>
If a range that includes letters is used when caseless matching is set, it
matches the letters in either case. For example, [W-c] is equivalent to
[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
tables for a French locale are in use, [\xc8-\xcb] matches accented E
characters in both cases.
</P>
<P>
A circumflex can conveniently be used with the upper case character types to
specify a more restricted set of characters than the matching lower case type.
For example, the class [^\W_] matches any letter or digit, but not underscore,
whereas [\w] includes underscore. A positive character class should be read as
"something OR something OR ..." and a negative class as "NOT something AND NOT
something AND NOT ...".
</P>
<P>
The metacharacters that are recognized in character classes are backslash,
hyphen (when it can be interpreted as specifying a range), circumflex
(only at the start), and the terminating closing square bracket. An opening
square bracket is also special when it can be interpreted as introducing a
POSIX class (see
<a href="#posixclasses">"Posix character classes"</a>
below), or a special compatibility feature (see
<a href="#wordboundcompat">"Compatibility feature for word boundaries"</a>
below. Escaping any non-alphanumeric character in a class turns it into a
literal, whether or not it would otherwise be a metacharacter.
</P>
<br><a name="SEC10" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a><br>
<P>
From release 10.45 PCRE2 supports Perl's (?[...]) extended character class
syntax. This can be used to perform set operations such as intersection on
character classes.
</P>
<P>
The syntax permitted within (?[...]) is quite different to ordinary character
classes. Inside the extended class, there is an expression syntax consisting of
"atoms", operators, and ordinary parentheses "()" used for grouping. Such
classes always have the Perl /xx modifier (PCRE2 option PCRE2_EXTENDED_MORE)
turned on within them. This means that literal space and tab characters are
ignored everywhere in the class.
</P>
<P>
The allowed atoms are individual characters specified by escape sequences such
as \n or \x{123}, character types such as \d, POSIX classes such as
[:alpha:], and nested ordinary (non-extended) character classes. For example,
in (?[\d & [...]]) the nested class [...] follows the usual rules for ordinary
character classes, in which parentheses are not metacharacters, and character
literals and ranges are permitted.
</P>
<P>
Character literals and ranges may not appear outside a nested ordinary
character class because they are not atoms in the extended syntax. The extended
syntax does not introduce any additional escape sequences, so (?[\y]) is an
unknown escape, as it would be in [\y].
</P>
<P>
In the extended syntax, ^ does not negate a class (except within an
ordinary class nested inside an extended class); it is instead a binary
operator.
</P>
<P>
The binary operators are "&" (intersection), "|" or "+" (union), "-"
(subtraction) and "^" (symmetric difference). These are left-associative and
"&" has higher (tighter) precedence, while the others have equal lower
precedence. The one prefix unary operator is "!" (complement), with highest
precedence.
</P>
<br><a name="SEC11" href="#TOC1">UTS#18 EXTENDED CHARACTER CLASSES</a><br>
<P>
The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's (?[...])
syntax, allowing instead extended class behaviour inside ordinary [...]
character classes. This altered syntax for [...] classes is loosely described
by the Unicode standard UTS#18. The PCRE2_ALT_EXTENDED_CLASS option does not
prevent use of (?[...]) classes; it just changes the meaning of all
[...] classes that are not nested inside a Perl (?[...]) class.
</P>
<P>
Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is a
character class with two literal characters "a" and "[", but in UTS#18 extended
classes the "[" character becomes an additional metacharacter within classes,
denoting the start of a nested class, so a literal "[" must be escaped as "\[".
</P>
<P>
Secondly, within the UTS#18 extended syntax, there are operators "||", "&&",
"--" and "~~" which denote character class union, intersection, subtraction,
and symmetric difference respectively. In standard Perl syntax, these would
simply be needlessly-repeated literals (except for "--" which could be the
start or end of a range). In UTS#18 extended classes these operators can be used
in constructs such as [\p{L}--[QW]] for "Unicode letters, other than Q and W".
A literal "-" at the start or end of a range must be escaped, so while "[--1]"
in Perl syntax is the range from hyphen to "1", it must be escaped as "[\--1]"
in UTS#18 extended classes.
</P>
<P>
Unlike Perl's (?[...]) extended classes, the PCRE2_EXTENDED_MORE option to
ignore space and tab characters is not automatically enabled for UTS#18
extended classes, but it is honoured if set.
</P>
<P>
Extended UTS#18 classes can be nested, and nested classes are themselves
extended classes (unlike Perl, where nested classes must be simple classes).
For example, [\p{L}&&[\p{Thai}||\p{Greek}]] matches any letter that is in
the Thai or Greek scripts. Note that this means that no special grouping
characters (such as the parentheses used in Perl's (?[...]) class syntax) are
needed.
</P>
<P>
Individual class items (literal characters, literal ranges, properties such as
\d or \p{...}, and nested classes) can be combined by juxtaposition or by an
operator. Juxtaposition is the implicit union operator, and binds more tightly
than any explicit operator. Thus a sequence of literals and/or ranges behaves
as if it is enclosed in square brackets. For example, [A-Z0-9&&[^E8]] is the
same as [[A-Z0-9]&&[^E8]], which matches any upper case alphanumeric character
except "E" or "8".
</P>
<P>
Precedence between the explicit operators is not defined, so mixing operators
is a syntax error. For example, [A&&B--C] is an error, but [A&&[B--C]] is
valid.
</P>
<P>
This is an emerging syntax which is being adopted gradually across the regex
ecosystem: for example JavaScript adopted the "/v" flag in ECMAScript 2024;
Python's "re" module reserves the syntax for future use with a FutureWarning
for unescaped use of "[" as a literal within character classes. Due to UTS#18
providing insufficient guidance, engines interpret the syntax differently.
Rust's "regex" crate and Python's "regex" PyPi module both implement UTS#18
extended classes, but with slight incompatibilities ([A||B&&C] is parsed as
[A||[B&&C]] in Python's "regex" but as [[A||B]&&C] in Rust's "regex").
</P>
<P>
PCRE2's syntax adds syntax restrictions similar to ECMASCript's /v flag, so
that all the UTS#18 extended classes accepted as valid by PCRE2 have the
property that they are interpreted either with the same behaviour, or as
invalid, by all other major engines. Please file an issue if you are aware of
cross-engine differences in behaviour between PCRE2 and another major engine.
<a name="posixclasses"></a></P>
<br><a name="SEC12" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
<P>
Perl supports the POSIX notation for character classes. This uses names
enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
this notation, in both ordinary and extended classes. For example,
<pre>
[01[:alpha:]%]
</pre>
matches "0", "1", any alphabetic character, or "%". The supported class names
are:
<pre>
alnum letters and digits
alpha letters
ascii character codes 0 - 127
blank space or tab only
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits and space
space white space (the same as \s from PCRE2 8.34)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits
</pre>
The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
and space (32). If locale-specific matching is taking place, the list of space
characters may be different; there may be fewer or more of them. "Space" and
\s match the same set of characters, as do "word" and \w.
</P>
<P>
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
5.8. Another Perl extension is negation, which is indicated by a ^ character
after the colon. For example,
<pre>
[12[:^digit:]]
</pre>
matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
</P>
<P>
By default, characters with values greater than 127 do not match any of the
POSIX character classes, although this may be different for characters in the
range 128-255 when locale-specific matching is happening. However, in UCP mode,
unless certain options are set (see below), some of the classes are changed so
that Unicode character properties are used. This is achieved by replacing
POSIX classes with other sequences, as follows:
<pre>
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:cntrl:] becomes \p{Cc}
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
</pre>
Negated versions, such as [:^alpha:] use \P instead of \p. Four other POSIX
classes are handled specially in UCP mode:
</P>
<P>
[:graph:]
This matches characters that have glyphs that mark the page when printed. In
Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
properties, except for:
<pre>
U+061C Arabic Letter Mark
U+180E Mongolian Vowel Separator
U+2066 - U+2069 Various "isolate"s
</PRE>
</P>
<P>
[:print:]
This matches the same characters as [:graph:] plus space characters that are
not controls, that is, characters with the Zs property.
</P>
<P>
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
plus those characters with code points less than 256 that have the S (Symbol)
property.
</P>
<P>
[:xdigit:]
In addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
versions of those characters, whose Unicode code points start at U+FF10. This
is a change that was made in PCRE2 release 10.43 for Perl compatibility.
</P>
<P>
The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
with code points less than 256.
</P>
<P>
There are two options that can be used to restrict the POSIX classes to ASCII
characters when PCRE2_UCP is set. The option PCRE2_EXTRA_ASCII_DIGIT affects
just [:digit:] and [:xdigit:]. Within a pattern, this can be set and unset by
(?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
for all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
(?aP) and (?-aP) set and unset both these options for consistency.
<a name="wordboundcompat"></a></P>
<br><a name="SEC13" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
<P>
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
syntax [[:&#60;:]] and [[:&#62;:]] is used for matching "start of word" and "end of
word". PCRE2 treats these items as follows:
<pre>
[[:&#60;:]] is converted to \b(?=\w)
[[:&#62;:]] is converted to \b(?&#60;=\w)
</pre>
Only these exact character sequences are recognized. A sequence such as
[a[:&#60;:]b] provokes error for an unrecognized POSIX class name. This support is
not compatible with Perl. It is provided to help migrations from other
environments, and is best not used in any new patterns. Note that \b matches
at the start and the end of a word (see
<a href="#smallassertions">"Simple assertions"</a>
above), and in a Perl-style pattern the preceding or following character
normally shows which is wanted, without the need for the assertions that are
used above in order to give exactly the POSIX behaviour. Note also that the
PCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so
it also affects these POSIX sequences.
</P>
<br><a name="SEC14" href="#TOC1">VERTICAL BAR</a><br>
<P>
Vertical bar characters are used to separate alternative patterns. For example,
the pattern
<pre>
gilbert|sullivan
</pre>
matches either "gilbert" or "sullivan". Any number of alternatives may appear,
and an empty alternative is permitted (matching the empty string). The matching
process tries each alternative in turn, from left to right, and the first one
that succeeds is used. If the alternatives are within a group
<a href="#group">(defined below),</a>
"succeeds" means matching the rest of the main pattern as well as the
alternative in the group.
<a name="internaloptions"></a></P>
<br><a name="SEC15" href="#TOC1">INTERNAL OPTION SETTING</a><br>
<P>
The settings of several options can be changed within a pattern by a sequence
of letters enclosed between "(?" and ")". The following are Perl-compatible,
and are described in detail in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. The option letters are:
<pre>
i for PCRE2_CASELESS
m for PCRE2_MULTILINE
n for PCRE2_NO_AUTO_CAPTURE
s for PCRE2_DOTALL
x for PCRE2_EXTENDED
xx for PCRE2_EXTENDED_MORE
</pre>
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the relevant letters with a hyphen, for
example (?-im). The two "extended" options are not independent; unsetting
either one cancels the effects of both of them.
</P>
<P>
A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
permitted. Only one hyphen may appear in the options string. If a letter
appears both before and after the hyphen, the option is unset. An empty options
setting "(?)" is allowed. Needless to say, it has no effect.
</P>
<P>
If the first character following (? is a circumflex, it causes all of the above
options to be unset. Letters may follow the circumflex to cause some options to
be re-instated, but a hyphen may not appear.
</P>
<P>
Some PCRE2-specific options can be changed by the same mechanism using these
pairs or individual letters:
<pre>
aD for PCRE2_EXTRA_ASCII_BSD
aS for PCRE2_EXTRA_ASCII_BSS
aW for PCRE2_EXTRA_ASCII_BSW
aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
aT for PCRE2_EXTRA_ASCII_DIGIT
r for PCRE2_EXTRA_CASELESS_RESTRICT
J for PCRE2_DUPNAMES
U for PCRE2_UNGREEDY
</pre>
However, except for 'r', these are not unset by (?^), which is equivalent to
(?-imnrsx). If 'a' is not followed by any of the upper case letters shown
above, it sets (or unsets) all the ASCII options.
</P>
<P>
PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX
is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
restrictions for POSIX classes.
</P>
<P>
When one of these option changes occurs at top level (that is, not inside group
parentheses), the change applies until a subsequent change, or the end of the
pattern. An option change within a group (see below for a description of
groups) affects only that part of the group that follows it. At the end of the
group these options are reset to the state they were before the group. For
example,
<pre>
(a(?i)b)c
</pre>
matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not set
externally). Any changes made in one alternative do carry on into subsequent
branches within the same group. For example,
<pre>
(a(?i)b|c)
</pre>
matches "ab", "aB", "c", and "C", even though when matching "C" the first
branch is abandoned before the option setting. This is because the effects of
option settings happen at compile time. There would be some very weird
behaviour otherwise.
</P>
<P>
As a convenient shorthand, if any option settings are required at the start of
a non-capturing group (see the next section), the option letters may
appear between the "?" and the ":". Thus the two patterns
<pre>
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
</pre>
match exactly the same set of strings.
</P>
<P>
<b>Note:</b> There are other PCRE2-specific options, applying to the whole
pattern, which can be set by the application when the compiling function is
called. In addition, the pattern can contain special leading sequences such as
(*CRLF) to override what the application has set or what has been defaulted.
Details are given in the section entitled
<a href="#newlineseq">"Newline sequences"</a>
above. There are also the (*UTF) and (*UCP) leading sequences that can be used
to set UTF and Unicode property modes; they are equivalent to setting the
PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, which lock out the use of the
(*UTF) and (*UCP) sequences.
<a name="group"></a></P>
<br><a name="SEC16" href="#TOC1">GROUPS</a><br>
<P>
Groups are delimited by parentheses (round brackets), which can be nested.
Turning part of a pattern into a group does two things:
<br>
<br>
1. It localizes a set of alternatives. For example, the pattern
<pre>
cat(aract|erpillar|)
</pre>
matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
match "cataract", "erpillar" or an empty string.
<br>
<br>
2. It creates a "capture group". This means that, when the whole pattern
matches, the portion of the subject string that matched the group is passed
back to the caller, separately from the portion that matched the whole pattern.
(This applies only to the traditional matching function; the DFA matching
function does not support capturing.)
</P>
<P>
Opening parentheses are counted from left to right (starting from 1) to obtain
numbers for capture groups. For example, if the string "the red king" is
matched against the pattern
<pre>
the ((red|white) (king|queen))
</pre>
the captured substrings are "red king", "red", and "king", and are numbered 1,
2, and 3, respectively.
</P>
<P>
The fact that plain parentheses fulfil two functions is not always helpful.
There are often times when grouping is required without capturing. If an
opening parenthesis is followed by a question mark and a colon, the group
does not do any capturing, and is not counted when computing the number of any
subsequent capture groups. For example, if the string "the white queen"
is matched against the pattern
<pre>
the ((?:red|white) (king|queen))
</pre>
the captured substrings are "white queen" and "queen", and are numbered 1 and
2. The maximum number of capture groups is 65535.
</P>
<P>
As a convenient shorthand, if any option settings are required at the start of
a non-capturing group, the option letters may appear between the "?" and the
":". Thus the two patterns
<pre>
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
</pre>
match exactly the same set of strings. Because alternative branches are tried
from left to right, and options are not reset until the end of the group is
reached, an option setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
<a name="dupgroupnumber"></a></P>
<br><a name="SEC17" href="#TOC1">DUPLICATE GROUP NUMBERS</a><br>
<P>
Perl 5.10 introduced a feature whereby each alternative in a group uses the
same numbers for its capturing parentheses. Such a group starts with (?| and is
itself a non-capturing group. For example, consider this pattern:
<pre>
(?|(Sat)ur|(Sun))day
</pre>
Because the two alternatives are inside a (?| group, both sets of capturing
parentheses are numbered one. Thus, when the pattern matches, you can look
at captured substring number one, whichever alternative matched. This construct
is useful when you want to capture part, but not all, of one of a number of
alternatives. Inside a (?| group, parentheses are numbered as usual, but the
number is reset at the start of each branch. The numbers of any capturing
parentheses that follow the whole group start after the highest number used in
any branch. The following example is taken from the Perl documentation. The
numbers underneath show in which buffer the captured content will be stored.
<pre>
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
</pre>
A backreference to a capture group uses the most recent value that is set for
the group. The following pattern matches "abcabc" or "defdef":
<pre>
/(?|(abc)|(def))\1/
</pre>
In contrast, a subroutine call to a capture group always refers to the
first one in the pattern with the given number. The following pattern matches
"abcabc" or "defabc":
<pre>
/(?|(abc)|(def))(?1)/
</pre>
A relative reference such as (?-1) is no different: it is just a convenient way
of computing an absolute group number.
</P>
<P>
If a
<a href="#conditions">condition test</a>
for a group's having matched refers to a non-unique number, the test is
true if any group with that number has matched.
</P>
<P>
An alternative approach to using this "branch reset" feature is to use
duplicate named groups, as described in the next section.
</P>
<br><a name="SEC18" href="#TOC1">NAMED CAPTURE GROUPS</a><br>
<P>
Identifying capture groups by number is simple, but it can be very hard to keep
track of the numbers in complicated patterns. Furthermore, if an expression is
modified, the numbers may change. To help with this difficulty, PCRE2 supports
the naming of capture groups. This feature was not added to Perl until release
5.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0,
using the Python syntax. PCRE2 supports both the Perl and the Python syntax.
</P>
<P>
In PCRE2, a capture group can be named in one of three ways: (?&#60;name&#62;...) or
(?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. Names may be up to 128
code units long. When PCRE2_UTF is not set, they may contain only ASCII
alphanumeric characters and underscores, but must start with a non-digit. When
PCRE2_UTF is set, the syntax of group names is extended to allow any Unicode
letter or Unicode decimal digit. In other words, group names must match one of
these patterns:
<pre>
^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set
^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set
</pre>
References to capture groups from other parts of the pattern, such as
<a href="#backreferences">backreferences,</a>
<a href="#recursion">recursion,</a>
and
<a href="#conditions">conditions,</a>
can all be made by name as well as by number.
</P>
<P>
Named capture groups are allocated numbers as well as names, exactly as
if the names were not present. In both PCRE2 and Perl, capture groups
are primarily identified by numbers; any names are just aliases for these
numbers. The PCRE2 API provides function calls for extracting the complete
name-to-number translation table from a compiled pattern, as well as
convenience functions for extracting captured substrings by name.
</P>
<P>
<b>Warning:</b> When more than one capture group has the same number, as
described in the previous section, a name given to one of them applies to all
of them. Perl allows identically numbered groups to have different names.
Consider this pattern, where there are two capture groups, both numbered 1:
<pre>
(?|(?&#60;AA&#62;aa)|(?&#60;BB&#62;bb))
</pre>
Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
a successful match, both names yield the same value (either "aa" or "bb").
</P>
<P>
In an attempt to reduce confusion, PCRE2 does not allow the same group number
to be associated with more than one name. The example above provokes a
compile-time error. However, there is still scope for confusion. Consider this
pattern:
<pre>
(?|(?&#60;AA&#62;aa)|(bb))
</pre>
Although the second group number 1 is not explicitly named, the name AA is
still an alias for any group 1. Whether the pattern matches "aa" or "bb", a
reference by name to group AA yields the matched string.
</P>
<P>
By default, a name must be unique within a pattern, except that duplicate names
are permitted for groups with the same number, for example:
<pre>
(?|(?&#60;AA&#62;aa)|(?&#60;AA&#62;bb))
</pre>
The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
option at compile time, or by the use of (?J) within the pattern, as described
in the section entitled
<a href="#internaloptions">"Internal Option Setting"</a>
above.
</P>
<P>
Duplicate names can be useful for patterns where only one instance of the named
capture group can match. Suppose you want to match the name of a weekday,
either as a 3-letter abbreviation or as the full name, and in both cases you
want to extract the abbreviation. This pattern (ignoring the line breaks) does
the job:
<pre>
(?J)
(?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
(?&#60;DN&#62;Tue)(?:sday)?|
(?&#60;DN&#62;Wed)(?:nesday)?|
(?&#60;DN&#62;Thu)(?:rsday)?|
(?&#60;DN&#62;Sat)(?:urday)?
</pre>
There are five capture groups, but only one is ever set after a match. The
convenience functions for extracting the data by name returns the substring for
the first (and in this example, the only) group of that name that matched. This
saves searching to find which numbered group it was. (An alternative way of
solving this problem is to use a "branch reset" group, as described in the
previous section.)
</P>
<P>
If you make a backreference to a non-unique named group from elsewhere in the
pattern, the groups to which the name refers are checked in the order in which
they appear in the overall pattern. The first one that is set is used for the
reference. For example, this pattern matches both "foofoo" and "barbar" but not
"foobar" or "barfoo":
<pre>
(?J)(?:(?&#60;n&#62;foo)|(?&#60;n&#62;bar))\k&#60;n&#62;
</PRE>
</P>
<P>
If you make a subroutine call to a non-unique named group, the one that
corresponds to the first occurrence of the name is used. In the absence of
duplicate numbers this is the one with the lowest number.
</P>
<P>
If you use a named reference in a condition
test (see the
<a href="#conditions">section about conditions</a>
below), either to check whether a capture group has matched, or to check for
recursion, all groups with the same name are tested. If the condition is true
for any one of them, the overall condition is true. This is the same behaviour
as testing by number. For further details of the interfaces for handling named
capture groups, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><a name="SEC19" href="#TOC1">REPETITION</a><br>
<P>
Repetition is specified by quantifiers, which may follow any one of these
items:
<pre>
a literal data character
the dot metacharacter
the \C escape sequence
the \R escape sequence
the \X escape sequence
any escape sequence that matches a single character
a character class
a backreference
a parenthesized group (including lookaround assertions)
a subroutine call (recursive or otherwise)
</pre>
If a quantifier does not follow a repeatable item, an error occurs. The
general repetition quantifier specifies a minimum and maximum number of
permitted matches by giving two numbers in curly brackets (braces), separated
by a comma. The numbers must be less than 65536, and the first must be less
than or equal to the second. For example,
<pre>
z{2,4}
</pre>
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
character. If the second number is omitted, but the comma is present, there is
no upper limit; if the second number and the comma are both omitted, the
quantifier specifies an exact number of required matches. Thus
<pre>
[aeiou]{3,}
</pre>
matches at least 3 successive vowels, but may match many more, whereas
<pre>
\d{8}
</pre>
matches exactly 8 digits. If the first number is omitted, the lower limit is
taken as zero; in this case the upper limit must be present.
<pre>
X{,4} is interpreted as X{0,4}
</pre>
This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In
earlier versions such a sequence was not interpreted as a quantifier. Other
regular expression engines may behave either way.
</P>
<P>
If the characters that follow an opening brace do not match the syntax of a
quantifier, the brace is taken as a literal character. In particular, this
means that {,} is a literal string of three characters.
</P>
<P>
Note that not every opening brace is potentially the start of a quantifier
because braces are used in other items such as \N{U+345} or \k{name}.
</P>
<P>
In UTF modes, quantifiers apply to characters rather than to individual code
units. Thus, for example, \x{100}{2} matches two characters, each of
which is represented by a two-byte sequence in a UTF-8 string. Similarly,
\X{3} matches three Unicode extended grapheme clusters, each of which may be
several code units long (and they may be of different lengths).
</P>
<P>
The quantifier {0} is permitted, causing the expression to behave as if the
previous item and the quantifier were not present. This may be useful for
capture groups that are referenced as
<a href="#groupsassubroutines">subroutines</a>
from elsewhere in the pattern (but see also the section entitled
<a href="#subdefine">"Defining capture groups for use by reference only"</a>
below). Except for parenthesized groups, items that have a {0} quantifier are
omitted from the compiled pattern.
</P>
<P>
For convenience, the three most common quantifiers have single-character
abbreviations:
<pre>
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
</pre>
It is possible to construct infinite loops by following a group that can match
no characters with a quantifier that has no upper limit, for example:
<pre>
(a?)*
</pre>
Earlier versions of Perl and PCRE1 used to give an error at compile time for
such patterns. However, because there are cases where this can be useful, such
patterns are now accepted, but whenever an iteration of such a group matches no
characters, matching moves on to the next item in the pattern instead of
repeatedly matching an empty string. This does not prevent backtracking into
any of the iterations if a subsequent item fails to match.
</P>
<P>
By default, quantifiers are "greedy", that is, they match as much as possible
(up to the maximum number of permitted repetitions), without causing the rest
of the pattern to fail. The classic example of where this gives problems is in
trying to match comments in C programs. These appear between /* and */ and
within the comment, individual * and / characters may appear. An attempt to
match C comments by applying the pattern
<pre>
/\*.*\*/
</pre>
to the string
<pre>
/* first comment */ not comment /* second comment */
</pre>
fails, because it matches the entire string owing to the greediness of the .*
item. However, if a quantifier is followed by a question mark, it ceases to be
greedy, and instead matches the minimum number of times possible, so the
pattern
<pre>
/\*.*?\*/
</pre>
does the right thing with C comments. The meaning of the various quantifiers is
not otherwise changed, just the preferred number of matches. Do not confuse
this use of question mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as in
<pre>
\d??\d
</pre>
which matches one digit by preference, but can match two if that is the only
way the rest of the pattern matches.
</P>
<P>
If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
the quantifiers are not greedy by default, but individual ones can be made
greedy by following them with a question mark. In other words, it inverts the
default behaviour.
</P>
<P>
When a parenthesized group is quantified with a minimum repeat count that
is greater than 1 or with a limited maximum, more memory is required for the
compiled pattern, in proportion to the size of the minimum or maximum.
</P>
<P>
If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
implicitly anchored, because whatever follows will be tried against every
character position in the subject string, so there is no point in retrying the
overall match at any position after the first. PCRE2 normally treats such a
pattern as though it were preceded by \A.
</P>
<P>
In cases where it is known that the subject string contains no newlines, it is
worth setting PCRE2_DOTALL in order to obtain this optimization, or
alternatively, using ^ to indicate anchoring explicitly.
</P>
<P>
However, there are some cases where the optimization cannot be used. When .*
is inside capturing parentheses that are the subject of a backreference
elsewhere in the pattern, a match at the start may fail where a later one
succeeds. Consider, for example:
<pre>
(.*)abc\1
</pre>
If the subject is "xyz123abc123" the match point is the fourth character. For
this reason, such a pattern is not implicitly anchored.
</P>
<P>
Another case where implicit anchoring is not applied is when the leading .* is
inside an atomic group. Once again, a match at the start may fail where a later
one succeeds. Consider this pattern:
<pre>
(?&#62;.*?a)b
</pre>
It matches "ab" in the subject "aab". The use of the backtracking control verbs
(*PRUNE) and (*SKIP) also disable this optimization. To do so explicitly,
either pass the compile option PCRE2_NO_DOTSTAR_ANCHOR, or call
<b>pcre2_set_optimize()</b> with a PCRE2_DOTSTAR_ANCHOR_OFF directive.
</P>
<P>
When a capture group is repeated, the value captured is the substring that
matched the final iteration. For example, after
<pre>
(tweedle[dume]{3}\s*)+
</pre>
has matched "tweedledum tweedledee" the value of the captured substring is
"tweedledee". However, if there are nested capture groups, the corresponding
captured values may have been set in previous iterations. For example, after
<pre>
(a|(b))+
</pre>
matches "aba" the value of the second captured substring is "b".
<a name="atomicgroup"></a></P>
<br><a name="SEC20" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
<P>
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
repetition, failure of what follows normally causes the repeated item to be
re-evaluated to see if a different number of repeats allows the rest of the
pattern to match. Sometimes it is useful to prevent this, either to change the
nature of the match, or to cause it fail earlier than it otherwise might, when
the author of the pattern knows there is no point in carrying on.
</P>
<P>
Consider, for example, the pattern \d+foo when applied to the subject line
<pre>
123456bar
</pre>
After matching all 6 digits and then failing to match "foo", the normal
action of the matcher is to try again with only 5 digits matching the \d+
item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
(a term taken from Jeffrey Friedl's book) provides the means for specifying
that once a group has matched, it is not to be re-evaluated in this way.
</P>
<P>
If we use atomic grouping for the previous example, the matcher gives up
immediately on failing to match "foo" the first time. The notation is a kind of
special parenthesis, starting with (?&#62; as in this example:
<pre>
(?&#62;\d+)foo
</pre>
Perl 5.28 introduced an experimental alphabetic form starting with (* which may
be easier to remember:
<pre>
(*atomic:\d+)foo
</pre>
This kind of parenthesized group "locks up" the part of the pattern it contains
once it has matched, and a failure further into the pattern is prevented from
backtracking into it. Backtracking past it to previous items, however, works as
normal.
</P>
<P>
An alternative description is that a group of this type matches exactly the
string of characters that an identical standalone pattern would match, if
anchored at the current point in the subject string.
</P>
<P>
Atomic groups are not capture groups. Simple cases such as the above example
can be thought of as a maximizing repeat that must swallow everything it can.
So, while both \d+ and \d+? are prepared to adjust the number of digits they
match in order to make the rest of the pattern match, (?&#62;\d+) can only match
an entire sequence of digits.
</P>
<P>
Atomic groups in general can of course contain arbitrarily complicated
expressions, and can be nested. However, when the contents of an atomic
group is just a single repeated item, as in the example above, a simpler
notation, called a "possessive quantifier" can be used. This consists of an
additional + character following a quantifier. Using this notation, the
previous example can be rewritten as
<pre>
\d++foo
</pre>
Note that a possessive quantifier can be used with an entire group, for
example:
<pre>
(abc|xyz){2,3}+
</pre>
Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
option is ignored. They are a convenient notation for the simpler forms of
atomic group. However, there is no difference in the meaning of a possessive
quantifier and the equivalent atomic group, though there may be a performance
difference; possessive quantifiers should be slightly faster.
</P>
<P>
The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
Jeffrey Friedl originated the idea (and the name) in the first edition of his
book. Mike McCloskey liked it, so implemented it when he built Sun's Java
package, and PCRE1 copied it from there. It found its way into Perl at release
5.10.
</P>
<P>
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, by calling
<b>pcre2_set_optimize()</b> with a PCRE2_AUTO_POSSESS_OFF directive, or by
starting the pattern with (*NO_AUTO_POSSESS).
</P>
<P>
When a pattern contains an unlimited repeat inside a group that can itself be
repeated an unlimited number of times, the use of an atomic group is the only
way to avoid some failing matches taking a very long time indeed. The pattern
<pre>
(\D+|&#60;\d+&#62;)*[!?]
</pre>
matches an unlimited number of substrings that either consist of non-digits, or
digits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
quickly. However, if it is applied to
<pre>
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
</pre>
it takes a long time before reporting failure. This is because the string can
be divided between the internal \D+ repeat and the external * repeat in a
large number of ways, and all have to be tried. (The example uses [!?] rather
than a single character at the end, because both PCRE2 and Perl have an
optimization that allows for fast failure when a single character is used. They
remember the last single character that is required for a match, and fail early
if it is not present in the string.) If the pattern is changed so that it uses
an atomic group, like this:
<pre>
((?&#62;\D+)|&#60;\d+&#62;)*[!?]
</pre>
sequences of non-digits cannot be broken, and failure happens quickly.
<a name="backreferences"></a></P>
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
<P>
Outside a character class, a backslash followed by a digit greater than 0 (and
possibly further digits) is a backreference to a capture group earlier (that
is, to its left) in the pattern, provided there have been that many previous
capture groups.
</P>
<P>
However, if the decimal number following the backslash is less than 8, it is
always taken as a backreference, and causes an error only if there are not that
many capture groups in the entire pattern. In other words, the group that is
referenced need not be to the left of the reference for numbers less than 8. A
"forward backreference" of this type can make sense when a repetition is
involved and the group to the right has participated in an earlier iteration.
</P>
<P>
It is not possible to have a numerical "forward backreference" to a group whose
number is 8 or more using this syntax because a sequence such as \50 is
interpreted as a character defined in octal. See the subsection entitled
"Non-printing characters"
<a href="#digitsafterbackslash">above</a>
for further details of the handling of digits following a backslash. Other
forms of backreferencing do not suffer from this restriction. In particular,
there is no problem when named capture groups are used (see below).
</P>
<P>
Another way of avoiding the ambiguity inherent in the use of digits following a
backslash is to use the \g escape sequence. This escape must be followed by a
signed or unsigned number, optionally enclosed in braces. These examples are
all identical:
<pre>
(ring), \1
(ring), \g1
(ring), \g{1}
</pre>
An unsigned number specifies an absolute reference without the ambiguity that
is present in the older syntax. It is also useful when literal digits follow
the reference. A signed number is a relative reference. Consider this example:
<pre>
(abc(def)ghi)\g{-1}
</pre>
The sequence \g{-1} is a reference to the capture group whose number is one
less than the number of the next group to be started, so in this example (where
the next group would be numbered 3) is it equivalent to \2, and \g{-2} would
be equivalent to \1. Note that if this construct is inside a capture group,
that group is included in the count, so in this example \g{-2} also refers to
group 1:
<pre>
(A)(\g{-2}B)
</pre>
The use of relative references can be helpful in long patterns, and also in
patterns that are created by joining together fragments that contain references
within themselves.
</P>
<P>
The sequence \g{+1} is a reference to the next capture group that is started
after this item, and \g{+2} refers to the one after that, and so on. This kind
of forward reference can be useful in patterns that repeat. Perl does not
support the use of + in this way.
</P>
<P>
A backreference matches whatever actually most recently matched the capture
group in the current subject string, rather than anything at all that matches
the group (see
<a href="#groupsassubroutines">"Groups as subroutines"</a>
below for a way of doing that). So the pattern
<pre>
(sens|respons)e and \1ibility
</pre>
matches "sense and sensibility" and "response and responsibility", but not
"sense and responsibility". If caseful matching is in force at the time of the
backreference, the case of letters is relevant. For example,
<pre>
((?i)rah)\s+\1
</pre>
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
capture group is matched caselessly.
</P>
<P>
There are several different ways of writing backreferences to named capture
groups. The .NET syntax is \k{name}, the Python syntax is (?=name), and the
original Perl syntax is \k&#60;name&#62; or \k'name'. All of these are now supported
by both Perl and PCRE2. Perl 5.10's unified backreference syntax, in which \g
can be used for both numeric and named references, is also supported by PCRE2.
We could rewrite the above example in any of the following ways:
<pre>
(?&#60;p1&#62;(?i)rah)\s+\k&#60;p1&#62;
(?'p1'(?i)rah)\s+\k{p1}
(?P&#60;p1&#62;(?i)rah)\s+(?P=p1)
(?&#60;p1&#62;(?i)rah)\s+\g{p1}
</pre>
A capture group that is referenced by name may appear in the pattern before or
after the reference.
</P>
<P>
There may be more than one backreference to the same group. If a group has not
actually been used in a particular match, backreferences to it always fail by
default. For example, the pattern
<pre>
(a|(bc))\2
</pre>
always fails if it starts to match "a" rather than "bc". However, if the
PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
unset value matches an empty string.
</P>
<P>
Because there may be many capture groups in a pattern, all digits following a
backslash are taken as part of a potential backreference number. If the pattern
continues with a digit character, some delimiter must be used to terminate the
backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this
can be white space. Otherwise, the \g{} syntax or an empty comment (see
<a href="#comments">"Comments"</a>
below) can be used.
</P>
<br><b>
Recursive backreferences
</b><br>
<P>
A backreference that occurs inside the group to which it refers fails when the
group is first used, so, for example, (a\1) never matches. However, such
references can be useful inside repeated groups. For example, the pattern
<pre>
(a|b\1)+
</pre>
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
the group, the backreference matches the character string corresponding to the
previous iteration. In order for this to work, the pattern must be such that
the first iteration does not need to match the backreference. This can be done
using alternation, as in the example above, or by a quantifier with a minimum
of zero.
</P>
<P>
For versions of PCRE2 less than 10.25, backreferences of this type used to
cause the group that they reference to be treated as an
<a href="#atomicgroup">atomic group.</a>
This restriction no longer applies, and backtracking into such groups can occur
as normal.
<a name="bigassertions"></a></P>
<br><a name="SEC22" href="#TOC1">ASSERTIONS</a><br>
<P>
An assertion is a test that does not consume any characters. The test must
succeed for the match to continue. The simple assertions coded as \b, \B,
\A, \G, \Z, \z, ^ and $ are described
<a href="#smallassertions">above.</a>
</P>
<P>
More complicated assertions are coded as parenthesized groups. If matching such
a group succeeds, matching continues after it, but with the matching position
in the subject string reset to what it was before the assertion was processed.
</P>
<P>
A special kind of assertion, called a "scan substring" assertion, matches a
subpattern against a previously captured substring. This is described in the
section entitled
<a href="#scansubstringassertions">"Scan substring assertions"</a>
below. It is a PCRE2 extension, not compatible with Perl.
</P>
<P>
The other goup-based assertions are of two kinds: those that look ahead of the
current position in the subject string, and those that look behind it, and in
each case an assertion may be positive (must match for the assertion to be
true) or negative (must not match for the assertion to be true).
</P>
<P>
The Perl-compatible lookaround assertions are atomic. If an assertion is true,
but there is a subsequent matching failure, there is no backtracking into the
assertion. However, there are some cases where non-atomic assertions can be
useful. PCRE2 has some support for these, described in the section entitled
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
below, but they are not Perl-compatible.
</P>
<P>
A lookaround assertion may appear as the condition in a
<a href="#conditions">conditional group</a>
(see below). In this case, the result of matching the assertion determines
which branch of the condition is followed.
</P>
<P>
Assertion groups are not capture groups. If an assertion contains capture
groups within it, these are counted for the purposes of numbering the capture
groups in the whole pattern. Within each branch of an assertion, locally
captured substrings may be referenced in the usual way. For example, a sequence
such as (.)\g{-1} can be used to check that two adjacent characters are the
same.
</P>
<P>
When a branch within an assertion fails to match, any substrings that were
captured are discarded (as happens with any pattern branch that fails to
match). A negative assertion is true only when all its branches fail to match;
this means that no captured substrings are ever retained after a successful
negative assertion. When an assertion contains a matching branch, what happens
depends on the type of assertion.
</P>
<P>
For a positive assertion, internally captured substrings in the successful
branch are retained, and matching continues with the next pattern item after
the assertion. For a negative assertion, a matching branch means that the
assertion is not true. If such an assertion is being used as a condition in a
<a href="#conditions">conditional group</a>
(see below), captured substrings are retained, because matching continues with
the "no" branch of the condition. For other failing negative assertions,
control passes to the previous backtracking point, thus discarding any captured
strings within the assertion.
</P>
<P>
Most assertion groups may be repeated; though it makes no sense to assert the
same thing several times, the side effect of capturing in positive assertions
may occasionally be useful. However, an assertion that forms the condition for
a conditional group may not be quantified. PCRE2 used to restrict the
repetition of assertions, but from release 10.35 the only restriction is that
an unlimited maximum repetition is changed to be one more than the minimum. For
example, {3,} is treated as {3,4}.
</P>
<br><b>
Alphabetic assertion names
</b><br>
<P>
Traditionally, symbolic sequences such as (?= and (?&#60;= have been used to
specify lookaround assertions. Perl 5.28 introduced some experimental
alphabetic alternatives which might be easier to remember. They all start with
(* instead of (? and must be written using lower case letters. PCRE2 supports
the following synonyms:
<pre>
(*positive_lookahead: or (*pla: is the same as (?=
(*negative_lookahead: or (*nla: is the same as (?!
(*positive_lookbehind: or (*plb: is the same as (?&#60;=
(*negative_lookbehind: or (*nlb: is the same as (?&#60;!
</pre>
For example, (*pla:foo) is the same assertion as (?=foo). In the following
sections, the various assertions are described using the original symbolic
forms.
</P>
<br><b>
Lookahead assertions
</b><br>
<P>
Lookahead assertions start with (?= for positive assertions and (?! for
negative assertions. For example,
<pre>
\w+(?=;)
</pre>
matches a word followed by a semicolon, but does not include the semicolon in
the match, and
<pre>
foo(?!bar)
</pre>
matches any occurrence of "foo" that is not followed by "bar". Note that the
apparently similar pattern
<pre>
(?!foo)bar
</pre>
does not find an occurrence of "bar" that is preceded by something other than
"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
(?!foo) is always true when the next three characters are "bar". A
lookbehind assertion is needed to achieve the other effect.
</P>
<P>
If you want to force a matching failure at some point in a pattern, the most
convenient way to do it is with (?!) because an empty string always matches, so
an assertion that requires there not to be an empty string must always fail.
The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
<a name="lookbehind"></a></P>
<br><b>
Lookbehind assertions
</b><br>
<P>
Lookbehind assertions start with (?&#60;= for positive assertions and (?&#60;! for
negative assertions. For example,
<pre>
(?&#60;!foo)bar
</pre>
does find an occurrence of "bar" that is not preceded by "foo". The contents of
a lookbehind assertion are restricted such that there must be a known maximum
to the lengths of all the strings it matches. There are two cases:
</P>
<P>
If every top-level alternative matches a fixed length, for example
<pre>
(?&#60;=colour|color)
</pre>
there is a limit of 65535 characters to the lengths, which do not have to be
the same, as this example demonstrates. This is the only kind of lookbehind
supported by PCRE2 versions earlier than 10.43 and by the alternative matching
function <b>pcre2_dfa_match()</b>.
</P>
<P>
In PCRE2 10.43 and later, <b>pcre2_match()</b> supports lookbehind assertions in
which one or more top-level alternatives can match more than one string length,
for example
<pre>
(?&#60;=colou?r)
</pre>
The maximum matching length for any branch of the lookbehind is limited to a
value set by the calling program (default 255 characters). Unlimited repetition
(for example \d*) is not supported. In some cases, the escape sequence \K
<a href="#resetmatchstart">(see above)</a>
can be used instead of a lookbehind assertion at the start of a pattern to get
round the length limit restriction.
</P>
<P>
In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
single code unit even in a UTF mode) to appear in lookbehind assertions,
because it makes it impossible to calculate the length of the lookbehind. The
\X and \R escapes, which can match different numbers of code units, are never
permitted in lookbehinds.
</P>
<P>
<a href="#groupsassubroutines">"Subroutine"</a>
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
as the called capture group matches a limited-length string. However,
<a href="#recursion">recursion,</a>
that is, a "subroutine" call into a group that is already active,
is not supported.
</P>
<P>
PCRE2 supports backreferences in lookbehinds, but only if certain conditions
are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no
use of (?| in the pattern (it creates duplicate group numbers), and if the
backreference is by name, the name must be unique. Of course, the referenced
group must itself match a limited length substring. The following pattern
matches words containing at least two characters that begin and end with the
same character:
<pre>
\b(\w)\w++(?&#60;=\1)
</PRE>
</P>
<P>
Possessive quantifiers can be used in conjunction with lookbehind assertions to
specify efficient matching at the end of subject strings. Consider a simple
pattern such as
<pre>
abcd$
</pre>
when applied to a long string that does not match. Because matching proceeds
from left to right, PCRE2 will look for each "a" in the subject and then see if
what follows matches the rest of the pattern. If the pattern is specified as
<pre>
^.*abcd$
</pre>
the initial .* matches the entire string at first, but when this fails (because
there is no following "a"), it backtracks to match all but the last character,
then all but the last two characters, and so on. Once again the search for "a"
covers the entire string, from right to left, so we are no better off. However,
if the pattern is written as
<pre>
^.*+(?&#60;=abcd)
</pre>
there can be no backtracking for the .*+ item because of the possessive
quantifier; it can match only the entire string. The subsequent lookbehind
assertion does a single test on the last four characters. If it fails, the
match fails immediately. For long strings, this approach makes a significant
difference to the processing time.
</P>
<br><b>
Using multiple assertions
</b><br>
<P>
Several assertions (of any sort) may occur in succession. For example,
<pre>
(?&#60;=\d{3})(?&#60;!999)foo
</pre>
matches "foo" preceded by three digits that are not "999". Notice that each of
the assertions is applied independently at the same point in the subject
string. First there is a check that the previous three characters are all
digits, and then there is a check that the same three characters are not "999".
This pattern does <i>not</i> match "foo" preceded by six characters, the first
of which are digits and the last three of which are not "999". For example, it
doesn't match "123abcfoo". A pattern to do that is
<pre>
(?&#60;=\d{3}...)(?&#60;!999)foo
</pre>
This time the first assertion looks at the preceding six characters, checking
that the first three are digits, and then the second assertion checks that the
preceding three characters are not "999".
</P>
<P>
Assertions can be nested in any combination. For example,
<pre>
(?&#60;=(?&#60;!foo)bar)baz
</pre>
matches an occurrence of "baz" that is preceded by "bar" which in turn is not
preceded by "foo", while
<pre>
(?&#60;=\d{3}(?!999)...)foo
</pre>
is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
<a name="nonatomicassertions"></a></P>
<br><a name="SEC23" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
<P>
Traditional lookaround assertions are atomic. That is, if an assertion is true,
but there is a subsequent matching failure, there is no backtracking into the
assertion. However, there are some cases where non-atomic positive assertions
can be useful. PCRE2 provides these using the following syntax:
<pre>
(*non_atomic_positive_lookahead: or (*napla: or (?*
(*non_atomic_positive_lookbehind: or (*naplb: or (?&#60;*
</pre>
Consider the problem of finding the right-most word in a string that also
appears earlier in the string, that is, it must appear at least twice in total.
This pattern returns the required result as captured substring 1:
<pre>
^(?x)(*napla: .* \b(\w++)) (?&#62; .*? \b\1\b ){2}
</pre>
For a subject such as "word1 word2 word3 word2 word3 word4" the result is
"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
"x" option, which causes white space (introduced for readability) to be
ignored. Inside the assertion, the greedy .* at first consumes the entire
string, but then has to backtrack until the rest of the assertion can match a
word, which is captured by group 1. In other words, when the assertion first
succeeds, it captures the right-most word in the string.
</P>
<P>
The current matching point is then reset to the start of the subject, and the
rest of the pattern match checks for two occurrences of the captured word,
using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
if the last word in the string does not occur twice, this part of the pattern
fails. If a traditional atomic lookahead (?= or (*pla: had been used, the
assertion could not be re-entered, and the whole match would fail. The pattern
would succeed only if the very last word in the subject was found twice.
</P>
<P>
Using a non-atomic lookahead, however, means that when the last word does not
occur twice in the string, the lookahead can backtrack and find the second-last
word, and so on, until either the match succeeds, or all words have been
tested.
</P>
<P>
Two conditions must be met for a non-atomic assertion to be useful: the
contents of one or more capturing groups must change after a backtrack into the
assertion, and there must be a backreference to a changed group later in the
pattern. If this is not the case, the rest of the pattern match fails exactly
as before because nothing has changed, so using a non-atomic assertion just
wastes resources.
</P>
<P>
There is one exception to backtracking into a non-atomic assertion. If an
(*ACCEPT) control verb is triggered, the assertion succeeds atomically. That
is, a subsequent match failure cannot backtrack into the assertion.
</P>
<P>
Non-atomic assertions are not supported by the alternative matching function
<b>pcre2_dfa_match()</b>. They are supported by JIT, but only if they do not
contain any control verbs such as (*ACCEPT). (This may change in future). Note
that assertions that appear as conditions for
<a href="#conditions">conditional groups</a>
(see below) must be atomic.
<a name="scansubstringassertions"></a></P>
<br><a name="SEC24" href="#TOC1">SCAN SUBSTRING ASSERTIONS</a><br>
<P>
A special kind of assertion, not compatible with Perl, makes it possible to
check the contents of a captured substring by matching it with a subpattern.
Because this involves capturing, this feature is not supported by
<b>pcre2_dfa_match()</b>.
</P>
<P>
A scan substring assertion starts with the sequence (*scan_substring: or
(*scs: which is followed by a list of substring numbers (absolute or relative)
and/or substring names enclosed in single quotes or angle brackets, all within
parentheses. The rest of the item is the subpattern that is applied to the
substring, as shown in these examples:
<pre>
(*scan_substring:(1)...)
(*scs:(-2)...)
(*scs:('AB')...)
(*scs:(1,'AB',-2)...)
</pre>
The list of groups is checked in the order they are given, and it is the
contents of the first one that is found to be set that are scanned. When
PCRE2_DUPNAMES is set and there are ambiguous group names, all groups with the
same name are checked in numerical order. A scan substring assertion fails if
none of the groups it references have been set.
</P>
<P>
The pattern match on the substring is always anchored, that is, it must match
from the start of the substring. There is no "bumpalong" if it does not match
at the start. The end of the subject is temporarily reset to be the end of the
substring, so \Z, \z, and $ will match there. However, the start of the
subject is <i>not</i> reset. This means that ^ matches only if the substring is
actually at the start of the main subject, but it also means that lookbehind
assertions into what precedes the substring are possible.
</P>
<P>
Here is a very simple example: find a word that contains the rare (in English)
sequence of letters "rh" not at the start:
<pre>
\b(\w++)(*scs:(1).+rh)
</pre>
The first group captures a word which is then scanned by the second group.
This example does not actually need this heavyweight feature; the same match
can be achieved with:
<pre>
\b\w+?rh\w*\b
</pre>
When things are more complicated, however, scanning a captured substring can be
a useful way to describe the required match. For exmple, there is a rather
complicated pattern in the PCRE2 test data that checks an entire subject string
for a palindrome, that is, the sequence of letters is the same in both
directions. Suppose you want to search for individual words of two or more
characters such as "level" that are palindromes:
<pre>
(\b\w{2,}+\b)(*scs:(1)...palindrome-matching-pattern...)
</pre>
Within a substring scanning subpattern, references to other groups work as
normal. Capturing groups may appear, and will retain their values during
ongoing matching if the assertion succeeds.
</P>
<br><a name="SEC25" href="#TOC1">SCRIPT RUNS</a><br>
<P>
In concept, a script run is a sequence of characters that are all from the same
Unicode script such as Latin or Greek. However, because some scripts are
commonly used together, and because some diacritical and other marks are used
with multiple scripts, it is not that simple. There is a full description of
the rules that PCRE2 uses in the section entitled
<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
parenthesis, it fails if the sequence of characters that it matches are not a
script run. After a failure, normal backtracking occurs. Script runs can be
used to detect spoofing attacks using characters that look the same, but are
from different scripts. The string "paypal.com" is an infamous example, where
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
the matched characters in a sequence of non-spaces that follow white space are
a script run:
<pre>
\s+(*sr:\S+)
</pre>
To be sure that they are all from the Latin script (for example), a lookahead
can be used:
<pre>
\s+(?=\p{Latin})(*sr:\S+)
</pre>
This works as long as the first character is expected to be a character in that
script, and not (for example) punctuation, which is allowed with any script. If
this is not the case, a more creative lookahead is needed. For example, if
digits, underscore, and dots are permitted at the start:
<pre>
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
</PRE>
</P>
<P>
In many cases, backtracking into a script run pattern fragment is not
desirable. The script run can employ an atomic group to prevent this. Because
this is a common requirement, a shorthand notation is provided by
(*atomic_script_run: or (*asr:
<pre>
(*asr:...) is the same as (*sr:(?&#62;...))
</pre>
Note that the atomic group is inside the script run. Putting it outside would
not prevent backtracking into the script run pattern.
</P>
<P>
Support for script runs is not available if PCRE2 is compiled without Unicode
support. A compile-time error is given if any of the above constructs is
encountered. Script runs are not supported by the alternate matching function,
<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
parentheses.
</P>
<P>
<b>Warning:</b> The (*ACCEPT) control verb
<a href="#acceptverb">(see below)</a>
should not be used within a script run group, because it causes an immediate
exit from the group, bypassing the script run checking.
<a name="conditions"></a></P>
<br><a name="SEC26" href="#TOC1">CONDITIONAL GROUPS</a><br>
<P>
It is possible to cause the matching process to obey a pattern fragment
conditionally or to choose between two alternative fragments, depending on
the result of an assertion, or whether a specific capture group has
already been matched. The two possible forms of conditional group are:
<pre>
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
</pre>
If the condition is satisfied, the yes-pattern is used; otherwise the
no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
string (it always matches). If there are more than two alternatives in the
group, a compile-time error occurs. Each of the two alternatives may itself
contain nested groups of any form, including conditional groups; the
restriction to two alternatives applies only at the level of the condition
itself. This pattern fragment is an example where the alternatives are complex:
<pre>
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
</PRE>
</P>
<P>
There are five kinds of condition: references to capture groups, references to
recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
</P>
<br><b>
Checking for a used capture group by number
</b><br>
<P>
If the text between the parentheses consists of a sequence of digits, the
condition is true if a capture group of that number has previously matched. If
there is more than one capture group with the same number (see the earlier
<a href="#recursion">section about duplicate group numbers),</a>
the condition is true if any of them have matched. An alternative notation,
which is a PCRE2 extension, not supported by Perl, is to precede the digits
with a plus or minus sign. In this case, the group number is relative rather
than absolute. The most recently opened capture group (which could be enclosing
this condition) can be referenced by (?(-1), the next most recent by (?(-2),
and so on. Inside loops it can also make sense to refer to subsequent groups.
The next capture group to be opened can be referenced as (?(+1), and so on. The
value zero in any of these forms is not used; it provokes a compile-time error.
</P>
<P>
Consider the following pattern, which contains non-significant white space to
make it more readable (assume the PCRE2_EXTENDED option) and to divide it into
three parts for ease of discussion:
<pre>
( \( )? [^()]+ (?(1) \) )
</pre>
The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The second part
matches one or more characters that are not parentheses. The third part is a
conditional group that tests whether or not the first capture group
matched. If it did, that is, if subject started with an opening parenthesis,
the condition is true, and so the yes-pattern is executed and a closing
parenthesis is required. Otherwise, since no-pattern is not present, the
conditional group matches nothing. In other words, this pattern matches a
sequence of non-parentheses, optionally enclosed in parentheses.
</P>
<P>
If you were embedding this pattern in a larger one, you could use a relative
reference:
<pre>
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
</pre>
This makes the fragment independent of the parentheses in the larger pattern.
</P>
<br><b>
Checking for a used capture group by name
</b><br>
<P>
Perl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used
capture group by name. For compatibility with earlier versions of PCRE1, which
had this facility before Perl, the syntax (?(name)...) is also recognized.
Note, however, that undelimited names consisting of the letter R followed by
digits are ambiguous (see the following section). Rewriting the above example
to use a named group gives this:
<pre>
(?&#60;OPEN&#62; \( )? [^()]+ (?(&#60;OPEN&#62;) \) )
</pre>
If the name used in a condition of this kind is a duplicate, the test is
applied to all groups of the same name, and is true if any one of them has
matched.
</P>
<br><b>
Checking for pattern recursion
</b><br>
<P>
"Recursion" in this sense refers to any subroutine-like call from one part of
the pattern to another, whether or not it is actually recursive. See the
sections entitled
<a href="#recursion">"Recursive patterns"</a>
and
<a href="#groupsassubroutines">"Groups as subroutines"</a>
below for details of recursion and subroutine calls.
</P>
<P>
If a condition is the string (R), and there is no capture group with the name
R, the condition is true if matching is currently in a recursion or subroutine
call to the whole pattern or any capture group. If digits follow the letter R,
and there is no group with that name, the condition is true if the most recent
call is into a group with the given number, which must exist somewhere in the
overall pattern. This is a contrived example that is equivalent to a+b:
<pre>
((?(R1)a+|(?1)b))
</pre>
However, in both cases, if there is a capture group with a matching name, the
condition tests for its being set, as described in the section above, instead
of testing for recursion. For example, creating a group with the name R1 by
adding (?&#60;R1&#62;) to the above pattern completely changes its meaning.
</P>
<P>
If a name preceded by ampersand follows the letter R, for example:
<pre>
(?(R&name)...)
</pre>
the condition is true if the most recent recursion is into a group of that name
(which must exist within the pattern).
</P>
<P>
This condition does not check the entire recursion stack. It tests only the
current level. If the name used in a condition of this kind is a duplicate, the
test is applied to all groups of the same name, and is true if any one of
them is the most recent recursion.
</P>
<P>
At "top level", all these recursion test conditions are false.
<a name="subdefine"></a></P>
<br><b>
Defining capture groups for use by reference only
</b><br>
<P>
If the condition is the string (DEFINE), the condition is always false, even if
there is a group with the name DEFINE. In this case, there may be only one
alternative in the rest of the conditional group. It is always skipped if
control reaches this point in the pattern; the idea of DEFINE is that it can be
used to define subroutines that can be referenced from elsewhere. (The use of
<a href="#groupsassubroutines">subroutines</a>
is described below.) For example, a pattern to match an IPv4 address such as
"192.168.23.245" could be written like this (ignore white space and line
breaks):
<pre>
(?(DEFINE) (?&#60;byte&#62; 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
</pre>
The first part of the pattern is a DEFINE group inside which another group
named "byte" is defined. This matches an individual component of an IPv4
address (a number less than 256). When matching takes place, this part of the
pattern is skipped because DEFINE acts like a false condition. The rest of the
pattern uses references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at each end.
</P>
<br><b>
Checking the PCRE2 version
</b><br>
<P>
Programs that link with a PCRE2 library can check the version by calling
<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
not have access to the underlying code cannot do this. A special "condition"
called VERSION exists to allow such users to discover which version of PCRE2
they are dealing with by using this condition to match a string such as
"yesno". VERSION must be followed either by "=" or "&#62;=" and a version number.
For example:
<pre>
(?(VERSION&#62;=10.4)yes|no)
</pre>
This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
"no" otherwise. The fractional part of the version number may not contain more
than two digits.
</P>
<br><b>
Assertion conditions
</b><br>
<P>
If the condition is not in any of the above formats, it must be a parenthesized
assertion. This may be a positive or negative lookahead or lookbehind
assertion. However, it must be a traditional atomic assertion, not one of the
<a href="#nonatomicassertions">non-atomic assertions.</a>
</P>
<P>
Consider this pattern, again containing non-significant white space, and with
the two alternatives on the second line:
<pre>
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
</pre>
The condition is a positive lookahead assertion that matches an optional
sequence of non-letters followed by a letter. In other words, it tests for the
presence of at least one letter in the subject. If a letter is found, the
subject is matched against the first alternative; otherwise it is matched
against the second. This pattern matches strings in one of the two forms
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
</P>
<P>
When an assertion that is a condition contains capture groups, any
capturing that occurs in a matching branch is retained afterwards, for both
positive and negative assertions, because matching always continues after the
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
for which captures are retained only for positive assertions that succeed.)
<a name="comments"></a></P>
<br><a name="SEC27" href="#TOC1">COMMENTS</a><br>
<P>
There are two ways of including comments in patterns that are processed by
PCRE2. In both cases, the start of the comment must not be in a character
class, nor in the middle of any other sequence of related characters such as
(?: or a group name or number or a Unicode property name. The characters that
make up a comment play no part in the pattern matching.
</P>
<P>
The sequence (?# marks the start of a comment that continues up to the next
closing parenthesis. Nested parentheses are not permitted. If the
PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
also introduces a comment, which in this case continues to immediately after
the next newline character or character sequence in the pattern. Which
characters are interpreted as newlines is controlled by an option passed to the
compiling function or by a special sequence at the start of the pattern, as
described in the section entitled
<a href="#newlines">"Newline conventions"</a>
above. Note that the end of this type of comment is a literal newline sequence
in the pattern; escape sequences that happen to represent a newline do not
count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
default newline convention (a single linefeed character) is in force:
<pre>
abc #comment \n still comment
</pre>
On encountering the # character, <b>pcre2_compile()</b> skips along, looking for
a newline in the pattern. The sequence \n is still literal at this stage, so
it does not terminate the comment. Only an actual character with the code value
0x0a (the default newline) does so.
<a name="recursion"></a></P>
<br><a name="SEC28" href="#TOC1">RECURSIVE PATTERNS</a><br>
<P>
Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best that can
be done is to use a pattern that matches up to some fixed depth of nesting. It
is not possible to handle an arbitrary nesting depth.
</P>
<P>
For some time, Perl has provided a facility that allows regular expressions to
recurse (amongst other things). It does this by interpolating Perl code in the
expression at run time, and the code can refer to the expression itself. A Perl
pattern using code interpolation to solve the parentheses problem can be
created like this:
<pre>
$re = qr{\( (?: (?&#62;[^()]+) | (?p{$re}) )* \)}x;
</pre>
The (?p{...}) item interpolates Perl code at run time, and in this case refers
recursively to the pattern in which it appears.
</P>
<P>
Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
supports special syntax for recursion of the entire pattern, and also for
individual capture group recursion. After its introduction in PCRE1 and Python,
this kind of recursion was subsequently introduced into Perl at release 5.10.
</P>
<P>
A special item that consists of (? followed by a number greater than zero and a
closing parenthesis is a recursive subroutine call of the capture group of the
given number, provided that it occurs inside that group. (If not, it is a
<a href="#groupsassubroutines">non-recursive subroutine</a>
call, which is described in the next section.) The special item (?R) or (?0) is
a recursive call of the entire regular expression.
</P>
<P>
This PCRE2 pattern solves the nested parentheses problem (assume the
PCRE2_EXTENDED option is set so that white space is ignored):
<pre>
\( ( [^()]++ | (?R) )* \)
</pre>
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a recursive
match of the pattern itself (that is, a correctly parenthesized substring).
Finally there is a closing parenthesis. Note the use of a possessive quantifier
to avoid backtracking into sequences of non-parentheses.
</P>
<P>
If this were part of a larger pattern, you would not want to recurse the entire
pattern, so instead you could use this:
<pre>
( \( ( [^()]++ | (?1) )* \) )
</pre>
We have put the pattern into parentheses, and caused the recursion to refer to
them instead of the whole pattern.
</P>
<P>
In a larger pattern, keeping track of parenthesis numbers can be tricky. This
is made easier by the use of relative references. Instead of (?1) in the
pattern above you can write (?-2) to refer to the second most recently opened
parentheses preceding the recursion. In other words, a negative number counts
capturing parentheses leftwards from the point at which it is encountered.
</P>
<P>
Be aware however, that if
<a href="#dupgroupnumber">duplicate capture group numbers</a>
are in use, relative references refer to the earliest group with the
appropriate number. Consider, for example:
<pre>
(?|(a)|(b)) (c) (?-2)
</pre>
The first two capture groups (a) and (b) are both numbered 1, and group (c)
is number 2. When the reference (?-2) is encountered, the second most recently
opened parentheses has the number 1, but it is the first such group (the (a)
group) to which the recursion refers. This would be the same if an absolute
reference (?1) was used. In other words, relative references are just a
shorthand for computing a group number.
</P>
<P>
It is also possible to refer to subsequent capture groups, by writing
references such as (?+2). However, these cannot be recursive because the
reference is not inside the parentheses that are referenced. They are always
<a href="#groupsassubroutines">non-recursive subroutine</a>
calls, as described in the next section.
</P>
<P>
An alternative approach is to use named parentheses. The Perl syntax for this
is (?&name); PCRE1's earlier syntax (?P&#62;name) is also supported. We could
rewrite the above example as follows:
<pre>
(?&#60;pn&#62; \( ( [^()]++ | (?&pn) )* \) )
</pre>
If there is more than one group with the same name, the earliest one is
used.
</P>
<P>
The example pattern that we have been looking at contains nested unlimited
repeats, and so the use of a possessive quantifier for matching strings of
non-parentheses is important when applying the pattern to strings that do not
match. For example, when this pattern is applied to
<pre>
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
</pre>
it yields "no match" quickly. However, if a possessive quantifier is not used,
the match runs for a very long time indeed because there are so many different
ways the + and * repeats can carve up the subject, and all have to be tested
before failure can be reported.
</P>
<P>
At the end of a match, the values of capturing parentheses are those from
the outermost level. If you want to obtain intermediate values, a callout
function can be used (see below and the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation). If the pattern above is matched against
<pre>
(ab(cd)ef)
</pre>
the value for the inner capturing parentheses (numbered 2) is "ef", which is
the last value taken on at the top level. If a capture group is not matched at
the top level, its final captured value is unset, even if it was (temporarily)
set at a deeper level during the matching process.
</P>
<P>
Do not confuse the (?R) item with the condition (R), which tests for recursion.
Consider this pattern, which matches text in angle brackets, allowing for
arbitrary nesting. Only digits are allowed in nested brackets (that is, when
recursing), whereas any characters are permitted at the outer level.
<pre>
&#60; (?: (?(R) \d++ | [^&#60;&#62;]*+) | (?R)) * &#62;
</pre>
In this pattern, (?(R) is the start of a conditional group, with two different
alternatives for the recursive and non-recursive cases. The (?R) item is the
actual recursive call.
<a name="recursiondifference"></a></P>
<br><b>
Differences in recursion processing between PCRE2 and Perl
</b><br>
<P>
Some former differences between PCRE2 and Perl no longer exist.
</P>
<P>
Before release 10.30, recursion processing in PCRE2 differed from Perl in that
a recursive subroutine call was always treated as an atomic group. That is,
once it had matched some of the subject string, it was never re-entered, even
if it contained untried alternatives and there was a subsequent matching
failure. (Historical note: PCRE implemented recursion before Perl did.)
</P>
<P>
Starting with release 10.30, recursive subroutine calls are no longer treated
as atomic. That is, they can be re-entered to try unused alternatives if there
is a matching failure later in the pattern. This is now compatible with the way
Perl works. If you want a subroutine call to be atomic, you must explicitly
enclose it in an atomic group.
</P>
<P>
Supporting backtracking into recursions simplifies certain types of recursive
pattern. For example, this pattern matches palindromic strings:
<pre>
^((.)(?1)\2|.?)$
</pre>
The second branch in the group matches a single central character in the
palindrome when there are an odd number of characters, or nothing when there
are an even number of characters, but in order to work it has to be able to try
the second case when the rest of the pattern match fails. If you want to match
typical palindromic phrases, the pattern has to ignore all non-word characters,
which can be done like this:
<pre>
^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
</pre>
If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
avoid backtracking into sequences of non-word characters. Without this, PCRE2
takes a great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
</P>
<P>
Another way in which PCRE2 and Perl used to differ in their recursion
processing is in the handling of captured values. Formerly in Perl, when a
group was called recursively or as a subroutine (see the next section), it
had no access to any values that were captured outside the recursion, whereas
in PCRE2 these values can be referenced. Consider this pattern:
<pre>
^(.)(\1|a(?2))
</pre>
This pattern matches "bab". The first capturing parentheses match "b", then in
the second group, when the backreference \1 fails to match "b", the second
alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
<a name="groupsassubroutines"></a></P>
<br><a name="SEC29" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
<P>
If the syntax for a recursive group call (either by number or by name) is used
outside the parentheses to which it refers, it operates a bit like a subroutine
in a programming language. More accurately, PCRE2 treats the referenced group
as an independent subpattern which it tries to match at the current matching
position. The called group may be defined before or after the reference. A
numbered reference can be absolute or relative, as in these examples:
<pre>
(...(absolute)...)...(?2)...
(...(relative)...)...(?-1)...
(...(?+1)...(relative)...
</pre>
An earlier example pointed out that the pattern
<pre>
(sens|respons)e and \1ibility
</pre>
matches "sense and sensibility" and "response and responsibility", but not
"sense and responsibility". If instead the pattern
<pre>
(sens|respons)e and (?1)ibility
</pre>
is used, it does match "sense and responsibility" as well as the other two
strings. Another example is given in the discussion of DEFINE above.
</P>
<P>
Like recursions, subroutine calls used to be treated as atomic, but this
changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
occur. However, any capturing parentheses that are set during the subroutine
call revert to their previous values afterwards.
</P>
<P>
Processing options such as case-independence are fixed when a group is
defined, so if it is used as a subroutine, such options cannot be changed for
different calls. For example, consider this pattern:
<pre>
(abc)(?i:(?-1))
</pre>
It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called group.
</P>
<P>
The behaviour of
<a href="#backtrackcontrol">backtracking control verbs</a>
in groups when called as subroutines is described in the section entitled
<a href="#btsub">"Backtracking verbs in subroutines"</a>
below.
<a name="onigurumasubroutines"></a></P>
<br><a name="SEC30" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<P>
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
syntax for calling a group as a subroutine, possibly recursively. Here are two
of the examples used above, rewritten using this syntax:
<pre>
(?&#60;pn&#62; \( ( (?&#62;[^()]+) | \g&#60;pn&#62; )* \) )
(sens|respons)e and \g'1'ibility
</pre>
PCRE2 supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
<pre>
(abc)(?i:\g&#60;-1&#62;)
</pre>
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a backreference; the latter is a subroutine call.
</P>
<br><a name="SEC31" href="#TOC1">CALLOUTS</a><br>
<P>
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it
possible, amongst other things, to extract different substrings that match the
same pair of parentheses when there is a repetition.
</P>
<P>
PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
code. The feature is called "callout". The caller of PCRE2 provides an external
function by putting its entry point in a match context using the function
<b>pcre2_set_callout()</b>, and then passing that context to <b>pcre2_match()</b>
or <b>pcre2_dfa_match()</b>. If no match context is passed, or if the callout
entry point is set to NULL, callout points will be passed over silently during
matching. To disallow callouts in the pattern syntax, you may use the
PCRE2_EXTRA_NEVER_CALLOUT option.
</P>
<P>
Within a regular expression, (?C&#60;arg&#62;) indicates a point at which the external
function is to be called. There are two kinds of callout: those with a
numerical argument and those with a string argument. (?C) on its own with no
argument is treated as (?C0). A numerical argument allows the application to
distinguish between different callouts. String arguments were added for release
10.20 to make it possible for script languages that use PCRE2 to embed short
scripts within patterns in a similar way to Perl.
</P>
<P>
During matching, when PCRE2 reaches a callout point, the external function is
called. It is provided with the number or string argument of the callout, the
position in the pattern, and one item of data that is also set in the match
block. The callout function may cause matching to proceed, to backtrack, or to
fail.
</P>
<P>
By default, PCRE2 implements a number of optimizations at matching time, and
one side-effect is that sometimes callouts are skipped. If you need all
possible callouts to happen, you need to set options that disable the relevant
optimizations. More details, including a complete description of the
programming interface to the callout function, are given in the
<a href="pcre2callout.html"><b>pcre2callout</b></a>
documentation.
</P>
<br><b>
Callouts with numerical arguments
</b><br>
<P>
If you just want to have a means of identifying different callout points, put a
number less than 256 after the letter C. For example, this pattern has two
callout points:
<pre>
(?C1)abc(?C2)def
</pre>
If the PCRE2_AUTO_CALLOUT flag is passed to <b>pcre2_compile()</b>, numerical
callouts are automatically installed before each item in the pattern. They are
all numbered 255. If there is a conditional group in the pattern whose
condition is an assertion, an additional callout is inserted just before the
condition. An explicit callout may also be set at this position, as in this
example:
<pre>
(?(?C9)(?=a)abc|def)
</pre>
Note that this applies only to assertion conditions, not to other types of
condition.
</P>
<br><b>
Callouts with string arguments
</b><br>
<P>
A delimited string may be used instead of a number as a callout argument. The
starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
the same as the start, except for {, where the ending delimiter is }. If the
ending delimiter is needed within the string, it must be doubled. For
example:
<pre>
(?C'ab ''c'' d')xyz(?C{any text})pqr
</pre>
The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P>
<br><a name="SEC32" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
and may behave differently depending on whether or not a name argument is
present. The names are not required to be unique within the pattern.
</P>
<P>
By default, for compatibility with Perl, a name is any sequence of characters
that does not include a closing parenthesis. The name is not processed in
any way, and it is not possible to include a closing parenthesis in the name.
This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
is no longer Perl-compatible.
</P>
<P>
When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
and only an unescaped closing parenthesis terminates the name. However, the
only backslash items that are permitted are \Q, \E, and sequences such as
\x{100} that define character code points. Character type escapes such as \d
are faulted.
</P>
<P>
A closing parenthesis can be included in a name either as \) or between \Q
and \E. In addition to backslash processing, if the PCRE2_EXTENDED or
PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
skipped, and #-comments are recognized, exactly as in the rest of the pattern.
PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
PCRE2_ALT_VERBNAMES is also set.
</P>
<P>
The maximum length of a name is 255 in the 8-bit library and 65535 in the
16-bit and 32-bit libraries. If the name is empty, that is, if the closing
parenthesis immediately follows the colon, the effect is as if the colon were
not there. Any number of these verbs may occur in a pattern. Except for
(*ACCEPT), they may not be quantified.
</P>
<P>
Since these verbs are specifically related to backtracking, most of them can be
used only when the pattern is to be matched using the traditional matching
function or JIT, because they use backtracking algorithms. With the exception
of (*FAIL), which behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by the DFA matching function.
</P>
<P>
The behaviour of these verbs in
<a href="#btrepeat">repeated groups,</a>
<a href="#btassert">assertions,</a>
and in
<a href="#btsub">capture groups called as subroutines</a>
(whether or not recursively) is documented below.
<a name="nooptimize"></a></P>
<br><b>
Optimizations that affect backtracking verbs
</b><br>
<P>
PCRE2 contains some optimizations that are used to speed up matching by running
some checks at the start of each match attempt. For example, it may know the
minimum length of matching subject, or that a particular character must be
present. When one of these optimizations bypasses the running of a match, any
included backtracking verbs will not, of course, be processed. You can suppress
the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
when calling <b>pcre2_compile()</b>, by calling <b>pcre2_set_optimize()</b> with a
PCRE2_START_OPTIMIZE_OFF directive, or by starting the pattern with
(*NO_START_OPT). There is more discussion of this option in the section
entitled
<a href="pcre2api.html#compiling">"Compiling a pattern"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
Experiments with Perl suggest that it too has similar optimizations, and like
PCRE2, turning them off can change the result of a match.
<a name="acceptverb"></a></P>
<br><b>
Verbs that act immediately
</b><br>
<P>
The following verbs act as soon as they are encountered.
<pre>
(*ACCEPT) or (*ACCEPT:NAME)
</pre>
This verb causes the match to end successfully, skipping the remainder of the
pattern. However, when it is inside a capture group that is called as a
subroutine, only that group is ended successfully. Matching then continues
at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
assertion succeeds; in a negative assertion, the assertion fails.
</P>
<P>
If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
example:
<pre>
A((?:A|B(*ACCEPT)|C)D)
</pre>
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses.
</P>
<P>
(*ACCEPT) is the only backtracking verb that is allowed to be quantified
because an ungreedy quantification with a minimum of zero acts only when a
backtrack happens. Consider, for example,
<pre>
(A(*ACCEPT)??B)C
</pre>
where A, B, and C may be complex expressions. After matching "A", the matcher
processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
the match succeeds. In both cases, all but C is captured. Whereas (*COMMIT)
(see below) means "fail on backtrack", a repeated (*ACCEPT) of this type means
"succeed on backtrack".
</P>
<P>
<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because
it causes an immediate exit from the group, bypassing the script run checking.
<pre>
(*FAIL) or (*FAIL:NAME)
</pre>
This verb causes a matching failure, forcing backtracking to occur. It may be
abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
documentation notes that it is probably useful only when combined with (?{}) or
(??{}). Those are, of course, Perl features that are not present in PCRE2. The
nearest equivalent is the callout feature, as for example in this pattern:
<pre>
a+(?C)(*FAIL)
</pre>
A match with the string "aaaa" always fails, but the callout is taken before
each backtrack happens (in this example, 10 times).
</P>
<P>
(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
the verb acts.
</P>
<br><b>
Recording which path was taken
</b><br>
<P>
There is one verb whose main purpose is to track how a match was arrived at,
though it also has a secondary use in conjunction with advancing the match
starting point (see (*SKIP) below).
<pre>
(*MARK:NAME) or (*:NAME)
</pre>
A name is always required with this verb. For all the other backtracking
control verbs, a NAME argument is optional.
</P>
<P>
When a match succeeds, the name of the last-encountered mark name on the
matching path is passed back to the caller as described in the section entitled
<a href="pcre2api.html#matchotherdata">"Other information about the match"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation. This applies to all instances of (*MARK) and other verbs,
including those inside assertions and atomic groups. However, there are
differences in those cases when (*MARK) is used in conjunction with (*SKIP) as
described below.
</P>
<P>
The mark name that was last encountered on the matching path is passed back. A
verb without a NAME argument is ignored for this purpose. Here is an example of
<b>pcre2test</b> output, where the "mark" modifier requests the retrieval and
outputting of (*MARK) data:
<pre>
re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/mark
data&#62; XY
0: XY
MK: A
XZ
0: XZ
MK: B
</pre>
The (*MARK) name is tagged with "MK:" in this output, and in this example it
indicates which of the two alternatives matched. This is a more efficient way
of obtaining this information than putting each alternative in its own
capturing parentheses.
</P>
<P>
If a verb with a name is encountered in a positive assertion that is true, the
name is recorded and passed back if it is the last-encountered. This does not
happen for negative assertions or failing positive assertions.
</P>
<P>
After a partial match or a failed match, the last encountered name in the
entire match process is returned. For example:
<pre>
re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/mark
data&#62; XP
No match, mark = B
</pre>
Note that in this unanchored example the mark is retained from the match
attempt that started at the letter "X" in the subject. Subsequent match
attempts starting at "P" and then with an empty string do not get as far as the
(*MARK) item, but nevertheless do not reset it.
</P>
<P>
If you are interested in (*MARK) values after failed matches, you should
probably either set the PCRE2_NO_START_OPTIMIZE option or call
<b>pcre2_set_optimize()</b> with a PCRE2_START_OPTIMIZE_OFF directive
<a href="#nooptimize">(see above)</a>
to ensure that the match is always attempted.
</P>
<br><b>
Verbs that act after backtracking
</b><br>
<P>
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is a subsequent match failure, causing a
backtrack to the verb, a failure is forced. That is, backtracking cannot pass
to the left of the verb. However, when one of these verbs appears inside an
atomic group or in an atomic lookaround assertion that is true, its effect is
confined to that group, because once the group has been matched, there is never
any backtracking into it. Backtracking from beyond an atomic assertion or group
ignores the entire group, and seeks a preceding backtracking point.
</P>
<P>
These verbs differ in exactly what kind of failure occurs when backtracking
reaches them. The behaviour described below is what happens when the verb is
not in a subroutine or an assertion. Subsequent sections cover these special
cases.
<pre>
(*COMMIT) or (*COMMIT:NAME)
</pre>
This verb causes the whole match to fail outright if there is a later matching
failure that causes backtracking to reach it. Even if the pattern is
unanchored, no further attempts to find a match by advancing the starting point
take place. If (*COMMIT) is the only backtracking verb that is encountered,
once it has been passed <b>pcre2_match()</b> is committed to finding a match at
the current starting point, or not at all. For example:
<pre>
a+(*COMMIT)b
</pre>
This matches "xxaab" but not "aacaab". It can be thought of as a kind of
dynamic anchor, or "I've started, so I must finish."
</P>
<P>
The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names that are set with
(*MARK), ignoring those set by any of the other backtracking verbs.
</P>
<P>
If there is more than one backtracking verb in a pattern, a different one that
follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
match does not always guarantee that a match must be at this starting point.
</P>
<P>
Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE2's start-of-match optimizations are turned off, as shown in this
output from <b>pcre2test</b>:
<pre>
re&#62; /(*COMMIT)abc/
data&#62; xyzabc
0: abc
data&#62;
re&#62; /(*COMMIT)abc/no_start_optimize
data&#62; xyzabc
No match
</pre>
For the first pattern, PCRE2 knows that any match must start with "a", so the
optimization skips along the subject to "a" before applying the pattern to the
first set of data. The match attempt then succeeds. The second pattern disables
the optimization that skips along to the first character. The pattern is now
applied starting at "x", and so the (*COMMIT) causes the match to fail without
trying any other starting points.
<pre>
(*PRUNE) or (*PRUNE:NAME)
</pre>
This verb causes the match to fail at the current starting position in the
subject if there is a later matching failure that causes backtracking to reach
it. If the pattern is unanchored, the normal "bumpalong" advance to the next
starting character then happens. Backtracking can occur as usual to the left of
(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
if there is no match to the right, backtracking cannot cross (*PRUNE). In
simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
possessive quantifier, but there are some uses of (*PRUNE) that cannot be
expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
as (*COMMIT).
</P>
<P>
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by other backtracking verbs.
<pre>
(*SKIP)
</pre>
This verb, when given without a name, is like (*PRUNE), except that if the
pattern is unanchored, the "bumpalong" advance is not to the next character,
but to the position in the subject where (*SKIP) was encountered. (*SKIP)
signifies that whatever text was matched leading up to it cannot be part of a
successful match if there is a later mismatch. Consider:
<pre>
a+(*SKIP)b
</pre>
If the subject is "aaaac...", after the first match attempt fails (starting at
the first character in the string), the starting point skips on to start the
next attempt at "c". Note that a possessive quantifier does not have the same
effect as this example; although it would suppress backtracking during the
first match attempt, the second attempt would start at the second character
instead of skipping on to "c".
</P>
<P>
If (*SKIP) is used to specify a new starting position that is the same as the
starting position of the current match, or (by being inside a lookbehind)
earlier, the position specified by (*SKIP) is ignored, and instead the normal
"bumpalong" occurs.
<pre>
(*SKIP:NAME)
</pre>
When (*SKIP) has an associated name, its behaviour is modified. When such a
(*SKIP) is triggered, the previous path through the pattern is searched for the
most recent (*MARK) that has the same name. If one is found, the "bumpalong"
advance is to the subject position that corresponds to that (*MARK) instead of
to where (*SKIP) was encountered. If no (*MARK) with a matching name is found,
the (*SKIP) is ignored.
</P>
<P>
The search for a (*MARK) name uses the normal backtracking mechanism, which
means that it does not see (*MARK) settings that are inside atomic groups or
assertions, because they are never re-entered by backtracking. Compare the
following <b>pcre2test</b> examples:
<pre>
re&#62; /a(?&#62;(*MARK:X))(*SKIP:X)(*F)|(.)/
data: abc
0: a
1: a
data:
re&#62; /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
data: abc
0: b
1: b
</pre>
In the first example, the (*MARK) setting is in an atomic group, so it is not
seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
the second branch of the pattern to be tried at the first character position.
In the second example, the (*MARK) setting is not in an atomic group. This
allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new
matching attempt to start at the second character. This time, the (*MARK) is
never seen because "a" does not match "b", so the matcher immediately jumps to
the second branch of the pattern.
</P>
<P>
Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
names that are set by other backtracking verbs.
<pre>
(*THEN) or (*THEN:NAME)
</pre>
This verb causes a skip to the next innermost alternative when backtracking
reaches it. That is, it cancels any further backtracking within the current
alternative. Its name comes from the observation that it can be used for a
pattern-based if-then-else block:
<pre>
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
</pre>
If the COND1 pattern matches, FOO is tried (and possibly further items after
the end of the group if FOO succeeds); on failure, the matcher skips to the
second alternative and tries COND2, without backtracking into COND1. If that
succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
more alternatives, so there is a backtrack to whatever came before the entire
group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
</P>
<P>
The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is
like (*MARK:NAME) in that the name is remembered for passing back to the
caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
ignoring those set by other backtracking verbs.
</P>
<P>
A group that does not contain a | character is just a part of the enclosing
alternative; it is not a nested alternation with only one alternative. The
effect of (*THEN) extends beyond such a group to the enclosing alternative.
Consider this pattern, where A, B, etc. are complex pattern fragments that do
not contain any | characters at this level:
<pre>
A (B(*THEN)C) | D
</pre>
If A and B are matched, but there is a failure in C, matching does not
backtrack into A; instead it moves to the next alternative, that is, D.
However, if the group containing (*THEN) is given an alternative, it
behaves differently:
<pre>
A (B(*THEN)C | (*FAIL)) | D
</pre>
The effect of (*THEN) is now confined to the inner group. After a failure in C,
matching moves to (*FAIL), which causes the whole group to fail because there
are no more alternatives to try. In this case, matching does backtrack into A.
</P>
<P>
Note that a conditional group is not considered as having two alternatives,
because only one is ever used. In other words, the | character in a conditional
group has a different meaning. Ignoring white space, consider:
<pre>
^.*? (?(?=a) a | b(*THEN)c )
</pre>
If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
it initially matches zero characters. The condition (?=a) then fails, the
character "b" is matched, but "c" is not. At this point, matching does not
backtrack to .*? as might perhaps be expected from the presence of the |
character. The conditional group is part of the single alternative that
comprises the whole pattern, and so the match fails. (If there was a backtrack
into .*?, allowing it to match "b", the match would succeed.)
</P>
<P>
The verbs just described provide four different "strengths" of control when
subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
next alternative. (*PRUNE) comes next, failing the match at the current
starting position, but allowing an advance to the next character (for an
unanchored pattern). (*SKIP) is similar, except that the advance may be more
than one character. (*COMMIT) is the strongest, causing the entire match to
fail.
</P>
<br><b>
More than one backtracking verb
</b><br>
<P>
If more than one backtracking verb is present in a pattern, the one that is
backtracked onto first acts. For example, consider this pattern, where A, B,
etc. are complex pattern fragments:
<pre>
(A(*COMMIT)B(*THEN)C|ABD)
</pre>
If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
the next alternative (ABD) to be tried. This behaviour is consistent, but is
not always the same as Perl's. It means that if two or more backtracking verbs
appear in succession, all but the last of them has no effect. Consider this
example:
<pre>
...(*COMMIT)(*PRUNE)...
</pre>
If there is a matching failure to the right, backtracking onto (*PRUNE) causes
it to be triggered, and its action is taken. There can never be a backtrack
onto (*COMMIT).
<a name="btrepeat"></a></P>
<br><b>
Backtracking verbs in repeated groups
</b><br>
<P>
PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
repeated groups. For example, consider:
<pre>
/(a(*COMMIT)b)+ac/
</pre>
If the subject is "abac", Perl matches unless its optimizations are disabled,
but PCRE2 always fails because the (*COMMIT) in the second repeat of the group
acts.
<a name="btassert"></a></P>
<br><b>
Backtracking verbs in assertions
</b><br>
<P>
(*FAIL) in any assertion has its normal effect: it forces an immediate
backtrack. The behaviour of the other backtracking verbs depends on whether or
not the assertion is standalone or acting as the condition in a conditional
group.
</P>
<P>
(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
without any further processing; captured strings and a mark name (if set) are
retained. In a standalone negative assertion, (*ACCEPT) causes the assertion to
fail without any further processing; captured substrings and any mark name are
discarded.
</P>
<P>
If the assertion is a condition, (*ACCEPT) causes the condition to be true for
a positive assertion and false for a negative one; captured substrings are
retained in both cases.
</P>
<P>
The remaining verbs act only when a later failure causes a backtrack to
reach them. This means that, for the Perl-compatible assertions, their effect
is confined to the assertion, because Perl lookaround assertions are atomic. A
backtrack that occurs after such an assertion is complete does not jump back
into the assertion. Note in particular that a (*MARK) name that is set in an
assertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern.
</P>
<P>
PCRE2 now supports non-atomic positive assertions and also "scan substring"
assertions, as described in the sections entitled
<a href="#nonatomicassertions">"Non-atomic assertions"</a>
and
<a href="#scansubstringassertions">"Scan substring assertions"</a>
above. These assertions must be standalone (not used as conditions). They are
not Perl-compatible. For these assertions, a later backtrack does jump back
into the assertion, and therefore verbs such as (*COMMIT) can be triggered by
backtracks from later in the pattern.
</P>
<P>
The effect of (*THEN) is not allowed to escape beyond an assertion. If there
are no more branches to try, (*THEN) causes a positive assertion to be false,
and a negative assertion to be true. This behaviour differs from Perl when the
assertion has only one branch.
</P>
<P>
The other backtracking verbs are not treated specially if they appear in a
standalone positive assertion. In a conditional positive assertion,
backtracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE)
causes the condition to be false. However, for both standalone and conditional
negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes
the assertion to be true, without considering any further alternative branches.
<a name="btsub"></a></P>
<br><b>
Backtracking verbs in subroutines
</b><br>
<P>
These behaviours occur whether or not the group is called recursively.
</P>
<P>
(*ACCEPT) in a group called as a subroutine causes the subroutine match to
succeed without any further processing. Matching then continues after the
subroutine call. Perl documents this behaviour. Perl's treatment of the other
verbs in subroutines is different in some cases.
</P>
<P>
(*FAIL) in a group called as a subroutine has its normal effect: it forces
an immediate backtrack.
</P>
<P>
(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when
triggered by being backtracked to in a group called as a subroutine. There is
then a backtrack at the outer level.
</P>
<P>
(*THEN), when triggered, skips to the next alternative in the innermost
enclosing group that has alternatives (its normal behaviour). However, if there
is no such group within the subroutine's group, the subroutine match fails and
there is a backtrack at the outer level.
<a name="ebcdicenvironments"></a></P>
<br><a name="SEC33" href="#TOC1">EBCDIC ENVIRONMENTS</a><br>
<P>
Differences in the way PCRE behaves when it is running in an EBCDIC environment
are covered in this section.
</P>
<br><b>
Escape sequences
</b><br>
<P>
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
^, _, or ?. Any other character provokes a compile-time error. The sequence
\c@ encodes character code 0; after \c the letters (in either case) encode
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
</P>
<P>
Thus, apart from \c?, these escapes generate the same character code values as
they do in an ASCII or Unicode environment, though the meanings of the values
mostly differ. For example, \cG always generates code value 7, which is BEL in
ASCII but DEL in EBCDIC.
</P>
<P>
The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
values, PCRE2 makes \c? generate 95; otherwise it generates 255.
</P>
<br><b>
Character classes
</b><br>
<P>
In character classes there is a special case in EBCDIC environments for ranges
whose end points are both specified as literal letters in the same case. For
compatibility with Perl, EBCDIC code points within the range that are not
letters are omitted. For example, [h-k] matches only four characters, even
though the EBCDIC codes for h and k are 0x88 and 0x92, a range of 11 code
points. However, if the range is specified numerically, for example,
[\x88-\x92] or [h-\x92], all code points are included.
</P>
<br><a name="SEC34" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC35" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC36" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 November 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,280 @@
<html>
<head>
<title>pcre2perform specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2perform man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a>
<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a>
<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a>
<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 PERFORMANCE</a><br>
<P>
Two aspects of performance are discussed below: memory usage and processing
time. The way you express your pattern as a regular expression can affect both
of them.
</P>
<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br>
<P>
Patterns are compiled by PCRE2 into a reasonably efficient interpretive code,
so that most simple patterns do not use much memory for storing the compiled
version. However, there is one case where the memory usage of a compiled
pattern can be unexpectedly large. If a parenthesized group has a quantifier
with a minimum greater than 1 and/or a limited maximum, the whole group is
repeated in the compiled code. For example, the pattern
<pre>
(abc|def){2,4}
</pre>
is compiled as if it were
<pre>
(abc|def)(abc|def)((abc|def)(abc|def)?)?
</pre>
(Technical aside: It is done this way so that backtrack points within each of
the repetitions can be independently maintained.)
</P>
<P>
For regular expressions whose quantifiers use only small numbers, this is not
usually a problem. However, if the numbers are large, and particularly if such
repetitions are nested, the memory usage can become an embarrassment. For
example, the very simple pattern
<pre>
((ab){1,1000}c){1,3}
</pre>
uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
compiled with its default internal pointer size of two bytes, the size limit on
a compiled pattern is 65535 code units in the 8-bit and 16-bit libraries, and
this is reached with the above pattern if the outer repetition is increased
from 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus
handle larger compiled patterns, but it is better to try to rewrite your
pattern to use less memory if you can.
</P>
<P>
One way of reducing the memory usage for such patterns is to make use of
PCRE2's
<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a>
facility. Re-writing the above pattern as
<pre>
((ab)(?2){0,999}c)(?1){0,2}
</pre>
reduces the memory requirements to around 16KiB, and indeed it remains under
20KiB even with the outer repetition increased to 100. However, this kind of
pattern is not always exactly equivalent, because any captures within
subroutine calls are lost when the subroutine completes. If this is not a
problem, this kind of rewriting will allow you to process patterns that PCRE2
cannot otherwise handle. The matching performance of the two different versions
of the pattern are roughly the same. (This applies from release 10.30 - things
were different in earlier releases.)
</P>
<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br>
<P>
From release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b>
uses very little system stack at run time. In earlier releases recursive
function calls could use a great deal of stack, and this could cause problems,
but this usage has been eliminated. Backtracking positions are now explicitly
remembered in memory frames controlled by the code.
</P>
<P>
The size of each frame depends on the size of pointer variables and the number
of capturing parenthesized groups in the pattern being matched. On a 64-bit
system the frame size for a pattern with no captures is 128 bytes. For each
capturing group the size increases by 16 bytes.
</P>
<P>
Until release 10.41, an initial 20KiB frames vector was allocated on the system
stack, but this still caused some issues for multi-thread applications where
each thread has a very small stack. From release 10.41 backtracking memory
frames are always held in heap memory. An initial heap allocation is obtained
the first time any match data block is passed to <b>pcre2_match()</b>. This is
remembered with the match data block and re-used if that block is used for
another match. It is freed when the match data block itself is freed.
</P>
<P>
The size of the initial block is the larger of 20KiB or ten times the pattern's
frame size, unless the heap limit is less than this, in which case the heap
limit is used. If the initial block proves to be too small during matching, it
is replaced by a larger block, subject to the heap limit. The heap limit is
checked only when a new block is to be allocated. Reducing the heap limit
between calls to <b>pcre2_match()</b> with the same match data block does not
affect the saved block.
</P>
<P>
In contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive
function calls, but only for processing atomic groups, lookaround assertions,
and recursion within the pattern. The original version of the code used to
allocate quite large internal workspace vectors on the stack, which caused some
problems for some patterns in environments with small stacks. From release
10.32 the code for <b>pcre2_dfa_match()</b> has been re-factored to use heap
memory when necessary for internal workspace when recursing, though recursive
function calls are still used.
</P>
<P>
The "match depth" parameter can be used to limit the depth of function
recursion, and the "match heap" parameter to limit heap memory in
<b>pcre2_dfa_match()</b>.
</P>
<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br>
<P>
Certain items in regular expression patterns are processed more efficiently
than others. It is more efficient to use a character class like [aeiou] than a
set of single-character alternatives such as (a|e|i|o|u). In general, the
simplest construction that provides the required behaviour is usually the most
efficient. Jeffrey Friedl's book contains a lot of useful general discussion
about optimizing regular expressions for efficient performance. This document
contains a few observations about PCRE2.
</P>
<P>
Using Unicode character properties (the \p, \P, and \X escapes) is slow,
because PCRE2 has to use a multi-stage table lookup whenever it needs a
character's property. If you can find an alternative pattern that does not use
character properties, it will probably be faster.
</P>
<P>
By default, the escape sequences \b, \d, \s, and \w, and the POSIX
character classes such as [:alpha:] do not use Unicode properties, partly for
backwards compatibility, and partly for performance reasons. However, you can
set the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode
character properties to be used. This can double the matching time for items
such as \d, when matched with <b>pcre2_match()</b>; the performance loss is
less with a DFA matching function, and in both cases there is not much
difference for \b.
</P>
<P>
When a pattern begins with .* not in atomic parentheses, nor in parentheses
that are the subject of a backreference, and the PCRE2_DOTALL option is set,
the pattern is implicitly anchored by PCRE2, since it can match only at the
start of a subject string. If the pattern has multiple top-level branches, they
must all be anchorable. The optimization can be disabled by the
PCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern
contains (*PRUNE) or (*SKIP).
</P>
<P>
If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the
dot metacharacter does not then match a newline, and if the subject string
contains newlines, the pattern may match from the character immediately
following one of them instead of from the very start. For example, the pattern
<pre>
.*second
</pre>
matches the subject "first\nand second" (where \n stands for a newline
character), with the match starting at the seventh character. In order to do
this, PCRE2 has to retry the match starting after every newline in the subject.
</P>
<P>
If you are using such a pattern with subject strings that do not contain
newlines, the best performance is obtained by setting PCRE2_DOTALL, or starting
the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2
from having to scan along the subject looking for a newline to restart at.
</P>
<P>
Beware of patterns that contain nested indefinite repeats. These can take a
long time to run when applied to a string that does not match. Consider the
pattern fragment
<pre>
^(a+)*
</pre>
This can match "aaaa" in 16 different ways, and this number increases very
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
times, and for each of those cases other than 0 or 4, the + repeats can match
different numbers of times.) When the remainder of the pattern is such that the
entire match is going to fail, PCRE2 has in principle to try every possible
variation, and this can take an extremely long time, even for relatively short
strings.
</P>
<P>
An optimization catches some of the more simple cases such as
<pre>
(a+)*b
</pre>
where a literal character follows. Before embarking on the standard matching
procedure, PCRE2 checks that there is a "b" later in the subject string, and if
there is not, it fails the match immediately. However, when there is no
following literal this optimization cannot be used. You can see the difference
by comparing the behaviour of
<pre>
(a+)*\d
</pre>
with the pattern above. The former gives a failure almost instantly when
applied to a whole line of "a" characters, whereas the latter takes an
appreciable time with strings longer than about 20 characters.
</P>
<P>
In many cases, the solution to this kind of performance issue is to use an
atomic group or a possessive quantifier. This can often reduce memory
requirements as well. As another example, consider this pattern:
<pre>
([^&#60;]|&#60;(?!inet))+
</pre>
It matches from wherever it starts until it encounters "&#60;inet" or the end of
the data, and is the kind of pattern that might be used when processing an XML
file. Each iteration of the outer parentheses matches either one character that
is not "&#60;" or a "&#60;" that is not followed by "inet". However, each time a
parenthesis is processed, a backtracking position is passed, so this
formulation uses a memory frame for each matched character. For a long string,
a lot of memory is required. Consider now this rewritten pattern, which matches
exactly the same strings:
<pre>
([^&#60;]++|&#60;(?!inet))+
</pre>
This runs much faster, because sequences of characters that do not contain "&#60;"
are "swallowed" in one item inside the parentheses, and a possessive quantifier
is used to stop any backtracking into the runs of non-"&#60;" characters. This
version also uses a lot less memory because entry to a new set of parentheses
happens only when a "&#60;" character that is not followed by "inet" is encountered
(and we assume this is relatively rare).
</P>
<P>
This example shows that one way of optimizing performance when matching long
subject strings is to write repeated parenthesized subpatterns to match more
than one character whenever possible.
</P>
<br><b>
SETTING RESOURCE LIMITS
</b><br>
<P>
You can set limits on the amount of processing that takes place when matching,
and on the amount of heap memory that is used. The default values of the limits
are very large, and unlikely ever to operate. They can be changed when PCRE2 is
built, and they can also be set when <b>pcre2_match()</b> or
<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the
<a href="pcre2build.html"><b>pcre2build</b></a>
documentation and the section entitled
<a href="pcre2api.html#matchcontext">"The match context"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
The <b>pcre2test</b> test program has a modifier called "find_limits" which, if
applied to a subject line, causes it to find the smallest limits that allow a
pattern to match. This is done by repeatedly matching with different limits.
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 06 December 2022
<br>
Copyright &copy; 1997-2022 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,379 @@
<html>
<head>
<title>pcre2posix specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2posix man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SYNOPSIS</a>
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
<li><a name="TOC3" href="#SEC3">USING THE POSIX FUNCTIONS</a>
<li><a name="TOC4" href="#SEC4">COMPILING A PATTERN</a>
<li><a name="TOC5" href="#SEC5">MATCHING NEWLINE CHARACTERS</a>
<li><a name="TOC6" href="#SEC6">MATCHING A PATTERN</a>
<li><a name="TOC7" href="#SEC7">ERROR MESSAGES</a>
<li><a name="TOC8" href="#SEC8">MEMORY USAGE</a>
<li><a name="TOC9" href="#SEC9">AUTHOR</a>
<li><a name="TOC10" href="#SEC10">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SYNOPSIS</a><br>
<P>
<b>#include &#60;pcre2posix.h&#62;</b>
</P>
<P>
<b>int pcre2_regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b>
<b> int <i>cflags</i>);</b>
<br>
<br>
<b>int pcre2_regexec(const regex_t *<i>preg</i>, const char *<i>string</i>,</b>
<b> size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b>
<br>
<br>
<b>size_t pcre2_regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b>
<b> char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b>
<br>
<br>
<b>void pcre2_regfree(regex_t *<i>preg</i>);</b>
</P>
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
<P>
This set of functions provides a POSIX-style API for the PCRE2 regular
expression 8-bit library. There are no POSIX-style wrappers for PCRE2's 16-bit
and 32-bit libraries. See the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation for a description of PCRE2's native API, which contains much
additional functionality.
</P>
<P>
<b>IMPORTANT NOTE</b>: The functions described here are NOT thread-safe, and
should not be used in multi-threaded applications. They are also limited to
processing subjects that are not bigger than 2GB. Use the native API instead.
</P>
<P>
These functions are wrapper functions that ultimately call the PCRE2 native
API. Their prototypes are defined in the <b>pcre2posix.h</b> header file, and
they all have unique names starting with <b>pcre2_</b>. However, the
<b>pcre2posix.h</b> header also contains macro definitions that convert the
standard POSIX names such <b>regcomp()</b> into <b>pcre2_regcomp()</b> etc. This
means that a program can use the usual POSIX names without running the risk of
accidentally linking with POSIX functions from a different library.
</P>
<P>
On Unix-like systems the PCRE2 POSIX library is called <b>libpcre2-posix</b>, so
can be accessed by adding <b>-lpcre2-posix</b> to the command for linking an
application. Because the POSIX functions call the native ones, it is also
necessary to add <b>-lpcre2-8</b>.
</P>
<P>
On Windows systems, if you are linking to a DLL version of the library, it is
recommended that <b>PCRE2POSIX_SHARED</b> is defined before including the
<b>pcre2posix.h</b> header, as it will allow for a more efficient way to
invoke the functions by adding the <b>__declspec(dllimport)</b> decorator.
</P>
<P>
Although they were not defined as prototypes in <b>pcre2posix.h</b>, releases
10.33 to 10.36 of the library contained functions with the POSIX names
<b>regcomp()</b> etc. These simply passed their arguments to the PCRE2
functions. These functions were provided for backwards compatibility with
earlier versions of PCRE2, which had only POSIX names. However, this has proved
troublesome in situations where a program links with several libraries, some of
which use PCRE2's POSIX interface while others use the real POSIX functions.
For this reason, the POSIX names have been removed since release 10.37.
</P>
<P>
Calling the header file <b>pcre2posix.h</b> avoids any conflict with other POSIX
libraries. It can, of course, be renamed or aliased as <b>regex.h</b>, which is
the "correct" name, if there is no clash. It provides two structure types,
<i>regex_t</i> for compiled internal forms, and <i>regmatch_t</i> for returning
captured substrings. It also defines some constants whose names start with
"REG_"; these are used for setting options and identifying error codes.
</P>
<br><a name="SEC3" href="#TOC1">USING THE POSIX FUNCTIONS</a><br>
<P>
Note that these functions are just POSIX-style wrappers for PCRE2's native API.
They do not give POSIX regular expression behaviour, and they are not
thread-safe or even POSIX compatible.
</P>
<P>
Those POSIX option bits that can reasonably be mapped to PCRE2 native options
have been implemented. In addition, the option REG_EXTENDED is defined with the
value zero. This has no effect, but since programs that are written to the
POSIX interface often use it, this makes it easier to slot in PCRE2 as a
replacement library. Other POSIX options are not even defined.
</P>
<P>
There are also some options that are not defined by POSIX. These have been
added at the request of users who want to make use of certain PCRE2-specific
features via the POSIX calling interface or to add BSD or GNU functionality.
</P>
<P>
When PCRE2 is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
still those of Perl, subject to the setting of various PCRE2 options, as
described below. "POSIX-like in style" means that the API approximates to the
POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
domains it is probably even less compatible.
</P>
<P>
The descriptions below use the actual names of the functions, but, as described
above, the standard POSIX names (without the <b>pcre2_</b> prefix) may also be
used.
</P>
<br><a name="SEC4" href="#TOC1">COMPILING A PATTERN</a><br>
<P>
The function <b>pcre2_regcomp()</b> is called to compile a pattern into an
internal form. By default, the pattern is a C string terminated by a binary
zero (but see REG_PEND below). The <i>preg</i> argument is a pointer to a
<b>regex_t</b> structure that is used as a base for storing information about
the compiled regular expression. It is also used for input when REG_PEND is
set. The <b>regex_t</b> structure used by <b>pcre2_regcomp()</b> is defined in
<b>pcre2posix.h</b> and is not the same as the structure used by other libraries
that provide POSIX-style matching.
</P>
<P>
The argument <i>cflags</i> is either zero, or contains one or more of the bits
defined by the following macros:
<pre>
REG_DOTALL
</pre>
The PCRE2_DOTALL option is set when the regular expression is passed for
compilation to the native function. Note that REG_DOTALL is not part of the
POSIX standard.
<pre>
REG_ICASE
</pre>
The PCRE2_CASELESS option is set when the regular expression is passed for
compilation to the native function.
<pre>
REG_NEWLINE
</pre>
The PCRE2_MULTILINE option is set when the regular expression is passed for
compilation to the native function. Note that this does <i>not</i> mimic the
defined POSIX behaviour for REG_NEWLINE (see the following section).
<pre>
REG_NOSPEC
</pre>
The PCRE2_LITERAL option is set when the regular expression is passed for
compilation to the native function. This disables all meta characters in the
pattern, causing it to be treated as a literal string. The only other options
that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
<pre>
REG_NOSUB
</pre>
When a pattern that is compiled with this flag is passed to
<b>pcre2_regexec()</b> for matching, the <i>nmatch</i> and <i>pmatch</i> arguments
are ignored, and no captured strings are returned. Versions of the PCRE2 library
prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this
no longer happens because it disables the use of backreferences.
<pre>
REG_PEND
</pre>
If this option is set, the <b>reg_endp</b> field in the <i>preg</i> structure
(which has the type const char *) must be set to point to the character beyond
the end of the pattern before calling <b>pcre2_regcomp()</b>. The pattern itself
may now contain binary zeros, which are treated as data characters. Without
REG_PEND, a binary zero terminates the pattern and the <b>re_endp</b> field is
ignored. This is a GNU extension to the POSIX standard and should be used with
caution in software intended to be portable to other systems.
<pre>
REG_UCP
</pre>
The PCRE2_UCP option is set when the regular expression is passed for
compilation to the native function. This causes PCRE2 to use Unicode properties
when matching \d, \w, etc., instead of just recognizing ASCII values. Note
that REG_UCP is not part of the POSIX standard.
<pre>
REG_UNGREEDY
</pre>
The PCRE2_UNGREEDY option is set when the regular expression is passed for
compilation to the native function. Note that REG_UNGREEDY is not part of the
POSIX standard.
<pre>
REG_UTF
</pre>
The PCRE2_UTF option is set when the regular expression is passed for
compilation to the native function. This causes the pattern itself and all data
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
is not part of the POSIX standard.
</P>
<P>
In the absence of these flags, no options are passed to the native function.
This means that the regex is compiled with PCRE2 default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
newlines are matched by the dot metacharacter (they are not) or by a negative
class such as [^a] (they are).
</P>
<P>
The yield of <b>pcre2_regcomp()</b> is zero on success, and non-zero otherwise.
The <i>preg</i> structure is filled in on success, and one other member of the
structure (as well as <i>re_endp</i>) is public: <i>re_nsub</i> contains the
number of capturing subpatterns in the regular expression. Various error codes
are defined in the header file.
</P>
<P>
NOTE: If the yield of <b>pcre2_regcomp()</b> is non-zero, you must not attempt
to use the contents of the <i>preg</i> structure. If, for example, you pass it
to <b>pcre2_regexec()</b>, the result is undefined and your program is likely to
crash.
</P>
<br><a name="SEC5" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
<P>
This area is not simple, because POSIX and Perl take different views of things.
It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
never intended to be a POSIX engine. The following table lists the different
possibilities for matching newline characters in Perl and PCRE2:
<pre>
Default Change with
. matches newline no PCRE2_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE2_DOLLAR_ENDONLY
$ matches \n in middle no PCRE2_MULTILINE
^ matches \n in middle no PCRE2_MULTILINE
</pre>
This is the equivalent table for a POSIX-compatible pattern matcher:
<pre>
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
</pre>
This behaviour is not what happens when PCRE2 is called via its POSIX
API. By default, PCRE2's behaviour is the same as Perl's, except that there is
no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
is no way to stop newline from matching [^a].
</P>
<P>
Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
PCRE2_DOLLAR_ENDONLY when calling <b>pcre2_compile()</b> directly, but there is
no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
the POSIX API, passing REG_NEWLINE to PCRE2's <b>pcre2_regcomp()</b> function
causes PCRE2_MULTILINE to be passed to <b>pcre2_compile()</b>, and REG_DOTALL
passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
</P>
<br><a name="SEC6" href="#TOC1">MATCHING A PATTERN</a><br>
<P>
The function <b>pcre2_regexec()</b> is called to match a compiled pattern
<i>preg</i> against a given <i>string</i>, which is by default terminated by a
zero byte (but see REG_STARTEND below), subject to the options in <i>eflags</i>.
These can be:
<pre>
REG_NOTBOL
</pre>
The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
function.
<pre>
REG_NOTEMPTY
</pre>
The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
setting this option can give more POSIX-like behaviour in some situations.
<pre>
REG_NOTEOL
</pre>
The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
function.
<pre>
REG_STARTEND
</pre>
When this option is set, the subject string starts at <i>string</i> +
<i>pmatch[0].rm_so</i> and ends at <i>string</i> + <i>pmatch[0].rm_eo</i>, which
should point to the first character beyond the string. There may be binary
zeros within the subject string, and indeed, using REG_STARTEND is the only
way to pass a subject string that contains a binary zero.
</P>
<P>
Whatever the value of <i>pmatch[0].rm_so</i>, the offsets of the matched string
and any captured substrings are still given relative to the start of
<i>string</i> itself. (Before PCRE2 release 10.30 these were given relative to
<i>string</i> + <i>pmatch[0].rm_so</i>, but this differs from other
implementations.)
</P>
<P>
This is a BSD extension, compatible with but not specified by IEEE Standard
1003.2 (POSIX.2), and should be used with caution in software intended to be
portable to other systems. Note that a non-zero <i>rm_so</i> does not imply
REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
not how it is matched. Setting REG_STARTEND and passing <i>pmatch</i> as NULL
are mutually exclusive; the error REG_INVARG is returned.
</P>
<P>
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
<b>pcre2_regexec()</b> are ignored (except possibly as input for REG_STARTEND).
</P>
<P>
The value of <i>nmatch</i> may be zero, and the value <i>pmatch</i> may be NULL
(unless REG_STARTEND is set); in both these cases no data about any matched
strings is returned.
</P>
<P>
Otherwise, the portion of the string that was matched, and also any captured
substrings, are returned via the <i>pmatch</i> argument, which points to an
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
members <i>rm_so</i> and <i>rm_eo</i>. These contain the byte offset to the first
character of each substring and the offset to the first character after the end
of each substring, respectively. The 0th element of the vector relates to the
entire portion of <i>string</i> that was matched; subsequent elements relate to
the capturing subpatterns of the regular expression. Unused entries in the
array have both structure members set to -1.
</P>
<P>
<i>regmatch_t</i> as well as the <i>regoff_t</i> typedef it uses are defined in
<b>pcre2posix.h</b> and are not warranted to have the same size or layout as other
similarly named types from other libraries that provide POSIX-style matching.
</P>
<P>
A successful match yields a zero return; various error codes are defined in the
header file, of which REG_NOMATCH is the "expected" failure code.
</P>
<br><a name="SEC7" href="#TOC1">ERROR MESSAGES</a><br>
<P>
The <b>pcre2_regerror()</b> function maps a non-zero errorcode from either
<b>pcre2_regcomp()</b> or <b>pcre2_regexec()</b> to a printable message. If
<i>preg</i> is not NULL, the error should have arisen from the use of that
structure. A message terminated by a binary zero is placed in <i>errbuf</i>. If
the buffer is too short, only the first <i>errbuf_size</i> - 1 characters of the
error message are used. The yield of the function is the size of buffer needed
to hold the whole message, including the terminating zero. This value is
greater than <i>errbuf_size</i> if the message was truncated.
</P>
<br><a name="SEC8" href="#TOC1">MEMORY USAGE</a><br>
<P>
Compiling a regular expression causes memory to be allocated and associated
with the <i>preg</i> structure. The function <b>pcre2_regfree()</b> frees all
such memory, after which <i>preg</i> may no longer be used as a compiled
expression.
</P>
<br><a name="SEC9" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC10" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 November 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,110 @@
<html>
<head>
<title>pcre2sample specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2sample man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
PCRE2 SAMPLE PROGRAM
</b><br>
<P>
A simple, complete demonstration program to get you started with using PCRE2 is
supplied in the file <i>pcre2demo.c</i> in the <b>src</b> directory in the PCRE2
distribution. A listing of this program is given in the
<a href="pcre2demo.html"><b>pcre2demo</b></a>
documentation. If you do not have a copy of the PCRE2 distribution, you can
save this listing to re-create the contents of <i>pcre2demo.c</i>.
</P>
<P>
The demonstration program compiles the regular expression that is its
first argument, and matches it against the subject string in its second
argument. No PCRE2 options are set, and default character tables are used. If
matching succeeds, the program outputs the portion of the subject that matched,
together with the contents of any captured substrings.
</P>
<P>
If the -g option is given on the command line, the program then goes on to
check for further matches of the same regular expression in the same subject
string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
</P>
<P>
The code in <b>pcre2demo.c</b> is an 8-bit program that uses the PCRE2 8-bit
library. It handles strings and characters that are stored in 8-bit code units.
By default, one character corresponds to one code unit, but if the pattern
starts with "(*UTF)", both it and the subject are treated as UTF-8 strings,
where characters may occupy multiple code units.
</P>
<P>
If PCRE2 is installed in the standard include and library directories for your
operating system, you should be able to compile the demonstration program using
a command like this:
<pre>
cc -o pcre2demo pcre2demo.c -lpcre2-8
</pre>
If PCRE2 is installed elsewhere, you may need to add additional options to the
command line. For example, on a Unix-like system that has PCRE2 installed in
<i>/usr/local</i>, you can compile the demonstration program using a command
like this:
<pre>
cc -o pcre2demo -I/usr/local/include pcre2demo.c -L/usr/local/lib -lpcre2-8
</pre>
Once you have built the demonstration program, you can run simple tests like
this:
<pre>
./pcre2demo 'cat|dog' 'the cat sat on the mat'
./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
</pre>
Note that there is a much more comprehensive test program, called
<a href="pcre2test.html"><b>pcre2test</b>,</a>
which supports many more facilities for testing regular expressions using all
three PCRE2 libraries (8-bit, 16-bit, and 32-bit, though not all three need be
installed). The
<a href="pcre2demo.html"><b>pcre2demo</b></a>
program is provided as a relatively simple coding example.
</P>
<P>
If you try to run
<a href="pcre2demo.html"><b>pcre2demo</b></a>
when PCRE2 is not installed in the standard library directory, you may get an
error like this on some operating systems (e.g. Solaris):
<pre>
ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file or directory
</pre>
This is caused by the way shared library support works on those systems. You
need to add
<pre>
-R/usr/local/lib
</pre>
(for example) to the compile command to get round this problem.
</P>
<br><b>
AUTHOR
</b><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><b>
REVISION
</b><br>
<P>
Last updated: 14 November 2023
<br>
Copyright &copy; 1997-2016 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,212 @@
<html>
<head>
<title>pcre2serialize specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2serialize man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a>
<li><a name="TOC2" href="#SEC2">SECURITY CONCERNS</a>
<li><a name="TOC3" href="#SEC3">SAVING COMPILED PATTERNS</a>
<li><a name="TOC4" href="#SEC4">RE-USING PRECOMPILED PATTERNS</a>
<li><a name="TOC5" href="#SEC5">AUTHOR</a>
<li><a name="TOC6" href="#SEC6">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a><br>
<P>
<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
<b> pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
<b> int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
<b> PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
<br>
<br>
<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
<br>
<br>
<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
<br>
<br>
If you are running an application that uses a large number of regular
expression patterns, it may be useful to store them in a precompiled form
instead of having to compile them every time the application is run. However,
if you are using the just-in-time optimization feature, it is not possible to
save and reload the JIT data, because it is position-dependent. The host on
which the patterns are reloaded must be running the same version of PCRE2, with
the same code unit width, and must also have the same endianness, pointer width
and PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
PCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
reloaded using the 8-bit library.
</P>
<P>
Note that "serialization" in PCRE2 does not convert compiled patterns to an
abstract format like Java or .NET serialization. The serialized output is
really just a bytecode dump, which is why it can only be reloaded in the same
environment as the one that created it. Hence the restrictions mentioned above.
Applications that are not statically linked with a fixed version of PCRE2 must
be prepared to recompile patterns from their sources, in order to be immune to
PCRE2 upgrades.
</P>
<br><a name="SEC2" href="#TOC1">SECURITY CONCERNS</a><br>
<P>
The facility for saving and restoring compiled patterns is intended for use
within individual applications. As such, the data supplied to
<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
arbitrary external sources. There is only some simple consistency checking, not
complete validation of what is being re-loaded. Corrupted data may cause
undefined results. For example, if the length field of a pattern in the
serialized data is corrupted, the deserializing code may read beyond the end of
the byte stream that is passed to it.
</P>
<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
<P>
Before compiled patterns can be saved they must be serialized, which in PCRE2
means converting the pattern to a stream of bytes. A single byte stream may
contain any number of compiled patterns, but they must all use the same
character tables. A single copy of the tables is included in the byte stream
(its size is 1088 bytes). For more details of character tables, see the
<a href="pcre2api.html#localesupport">section on locale support</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<P>
The function <b>pcre2_serialize_encode()</b> creates a serialized byte stream
from a list of compiled patterns. Its first two arguments specify the list,
being a pointer to a vector of pointers to compiled patterns, and the length of
the vector. The third and fourth arguments point to variables which are set to
point to the created byte stream and its length, respectively. The final
argument is a pointer to a general context, which can be used to specify custom
memory management functions. If this argument is NULL, <b>malloc()</b> is used
to obtain memory for the byte stream. The yield of the function is the number
of serialized patterns, or one of the following negative error codes:
<pre>
PCRE2_ERROR_BADDATA the number of patterns is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
PCRE2_ERROR_NOMEMORY memory allocation failed
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
</pre>
PCRE2_ERROR_BADMAGIC means either that a pattern's code has been corrupted, or
that a slot in the vector does not point to a compiled pattern.
</P>
<P>
Once a set of patterns has been serialized you can save the data in any
appropriate manner. Here is sample code that compiles two patterns and writes
them to a file. It assumes that the variable <i>fd</i> refers to a file that is
open for output. The error checking that should be present in a real
application has been omitted for simplicity.
<pre>
int errorcode;
uint8_t *bytes;
PCRE2_SIZE erroroffset;
PCRE2_SIZE bytescount;
pcre2_code *list_of_codes[2];
list_of_codes[0] = pcre2_compile("first pattern",
PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
list_of_codes[1] = pcre2_compile("second pattern",
PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
&bytescount, NULL);
errorcode = fwrite(bytes, 1, bytescount, fd);
</pre>
Note that the serialized data is binary data that may contain any of the 256
possible byte values. On systems that make a distinction between binary and
non-binary data, be sure that the file is opened for binary output.
</P>
<P>
Serializing a set of patterns leaves the original data untouched, so they can
still be used for matching. Their memory must eventually be freed in the usual
way by calling <b>pcre2_code_free()</b>. When you have finished with the byte
stream, it too must be freed by calling <b>pcre2_serialize_free()</b>. If this
function is called with a NULL argument, it returns immediately without doing
anything.
</P>
<br><a name="SEC4" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
<P>
In order to re-use a set of saved patterns you must first make the serialized
byte stream available in main memory (for example, by reading from a file). The
management of this memory block is up to the application. You can use the
<b>pcre2_serialize_get_number_of_codes()</b> function to find out how many
compiled patterns are in the serialized data without actually decoding the
patterns:
<pre>
uint8_t *bytes = &#60;serialized data&#62;;
int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
</pre>
The <b>pcre2_serialize_decode()</b> function reads a byte stream and recreates
the compiled patterns in new memory blocks, setting pointers to them in a
vector. The first two arguments are a pointer to a suitable vector and its
length, and the third argument points to a byte stream. The final argument is a
pointer to a general context, which can be used to specify custom memory
management functions for the decoded patterns. If this argument is NULL,
<b>malloc()</b> and <b>free()</b> are used. After deserialization, the byte
stream is no longer needed and can be discarded.
<pre>
pcre2_code *list_of_codes[2];
uint8_t *bytes = &#60;serialized data&#62;;
int32_t number_of_codes =
pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
</pre>
If the vector is not large enough for all the patterns in the byte stream, it
is filled with those that fit, and the remainder are ignored. The yield of the
function is the number of decoded patterns, or one of the following negative
error codes:
<pre>
PCRE2_ERROR_BADDATA second argument is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
PCRE2_ERROR_MEMORY memory allocation failed
PCRE2_ERROR_NULL first or third argument is NULL
</pre>
PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
on a system with different endianness.
</P>
<P>
Decoded patterns can be used for matching in the usual way, and must be freed
by calling <b>pcre2_code_free()</b>. However, be aware that there is a potential
race issue if you are using multiple patterns that were decoded from a single
byte stream in a multithreaded application. A single copy of the character
tables is used by all the decoded patterns and a reference count is used to
arrange for its memory to be automatically freed when the last pattern is
freed, but there is no locking on this reference count. Therefore, if you want
to call <b>pcre2_code_free()</b> for these patterns in different threads, you
must arrange your own locking, and ensure that <b>pcre2_code_free()</b> cannot
be called by two threads at the same time.
</P>
<P>
If a pattern was processed by <b>pcre2_jit_compile()</b> before being
serialized, the JIT data is discarded and so is no longer available after a
save/restore cycle. You can, however, process a restored pattern with
<b>pcre2_jit_compile()</b> if you wish.
</P>
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
Last updated: 19 January 2024
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

View File

@@ -0,0 +1,754 @@
<html>
<head>
<title>pcre2syntax specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2syntax man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
<li><a name="TOC2" href="#SEC2">QUOTING</a>
<li><a name="TOC3" href="#SEC3">BRACED ITEMS</a>
<li><a name="TOC4" href="#SEC4">ESCAPED CHARACTERS</a>
<li><a name="TOC5" href="#SEC5">CHARACTER TYPES</a>
<li><a name="TOC6" href="#SEC6">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC7" href="#SEC7">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
<li><a name="TOC8" href="#SEC8">BINARY PROPERTIES FOR \p AND \P</a>
<li><a name="TOC9" href="#SEC9">SCRIPT MATCHING WITH \p AND \P</a>
<li><a name="TOC10" href="#SEC10">THE BIDI_CLASS PROPERTY FOR \p AND \P</a>
<li><a name="TOC11" href="#SEC11">CHARACTER CLASSES</a>
<li><a name="TOC12" href="#SEC12">PERL EXTENDED CHARACTER CLASSES</a>
<li><a name="TOC13" href="#SEC13">QUANTIFIERS</a>
<li><a name="TOC14" href="#SEC14">ANCHORS AND SIMPLE ASSERTIONS</a>
<li><a name="TOC15" href="#SEC15">REPORTED MATCH POINT SETTING</a>
<li><a name="TOC16" href="#SEC16">ALTERNATION</a>
<li><a name="TOC17" href="#SEC17">CAPTURING</a>
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPS</a>
<li><a name="TOC19" href="#SEC19">COMMENT</a>
<li><a name="TOC20" href="#SEC20">OPTION SETTING</a>
<li><a name="TOC21" href="#SEC21">NEWLINE CONVENTION</a>
<li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a>
<li><a name="TOC23" href="#SEC23">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC24" href="#SEC24">NON-ATOMIC LOOKAROUND ASSERTIONS</a>
<li><a name="TOC25" href="#SEC25">SUBSTRING SCAN ASSERTION</a>
<li><a name="TOC26" href="#SEC26">SCRIPT RUNS</a>
<li><a name="TOC27" href="#SEC27">BACKREFERENCES</a>
<li><a name="TOC28" href="#SEC28">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC29" href="#SEC29">CONDITIONAL PATTERNS</a>
<li><a name="TOC30" href="#SEC30">BACKTRACKING CONTROL</a>
<li><a name="TOC31" href="#SEC31">CALLOUTS</a>
<li><a name="TOC32" href="#SEC32">REPLACEMENT STRINGS</a>
<li><a name="TOC33" href="#SEC33">SEE ALSO</a>
<li><a name="TOC34" href="#SEC34">AUTHOR</a>
<li><a name="TOC35" href="#SEC35">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
The full syntax and semantics of the regular expression patterns that are
supported by PCRE2 are described in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation. This document contains a quick-reference summary of the pattern
syntax followed by the syntax of replacement strings in substitution function.
The full description of the latter is in the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
<P>
<pre>
\x where x is non-alphanumeric is a literal x
\Q...\E treat enclosed characters as literal
</pre>
Note that white space inside \Q...\E is always treated as literal, even if
PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also
that PCRE2's handling of \Q...\E has some differences from Perl's. See the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation for details.
</P>
<br><a name="SEC3" href="#TOC1">BRACED ITEMS</a><br>
<P>
With one exception, wherever brace characters { and } are required to enclose
data for constructions such as \g{2} or \k{name}, space and/or horizontal tab
characters that follow { or precede } are allowed and are ignored. In the case
of quantifiers, they may also appear before or after the comma. The exception
is \u{...} which is not Perl-compatible and is recognized only when
PCRE2_EXTRA_ALT_BSUX is set. This is an ECMAScript compatibility feature, and
follows ECMAScript's behaviour.
</P>
<br><a name="SEC4" href="#TOC1">ESCAPED CHARACTERS</a><br>
<P>
This table applies to ASCII and Unicode environments. An unrecognized escape
sequence causes an error.
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is a non-control ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n newline (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
\xhh character with hex code hh
\x{hh..} character with hex code hh..
</pre>
\N{U+hh..} is synonymous with \x{hh..} but is not supported in environments
that use EBCDIC code (mainly IBM mainframes). Note that \N not followed by an
opening curly bracket has a different meaning (see below).
</P>
<P>
If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the
following are also recognized:
<pre>
\U the character "U"
\uhhhh character with hex code hhhh
\u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX
</pre>
When \x is not followed by {, one or two hexadecimal digits are read,
but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be
recognized as a hexadecimal escape; otherwise it matches a literal "x".
Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits
or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it
matches a literal "u".
</P>
<P>
Note that \0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section
<a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation, where details of escape processing in EBCDIC environments are
also given.
</P>
<br><a name="SEC5" href="#TOC1">CHARACTER TYPES</a><br>
<P>
<pre>
. any character except newline;
in dotall mode, any character whatsoever
\C one code unit, even in UTF mode (best avoided)
\d a decimal digit
\D a character that is not a decimal digit
\h a horizontal white space character
\H a character that is not a horizontal white space character
\N a character that is not a newline
\p{<i>xx</i>} a character with the <i>xx</i> property
\P{<i>xx</i>} a character without the <i>xx</i> property
\R a newline sequence
\s a white space character
\S a character that is not a white space character
\v a vertical white space character
\V a character that is not a vertical white space character
\w a "word" character
\W a "non-word" character
\X a Unicode extended grapheme cluster
</pre>
\C is dangerous because it may leave the current matching point in the middle
of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
with the use of \C permanently disabled.
</P>
<P>
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
happening, \s and \w may also match characters with code points in the range
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
sequences is changed to use Unicode properties and they match many more
characters, but there are some option settings that can restrict individual
sequences to matching only ASCII characters.
</P>
<P>
Property descriptions in \p and \P are matched caselessly; hyphens,
underscores, and ASCII white space characters are ignored, in accordance with
Unicode's "loose matching" rules. For example, \p{Bidi_Class=al} is the same
as \p{ bidi class = AL }.
</P>
<br><a name="SEC6" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Lc Cased letter, the union of Ll, Lu, and Lt
L& Synonym of Lc
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
</pre>
From release 10.45, when caseless matching is set, Ll, Lu, and Lt are all
equivalent to Lc.
</P>
<br><a name="SEC7" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
Xan Alphanumeric: union of properties L and N
Xps POSIX space: property Z or tab, NL, VT, FF, CR
Xsp Perl space: property Z or tab, NL, VT, FF, CR
Xuc Universally-named character: one that can be
represented by a Universal Character Name
Xwd Perl word: property Xan or underscore
</pre>
Perl and POSIX space are now the same. Perl added VT to its space character set
at release 5.18.
</P>
<br><a name="SEC8" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a><br>
<P>
Unicode defines a number of binary properties, that is, properties whose only
values are true or false. You can obtain a list of those that are recognized by
\p and \P, along with their abbreviations, by running this command:
<pre>
pcre2test -LP
</PRE>
</P>
<br><a name="SEC9" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a><br>
<P>
Many script names and their 4-letter abbreviations are recognized in
\p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of
course). You can obtain a list of these scripts by running this command:
<pre>
pcre2test -LS
</PRE>
</P>
<br><a name="SEC10" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a><br>
<P>
<pre>
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
AL Arabic letter
AN Arabic number
B paragraph separator
BN boundary neutral
CS common separator
EN European number
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
NSM non-spacing mark
ON other neutral
PDF pop directional format
PDI pop directional isolate
R right-to-left
RLE right-to-left embedding
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS white space
</PRE>
</P>
<br><a name="SEC11" href="#TOC1">CHARACTER CLASSES</a><br>
<P>
<pre>
[...] positive character class
[^...] negative character class
[x-y] range (can be used for hex characters)
[[:xxx:]] positive POSIX named set
[[:^xxx:]] negative POSIX named set
alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space white space
upper upper case letter
word same as \w
xdigit hexadecimal digit
</pre>
In PCRE2, POSIX character set names recognize only ASCII characters by default,
but some of them use Unicode properties if PCRE2_UCP is set. You can use
\Q...\E inside a character class.
</P>
<P>
When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be
used, allowing nested character classes, combined using set operators.
<pre>
[x&&[^y]] UTS#18 extended character class
x||y set union (OR)
x&&y set intersection (AND)
x--y set difference (AND NOT)
x~~y set symmetric difference (XOR)
</PRE>
</P>
<br><a name="SEC12" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a><br>
<P>
<pre>
(?[...]) Perl extended character class
(?[\p{Thai} & \p{Nd}]) operators; whitespace ignored
(?[(x - y) & z]) parentheses for grouping
(?[ [^3] & \p{Nd} ]) [...] is a nested ordinary class
(?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...]
(?[ \d - [3] ]) backslash-escaped set is allowed outside [...]
(?[ !\n & [:ascii:] ]) backslash-escaped character is allowed outside [...]
all other characters or ranges must be enclosed in [...]
x|y, x+y set union (OR)
x&y set intersection (AND)
x-y set difference (AND NOT)
x^y set symmetric difference (XOR)
!x set complement (NOT)
</pre>
Inside a Perl extended character class, [...] switches mode to be interpreted
as an ordinary character class. Outside of a nested [...], the only items
permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside
a nested ordinary class, ^ has its usual meaning (inverts the class when used
as the first character); outside of a nested class, ^ is the XOR operator.
</P>
<br><a name="SEC13" href="#TOC1">QUANTIFIERS</a><br>
<P>
<pre>
? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy
{,m} zero up to m, greedy
{,m}+ zero up to m, possessive
{,m}? zero up to m, lazy
</PRE>
</P>
<br><a name="SEC14" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
<P>
<pre>
\b word boundary
\B not a word boundary
^ start of subject
also after an internal newline in multiline mode
(after any newline if PCRE2_ALT_CIRCUMFLEX is set)
\A start of subject
$ end of subject
also before newline at end of subject
also before internal newline in multiline mode
\Z end of subject
also before newline at end of subject
\z end of subject
\G first matching position in subject
</PRE>
</P>
<br><a name="SEC15" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
<P>
<pre>
\K set reported start of match
</pre>
From release 10.38 \K is not permitted by default in lookaround assertions,
for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK
option is set, the previous behaviour is re-enabled. When this option is set,
\K is honoured in positive assertions, but ignored in negative ones.
</P>
<br><a name="SEC16" href="#TOC1">ALTERNATION</a><br>
<P>
<pre>
expr|expr|expr...
</PRE>
</P>
<br><a name="SEC17" href="#TOC1">CAPTURING</a><br>
<P>
<pre>
(...) capture group
(?&#60;name&#62;...) named capture group (Perl)
(?'name'...) named capture group (Perl)
(?P&#60;name&#62;...) named capture group (Python)
(?:...) non-capture group
(?|...) non-capture group; reset group numbers for
capture groups in each alternative
</pre>
In non-UTF modes, names may contain underscores and ASCII letters and digits;
in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In
both cases, a name must not start with a digit.
</P>
<br><a name="SEC18" href="#TOC1">ATOMIC GROUPS</a><br>
<P>
<pre>
(?&#62;...) atomic non-capture group
(*atomic:...) atomic non-capture group
</PRE>
</P>
<br><a name="SEC19" href="#TOC1">COMMENT</a><br>
<P>
<pre>
(?#....) comment (not nestable)
</PRE>
</P>
<br><a name="SEC20" href="#TOC1">OPTION SETTING</a><br>
<P>
Changes of these options within a group are automatically cancelled at the end
of the group.
<pre>
(?a) all ASCII options
(?aD) restrict \d to ASCII in UCP mode
(?aS) restrict \s to ASCII in UCP mode
(?aW) restrict \w to ASCII in UCP mode
(?aP) restrict all POSIX classes to ASCII in UCP mode
(?aT) restrict POSIX digit classes to ASCII in UCP mode
(?i) caseless
(?J) allow duplicate named groups
(?m) multiline
(?n) no auto capture
(?r) restrict caseless to either ASCII or non-ASCII
(?s) single line (dotall)
(?U) default ungreedy (lazy)
(?x) ignore white space except in classes or \Q...\E
(?xx) as (?x) but also ignore space and tab in classes
(?-...) unset the given option(s)
(?^) unset imnrsx options
</pre>
(?aP) implies (?aT) as well, though this has no additional effect. However, it
means that (?-aP) also implies (?-aT) and disables all ASCII restrictions for
POSIX classes.
</P>
<P>
Unsetting x or xx unsets both. Several options may be set at once, and a
mixture of setting and unsetting such as (?i-x) is allowed, but there may be
only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
(?^in). An option setting may appear at the start of a non-capture group, for
example (?i:...).
</P>
<P>
The following are recognized only at the very start of a pattern or after one
of the newline or \R sequences or options with similar syntax. More than one
of them may appear. For the first three, d is a decimal number.
<pre>
(*LIMIT_DEPTH=d) set the backtracking limit to d
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
(*LIMIT_MATCH=d) set the match limit to d
(*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching
(*NOTEMPTY) set PCRE2_NOTEMPTY when matching
(*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
(*NO_JIT) disable JIT optimization
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*TURKISH_CASING) set PCRE2_EXTRA_TURKISH_CASING when matching
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
</pre>
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>,
not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
application can lock out the use of (*UTF) and (*UCP) by setting the
PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
</P>
<br><a name="SEC21" href="#TOC1">NEWLINE CONVENTION</a><br>
<P>
These are recognized only at the very start of the pattern or after option
settings with a similar syntax.
<pre>
(*CR) carriage return only
(*LF) linefeed only
(*CRLF) carriage return followed by linefeed
(*ANYCRLF) all three of the above
(*ANY) any Unicode newline sequence
(*NUL) the NUL character (binary zero)
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
<P>
These are recognized only at the very start of the pattern or after option
setting with a similar syntax.
<pre>
(*BSR_ANYCRLF) CR, LF, or CRLF
(*BSR_UNICODE) any Unicode newline sequence
</PRE>
</P>
<br><a name="SEC23" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) )
(*pla:...) ) positive lookahead
(*positive_lookahead:...) )
(?!...) )
(*nla:...) ) negative lookahead
(*negative_lookahead:...) )
(?&#60;=...) )
(*plb:...) ) positive lookbehind
(*positive_lookbehind:...) )
(?&#60;!...) )
(*nlb:...) ) negative lookbehind
(*negative_lookbehind:...) )
</pre>
Each top-level branch of a lookbehind must have a limit for the number of
characters it matches. If any branch can match a variable number of characters,
the maximum for each branch is limited to a value set by the caller of
<b>pcre2_compile()</b> or defaulted. The default is set when PCRE2 is built
(ultimate default 255). If every branch matches a fixed number of characters,
the limit for each branch is 65535 characters.
</P>
<br><a name="SEC24" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a><br>
<P>
These assertions are specific to PCRE2 and are not Perl-compatible.
<pre>
(?*...) )
(*napla:...) ) synonyms
(*non_atomic_positive_lookahead:...) )
(?&#60;*...) )
(*naplb:...) ) synonyms
(*non_atomic_positive_lookbehind:...) )
</PRE>
</P>
<br><a name="SEC25" href="#TOC1">SUBSTRING SCAN ASSERTION</a><br>
<P>
This feature is not Perl-compatible.
<pre>
(*scan_substring:(grouplist)...) scan captured substring
(*scs:(grouplist)...) scan captured substring
</pre>
The comma-separated list may identify groups in any of the following ways:
<pre>
n absolute reference
+n relative reference
-n relative reference
&#60;name&#62; name
'name' name
</PRE>
</P>
<br><a name="SEC26" href="#TOC1">SCRIPT RUNS</a><br>
<P>
<pre>
(*script_run:...) ) script run, can be backtracked into
(*sr:...) )
(*atomic_script_run:...) ) atomic script run
(*asr:...) )
</PRE>
</P>
<br><a name="SEC27" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
\gn reference by number
\g{n} reference by number
\g+n relative reference by number (PCRE2 extension)
\g-n relative reference by number
\g{+n} relative reference by number (PCRE2 extension)
\g{-n} relative reference by number
\k&#60;name&#62; reference by name (Perl)
\k'name' reference by name (Perl)
\g{name} reference by name (Perl)
\k{name} reference by name (.NET)
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC28" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
(?n) call subroutine by absolute number
(?+n) call subroutine by relative number
(?-n) call subroutine by relative number
(?&name) call subroutine by name (Perl)
(?P&#62;name) call subroutine by name (Python)
\g&#60;name&#62; call subroutine by name (Oniguruma)
\g'name' call subroutine by name (Oniguruma)
\g&#60;n&#62; call subroutine by absolute number (Oniguruma)
\g'n' call subroutine by absolute number (Oniguruma)
\g&#60;+n&#62; call subroutine by relative number (PCRE2 extension)
\g'+n' call subroutine by relative number (PCRE2 extension)
\g&#60;-n&#62; call subroutine by relative number (PCRE2 extension)
\g'-n' call subroutine by relative number (PCRE2 extension)
</PRE>
</P>
<br><a name="SEC29" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(n) absolute reference condition
(?(+n) relative reference condition (PCRE2 extension)
(?(-n) relative reference condition (PCRE2 extension)
(?(&#60;name&#62;) named reference condition (Perl)
(?('name') named reference condition (Perl)
(?(name) named reference condition (PCRE2, deprecated)
(?(R) overall recursion condition
(?(Rn) specific numbered group recursion condition
(?(R&name) specific named group recursion condition
(?(DEFINE) define groups for reference
(?(VERSION[&#62;]=n.m) test PCRE2 version
(?(assert) assertion condition
</pre>
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
</P>
<br><a name="SEC30" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
if :NAME is present. The others just set a name for passing back to the caller,
but this is not a name that (*SKIP) can see. The following act immediately they
are reached:
<pre>
(*ACCEPT) force successful match
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
</pre>
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
<pre>
(*COMMIT) overall failure, no advance of starting point
(*PRUNE) advance to next starting character
(*SKIP) advance to current matching position
(*SKIP:NAME) advance to position corresponding to an earlier
(*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation
</pre>
The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call.
</P>
<br><a name="SEC31" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout (assumed number 0)
(?Cn) callout with numerical data n
(?C"text") callout with string data
</pre>
The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC32" href="#TOC1">REPLACEMENT STRINGS</a><br>
<P>
If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for
<b>pcre2_substitute()</b> is not interpreted. Otherwise, by default, the only
special character is the dollar character in one of the following forms:
<pre>
$$ insert a dollar character
$n or ${n} insert the contents of group <i>n</i>
$&#60;name&#62; insert the contents of named group
$0 or $& insert the entire matched substring
$` insert the substring that precedes the match
$' insert the substring that follows the match
$_ insert the entire input string
$*MARK or ${*MARK} insert a control verb name
</pre>
For ${n}, n can be a name or a number. If PCRE2_SUBSTITUTE_EXTENDED is set,
there is additional interpretation:
</P>
<P>
1. Backslash is an escape character, and the forms described in "ESCAPED
CHARACTERS" above are recognized. Also:
<pre>
\Q...\E can be used to suppress interpretation
\l force the next character to lower case
\u force the next character to upper case
\L force subsequent characters to lower case
\U force subsequent characters to upper case
\u\L force next character to upper case, then all lower
\l\U force next character to lower case, then all upper
\E end \L or \U case forcing
\b backspace character (note: as in character class in pattern)
\v vertical tab character (note: not the same as in a pattern)
</pre>
2. The Python form \g&#60;n&#62;, where the angle brackets are part of the syntax and
<i>n</i> is either a group name or a number, is recognized as an alternative way
of inserting the contents of a group, for example \g&#60;3&#62;.
</P>
<P>
3. Capture substitution supports the following additional forms:
<pre>
${n:-string} default for unset group
${n:+string1:string2} values for set/unset group
</pre>
The substitution strings themselves are expanded. Backslash can be used to
escape colons and closing curly brackets.
</P>
<br><a name="SEC33" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC34" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
Retired from University Computing Service
<br>
Cambridge, England.
<br>
</P>
<br><a name="SEC35" href="#TOC1">REVISION</a><br>
<P>
Last updated: 27 November 2024
<br>
Copyright &copy; 1997-2024 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>

Some files were not shown because too many files have changed in this diff Show More