Skip to content

dregex is a Java library that implements a regular expression engine using deterministic finite automata (DFA). It supports some Perl-style features and yet retains linear matching time, and also offers set operations.

License

Notifications You must be signed in to change notification settings

marianobarrios/dregex

Repository files navigation

dregex - Deterministic Regular Expression Engine

dregex is a Java library that implements a regular expression engine using deterministic finite automata (DFA). It supports some Perl-style features yet retains linear matching time. Additionally, it can perform set operations (union, intersection, and difference).

Build Status

Maven Central javadoc

Rationale

History

"Regular expressions" are a mathematically defined concept, invented by Stephen Kleene in 1956. In their most minimalistic (and original) version, these expressions define languages using just literal characters, alternation ("|"), and repetition ("*"). They have well-understood properties; most importantly, they can be matched against text using another well-understood abstract device: a Deterministic Finite Automaton (DFA). DFAs have the important property of running on arbitrary text of length n in O(n) time and using O(1) space. Presented with a regular expression and a candidate text, a DFA decides whether the text matches the expression.

Some years later, regular expressions entered actual usage when Ken Thompson included them as a feature in the editor QED. After that, they became a standard in the UNIX world, appearing in a variety of programs, notably the still-used grep tool.

Around 1994, Henry Spencer wrote a grep implementation that used a Non-Deterministic Finite Automaton (NFA). Using a NFA simplifies the compilation of expressions (building the automata) at the cost of complicating the execution. More importantly, NFA-based engines (specifically, those using recursive backtracking) make it easy to add new features, many of which are impossible to do with a state-machine (DFA) based implementation. That's precisely what started to happen when the library was used to implement regular expressions in Perl 5 in 1994.

After the release of Perl 5, the extended feature set was quickly embraced by users, and Perl-style regular expression implementations became part of the standard libraries of virtually all popular programming languages and also standard infrastructure components, like the Apache and Nginx web servers.

As mentioned, Perl-style expressions don't give any execution time guarantee. On the other hand, there are some features of Perl regular expressions that are impossible to express in a DFA, most notable backreferences (i.e., forcing a match of same text more than once). Other features, while not infeasible, significantly complicate a DFA solution, such as search and capturing groups.

Proposal

Regular expressions were born as a very specific tool and, almost by accident, grew to be one of the most versatile (and sometimes abused) tools in the world of software. However, there is a fundamental trade-off between the two prototypical implementations that is often ignored. Unbounded execution time is undesirable for many (if not all) interactive uses, as problems can happen when either the regular expression or the text are supplied by the user. For examples, see:

For cases when advanced Perl-style features are not needed and predictable performance is desired, using DFA-based matching is a compelling alternative, unfortunately made difficult by the scarcity of proper implementations in most languages. dregex offers an alternative for the Java ecosystem.

Supported regex flavor

Unless specified, the supported regular expression flavor attempts to be compatible with the Java flavor.

Supported features

  • Literals
  • Escaped characters
  • Standard regular language constructs: |, ?, * and +, and parenthesis grouping.
  • General quantifiers: {n}, {n,m}, {n,}
  • Dot wildcards
  • Simple character classes: [abc]
  • Simple negated character classes: [^abc]
  • Ranges in character classes: [a-z]
  • Special character classes, including negated versions: \w, \s, \d, \W, \S, \D
  • POSIX character classes (using the syntax and definition of java.util.regex.Pattern), e.g., \p{Lower}, \p{Punct}, \P{Lower}
  • Unicode character classes (using the syntax and definition of java.util.regex.Pattern):
  • Java-defined character classes (using the syntax and definition of java.util.regex.Pattern), e.g., \p{javaLowerCase}, \p{javaWhitespace}
  • Special character classes inside regular character classes: [\d\s], [\D]
  • Unicode line breaks: \R
  • Block quotes: \Q...\E
  • Lookaround (lookahead and lookbehind; both positive and negative) (see note below)

The definition of Unicode character classes (Unicode blocks, scripts, general categories and binary properties) is tied to the library and not the JVM). The version currently supported is Unicode 16.0.0.

Compile flags

With one exception, all compile flags defined by java.util.regex.Pattern are supported, including in embedded form:

  • LITERAL
  • COMMENTS (also in embedded form: (?x)). Note: The flag intentionally behaves ignoring exactly the same set of white space characters as the standard Java implementation, that is, only ASCII white space, not Unicode.
  • DOTALL (also in embedded form: (?s))
  • UNIX_LINES (also in embedded form: (?d))
  • UNICODE_CHARACTER_CLASS (also in embedded form: (?U))
  • CASE_INSENSITIVE (also in embedded form: (?i))
  • UNICODE_CASE (also in embedded form: (?u))
  • CANON_EQ

Not supported

  • Searching (the engine matches only against the full input string)
  • Capturing groups
  • Backreferences
  • Anchors (ˆ and $), as they are redundant because the expressions only operate over the complete text.
  • Reluctant (+?, *?, ??, {...}?) and possessive (++, *+, ?+, {...}+) quantifiers , because they are meaningless for a pure-matching engine. By definition, they only affect capturing groups, not whether an expression matches.
  • Compile flag MULTILINE, because it is meaningless for a pure-matching engine, that works always in multi-line mode.

Note: for the safety, the presence of unsupported features in a regular expression will cause it to fail the compilation (except for unnamed capturing groups, as they have no syntax: they are just a pair of parenthesis).

Set operations

In addition to regular matching, dregex fully supports set operations on regular expressions. These operations work on regular expressions themselves, not on input strings. It possible to do union, intersection and difference:

List<Regex> regexes = Regex.compile(List.of("[a-z]+", "[A-Z]+", "[a-z]+|[A-Z]+"));
Regex lower = regexes.get(0);
Regex upper = regexes.get(1);
Regex both = regexes.get(2);
System.out.println(lower.doIntersect(upper)); // false
System.out.println(both.equiv(lower.union(upper))); // true
List<Regex> regexes = Regex.compile(List.of("[a-z]+|[A-Z]+", "[A-Z]+"));
Regex all = regexes.get(0);
Regex upper = regexes.get(1);
Regex lower = all.diff(upper);
System.out.println(lower.matches("aaa")); // true
System.out.println(lower.matches("Aaa")); // false

The motivating use case was detecting non-intersecting expressions. Once it can be established that a set of expressions do not intersect (that they are disjoint) it becomes possible to short-circuit evaluations. Moreover, they can be tested in any order, allowing for reordering based on matching statistics. This is especially important in performance-critical paths where multiple expressions are matched, such as in load balancers.

Note on lookaround

Lookaround constructs are transformed into an equivalent DFA operation, and the result is then trivially converted back into an NFA for insertion into the outer expression:

Lookahead:

  • (?=B)CC ∩ B.*
  • (?!B)CC - B.*

Lookbehind:

  • A(?<=B)A ∩ .*B
  • A(?<!B)A - .*B

If there is more than one lookaround, the transformation is applied recursively.

Lookaround expressions are supported anywhere in a regex, including inside repetition constructs. Nevertheless, their meaning differs from those in Perl-like engines. Consider:

((?!aa)a)+

This expression matches, in this engine, a, aa, aaa and so on, being effectively equivalent to a+, because the negative condition (aa) can never occur inside a. The fact that the expression is repeated does not change its inner logic. However, in Perl-like engines, the aforementioned regex does not match any a longer than one character, because those engines treat lookarounds specially, effectively running a sub-regex at the point of occurrence, irrespective of the context.

The different behavior of this engine is a direct consequence of the way lookarounds are implemented. Nevertheless, it can be argued that this definition is conceptually simpler and, more importantly, easier to reason about in the context of a complex expression. Looped lookarounds like the one in the example are in practice quite rare anyway.

Internals

DFA construction

The library parses the regular expressions and builds a NFA (Nondeterministic Finite Automaton) using a variation of the Thompson algorithm. It then uses the "powerset construction" to build a DFA (Deterministic Finite Automaton). One the DFA is built, the matching algorithm is straightforward.

Wildcards and character classes

Character classes are expanded as disjunctions before NFA creation. However, because of the number of possible Unicode code points, non-overlapping code point intervals are used internally to avoid disjunctions with too many alternatives.

Example:

  • [abc][a-c]
  • [^efg][0-d]|[h-MAX]
  • mno[^efg]mno(0-d|h-l|m|n|o|p-MAX)
  • .[0-MAX]

Set operations

Intersections, unions and differences between regex are done using the "product construction". The following pages include graphical explanations of this technique:

This is a relatively straightforward algorithm implemented using the already generated DFA.

Requirements

dregex requires Java 11.

Logging

The library uses SLF4J for logging, which is the most widely used pluggable logging framework in Java. As a policy, all logging event emitted are at TRACE level, which is below the default threshold in most logging implementations and thus completely silent by default.

Dependencies

There are two small dependencies: SLF4J and jparsec. The main jar file is about 200 KB.

Similar efforts

  • RE2 is an efficient (linear) C++ library that implements a subset of Perl features, written by Russ Cox. The author has written a set of articles explaining the problem.
  • TRE is an efficient C library and command-line tool that implements POSIX-compliant and approximate (fuzzy) regex matching, using "tagged DFA". It is written by Ville Laurikari.
  • Plan 9 grep is an efficient DFA implementation that supports egrep syntax. It was written by Ken Thompson.
  • regex-tdfa is a "tagged DFA" implementation written in Haskell by Chris Kuklewicz.
  • regex is a Rust crate implemented with similar principles.

About

dregex is a Java library that implements a regular expression engine using deterministic finite automata (DFA). It supports some Perl-style features and yet retains linear matching time, and also offers set operations.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5

Languages