Skip to content
RY edited this page Jan 10, 2022 · 24 revisions

The Library main purpose is to provide an easy, straightforward and flexible way to map a source text to a desired form with variety of options and customizations available. Basic components which are involved in the workflow and their relationships can be demonstrated by the following diagram:

  • Each terminal is defined by a regular expression and represents a basic source recognition unit(lexeme).
  • Non-terminals are used to reduce a list of terminals and non-terminals(production) to a higher level entity.
  • Both terminals and non-terminals with additional configuration defines a target grammar.
  • The grammar and custom reduction code are then used to compile a final parser.

Regular Expressions

Regular Expressions are constructed from nodes of the following types:

  • Text nodes to match a single position within a source which supports any combination of Unicode character codes, ranges and Unicode categories:
Rex.Char('r');
Rex.Char(@"a-z0-9");
Rex.Char(@"\u{0|10-ff|ccc|10000-10FFFF}");
Rex.Char(@"\p{Cc|Cf|Cn|Co|Cs|C|Ll|Lm|Lo|Lt|Lu|L|Mc|Me|Mn|M|Nd|Nl|No|N|Pc|Pd|Pe|Po|Ps|Pf|Pi|P|Sc|Sk|Sm|So|S|Zl|Zp|Zs|Z}");
Rex.Char(@"\a\b\t\r\n\v\f\\");

There is also a convenient way to configure a match for any character (or any set in general) excluding specific characters, ranges or categories:

Rex.Char(@"0-10ffff-[\p{L}]");
Rex.Char(@"0-10ffff-[a]");
Rex.Except(@"\p{L}");
Rex.Except('a');
  • Text nodes are later combined in the expression by OR, AND(concatenation) and REPEAT nodes:
Rex.Char('a').Then(Rex.Char('b')).Then(Rex.Char('b'));
Rex.Text("abb");
Rex.Or(Rex.Char('a'), Rex.Char('b')).NoneOrMore().Then("abb"); // (a|b)*abb
  • Special kind of a conditional expressions are also available. The idea is to define a sub-expression, which is supposed to be checked at some position in the pattern and an evaluation is transferred to the state which follows only if the sub-expression has succeed(positive lookahead) or failed(negative lookahead). Every time a conditional expression evaluation is completed current text position is reset to the state where it's started:
Rex.IfNot("-->").Then(Rex.AnyChar).NoneOrMore();
Rex.Char('a').NoneOrMore().FollowedBy('b');

Compiled Expression

An individual regular expression can be compiled in a delegate of int RexEvaluator(string content, int offset, int length) type. Accepting a text input and boundaries the dynamic method will find and return a length of the longest sub-string has been matched or -1 if none succeed:

var digit  = Rex.Char("0-9");
var number = digit.OneOrMore();
var eval   = Rex.Compile(number);
var input  = "x123y";

Console.WriteLine(eval(input, 0, input.Length));     // outputs -1
Console.WriteLine(eval(input, 1, input.Length - 1)); // outputs  3
Console.WriteLine(eval(input, 1, 2));                // outputs  2

Lexical Analysis

Multiple regular expressions(each expression is associated with an appropriate token ID) then are converted into a DFA by a lexical builder. The DFA is represented by lexical states and transitions between them. Each lexical state corresponds one or more nodes from the source expressions. When some state contains a regular expression accept node(added automatically) it's considered as a final state. If there are several acceptance nodes, the one with the lowest token ID is selected as final for the state. When a source text (represented by a sequence of UTF-16 code points) is processed the following workflow to choose the next state is applied:

  • Each state contains ordered disjoint list of Unicode ranges.
  • If some range contains a pending char code then Unicode category for the char code is considered.
  • If there is a mapping between some category and the char code then the transition which is defined by this category is selected.
  • If no category is matched then the default transition for the range is selected.
  • If no range contains the char code then the default categories are tested.
  • If neither range nor default category corresponds to the char then the default state transition is selected.

Surrogates

Lexical analyzer provides natural support for the code points from the extended Unicode range(0x10000-0x10FFFF). To achieve this a supplementary surrogate state is generated when necessary. Then all codes which does not fit UTF-16 range are moved to this "extra" state. Finally, the surrogate state is connected to the root by introducing a special 'high-surrogate'(0xD800--0xDBFF) range transition.

Lookaheads

To support conditional regular expression nodes a lexical state is extended to have a tree-like structure. Each state contains False and True child state nodes. When both are null means we're dealing with a regular lexical state. But when any or both are set identifies a lookahead initial state case. If so, the lookahead state is processed individually and if a final state is encountered the current position is reset and True-state is considered as a current state. Otherwise False-state is chosen.

Grammar

Grammar allows to configure terminals, whitespaces and non-terminals which are used later for parser states generation. There are also several properties specified when the grammar is created:

  • Ignore Case: when specified each char-set in the grammar is extended with to-lower and to-upper variants. Also updates letter Unicode categories accordingly. False by default.

  • Conflict Resolver: when specified overrides default conflict resolver with a custom one.

Terminals

Terminals are defined by a name, a regular expression and an incrementally generated token ID(so terminal defined earlier in the grammar has a priority over the later ones). The name then can be used to reference the terminal in a production body or in a reducer. Examples:

grammar.CreateTerminals("+", "-", "(", ")");
grammar.CreateTerminal("num", Rex.Char("0-9").OneOrMore());
grammar.CreateTerminal("expr:str", Rex.Char('"').Then(Rex.AnyText).Then('"'), lazy: true);
  • CreateTermials defines a list of static terminals where expression for each items is generated from its name.

  • expr:str-like construct defines a complex name where the terminal is automatically reduced to the 'expr' non-terminal(should be defined before this line). Equivalent to:

grammar.CreateTerminal("str", Rex.Char('"').Then(Rex.AnyText).Then('"'), lazy: true);
grammar.AddRule("expr:str", "str");
  • lazy:true prevents any transition from an expression final state once it's reached.

Whitespaces

Whitespaces define a special kind of terminals which can appear at any position within a production between other symbols and generally does not affect a parser state:

grammar.CreateWhitespace("ws", Rex.Char(' '));
grammar.CreateWhitespace("lb", Rex.Char('\n'), isLineBreak: true);

isLineBreak: true tags a whitespace with a LineBreak attribute. In that case the whitespace when recognized will deny/allow productions constrained by [NoLB]/[LB] modifiers.

Non-Terminals

Conflicts Resolution

Clone this wiki locally