-
Notifications
You must be signed in to change notification settings - Fork 1
Home
The Library main purpose is to provide an easy, straightforward and flexible way to map a source text to a desired form with variety of options and customizations available. Basic components which are involved in the workflow and their relationships can be demonstrated by the following diagram:
- Each terminal is defined by a regular expression and represents a basic source recognition unit(lexeme).
- Non-terminals are used to reduce a list of terminals and non-terminals(production) to a higher level entity.
- Both terminals and non-terminals with additional configuration defines a target grammar.
- The grammar and custom reduction code are then used to compile a final parser.
Regular Expressions are constructed from nodes of the following types:
- Text nodes to match a single position within a source which supports any combination of Unicode character codes, ranges and Unicode categories:
Rex.Char('r');
Rex.Char(@"a-z0-9");
Rex.Char(@"\u{0|10-ff|ccc|10000-10FFFF}");
Rex.Char(@"\p{Cc|Cf|Cn|Co|Cs|C|Ll|Lm|Lo|Lt|Lu|L|Mc|Me|Mn|M|Nd|Nl|No|N|Pc|Pd|Pe|Po|Ps|Pf|Pi|P|Sc|Sk|Sm|So|S|Zl|Zp|Zs|Z}");
Rex.Char(@"\a\b\t\r\n\v\f\\");
There is also a convenient way to configure a match for any character (or any set in general) excluding specific characters, ranges or categories:
Rex.Char(@"0-10ffff-[\p{L}]");
Rex.Char(@"0-10ffff-[a]");
Rex.Except(@"\p{L}");
Rex.Except('a');
- Text nodes are later combined in the expression by OR, AND(concatenation) and REPEAT nodes:
Rex.Char('a').Then(Rex.Char('b')).Then(Rex.Char('b'));
Rex.Text("abb");
Rex.Or(Rex.Char('a'), Rex.Char('b')).NoneOrMore().Then("abb"); // (a|b)*abb
- Special kind of a conditional expressions are also available. The idea is to define a sub-expression, which is supposed to be checked at some position in the pattern and an evaluation is transferred to the state which follows only if the sub-expression has succeed(positive lookahead) or failed(negative lookahead). Every time a conditional expression evaluation is completed current text position is reset to the state where it's started:
Rex.IfNot("-->").Then(Rex.AnyChar).NoneOrMore();
Rex.Char('a').NoneOrMore().FollowedBy('b');
An individual regular expression can be compiled in a delegate of int RexEvaluator(string content, int offset, int length)
type. Accepting a text input and boundaries the dynamic method will find and return a length of the longest sub-string has been matched or -1
if none succeed:
var digit = Rex.Char("0-9");
var number = digit.OneOrMore();
var eval = Rex.Compile(number);
var input = "x123y";
Console.WriteLine(eval(input, 0, input.Length)); // outputs -1
Console.WriteLine(eval(input, 1, input.Length - 1)); // outputs 3
Console.WriteLine(eval(input, 1, 2)); // outputs 2
Multiple regular expressions(each expression is associated with an appropriate token ID) then are converted into a DFA by a lexical builder. The DFA is represented by lexical states and transitions between them. Each lexical state corresponds one or more nodes from the source expressions. When some state contains a regular expression accept node(added automatically) it's considered as a final state. If there are several acceptance nodes, the one with the lowest token ID is selected as final for the state. When a source text (represented by a sequence of UTF-16 code points) is processed the following workflow to choose the next state is applied:
- Each state contains ordered disjoint list of Unicode ranges.
- If some range contains a pending char code then Unicode category for the char code is considered.
- If there is a mapping between some category and the char code then the transition which is defined by this category is selected.
- If no category is matched then the default transition for the range is selected.
- If no range contains the char code then the default categories are tested.
- If neither range nor default category corresponds to the char then the default state transition is selected.
Lexical analyzer provides natural support for the code points from the extended Unicode range(0x10000-0x10FFFF). To achieve this a supplementary surrogate state is generated when necessary. Then all codes which does not fit UTF-16 range are moved to this "extra" state. Finally, the surrogate state is connected to the root by introducing a special 'high-surrogate'(0xD800--0xDBFF) range transition.
To support conditional regular expression nodes a lexical state is extended to have a tree-like structure. Each state contains False
and True
child state nodes. When both are null means we're dealing with a regular lexical state. But when any or both are set identifies a lookahead initial state case. If so, the lookahead state is processed individually and if a final state is encountered the current position is reset and True
-state is considered as a current state. Otherwise False
-state is chosen.
Grammar allows to configure terminals, whitespaces and non-terminals which are used later for parser states generation. There are also several properties specified when the grammar is created:
-
Ignore Case: when specified each char-set in the grammar is extended with to-lower and to-upper variants. Also updates letter Unicode categories accordingly. False by default.
-
Conflict Resolver: when specified overrides default conflict resolver with a custom one.
Terminals are defined by a name, a regular expression and an incrementally generated token ID(so terminal defined earlier in the grammar has a priority over the later ones). The name then can be used to reference the terminal in a production body or in a reducer. Examples:
grammar.CreateTerminals("+", "-", "(", ")");
grammar.CreateTerminal("num", Rex.Char("0-9").OneOrMore());
grammar.CreateTerminal("expr:str", Rex.Char('"').Then(Rex.AnyText).Then('"'), lazy: true);
-
CreateTermials
defines a list of static terminals where expression for each items is generated from its name. -
expr:str
-like construct defines a complex name where the terminal is automatically reduced to the 'expr' non-terminal(should be defined before this line). Equivalent to:
grammar.CreateTerminal("str", Rex.Char('"').Then(Rex.AnyText).Then('"'), lazy: true);
grammar.AddRule("expr:str", "str");
-
lazy:true
prevents any transition from an expression final state once it's reached.
Whitespaces define a special kind of terminals which can appear at any position within a production between other symbols and generally does not affect a parser state:
grammar.CreateWhitespace("ws", Rex.Char(' '));
grammar.CreateWhitespace("lb", Rex.Char('\n'), isLineBreak: true);
isLineBreak: true
tags a whitespace with a LineBreak
attribute. In that case the whitespace when recognized will deny/allow productions constrained by [NoLB]/[LB] modifiers.