GSTX is a parsing framework for Go that makes it possible to describe a grammar in an EBNF-like manner. By simply composing a declarative set of rules a parser can be generated that produces an abstract syntax tree (AST) if a text is processed successfully. In case of a syntax error, it returns as much information as it is possible to extract and creates a nice, informative error message.
As an example, the EBNF of JSON is given below (this is one possible implementation of the grammar). Higher level rules are written in lower-case, low-level rules (terminal symbols) are written in upper-case.
json = value
value = NUMBER | STRING | "null" | "true" | "false" | object | array;
object = "{" (key_value ("," key_value)* )? "}"
key_value = STRING ":" value
array = "[" (value ("," value)* )? "]"
As Go doesn't have the feature of overloading operators, the grammar looks quite different when described using GSTX but it "reads" almost the same:
var(
json = NewRule("json")
value = NewRule("value")
object = NewRule("object")
keyValue = NewRule("keyValue")
array = NewRule("array")
)
json.Set(W(value).Cat(W(E())))
value.Set(
Float().
Or(Str()).
Or(Keyword("null")).
Or(Keyword("true")).
Or(Keyword("false")).
Or(object).
Or(array),
)
object.Set(
W(OneOf("{")).
Cat(Opt(
W(keyValue).Cat(
Star(W(OneOf(",")).Cat(W(keyValue)))),
)).
Cat(W(OneOf("}"))),
),
)
keyValue.Set(W(Str()).Cat(W(OneOf(":"))).Cat(W(value)))
array.Set(
W(OneOf("[")).
Cat(Opt(
W(value).
Cat(Star(W(OneOf(",")).Cat(W(value)))),
)).
Cat(W(OneOf("]"))),
)Please note the similarities between the two listings. The grammatical rules
are the same: json, value, object, keyValue, array. In the Go
version, each rule is given a name in the NewRule() creator function and then
they are defined using the Set() method.
The methods and creator functions of built-in types provide a domain-specific
language that brings the rule definitions very close to the spoken language.
For example, the value rule reads as: a value is a float (floating-point
number) or a string or the keyword null or the keyword true or the keyword
false or an object or an array.
If we need to match two rules in a sequence, we can use the Cat() function,
where Cat stands for concatenation. There are built-in rules for the typical
tasks of parsing: Identifier() matches any character sequence that could be
an identifier in a typical programming language (starts with a character or
underscore and continues with characters, underscores or digits). Keyword
will match a given keyword. It checks for an identifier that is equal to the
given string literal. So Keyword("null") will not match the text nullable
but it will match the first four letters of the literal null-1. Str()
matches strings and the StrDelim() version also allows the user to set the
string delimiters (starting and ending characters) and the escape character, so
e.g. StrDelim('[', ']', '#') will match the following sting literal:
[app#[le].
The OneOf() rule gets a string literal as its input, and will try to match
one character from the given literal. Any of the characters might match but
only one character can get digested by it.
For numbers there are two built-in types: Int() matches integers (signed or
unsigned) and Float() matches real numbers. The FloatSep(',') function
makes it possible to set the decimal separator, so this example matches:
23,45.
Sometimes rule are optional or can be repeated arbitrary times. There are three built-in rules to deal with these cases, all taking an inner rule that they will try to match as many times as their name suggests:
Opt(<rule>): this means that the inner rule is optional, it will match either once or it won't match at allStar(<rule>): the inner rule is optional but it can also match any number of timesPlus(<rule>): the inner rule has to match at least once but can match many more times
For example, Star(OneOf("123")) will match 1, 333221 and the empty string
as well while Plus(OneOf("123")) will not match the empty string.
Many parsing frameworks separate lexers (simple matchers that work at the
character level) and parsers, high-level rules that work on lexers. GSTX
doesn't do this separation so the rules go all the way down to the character
level. This means that we need to deal with whitespaces (something that is
usually taken care of by lexers). To make this as easy as possible, there is a
single letter function: W(<rule>), that takes a rule as an argument and
digests all the white-spaces before trying to match the inner rule. So, for
example W(Int()) will match: 2321.
Using W(), we can get rid of all the unwanted whitespace from our input text,
except for the trailing ones. Wecould say that we're not interested in those as
by the time we get there, we have already parsed the gist of our input but the
of the returned types that contains the indices of the matched substring also
has a method called FullMatch() which only returns true if our grammar
digested the entire input text. This can be very useful to check that our
parsing was complete. In order to be able to use this method even when there
are trailing whitespaces, we have an epsilon function (E()) that digests
nothing but always returns with success. By adding the following structure to
the end of our main rule, we can make sure that any trialing whitespaces are
digested by the grammar: .Cat(W(E())). This is how the json rule ends.
When we need to process a given text input, there are a few things to setup:
apart from defining our grammar using a set of rules, we also need to call the
Parse() function that takes an input string and the top-level rule as inputs.
It returns three values:
- a value of type
*MatchedResultwhich is a struct with two public methods:String()which returns the substring that was matched by the top-level ruleFullMatch()which returns true if the match was full, i.e. the entire input string was digested by the top-level rule
- a value of type
*Node, the root of Abstract Syntax Tree (AST) of the parsed text; this is a bit special, more on this later - an instance of
*SyntaxError(ornil) if the parsing failed
So, here's a full example showing how a single integer can be parsed:
var (
text = "123"
rule = Int()
matched *MatchedResult
node *Node
err *SyntaxError
)
if matched, node, err = Parse(text, rule); err != nil {
log.Fatalf("error parsing the input: %s", err)
}The SyntaxError type is an error so it has an Error() method and can be
used wherever a built-in error can be used.
As it can be seen in the example above, any rule can be passed to the Parse()
method, not just ones that are of type Rule that is created by the
NewRule() function. All the rules implement the Parser interface and the
Parse() function expects a Parser.
The Rule is required for two reasons: the first is that there are recursive
grammars where the rules need to be referenced before they are defined and for
this reason they need to be identifiable before they have a definition. A
typical example is the set of rules that define arithmetic expressions, where
it is possible to have parantheses at an arbitrary level:
var (
text = "2*(3+4)"
expression = NewRule("expression")
term = NewRule("term")
factor = NewRule("factor")
)
expression.Set(term.Cat(Opt(OneOf("+-").Cat(term))))
term.Set(factor.Cat(Opt(OneOf("*/").Cat(factor))))
factor.Set(Int().Or(OneOf("(").Cat(expression).Cat(OneOf(")"))))The other reason for the existence of rules is that GSTX uses them for AST generation. Typical AST contain a lot of nodes that are logical parts of the tree based on the grammar but have no importance when the results are processed and thus they just bloat the tree and make parsing harder.
GSTX has a different approach: it only includes the rules in the tree: each
rule has a name and they appear together with the matched substring in their
corresponding node. This way the programmers defining a grammar have a very fine-grained way of influencing what gets into the AST and what doesn't. Whenever something is needed in the tree, it is always possible to extract part of an expression to a separate rule and when something is not needed, the corresponding rule can be inlined into another rule or can be placed into a simple Parser type and not a Rule.
For example, there is a main function and thus an application that comes with
the library, which is basically a full example showing how JSON documents can
be parsed using GSTX. In that example, the JSON grammar is a little different
than the one shown at the beginning of this document. One reason for this
difference is that the generate tree is easier to handle this way. For example,
keys have been extracted to a rule called key, so that keys and values both
have their own leaf nodes in the tree and can thus be easily processed.
This is the grammar in main.go:
var (
json = NewRule("json")
value = NewRule("value")
object = NewRule("object")
key = NewRule("key")
keyValue = NewRule("keyValue")
array = NewRule("array")
)
json.Set(W(value).Cat(W(E())))
value.Set(
Float().
Or(Str()).
Or(object).
Or(array).
Or(Keyword("null")).
Or(Keyword("true")).
Or(Keyword("false")),
)
object.Set(
Grp(
W(OneOf("{")).
Cat(
W(keyValue).Cat(
Star(W(OneOf(",")).Cat(W(keyValue)))),
).
Cat(W(OneOf("}"))),
).
Or(
W(OneOf("{")).Cat(W(OneOf("}"))),
),
)
key.Set(Str())
keyValue.Set(W(key).Cat(W(OneOf(":"))).Cat(W(value)))
array.Set(
W(OneOf("[")).
Cat(Opt(
W(value).
Cat(Star(W(OneOf(",")).Cat(W(value)))),
)).
Cat(W(OneOf("]"))),
)And if the input is the following JSON:
{
"Name": {
"Forename": "Jane",
"Surname": "Smith"
},
"Age": 24,
"Emails": ["[email protected]", "[email protected]"]
}Then the generated AST is the following:
Name: [ROOT] (matched: "...")
Name: json (matched: "...")
Name: value (matched: "{...")
Name: object (matched: "{...")
Name: keyValue (matched: ""Name": {...")
Name: key (matched: ""Name"")
Name: value (matched: "{...")
Name: object (matched: "{...")
Name: keyValue (matched: ""Forename": "Jane"")
Name: key (matched: ""Forename"")
Name: value (matched: ""Jane"")
Name: keyValue (matched: ""Surname": "Smith"")
Name: key (matched: ""Surname"")
Name: value (matched: ""Smith"")
Name: keyValue (matched: ""Age": 24")
Name: key (matched: ""Age"")
Name: value (matched: "24")
Name: keyValue (matched: ""Emails": ["[email protected]", "[email protected]"]")
Name: key (matched: ""Emails"")
Name: value (matched: "["[email protected]", "[email protected]"]")
Name: array (matched: "["[email protected]", "[email protected]"]")
Name: value (matched: ""[email protected]"")
Name: value (matched: ""[email protected]"")
Please note, that there is always a root node called "[ROOT]" and the node corresponding to the top-level rule is its only child. This is a result of how the AST is built in the library.
The Node type returned by the Parse() function provides a Walk() method
which makes it possible to traverse the AST using a range-based for loop. The
tree listing above was generated by the code below:
for level, node := range root.Walk() {
for range level {
fmt.Printf("\t")
}
fmt.Printf("Name: %s (matched: \"%s\")\n", node.Name, cutMultilineStrings(node.String()))
}where cutMultilineStrings() merely exchanges multiline strings with three asterisks.
When there is a parsing failure, the rules return an error. What makes the
generation of good error message hard is that a failure might now mean that the
parsing itself fails. In complex grammars there many levels and many
alternatives (Or operators). Whenever a given branch of an Or fails, an
error message is generated but whether the entire Parse() call fails, depend
on how the other branches perform. Another complicating factor is that when
dealing with error usually two types of information are needed:
- the exact problem that caused parsing to fail (coming from the character-level rule that failed)
- the higher-level element that we were trying to parse at that given point (usually provided by the lowest level named rule above the character-level rule that failed)
GSTX solves the above problems with a heuristic method: it returns an error
message from the character built-in rule that got furthest during parser, adds
the name of the named rule that is closest to the failure and provides a method
(ErrorLine()) that extracts the line where the error happened from the input
text and shows the exact character where parsing failed. Together with the
compound error message depicted above this should give enough information to
find the syntax error.
For example, if we modify the example JSON in main.go by removing the colon
from between "Forename" and "Jane":
{
"Name": {
"Forename" "Jane",
"Surname": "Smith"
},
"Age": 24,
"Emails": ["[email protected]", "[email protected]"]
}Then we get the following error message:
2025/02/24 17:04:25 in rule "keyValue" expected a character from this set: ":" at 27
The error:
"Forename" "Jane",
^
This output is generated by the code below:
if result, root, err = Parse(text, json); err != nil {
log.Fatalf("%s\nThe error:\n%s\n", err, err.ErrorLine(text))
}If the reader takes a closer look at the grammar listing above, it is apparent
that there is difference in how the object rule was defined when compared to
the original listing. The reason for this is that inthe simplest definition of
the grammar, the key-value list within the object is optional (an empty object
is valid in JSON). The problem with this is that if the parsing fails due to
the missing colon, the parser will opt for a no match in the optional part and
thus the object rule will fail at the start of "Forename" as it is not a
closing curly brace. This is not very helpful for a user who is trying to find
out what the problem is with the JSON. It is therefore a better solution to add
the empty object as a separate option with an Or() operator. This way, the
keyValue rule will fail and as it gets the furthest in parsing the text, this
is what will get bubbled back to the top and appears as an error message.
There can be many ways of processing the AST. One suggested method is shown in
pkg/parser/node_test.go in TestNodeBuildStructure(). The grammar is very
simple here, it defines a function call in a very simple language:
var (
text = "myFun(23, \"apple\", fruits)"
functionCallRule = NewRule("functionCall")
functionNameRule = NewRule("functionName")
argumentListRule = NewRule("argumentList")
argumentRule = NewRule("argument")
root *Node
err *SyntaxError
)
functionCallRule.Set(
W(functionNameRule).Cat(W(OneOf("("))).Cat(argumentListRule).Cat(W(OneOf(")"))),
)
functionNameRule.Set(Identifier())
argumentListRule.Set(
W(argumentRule).Cat(
Star(W(OneOf(",")).Cat(W(argumentRule))),
),
)
argumentRule.Set(
Int().Or(Str()).Or(Identifier()),
)Please note, that that every important syntactic element has a corresponding rule so that they appear in the tree.
The method we suggest is that for every rule a corresponding struct is
created that receives the node that should contain the information related to
the struct. The character-level structs actually parse the matched
substrings in the nodes, the higher-level ones simple check whether the node
names match and pass on the parsing tasks to structs and nodes at a lower
level. This provides a recursive algorithm that traverses the tree and builds
up a data structure that corresponds to the information that is stored in the
AST.
Here are the corresponding structs. All of them implement two methods:
fromNode(), which processes a Node and String() that returns the string
representation of the struct.
type argument struct {
value string
}
func (a *argument) fromNode(n *Node) error {
if n.Name != "argument" {
return fmt.Errorf("expected an argument node, got: %s", n.Name)
}
a.value = n.String()
return nil
}
func (a *argument) String() string {
return a.value
}
type argumentList struct {
arguments []*argument
}
func (a *argumentList) fromNode(n *Node) error {
if n.Name != "argumentList" {
return fmt.Errorf("expected an argumentList node, got: %s", n.Name)
}
for _, n := range n.Children {
var arg = &argument{}
if err := arg.fromNode(n); err != nil {
return err
}
a.arguments = append(a.arguments, arg)
}
return nil
}
func (a *argumentList) String() string {
var sb strings.Builder
for _, arg := range a.arguments {
sb.WriteString(fmt.Sprintf("%s,", arg))
}
return sb.String()
}
type functionName struct {
value string
}
func (f *functionName) fromNode(n *Node) error {
if n.Name != "functionName" {
return fmt.Errorf("expected a functionName node, got: %s", n.Name)
}
f.value = n.String()
return nil
}
func (f *functionName) String() string {
return f.value
}
type functionCall struct {
name *functionName
argumentList *argumentList
}
func (f *functionCall) fromNode(n *Node) error {
if n.Name != "functionCall" {
return fmt.Errorf("expected a functionCall node, got: %s", n.Name)
}
if len(n.Children) < 1 {
return fmt.Errorf("missing functionName node")
}
var (
fName = &functionName{}
err error
)
if err = fName.fromNode(n.Children[0]); err != nil {
return err
}
f.name = fName
if len(n.Children) > 1 {
var args = &argumentList{}
if err = args.fromNode(n.Children[1]); err != nil {
return err
}
f.argumentList = args
}
return nil
}
func (f *functionCall) String() string {
return fmt.Sprintf("%s (%s)", f.name, f.argumentList)
}The processing is done with a simple call which can be seen at the bottom of the listing below:
if matched, root, err = Parse(text, functionCallRule); err != nil {
t.Fatalf("expected to pass, got: %s", err)
}
if !matched.FullMatch() {
t.Fatalf("expected full match, got: %s", matched)
}
if len(root.Children) != 1 {
t.Fatalf("expected to get a functionCall node, but tree is different")
}
var fn = &functionCall{}
if err := fn.fromNode(root.Children[0]); err != nil {
t.Fatalf("expected to build a tree structure, got: %s", err)
}We create an instance of the top-level struct and make a call to fromNode()
passing the root of the AST (which is the child of the node called "[ROOT]").