Skip to content

nagygr/gstx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GSTX (Syntx for Go)

Introduction

GSTX is a parsing framework for Go that makes it possible to describe a grammar in an EBNF-like manner. By simply composing a declarative set of rules a parser can be generated that produces an abstract syntax tree (AST) if a text is processed successfully. In case of a syntax error, it returns as much information as it is possible to extract and creates a nice, informative error message.

As an example, the EBNF of JSON is given below (this is one possible implementation of the grammar). Higher level rules are written in lower-case, low-level rules (terminal symbols) are written in upper-case.

json = value
value = NUMBER | STRING | "null" | "true" | "false" | object | array;
object = "{" (key_value ("," key_value)* )? "}"
key_value = STRING ":" value
array = "[" (value ("," value)* )? "]"

As Go doesn't have the feature of overloading operators, the grammar looks quite different when described using GSTX but it "reads" almost the same:

var(
    json     = NewRule("json")
    value    = NewRule("value")
    object   = NewRule("object")
    keyValue = NewRule("keyValue")
    array    = NewRule("array")
)

json.Set(W(value).Cat(W(E())))

value.Set(
    Float().
        Or(Str()).
        Or(Keyword("null")).
        Or(Keyword("true")).
        Or(Keyword("false")).
        Or(object).
        Or(array),
)
object.Set(
    W(OneOf("{")).
        Cat(Opt(
            W(keyValue).Cat(
                Star(W(OneOf(",")).Cat(W(keyValue)))),
        )).
        Cat(W(OneOf("}"))),
    ),
)

keyValue.Set(W(Str()).Cat(W(OneOf(":"))).Cat(W(value)))

array.Set(
    W(OneOf("[")).
        Cat(Opt(
            W(value).
                Cat(Star(W(OneOf(",")).Cat(W(value)))),
        )).
        Cat(W(OneOf("]"))),
)

Please note the similarities between the two listings. The grammatical rules are the same: json, value, object, keyValue, array. In the Go version, each rule is given a name in the NewRule() creator function and then they are defined using the Set() method.

The methods and creator functions of built-in types provide a domain-specific language that brings the rule definitions very close to the spoken language. For example, the value rule reads as: a value is a float (floating-point number) or a string or the keyword null or the keyword true or the keyword false or an object or an array.

If we need to match two rules in a sequence, we can use the Cat() function, where Cat stands for concatenation. There are built-in rules for the typical tasks of parsing: Identifier() matches any character sequence that could be an identifier in a typical programming language (starts with a character or underscore and continues with characters, underscores or digits). Keyword will match a given keyword. It checks for an identifier that is equal to the given string literal. So Keyword("null") will not match the text nullable but it will match the first four letters of the literal null-1. Str() matches strings and the StrDelim() version also allows the user to set the string delimiters (starting and ending characters) and the escape character, so e.g. StrDelim('[', ']', '#') will match the following sting literal: [app#[le].

The OneOf() rule gets a string literal as its input, and will try to match one character from the given literal. Any of the characters might match but only one character can get digested by it.

For numbers there are two built-in types: Int() matches integers (signed or unsigned) and Float() matches real numbers. The FloatSep(',') function makes it possible to set the decimal separator, so this example matches: 23,45.

Sometimes rule are optional or can be repeated arbitrary times. There are three built-in rules to deal with these cases, all taking an inner rule that they will try to match as many times as their name suggests:

  • Opt(<rule>): this means that the inner rule is optional, it will match either once or it won't match at all
  • Star(<rule>): the inner rule is optional but it can also match any number of times
  • Plus(<rule>): the inner rule has to match at least once but can match many more times

For example, Star(OneOf("123")) will match 1, 333221 and the empty string as well while Plus(OneOf("123")) will not match the empty string.

Many parsing frameworks separate lexers (simple matchers that work at the character level) and parsers, high-level rules that work on lexers. GSTX doesn't do this separation so the rules go all the way down to the character level. This means that we need to deal with whitespaces (something that is usually taken care of by lexers). To make this as easy as possible, there is a single letter function: W(<rule>), that takes a rule as an argument and digests all the white-spaces before trying to match the inner rule. So, for example W(Int()) will match: 2321.

Using W(), we can get rid of all the unwanted whitespace from our input text, except for the trailing ones. Wecould say that we're not interested in those as by the time we get there, we have already parsed the gist of our input but the of the returned types that contains the indices of the matched substring also has a method called FullMatch() which only returns true if our grammar digested the entire input text. This can be very useful to check that our parsing was complete. In order to be able to use this method even when there are trailing whitespaces, we have an epsilon function (E()) that digests nothing but always returns with success. By adding the following structure to the end of our main rule, we can make sure that any trialing whitespaces are digested by the grammar: .Cat(W(E())). This is how the json rule ends.

Using GSTX

When we need to process a given text input, there are a few things to setup: apart from defining our grammar using a set of rules, we also need to call the Parse() function that takes an input string and the top-level rule as inputs. It returns three values:

  1. a value of type *MatchedResult which is a struct with two public methods:
    • String() which returns the substring that was matched by the top-level rule
    • FullMatch() which returns true if the match was full, i.e. the entire input string was digested by the top-level rule
  2. a value of type *Node, the root of Abstract Syntax Tree (AST) of the parsed text; this is a bit special, more on this later
  3. an instance of *SyntaxError (or nil) if the parsing failed

So, here's a full example showing how a single integer can be parsed:

var (
    text    = "123"
    rule    = Int()
    matched *MatchedResult
    node    *Node
    err     *SyntaxError
)

if matched, node, err = Parse(text, rule); err != nil {
    log.Fatalf("error parsing the input: %s", err)
}

The SyntaxError type is an error so it has an Error() method and can be used wherever a built-in error can be used.

As it can be seen in the example above, any rule can be passed to the Parse() method, not just ones that are of type Rule that is created by the NewRule() function. All the rules implement the Parser interface and the Parse() function expects a Parser.

The Rule is required for two reasons: the first is that there are recursive grammars where the rules need to be referenced before they are defined and for this reason they need to be identifiable before they have a definition. A typical example is the set of rules that define arithmetic expressions, where it is possible to have parantheses at an arbitrary level:

var (
    text       = "2*(3+4)"
    expression = NewRule("expression")
    term       = NewRule("term")
    factor     = NewRule("factor")
)

expression.Set(term.Cat(Opt(OneOf("+-").Cat(term))))
term.Set(factor.Cat(Opt(OneOf("*/").Cat(factor))))
factor.Set(Int().Or(OneOf("(").Cat(expression).Cat(OneOf(")"))))

AST generation

The other reason for the existence of rules is that GSTX uses them for AST generation. Typical AST contain a lot of nodes that are logical parts of the tree based on the grammar but have no importance when the results are processed and thus they just bloat the tree and make parsing harder.

GSTX has a different approach: it only includes the rules in the tree: each rule has a name and they appear together with the matched substring in their corresponding node. This way the programmers defining a grammar have a very fine-grained way of influencing what gets into the AST and what doesn't. Whenever something is needed in the tree, it is always possible to extract part of an expression to a separate rule and when something is not needed, the corresponding rule can be inlined into another rule or can be placed into a simple Parser type and not a Rule.

For example, there is a main function and thus an application that comes with the library, which is basically a full example showing how JSON documents can be parsed using GSTX. In that example, the JSON grammar is a little different than the one shown at the beginning of this document. One reason for this difference is that the generate tree is easier to handle this way. For example, keys have been extracted to a rule called key, so that keys and values both have their own leaf nodes in the tree and can thus be easily processed.

This is the grammar in main.go:

    var (
		json     = NewRule("json")
		value    = NewRule("value")
		object   = NewRule("object")
		key      = NewRule("key")
		keyValue = NewRule("keyValue")
		array    = NewRule("array")
	)

	json.Set(W(value).Cat(W(E())))
	value.Set(
		Float().
			Or(Str()).
			Or(object).
			Or(array).
			Or(Keyword("null")).
			Or(Keyword("true")).
			Or(Keyword("false")),
	)
	object.Set(
		Grp(
			W(OneOf("{")).
				Cat(
					W(keyValue).Cat(
						Star(W(OneOf(",")).Cat(W(keyValue)))),
				).
				Cat(W(OneOf("}"))),
		).
			Or(
				W(OneOf("{")).Cat(W(OneOf("}"))),
			),
	)
	key.Set(Str())
	keyValue.Set(W(key).Cat(W(OneOf(":"))).Cat(W(value)))
	array.Set(
		W(OneOf("[")).
			Cat(Opt(
				W(value).
					Cat(Star(W(OneOf(",")).Cat(W(value)))),
			)).
			Cat(W(OneOf("]"))),
	)

And if the input is the following JSON:

{
	"Name": {
		"Forename": "Jane",
		"Surname": "Smith"
	},
	"Age": 24,
	"Emails": ["[email protected]", "[email protected]"]
}

Then the generated AST is the following:

Name: [ROOT]   (matched: "...")
	Name: json   (matched: "...")
		Name: value   (matched: "{...")
			Name: object   (matched: "{...")
				Name: keyValue   (matched: ""Name": {...")
					Name: key   (matched: ""Name"")
					Name: value   (matched: "{...")
						Name: object   (matched: "{...")
							Name: keyValue   (matched: ""Forename": "Jane"")
								Name: key   (matched: ""Forename"")
								Name: value   (matched: ""Jane"")
							Name: keyValue   (matched: ""Surname": "Smith"")
								Name: key   (matched: ""Surname"")
								Name: value   (matched: ""Smith"")
				Name: keyValue   (matched: ""Age": 24")
					Name: key   (matched: ""Age"")
					Name: value   (matched: "24")
				Name: keyValue   (matched: ""Emails": ["[email protected]", "[email protected]"]")
					Name: key   (matched: ""Emails"")
					Name: value   (matched: "["[email protected]", "[email protected]"]")
						Name: array   (matched: "["[email protected]", "[email protected]"]")
							Name: value   (matched: ""[email protected]"")
							Name: value   (matched: ""[email protected]"")

Please note, that there is always a root node called "[ROOT]" and the node corresponding to the top-level rule is its only child. This is a result of how the AST is built in the library.

The Node type returned by the Parse() function provides a Walk() method which makes it possible to traverse the AST using a range-based for loop. The tree listing above was generated by the code below:

for level, node := range root.Walk() {
    for range level {
        fmt.Printf("\t")
    }

    fmt.Printf("Name: %s   (matched: \"%s\")\n", node.Name, cutMultilineStrings(node.String()))
}

where cutMultilineStrings() merely exchanges multiline strings with three asterisks.

Error messages

When there is a parsing failure, the rules return an error. What makes the generation of good error message hard is that a failure might now mean that the parsing itself fails. In complex grammars there many levels and many alternatives (Or operators). Whenever a given branch of an Or fails, an error message is generated but whether the entire Parse() call fails, depend on how the other branches perform. Another complicating factor is that when dealing with error usually two types of information are needed:

  1. the exact problem that caused parsing to fail (coming from the character-level rule that failed)
  2. the higher-level element that we were trying to parse at that given point (usually provided by the lowest level named rule above the character-level rule that failed)

GSTX solves the above problems with a heuristic method: it returns an error message from the character built-in rule that got furthest during parser, adds the name of the named rule that is closest to the failure and provides a method (ErrorLine()) that extracts the line where the error happened from the input text and shows the exact character where parsing failed. Together with the compound error message depicted above this should give enough information to find the syntax error.

For example, if we modify the example JSON in main.go by removing the colon from between "Forename" and "Jane":

{
	"Name": {
		"Forename" "Jane",
		"Surname": "Smith"
	},
	"Age": 24,
	"Emails": ["[email protected]", "[email protected]"]
}

Then we get the following error message:

2025/02/24 17:04:25 in rule "keyValue" expected a character from this set: ":" at 27
The error:
		"Forename" "Jane",
		          ^

This output is generated by the code below:

if result, root, err = Parse(text, json); err != nil {
    log.Fatalf("%s\nThe error:\n%s\n", err, err.ErrorLine(text))
}

If the reader takes a closer look at the grammar listing above, it is apparent that there is difference in how the object rule was defined when compared to the original listing. The reason for this is that inthe simplest definition of the grammar, the key-value list within the object is optional (an empty object is valid in JSON). The problem with this is that if the parsing fails due to the missing colon, the parser will opt for a no match in the optional part and thus the object rule will fail at the start of "Forename" as it is not a closing curly brace. This is not very helpful for a user who is trying to find out what the problem is with the JSON. It is therefore a better solution to add the empty object as a separate option with an Or() operator. This way, the keyValue rule will fail and as it gets the furthest in parsing the text, this is what will get bubbled back to the top and appears as an error message.

Processing the AST

There can be many ways of processing the AST. One suggested method is shown in pkg/parser/node_test.go in TestNodeBuildStructure(). The grammar is very simple here, it defines a function call in a very simple language:

var (
    text             = "myFun(23, \"apple\", fruits)"
    functionCallRule = NewRule("functionCall")
    functionNameRule = NewRule("functionName")
    argumentListRule = NewRule("argumentList")
    argumentRule     = NewRule("argument")
    root             *Node
    err              *SyntaxError
)

functionCallRule.Set(
    W(functionNameRule).Cat(W(OneOf("("))).Cat(argumentListRule).Cat(W(OneOf(")"))),
)

functionNameRule.Set(Identifier())

argumentListRule.Set(
    W(argumentRule).Cat(
        Star(W(OneOf(",")).Cat(W(argumentRule))),
    ),
)

argumentRule.Set(
    Int().Or(Str()).Or(Identifier()),
)

Please note, that that every important syntactic element has a corresponding rule so that they appear in the tree.

The method we suggest is that for every rule a corresponding struct is created that receives the node that should contain the information related to the struct. The character-level structs actually parse the matched substrings in the nodes, the higher-level ones simple check whether the node names match and pass on the parsing tasks to structs and nodes at a lower level. This provides a recursive algorithm that traverses the tree and builds up a data structure that corresponds to the information that is stored in the AST.

Here are the corresponding structs. All of them implement two methods: fromNode(), which processes a Node and String() that returns the string representation of the struct.

type argument struct {
	value string
}

func (a *argument) fromNode(n *Node) error {
	if n.Name != "argument" {
		return fmt.Errorf("expected an argument node, got: %s", n.Name)
	}

	a.value = n.String()
	return nil
}

func (a *argument) String() string {
	return a.value
}

type argumentList struct {
	arguments []*argument
}

func (a *argumentList) fromNode(n *Node) error {
	if n.Name != "argumentList" {
		return fmt.Errorf("expected an argumentList node, got: %s", n.Name)
	}

	for _, n := range n.Children {
		var arg = &argument{}

		if err := arg.fromNode(n); err != nil {
			return err
		}

		a.arguments = append(a.arguments, arg)
	}

	return nil
}

func (a *argumentList) String() string {
	var sb strings.Builder

	for _, arg := range a.arguments {
		sb.WriteString(fmt.Sprintf("%s,", arg))
	}

	return sb.String()
}

type functionName struct {
	value string
}

func (f *functionName) fromNode(n *Node) error {
	if n.Name != "functionName" {
		return fmt.Errorf("expected a functionName node, got: %s", n.Name)
	}

	f.value = n.String()
	return nil
}

func (f *functionName) String() string {
	return f.value
}

type functionCall struct {
	name         *functionName
	argumentList *argumentList
}

func (f *functionCall) fromNode(n *Node) error {
	if n.Name != "functionCall" {
		return fmt.Errorf("expected a functionCall node, got: %s", n.Name)
	}

	if len(n.Children) < 1 {
		return fmt.Errorf("missing functionName node")
	}

	var (
		fName = &functionName{}
		err   error
	)

	if err = fName.fromNode(n.Children[0]); err != nil {
		return err
	}

	f.name = fName

	if len(n.Children) > 1 {
		var args = &argumentList{}
		if err = args.fromNode(n.Children[1]); err != nil {
			return err
		}

		f.argumentList = args
	}

	return nil
}

func (f *functionCall) String() string {
	return fmt.Sprintf("%s (%s)", f.name, f.argumentList)
}

The processing is done with a simple call which can be seen at the bottom of the listing below:

if matched, root, err = Parse(text, functionCallRule); err != nil {
    t.Fatalf("expected to pass, got: %s", err)
}

if !matched.FullMatch() {
    t.Fatalf("expected full match, got: %s", matched)
}

if len(root.Children) != 1 {
    t.Fatalf("expected to get a functionCall node, but tree is different")
}

var fn = &functionCall{}
if err := fn.fromNode(root.Children[0]); err != nil {
    t.Fatalf("expected to build a tree structure, got: %s", err)
}

We create an instance of the top-level struct and make a call to fromNode() passing the root of the AST (which is the child of the node called "[ROOT]").

About

Simple, generic parsing framework for Go (Go Syntx)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages