-
-
Notifications
You must be signed in to change notification settings - Fork 199
Description
After #275 my lexer test suite was still failing. It looks like it's because the Quest
regex operator (and related) doesn't support "giving back". For example, imagine you have the pattern for matching a 3 or 4 capital letter time zone: [A-Z][A-Z][A-Z]?T
-- the generated lexer currently will not match EST
, because the current quest operator implementation is fully greedy (never gives back). The generated code will match EST
for the first three operations, but that advances the position past the ending T
and thus the last literal T
does not match. It doesn't backtrack to try the case where the [A-Z]?
term was never matched, which would have allowed it to match.
The current generated code looks something like:
// [A-Z][A-Z][A-Z]?T (Concat)
l3 := func(s string, p int) int {
if p = l0(s, p); p == -1 { return -1 }
if p = l0(s, p); p == -1 { return -1 }
if p = l1(s, p); p == -1 { return -1 }
if p = l2(s, p); p == -1 { return -1 }
return p
}
It seems the implementation to support "give back" could be pretty complicated, so I wanted to discuss about possible good directions.
Ideas:
- The least efficient but simplest would be to have it rewrite the regexp to remove the quest operator and instead replace the entire expression it was concatenated within with the two (or more variants). e.g.,
[A-Z][A-Z][A-Z]?T
would be rewritten internally to[A-Z][A-Z][A-Z]T|[A-Z][A-Z]T
before going through codegen. This could quickly lead to exponential explosion of the number of alternates for any reasonably complex regex, but the code itself would never have to "backtrack". It's simple to implement, but could be horribly slow, doesn't seem like a good option. - Perhaps change the generated functions representing expression (e.g.,
l3
func above) to return a list of positions, rather than a single position (or an empty list instead of -1). Then the caller would explore all of the possibilities starting from that point (or if there was no match because it got an empty list, would stop and return an empty list right there). It feels like this direction might work, but would require some memory allocation with the returned lists (maybe there is a way to optimize by passing a shared list or using an allocation pool). A sketch of this might look like:
// [A-Z][A-Z][A-Z]?T (Concat)
l3 := func(s string, p int) []int {
matchedPositions := make([]int, 0)
for _, nextP := range l0(s, nextP) {
for _, nextP := range l0(s, nextP) {
for _, nextP := range l1(s, nextP) {
for _, nextP := range l2(s, nextP) {
matchedPositions = append(matchedPositions, nextP)
}
}
}
}
return matchedPositions
}
Probably this method could be optimized where it could recognize if a given function can only ever return a single position, and if so avoid the overhead of allocation. e.g., it could look more like:
l3 := func(s string, p int) []int {
matchedPositions := make([]int, 0)
if p = l0(s, p); p != -1 {
if p = l0(s, p); p != -1 {
for _, p := range l1(s, p) {
if p = l2(s, p); p != -1 {
matchedPositions = append(matchedPositions, p)
}
}
}
}
return matchedPositions
}
This variant is nice because it can still use p
regardless of whether it's in a loop (previous function had multiple possibilities) or just an if statement (previous function had just one possibility). With an allocation pool for the matchedPositions array, maybe it wouldn't actually be that bad (memory complexity would be related to the number of possible open "branches" at any given time)
- Perhaps some technique where the caller recognizes the function it's calling represents a quest, and generates the different "paths" in the code (duplicating code for the entire "branch" for the remainder of that concatenation) -- this seems both hard to implement and code size could explode too for complex pressions.
What do you think? Any ideas on better directions?