TLDR: It is an extension of the regular expressions for text editing and a grep-like command line tool.
- Examples
- Language specification
- Why it works
- Precedence
- Modes and greediness
- Determinization
- Performance
- Installation
- TODO
- References
Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing. The group logic works as a post-processor and can be complicated. Here I propose an extension to the regular expression language for pattern matching and text modification. I call it transductive regular expressions or simply trre.
It introduces the : symbol to define transformations. The simplest form is a:b, which replaces a with b. I call this a transductive pair or transduction.
To demonstrate the concept, I have created a command line tool trre. It feels similar to the grep -E command.
To run it first build the program:
makeTo change cat to dog we use the following expression:
echo 'cat' | ./trre 'cat:dog'dog
A more cryptic token-by-token version:
echo 'cat' | ./trre '(c:d)(a:o)(t:g)'dog
It can be used like sed to replace all matches in a string:
echo 'Mary had a little lamb.' | ./trre 'lamb:cat'Mary had a little cat.
Deletion:
To delete a string we can use the pattern string_to_delete:
echo 'xor' | ./trre '(x:)or'or
The expression (x:) could be interpreted as of translation of x to an empty string.
This expression can be used to delete all the occurances in the default scan mode. E.g. expression below removes all occurances of 'a' from a string/
echo 'Mary had a little lamb.' | ./trre 'a:'Mry hd little lmb.
Technically, here we replace character a with empty symbol.
To remove several characters from this string we can use square brackets construction similar to normal regex:
echo 'Mary had a little lamb.' | ./trre '[aie]:'Mry hd lttl lmb.
Insertion:
To insert a string we can use the pattern :string_to_insert
echo 'or' | ./trre '(:x)or'xor
We could think of the expression (:x) as of translation of an empty string into x.
We can insert a word in a context:
echo 'Mary had a lamb.' | ./trre 'had a (:little )lamb'Mary had a little lamb.
As for normal regular expression we could use alternations with | symbol:
echo 'cat dog' | ./trre '(c:b)at|(d:h)og'bat hog
Or use the star over trre to repeat the transformation:
echo 'catcatcat' | ./trre '(cat:dog)*'dogdogdog
In the default scan mode, star can be omitted:
echo 'catcatcat' | ./trre 'cat:dog'dogdogdog
You can also use the star in the left part to "consume" a pattern infinitely:
echo 'catcatcat' | ./trre '(cat)*:dog'dog
Avoid using * or + in the right part, as it can cause infinite loops:
echo '' | ./trre ':a*' # <- do NOT do this...
Instead, use finite iterations:
echo '' | ./trre ':(repeat-10-times){10}'
repeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesrepeat-10-timesTransform ranges of characters:
echo "regular expressions" | ./trre "[a:A-z:Z]"REGULAR EXPRESSIONS
As more complex example, lets create a toy cipher. Below is the Caesar cipher(1) implementation:
echo 'caesar cipher' | ./trre '[a:b-y:zz:a]'dbftbs djqifs
And decrypt it back:
echo 'dbftbs djqifs' | ./trre '[a:zb:a-z:y]'caesar cipher
trre can generate multiple output strings for a single input. By default, it uses the first possible match. You can also generate all possible outputs.
Binary sequences:
echo '' | ./trre -ma ':(0|1){3}'
000
001
010
011
100
101
110
111
Subsets:
echo '' | ./trre -ma ':(0|1){,3}?'
0
00
000
001
01
010
011
1
10
100
101
11
110
111
Informally, we define a trre as a pair pattern-to-match:pattern-to-generate. The pattern-to-match can be a string or regexp. The pattern-to-generate normally is a string. But it can be a regex as well. Moreover, we can do normal regular expression over these pairs.
We can specify a smiplified grammar of trre as following:
TRRE <- TRRE* TRRE|TRRE TRRE.TRRE
TRRE <- REGEX REGEX:REGEX
Where REGEX is a usual regular expression, and . stands for concatenation.
For now I consider the operator : as non-associative and the expression TRRE:TRRE as unsyntactical. There is a natural meaning for this expression as a composition of relations defined by **trre**s. But it can make things too complex. Maybe I change this later.
Under the hood, trre constructs a special automaton called a Finite State Transducer (FST), which is similar to a Finite State Automaton (FSA) used in normal regular expressions but handles input-output pairs instead of plain strings.
Key differences:
-
trredefines a binary relation between two regular languages. -
It uses FSTs instead of FSAs for inference.
-
It supports on-the-fly determinization for performance (experimental and under construction!).
To justify the laguage of trunsductive regular expression we need to prove the correspondence between trre expressions and the corresponding FSTs. There is my sketch of a the proof: docs/theory.pdf.
Below is the table of precedence (priority) of the trre operators from high to low:
| Operator | Expression |
|---|---|
| Escaped characters | \ |
| Bracket expression | [] |
| Grouping | () |
| Iteration | * + ? {m,n} |
| Concatenation | |
| Transduction | : |
| Alternation | | |
trre supports two modes:
-
Scan Mode (default): Applies transformations sequentially.
-
Match Mode: Checks the entire string against the expression (use
-mflag).
Use -a to generate all possible outputs.
The ? modifier makes *, +, and {,} operators non-greedy:
echo '<cat><dog>' | ./trre '<(.:)*>'<>
echo '<cat><dog>' | ./trre '<(.:)*?>'<><>
Another common example is to change something within tags or brackets. E.g. to change everything within "<>" we can use the following expression:
echo '<dog> <mouse>' | ./trre '<(.*?:cat)>'<cat> <cat>
The important part of the modern regex engines is determinization. This routine converts the non-deterministic automata to the deterministic one. Once converted it has linear time inference on the input string length. It is handy but the convertion is exponential in the worst case.
For trre the similar approach is possible. The bad news is that not all the non-deterministic transducers (NFT) can be converted to a deterministic (DFT). In case of two "bad" cycles with same input labels the algorithm is trapped in the infinite loop of a state creation. There is a way to detect such loops but it is expensive (see more in Allauzen, Mohri, Efficient Algorithms for testing the twins property).
The default non-deterministic version is a bit slower then sed:
wget https://www.gutenberg.org/cache/epub/57333/pg57333.txt -O chekhov.txt
time cat chekhov.txt | ./trre '(vodka):(VODKA)' > /dev/nullreal 0m0.046s
user 0m0.043s
sys 0m0.007s
time cat chekhov.txt | sed 's/vodka/VODKA/' > /dev/nullreal 0m0.024s
user 0m0.020s
sys 0m0.010s
For complex tasks, trre_dft (deterministic version) can outperform sed:
time cat chekhov.txt | sed -e 's/\(.*\)/\U\1/' > /dev/nullreal 0m0.508s
user 0m0.504s
sys 0m0.015s
time cat chekhov.txt | ./trre_dft '[a:A-z:Z]' > /dev/nullreal 0m0.131s
user 0m0.127s
sys 0m0.009s
No pre-built binaries are available yet. Clone the repository and compile:
git clone [email protected]:c0stya/trre.git trre
cd trre
make && sh test.shThen move the binary to a directory in your $PATH.
- Stable DFT version
- Full unicode support
- Complete the ERE feature set:
- negation
^within[] - character classes
- '$^' anchoring symbols
- negation
- Efficient range processing
- The approach is strongly inspired by Russ Cox article: Regular Expression Matching Can Be Simple And Fast
- The idea of transducer determinization comes from this paper: Cyril Allauzen, Mehryar Mohri, "Finitely Subsequential Transducers"
- Parsing approach comes from Double-E algorithm by Erik Eidt which is close to the classical Shunting Yard algorithm.