Text Walker

Documentation

Getting Started

textwalker is a simple utility to incrementally parse (un|semi)structured text.

The textwalker API emulates how a complex regular expression is iteratively constructed. Typically, when constructing a regex, I'll construct a part of it; test it and build the next part.

Consider trying to parse an SQL table definition:

>>> text = """CREATE TABLE dbo.car_inventory
(
    cp_car_sk        integer               not null,
    cp_car_make_id   char(16)              not null,
)
WITH (OPTION (STATS = ON))"""

>>> from text_walker import TextWalker
>>> tw = TextWalker(text)

>>> tw.walk('CREATE')
>>> tw.walk('TABLE')

The TextWalker class is initialized with the text to parse. The walk(pattern) method consumes and returns the pattern. Here, the return value is the literal matched. This pattern can be a string representing a:

literal, e.g. foo
character set, with character ranges and individual characters e.g. [a-z9]
grouping, e.g. (foo)+

See supported grammar here.

Internally, when walk is invoked the TextWalker tracks how much of the input text has been matched.

This is essentially, the key thought behind the design: by making the text parsing stateful, it can be done incrementally, and this reduces the complexity of the expression for matching text and allows combining with python text processing capabilities.

>>> table_name_match = tw.walk('dbo.[a-z0-9_]+')
>>> tablename = table_ame_match.replace('dbo.', '')
>>> print(f'table name is {tablename}')

table name is car_inventory

>>> tw.walk('\(')

# now print column names
>>> cols_text, _ = tw.walk_until('WITH')
>>> for col_def in cols_text.split(','):
        col_name = col_def.strip().split(' ')[0]
        print(f'column name is: {}')

column name is cp_car_sk
column name is cp_car_make_id

Or trying to parse a phone number, e.g.

>>> from textwalker import TextWalker
>>> text = "(+1)123-456-7890"
>>> tw = TextWalker(text)
>>> area_code = tw.walk('(\\(\\+[0-9]+\\))?')
>>> print(f'area code is {area_code}')

Note, special characters need to be escaped in all contexts.

>>> steps = tw.walk_many(['[0-9]{3,3}', '\\-', '[0-9]{3,3}', '\\-', '[0-9]{4,4}'])
>>> print(f'first 3 digits are {steps[0]}; next 3 digits are {steps[2]}; last 3 digits are {steps[4]}')
first 3 digits are 123; next 3 digits are 456; last 3 digits are 7890

More Examples

See more examples in .\examples

Installation

Textwalker is available on PyPI:

python -m pip install textwalker

Grammar

Literals

Can be any literal string

foo
bar 
123
x?

Can have quantifiers

Character Sets

A character set is defined within a pair of left and right square brackets, [...]
Can contain ranges, specified via a dash, [a-z] or individual chars [a-z8]
Support quantifiers, [0-9]{1,3}
NOTE: There are no predefined ranges!

Groups

A group is defined with a pair of parentheses (...)
A group can contain Literals, Character Sets and arbitrarily nested Groups, (hello[a-zA-z]+)*

Quantifiers

zero or more *
zero or one ?
one or more +
range {1,3}

Special Characters

Special characters (below) need to be escaped in all contexts.

"(", ")", "[", "]", "{", "}", "-", "+", "*", "?"

To escape a character it must be escaped with a double backslash, e.g. left parentheses \\(
This need two backslashes, because a single \ is treated by the python interpreter as an escape on the following character.
Even in cases, where a special character is unambiguously non-special, e.g. [*], can only mean match the literal * character, it must still be escaped. [*] is an invalid expression.

Limitations/Gotchas/Notes

The matching semantics are such that a pattern must fully match to be considered a match. For the walk methods None means not a match. This is different from a match of zero length, e.g. (foo)?
If a quantifier is not specified it must have exactly one match.
charset ranges match depend on how lexical comparison is implemented in python
only supports case-sensitive search
all operators are greedy. This is noteworthy, because in some cases, a non-greedy match on a sub-group would lead to match on the entire e.g. if matching (ab)*ab, the text abab will be a non match, since the subexpression (ab)* will consume the entire text. This can be avoided by, e.g. (ab){1,1}ab would match abab

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
textwalker		textwalker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dodo.py		dodo.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Walker

Documentation

Getting Started

More Examples

Installation

Grammar

Literals

Character Sets

Groups

Quantifiers

Special Characters

Limitations/Gotchas/Notes

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

spandanb/textwalker

Folders and files

Latest commit

History

Repository files navigation

Text Walker

Documentation

Getting Started

More Examples

Installation

Grammar

Literals

Character Sets

Groups

Quantifiers

Special Characters

Limitations/Gotchas/Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages