Lexicalis

Lexicalis is a Python-based lexical analyzer designed to tokenize source code written in an imaginary programming language. The project is robust, flexible, and features detailed inline documentation to help users understand and extend its functionality. This documentation provides a structured guide to the design, functionality, and usage of Lexicalis.

Features

Symbol Table: Stores predefined tokens such as keywords, operators, and delimiters. and fills out with newly found tokens along the way.
State Diagrams: Used to validate and classify tokens like variables, numbers, strings, and classes.
Comprehensive Error Handling: Ensures invalid tokens are flagged appropriately.
Extendability: Designed to support the addition of new token types and patterns.

How It Works

The analyzer follows a structured flow, as depicted in the flowchart below:

Source Input: A source code string is provided.
Line Splitting: The source is split into lines and then into individual words (potential tokens).
Word Scanning: Each word can either be a single token or a group of tokens, which will be fed into scan_word() or scan_group() for any of those cases respectively.
Token Validation:
- Each word is checked against the Symbol Table (ST) to see if it's already found.
- If not found, the token is validated using State Diagrams.
Accepted Tokens: Valid tokens are categorized and stored in the symbol table.
Error Reporting: If a token cannot reach a valid acceptance state, an error is raised.

Flowchart

For a visual representation, refer to the flowchart below:


                   +-------------------+
                   |    Source Code    |
                   +-------------------+
                            |
                            v
                   +-------------------+
                   |   Split to Lines  |
                   +-------------------+
                            |
                            v
                   +-------------------+
                   |  Split to Words   |
                   +-------------------+
                            |
                            v
                   +-------------------+
                   |  Tokenize Words   |
                   +-------------------+
                            |
         +------------------+------------------+
         |                                     |
 +------------------+                 +----------------------+
 |   Single Token   |                 |   Group of Tokens    |
 +------------------+                 +----------------------+
         |                                       |
         v                                       v
 +------------------+                 +----------------------+
 |   scan_word()    |  <------+       |     scan_group()     |
 +------------------+         |       +----------------------+
         |                    |                  |
         v                    |                  v
 +----------------------+     |     +-------------------------+
 |  Check Symbol Table  |     +---- | Split Group into Tokens |
 +----------------------+           +-------------------------+
         |               \                       
 +-----------------+      \           +-------------------+
 |  Found Already  |       \------->  |   Not Found Yet   |
 +-----------------+                  +-------------------+
         |                                     |
         v                                     v
    +------------+                  +-------------------+
    |    Done    | <--------------  |   State Diagram   |
    +------------+                  +-------------------+

Code Overview

1. Scanner Class

The Scanner class orchestrates the tokenization process. It relies on the SymbolTable and StateDiagram classes to identify and validate tokens.

Key Methods:

tokenize(source: str):
- Splits the source code into lines and words.
- Passes each word through scan_word and scan_group.
scan_word(word: str):
- Checks if the word exists in the Symbol Table.
- If not found, delegates to State Diagram for validation.
scan_group(group: list[str]):
- Processes multi-character tokens, combining symbols into a valid group.

2. SymbolTable Class

The SymbolTable acts as a central repository for all token types. It maintains token sets like:

Keywords: while, for, if, else
Operators: +, -, ==, div
Delimiters: {, }, ;

Users can add or modify tokens for customization.

3. StateDiagram Class

Implements finite automata to validate complex tokens. It defines a transition map (products) for each token type.

Example Diagrams:

Variable: Recognizes identifiers starting with $.
Number: Validates integers and floating-point numbers.
String: Detects strings enclosed in quotes.

Example Usage

Here’s a simple example demonstrating how to use Lexicalis:

from Scanner import Scanner

source_code = """
    while x < 10 {
        print("Hello, world!")
    }
"""

scanner = Scanner()
tokens = scanner.tokenize(source_code)

for token_type, token_value in tokens:
    print(f"{token_type}: {token_value}")

Expected Output:

kw: while
var: x
opr: <
num: 10
del: {
func: print
str: "Hello, world!"
del: }

Error Handling

Invalid tokens or unrecognized patterns raise a ValueError with descriptive messages, aiding debugging and error resolution.

Example:

scanner.scan_word("$Invalid#Token") 
# Output: ValueError: Token did not reach acceptance state.

Extending Lexicalis

Adding New Tokens:
- Update the SymbolTable with the new token type and values.
Custom State Diagrams:
- Create a new StateDiagram instance with the required transition map.

Example:

custom_diagram = StateDiagram(
    (('0', 'C'), '1'),
    (('1', 'U'), 'A'),
    token_type="custom"
)

Future Enhancements

Support for Multi-line Strings.
Integration with Syntactic Analyzers.
Interactive Visualizations of Tokenization.

Feel free to reach out or contribute to Lexicalis.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
Ascii.py		Ascii.py
README.md		README.md
Scanner.py		Scanner.py
StateDiagram.py		StateDiagram.py
SymbolTable.py		SymbolTable.py
main.in		main.in
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lexicalis

Features

How It Works

Flowchart

Code Overview

1. Scanner Class

2. SymbolTable Class

3. StateDiagram Class

Example Usage

Error Handling

Extending Lexicalis

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

erfanrajati/Lexicalis

Folders and files

Latest commit

History

Repository files navigation

Lexicalis

Features

How It Works

Flowchart

Code Overview

1. Scanner Class

2. SymbolTable Class

3. StateDiagram Class

Example Usage

Error Handling

Extending Lexicalis

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages