A fast, lightweight search engine written in Rust that can index and search through various document formats using TF-IDF scoring.
- Multiple Format Support: CSV, HTML, PDF, XML, TXT, Markdown
- Stemming: English Porter2 stemming algorithm
- Stop Words: Automatic filtering of common English stop words
- Parallel Processing: Multi-threaded indexing for performance
- Web Interface: HTTP server with search API
- Incremental Updates: Skip unchanged files during re-indexing
- TF-IDF Scoring: Relevance-based search results
git clone https://github.com/juanmilkah/indexer
cd indexer
bash build.shIndex all files in the current directory:
indexer indexIndex a specific directory:
indexer index --path /path/to/documentsIndex with custom output directory:
indexer index --path ./docs --output ./my_indexInclude hidden files and directories:
indexer index --path ./docs --hiddenSkip specific directories or files:
indexer index --path ./project --skip-paths target node_modules .gitSearch the default index:
indexer search --query "machine learning"Search with specific index directory:
indexer search --index ./my_index --query "rust programming"Limit number of results:
indexer search --query "database" --count 10Save results to file:
indexer search --query "algorithm" --output results.txtStart the web server on default port (8765):
indexer serveStart on custom port with specific index:
indexer serve --index ./my_index --port 3000The web interface will be available at http://localhost:8765
The main index manages the inverted index structure:
- DocumentStore: Maps file paths to document IDs
- InMemorySegment: Temporary storage before flushing to disk
- Segments: Persistent storage units containing term dictionaries and postings lists
Tokenizes text content:
- Handles numeric, alphabetic, and special characters
- Applies English stemming using Porter2 algorithm
- Filters stop words
Document-specific parsers for different file formats:
- CSV: Extracts text from all fields
- HTML: Parses and extracts visible text content
- PDF: Extracts text from all pages
- XML: Extracts character data from elements
- Text/Markdown: Direct text processing
HTTP server providing search functionality:
GET /: Serves HTML search interfacePOST /query: Processes search queries and returns results
- Indexing: Files → Parser → Lexer → Tokens → InMemorySegment → Disk Segments
- Searching: Query → Lexer → Tokens → Segment Lookup → TF-IDF Calculation → Ranked Results
~/.indexer/ # Default index directory
├── docstore.bin # Document metadata
├── segment_0/ # First segment
│ ├── term.dict # Term dictionary
│ └── postings.bin # Postings lists
├── segment_1/ # Additional segments...
│ ├── term.dict
│ └── postings.bin
└── logs # Application logs
The indexer uses ~/.indexer as the default storage directory. This can be
overridden using the --output flag for indexing or --index flag for
searching.
- Text:
.txt,.md - Web:
.html,.xml,.xhtml - Data:
.csv - Documents:
.pdf
- Segment Size: Default 100 documents per segment (configurable in code)
- Parallel Processing: Uses all available CPU cores for indexing
- Memory Usage: Segments are flushed to disk when full
-l, --log <FILE>: Redirect logs to specific file
indexer index [OPTIONS]Options:
-p, --path <PATH>: Directory or file to index-o, --output <DIR>: Index output directory-z, --hidden: Include hidden files and directories-s, --skip-paths <PATHS>: Skip specific paths (space-separated)
indexer search [OPTIONS] --query <QUERY>Options:
-i, --index <DIR>: Index directory to search-q, --query <QUERY>: Search terms-o, --output <FILE>: Save results to file-c, --count <NUMBER>: Maximum number of results
indexer serve [OPTIONS]Options:
-i, --index <DIR>: Index directory to serve-p, --port <PORT>: Port number (default: 8765)
Returns the HTML search interface.
Accepts search query in request body and returns matching documents.
Response Format:
/path/to/document1.txt
/path/to/document2.pdf
/path/to/document3.html
The search engine uses Term Frequency-Inverse Document Frequency scoring:
- TF (Term Frequency): Number of times a term appears in a document
- IDF (Inverse Document Frequency):
ln(total_docs / docs_containing_term) - Score:
TF × IDFsummed across all query terms
Uses the rust-stemmers crate with the English Porter2 algorithm to reduce
words to their root forms (e.g., "running" → "run").
Common English words (the, and, or, etc.) are filtered out during indexing
and searching using the stop-words crate.
- Document Store: Binary serialization using
bincode2 - Postings Lists: Binary serialization for efficient storage and retrieval
- Term Dictionaries: HashMap serialization for fast term lookups
Permission denied:
# Check file permissions or run with appropriate privileges
chmod +r /path/to/documents/*Application logs are stored in ~/.indexer/logs by default. Use the --log
flag to specify a different location.
git clone <repository>
cd indexer
bash build.sh- Fork the repository
- Create a feature branch
- Make changes with appropriate tests
- Submit a pull request
Check version with:
indexer --version