Adaptive, multi‑strategy web scraper that extracts clean text and metadata for RAG pipelines and local LLM workflows.
- About
- Features
- Project Structure
- Roadmap
- Quick Start
- Usage
- Contributing
- Hacktoberfest
- Submitting a Pull Request
- Guidelines for Pull Request
- Authors
This project proposes a robust web scraper designed for maximum adaptability. It automatically adjusts to various website structures by employing a multi‑strategy extraction approach. The scraper attempts to extract clean, structured text and metadata (including title, author, and date) using methods such as newspaper3k, readability‑lxml, and BeautifulSoup heuristics. In cases where these methods are insufficient, it can optionally fall back to headless browser rendering to capture content from more complex, dynamically loaded websites. The output is specifically formatted for integration into Retrieval‑Augmented Generation (RAG) pipelines or for use with local Large Language Models (LLMs).
An ambitious, optional extension to this project is the Universal RAG Builder. This layer would automatically identify and scrape top‑ranked websites relevant to a user's query, subsequently building a RAG index from the collected data. This feature addresses a key limitation of local LLMs, their inability to browse the internet by providing automated knowledge aggregation and up‑to‑date information retrieval without requiring manual data collection. The project's user interface will initially be a Command Line Interface (CLI), with plans for a web‑based version to cater to users who prefer a more visually appealing display.
- Multi‑strategy extraction: newspaper3k, readability‑lxml, and BeautifulSoup‑based heuristics.
- Metadata capture: title, author, and date when available.
- Optional headless rendering fallback for dynamic, JS‑heavy pages.
- RAG‑ready output: clean, structured content suitable for chunking and indexing.
- CLI first: simple commands to fetch and export content; web UI planned.
scrag/
├── src/scrag/ # Main source code
│ ├── extractors/ # Content extraction strategies
│ ├── processors/ # Text processing and cleaning
│ ├── storage/ # Storage backends and adapters
│ ├── rag/ # RAG pipeline components
│ ├── cli/ # Command-line interface
│ ├── web/ # Web interface (planned)
│ └── utils/ # Utility functions
├── tests/ # Comprehensive test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ ├── performance/ # Performance benchmarks
│ └── fixtures/ # Test data and mocks
├── docs/ # Documentation
│ ├── api/ # API reference
│ ├── guides/ # User guides
│ └── tutorials/ # Step-by-step tutorials
├── config/ # Configuration files
│ ├── extractors/ # Extractor configurations
│ └── rag/ # RAG pipeline configurations
├── deployment/ # Deployment configurations
│ ├── docker/ # Docker configurations
│ ├── kubernetes/ # Kubernetes manifests
│ └── aws/ # AWS deployment files
├── scripts/ # Development and build scripts
└── ARCHITECTURE.md # Detailed architecture documentation
For detailed architecture information, see ARCHITECTURE.md.
- Universal RAG Builder: auto‑discover top results for a query, scrape them, and build a ready‑to‑use RAG index.
- Web UI: a lightweight interface for users who prefer a visual workflow.
- Export adapters: convenient formats for popular vector DBs and RAG frameworks.
# 1) Fork and clone
# Click Fork on GitHub, then:
git clone https://github.com/ACM-VIT/scrag.git
cd scrag
# 2) Create a branch
git checkout -b feat/your-feature
# 3) Install dependencies
uv sync
uv pip install -e src/scrag
> **Note:** This project uses `uv` as the canonical dependency manager. Dependencies are defined in `src/scrag/pyproject.toml` and managed via `uv.lock`. Do not use `pip install -r requirements.txt` as the root `requirements.txt` has been removed to avoid conflicts.
# 4) Verify the CLI
uv run scrag infoRun the Typer-powered CLI after syncing dependencies (as shown in Quick Start).
# Extract a single page using the default strategy cascade
uv run scrag extract https://example.com/article
# Choose a custom output location and persist as plain text
uv run scrag extract https://example.com/article --output data/custom --format txt
# Relax the minimum content length requirement for sparse pages
uv run scrag extract https://example.com/article --min-length 50We welcome contributions of all kinds! Please read our Contributing Guidelines to get started quickly and make your PRs count.
Join us for Hacktoberfest! Quality > quantity.
- Aim for meaningful, well‑scoped PR/MRs that solve real issues.
- Non‑code contributions (docs, design, tutorials) are welcome via PR.
- Full participation details: https://hacktoberfest.com/participation
- Fork the repository (top‑right on GitHub)
- Clone your fork locally:
git clone <HTTPS-ADDRESS> cd <NAME-OF-REPO>
- Create a new branch:
git checkout -b <your-branch-name>
- Make your changes and stage them:
git add . - Commit your changes:
git commit -m "feat: your message" - Push to your fork:
git push origin <your-branch-name>
- Open a Pull Request and clearly describe what you changed and why. Link related issues (e.g., “Fixes #123”).
- Avoid PRs that are automated/scripted or plagiarized from someone else’s work.
- Don’t spam; keep each PR focused and meaningful.
- The project maintainer’s decision on PR validity is final.
- For more, see our Contributing Guidelines and the Hacktoberfest participation rules.
Authors:
Contributors:
By participating in this project, you agree to abide by our Code of Conduct.
Made with ❤️ by ACM‑VIT