A comprehensive, modular web scraping framework for extracting data from various e-commerce and classified advertisement websites. ICU-scraper provides specialized scrapers for different platforms with unified data export capabilities and a clean, maintainable architecture.
- Modular Architecture: Clean separation of scrapers and utilities
- Multi-platform Support: Specialized scrapers for different websites
- Flexible Field Selection: Choose which data fields to extract
- Multiple Export Formats: CSV and Excel output with comprehensive metadata
- Rate Limiting: Built-in delays to respect website policies
- Robust Error Handling: Graceful handling of network and parsing errors
- Data Validation: Clean and structured data output
- Extensible Design: Easy to add new scrapers following established patterns
- Python 3.7 or higher
- pip package manager
# Clone the repository
git clone <repository-url>
cd ICU-scraper
# Install required packages
pip install -r requirements.txt
requests>=2.31.0
- HTTP requestsbeautifulsoup4>=4.12.0
- HTML parsingpandas>=2.0.0
- Data manipulationopenpyxl>=3.1.0
- Excel file handlinglxml>=4.9.0
- Enhanced XML/HTML parsing
ICU-scraper/
โโโ scrapers/ # Specialized scrapers
โ โโโ __init__.py # Package initialization
โ โโโ avito_scraper.py # Avito.ma specialized scraper
โ โโโ general_scraper.py # General-purpose scraper
โ โโโ 1moment_scraper.py # 1moment.ma e-commerce scraper
โโโ utils/ # Shared utilities
โ โโโ __init__.py # Package initialization
โ โโโ export.py # Data export utilities
โ โโโ helpers.py # Common helper functions
โโโ main.py # Main entry point with demos
โโโ README.md # This documentation
โโโ requirements.txt # Python dependencies
โโโ .gitignore # Git ignore rules
Specialized scraper for Avito.ma, Morocco's leading classified ads platform.
- โ Extracts comprehensive listing data
- โ Supports pagination
- โ Configurable field selection
- โ Automatic data export with metadata
- โ Rate limiting and error handling
Field | Description | Example |
---|---|---|
title |
Listing title | "Appartement 3 piรจces" |
price |
Item price | "850,000 DH" |
location |
Geographic location | "Casablanca" |
details |
Additional item details | "3 chambres, 2 salles de bain" |
link |
Direct link to listing | "https://..." |
seller |
Seller information | "Agence X" |
date |
Posting date | "Il y a 2 jours" |
image |
Main image URL | "https://..." |
from scrapers.avito_scraper import scrape_avito
from utils.export import export_data
# Basic usage - scrape apartments in Casablanca
url = "https://example.com/category/apartments"
results, fields = scrape_avito(url, pages=3)
# Export to Excel (default)
export_data(results, list(fields), url, 3, "xlsx", site_name="avito_ma")
# Custom field selection
custom_fields = {"title", "price", "location", "link"}
results, fields = scrape_avito(url, pages=2, fields=custom_fields)
# Export to CSV
export_data(results, list(fields), url, 2, "csv", site_name="avito_ma")
# Full configuration example
results, fields = scrape_avito(
url="https://example.com/category/cars",
pages=5, # Number of pages to scrape
delay=2, # Delay between requests (seconds)
fields={"title", "price", "location", "link"} # Custom fields
)
A flexible, configuration-driven scraper that can adapt to different websites using JSON configuration files or dictionaries.
- โ Configuration-driven scraping
- โ Support for various pagination methods
- โ Flexible field extraction with transformations
- โ Multiple data types (text, attributes, lists)
- โ Built-in error handling and validation
from scrapers.general_scraper import GeneralScraper
# Configuration for a generic e-commerce site
config = {
"site_name": "example_site",
"base_url": "https://example.com",
"headers": {"User-Agent": "Mozilla/5.0"},
"timeout": 10,
"container_selector": "div.product-item",
"pagination": {
"type": "parameter", # or "next_link"
"parameter": "page"
},
"fields": {
"title": {
"selector": "h2.product-title",
"default": "N/A"
},
"price": {
"selector": "span.price",
"replace": {"$": "", ",": ""},
"default": "N/A"
},
"link": {
"selector": "a.product-link",
"type": "attribute",
"attribute": "href",
"default": "N/A"
},
"images": {
"selector": "img.product-image",
"type": "attribute",
"attribute": "src",
"multiple": True,
"default": []
}
}
}
# Create scraper and use it
scraper = GeneralScraper(config)
results = scraper.scrape("https://example.com/products", max_pages=3, delay=1)
scraper.export_data(results, format="xlsx")
Specialized scraper for 1moment.ma, a Moroccan e-commerce platform.
- โ Category-based scraping
- โ Product image downloading
- โ Price comparison (original vs. discounted)
- โ Product descriptions and details
- โ Automatic category discovery
make-up
- Cosmetics and beauty productsmaman-bebe
- Baby and mother productsparfum
- Perfumes and fragrancesidees-cadeaux
- Gift ideascomplement-alimentaire-et-force
- Supplementsaccessoires
- Accessoriesparapharmacie
- Pharmacy products
from scrapers.1moment_scraper import scrape_category, scrape_all_categories
# Scrape a specific category
products = scrape_category("make-up", save_images=True, delay=1)
# Scrape all categories
scrape_all_categories(save_images=True, delay=1, output_format="xlsx")
# Scrape specific categories only
categories = ["make-up", "parfum"]
scrape_all_categories(categories, save_images=False, delay=2)
Centralized data export functionality supporting multiple formats with metadata.
export_data()
- Standard export with metadataexport_general_data()
- Export for general scraperexport_category_data()
- Category-specific export
Excel Output:
Listings
sheet: Main data with all extracted fieldsDocumentation
sheet: Scraping metadata and summary
CSV Output:
- Commented header with scraping metadata
- UTF-8 encoding for international characters
Common helper functions used across all scrapers.
fetch_soup()
- Safe webpage fetching and parsingsafe_text_extract()
- Safe text extraction from elementssafe_attribute_extract()
- Safe attribute extractionmake_absolute_url()
- Convert relative URLs to absoluteslugify()
- Create URL-friendly slugsdownload_image()
- Download images with error handlingrate_limit()
- Implement rate limitingclean_price()
- Clean price formattingvalidate_url()
- Basic URL validation
# Run the demo
python main.py
# Use Avito scraper directly
python -m scrapers.avito_scraper
# Use 1moment scraper
python -m scrapers.1moment_scraper
from scrapers.avito_scraper import scrape_avito
from utils.export import export_data
# Scrape apartments
url = "https://example.com/category/apartments"
results, fields = scrape_avito(url, pages=2)
# Export results
export_data(results, list(fields), url, 2, "xlsx")
from scrapers.general_scraper import GeneralScraper
# Define your configuration
config = {
"site_name": "my_site",
"base_url": "https://example.com",
"container_selector": "div.item",
"fields": {
"title": {"selector": "h1.title"},
"price": {"selector": "span.price"}
}
}
# Create and use scraper
scraper = GeneralScraper(config)
results = scraper.scrape("https://example.com/items", max_pages=3)
scraper.export_data(results)
title | price | location | details | link | seller | date | image |
---|---|---|---|---|---|---|---|
Appartement 3 piรจces | 850,000 DH | Casablanca | 3 chambres, 2 salles de bain | https://... | Agence X | Il y a 2 jours | https://... |
Villa 4 piรจces | 1,200,000 DH | Rabat | 4 chambres, jardin | https://... | Particulier | Il y a 1 jour | https://... |
- Site: example_site
- URL: https://example.com/category/apartments
- Pages Scraped: 3
- Fields: title, price, location, details, link, seller, date, image
- Total Items: 74
- Scraped: 2025-07-10 23:08:45
# Respectful scraping delays
delay = 2 # seconds between requests
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
- Network errors: Automatic retry with proper error logging
- Parsing errors: Graceful handling of missing or malformed data
- Rate limiting: Built-in delays between requests
- Respect robots.txt: Always check the website's robots.txt file
- Rate limiting: Use appropriate delays between requests (1-3 seconds)
- Data validation: Verify scraped data before processing
- Error handling: Implement proper exception handling
- Monitoring: Log scraping activities for debugging
- Legal compliance: Always respect website terms of service
...
- Database Integration: Direct export to PostgreSQL/MySQL
- Real-time Monitoring: Live data tracking and alerts
- Advanced Filtering: Complex search criteria and filters
- Parallel Processing: Multi-threaded scraping for better performance
- Data Enrichment: Integration with external data sources
- Web Interface: Simple web UI for configuration and monitoring
- API Endpoints: REST API for programmatic access
- Scheduled Scraping: Automated scraping with cron jobs
- Terms of Service: Always review and comply with website terms
- Rate Limiting: Respect website server capacity and policies
- Data Usage: Use scraped data responsibly and legally
- Privacy: Respect user privacy and data protection laws
- Educational Use: This tool is designed for educational and research purposes
-
Connection timeouts
- Solution: Increase delay between requests
- Example:
delay=3
instead ofdelay=1
-
Parsing errors
- Cause: Website structure may have changed
- Solution: Update selectors in scraper configuration
-
Missing data
- Cause: Field selectors may be incorrect
- Solution: Verify selectors using browser developer tools
-
Export errors
- Cause: Insufficient write permissions
- Solution: Check file permissions and disk space
Enable verbose logging for debugging:
import logging
logging.basicConfig(level=logging.DEBUG)
# Or use the built-in print statements
# They provide detailed information about the scraping process
# For faster scraping (use responsibly)
delay = 0.5 # Minimum recommended delay
# For large datasets
pages = 10 # Limit pages to avoid overwhelming servers
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-scraper
- Add your scraper following the existing pattern:
- Place in
scrapers/
directory - Use utility functions from
utils/
- Include proper error handling
- Add comprehensive documentation
- Place in
- Test your changes: Ensure all scrapers still work
- Submit a pull request
- Create
scrapers/your_site_scraper.py
- Import utilities:
from utils.helpers import *
- Import export:
from utils.export import export_data
- Follow the pattern of existing scrapers
- Add to
scrapers/__init__.py
- Update this README
[Add your license information here]
For issues, questions, and contributions:
- Issues: Create an issue in the repository
- Documentation: Check this README and inline code comments
- Examples: See
main.py
for usage examples - Troubleshooting: Review the troubleshooting section above
- โ Avito Scraper: Fully functional
- โ General Scraper: Fully functional
- โ 1moment Scraper: Fully functional
- โ Export Utilities: Complete
- โ Helper Functions: Complete
- ๐ Documentation: Updated and comprehensive
- ๐ง Future Scrapers: In development
Note: This scraper is designed for educational and research purposes. Always ensure compliance with website terms of service and applicable laws when scraping web data. Use responsibly and respect website policies.