A Python web application to categorize procurement transaction datasets with UNSPSC codes.
This application helps procurement professionals clean, validate, and categorize procurement data by:
- Cleaning and standardizing supplier names and item descriptions
- Detecting and correcting erroneous UNSPSC codes
- Predicting missing UNSPSC codes using machine learning
- Providing insights on suppliers and validation metrics
- Offering a user-friendly web interface for data processing
- Data Cleaning: Remove duplicates, standardize text, and identify missing values
- Supplier Name Processing: Group similar supplier names using fuzzy matching
- Error Detection: Identify incorrect UNSPSC codes using clustering and supplier patterns
- Code Prediction: Fill in missing codes with a machine learning model
- Supplier Insights: Analyze supplier specialization and diversity
- Validation Metrics: Evaluate the accuracy of corrections and predictions
- Interactive UI: Upload, process, and download data through a Streamlit interface
- Backend: FastAPI for RESTful API
- Frontend: Streamlit for interactive UI
- Data Processing: Pandas for data manipulation
- Machine Learning: Scikit-learn for clustering and classification
- Visualization: Plotly for interactive charts
- Deployment: Docker and Docker Compose for containerization
procurement_categorizer/
├── app/
│ ├── main.py # FastAPI app
│ ├── frontend.py # Streamlit app
│ ├── models/
│ │ ├── categorizer.py # Pipeline logic
│ │ └── schemas.py # Pydantic models
│ └── utils/
│ ├── data_cleaner.py # Cleaning functions
│ ├── error_corrector.py # Error correction
│ ├── predictor.py # Prediction logic
│ └── supplier_utils.py # Supplier preprocessing
├── tests/
│ ├── test_categorizer.py # Unit tests
│ └── test_endpoints.py # API tests
├── README.md # This file
├── requirements.txt # Dependencies
├── config.yaml # Configuration
├── docker-compose.yml # Docker configuration
├── Dockerfile.backend # Backend Dockerfile
├── Dockerfile.frontend # Frontend Dockerfile
├── HOW_TO_RUN.md # Setup and execution steps
├── sample_data.csv # Sample dataset
└── docs/
└── process.md # Detailed documentation
See HOW_TO_RUN.md for detailed instructions on setting up and running the application.
The categorization process involves multiple steps:
- Data Cleaning: Standardizing text and handling missing values
- Supplier Preprocessing: Using fuzzy matching to group similar supplier names
- Item Clustering: Vectorizing descriptions with TF-IDF and clustering using cosine similarity
- Error Correction: Using cluster-based and supplier-based heuristics
- Code Prediction: Training a Random Forest or Logistic Regression model
- Validation: Comparing models with and without supplier context
The use of supplier names as a feature is a key aspect of this approach. By encoding supplier names based on their typical UNSPSC codes, the model can leverage the pattern that specialized suppliers tend to provide consistent categories of items.
The application is configurable through config.yaml
:
- Use of supplier names for categorization
- Similarity thresholds for clustering and fuzzy matching
- Classifier type and parameters
- Train/test split ratio
- Performance settings for large datasets
These options can also be adjusted through the Streamlit UI.
A sample dataset (sample_data.csv
) is provided for testing. It contains 100 rows of synthetic procurement data with a mix of specialized and diverse suppliers, as well as some missing and erroneous UNSPSC codes.
Current limitations:
- Focuses on single-stage categorization (does not predict hierarchical UNSPSC levels)
- Limited visualization options for large numbers of unique UNSPSC codes
Future enhancements could include:
- Deep learning models for improved accuracy
- Hierarchical UNSPSC prediction (segment, family, class, commodity)
- API integration with external UNSPSC databases
- Clustering of suppliers by typical product categories
- Advanced data visualization for large datasets
- Incremental learning for continuous model improvement
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- This project was created to address the challenges of procurement data categorization
- Thanks to the open-source community for the excellent libraries that made this possible