Procurement Dataset Categorizer

A Python web application to categorize procurement transaction datasets with UNSPSC codes.

Overview

This application helps procurement professionals clean, validate, and categorize procurement data by:

Cleaning and standardizing supplier names and item descriptions
Detecting and correcting erroneous UNSPSC codes
Predicting missing UNSPSC codes using machine learning
Providing insights on suppliers and validation metrics
Offering a user-friendly web interface for data processing

Features

Data Cleaning: Remove duplicates, standardize text, and identify missing values
Supplier Name Processing: Group similar supplier names using fuzzy matching
Error Detection: Identify incorrect UNSPSC codes using clustering and supplier patterns
Code Prediction: Fill in missing codes with a machine learning model
Supplier Insights: Analyze supplier specialization and diversity
Validation Metrics: Evaluate the accuracy of corrections and predictions
Interactive UI: Upload, process, and download data through a Streamlit interface

Technology Stack

Backend: FastAPI for RESTful API
Frontend: Streamlit for interactive UI
Data Processing: Pandas for data manipulation
Machine Learning: Scikit-learn for clustering and classification
Visualization: Plotly for interactive charts
Deployment: Docker and Docker Compose for containerization

Project Structure

procurement_categorizer/
├── app/
│   ├── main.py                 # FastAPI app
│   ├── frontend.py             # Streamlit app
│   ├── models/
│   │   ├── categorizer.py      # Pipeline logic
│   │   └── schemas.py          # Pydantic models
│   └── utils/
│       ├── data_cleaner.py     # Cleaning functions
│       ├── error_corrector.py  # Error correction
│       ├── predictor.py        # Prediction logic
│       └── supplier_utils.py   # Supplier preprocessing
├── tests/
│   ├── test_categorizer.py     # Unit tests
│   └── test_endpoints.py       # API tests
├── README.md                   # This file
├── requirements.txt            # Dependencies
├── config.yaml                 # Configuration
├── docker-compose.yml          # Docker configuration
├── Dockerfile.backend          # Backend Dockerfile
├── Dockerfile.frontend         # Frontend Dockerfile
├── HOW_TO_RUN.md               # Setup and execution steps
├── sample_data.csv             # Sample dataset
└── docs/
    └── process.md              # Detailed documentation

Installation and Setup

See HOW_TO_RUN.md for detailed instructions on setting up and running the application.

Technical Approach

The categorization process involves multiple steps:

Data Cleaning: Standardizing text and handling missing values
Supplier Preprocessing: Using fuzzy matching to group similar supplier names
Item Clustering: Vectorizing descriptions with TF-IDF and clustering using cosine similarity
Error Correction: Using cluster-based and supplier-based heuristics
Code Prediction: Training a Random Forest or Logistic Regression model
Validation: Comparing models with and without supplier context

The use of supplier names as a feature is a key aspect of this approach. By encoding supplier names based on their typical UNSPSC codes, the model can leverage the pattern that specialized suppliers tend to provide consistent categories of items.

Configuration Options

The application is configurable through config.yaml:

Use of supplier names for categorization
Similarity thresholds for clustering and fuzzy matching
Classifier type and parameters
Train/test split ratio
Performance settings for large datasets

These options can also be adjusted through the Streamlit UI.

Sample Data

A sample dataset (sample_data.csv) is provided for testing. It contains 100 rows of synthetic procurement data with a mix of specialized and diverse suppliers, as well as some missing and erroneous UNSPSC codes.

Limitations and Future Work

Current limitations:

Focuses on single-stage categorization (does not predict hierarchical UNSPSC levels)
Limited visualization options for large numbers of unique UNSPSC codes

Future enhancements could include:

Deep learning models for improved accuracy
Hierarchical UNSPSC prediction (segment, family, class, commodity)
API integration with external UNSPSC databases
Clustering of suppliers by typical product categories
Advanced data visualization for large datasets
Incremental learning for continuous model improvement

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project was created to address the challenges of procurement data categorization
Thanks to the open-source community for the excellent libraries that made this possible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Procurement Dataset Categorizer

Overview

Features

Technology Stack

Project Structure

Installation and Setup

Technical Approach

Configuration Options

Sample Data

Limitations and Future Work

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
docs		docs
tests		tests
venv		venv
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
HOW_TO_RUN.md		HOW_TO_RUN.md
IMPORT_FIX.md		IMPORT_FIX.md
README.md		README.md
app.log		app.log
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
fix_imports.py		fix_imports.py
requirements.txt		requirements.txt
sample_data.csv		sample_data.csv

bomino/Categorizer01

Folders and files

Latest commit

History

Repository files navigation

Procurement Dataset Categorizer

Overview

Features

Technology Stack

Project Structure

Installation and Setup

Technical Approach

Configuration Options

Sample Data

Limitations and Future Work

Contributing

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages