This project implements a data pipeline infrastructure using Terraform, AWS, and Snowflake to process and analyse pre-operating expenses financial data. The solution provides automated infrastructure provisioning, data processing with PySpark, and cloud-based data warehousing, Snowflake to extract financial analytics using SQL.
This is a sample dashboard using sample-data in the sample-data folder
- Overview
- Features
- Sample Data Analysis
- Technology Stack
- Quick Start Guide
- Implementation Details
- Key Insights
- Challenges and Solutions
- Impact
This project establishes a complete data infrastructure for processing pre-operating expenses data using Infrastructure as Code (IaC) principles. The solution automates the provisioning of AWS S3 storage, processes CSV data using PySpark, and loads structured data into Snowflake for advanced analytics and reporting.
The pipeline handles financial data from CSV files, transforms it into normalised database tables (expenses, categories, vendors), and provides SQL-based analytics capabilities. This enables organisations to track, analyse, and optimise their pre-operating expenses with automated data processing and cloud-based storage.
- Infrastructure as Code: Automated AWS S3 bucket provisioning using Terraform
- Data Processing: PySpark-based ETL pipeline for CSV data transformation
- Cloud Storage: Secure S3 bucket for raw data storage with automated file uploads
- Data Warehousing: Snowflake integration for structured data storage and analytics
- Multi-Provider Support: AWS and Databricks provider configurations
- Environment Management: Virtual environment setup with dependency management
- SQL Analytics: Pre-built queries for expense analysis and reporting
The pipeline processes CSV data with the following structure:
- Pre Operating Expense: Description of the expense
- Date of Expense: Date when expense occurred
- Cost: Amount in currency
- Vendor: Company providing the service
- Category: Expense category (Marketing, Technology, Legal, etc.)
- Contact Info: Vendor contact information
The pipeline creates three normalised tables in Snowflake:
- EXPENSES: Core expense records with IDs and amounts
- CATEGORIES: Expense categories and descriptions
- VENDORS: Vendor information and contact details
Connect to Snowflake and run the following queries for insights:
SELECT
SUM(e.cost) AS TOTAL_AMOUNT,
c.category_type
FROM expenses e
JOIN categories c ON e.category_id = c.category_id
GROUP BY c.category_type
ORDER BY TOTAL_AMOUNT DESC;Example Output:
| TOTAL_AMOUNT | CATEGORY_TYPE |
|---|---|
| 5,234.56 | Marketing |
| 3,456.78 | Technology |
| 2,890.12 | Legal |
| 1,567.89 | Office Expenses |
| 987.65 | Insurance |
SELECT
MONTHNAME(e.date_of_expense) AS MONTH,
YEAR(e.date_of_expense) AS YEAR,
SUM(e.cost) AS TOTAL_AMOUNT,
c.category_type
FROM expenses e
JOIN categories c ON e.category_id = c.category_id
GROUP BY c.category_type, MONTHNAME(e.date_of_expense), YEAR(e.date_of_expense)
ORDER BY YEAR, MONTH, TOTAL_AMOUNT DESC;Example Output:
| MONTH | YEAR | TOTAL_AMOUNT | CATEGORY_TYPE |
|---|---|---|---|
| May | 2025 | 57.29 | Intellectual Property |
| June | 2025 | 1,234.56 | Technology |
| July | 2025 | 2,345.67 | Marketing |
| August | 2025 | 3,456.78 | Marketing |
select
e.cost,
e.date_of_expense
from
expenses e;Example Output:
| MONTH | YEAR | TOTAL_AMOUNT |
|---|---|---|
| May | 2025 | 57.29 |
| June | 2025 | 1,234.56 |
| July | 2025 | 2,345.67 |
| August | 2025 | 3,456.78 |
with monthly_totals as (
select
monthname(e.date_of_expense) as MONTH,
year(e.date_of_expense) as YEAR,
sum(e.cost) as TOTAL_AMOUNT,
c.category_type
from
expenses e
join categories c on e.category_id = c.category_id
group by c.category_type, monthname(e.date_of_expense), year(e.date_of_expense)
),
ranked_1 as (
select
month,
year,
category_type,
total_amount,
row_number() over (
partition by year, month
order by total_amount desc
) as rn_1
from monthly_totals
where year = 2025
)
select
month,
year,
total_amount,
category_type
from ranked_1
where rn_1 = 1
order by year, month;Example Output:
| YEAR | TOTAL_AMOUNT | CATEGORY_TYPE |
|---|---|---|
| 2025 | 4,567.89 | Marketing |
| 2026 | 6,789.12 | Technology |
| Component | Tools and Libraries |
|---|---|
| Infrastructure | Terraform, AWS S3, Databricks |
| Data Processing | PySpark, Pandas, Python |
| Cloud Services | AWS S3, Snowflake |
| Data Storage | CSV files, Snowflake Data Warehouse |
| Development | Python virtual environment, requirements.txt |
| Version Control | Git |
- Python 3.8+
- Terraform
- AWS CLI configured
- Snowflake account
- Git
git clone <your-repository-url>
cd terraform-setuppython -m venv venv
# On Windows
venv\Scripts\activate
# On Linux/Mac
source venv/bin/activatepip install -r requirements.txtCreate a .env file in the root directory with your credentials:
# AWS Credentials
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=your_aws_region
AWS_S3_BUCKET=pre-operating-expenses-data
# Snowflake Credentials
SNOWFLAKE_USER=your_snowflake_username
SNOWFLAKE_PASSWORD=your_snowflake_password
SNOWFLAKE_ACCOUNT=your_snowflake_account
SNOWFLAKE_WAREHOUSE=data_warehouse_2
SNOWFLAKE_SCHEMA=expense_data_samplecd terraform
terraform init
terraform plan
terraform applycd python
python app.py-
Terraform Configuration:
- AWS S3 bucket creation for data storage
- Automated CSV file upload to S3 with MD5 validation
- Multi-provider setup for AWS and Databricks integration
-
Data Processing Pipeline:
- PySpark session initialisation with custom Python environment
- CSV data extraction from S3 using boto3
- Data transformation into three normalised tables: expenses, categories, vendors
- Automated data type conversion and column mapping
-
Database Schema:
- Snowflake warehouse and database creation
- Normalised schema design for expense tracking
- Pre-built SQL queries for expense analysis and reporting
-
Environment Management:
- Virtual environment with isolated dependencies
- Environment variable configuration for AWS credentials
- Cross-platform compatibility (Windows/Linux)
| Component | Implementation Details |
|---|---|
| S3 Data Storage | Automated bucket provisioning with file upload validation |
| PySpark Processing | 3.2.1 version with custom Python environment configuration |
| Data Normalisation | Three-table schema: expenses, categories, vendors |
| Snowflake Integration | XSMALL warehouse with auto-suspend/resume capabilities |
| Infrastructure Automation | Complete IaC setup with Terraform state management |
| Challenge | Solution |
|---|---|
| Cross-platform Python paths | Used absolute paths and environment variable configuration |
| PySpark environment setup | Configured PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON variables |
| AWS credential management | Implemented dotenv for secure credential handling |
| Data type consistency | Applied explicit type conversions for dates and numeric fields |
| Infrastructure state management | Used Terraform state files for deployment tracking |
This project demonstrates modern data engineering practices by combining Infrastructure as Code with cloud-native data processing. It provides a scalable foundation for financial data analytics, enabling organisations to:
- Automate infrastructure provisioning and reduce manual setup time
- Process large volumes of financial data efficiently with PySpark
- Maintain data integrity through normalised database design
- Enable real-time analytics with cloud-based data warehousing
- Scale processing capabilities based on business needs
The solution reduces operational overhead while providing a robust platform for financial data analysis and reporting.
- PySpark Environment: Ensure Python paths are correctly set in your virtual environment
- AWS Credentials: Verify your
.envfile contains valid AWS credentials - Snowflake Connection: Check network connectivity and credential validity
- Dependencies: Ensure all packages are installed in your virtual environment
- Check the logs in the
airflow-docker/logs/directory for detailed error information - Verify your environment variables are correctly set
- Ensure your Snowflake warehouse is running and accessible
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.