Heart Disease Machine Learning Pipeline

A comprehensive machine learning project for predicting heart disease using the UCI Heart Disease dataset. This project includes data preprocessing, feature engineering, multiple ML models, hyperparameter tuning, and a Streamlit web interface.

📋 Project Overview

This project implements a complete machine learning pipeline for heart disease prediction, including:

Data Preprocessing & Cleaning: Handling missing values, feature scaling, and exploratory data analysis
Dimensionality Reduction: PCA analysis for feature compression
Feature Selection: Using Random Forest, RFE, and Chi-Square tests
Supervised Learning: Logistic Regression, Decision Trees, Random Forest, and SVM
Unsupervised Learning: K-Means and Hierarchical Clustering
Hyperparameter Tuning: GridSearchCV and RandomizedSearchCV optimization
Web Interface: Interactive Streamlit application for predictions

🗂️ Project Structure

Heart_Disease_Project/
│
├── data/
│   ├── preprocessed_data.csv          # Cleaned and scaled data
│   ├── pca_transformed_data.csv        # PCA-transformed data
│   └── selected_features_data.csv      # Data with selected features
│
├── models/
│   ├── final_model.pkl                 # Best trained model
│   └── model_metadata.json             # Model information
│
├── results/
│   ├── *.png                           # All visualization outputs
│   ├── model_performance.csv           # Model metrics
│   ├── feature_selection_scores.csv    # Feature importance
│   ├── clustering_comparison.csv       # Clustering results
│   └── hyperparameter_tuning_results.csv
│
├── 01_data_preprocessing.py            # Data cleaning and EDA
├── 02_pca_analysis.py                  # PCA dimensionality reduction
├── 03_feature_selection.py             # Feature importance and selection
├── 04_supervised_learning.py           # Classification models
├── 05_unsupervised_learning.py         # Clustering analysis
├── 06_hyperparameter_tuning.py         # Model optimization
├── app.py                              # Streamlit web application
├── main.py                             # Main runner script
├── requirements.txt                    # Python dependencies
└── README.md                           # This file

🚀 Getting Started

Prerequisites

Python 3.8 or higher
pip package manager

Installation

Clone the repository or download the project files
Install required packages:

pip install -r requirements.txt

Create necessary directories (auto-created by scripts):

mkdir data models results

Running the Pipeline

Option 1: Run All Steps at Once

python main.py

This will execute all pipeline steps in sequence.

Option 2: Run Individual Scripts

python 01_data_preprocessing.py
python 02_pca_analysis.py
python 03_feature_selection.py
python 04_supervised_learning.py
python 05_unsupervised_learning.py
python 06_hyperparameter_tuning.py

Running the Streamlit Application

After training the models, launch the web interface:

streamlit run app.py

The application will open in your default browser at http://localhost:8501

📊 Pipeline Steps Explained

1. Data Preprocessing (`01_data_preprocessing.py`)

Loads Heart Disease UCI dataset
Handles missing values using median imputation
Converts target to binary classification
Performs feature scaling with StandardScaler
Generates EDA visualizations (distributions, correlations, boxplots)

Outputs:

preprocessed_data.csv
Visualization files (01-04)

2. PCA Analysis (`02_pca_analysis.py`)

Applies Principal Component Analysis
Determines optimal number of components (95% variance)
Creates scree plots and cumulative variance plots
Generates 2D and 3D scatter plots
Shows component loadings (feature contributions)

Outputs:

pca_transformed_data.csv
Visualization files (05-09)

3. Feature Selection (`03_feature_selection.py`)

Method 1: Random Forest feature importance
Method 2: Recursive Feature Elimination (RFE)
Method 3: Chi-Square statistical test
Combines all methods for robust feature selection
Selects top 8 most important features

Outputs:

selected_features_data.csv
feature_selection_scores.csv
Visualization files (10-13)

4. Supervised Learning (`04_supervised_learning.py`)

Trains four classification models:
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
Evaluates with accuracy, precision, recall, F1-score
Performs 5-fold cross-validation
Generates ROC curves and confusion matrices

Outputs:

model_performance.csv
Visualization files (14-18)

5. Unsupervised Learning (`05_unsupervised_learning.py`)

K-Means Clustering:
- Elbow method for optimal K
- Silhouette score analysis
Hierarchical Clustering:
- Multiple linkage methods (ward, complete, average, single)
- Dendrogram visualization
Compares clusters with actual disease labels

Outputs:

clustering_comparison.csv
Visualization files (19-24)

6. Hyperparameter Tuning (`06_hyperparameter_tuning.py`)

GridSearchCV: Exhaustive parameter search
RandomizedSearchCV: Faster random sampling
Optimizes all four classification models
Compares tuning methods
Saves best model as final_model.pkl

Outputs:

final_model.pkl
model_metadata.json
hyperparameter_tuning_results.csv
Visualization files (25-27)

🖥️ Streamlit Application Features

1. Prediction Page

Input patient health data through intuitive forms
Real-time heart disease risk prediction
Visual risk gauge with color-coded indicators
Confidence scores and detailed recommendations

2. Data Exploration

Dataset statistics and distribution
Interactive feature visualizations
Correlation heatmaps
Raw data viewer

3. Model Information

Model performance metrics
List of features used
Dataset description
Training methodology

📈 Model Performance

The final model achieves the following performance (approximate):

Accuracy: ~85%
Precision: ~83%
Recall: ~87%
F1-Score: ~85%

Actual metrics depend on the dataset and random seed.

🔧 Customization

Adjusting Feature Selection

Edit 03_feature_selection.py, line with top_n_features = 8 to change the number of selected features.

Model Parameters

Modify param_grids dictionary in 06_hyperparameter_tuning.py to test different hyperparameters.

Visualization Style

Customize plot colors and styles in individual scripts using matplotlib/seaborn parameters.

📝 Dataset Information

Source: UCI Heart Disease Dataset

Features:

age: Age in years
sex: Sex (1 = male, 0 = female)
cp: Chest pain type (0-3)
trestbps: Resting blood pressure (mm Hg)
chol: Serum cholesterol (mg/dl)
fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
restecg: Resting electrocardiographic results (0-2)
thalach: Maximum heart rate achieved
exang: Exercise induced angina (1 = yes, 0 = no)
oldpeak: ST depression induced by exercise
slope: Slope of peak exercise ST segment (0-2)
ca: Number of major vessels colored by fluoroscopy (0-4)
thal: Thalassemia (0-3)

Target:

0: No heart disease
1: Heart disease present

⚠️ Important Notes

Medical Disclaimer: This is an educational project. Always consult healthcare professionals for medical advice.
Data Privacy: Do not use real patient data without proper authorization and HIPAA compliance.
Model Limitations: Machine learning models are not perfect and should be used as decision support tools, not definitive diagnoses.

🐛 Troubleshooting

Common Issues

Issue: FileNotFoundError when running scripts

Solution: Run scripts in order, starting with 01_data_preprocessing.py

Issue: Module import errors

Solution: Install all requirements: pip install -r requirements.txt

Issue: Streamlit app shows "Model not found"

Solution: Run hyperparameter tuning script first to create the model file

Issue: Memory errors with large datasets

Solution: Reduce n_iter in RandomizedSearchCV or use smaller parameter grids

🌐 Deployment with Ngrok

Quick Deployment

Option 1: Automated Deployment Script

python deploy_with_ngrok.py

This script will:

Check all prerequisites
Start Streamlit automatically
Start Ngrok tunnel
Display the public URL
Handle cleanup on exit

Option 2: Manual Deployment

Terminal 1 - Start Streamlit:

streamlit run app.py --server.port 8501

Terminal 2 - Start Ngrok:

ngrok http 8501

Copy the forwarding URL (https://xxxxx.ngrok-free.app) and share it!

Detailed Deployment Guide

For comprehensive deployment instructions, including:

Ngrok installation and setup
Authentication configuration
Troubleshooting tips
Security considerations
Alternative deployment options

See: deployment/ngrok_setup.txt

Ngrok Features

✅ Instant public URL - Share your app with anyone
✅ HTTPS encryption - Secure by default
✅ No server setup - Works from your laptop
✅ Web interface - Monitor requests at http://localhost:4040
✅ Free tier available - Perfect for demos and testing

Important Notes

⚠️ Free Tier Limitations:

URL changes each time you restart Ngrok
8-hour session limit
40 connections per minute

💡 Upgrade to Pro for:

Static URLs (reserved domains)
Unlimited sessions
Higher bandwidth
Custom domains

📚 Dependencies

pandas: Data manipulation
numpy: Numerical operations

Repo Link:

https://github.com/EngPeterAtef/Heart_Disease_Project.git

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
deployment		deployment
models		models
results		results
ui		ui
.gitignore		.gitignore
README.md		README.md
Task.pdf		Task.pdf
deploy_script.py		deploy_script.py
requirements.txt		requirements.txt
setup_guide.md		setup_guide.md

EngPeterAtef/Heart_Disease_Project

Folders and files

Latest commit

History

Repository files navigation

Heart Disease Machine Learning Pipeline

📋 Project Overview

🗂️ Project Structure

🚀 Getting Started

Prerequisites

Installation

Running the Pipeline

Option 1: Run All Steps at Once

Option 2: Run Individual Scripts

Running the Streamlit Application

📊 Pipeline Steps Explained

1. Data Preprocessing (01_data_preprocessing.py)

2. PCA Analysis (02_pca_analysis.py)

3. Feature Selection (03_feature_selection.py)

4. Supervised Learning (04_supervised_learning.py)

5. Unsupervised Learning (05_unsupervised_learning.py)

6. Hyperparameter Tuning (06_hyperparameter_tuning.py)

🖥️ Streamlit Application Features

1. Prediction Page

2. Data Exploration

3. Model Information

📈 Model Performance

🔧 Customization

Adjusting Feature Selection

Model Parameters

Visualization Style

📝 Dataset Information

⚠️ Important Notes

🐛 Troubleshooting

Common Issues

🌐 Deployment with Ngrok

Quick Deployment

Detailed Deployment Guide

Ngrok Features

Important Notes

📚 Dependencies

Repo Link:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Data Preprocessing (`01_data_preprocessing.py`)

2. PCA Analysis (`02_pca_analysis.py`)

3. Feature Selection (`03_feature_selection.py`)

4. Supervised Learning (`04_supervised_learning.py`)

5. Unsupervised Learning (`05_unsupervised_learning.py`)

6. Hyperparameter Tuning (`06_hyperparameter_tuning.py`)

Packages