Skip to content

EngPeterAtef/Heart_Disease_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Heart Disease Machine Learning Pipeline

A comprehensive machine learning project for predicting heart disease using the UCI Heart Disease dataset. This project includes data preprocessing, feature engineering, multiple ML models, hyperparameter tuning, and a Streamlit web interface.

πŸ“‹ Project Overview

This project implements a complete machine learning pipeline for heart disease prediction, including:

  • Data Preprocessing & Cleaning: Handling missing values, feature scaling, and exploratory data analysis
  • Dimensionality Reduction: PCA analysis for feature compression
  • Feature Selection: Using Random Forest, RFE, and Chi-Square tests
  • Supervised Learning: Logistic Regression, Decision Trees, Random Forest, and SVM
  • Unsupervised Learning: K-Means and Hierarchical Clustering
  • Hyperparameter Tuning: GridSearchCV and RandomizedSearchCV optimization
  • Web Interface: Interactive Streamlit application for predictions

πŸ—‚οΈ Project Structure

Heart_Disease_Project/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ preprocessed_data.csv          # Cleaned and scaled data
β”‚   β”œβ”€β”€ pca_transformed_data.csv        # PCA-transformed data
β”‚   └── selected_features_data.csv      # Data with selected features
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ final_model.pkl                 # Best trained model
β”‚   └── model_metadata.json             # Model information
β”‚
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ *.png                           # All visualization outputs
β”‚   β”œβ”€β”€ model_performance.csv           # Model metrics
β”‚   β”œβ”€β”€ feature_selection_scores.csv    # Feature importance
β”‚   β”œβ”€β”€ clustering_comparison.csv       # Clustering results
β”‚   └── hyperparameter_tuning_results.csv
β”‚
β”œβ”€β”€ 01_data_preprocessing.py            # Data cleaning and EDA
β”œβ”€β”€ 02_pca_analysis.py                  # PCA dimensionality reduction
β”œβ”€β”€ 03_feature_selection.py             # Feature importance and selection
β”œβ”€β”€ 04_supervised_learning.py           # Classification models
β”œβ”€β”€ 05_unsupervised_learning.py         # Clustering analysis
β”œβ”€β”€ 06_hyperparameter_tuning.py         # Model optimization
β”œβ”€β”€ app.py                              # Streamlit web application
β”œβ”€β”€ main.py                             # Main runner script
β”œβ”€β”€ requirements.txt                    # Python dependencies
└── README.md                           # This file

πŸš€ Getting Started

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Installation

  1. Clone the repository or download the project files

  2. Install required packages:

pip install -r requirements.txt
  1. Create necessary directories (auto-created by scripts):
mkdir data models results

Running the Pipeline

Option 1: Run All Steps at Once

python main.py

This will execute all pipeline steps in sequence.

Option 2: Run Individual Scripts

python 01_data_preprocessing.py
python 02_pca_analysis.py
python 03_feature_selection.py
python 04_supervised_learning.py
python 05_unsupervised_learning.py
python 06_hyperparameter_tuning.py

Running the Streamlit Application

After training the models, launch the web interface:

streamlit run app.py

The application will open in your default browser at http://localhost:8501

πŸ“Š Pipeline Steps Explained

1. Data Preprocessing (01_data_preprocessing.py)

  • Loads Heart Disease UCI dataset
  • Handles missing values using median imputation
  • Converts target to binary classification
  • Performs feature scaling with StandardScaler
  • Generates EDA visualizations (distributions, correlations, boxplots)

Outputs:

  • preprocessed_data.csv
  • Visualization files (01-04)

2. PCA Analysis (02_pca_analysis.py)

  • Applies Principal Component Analysis
  • Determines optimal number of components (95% variance)
  • Creates scree plots and cumulative variance plots
  • Generates 2D and 3D scatter plots
  • Shows component loadings (feature contributions)

Outputs:

  • pca_transformed_data.csv
  • Visualization files (05-09)

3. Feature Selection (03_feature_selection.py)

  • Method 1: Random Forest feature importance
  • Method 2: Recursive Feature Elimination (RFE)
  • Method 3: Chi-Square statistical test
  • Combines all methods for robust feature selection
  • Selects top 8 most important features

Outputs:

  • selected_features_data.csv
  • feature_selection_scores.csv
  • Visualization files (10-13)

4. Supervised Learning (04_supervised_learning.py)

  • Trains four classification models:
    • Logistic Regression
    • Decision Tree
    • Random Forest
    • Support Vector Machine (SVM)
  • Evaluates with accuracy, precision, recall, F1-score
  • Performs 5-fold cross-validation
  • Generates ROC curves and confusion matrices

Outputs:

  • model_performance.csv
  • Visualization files (14-18)

5. Unsupervised Learning (05_unsupervised_learning.py)

  • K-Means Clustering:
    • Elbow method for optimal K
    • Silhouette score analysis
  • Hierarchical Clustering:
    • Multiple linkage methods (ward, complete, average, single)
    • Dendrogram visualization
  • Compares clusters with actual disease labels

Outputs:

  • clustering_comparison.csv
  • Visualization files (19-24)

6. Hyperparameter Tuning (06_hyperparameter_tuning.py)

  • GridSearchCV: Exhaustive parameter search
  • RandomizedSearchCV: Faster random sampling
  • Optimizes all four classification models
  • Compares tuning methods
  • Saves best model as final_model.pkl

Outputs:

  • final_model.pkl
  • model_metadata.json
  • hyperparameter_tuning_results.csv
  • Visualization files (25-27)

πŸ–₯️ Streamlit Application Features

1. Prediction Page

  • Input patient health data through intuitive forms
  • Real-time heart disease risk prediction
  • Visual risk gauge with color-coded indicators
  • Confidence scores and detailed recommendations

2. Data Exploration

  • Dataset statistics and distribution
  • Interactive feature visualizations
  • Correlation heatmaps
  • Raw data viewer

3. Model Information

  • Model performance metrics
  • List of features used
  • Dataset description
  • Training methodology

πŸ“ˆ Model Performance

The final model achieves the following performance (approximate):

  • Accuracy: ~85%
  • Precision: ~83%
  • Recall: ~87%
  • F1-Score: ~85%

Actual metrics depend on the dataset and random seed.

πŸ”§ Customization

Adjusting Feature Selection

Edit 03_feature_selection.py, line with top_n_features = 8 to change the number of selected features.

Model Parameters

Modify param_grids dictionary in 06_hyperparameter_tuning.py to test different hyperparameters.

Visualization Style

Customize plot colors and styles in individual scripts using matplotlib/seaborn parameters.

πŸ“ Dataset Information

Source: UCI Heart Disease Dataset

Features:

  • age: Age in years
  • sex: Sex (1 = male, 0 = female)
  • cp: Chest pain type (0-3)
  • trestbps: Resting blood pressure (mm Hg)
  • chol: Serum cholesterol (mg/dl)
  • fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
  • restecg: Resting electrocardiographic results (0-2)
  • thalach: Maximum heart rate achieved
  • exang: Exercise induced angina (1 = yes, 0 = no)
  • oldpeak: ST depression induced by exercise
  • slope: Slope of peak exercise ST segment (0-2)
  • ca: Number of major vessels colored by fluoroscopy (0-4)
  • thal: Thalassemia (0-3)

Target:

  • 0: No heart disease
  • 1: Heart disease present

⚠️ Important Notes

  1. Medical Disclaimer: This is an educational project. Always consult healthcare professionals for medical advice.

  2. Data Privacy: Do not use real patient data without proper authorization and HIPAA compliance.

  3. Model Limitations: Machine learning models are not perfect and should be used as decision support tools, not definitive diagnoses.

πŸ› Troubleshooting

Common Issues

Issue: FileNotFoundError when running scripts

  • Solution: Run scripts in order, starting with 01_data_preprocessing.py

Issue: Module import errors

  • Solution: Install all requirements: pip install -r requirements.txt

Issue: Streamlit app shows "Model not found"

  • Solution: Run hyperparameter tuning script first to create the model file

Issue: Memory errors with large datasets

  • Solution: Reduce n_iter in RandomizedSearchCV or use smaller parameter grids

🌐 Deployment with Ngrok

Quick Deployment

Option 1: Automated Deployment Script

python deploy_with_ngrok.py

This script will:

  • Check all prerequisites
  • Start Streamlit automatically
  • Start Ngrok tunnel
  • Display the public URL
  • Handle cleanup on exit

Option 2: Manual Deployment

Terminal 1 - Start Streamlit:

streamlit run app.py --server.port 8501

Terminal 2 - Start Ngrok:

ngrok http 8501

Copy the forwarding URL (https://xxxxx.ngrok-free.app) and share it!

Detailed Deployment Guide

For comprehensive deployment instructions, including:

  • Ngrok installation and setup
  • Authentication configuration
  • Troubleshooting tips
  • Security considerations
  • Alternative deployment options

See: deployment/ngrok_setup.txt

Ngrok Features

βœ… Instant public URL - Share your app with anyone
βœ… HTTPS encryption - Secure by default
βœ… No server setup - Works from your laptop
βœ… Web interface - Monitor requests at http://localhost:4040
βœ… Free tier available - Perfect for demos and testing

Important Notes

⚠️ Free Tier Limitations:

  • URL changes each time you restart Ngrok
  • 8-hour session limit
  • 40 connections per minute

πŸ’‘ Upgrade to Pro for:

  • Static URLs (reserved domains)
  • Unlimited sessions
  • Higher bandwidth
  • Custom domains

πŸ“š Dependencies

  • pandas: Data manipulation
  • numpy: Numerical operations

Repo Link:

https://github.com/EngPeterAtef/Heart_Disease_Project.git

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages