A comprehensive machine learning project for predicting heart disease using the UCI Heart Disease dataset. This project includes data preprocessing, feature engineering, multiple ML models, hyperparameter tuning, and a Streamlit web interface.
This project implements a complete machine learning pipeline for heart disease prediction, including:
- Data Preprocessing & Cleaning: Handling missing values, feature scaling, and exploratory data analysis
- Dimensionality Reduction: PCA analysis for feature compression
- Feature Selection: Using Random Forest, RFE, and Chi-Square tests
- Supervised Learning: Logistic Regression, Decision Trees, Random Forest, and SVM
- Unsupervised Learning: K-Means and Hierarchical Clustering
- Hyperparameter Tuning: GridSearchCV and RandomizedSearchCV optimization
- Web Interface: Interactive Streamlit application for predictions
Heart_Disease_Project/
β
βββ data/
β βββ preprocessed_data.csv # Cleaned and scaled data
β βββ pca_transformed_data.csv # PCA-transformed data
β βββ selected_features_data.csv # Data with selected features
β
βββ models/
β βββ final_model.pkl # Best trained model
β βββ model_metadata.json # Model information
β
βββ results/
β βββ *.png # All visualization outputs
β βββ model_performance.csv # Model metrics
β βββ feature_selection_scores.csv # Feature importance
β βββ clustering_comparison.csv # Clustering results
β βββ hyperparameter_tuning_results.csv
β
βββ 01_data_preprocessing.py # Data cleaning and EDA
βββ 02_pca_analysis.py # PCA dimensionality reduction
βββ 03_feature_selection.py # Feature importance and selection
βββ 04_supervised_learning.py # Classification models
βββ 05_unsupervised_learning.py # Clustering analysis
βββ 06_hyperparameter_tuning.py # Model optimization
βββ app.py # Streamlit web application
βββ main.py # Main runner script
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Python 3.8 or higher
- pip package manager
-
Clone the repository or download the project files
-
Install required packages:
pip install -r requirements.txt- Create necessary directories (auto-created by scripts):
mkdir data models resultspython main.pyThis will execute all pipeline steps in sequence.
python 01_data_preprocessing.py
python 02_pca_analysis.py
python 03_feature_selection.py
python 04_supervised_learning.py
python 05_unsupervised_learning.py
python 06_hyperparameter_tuning.pyAfter training the models, launch the web interface:
streamlit run app.pyThe application will open in your default browser at http://localhost:8501
- Loads Heart Disease UCI dataset
- Handles missing values using median imputation
- Converts target to binary classification
- Performs feature scaling with StandardScaler
- Generates EDA visualizations (distributions, correlations, boxplots)
Outputs:
preprocessed_data.csv- Visualization files (01-04)
- Applies Principal Component Analysis
- Determines optimal number of components (95% variance)
- Creates scree plots and cumulative variance plots
- Generates 2D and 3D scatter plots
- Shows component loadings (feature contributions)
Outputs:
pca_transformed_data.csv- Visualization files (05-09)
- Method 1: Random Forest feature importance
- Method 2: Recursive Feature Elimination (RFE)
- Method 3: Chi-Square statistical test
- Combines all methods for robust feature selection
- Selects top 8 most important features
Outputs:
selected_features_data.csvfeature_selection_scores.csv- Visualization files (10-13)
- Trains four classification models:
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machine (SVM)
- Evaluates with accuracy, precision, recall, F1-score
- Performs 5-fold cross-validation
- Generates ROC curves and confusion matrices
Outputs:
model_performance.csv- Visualization files (14-18)
- K-Means Clustering:
- Elbow method for optimal K
- Silhouette score analysis
- Hierarchical Clustering:
- Multiple linkage methods (ward, complete, average, single)
- Dendrogram visualization
- Compares clusters with actual disease labels
Outputs:
clustering_comparison.csv- Visualization files (19-24)
- GridSearchCV: Exhaustive parameter search
- RandomizedSearchCV: Faster random sampling
- Optimizes all four classification models
- Compares tuning methods
- Saves best model as
final_model.pkl
Outputs:
final_model.pklmodel_metadata.jsonhyperparameter_tuning_results.csv- Visualization files (25-27)
- Input patient health data through intuitive forms
- Real-time heart disease risk prediction
- Visual risk gauge with color-coded indicators
- Confidence scores and detailed recommendations
- Dataset statistics and distribution
- Interactive feature visualizations
- Correlation heatmaps
- Raw data viewer
- Model performance metrics
- List of features used
- Dataset description
- Training methodology
The final model achieves the following performance (approximate):
- Accuracy: ~85%
- Precision: ~83%
- Recall: ~87%
- F1-Score: ~85%
Actual metrics depend on the dataset and random seed.
Edit 03_feature_selection.py, line with top_n_features = 8 to change the number of selected features.
Modify param_grids dictionary in 06_hyperparameter_tuning.py to test different hyperparameters.
Customize plot colors and styles in individual scripts using matplotlib/seaborn parameters.
Source: UCI Heart Disease Dataset
Features:
age: Age in yearssex: Sex (1 = male, 0 = female)cp: Chest pain type (0-3)trestbps: Resting blood pressure (mm Hg)chol: Serum cholesterol (mg/dl)fbs: Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)restecg: Resting electrocardiographic results (0-2)thalach: Maximum heart rate achievedexang: Exercise induced angina (1 = yes, 0 = no)oldpeak: ST depression induced by exerciseslope: Slope of peak exercise ST segment (0-2)ca: Number of major vessels colored by fluoroscopy (0-4)thal: Thalassemia (0-3)
Target:
0: No heart disease1: Heart disease present
-
Medical Disclaimer: This is an educational project. Always consult healthcare professionals for medical advice.
-
Data Privacy: Do not use real patient data without proper authorization and HIPAA compliance.
-
Model Limitations: Machine learning models are not perfect and should be used as decision support tools, not definitive diagnoses.
Issue: FileNotFoundError when running scripts
- Solution: Run scripts in order, starting with
01_data_preprocessing.py
Issue: Module import errors
- Solution: Install all requirements:
pip install -r requirements.txt
Issue: Streamlit app shows "Model not found"
- Solution: Run hyperparameter tuning script first to create the model file
Issue: Memory errors with large datasets
- Solution: Reduce
n_iterin RandomizedSearchCV or use smaller parameter grids
Option 1: Automated Deployment Script
python deploy_with_ngrok.pyThis script will:
- Check all prerequisites
- Start Streamlit automatically
- Start Ngrok tunnel
- Display the public URL
- Handle cleanup on exit
Option 2: Manual Deployment
Terminal 1 - Start Streamlit:
streamlit run app.py --server.port 8501Terminal 2 - Start Ngrok:
ngrok http 8501Copy the forwarding URL (https://xxxxx.ngrok-free.app) and share it!
For comprehensive deployment instructions, including:
- Ngrok installation and setup
- Authentication configuration
- Troubleshooting tips
- Security considerations
- Alternative deployment options
See: deployment/ngrok_setup.txt
β
Instant public URL - Share your app with anyone
β
HTTPS encryption - Secure by default
β
No server setup - Works from your laptop
β
Web interface - Monitor requests at http://localhost:4040
β
Free tier available - Perfect for demos and testing
- URL changes each time you restart Ngrok
- 8-hour session limit
- 40 connections per minute
π‘ Upgrade to Pro for:
- Static URLs (reserved domains)
- Unlimited sessions
- Higher bandwidth
- Custom domains
- pandas: Data manipulation
- numpy: Numerical operations