RBHC

Recursive Binary Hierarchical Clustering

This code is for accomplishing recursive binary hierarchical clustering of data
K-Means algorithm is applied on the initial dataset and a binary partition is created after which using chi square score statistic, the feature (event) that was responsible for the partition is found out. The remaining clusters are further divided recursively using the above approach until the cluster size reaches 1 or the silhouette score reaches the threshold value

Installation

Prerequisites: python3

pip install RBHC

Usage

from RBHC import clustering
clustering(dataFilePath,thresholdValue)

dataFilePath = Path to data file Check data file structure
thresholdValue = Silhouette value threshold (optional parameter and default in program is 0.65)

Return value from this function is a json with a tree structure that is generated with following important fields

name = Name of cluster node (string)
parent = Name of it's parent node (string)
size = Size of cluster (integer)
children = Tree structure of subtree (List)
clusterCreated = If clustering has been successful (Boolean)

To see a sample of this return value run clustering over sample dataset provided and print output or check visualisation/sampleData.json

If you want to run this program in an interactive manner in a jupyter notebook run this command in root directory jupyter notebook and then it opens up in localhost

Statistics

Once program runs then clustering statistics are stored in statistics/hierarchical/nameOfDataFile/ and for each sub cluster created stats are stored in a .json file and attributes are following

ClusterId = Identifier of a sub cluster L=Level G=Number of cluster in that level counted left to right
Size = Size of cluster
Primary feature cluster created by = Name of feature which is responsible primarily for this cluster formation
Features chi score = Shows chi score of all features in that cluster
Stats on cluster by each feature = Stats of each feature in this cluster
Ids = All instances that are part of cluster and names are derived from column[0] of data file

Visualisation

Copy visualisation folder to directory where clustering is being used
In visualisation folder nameOfDataFile.json will be created for clustering visualisation
Run this in visualisation folder python -m http.server 8888 and then in web browser open http://localhost:8888/

Data File Structure

IDS         | feature1    |                     | featureN
------------|-------------|---------------------|-----------------
ID1         |  value1     |                     |  valueN
            |             |                     |  
            |             |                     |
            |             |                     |

All data files should be stored in data folder and check data folder for a sample .csv data file

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.circleci		.circleci
.github		.github
Data		Data
Images		Images
RBHC		RBHC
Tests		Tests
Visualisation		Visualisation
.gitattributes		.gitattributes
CHANGELOG.md		CHANGELOG.md
Clustering.gif		Clustering.gif
Clustering.ipynb		Clustering.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RBHC

Recursive Binary Hierarchical Clustering

Installation

Usage

Statistics

Visualisation

Data File Structure

Contribution and license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

intuit/RBHC

Folders and files

Latest commit

History

Repository files navigation

RBHC

Recursive Binary Hierarchical Clustering

Installation

Usage

Statistics

Visualisation

Data File Structure

Contribution and license

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages