My solution to Kaggle Quora Question Pairs competition (Top 2%, Private LB log loss 0.13497).
The solution uses a mixture of purely statistical features, classical NLP features, and deep learning. Almost 200 handcrafted features are combined with out-of-fold predictions from 4 neural networks having different architectures.
The final model is a GBM (LightGBM), trained with early stopping and a very small learning rate, using stratified K-fold cross validation.
Almost all code (with the exception of some 3rd-party scripts) can efficiently utilize multi-core machines.
At the same time, some of them might be memory-hungry.
All code has been tested on a machine with 64 GB RAM.
For all non-neural notebooks, a c4.8xlarge AWS instance should do excellent.
For neural networks, a GPU is highly recommended. On a GTX 1080 Ti, it takes about 8-9 hours to complete all 4 "neural" notebooks.
You'll need about 30 GB of free disk space to store the pre-trained word embeddings and the extracted features.
- Python >= 3.6.
- LightGBM (compiled from sources).
- FastText (compiled from sources).
- Python packages from
requirements.txt. - (Recommended) NVIDIA CUDA and a GPU version of TensorFlow.
You can spin up a fresh Ubuntu 16.04 AWS instance and use Ansible to make all the necessary software installation and configuration (except the GPU-related stuff).
- Make sure to open the ports 22 and 8888 on the target machine.
- Navigate to
provisioningdirectory. - Edit
config.yml:jupyter_plaintext_password: the password to set for the Jupyter server on the target machine.kaggle_username,kaggle_password: your Kaggle credentials (required to download the competition datasets). Otherwise, download them to thedatafolder manually.
- Edit
inventory.iniand specify your instance DNS and the private key file (*.pem) to access it. - Run:
$ ansible-galaxy install -r requirements.yml $ ansible-playbook playbook.yml -i inventory.ini
Run run-all.sh from the repository root. Check notebooks/output for execution progress and data/submissions for the final results.
Start a Jupyter server in the notebooks directory. If you used the Ansible playbook, the server will already be running on port 8888.
Run the notebooks in the following order:
-
Preprocessing.
1) preproc-tokenize-spellcheck.ipynb 2) preproc-extract-unique-questions.ipynb 3) preproc-embeddings-fasttext.ipynb 4) preproc-nn-sequences-fasttext.ipynb -
Feature extraction.
Run all
feature-*.ipynbnotebooks in arbitrary order.Note: for faster execution, run all
feature-oofp-nn-*.ipynbnotebooks on a machine with a GPU and NVIDIA CUDA. -
Prediction.
Run
classify-lightgbm-cv-pred.ipynb. The output file will be saved asDATETIME-submission-draft-CVSCORE.csv