Skip to content

Parallelizing word2vec in shared and distributed memory

License

vasupsu/pWord2Vec

 
 

Repository files navigation

pSGNScc

This is the C++ implementation of the Context Combining optimization of Word2Vec described in the paper titled,

Optimizing Word2Vec Performance on Multicore Systems, accepted at IA^3 2017 - the Seventh Workshop on Irregular Applications: Architectures & Algorithms, co-located with SC17

The code is developed based on the original pWord2Vec implementation described in the paper Parallelizing Word2Vec in Shared and Distributed Memory, arXiv, 2016.

License

All source code files in the package are under Apache License 2.0.

Prerequisites

The code is developed and tested on UNIX-based systems with the following software dependencies:

  • Intel Compiler (The code is optimized on Intel CPUs)
  • OpenMP (No separated installation is needed once Intel compiler is installed)
  • MKL (The latest version "16.0.0 or higher" is preferred as it has been improved significantly in recent years)
  • HyperWords (for model accuracy evaluation). This package is included in this repository.
  • Numactl package (for multi-socket NUMA systems)

Environment Setup

  • Install Intel C++ development environment (i.e., Intel compiler, OpenMP, MKL "16.0.0 or higher". free copies are available for some users)
  • Enable Intel C++ development environment
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 (please point to the path of your installation)
source /opt/intel/impi/latest/compilers_and_libraries/linux/bin/compilervars.sh intel64 (please point to the path of your installation)
  • Install numactl package
sudo yum install numactl (on RedHat/Centos)
sudo apt-get install numactl (on Ubuntu)

Quick Start

  1. Download the code: git clone [email protected]:vasupsu/pWord2Vec.git
  2. Run make to build the package
    This installation will produce three binaries: word2vec, pWord2Vec and pSGNScc. These correspond to the original implementation of Word2Vec found in this GIT repository, original pWord2Vec and our pSGNScc context combining approach respectively. The other implementations are included for performance comparison and verification.
  3. Download the data: cd data; .\getText8.sh or .\getBillion.sh
  4. The directory IA3_AE_test_cases contain BASH test scripts for validating the results in our IA^3 submission. Each test script validates one Figure or Table present in the Experiments and Results section of the paper. The name of each test script corresponds to the Figure or Table number in the paper it validates.

Reference

  1. Optimizing Word2Vec Performance on Multicore Systems, accepted at IA^3 2017.
  2. Parallelizing Word2Vec in Shared and Distributed Memory, arXiv, 2016.
  3. Parallelizing Word2Vec in Multi-Core and Many-Core Architectures, in NIPS workshop on Efficient Methods for Deep Neural Networks, Dec. 2016.

For questions, please contact us at [email protected]

About

Parallelizing word2vec in shared and distributed memory

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 52.2%
  • Python 23.2%
  • C 12.9%
  • Shell 11.4%
  • Makefile 0.3%