tok is a Byte Pair Encoding tokenizer used for splitting text into tokens which can then be encoded into ids.
- Custom string normalization
 - Easy to use API for interacting with text tokenization
 - Serializable vocabulary and merge rules
 
tok uses utf8proc for normalizing strings and cereal for serializing, you can also install them through your package manager.
Using a debian based distro:
apt install libutf8proc-dev libcereal-dev- Clone the repository with 
git clone https://github.com/M3nny/tok - Run 
makeinside the cloned repository, it will create abuilddirectory with the static library - Include it in you project (you also have to link utf8proc)
 
g++ -std=c++11 -c program.cpp -o program.o
g++ -std=c++11 program.o -o program -L path_to/tok/build -l tok -l utf8proc#include <vector>
#include "tok.hpp"
int main() {
    tok tokenizer;
    tokenizer.load("pretrained/eng_adjectives_adverbs_30k.bin");
    std::string str = "i've just bought a melon!";
    std::vector<std::string> tokenized_str = tokenizer.tokenize(str);
    // ["i", "'", "ve", "Ķjust", "Ķbought", "Ķa", "Ķmel", "on", "!", "<|eot|>"]
    return 0;
}Important
The API documentation can be found in tok.hpp and some examples are listed inside the examples folder.
You can find pretrained vocabularies inside pretrained.