GitHub

Random Scripts

Some detail of every script can be found in the beginning of the script

This script normalizes Arabic and do some Twitter related normalization e.g. hashtags, numbers, etc.

python normalize_Arabic.pl input_file output_file

Given a text file, calculate frequency of each word in it and replace the least frequent ones (occurring <= 3) with tag

python pruneVocab.py input_file > output

You can also specify your own threshold value. Note that check on threshold value is inclusive of the threshold frequency

python pruneVocab.py input_file 10  > output

python candidatePairs.py file_with_nearest_neighbor

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
ablation-json-script		ablation-json-script
document_similarity		document_similarity
README.md		README.md
candidatePairs.py		candidatePairs.py
extract_embeddings.py		extract_embeddings.py
noising.py		noising.py
noising_random.py		noising_random.py
normalize_Arabic.pl		normalize_Arabic.pl
pruneVocab.py		pruneVocab.py
tweet_preprocessing.py		tweet_preprocessing.py