Repo for DDL research lab project. The codebase takes care of all ingestion, training, and evaluation for ontology assisted word2vec training activities. A Drakefile is used to conduct the entire pipeline after ingestion/corpus cleaning through evaluation. There are essentially 2 main components that need to be run to run the test; Ingestion and the Drake Workflow. Ingestion is a simple command line wrapper that grabs the data we want to use as our corpus and ontology, and performs basic munging to standardize the corpus. The Drake workflow appends ontologies to the corpus, trains the word2vec model, and performs the evaluation activities. So essentially, once you run the ingestion you can just continue to tweak the number of times the ontologies are appended and the word2vec parameters and re-run the Drake workflow to conduct a new experiment (no need to ingest and clean the corpus multiple times).
#Usage
- Ingestion
- Model Building
- Full manual pipeline example using Wikipedia data
- Step 1: Retrieve wikipedia page content
- Step 2: Create a Word2Vec model
- Step 3: Explore the model
- Step 4: Evaluation
- Running a model building pipeline with Drake
- Step 1: Configuration
- Step 2: Execution
- Contributing Workflow
Optional
To build a fun_3000 model off of a base corpus and ontologies you must first ingest both types of data for a given run. The ingestion module pulls ontologies and text from sources (defined in ingestion/ingestion_config.conf) off of keywords )defined by a provided text file or, by default, our keyword selection in data/eval_words). There is a controller script, get_corpus.py, which pulls ontologies and text from all sources based on a set of search terms submitted via csv.
python fun_3000/get_corpus.py -s path_to_search_file -d run_1
With the options
-soption is the name of a txt file with a list of terms you want to search.-doption is the data directory for this test run, for example 'run1'-roption is the number of search results to be returned per search term
By default, the script fetches the top result from each source ('wikipedia', 'arxiv', 'pub med' and 'medline')
for each term in the 'search_file'. This can be changed by setting the -r option. Both search term and directory
name are required.
Data can be pulled from each source individually by importing the ingestion module and running the individual commands. get_corpus is simply a wrapper that grabs everything. You can also just run the ingestion scripts individually from the command line.
The options of scripts to run from the command line are below:
- From wikipedia with
ingestion/wikipedia_ingest.py - Rrom medical abstracts with
ingestion/med_abstract_ingest.py - From medical textbooks with
ingestion/med_textbook_ingest.py - From ontologies on the web with
ingestion/ingest_ontologies.py
Grabs up to the desired number of results (defined by results parameter -r) for the specified search term (term) and
puts them in the specified directory (data_dir). update when wikipedia ingest updated to exclude reference/notes section
import ingestion
wiki_search = ingestion.wikipedia_ingest
wiki_search.get_wikipedia_pages(term={some_term}, data_dir={data/some_run}, results= {some int})
Grabs up to the desired number of results (defined by results parameter) for the specified search term (term) and puts them in the specified directory (data_dir).
This will search three journal sites with STEM articles:
-
Medline (via biopython package)
import ingestion
med_search = ingestion.med_abstract_ingest
med_search.get_medical_abstracts(term={some_term}, data_dir={data/some_run}, results= {some int})
This imports two texts into the specified directory:
-
Gray's Anatomy
import ingestion
book_grab = ingestion.med_textbook_ingest
book_grab.get_books(directory)
Ontologies are formal specifications of linguistic relationships designed by domain experts and linguists, usually described on the web using XML-based syntaxes RDF (w3c) or its superset OWL and OWL derivatives (w3c). Ontologies are intended to build on one another to enforce a common vocabulary. For example a popular base ontology is FOAF (aka friend-of-a-friend (wikpedia), which provides a common vocabulary for describing relationships and attributes between and about people. Ontologies are also intended to be accessible over the Semantic Web/web of data and thus should live and refer to each other on the web using URIs accessible over http.
Ontologies can be separated into what we here refer to as 'base ontologies' or 'instance ontolgoies'. Base ontologies represent common vocabularies relevant to describe relationships and attributes for entities within a specific domain (such as FOAF, or in our case for fun_3000, OGMS (Ontology for General Medical Science, ref). Instance ontologies use the structure provided by the base ontology to link language instances heirarchically back with a base ontology so that certain logical conclusions can be made across instances using the base ontology as a backbone. For example, knowing that A is a type of B and B is a type of C, logically you can conclude that A is a type of C; specifying the relationships between A, B, and C in instance ontologies using the relations and attribute of a base ontology allow such logical conclusions to be made against the web of data.
For fun_3000, URLs to RDF/XML MIME-type ontology files are specified in the [Ontologies] section of the configuration file fun_3000/ingestion/ingestion_config.conf. In this conf file you must describe:
source_ontology: in our terminology, this is the base ontology our instance ontologies will derive meaning fromsource_ontology_fallback: a private host of the source ontology in case of internet problems- any number of instance ontologies that utilize structure from the
source_ontologythat are specified with a variable name that uniquely identifies them and the URL to their location
Ontologies are ingested and parsed into a natural language form via the module fun_3000/ingestion/ingest_ontologies.py. This script pulls the source ontology and the instance ontologies into a single graph and parses out sentences based on the human labels of entities that are joined by the rdf:type relationship (w3c). This script makes the assumption that a valid English human language equivalent for the rdf:type relationship is the verb "is". For example, the OWL snippet:
<owl:NamedIndividual rdf:about="http://purl.obolibrary.org/obo/OBI_0000759"><!-- Illumina -->
<rdf:type rdf:resource="http://purl.obolibrary.org/obo/OBI_0000245"/><!-- organization -->
<rdf:type>
<owl:Restriction>
<owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/RO_0000087"/><!-- has role -->
<owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/OBI_0000571"/><!-- manufacturer role -->
</owl:Restriction>
</rdf:type>
<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Illumina</rdfs:label>
<obo:IAO_0000111 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Illumina</obo:IAO_0000111>
<obo:IAO_0000117 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Philippe Rocca-Serra</obo:IAO_0000117>
<obo:IAO_0000114 rdf:resource="http://purl.obolibrary.org/obo/IAO_0000123"/><!-- metadata incomplete -->
</owl:NamedIndividual>
would convert to the sentence "Illumina is organization".
To run the ontology ingestion in a script, use the ingest_and_wrangle_owls module with the following syntax:
import ingestion
ontology_grab = ingestion.ingest_ontologies
ontology_grab.ingest_and_wrangle_owls(directory)
You can also run the ontology ingestion module directly as a script; see usage notes in the script itself.
To clean the corpus, use the wrangling/clean_corpus.py script. This script can clean a single file, or a directory's worth of files into a single, newline delimited text file that cleans up certain types of characters or phrases and splits the data into sentences of a minimum length.
The common use will be to use this script to clean a directory's worth of files. In that case, the cleaned and concatednated file (by default output.txt) will be located in the same directory the individual files were in.
There are several functions called during the generate folds process before and after the sentences are tokenized to remove HTML and Latex code, formatting, headers, and other potentially bothersome elements from the text.
Full list of stuff that is removed, by default:
- Remove all html tags {<-->)
- Remove all latex ({--} and ${--})
- Remove headers from wikipedia articles
- Remove new lines and carriage returns (this messes up the tokenize script)
- Remove all non-ascii characters (like copyright symbols)
- Remove extraneous spaces (this also messes up the tokenize script)
- Remove sentences less than 10 words long (or some length defined in parameter), that don't end with a period, don't start with a capital letter, or start with a number
//Cover new file structure and boosting process here. Ignore the mess below.//
You can use the wrangling/generate_folds.py script to generate a folder structure that will contain prepared training and test sets for k number of folds.
The folder structure follows the following pattern UNDER the data directory
.
+-- {SOME_RUN}
| +--corpus_filename_1.txt
| +--corpus_filename_2.txt
| +--1
| +--| +--train
| +--| +--| +--train.txt
| +--| +--test
| +--| +--| +--test.txt
| +--2
| +--| +--train
| +--| +--| +--train.txt
| +--| +--test
| +--| +--| +--test.txt
The script also expects ontology generated files to exist in a SISTER (to data) 'ontology' folder like the example below
+-- {SOME_RUN}
| +--ontology_filename_1.txt
| +--ontology_filename_2.txt
In the example above only 2 folds were generated.
To generate the proper files and folder structure do the following:
python fun_3000/wrangling/generate_folds.py -d '{SOME_RUN}' -k 3 -o True -s 10
where:
-kis the number of folds you want to generate-ois a boolean flag indicating whether we are including an ontology in this run.-dis the data directory for this test run, for example 'run1'-sis the random seed
python fun_3000/word2vec.py -i data_dir
The script will use all data files within data/data_dir/ and build a Word2Vec model from them. In the example above the data_dir might be = '<data_dir>/1/train'
The model will be saved under models/data_dir/ for future use.
You can specify additional options, such as the number of parallel execution threads, the size of the hidden layer and the output model name. For script usage information, run:
python fun_3000/word2vec.py -h
Note: if no model name is specified, output name will be <data_dir>_1_train.model (using the example above).
Evaluation returns a single score for all fols for an individual run. It is the average of scores across the folds. Ours scores are stored in scores.csv in the base directory. Every time you run th workflow this csv will be appended to. There is a column that provides the run name in the csv and the data and time.
python fun_3000/evaluation/similarity_evaluation.py -r 'run_1' -f 3 -o scores.csv
where:
-ris the name of the run, i.e. 'run1'-fis the number of folds that run's corpus was split into, defaults to 3-ois the output file, defaults to scores.csv
Let's say you wanted to train a Word2Vec model with the "Jazz" wikipedia page as your corpus:
python fun_3000/ingestion/wikipedia_ingest.py -s '{SOME_RUN}'
Confirm that the text content was downloaded and stored under data/{SOME_RUN}/model_data.txt
(Alternatively: you can manually create a directory under data/ and placing all corpus files within it)
python fun_3000/word2vec.py -i {SOME_RUN}
Confirm that the model was created and saved under models/{SOME_RUN}/{SOME_RUN}.model
Within a python REPL:
>>> import gensim
>>> model = gensim.models.Word2Vec.load('models/{SOME_RUN}/{SOME_RUN}.model')
>>> model.most_similar('jazz')
[('sound', 0.9113765358924866), ('well', 0.9058974981307983), ('had', 0.9046300649642944), ('bass', 0.9037381410598755), ('In', 0.9003950953483582), ('blues', 0.9001777768135071), ('on', 0.8995728492736816), ('at', 0.8993135690689087), ('rather', 0.8992522954940796), ('such', 0.8990519046783447)]TODO: Write this...
You can run the model building pipeline with Drake instead of calling each module by hand. The model building pipeline assumes you have already ingested your corpus and ontologies per the structures defined above.
Requirement: Make sure Drake is installed. See here for installation instructions.
Open the file named Drakefile and change any of the configuration settings at the top of the file. They correspond to the same options that the word2vec.py script supports.
Drake can be smart about what to (re)run based on the presence and/or timestamps on files generated as artifacts by each of the Drakfile steps which are stored in the workflow/ directory. See the Drake README for more information on how to specify reruns from the Drake CLI.
All you need to do is run the following from the main directory:
drake
or
drake +...
to force rerun (+) all steps (...)
or
drake =workflow/03.evaluation.complete
to run without dependency checking (=) a specified target (i.e. workflow/03.evaluation.complete).
Review the steps and enter 'y' to accept them.
We maintain both a master and a develop branch. All features are to be built as a branch off of develop and pull requests (pr) will be made into develop. Only major releases will be pulled into the master branch.