Before anything, ensure that dspy is properly set-up.
DSPY_CACHE_DIR="/tmp/dspy"
mkdir -p $DSPY_CACHE_DIR
export DSPY_CACHEDIR=$DSPY_CACHE_DIRPrepare a project directory
from biotopg import Biotopg
Biotopg.initialize("outputs", "PubMed")In the config file
excluded_ontology_ids: null
include_supplementary_entities: Trueexcluded_ontology_idsgives the path a file containing a json list of all ontology ids, to not consider as entities.include_supplementary_entitiesdiscard all supplementary entities extracted by the LLM.
Warnings:
- By default, no
extractor_demonstration_fileis specified. Specify one if you want to add demonstrations for the extraction process (recommended). There is an example indemonstrations/demo_hyperproposition_with_ner.json.json - If you are using OpenAI models, specify your API_KEY in the env variable
OPENAI_API_KEY(recommended), or, add it to the yaml file.
Initialize the system by passing the path to the config dict
import yaml
config_path = "outputs/PubMed/config.yaml"
with open(config_path, "r") as file:
system_config = yaml.safe_load(file)
system = Biotopg(config=system_config)Insert some pubmed article using their pmids
batch_pmids = ["11299965", "18218922", "16245277", "14633596", "9560222", "10797941", "23194061", "12049630", "7758945"]
system.insert_pmids(pmids=batch_pmids)You can export the passages and propositions using :
import os
outdir = "outputs"
system.store.export_all_hyperpropositions(
os.path.join(outdir, "hyperpropositions.json")
)
system.store.export_all_passages(os.path.join(outdir, "passages.json"))There are 2 modes: local for fact-oriented queries and global for more abstract queries.
# if new documents have been added since, it is recommended to reload the graph. When reloading the graph.
system.retriever.load_graphs(force_reload=True)
retriever_args = {
"initial_retriever_args": {
"e_syn_k": 5,
"e_syn_threshold": 0.80,
"lambda_mmr": 1.0,
"p_k": 20
},
"q": 0.5,
"damping": 0.5,
"cosine_threshold": 0.4,
"horizon_threshold": 1e-4,
"temperature": 0.1,
"top_k": 20,
"use_passage_links": True
}
question = "What is Tastin ?"
qa_mode = "local"
max_iter = 1
predicted_answer, documents, _memory = system.query(
question,
mode=qa_mode,
max_iter=max_iter,
retriever_args=retriever_args,
)
print("Answer:", predicted_answer)
print("Documents:")
for pmid, data in documents.items():
print(f"PMID: {pmid}")
print(f"Passage ID: {data['passage_id']}")
print(f"Text: {data['text']}")
print("Facts:")
for fact in data["facts"]:
print(f"- {fact}")
print()question = "What is Tastin ?"
qa_mode = "global"
max_iter = 3
m = 3
predicted_answer, documents, _memory = system.query(
question,
mode=qa_mode,
max_iter=max_iter,
retriever_args=retriever_args,
m=3
)
print("Answer:", predicted_answer)
print("Documents:")
for pmid, data in documents.items():
print(f"PMID: {pmid}")
print(f"Passage ID: {data['passage_id']}")
print(f"Text: {data['text']}")
print("Facts:")
for fact in data["facts"]:
print(f"- {fact}")
print()When loading the graph at inference, you can specie a list of ontology nodes to discard
system.retriever.load_graphs(force_reload=True, masks_ontology_ids=["Species|9606"])You can also check the cost
from biotopg.utils.llm import get_cost
cost = get_cost(system.lm)