forked from bingmann/cobs
-
Couldn't load subscription status.
- Fork 3
Open
Description
Strange bug, and might be more of a user error
I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.
If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.
All works as expected with the second file and it finds all the kmers for each of the reference sequences.
minimal script:
#!/usr/bin/env python
import cobs_index as cobs
from Bio import SeqIO
import gzip
p = cobs.CompactIndexParameters()
p.term_size = 31 # k-mer size
p.clobber = True # overwrite output and temporary files
p.false_positive_rate = 0.4 # higher false positive rate -> smaller index
broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")
with gzip.open(broken, "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
l = len(record) - 30
print(record.id, l)
r = s.search(str(record.seq), num_results=2)
for result in r:
print(result.score, result.doc_name, float(result.score)/l)
works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")
with gzip.open(works, "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
l = len(record) - 30
print(record.id, l)
r = s.search(str(record.seq), num_results=2)
for result in r:
print(result.score, result.doc_name, float(result.score)/l)
Metadata
Metadata
Assignees
Labels
No labels