Skip to content

False negatives - some sort of interaction with gzip / bgzip #26

@rmcolq

Description

@rmcolq

Strange bug, and might be more of a user error

I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.

If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.

All works as expected with the second file and it finds all the kmers for each of the reference sequences.

broken.fa.gz
works.fa.gz

minimal script:

#!/usr/bin/env python

import cobs_index as cobs
from Bio import SeqIO
import gzip

p = cobs.CompactIndexParameters()
p.term_size = 31               # k-mer size
p.clobber = True               # overwrite output and temporary files
p.false_positive_rate = 0.4    # higher false positive rate -> smaller index

broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")

with gzip.open(broken, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")

with gzip.open(works, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions