-*-outline-*-

* Goal
Build a single database of transcripts and related tools that:
- contains all transcripts, all sources (RefSeq, UCSC, E!, LRG), all versions
- contains non-canonical and custom transcripts structures (e.g., BIC)
- contains computed alignments of transcripts against multiple references
- makes refagree data easily accssible (e.g., sequence descrepancies with GRCh37)
- identifies equivalence with Ensembl transcripts (ENSTs, and perhaps LRGs later)
- enables comparison of transcript alignments from UCSC and NCBI (not the same!)
- enables liftover of variants and other features between sequences
- provides an API and REST interface


* Requirements
- use NCBI RefSeq alignements specifically (ie., not UCSC, which use BLAT)
- include historical transcripts
- support transcripts that have significant exon structure anomalies
  between the transcript and gene records


* Design
The requirement to include historical transcripts (i.e., no longer
available at NCBI) means that we need persistent storage.  Relying on NCBI
E-Utilities exclusively won't work.

The requirement to allow significant anomalies means that we can't use the
current transcript schema.

CIGARs -- store in *transcript* order to facilitate tx-based


* Components
sqlalchemy
flask
flask-restless




* Output
** refagree

** tx-table
One row per RefSeq transcript, current or obsolete.
Columns:
gene -- HGNC gene name
maploc -- [strand][NCBI maploc], eg -3q21.3
ac -- RefSeq NM
coords -- overall transcript coords
status -- current or obsolete, computed as "most recent version with same AC base"
ref agree
  CDS indels -- exon:posI/D
  UTR indels -- exon:posI/D
  substitutions -- exon:posS
  comments
ensembl
  ENST equiv
  comments



* Status
left off at:
PYTHONPATH=lib/python ./bin/uta -C etc/uta.conf load-transcripts-seqgene tests/data/seq_gene10k.md.gz
