Skip to content

andreyhgl/rosalind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This project is an effort to learn python by solving bioinformatic challenges.

Environment setup

[!IMPORTANT] Assumes the packages manager Conda is already installed on the system

Python is ran inside a conda environment with all the nessessary dependencies installed and contained within. The environment can be setup in multiple ways, here the environment is built from a single file: environment.yml

name: bioinformatics
channels:
  - conda-forge
dependencies:
  - python
  - marimo
  - pandas
  - biopython

Create the environment and "jump" into it

conda env create -f environment.yml -n bioinformatics
conda activate bioinformatics

# In case new dependancies are needed:

# 1. add them to environmental.yml
# 2. remove the environment
#conda env remove -n bioinformatics

# 3. install from file again
#conda env create -f environment.yml -n bioinformatics

Start a python notebook (marimo)

marimo edit

Count number of nucleotides in a given string, read the string from a file.

Code

To read a file, use the function open.
Add the statement with to close the file after read.
Use the read() method for open to read the content.
Wrap in a neat function.

def read_file(file):
  with open(file, "r") as f:

    # .strip() drops the last white space
    content = f.read().strip()
  return content

Instead of hard-coding the nucleotides, extract the unique character w/ the function set().
Use the method count() to count the nucleotides.
Save the number into a string, separate with a space.

def count_character(content):
  # extract the unique characters from the string, keep in alphabetic order
  chars = "".join(sorted(set(content)))

  # assign the counted chars to output
  output = ""

  # loop over each char and count, save as string w/ whitespace
  for char in chars:
    output += str(content.count(char)) + " "
  
  print(output.strip())

Finally, let the script take in an argument for the sequence file, instead of hard-coding the path.

import sys

# get first argument
file = sys.argv[1]

Put it all together, see bin/ini.py

python bin/ini.py data/rosalind_ini.txt

Given an organism name and two (publication) dates, return the number of counts of nucleotides found in the GenBank database.

The NCBI GenBank contains all annotated DNA sequences, with their transcripts and proteins. To extract entries from this database, the NCBI search engine Entrez can be used. Biopython is a python library with biological computational tools, including search function for Entrez.

Code

Maintain two digits for day and month, pad with zero if needed: 2007/2/9 => 2007/02/09
Parse the query with the correct quotes

from datetime import datetime
from Bio import Entrez

def entrez_search(organism, start_date, end_date):
  # pad dates with zero if needed

  # strptime creates a datetime object
  start_date = datetime.strptime(start_date, "%Y/%m/%d")

  # strftime creates a string
  start_date = start_date.strftime("%Y/%m/%d")

  Entrez.email = "[email protected]"

  # parse the query
  term = '"' + organism + '"' + "[Organism]" + " AND " + '"' + start_date + '"' + "[Publication Date]" + " : " + '"' + end_date + '"' + "[Publication Date]"

  handle = Entrez.esearch(db="nucleotide", term=term)
  record = Entrez.read(handle)
  print(record["Count"])

Read the organism and the two dates from file, make variables of them

import sys

def read_file(file):
  with open(file, "r") as f:
    content = f.read().strip()
  return content

# get arguments
file = sys.argv[1]

# parse file content
content = read_file(file).split("\n")
organism, start_date, end_date = content

Put it all together, see bin/gbk.py

python bin/gbk.py data/rosalind_gbk.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Languages