Additional "Real World Problem" Set
1. Create a fasta parser. Your fasta parser should be able to handle more than
one sequence and each sequence may have newlines. What if a sequence has a
description.
This can be used for getting the sequence in a usuable form for any downstream
processing. For example reverse complementing, or finding restriction sites in
many sequences.
>seq1
AAGAGCAGCTCGCGCTAATGTGATAGATGGCGGTAAAGTAAATGTCCTATGGGCC
ACCAATTATGGTGTATGAGTGAATCTCTGGTCCGAGATTCACTGAGTAACTGCTG
TACACAGTAGTAACACGTGGAGATCCCATAAGCTTCACGTGTGGTCCAATAAAAC
ACTCCGTTGGTCAAC
>seq2 some fancy description
GCCACAGAGCCTAGGACCCCAACCTAACCTAACCTAACCTAACCTACAGTTTGATC
TTAACCATGAGGCTGAGAAGCGATGTCCTGACCGGCCTGTCCTAACCGCCCTGACC
TAACCGGCTTGACCTAACCGCCCTGACCTAACCAGGCTAACCTAACCAAACCGTGA
AAAAAGGAATCT
>seq3
ATGAAAGTTACATAAAGACTATTCGATGCATAAATAGTTCAGTTTTGAAAACTTAC
ATTTTGTTAAAGTCAGGTACTTGTGTATAATATCAACTAAAT
>seq4 no newlines
ATGCTAACCAAAGTTTCAGTTCGGACGTGTCGATGAGCGACGCTCAAAAAGGAAACAACATGCCAAATAGAAACGATCAATTCGGCGATGGAAATCAGAACAACGATCAGTTTGGAAATCAAAATAGAAATAACGGGAACGATCAGTTTAATAACATGATGCAGAATAAAGGGAATAATCAATTTAATCCAGGTAATCAGAACAGAGGT
Hints:
-Separate the id from sequence.
-What do you do with a description?
-If the sequence has new lines, remove them.
-You need to be able to always know which id belongs to which sequence.
ideas : loops, hashes, others.
2. Longest ORF:
Start with a fasta file with multiple sequences.
Translate each sequence into the 3 reading frames
For each seqeunce find and return
the longest coding nucleotide fragments in one file: seq.nt
the longest translated fragment in another file: seq.aa
You can create your own translation hash, or use this:
my(%genetic_code) = (
'TCA' => 'S', # Serine
'TCC' => 'S', # Serine
'TCG' => 'S', # Serine
'TCT' => 'S', # Serine
'TTC' => 'F', # Phenylalanine
'TTT' => 'F', # Phenylalanine
'TTA' => 'L', # Leucine
'TTG' => 'L', # Leucine
'TAC' => 'Y', # Tyrosine
'TAT' => 'Y', # Tyrosine
'TAA' => '_', # Stop
'TAG' => '_', # Stop
'TGC' => 'C', # Cysteine
'TGT' => 'C', # Cysteine
'TGA' => '_', # Stop
'TGG' => 'W', # Tryptophan
'CTA' => 'L', # Leucine
'CTC' => 'L', # Leucine
'CTG' => 'L', # Leucine
'CTT' => 'L', # Leucine
'CCA' => 'P', # Proline
'CAT' => 'H', # Histidine
'CAA' => 'Q', # Glutamine
'CAG' => 'Q', # Glutamine
'CGA' => 'R', # Arginine
'CGC' => 'R', # Arginine
'CGG' => 'R', # Arginine
'CGT' => 'R', # Arginine
'ATA' => 'I', # Isoleucine
'ATC' => 'I', # Isoleucine
'ATT' => 'I', # Isoleucine
'ATG' => 'M', # Methionine
'ACA' => 'T', # Threonine
'ACC' => 'T', # Threonine
'ACG' => 'T', # Threonine
'ACT' => 'T', # Threonine
'AAC' => 'N', # Asparagine
'AAT' => 'N', # Asparagine
'AAA' => 'K', # Lysine
'AAG' => 'K', # Lysine
'AGC' => 'S', # Serine
'AGT' => 'S', # Serine
'AGA' => 'R', # Arginine
'AGG' => 'R', # Arginine
'CCC' => 'P', # Proline
'CCG' => 'P', # Proline
'CCT' => 'P', # Proline
'CAC' => 'H', # Histidine
'GTA' => 'V', # Valine
'GTC' => 'V', # Valine
'GTG' => 'V', # Valine
'GTT' => 'V', # Valine
'GCA' => 'A', # Alanine
'GCC' => 'A', # Alanine
'GCG' => 'A', # Alanine
'GCT' => 'A', # Alanine
'GAC' => 'D', # Aspartic Acid
'GAT' => 'D', # Aspartic Acid
'GAA' => 'E', # Glutamic Acid
'GAG' => 'E', # Glutamic Acid
'GGA' => 'G', # Glycine
'GGC' => 'G', # Glycine
'GGG' => 'G', # Glycine
'GGT' => 'G' # Glycine
);
3. Fasta Stats
Start with a fasta file with multiple sequences, for example your favorite sequences or sequences you download from a public database.
Write a script that will:
Count the total number of sequences
Calculate min, max and average sequence length.
Check for any illegal DNA characters (anything that is not A,T,G,C, or N) (for extra "bonus points" allow for all IUPAC characters)
Create an option that will print out per sequence stats.
For extra extra bonus points
Create a flag to handle all of the above for protein sequences
4. Determine the species of sequences in a uniprot file
Download uniprot/swissprot:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
Use a regular expression to get the uniprot ID which includes GENEID_SPECIES (ex: HDAC1_HUMAN).
Produce a report that lists:
all of the species that are represented in the file
how many sequences for each the respective percentages of each
5. Reciprocal best blast matches
A reciprocal best blast match is when a query sequence A has as its best blast hit a sequence B,
and that sequence B has sequence A as its best blast hit.
Find the reciprocal best blast hits in the provided tab-delimited blast report.
Using Getopt::Long, make it possible for the user to filter the blast results to only use:
those that pass a certain E-value threshold
those that pass a certain alignment length.