Real World Problem Set

Additional "Real World Problem" Set
 
1. Create a fasta parser. Your fasta parser should be able to handle more than
one sequence and each sequence may have newlines. What if a sequence has a
description.
 
This can be used for getting the sequence in a usuable form for any downstream
processing. For example reverse complementing, or finding restriction sites in
many sequences.
 
>seq1
AAGAGCAGCTCGCGCTAATGTGATAGATGGCGGTAAAGTAAATGTCCTATGGGCC
ACCAATTATGGTGTATGAGTGAATCTCTGGTCCGAGATTCACTGAGTAACTGCTG
TACACAGTAGTAACACGTGGAGATCCCATAAGCTTCACGTGTGGTCCAATAAAAC
ACTCCGTTGGTCAAC
>seq2 some fancy description
GCCACAGAGCCTAGGACCCCAACCTAACCTAACCTAACCTAACCTACAGTTTGATC
TTAACCATGAGGCTGAGAAGCGATGTCCTGACCGGCCTGTCCTAACCGCCCTGACC
TAACCGGCTTGACCTAACCGCCCTGACCTAACCAGGCTAACCTAACCAAACCGTGA
AAAAAGGAATCT
>seq3
ATGAAAGTTACATAAAGACTATTCGATGCATAAATAGTTCAGTTTTGAAAACTTAC
ATTTTGTTAAAGTCAGGTACTTGTGTATAATATCAACTAAAT
>seq4 no newlines
ATGCTAACCAAAGTTTCAGTTCGGACGTGTCGATGAGCGACGCTCAAAAAGGAAACAACATGCCAAATAGAAACGATCAATTCGGCGATGGAAATCAGAACAACGATCAGTTTGGAAATCAAAATAGAAATAACGGGAACGATCAGTTTAATAACATGATGCAGAATAAAGGGAATAATCAATTTAATCCAGGTAATCAGAACAGAGGT
 
Hints:
-Separate the id from sequence.
-What do you do with a description?
-If the sequence has new lines, remove them.
-You need to be able to always know which id belongs to which sequence.
ideas : loops, hashes, others.
 
2. Fasta Stats
Start with a fasta file with multiple sequences, for example your favorite sequences or sequences you download from a public database.
Write a script that will:
-Count the total number of sequences
-Calculate min, max and average sequence length.
-Check for any illegal DNA characters (anything that is not A,T,G,C, or N) (for extra "bonus points" allow for all IUPAC characters)
-Create an option that will print out per sequence stats.
-Calculate N50.
For extra extra bonus points
Create a flag to handle all of the above for protein sequences
 
3. Longest ORF:
Start with a fasta file with multiple sequences.
Translate each sequence into the 3 reading frames
For each seqeunce find and return
 a) the longest coding nucleotide fragments in one file and print to a file: seq.nt
 b) the longest translated fragment in another file: seq.aa
 
You can create your own translation hash, or use this:
 
my(%genetic_code) = (
'TCA' => 'S', # Serine
'TCC' => 'S', # Serine
'TCG' => 'S', # Serine
'TCT' => 'S', # Serine
'TTC' => 'F', # Phenylalanine
'TTT' => 'F', # Phenylalanine
'TTA' => 'L', # Leucine
'TTG' => 'L', # Leucine
'TAC' => 'Y', # Tyrosine
'TAT' => 'Y', # Tyrosine
'TAA' => '_', # Stop
'TAG' => '_', # Stop
'TGC' => 'C', # Cysteine
'TGT' => 'C', # Cysteine
'TGA' => '_', # Stop
'TGG' => 'W', # Tryptophan
'CTA' => 'L', # Leucine
'CTC' => 'L', # Leucine
'CTG' => 'L', # Leucine
'CTT' => 'L', # Leucine
'CCA' => 'P', # Proline
'CAT' => 'H', # Histidine
'CAA' => 'Q', # Glutamine
'CAG' => 'Q', # Glutamine
'CGA' => 'R', # Arginine
'CGC' => 'R', # Arginine
'CGG' => 'R', # Arginine
'CGT' => 'R', # Arginine
'ATA' => 'I', # Isoleucine
'ATC' => 'I', # Isoleucine
'ATT' => 'I', # Isoleucine
'ATG' => 'M', # Methionine
'ACA' => 'T', # Threonine
'ACC' => 'T', # Threonine
'ACG' => 'T', # Threonine
'ACT' => 'T', # Threonine
'AAC' => 'N', # Asparagine
'AAT' => 'N', # Asparagine
'AAA' => 'K', # Lysine
'AAG' => 'K', # Lysine
'AGC' => 'S', # Serine
'AGT' => 'S', # Serine
'AGA' => 'R', # Arginine
'AGG' => 'R', # Arginine
'CCC' => 'P', # Proline
'CCG' => 'P', # Proline
'CCT' => 'P', # Proline
'CAC' => 'H', # Histidine
'GTA' => 'V', # Valine
'GTC' => 'V', # Valine
'GTG' => 'V', # Valine
'GTT' => 'V', # Valine
'GCA' => 'A', # Alanine
'GCC' => 'A', # Alanine
'GCG' => 'A', # Alanine
'GCT' => 'A', # Alanine
'GAC' => 'D', # Aspartic Acid
'GAT' => 'D', # Aspartic Acid
'GAA' => 'E', # Glutamic Acid
'GAG' => 'E', # Glutamic Acid
'GGA' => 'G', # Glycine
'GGC' => 'G', # Glycine
'GGG' => 'G', # Glycine
'GGT' => 'G' # Glycine
);
 
 
 
4. Determine the species of sequences in a uniprot file
Download uniprot/swissprot:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
 
Use a regular expression to get the uniprot ID which includes GENEID_SPECIES (ex: HDAC1_HUMAN).
Produce a report that lists:
all of the species that are represented in the file
how many sequences for each the respective percentages of each
 
5. Reciprocal best blast matches
 
A reciprocal best blast match is when a query sequence A has as its best blast hit a sequence B,
and that sequence B has sequence A as its best blast hit.
 
Find the reciprocal best blast hits in the provided tab-delimited blast report.
 
Using Getopt::Long, make it possible for the user to filter the blast results to only use:
those that pass a certain E-value threshold
those that pass a certain alignment length.