Additional "Real World Problem" Set

1. Longest ORF:
Start with a fasta file with multiple sequences.
Translate each sequence into the 3 reading frames
For each seqeunce find and return
	the longest coding nucleotide fragments in one file: seq.nt
	the longest translated fragment in another file: seq.aa

You can create your own translation hash, or use this:

my(%genetic_code) = (
'TCA' => 'S', # Serine
'TCC' => 'S', # Serine
'TCG' => 'S', # Serine
'TCT' => 'S', # Serine
'TTC' => 'F', # Phenylalanine
'TTT' => 'F', # Phenylalanine
'TTA' => 'L', # Leucine
'TTG' => 'L', # Leucine
'TAC' => 'Y', # Tyrosine
'TAT' => 'Y', # Tyrosine
'TAA' => '_', # Stop
'TAG' => '_', # Stop
'TGC' => 'C', # Cysteine
'TGT' => 'C', # Cysteine
'TGA' => '_', # Stop
'TGG' => 'W', # Tryptophan
'CTA' => 'L', # Leucine
'CTC' => 'L', # Leucine
'CTG' => 'L', # Leucine
'CTT' => 'L', # Leucine
'CCA' => 'P', # Proline
'CAT' => 'H', # Histidine
'CAA' => 'Q', # Glutamine
'CAG' => 'Q', # Glutamine
'CGA' => 'R', # Arginine
'CGC' => 'R', # Arginine
'CGG' => 'R', # Arginine
'CGT' => 'R', # Arginine
'ATA' => 'I', # Isoleucine
'ATC' => 'I', # Isoleucine
'ATT' => 'I', # Isoleucine
'ATG' => 'M', # Methionine
'ACA' => 'T', # Threonine
'ACC' => 'T', # Threonine
'ACG' => 'T', # Threonine
'ACT' => 'T', # Threonine
'AAC' => 'N', # Asparagine
'AAT' => 'N', # Asparagine
'AAA' => 'K', # Lysine
'AAG' => 'K', # Lysine
'AGC' => 'S', # Serine
'AGT' => 'S', # Serine
'AGA' => 'R', # Arginine
'AGG' => 'R', # Arginine
'CCC' => 'P', # Proline
'CCG' => 'P', # Proline
'CCT' => 'P', # Proline
'CAC' => 'H', # Histidine
'GTA' => 'V', # Valine
'GTC' => 'V', # Valine
'GTG' => 'V', # Valine
'GTT' => 'V', # Valine
'GCA' => 'A', # Alanine
'GCC' => 'A', # Alanine
'GCG' => 'A', # Alanine
'GCT' => 'A', # Alanine
'GAC' => 'D', # Aspartic Acid
'GAT' => 'D', # Aspartic Acid
'GAA' => 'E', # Glutamic Acid
'GAG' => 'E', # Glutamic Acid
'GGA' => 'G', # Glycine
'GGC' => 'G', # Glycine
'GGG' => 'G', # Glycine
'GGT' => 'G'  # Glycine
);


2. Fasta Stats
Start with a fasta file with multiple sequences, for example your favorite sequences or sequences you download from a public database.
Write a script that will:
	Count the total number of sequences
	Calculate min, max and average sequence length.
	Check for any illegal DNA characters (anything that is not A,T,G,C, or N) (for extra "bonus points" allow for all IUPAC characters)
	Create an option that will print out per sequence stats.

	For extra extra bonus points
	Create a flag to handle all of the above for protein sequences

3. Determine the species of sequences in a uniprot file 
Download uniprot/swissprot:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

Use a regular expression to get the uniprot ID which includes GENEID_SPECIES (ex: HDAC1_HUMAN).
Produce a report that lists:
	all of the species that are represented in the file
	how many sequences for each
	the respective percentages of each


4. Reciprocal best blast matches

A reciprocal best blast match is when a query sequence A has as its best blast hit a sequence B,
and that sequence B has sequence A as its best blast hit.

Find the reciprocal best blast hits in the provided tab-delimited blast report.


Using Getopt::Long, make it possible for the user to filter the blast results to only use:
      those that pass a certain E-value threshold
      those that pass a certain alignment length.