Additional "Real World Problem" Set 1. Create a fasta parser. Your fasta parser should be able to handle more than one sequence and each sequence may have newlines. What if a sequence has a description. This can be used for getting the sequence in a usuable form for any downstream processing. For example reverse complementing, or finding restriction sites in many sequences. >seq1 AAGAGCAGCTCGCGCTAATGTGATAGATGGCGGTAAAGTAAATGTCCTATGGGCC ACCAATTATGGTGTATGAGTGAATCTCTGGTCCGAGATTCACTGAGTAACTGCTG TACACAGTAGTAACACGTGGAGATCCCATAAGCTTCACGTGTGGTCCAATAAAAC ACTCCGTTGGTCAAC >seq2 some fancy description GCCACAGAGCCTAGGACCCCAACCTAACCTAACCTAACCTAACCTACAGTTTGATC TTAACCATGAGGCTGAGAAGCGATGTCCTGACCGGCCTGTCCTAACCGCCCTGACC TAACCGGCTTGACCTAACCGCCCTGACCTAACCAGGCTAACCTAACCAAACCGTGA AAAAAGGAATCT >seq3 ATGAAAGTTACATAAAGACTATTCGATGCATAAATAGTTCAGTTTTGAAAACTTAC ATTTTGTTAAAGTCAGGTACTTGTGTATAATATCAACTAAAT >seq4 no newlines ATGCTAACCAAAGTTTCAGTTCGGACGTGTCGATGAGCGACGCTCAAAAAGGAAACAACATGCCAAATAGAAACGATCAATTCGGCGATGGAAATCAGAACAACGATCAGTTTGGAAATCAAAATAGAAATAACGGGAACGATCAGTTTAATAACATGATGCAGAATAAAGGGAATAATCAATTTAATCCAGGTAATCAGAACAGAGGT Hints: -Separate the id from sequence. -What do you do with a description? -If the sequence has new lines, remove them. -You need to be able to always know which id belongs to which sequence. ideas : loops, hashes, others. 2. Fasta Stats Start with a fasta file with multiple sequences, for example your favorite sequences or sequences you download from a public database. Write a script that will: -Count the total number of sequences -Calculate min, max and average sequence length. -Check for any illegal DNA characters (anything that is not A,T,G,C, or N) (for extra "bonus points" allow for all IUPAC characters) -Create an option that will print out per sequence stats. -Calculate N50. For extra extra bonus points Create a flag to handle all of the above for protein sequences 3. Longest ORF: Start with a fasta file with multiple sequences. Translate each sequence into the 3 reading frames For each seqeunce find and return a) the longest coding nucleotide fragments in one file and print to a file: seq.nt b) the longest translated fragment in another file: seq.aa You can create your own translation hash, or use this: my(%genetic_code) = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine 'TTT' => 'F', # Phenylalanine 'TTA' => 'L', # Leucine 'TTG' => 'L', # Leucine 'TAC' => 'Y', # Tyrosine 'TAT' => 'Y', # Tyrosine 'TAA' => '_', # Stop 'TAG' => '_', # Stop 'TGC' => 'C', # Cysteine 'TGT' => 'C', # Cysteine 'TGA' => '_', # Stop 'TGG' => 'W', # Tryptophan 'CTA' => 'L', # Leucine 'CTC' => 'L', # Leucine 'CTG' => 'L', # Leucine 'CTT' => 'L', # Leucine 'CCA' => 'P', # Proline 'CAT' => 'H', # Histidine 'CAA' => 'Q', # Glutamine 'CAG' => 'Q', # Glutamine 'CGA' => 'R', # Arginine 'CGC' => 'R', # Arginine 'CGG' => 'R', # Arginine 'CGT' => 'R', # Arginine 'ATA' => 'I', # Isoleucine 'ATC' => 'I', # Isoleucine 'ATT' => 'I', # Isoleucine 'ATG' => 'M', # Methionine 'ACA' => 'T', # Threonine 'ACC' => 'T', # Threonine 'ACG' => 'T', # Threonine 'ACT' => 'T', # Threonine 'AAC' => 'N', # Asparagine 'AAT' => 'N', # Asparagine 'AAA' => 'K', # Lysine 'AAG' => 'K', # Lysine 'AGC' => 'S', # Serine 'AGT' => 'S', # Serine 'AGA' => 'R', # Arginine 'AGG' => 'R', # Arginine 'CCC' => 'P', # Proline 'CCG' => 'P', # Proline 'CCT' => 'P', # Proline 'CAC' => 'H', # Histidine 'GTA' => 'V', # Valine 'GTC' => 'V', # Valine 'GTG' => 'V', # Valine 'GTT' => 'V', # Valine 'GCA' => 'A', # Alanine 'GCC' => 'A', # Alanine 'GCG' => 'A', # Alanine 'GCT' => 'A', # Alanine 'GAC' => 'D', # Aspartic Acid 'GAT' => 'D', # Aspartic Acid 'GAA' => 'E', # Glutamic Acid 'GAG' => 'E', # Glutamic Acid 'GGA' => 'G', # Glycine 'GGC' => 'G', # Glycine 'GGG' => 'G', # Glycine 'GGT' => 'G' # Glycine ); 4. Determine the species of sequences in a uniprot file Download uniprot/swissprot: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz Use a regular expression to get the uniprot ID which includes GENEID_SPECIES (ex: HDAC1_HUMAN). Produce a report that lists: all of the species that are represented in the file how many sequences for each the respective percentages of each 5. Reciprocal best blast matches A reciprocal best blast match is when a query sequence A has as its best blast hit a sequence B, and that sequence B has sequence A as its best blast hit. Find the reciprocal best blast hits in the provided tab-delimited blast report. Using Getopt::Long, make it possible for the user to filter the blast results to only use: those that pass a certain E-value threshold those that pass a certain alignment length. |