PFB2013PFB2013 | Programming for Biology @ CSHL

Problem Set: Perl VII: Solutions

Steven Ahrendt — Thu, 24 Oct 2013 18:56:51 +0000

Solutions to Perl VII Problem Sets
==================================
Question 1:
perl7_1.pl (Steven)

Question 2:
perl7_2.pl (Steven)

Problem Set: Perl VI: Solutions

Steven Ahrendt — Thu, 24 Oct 2013 18:50:55 +0000

Solutions to Perl VI Problem Sets
=================================
Question 1:
perl6_1.pl (Steven)

Questions 2 & 3:
perl6_2.3.pl (Steven)

Question 4:
perl6_4.pl (Steven)

Question 5:
perl6_5.pl (Steven)

Question 6:
perl6_6.pl (Steven)

Question 7:
perl6_7.pl (Steven)

Question 8:
perl6_8.pl (Steven)

RNA Seq with Trinity and Tuxedo

Steven Ahrendt — Tue, 22 Oct 2013 23:15:39 +0000

Trinity workshop

Trinity Publication

Tuxedo workshop

Tuxedo Publication

Genome Assembly with ALLPATHS-LG

Steven Ahrendt — Tue, 22 Oct 2013 01:14:24 +0000

Data set:
wget http://schatzlab.cshl.edu/dna60ifx/decode.tgz

Tutorial

Problem Set: BioPerl

Steven Ahrendt — Sun, 20 Oct 2013 13:05:51 +0000

Useful links for Problem Set: BioPerl

Preparation (from command line):
 
1.  Download uniprot_sprot:
 
  curl -O "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
 
2. Unzip the file.
 
  gunzip uniprot_sprot.fasta.gz
 
 
PROBLEM 1: Bio::DB::Fasta
 
1. write a script to retrieve all IDs from uniprot_sprot.fasta using the Bio::DB:Fasta method:
   $db->get_all_primary_ids
 
2. Search through the list of IDs for all IDs that contain the term "HDAC"  
 
3. Print the sequences for these proteins, in FASTA format.
 
 
PROBLEM 2: Bio::SeqIO
 
1. Write a script using Bio::SeqIO to retrieve the CDS translation from a genbank file. 
 
 
PROBLEM 3: Bio::SearchIO
 
1. On the command line, run the following command to format the file for using as a blast database:
  makeblastdb -in uniprot_sprot.fasta  -dbtype prot
 
2. Here is how you can blast your favorite seq against swissprot:
 
  blastp -query query.fasta -db uniprot_sprot.fasta -evalue 1e-10 -out query_v_sprot.blastout
 
   **You can find additional information on BLAST+ at: http://www.ncbi.nlm.nih.gov/books/NBK1763/
 
 
3. Run Blast with 3 of your protein sequences from Problem 1.
      * use the uniprot_sprot.fasta as your database
      * run blast with an e-value cut-off of 1e-10
 
4. Parse your Blast output. For Hits with "significance" less than or equal to 1e-50,
   retrieve every HSP and print in a tab delimited format:
      * QUERY Name
      * HIT Name
      * HSP Evalue

Problem Set: Perl X

Steven Ahrendt — Fri, 18 Oct 2013 18:42:49 +0000

Files, etc for Problem Set X:

Perl X Problem Set
====================
1. Take a look at the Statistics::Descriptive module on cpan.
2. Write a script that uses the methods in Statistics::Descriptive to calculate the standard deviation, median, min, and max of the following numbers:
 
                 12,-13,-12,7,11,-4,-12,9,6,7,-9
 
Optional questions:
 
3.Add a method to Microarray.pm called expression() which returns the expression value.
 
4. Currently, calling $a = $m->gene() gets the value of gene in the object $m. Modify the gene() method such that if you call gene() with an argument, it will set the value of gene to be that argument.
eg:
$m->gene('FOXP1');  # this should set gene name to 'FOXP1'
print $m->gene();   # this should print the value 'FOXP1'

Problem Set: Perl IX

Steven Ahrendt — Fri, 18 Oct 2013 12:41:42 +0000

Perl IX: References Problem Set
 
Pre Questions:
 
P1. Create an array, retrieve the reference(address), print the reference(address).
    my @array = ('a' , 123, 'book' , 'end');
P2. Use the reference to print out the 2nd and 3rd elements.
P3. Create a hash, retrieve the reference (address), print the reference.
P4. Use the reference to print out a single key/value pair.
 
1. Modify your FASTA parser to create a hash of genes and store
    a. The sequenceID as a hash key (geneName)
    b. The sequence description as a inner key to geneName.
    c. The sequence as a inner key to the same key (geneName).
    d. The sequence length as a inner key to the same key (geneName).
    Hint: This is now a hash of hashes. The value of gene is really an anonymous hash.
 
$VAR1 = {
          'gene1' => { 
                        'seq' => "ATTC",
                        'desc' => 'the is a great seq',
                        'len' => 4,
           },
 
           'gene2' => { 
                        'seq' => "TTTT",
                        'desc' => 'the is a better seq',
                        'len' => 4,
 
            },
    };
 
 
2. Use Data::Dumper and its function Dumper to print out the hash to the screen (STDOUT).
 
3. Print out each gene and its information in a tab-delimited file (FILEHANDLE).
	a. Using a foreach loop, iterate through each key (sequenceID).
	b. Use key to iterate thru each inner key (seq desc, seq, length) and retrieve the values.
	c. print out all data delimited by tabs (\t).
		ex: seq1\tmy favorite gene\tATGCTGGGG\t9\n" 
	d. What about the header line ("Name\tDesc\tSeq\tlen\n")? Where do you need to print this so that it appears first and only once in your output.
 
 
4. Calculate and store the count of each nucleotide for each seq.
 	**** hint: $genes{$seqName}{'nt_count'}{$nt}=$count;
	**** hint: %gene is now a hash of hashes of hashes.
	a. Nucleotide count {'nt_count'} will be another value of the gene key. It is not a variable, it is a string. $genes{$seqName}{'nt_count'}
 
$VAR1 = {
          'gene1' => { 
                        'seq' => "ATTC",
                        'len' => 4,
                        'desc' => 'the is a great seq',
                        'nt_count' => { 
                            'a' => '1',
                            't' => '2'
                            'C' => '1'
                          },
           },
 
           'gene2' => { 
                        'seq' => "TTTT",
                        'len' => 4,
                        'desc' => 'the is a better seq',
                        'nt_count' => { 
                            't' => '4',
                          },
 
 
            },
    };
 
 
5. Calculate and add %GC content for each sequence to the hash for each gene as an additional value to the gene name key.
	a. Create a GC_content subroutine. 
 	b. Call the subroutine on each sequence 
	   my $GC = GC_content($seq);
	c. Use your already stored seq length and 'G' and 'C' total counts to calculate GC content
	d. Store the calculated %GC content as an additional value to gene key.
	e. use Data::Dumper to visualize your hash.
 
6. Print out only high %GC content sequences to a file called highGC.fasta in FASTA format.
	a. Create a subroutine called GC_filter.
	b. GC_filter takes the hash of genes and a minimum %GC cutoff. 
	c. GC_filter returns the number of sequences that met your %GC cutoff criteria or have a higher GC%.
	d. GC_filter also returns an array of arrays made up of gene names and sequences ([geneName,seq],[geneName,seq],[geneName,seq]).
	e. Example usage:  
	   my ($count, $array_of_arrays) = GC_filter (\%genes , 55);
 
$VAR1 = {
          'gene10' => { 
                        'seq' => "GCAT",
                        'len' => 4,
                        'desc' => 'the is a super seq',
                        'nt_count' => { 
                            'a' => '1',
                            't' => '1'
                            'C' => '1'
                            'G' => '1'
                          },
                         'GC'   => '50',
           },
}
 
 
7. Print out each gene and ALL of its information in a new tab-delimited file (FILEHANDLE).
	a. Using foreach loops, iterate through each level of the hash of hashes of hashes.
	b. print output to a new file called all.txt in with this format:
	"seq1\tDesc\tGC%\tA:count,T:count,C:count,T:count\tsequence\n"

Problem Set: Perl VIII

Steven Ahrendt — Thu, 17 Oct 2013 13:53:50 +0000

Perl VIII - Modules - Problem Set
 
1. Download and install a module using CPAN
 
	$ cpan
 
	CPAN should ask if you want to configure automatically. Say YES.
	Once it's done configuring, you should get a cpan prompt like this:
 
	cpan>
 
	Type the following to install the module Math::Round:
 
	cpan> install Math::Round
 
	then, once it's installed, write a short script that uses the round() to round any numbers
	entered as arguments on the command line.
 
 
2. Take the subroutine that you created in the subroutines problem set, and make it into a module.
 
3. Modify your script from the subroutines problem set so that it now calls that subroutine from the module.
 
4. Modify the module to automatically export the subroutine by default.
 
5. (optional bonus question) Write a script to read FASTA files. Make a module with four subroutines.
 
	Each subroutine will take a FASTA sequence as input. Then, write one subroutine for each of the following:
 
	- getting the name (sequence ID) of the sequence
	- getting the description of the sequence
	- getting the sequence itself
	- reformatting the sequence to be a user-specified number of characters
	  per line

Problem Set: Perl VII

Steven Ahrendt — Thu, 17 Oct 2013 13:44:38 +0000

Subroutines and Scope, Perl VII Problem Set

Perl VII Problem Set
=====================
1. Create a subroutine that reverse complements a sequence.
This subroutine should take a nucleotide sequence as a parameter and return the reverse complement.
 
Here's the pseudo code:
 
-- BEGIN PSEUDOCODE --
 
subroutine reverse_complement {
 
  get the parameter nucleotide string
 
  reverse complement the nucleotide string
 
  return the complemented nucleotide string
 
}
 
-- END PSEUDOCODE --
 
Write a program that takes in a nucleotide string as an argument, calls the reverse_complement subroutine, and then prints the reverse complement sequence to STDOUT.
 
-- BEGIN SAMPLE RUN --
 
./reverse_complement.pl GAGAGAGAGAGTTTTTTTTT
AAAAAAAAACTCTCTCTCTC
 
-- END SAMPLE RUN --
 
2. Create a subroutine that reformats your FASTA file such that the sequence is reformatted so that it contains certain number of nucleotides per line. 
 
For example:
 
reformat_seq ($seq , 60 );
 
## sequence before
>seq 1
GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGACTTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT
AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGAGCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA
AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGCAAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC
 
## sequence after
>seq 1
GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGAC
TTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT
AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGA
GCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA
AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGC
AAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC

Review Code: Tab-delimited parsing

Steven Ahrendt — Thu, 17 Oct 2013 02:18:11 +0000

Input file for this code:
test_data

#!/usr/bin/perl
# Takes a three-column, tab-delimited file and creates two hashes.
# The keys for both hashes will be the items in the first column,
# and the values will be the second and third column
###############################################################
use warnings;
use strict;
 
my $infile = shift @ARGV;
my %institute; # Hash for the 2nd column data
my %mutation;  # Hash for the 3rd column data
 
open(IN,'<', $infile) or die "Can't open file $infile: $!\n";
while(my $line = )
{
  chomp $line;
 
  # The following line checks for a header line,
  #   which in this case begins with "Sample"
  # Header lines can start with anything though,
  #   so it's really important to verify your input data first
  next if($line =~ /Sample/);
 
  # The split() function returns an array, so this next line
  #   puts each item in the returned array into three variables
  my ($sample, $inst, $mut) = split(/\t/,$line);
 
  # Then just assign the values to their respective hashes
  $institute{$sample} = $inst;
  $mutation{$sample} = $mut;
}
close(IN);
 
##Output
# Print columns in reversed order
print "Mut\tInst\tSample\n"; # print out a header
foreach my $key (sort keys %institute)
{
  my $ins = $institute{$key}; 
  my $mu = $mutation{$key};
  print "$mu\t$ins\t$key\n";
}