PFB2013Problem Sets | PFB2013 | Programming for Biology @ CSHL

Problem Set: Perl X: Solutions

Jessen Bredeson — Thu, 24 Oct 2013 23:40:18 +0000

Question 1-2:
Perl_X.Q2.pl (Jessen)

Question 3:
Perl_X.Q3.pl (Jessen)

Question 4:
Perl_X.Q4.pl (Jessen)

Sequence Similarity Lab script

Deb Triant — Tue, 22 Oct 2013 23:19:29 +0000

#!/usr/bin/perl -w 
 
=pod
 
=head1 NAME
 
 merge_hits.pl
 
=head1 SYNOPSIS
 
 merge_hits.pl file1 file2 file3 ...
 
 merge_hits.pl gstt1_drome_hits.bp62 gstt1_drome_hits.vt80 gstt1_drome_hits.vt20
 
=head1 OPTIONS
 
 -h  	  short help
 --help   include description
 --expect show E()-value
 --all    show all scores (default)
 --bits   show bits
 --f_id   show fraction identical
 --a_len  show alignment length
 
=head1 DESCRIPTION
 
C takes a set of results files produced by the
C or C(bp_blastp.pl> scripts, and merges the results
in these files, so that they can be compared.  The C<*_hits.*> are
expected to contain lines containing an accession, E()-value, bit
score, fraction identical, and alignment length, ending with a
description: 
 
 #Hits:	ACC    	E()	bits	f_id	a_len	descr
 NP_000552	2e-145	409.0	1.000	218	glutathione S-transferase Mu 1 isoform 1 [Homo sapiens]
 
Results from each of the files are merged using the unique accession
(and ordered according to the results in the first file).
 
=head1 AUTHOR
 
William R. Pearson, wrp@virginia.edu
 
=cut
 
use strict;
use Pod::Usage;
use Getopt::Long;
use vars qw($file $fd %results @file_list $matrix @hit_list $shelp $help);
use vars qw($expect $a_len $f_id $bits $all);
 
($expect,$a_len,$f_id, $bits, $all) = ( 0, 0, 0, 0, 0);
 
pod2usage(1) unless @ARGV;
GetOptions("h|?" => \$shelp,
	   "help" => \$help,
	   "all" => \$all,
	   "expect" => \$expect,
	   "bits" => \$bits,
	   "a_len" => \$a_len,
	   "f_id" => \$f_id,
    );
 
pod2usage(1) if $shelp;
pod2usage(exitstatus => 0, verbose => 2) if $help;
 
my @res_fields = ();
if ($all) {
  @res_fields = qw( expect bits f_id a_len );
}
elsif ($expect || $bits || $f_id || $a_len) {
  if ($expect) { push @res_fields, "expect";}
  if ($bits) { push @res_fields, "bits";}
  if ($f_id) { push @res_fields, "f_id";}
  if ($a_len) { push @res_fields, "a_len";}
}
else {
  @res_fields = qw( expect bits f_id a_len );
}
 
for (my $i=0; $i < @ARGV; $i++) {
 
  $file = $ARGV[$i];
 
# open the file
  unless (open($fd,$file)) {
    warn "Cannot open $file\n";
    next;
  }
 
# extract the suffix
  ($matrix) = ($file =~ m/\w+\.(\w+)$/);
  push @file_list, $matrix;
 
  while (my $line = <$fd>) {
# skip over the comments
    next if ($line =~ /^#/);
    chomp($line);
 
# read in the accessions, other info
    my @data = split(/\t/,$line);
    my $acc = $data[0];
    if ($i == 0) { push @hit_list, $acc;}
    my %values = ();
    @values{qw(expect bits f_id a_len)} = @data[1..4];
    unless ($results{$acc}) {
      $results{$acc} = {descr => $data[-1]};
    }
    $results{$acc}{$matrix} = \%values;
  }
}
 
#print Dumper(%results);
 
my $tab_fill = "\t" x scalar(@res_fields);
print "#\t\t".join($tab_fill,@file_list) . "\n";
 
for my $acc ( @hit_list) {
  print $acc;
  for $matrix ( @file_list ) {
    if ($results{$acc}{$matrix}) {
      my $result_p = $results{$acc}{$matrix};
      print "\t",join("\t",@$result_p{@res_fields});
    }
    else {
      print $tab_fill;
    }
  }
  print "\t".$results{$acc}{descr} . "\n";
}
 
__END__

RNA Seq with Trinity and Tuxedo

Steven Ahrendt — Tue, 22 Oct 2013 23:15:39 +0000

Trinity workshop

Trinity Publication

Tuxedo workshop

Tuxedo Publication

Problem Set: BioPerl

Steven Ahrendt — Sun, 20 Oct 2013 13:05:51 +0000

Useful links for Problem Set: BioPerl

Preparation (from command line):
 
1.  Download uniprot_sprot:
 
  curl -O "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
 
2. Unzip the file.
 
  gunzip uniprot_sprot.fasta.gz
 
 
PROBLEM 1: Bio::DB::Fasta
 
1. write a script to retrieve all IDs from uniprot_sprot.fasta using the Bio::DB:Fasta method:
   $db->get_all_primary_ids
 
2. Search through the list of IDs for all IDs that contain the term "HDAC"  
 
3. Print the sequences for these proteins, in FASTA format.
 
 
PROBLEM 2: Bio::SeqIO
 
1. Write a script using Bio::SeqIO to retrieve the CDS translation from a genbank file. 
 
 
PROBLEM 3: Bio::SearchIO
 
1. On the command line, run the following command to format the file for using as a blast database:
  makeblastdb -in uniprot_sprot.fasta  -dbtype prot
 
2. Here is how you can blast your favorite seq against swissprot:
 
  blastp -query query.fasta -db uniprot_sprot.fasta -evalue 1e-10 -out query_v_sprot.blastout
 
   **You can find additional information on BLAST+ at: http://www.ncbi.nlm.nih.gov/books/NBK1763/
 
 
3. Run Blast with 3 of your protein sequences from Problem 1.
      * use the uniprot_sprot.fasta as your database
      * run blast with an e-value cut-off of 1e-10
 
4. Parse your Blast output. For Hits with "significance" less than or equal to 1e-50,
   retrieve every HSP and print in a tab delimited format:
      * QUERY Name
      * HIT Name
      * HSP Evalue

Problem Set: Perl X

Steven Ahrendt — Fri, 18 Oct 2013 18:42:49 +0000

Files, etc for Problem Set X:

Perl X Problem Set
====================
1. Take a look at the Statistics::Descriptive module on cpan.
2. Write a script that uses the methods in Statistics::Descriptive to calculate the standard deviation, median, min, and max of the following numbers:
 
                 12,-13,-12,7,11,-4,-12,9,6,7,-9
 
Optional questions:
 
3.Add a method to Microarray.pm called expression() which returns the expression value.
 
4. Currently, calling $a = $m->gene() gets the value of gene in the object $m. Modify the gene() method such that if you call gene() with an argument, it will set the value of gene to be that argument.
eg:
$m->gene('FOXP1');  # this should set gene name to 'FOXP1'
print $m->gene();   # this should print the value 'FOXP1'

Problem Set: Perl IX

Steven Ahrendt — Fri, 18 Oct 2013 12:41:42 +0000

Perl IX: References Problem Set
 
Pre Questions:
 
P1. Create an array, retrieve the reference(address), print the reference(address).
    my @array = ('a' , 123, 'book' , 'end');
P2. Use the reference to print out the 2nd and 3rd elements.
P3. Create a hash, retrieve the reference (address), print the reference.
P4. Use the reference to print out a single key/value pair.
 
1. Modify your FASTA parser to create a hash of genes and store
    a. The sequenceID as a hash key (geneName)
    b. The sequence description as a inner key to geneName.
    c. The sequence as a inner key to the same key (geneName).
    d. The sequence length as a inner key to the same key (geneName).
    Hint: This is now a hash of hashes. The value of gene is really an anonymous hash.
 
$VAR1 = {
          'gene1' => { 
                        'seq' => "ATTC",
                        'desc' => 'the is a great seq',
                        'len' => 4,
           },
 
           'gene2' => { 
                        'seq' => "TTTT",
                        'desc' => 'the is a better seq',
                        'len' => 4,
 
            },
    };
 
 
2. Use Data::Dumper and its function Dumper to print out the hash to the screen (STDOUT).
 
3. Print out each gene and its information in a tab-delimited file (FILEHANDLE).
	a. Using a foreach loop, iterate through each key (sequenceID).
	b. Use key to iterate thru each inner key (seq desc, seq, length) and retrieve the values.
	c. print out all data delimited by tabs (\t).
		ex: seq1\tmy favorite gene\tATGCTGGGG\t9\n" 
	d. What about the header line ("Name\tDesc\tSeq\tlen\n")? Where do you need to print this so that it appears first and only once in your output.
 
 
4. Calculate and store the count of each nucleotide for each seq.
 	**** hint: $genes{$seqName}{'nt_count'}{$nt}=$count;
	**** hint: %gene is now a hash of hashes of hashes.
	a. Nucleotide count {'nt_count'} will be another value of the gene key. It is not a variable, it is a string. $genes{$seqName}{'nt_count'}
 
$VAR1 = {
          'gene1' => { 
                        'seq' => "ATTC",
                        'len' => 4,
                        'desc' => 'the is a great seq',
                        'nt_count' => { 
                            'a' => '1',
                            't' => '2'
                            'C' => '1'
                          },
           },
 
           'gene2' => { 
                        'seq' => "TTTT",
                        'len' => 4,
                        'desc' => 'the is a better seq',
                        'nt_count' => { 
                            't' => '4',
                          },
 
 
            },
    };
 
 
5. Calculate and add %GC content for each sequence to the hash for each gene as an additional value to the gene name key.
	a. Create a GC_content subroutine. 
 	b. Call the subroutine on each sequence 
	   my $GC = GC_content($seq);
	c. Use your already stored seq length and 'G' and 'C' total counts to calculate GC content
	d. Store the calculated %GC content as an additional value to gene key.
	e. use Data::Dumper to visualize your hash.
 
6. Print out only high %GC content sequences to a file called highGC.fasta in FASTA format.
	a. Create a subroutine called GC_filter.
	b. GC_filter takes the hash of genes and a minimum %GC cutoff. 
	c. GC_filter returns the number of sequences that met your %GC cutoff criteria or have a higher GC%.
	d. GC_filter also returns an array of arrays made up of gene names and sequences ([geneName,seq],[geneName,seq],[geneName,seq]).
	e. Example usage:  
	   my ($count, $array_of_arrays) = GC_filter (\%genes , 55);
 
$VAR1 = {
          'gene10' => { 
                        'seq' => "GCAT",
                        'len' => 4,
                        'desc' => 'the is a super seq',
                        'nt_count' => { 
                            'a' => '1',
                            't' => '1'
                            'C' => '1'
                            'G' => '1'
                          },
                         'GC'   => '50',
           },
}
 
 
7. Print out each gene and ALL of its information in a new tab-delimited file (FILEHANDLE).
	a. Using foreach loops, iterate through each level of the hash of hashes of hashes.
	b. print output to a new file called all.txt in with this format:
	"seq1\tDesc\tGC%\tA:count,T:count,C:count,T:count\tsequence\n"

Problem Set: Perl III: Solutions

Jessen Bredeson — Fri, 18 Oct 2013 02:24:19 +0000

Solutions to Perl II Problem Sets
================================
Perl_III Problem 1 (Jessen)
Perl_III Problem 2 (Jessen)
Perl_III Problem 3 (Jessen)
Perl_III Problem 4 (Jessen)
Perl_III Problem 5 (Jessen)

Problem Set: Perl II: Solutions

Jessen Bredeson — Thu, 17 Oct 2013 18:46:58 +0000

Solutions to Perl II Problem Sets

================================
Perl_II Problem 1 (Jessen)
Perl_II Problem 2 (Jessen)
Perl_II Problem 3 (Jessen)
Perl_II Problem 4 (Jessen)
Perl_II Problem 5 (Jessen)
Perl_II Problem 6 (Jessen)
Perl_II Problem 7 (Jessen)
Perl_II Problem 8 (Jessen)
Perl_II Problem 9 (Jessen)
Perl_II Problem 10 (Jessen)
Perl_II Problem 11 (Jessen)
Perl_II Problem 12 (Jessen)

Problem Set: UNIX and Perl I: Solutions

Jessen Bredeson — Thu, 17 Oct 2013 15:44:15 +0000

Solutions to UNIX Problem Sets

================================
Unix_1 (Jessen)

Solutions to Perl I Problem Sets

================================

Perl_1 (Jessen)

Problem Set: Perl VIII

Steven Ahrendt — Thu, 17 Oct 2013 13:53:50 +0000

Perl VIII - Modules - Problem Set
 
1. Download and install a module using CPAN
 
	$ cpan
 
	CPAN should ask if you want to configure automatically. Say YES.
	Once it's done configuring, you should get a cpan prompt like this:
 
	cpan>
 
	Type the following to install the module Math::Round:
 
	cpan> install Math::Round
 
	then, once it's installed, write a short script that uses the round() to round any numbers
	entered as arguments on the command line.
 
 
2. Take the subroutine that you created in the subroutines problem set, and make it into a module.
 
3. Modify your script from the subroutines problem set so that it now calls that subroutine from the module.
 
4. Modify the module to automatically export the subroutine by default.
 
5. (optional bonus question) Write a script to read FASTA files. Make a module with four subroutines.
 
	Each subroutine will take a FASTA sequence as input. Then, write one subroutine for each of the following:
 
	- getting the name (sequence ID) of the sequence
	- getting the description of the sequence
	- getting the sequence itself
	- reformatting the sequence to be a user-specified number of characters
	  per line