*****************************
Unix Basics
*****************************

ls       -- list contents
cd       -- change directory
mkdir    -- make a directory
rm       -- use caution, it is easy to delete more that you would like
head     -- prints the top few lines to the terminal window
tail     -- prints the last few lines to the terminal window
sort     -- sorts the lines
uniq     -- prints the unique lines
grep     -- filnds the lines that contain a pattern
wc       -- counts the number of lines, characters and words
mv       -- move files
cp       -- copy files
date     -- returns the current date and time
pwd      -- return working directory name
ssh      -- remote login
scp      -- remote secure copy
~        -- represents your home directory

man [command] -- manual page for the command
man ls

try:

ls -l

ls -lt

you can string more than one command together with a pipe (|) , such that the output of the first command is received by the second command.

ls -lt | head

you can string more than one command together with a semi-colon (;) , such that the commands run sequentially, but that output does not get passed into the next command.

date; some program command ; date

you can redirect the output of a command into a file

grep PATTERN > PATTERN.txt


you can append the output of a command to a file
   
    grep PATTERN2 >> PATTERN.txt


you can redirect stderr to a file
   
    command 2> filename

you can redirect the output (stdout) and stderr to a file
   
    command &> filename

text editors:
   
    text wrangler is a good app to start with.

============
Problem Set
============

    Using your text editor create a fasta file and name it sequences.fasta. Make sure it ends up in the proper directory, locally or remotely.

    This is fasta file format:
    >seqName description
    ATGGCGTCTTGGCCTTAAAAGCTC

    Log into your machine or account. What is the full path to your home directory?
        How many files does it contain?
        How many directories?
    Without using a text editor examine the contents of the file sequences.fasta.
        How many lines does this file contain?   
        How many characters?    (Hint: check out the options of wc)
        What is the first line of this file?    (Hint: read the man page of head)
        What are the last 3 lines?    (Hint: read the man page of tail)
        How many sequences are in the file?    (Hint: use grep)
    Rename sequences.fasta to something more informative of the sequences the file contains.    

(Hint: read the man page for mv)

    Create a directory called fasta.     (Hint: use mkdir)
    Copy the fasta file that you renamed to the fasta directory. (Hint: use cp)
    Verify that the file is within the fasta directory.    (Hint: use ls fasta/)
    Delete the the original file that you used for copying.    (Hint: use rm, be careful)
    Read the man page for rm and cp to find out how to remove and copy a directory.


*****************************
Perl Basics
*****************************

Quick Review


Why Perl for data processing & bioinformatics

    - Fast text processing
    - Regular expressions
    - Extensive module libraries for pre-written tools
    - Large number of users in community of bioinformatics
    - Scripts are often faster to write than full compiled programs


Every perl script needs to contain the following lines:

    #!/usr/bin/perl
    use warnings;
    use strict;
 

Print "hello world". Type this into your text editor and save it. Give the script a .pl extension.
   
    #!/usr/bin/perl
    use warnings;
    print "hello world\n";

Run it:
    on the command line type
    perl your_script_name.pl

Now try rewriting your script so that it doesnot have a new line character
(\n). Run it.

Using variables:
    $scalar        A scalar contains a value, can be a string or a number.
    @array         An array contains a list of scalars.
    %hash          A hash contains paired values (key/value pairs)

Using a scalar to print:
    #!/usr/bin/perl
    use warnings;
    use strict;
    my $phrase = "hello world";
    print "$phrase\n";
   
    # This is a comment
    # A print statement can combine variables and strings
    print "This is my phrase $phrase\n";

    # this is another way to print the same as above
    # the print function takes a list of arguments
    # in this example "This is my phrase $phrase" is the first argument
    # and "\n" is the second
    print "This is my phrase $phrase" , "\n";

---- note about scope ----

Scope:  (we will go over this concept in more detail at a later time)
but in summary:
use strict; #!!!!! this is important

when you 'use strict' you have to use my when declaring a variable. This means
when you use a variable for the first time use the term 'my';

---------------------------


Something Advanced but fun     tr///
   
    #!/usr/bin/perl
    use warnings;
    use strict;

    my $pet = "cat";
    print $pet, "\n";

    $pet =~ tr/cat/dog/;

    print $pet , "\n";

    $pet =~ tr/o/O/;
    print $pet, "\n";
   
    dOg
   
    $pet =~ tr/gd/wc/;
    print $pet, "\n";


---- getting help -------

Documentation of perl scripts and modules

    Perldoc online http://perldoc.perl.org/
    Or on your computer - type 'perldoc'
    For functions use -f 'perldoc -f sprintf'
    For modules just the name 'perldoc List::Util'
    http://www.cpan.org


============
Problem Set
============

1.    Create a script called "add.pl" script to sum two $scalar variables:
      (hint:     use + )

        
2.    Try other mathmatical operators. Create a new scripts that adds,
      subtracts, multiplies and divides values.

3.    Test your knowledge of precedence. Create a script that uses may
      operators together. Do you get the answer that you expect?

4.    Create a script to produce the reverse complement of a sequence (hint, use the reverse and tr/// functions)

    % reversec.pl GAGAGAGAGAGTTTTTTTTT
          output: AAAAAAAAACTCTCTCTCTC
    

*****************************
Perl II Problem Set:
*****************************

1.        Create a script that takes two numbers from the command line and
adds them.
                   % add.pl 2 3

                   5

2.        Modify the "add" script from the previous problem set so that it checks that both arguments are defined (hint, use function defined. this will not allow 0, do a check for 0). :

                   % add.pl 2 3

                   5

                   % add.pl 2

                   Please provide two numbers.    

3.        Modify the script again so that it checks that both arguments are positive numbers. Zero is allowed, but -1 is not:

                      % add.pl 2 -3

                      Please provide two positive numbers.    

 
4.   Write a script to compare two strings given on the command line arguments and print "right order" if they are in alphabetic order, and "wrong order" if they are not:

                      % order.pl Fred Lucy

                      right order

                      % order.pl Lucy Fred

                      wrong order

                 
5.  Write a script to compare two strings given on the command line and print them out in correct alphabetic order:

                  % reorder Fred Lucy

                  Fred Lucy

                  % reorder Lucy Fred

                  Fred Lucy

                 
6.        Write a script named "same.pl" to read two strings from the terminal. Compare them in a case-sensitive manner and print "same" if they are the same, "different" if they are different:

                  % same.pl

                  Enter string 1: lucy

                  Enter string 2: Lucy

                  different

                 
7.        Modify this script to compare the strings in a case-INsensitive manner (hint, use the "lc" or "uc" functions to change string to upper or lowercase.

8.  Write a script named "percent.pl" to calculate percentages, where the percentage is $i/($i+$j) * 100. Make sure that the script does not crash when given two numbers that add up to zero:

                  % percent.pl 50 150

                  25%

                 
                  % percent.pl 50 -50

                  You are trying to trick me! at line 4.

                 
9.   Modify this script to use the printf() function to produce nicely formatted floating point numbers (hint:  try "man sprintf" and "man printf" or look it up online to learn about this wonderful function).

                  % percent.pl 50 150

                  25.00 %

                 
10.   Run this code. Is its output what you expect? Why?

                  for (my $i = 0; $i < 10; $i++) {

                      if ($i = 2) {

                                  print "\$i = $i\n";

                      }

                  }

                     
11.  Write a program named "pali.pl" to detect palindromes. It must be able to handle changes in case.

                  % pali.pl "Madam in Eden Im Adam"

                  yes!

                  % pali.pl gatcctag

                  yes!

                  % pali.pl "cold spring harbor laboratory"

                  no!

                 
12.        Modify the program to work even if there is extraneous punctuation:

                  % pali.pl "A man, a plan, a canal... Panama"

                  yes!

                 
        (Hint: Look up the s/// pattern matching & substitution function in the Perl reference guide. We will cover this formally in a few days, so you can save this one for later if you like.)


13.  Create a file of numbers call numbers.txt with the following content:

22
45
1
2
31
32
72
24


14.

        Here is pseudo-code for a program which uses numbers.txt as input:

                  create file myresult.txt and open it for writing output

                  open numbers.txt for reading

                  while (each line of the file numbers.txt) {

                    if (the number is even) {

                              if (the number is less than 24) {

                                          print the line to STDOUT

                              }

                    }
                    else {

                              compute the factorial of the number

                              print the factorial to the file myresult.txt (one per line)

                    }

                  }


        a. What will be printed to STDOUT?
        b. What will be the contents of myresult.txt?

        c. Convert the pseudocode above into a real program.


*****************************
Perl III Problem Set:
*****************************

1.  Create a script that divides two numbers provided on the command line.
	Two numbers are required.
	Numbers have to be positive.
	Divisor cannot be zero.

        This part you do in Perl
        ========================
	Write the quotient to STDOUT
	Write any errors to STDERR

        This part you do on the command line in UNIX
        ============================================
	Redirect STDOUT to an output file (out.txt)
	Redirect STDERR to an error file (err.txt)


2. Open a file using the open function.
	Make all the letters in each line uppercase. (There's a built-in
    Perl function which will do this.)
	Write out to a new file that was created using the open function.

3. Open the provided fasta file. Print the reverse complement of each
    sequence. Make sure to print the output in fasta format including
    the sequence name and a note in the description that this is the
    reverse complement. Print to STDOUT and capture the ouput with a 
    command line redirect '>'.

4. Open the provided fastq file. Go thru each line of the file. Count
    the number of lines and the number of characters per line.
    Report the:
         a. total number of lines
         b. total number of characters
         c. the average line length 

5. Create a script that uses index() to:
      a. find the first position of 'Nobody' on every line
      b. find the first position of 'somebody' on every line
    Use the warn() function to warn the user that 'somebody is here'

>seq1
AAGAGCAGCTCGCGCTAATGTGATAGATGGCGGTAAAGTAAATGTCCTATGGGCCACCAATTATGGTGTATGAGTGAATCTCTGGTCCGAGATTCACTGAGTAACTGCTGTACACAGTAGTAACACGTGGAGATCCCATAAGCTTCACGTGTGGTCCAATAAAACACTCCGTTGGTCAAC
>seq2
GCCACAGAGCCTAGGACCCCAACCTAACCTAACCTAACCTAACCTACAGTTTGATCTTAACCATGAGGCTGAGAAGCGATGTCCTGACCGGCCTGTCCTAACCGCCCTGACCTAACCGGCTTGACCTAACCGCCCTGACCTAACCAGGCTAACCTAACCAAACCGTGAAAAAAGGAATCT
>seq3
ATGAAAGTTACATAAAGACTATTCGATGCATAAATAGTTCAGTTTTGAAAACTTACATTTTGTTAAAGTCAGGTACTTGTGTATAATATCAACTAAAT
>seq4
ATGCTAACCAAAGTTTCAGTTCGGACGTGTCGATGAGCGACGCTCAAAAAGGAAACAACATGCCAAATAGAAACGATCAATTCGGCGATGGAAATCAGAACAACGATCAGTTTGGAAATCAAAATAGAAATAACGGGAACGATCAGTTTAATAACATGATGCAGAATAAAGGGAATAATCAATTTAATCCAGGTAATCAGAACAGAGGT
@HWI-ST279:219:D0RJNACXX:6:2303:16038:171912 1:N:0:GGCTAC
TGAAGTAACACTAACAGAGAAAGTACATGTACTAAACAGTTCCTTAAGTGCAGTTGCTTCCTTGTGATAAACATTCTCTAAATCTCTAGTTGATGTTTGCC
+
CCCFFEFFHHHHGJJJJJJIJJJJJJJJJHIIIIJGHGHHHIJIJJJJHGIJJJJGJJJJJJJJJJJJJJJJJIJHIJJJJJJJHHHHHFHFFFFFFFDEE
@HWI-ST279:219:D0RJNACXX:6:2303:16145:171918 1:N:0:GGCTAC
GCTCTGAGATAGGTTCCAACTTCCTCCCGCGAACGCACCCGTACTTGCAGCCCAAAAACGAGAAGGGGAACAATAGAAAGCAAGTGAAAGGATGCTGCTGG
+
@@CFFFFEHHHHHFHHIJJJIJJDHIJIJJJIIJIIIJIJIEHHHHGHFFFFFDDEDD?@@B@BDDDDDDBDDCCCCDDDDDCC:CDDDDDDCCACDD>>>
@HWI-ST279:219:D0RJNACXX:6:2303:16023:171925 1:N:0:GGCTAC
GACCCATATAAATATGCGGTCACTACATCCATCAACTGTATTTCTAAGTTCAAATTGACTGCCATAGATATTAAGAACCGGAATGTAATTCCATCCATTAC
+
CCCFFFFFHHHHHJJJIJJHGJJJJJJJJJJJJJJIJJHIIJJIJIIJIHJJJJJJIJJJIJJJJIJIIJJIIIJHIHHHFFDDEEEEEEEDEEDDDDFDD
@HWI-ST279:219:D0RJNACXX:6:2303:16127:171927 1:N:0:GGCTAC
CACCTGTGGCAAGAAACTTAATGTTCATTCAGGCTCGATTTCAGGCTTCAGCATTATCAAATTTCTCATCAAGAAAGGATGAAAACAAATGTAGGTTACAG
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJIHHHFFFFFFFEEEDDDDDDDEDACDDCD
@HWI-ST279:219:D0RJNACXX:6:2303:16211:171937 1:N:0:GGCTAC
TTTAAATCTCTAACTATCTCTAACACTCAAATATGCTAAGTCTAATAATCTAATAATTTCTAACACTCACATATGCTAACTCTAATAATCTAATAATCTTA
+
<@@DFBBDDBFHDEABGECDDAC?DDF3CFH?H?CEFGGGC?DGFFD<FHHGBF@4BFDDGGG@BDGBCB=FGI@@@FD@G>EHCGHCHFGHABC;@;@DA
@HWI-ST279:219:D0RJNACXX:6:2303:16195:171950 1:N:0:GGCTAC
CCCTGAAAATTAGGCTGCTCCTAGATTAGCAGCATGTTGGTCTAATGGGCCATATTCAAGCTCAATCAGCAAATAAGTAGGGACTTGGTCGGGTTCCAAGG
+
CCCFFFFFGHHHHJGIJJJJJJJJIJJIJJJJIJJIGIIJIIJJJJIJJIJJJJJJIJIJJJJJJJIJJJJJJIHEFCHHGFFFFEEEEEDDDBDDDEDDB
@HWI-ST279:219:D0RJNACXX:6:2303:16100:171952 1:N:0:GGCTAC
ATGGAATGGGTTTTGGCATTAGTCTACGTTTAGTACTTCTAATTAGTGTCAAACATTCGATGTGATAGGGATTAAAATTTAGTCCCTAAACCAAACAGGGC
+
CCCFFFFFHHFHHJJJJJJJJJIJJJJJIJJJJGIJJJJJJJJJJJHIIJJJJJJJJJJJJJIJJJJIJJJJJJJIHHHHHHEFFFFFEEEEDDDDDDDDB
@HWI-ST279:219:D0RJNACXX:6:2303:16145:171961 1:N:0:GGCTAC
ATGCATCCATGCTAAGATATTTCCTCGTGTGGCACTGTTCAGTGTTCATCAGCAGTGGTTGGATACGTGAACCCCACACGCCAGAAATCATAACGTGCAGT
+
<?<DDFFFADHHHIGGGHEHIICHIF>EFBBHGGIGCGFGIIGGFGEGICDDEGEHEHGEDFGGEH@==@GCAEB;?=B>@BBBCCCA@CACC(,<AA@::
@HWI-ST279:219:D0RJNACXX:6:2303:16218:171973 1:N:0:GGCTAC
TTTATTGATTCTCTTTGTGTTTACAAAGTGCACCAAAAGTACTCCTTACAAAAATTTCGACTAAACTCAAAATCCTAACTAAACTATCAACTCAATTGCTC
+
CCCFFFFFHHHHGJJIIGGIJJJJJJJJHIJJJJJJJJJ?FHJJJJJIJJJJJJJJJJJJIJJJGIJJJJIIJHGHHHHFFFFFEEEEEEEDDDDDEDDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16132:171978 1:N:0:GGCTAC
TGTATTCACTAGGCTGTTGCTTTCTGTAGTTAATACTAGGCTAAATAGTGCTTTTTATTGTCATTCTGTTATTTTCATTCATGTTAGAGGGATTATATATA
+
CCBFFFFFHHHHHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJJJJHIJJJJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJIIGHHHFFFFFEEEFC
@HWI-ST279:219:D0RJNACXX:6:2303:16177:171984 1:Y:0:GGCTAC
ATACCTTCTCCCGGGAGTTTTAGACGTGCTTTGATGCTTCGGTCATGCGAAATGCAAAATGCAGTTAAGAGTGGAGGCGAAGGCCTCAAGGCCATGGAAAA
+
===A=AA<CCA+0AC)<<CBBBBBBA?<AABB<4?A:<AA6<0'7>A>A=6=;>AAAA@##########################################
@HWI-ST279:219:D0RJNACXX:6:2303:16255:171754 1:N:0:GGCTAC
TCCTAGTCTCAACCATAAACGATGCCGACCAGGGATCGGCGGATGTTGCTTATAGGACTCCGCCGGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGG
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIJJIJIJHDDDCECDCDACCDD@CDDDDDDDDDDDDDDCDCCCDDDDDDDDDADDEDCDDBBDDDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16383:171755 1:N:0:GGCTAC
TGGGGGATGTATATTTTTTCCTTTTTCCTAGGTATGACCCTCCAGGGGTGGCATCTTGGAATTTTTTCCTGATTTATCAATAGAATTTCGCCCCGCCTTGT
+
1::4AD8@D??<:AFGBB):E?4<*?C4*??*00*?*909)8??F@;-5@###################################################
@HWI-ST279:219:D0RJNACXX:6:2303:16283:171764 1:N:0:GGCTAC
ATTCCTAATCCAATATCTAGGCATGTTATATCTAAAGCTAACAAAGGAATAACAAATAATATGGCTGAGAACAGTAGGTAAATATCCCATACAGTATCATT
+
@@@DBDADFHHFDFAHEIGEEDHIEHGGGFDAHHDAHIIIIICFGGCDFHHEEGIGIGEGIGCHEIEHIGGHII77=@==?ACCEDE@DDDCCC>@BCDDE
@HWI-ST279:219:D0RJNACXX:6:2303:16473:171773 1:N:0:GGCTAC
TTGACGTTTTCCCAAAAGTCTTCGGAGACCTCTTTCCGATGAAAAATCTTTTTTTGTGTAATTCCAAATGTGAAGCAAGTCTCCGTATCTTATTGGTGAAA
+
CCCFFFFFHHHHHJJJJJFHIJJJJJJIIJJJJJJJJJJJJJJJJJJJJJJJJJHHDFCEFFFEEFEEEDDDEEDDDDDDDEDDDBDDDEEEDEDDDDDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16387:171787 1:N:0:GGCTAC
CAAATGTTGGTGATTTTTAATTTTTATTTTACTATATTCTAAACCAACCAAACAAGCTTCCTTTCCAGATTTTTAGTGCTATGTTGAGTTTCATATGTTAC
+
BCCFFFDFHHDFHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJHHHHHHFFFFFFFEEEEEEEFEEEEC
@HWI-ST279:219:D0RJNACXX:6:2303:16445:171789 1:N:0:GGCTAC
TATTCGGAGATGACCTGAACTCCTGAAGCATATGCAGTCTGCTTAGATTAGAGCTCATACATCGCTAATGTCCCAGTAATGTCAGATGCCATCCACATATG
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJIJJJJJJJJIIIJIIJJJJJJJJJJIIJJJJJJJJJJJJJJJHHHHHHHFFFFFFFEEEEEEDDDDDDDDDED
@HWI-ST279:219:D0RJNACXX:6:2303:16340:171805 1:N:0:GGCTAC
CCTATTTTATTGCAAGCTTTTGTGATGATTTTTTAAAACATTTATTGCCAAGTTAGTACAGTGATATATCTATTGGGAAAAGTACTCCAGGTATCATGTCT
+
CCCFFFFFHHHHHJIJJJJJJJHIJJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJHIJJIJJJJJJJJJJJJHHHHEDFFFFFEECEEEEEDEDD
@HWI-ST279:219:D0RJNACXX:6:2303:16446:171814 1:N:0:GGCTAC
ATGGGAAATACTTCAAGAAGTTATGAGGGGACATCCTGTACTGTTGAATAGAGCACCTACCCTGCATAGATTAGGCATACAGGCTTTCCAACCCACTTTAG
+
CCCFFFFFHHHHHJJJJIJJIJJJJJJJJJIJIJJJIJHGHIJIIJJJJJJJJIJJJJJJJJJJJJIJIHHHHHHDEFFFFEEDEEDDDEDDDDDDDDDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16469:171833 1:N:0:GGCTAC
CAAAATAACATCCTAAAAGTAGAAACATTGAAAGCAATAAACAATAGGTCTTCTTATTCCCTTAAAAAAATACTGAGGGTTATGAAGGGGGGTTTTTCCAA
+
CCCFFFFFHHGHHIIJJIICHJHIGIJJJJIGEGGGGGIIGIIIGHGIBGHGIIIJJJJJJIIJIJJII################################
@HWI-ST279:219:D0RJNACXX:6:2303:16447:171839 1:N:0:GGCTAC
TTTTCTTGAAAAGAATCCCAATACTTCATTGGGTGGGATGGCGGAACAAACCAAAAAAATTGTCTTATTTGATAAGGTTATGAATTAACAAATAAGACAGG
+
CCCFFFFFHHHHHJIJJJJJJIJJJJJIJJJJJ?FHIGCHIGIIGIIIJHHHGFFFFDDBCDDDDDEDDEEEDDDDCAACCDDCDDDCDDDDDDDDDDCDB
@HWI-ST279:219:D0RJNACXX:6:2303:16422:171845 1:N:0:GGCTAC
AATGCTTTAGTTGGTGAGAAGAAGACTGGTGTTACTCACACACCTGGCAAAACAAAGCATTTCCAGACGCTGATAATCTCAGAGGAGCTCACTCTATGTGA
+
CCCFFFFFHHHHHJGIIJJJJJJJIJJJJCFHIGIIJEIBGHJJIJJJJIJIJJJIJGIJJJIJIJJJHHFFFEECEEEEDDDDDCDDDDDDDDDDDECDC
@HWI-ST279:219:D0RJNACXX:6:2303:16378:171848 1:N:0:GGCTAC
CATTAAGATTTTGAAGACTTGAATATATAAACTTATTTAAATTTTACATGTTAATGATTTGTTCTTACTTTTTTTAATAATTCCAAAATATATATGAAGAG
+
CCCFFFFFHHHHHJJJJIJIJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJIJJJJJHIJJJJJJJJJJJJJJJJJHHHHHHFFFFFFDEEFEFFEFEEDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16339:171865 1:N:0:GGCTAC
TATTAAAATTTAATCAGTTTTTATTGCATTTTCAAATAATTTTGGACAAAAAGTATGAAAATTCTAAAAATTTTCAGTGCACCGAAATTTTAGTCCAAAAC
+
;?<DDDDBFF<FDAG>F<3C:C?<C:,<A+<4+<AF<E<CF99C@FGF=GFF:BF?B<F8?*.B)=BC8)@F787=77@@@E?B'9<;;>;5;;>;55;9?
@HWI-ST279:219:D0RJNACXX:6:2303:16496:171869 1:N:0:GGCTAC
AAATAATATGTAACATAGCTAGACAACAACTTACATAAGTTGATGTGGTTTATAATAATTTAAATTTGAACTACGATTCGTATGTAAAAATAAGGTGATGT
+
BCCFFFFFHHHHHJJJJJJJJJIJJJIIIJJJJJJJIJJHIIIJJFHIFHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJHHHHHFFFFFFFEEEEACDDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16268:171870 1:N:0:GGCTAC
GCAAATTAGTTAAATTAATTGTGTGCAATCAGAAAAATTCATCAATTAATTCTACCTATTGTTTTTTTCTGGGTATAGTATGACTGTAAACTGTAAGTAAA
+
@@@FFFFFAHHHDHIIIIIGHCHIICHCG>EEHHIGHIIDEIICIGIDFDFDHG<BD><DFB@?FGEG@C77==E@D@ADDB>CCEEDDCCDD>CC>@CC<
@HWI-ST279:219:D0RJNACXX:6:2303:16392:171873 1:N:0:GGCTAC
ATGTTTACTATAACACCACATTTTCAAATCATTGTGTAATTAGGCTTAAAAGATTTATCTCGCAATTTACACGTAATCTGTATAATTGGTTTTTATTTTTT
+
CCCDFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIHIJJJJJJJJJJJJJJHIIJJJJJJJJJJJJJJJIJJDHHHHHHDFFFFFFFEEDEDDDDDEEDD
@HWI-ST279:219:D0RJNACXX:6:2303:16412:171876 1:N:0:GGCTAC
ACATAGTTTGGAGTTTGGACTTTAGAGATGAATATGTTGTTTAACCGGGGACGGGTTCACCACAGGGAAAAATTCACCGCGTGGTGATCGGGGCTGATGAA
+
@@@DDDDDF8DADFGHIB9FGIGB>BHHC9F4?*:E*1:C=A*?DG:?DH>F6:=41;;CC3;2@B;(,;?9?>355:?3&)5>8?:>:<@##########
@HWI-ST279:219:D0RJNACXX:6:2303:16336:171880 1:N:0:GGCTAC
TGAAGAAACTAAGGTGAGGTATCCTAGTCAGGGGTCAATTTGGCCACAGAAAATAACCAGTGCTGGCTGCTTAAACTAATCCATCATCACAGCACTGATTT
+
@@@FFFFFGHHHHJFF4EFGBHGHIBHEHHHHHIFDFHIIGGIIHIIGGIIJIJJJGHIG@@@FHAC9@EEHEAC;B@DFCEDCEEDDDDCCBDDCCDDDD
@HWI-ST279:219:D0RJNACXX:6:2303:16300:171889 1:N:0:GGCTAC
TGAGCCAGCAGAAGTATGCTTCTAATGTTGTGGAAAAGTGTCTATCTTTTGGAACTCCTGATGAACGTGAAGGCCTTATAAGAGAGATTGTATCCTCTGGC
+
CCCFFFFFGHHHHJCGHJJJJIJJJJJJJJIIJJJJJJFHIGIJJJJJJJJJHIJJJJJIJJJJJJJIJIGIJHHHHHHFFFFFFDDEEDDEEFDEDDDDC


Nobody by Shel Silverstein

Nobody loves me,
Nobody cares,
Nobody picks me peaches and pears.
Nobody offers me candy and Cokes,
Nobody listens and laughs at me jokes.
Nobody helps when I get in a fight,
Nobody does all my homework at night.
Nobody misses me,
Nobody cries,
Nobody thinks I'm a wonderful guy.
So if you ask me who's my best friend, in a whiz,
I'll stand up and tell you that Nobody is.
But yesterday night I got quite a scare,
I woke up and Nobody just wasn't there.
I called out and reached out for Nobody's hand,
In the darkness where Nobody usually stands.
Then I poked through the house, in each cranny and nook,
But I found somebody each place that I looked.
I searched till I'm tired, and now with the dawn,
There's no doubt about it-
Nobody's gone!


*****************************
Array Problem Set
*****************************

1. Create a shuffled sequence
	Turn a DNA string into an array with split()
	Use a for loop to perform the following procedure N times (N = length of seq)
		Select a random position A with rand()
		Select a random position B with rand()
		Exchange the letters at array indices A and B
	Print the now shuffled sequence

2a. Start with 2 very similar DNA sequences. 
	Align with ClustalW or some other web alignment application. 
	Output should be in fasta format.
	Store (copy and paste) the sequence, including dashes, from each ClustalW fasta output in a separate string variable inside your script.
	Turn each string into an array with split()
	Use a for loop to compare each index for nucleotide differences.
	Report the nucleotide position of each difference.

2b. Do the same as above but instead of coping and pasting into string variables
import from a file.

3. Calculate GC content
	Turn a DNA string into an array with split()
	Use a foreach loop to look at each nucleotide in turn
	Calculate total length of the sequence
	Keep a running total of C's and G's
	Print the calculated GC content as a percent.


4.  Run this code. Is its output what you expect? Why?
                  for (my $i = 0; $i < 10; $i++) {
                   if ($i = 2) {
                       print "\$i = $i\n";
                   }
                  }


*****************************
Hash Problem set
*****************************

. Determine the codon usage for a DNA sequence

- Create a string containing a DNA sequence
- In a for loop
  - use the function substr ($seq, $offset, 3) to extract codons
  - store each codon and the number of times it has occurred in a hash
- Report the codon usage


2. Create a report of the expression levels and sequences of genes expressed in liver.

- There is a tab delimited file of expression data.
- This file contains the following information:
GeneID, tissue the gene is expressed in, expression level, and gene sequence.
- Open this file script using open().
- In a loop:- 
	- Read a line in the file 
	- remove the "\n" at the end of the line
	- split the line on tabs using split (/\t/ , $line)
	- Store the data on each line in 3 different hashes as described below

hash	  key	    	value
%tissue	  GeneID	tissue
%expr	  GeneID	expression level
%seq	  GeneID	sequence

- Now search the %tissue hash for genes that are expressed in liver. Make a list of the GeneIDs corresponding to these genes.
- create a report of the GeneID and expression level of these genes

* Expression Data *
CDC2	brain	34.5	AGCGCGGTGAGTTTGAAACTGCTCGCACTTGGCTTCAAAGCTGGCTCTTGGAAATTGAGCGGAGAGCGAC
ALT	liver	9.2	ATGTTCAGAAGAAGTTTAAAACTATTAAGTAAAGAAACCATTACTCGTGTTAAACCAAATACAACTATTG	
ARG1	liver	458.5	AGCCGATGCGTGGCGCCCCGGCGGCCACGCCGCCGCCCGCTACGGAATCGGCGGCCGAGCGGCTGCGCCG
TSHR	thyroid	2.8	CCTCCTCCACAGTGGTGAGGTCACAGCCCCTTGGAGCCCTCCCTCTTCCCACCCCTCCCGCTCCCGGGTC


*****************************
Regular Expression Problem Set
*****************************


1.The enzyme ApoI has a restriction site: R^AATTY where R and Y are degenerate
nucleotideides. See the IUPAC table to identify the nucleotide possibilities
for the R and Y.

    Write a regular expression that will match occurrences of the site in a sequence. (hint: what are you going to do about the actual cut site, represented by the '^'?)

2. Use the regular expression you just wrote to find all the restriction sites in the following sequence. Be sure to think about how to handle the newlines!

GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGAC
TTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT
AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGA
GCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA
AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGC
AAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC
AATTCTAGAATTTGTTTTTTGGTTTTTATTTATCATTGTAAATAAGACAAACATTTGTTC
CTAGTAAAGAATGTAACACCAGAAGTCACGTAAAATGGTGTCCCCATTGTTTAAACGGTT
GTTGGGACCAATGGAGTTCGTGGTAACAGTACATCTTTCCCCTTGAATTTGCCATTCAAA
ATTTGCGGTGGAATACCTAACAAATCCAGTGAATTTAAGAATTGCGATGGGTAATTGACA
TGAATTCCAAGGTCAAATGCTAAGAGATAGTTTAATTTATGTTTGAGACAATCAATTCCC
CAATTTTTCTAAGACTTCAATCAATCTCTTAGAATCCGCCTCTGGAGGTGCACTCAGCCG
CACGTCGGGCTCACCAAATATGTTGGGGTTGTCGGTGAACTCGAATAGAAATTATTGTCG
CCTCCATCTTCATGGCCGTGAAATCGGCTCGCTGACGGGCTTCTCGCGCTGGATTTTTTC
ACTATTTTTGAATACATCATTAACGCAATATATATATATATATATTTAT


3. Determine the site(s) of the cut in the above sequence. Print out the sequence with "^" at the cut site.

    Hints:
        Use subpatterns (parentheses and $1, $2) to find the cut site within the pattern.
        Use s///

    Example: if the pattern is GACGT^CT the following sequence

    AAAAAAAAGACGTCTTTTTTTAAAAAAAAGACGTCTTTTTTT

    would be cut like this:

    AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT


4. Now that you've done your restriction digest, determine the lengths of your fragments and sort them by length (in the same order they would separate on an electrophoresis gel).

    Hint: take a look at the split man page or think about storing your matches in an array. With one of these two approaches you should be able to convert this string:

       AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT


    into this array:

    ("AAAAAAAAGACGT","CTTTTTTTAAAAAAAAGACGT","CTTTTTTT")


*****************************
Subroutine Problem Set
*****************************


Create a subroutine that reverse complements a sequence.
This subroutine should take a nucleotide sequence as a parameter and return
the reverse complement.

Here's the pseudo code:

-- BEGIN PSEUDOCODE --

subroutine reverse_complement {

  get the parameter nucleotide string

    reverse complement the nucleotide string

      return the complemented nucleotide string

      }

      -- END PSEUDOCODE --

      Write a program that takes in a nucleotide string as an argument, calls
      the reverse_complement subroutine, and then prints the reverse
      complement sequence to STDOUT.

      -- BEGIN SAMPLE RUN --

      ./reverse_complement.pl GAGAGAGAGAGTTTTTTTTT
      AAAAAAAAACTCTCTCTCTC

      -- END SAMPLE RUN --


*****************************
Bioperl Problem Set
*****************************

Problem Set for Bioperl

Preparation:

	1.  Download uniprot_sprot:

		curl -O "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"

	2. Unzip the file.

		gunzip uniprot_sprot.fasta.gz


PROBLEM 1: Bio::DB::Fasta

	1. write a script to retrieve all IDs from uniprot_sprot.fasta using the  <strong>get_all_ids</strong> method from <strong>Bio::DB:Fasta</strong>.

	2. Search through the list of IDs for all IDs that contain the term "HDAC"  

	3. Print the sequences for these proteins, in FASTA format.


PROBLEM 2: Bio::SeqIO

	1. Write a script using <strong>Bio::SeqIO</strong> to retrieve the CDS translation from a <a href="sequence.gb">genbank file</a>. 


PROBLEM 3: Bio::SearchIO


          * On the command line, run the following command to format the file for using as a blast database:


					makeblastdb -in uniprot_sprot.fasta  -dbtype prot
					

            Here is how you can blast your favorite seq against swissprot:

            		blastp -query query.fasta -db uniprot_sprot.fasta -evalue 1e-10 -out query_v_sprot.blastout
            
            You can find additional information on BLAST+ at: http://www.ncbi.nlm.nih.gov/books/NBK1763/


   4. Run Blast with your 3 protein sequences from earlier.
          * use the uniprot_sprot.fasta as your database
          * run blast with an e-value cut-off of 1e-10

   5. Parse your Blast output. For Hits with "significance" less than or equal to 1e-50 retrieve every HSP and print in a tab delimited format:
          * QUERY Name
          * HIT Name
          * HSP Evalue

*****************************
Database/DBI Problem Set
*****************************

Problem sets: Databases and Perl DBI

- For this problem set, we have installed MySQL servers in each of the machines.

- In each MySQL server, you'll find a database named progbio2012.

- This database has 5 tables: genes, genes_go, expression, snp, go_terms
	1. 'genes' table lists the location and evidence class of each gene
	2. 'genes_go' table contains the Gene Ontology terms for genes 
       that have a Gene Ontology annotation
	3. 'go_terms' list the Gene Ontology descriptions of the GO Ids
	4. 'expression' table contains the expression level of genes in 4 experiments.
	5. 'snps' list of SNPs in chr1 and chr10.

- You will need the data in these tables for the problem sets


--- Getting familiar with mySQL ---

1. Follow the steps below to enter the MySQL shell and use the 
   database progbio2012 on the unix command line:

	 $ mysql -u root
								
	 You're now in the MySQL shell.

	- To use the database progbio2012:

	 mysql> use progbio2012;
								
	- To list the tables in the database:

	 mysql> show tables;
			
			
--- Getting familiar with mySQL ---

2. Use the 'explain' command to see the schema of each table
on the unix command line:

	 mysql> explain genes; 
								

	- Try out SHOW command to see the SQL-CREATE syntax for each table:

	 mysql> show create table genes;
	
	
--- SQL ---

3. Using SQL, perform the following queries:
	a. How many rows are there in the gene table?

	b. How many genes have GO annotations?
	HINT: count(distinct gene_id)

	c. List the number of genes in each evidence class in the genes table
	HINT: Using a COUNT ... GROUP BY query


	d. Using a SQL query that joins the genes_go and expression table, 
	select the day1 value of genes that have the go_term 'chromatin binding'
    ('GO:0003682')

	e. Try the query above again, but limit it to genes on chr1.				
	 
--- PERL DBI ---

4. Do the following using Perl-DBI

	a. Using a similar query to Q3.d above, write a Perl DBI script that produces 
       a tab-delimited text file for genes with go_term 'nucleic acid binding'
       ('GO:0003676')
	The text file should have the columns: gene_id, go_term, day1, day2, day3, day4

	b. Construct a query to find genes where the expression level in day4 is 
       greater than day1. 
	Print out this list of genes.


5. Advanced problems
	a. How many genes in the first 100Mb of chr1 contain snps?

	b. Compute the gene density on chr10 across 1Mb windows. 
	   Assume that Chr10 has a total length of 150Mb.

	c. Compute the average expression level in each experiment (day1, day2, day3, day4) 
       for the genes in each GO term.