***************************** Unix Basics ***************************** ls -- list contents cd -- change directory mkdir -- make a directory rm -- use caution, it is easy to delete more that you would like head -- prints the top few lines to the terminal window tail -- prints the last few lines to the terminal window sort -- sorts the lines uniq -- prints the unique lines grep -- filnds the lines that contain a pattern wc -- counts the number of lines, characters and words mv -- move files cp -- copy files date -- returns the current date and time pwd -- return working directory name ssh -- remote login scp -- remote secure copy ~ -- represents your home directory man [command] -- manual page for the command man ls try: ls -l ls -lt you can string more than one command together with a pipe (|) , such that the output of the first command is received by the second command. ls -lt | head you can string more than one command together with a semi-colon (;) , such that the commands run sequentially, but that output does not get passed into the next command. date; some program command ; date you can redirect the output of a command into a file grep PATTERN > PATTERN.txt you can append the output of a command to a file grep PATTERN2 >> PATTERN.txt you can redirect stderr to a file command 2> filename you can redirect the output (stdout) and stderr to a file command &> filename text editors: text wrangler is a good app to start with. ============ Problem Set ============ Using your text editor create a fasta file and name it sequences.fasta. Make sure it ends up in the proper directory, locally or remotely. This is fasta file format: >seqName description ATGGCGTCTTGGCCTTAAAAGCTC Log into your machine or account. What is the full path to your home directory? How many files does it contain? How many directories? Without using a text editor examine the contents of the file sequences.fasta. How many lines does this file contain? How many characters? (Hint: check out the options of wc) What is the first line of this file? (Hint: read the man page of head) What are the last 3 lines? (Hint: read the man page of tail) How many sequences are in the file? (Hint: use grep) Rename sequences.fasta to something more informative of the sequences the file contains. (Hint: read the man page for mv) Create a directory called fasta. (Hint: use mkdir) Copy the fasta file that you renamed to the fasta directory. (Hint: use cp) Verify that the file is within the fasta directory. (Hint: use ls fasta/) Delete the the original file that you used for copying. (Hint: use rm, be careful) Read the man page for rm and cp to find out how to remove and copy a directory. ***************************** Perl Basics ***************************** Quick Review Why Perl for data processing & bioinformatics - Fast text processing - Regular expressions - Extensive module libraries for pre-written tools - Large number of users in community of bioinformatics - Scripts are often faster to write than full compiled programs Every perl script needs to contain the following lines: #!/usr/bin/perl use warnings; use strict; Print "hello world". Type this into your text editor and save it. Give the script a .pl extension. #!/usr/bin/perl use warnings; print "hello world\n"; Run it: on the command line type perl your_script_name.pl Now try rewriting your script so that it doesnot have a new line character (\n). Run it. Using variables: $scalar A scalar contains a value, can be a string or a number. @array An array contains a list of scalars. %hash A hash contains paired values (key/value pairs) Using a scalar to print: #!/usr/bin/perl use warnings; use strict; my $phrase = "hello world"; print "$phrase\n"; # This is a comment # A print statement can combine variables and strings print "This is my phrase $phrase\n"; # this is another way to print the same as above # the print function takes a list of arguments # in this example "This is my phrase $phrase" is the first argument # and "\n" is the second print "This is my phrase $phrase" , "\n"; ---- note about scope ---- Scope: (we will go over this concept in more detail at a later time) but in summary: use strict; #!!!!! this is important when you 'use strict' you have to use my when declaring a variable. This means when you use a variable for the first time use the term 'my'; --------------------------- Something Advanced but fun tr/// #!/usr/bin/perl use warnings; use strict; my $pet = "cat"; print $pet, "\n"; $pet =~ tr/cat/dog/; print $pet , "\n"; $pet =~ tr/o/O/; print $pet, "\n"; dOg $pet =~ tr/gd/wc/; print $pet, "\n"; ---- getting help ------- Documentation of perl scripts and modules Perldoc online http://perldoc.perl.org/ Or on your computer - type 'perldoc' For functions use -f 'perldoc -f sprintf' For modules just the name 'perldoc List::Util' http://www.cpan.org ============ Problem Set ============ 1. Create a script called "add.pl" script to sum two $scalar variables: (hint: use + ) 2. Try other mathmatical operators. Create a new scripts that adds, subtracts, multiplies and divides values. 3. Test your knowledge of precedence. Create a script that uses may operators together. Do you get the answer that you expect? 4. Create a script to produce the reverse complement of a sequence (hint, use the reverse and tr/// functions) % reversec.pl GAGAGAGAGAGTTTTTTTTT output: AAAAAAAAACTCTCTCTCTC ***************************** Perl II Problem Set: ***************************** 1. Create a script that takes two numbers from the command line and adds them. % add.pl 2 3 5 2. Modify the "add" script from the previous problem set so that it checks that both arguments are defined (hint, use function defined. this will not allow 0, do a check for 0). : % add.pl 2 3 5 % add.pl 2 Please provide two numbers. 3. Modify the script again so that it checks that both arguments are positive numbers. Zero is allowed, but -1 is not: % add.pl 2 -3 Please provide two positive numbers. 4. Write a script to compare two strings given on the command line arguments and print "right order" if they are in alphabetic order, and "wrong order" if they are not: % order.pl Fred Lucy right order % order.pl Lucy Fred wrong order 5. Write a script to compare two strings given on the command line and print them out in correct alphabetic order: % reorder Fred Lucy Fred Lucy % reorder Lucy Fred Fred Lucy 6. Write a script named "same.pl" to read two strings from the terminal. Compare them in a case-sensitive manner and print "same" if they are the same, "different" if they are different: % same.pl Enter string 1: lucy Enter string 2: Lucy different 7. Modify this script to compare the strings in a case-INsensitive manner (hint, use the "lc" or "uc" functions to change string to upper or lowercase. 8. Write a script named "percent.pl" to calculate percentages, where the percentage is $i/($i+$j) * 100. Make sure that the script does not crash when given two numbers that add up to zero: % percent.pl 50 150 25% % percent.pl 50 -50 You are trying to trick me! at line 4. 9. Modify this script to use the printf() function to produce nicely formatted floating point numbers (hint: try "man sprintf" and "man printf" or look it up online to learn about this wonderful function). % percent.pl 50 150 25.00 % 10. Run this code. Is its output what you expect? Why? for (my $i = 0; $i < 10; $i++) { if ($i = 2) { print "\$i = $i\n"; } } 11. Write a program named "pali.pl" to detect palindromes. It must be able to handle changes in case. % pali.pl "Madam in Eden Im Adam" yes! % pali.pl gatcctag yes! % pali.pl "cold spring harbor laboratory" no! 12. Modify the program to work even if there is extraneous punctuation: % pali.pl "A man, a plan, a canal... Panama" yes! (Hint: Look up the s/// pattern matching & substitution function in the Perl reference guide. We will cover this formally in a few days, so you can save this one for later if you like.) 13. Create a file of numbers call numbers.txt with the following content: 22 45 1 2 31 32 72 24 14. Here is pseudo-code for a program which uses numbers.txt as input: create file myresult.txt and open it for writing output open numbers.txt for reading while (each line of the file numbers.txt) { if (the number is even) { if (the number is less than 24) { print the line to STDOUT } } else { compute the factorial of the number print the factorial to the file myresult.txt (one per line) } } a. What will be printed to STDOUT? b. What will be the contents of myresult.txt? c. Convert the pseudocode above into a real program. ***************************** Perl III Problem Set: ***************************** 1. Create a script that divides two numbers provided on the command line. Two numbers are required. Numbers have to be positive. Divisor cannot be zero. This part you do in Perl ======================== Write the quotient to STDOUT Write any errors to STDERR This part you do on the command line in UNIX ============================================ Redirect STDOUT to an output file (out.txt) Redirect STDERR to an error file (err.txt) 2. Open a file using the open function. Make all the letters in each line uppercase. (There's a built-in Perl function which will do this.) Write out to a new file that was created using the open function. 3. Open the provided fasta file. Print the reverse complement of each sequence. Make sure to print the output in fasta format including the sequence name and a note in the description that this is the reverse complement. Print to STDOUT and capture the ouput with a command line redirect '>'. 4. Open the provided fastq file. Go thru each line of the file. Count the number of lines and the number of characters per line. Report the: a. total number of lines b. total number of characters c. the average line length 5. Create a script that uses index() to: a. find the first position of 'Nobody' on every line b. find the first position of 'somebody' on every line Use the warn() function to warn the user that 'somebody is here' >seq1 AAGAGCAGCTCGCGCTAATGTGATAGATGGCGGTAAAGTAAATGTCCTATGGGCCACCAATTATGGTGTATGAGTGAATCTCTGGTCCGAGATTCACTGAGTAACTGCTGTACACAGTAGTAACACGTGGAGATCCCATAAGCTTCACGTGTGGTCCAATAAAACACTCCGTTGGTCAAC >seq2 GCCACAGAGCCTAGGACCCCAACCTAACCTAACCTAACCTAACCTACAGTTTGATCTTAACCATGAGGCTGAGAAGCGATGTCCTGACCGGCCTGTCCTAACCGCCCTGACCTAACCGGCTTGACCTAACCGCCCTGACCTAACCAGGCTAACCTAACCAAACCGTGAAAAAAGGAATCT >seq3 ATGAAAGTTACATAAAGACTATTCGATGCATAAATAGTTCAGTTTTGAAAACTTACATTTTGTTAAAGTCAGGTACTTGTGTATAATATCAACTAAAT >seq4 ATGCTAACCAAAGTTTCAGTTCGGACGTGTCGATGAGCGACGCTCAAAAAGGAAACAACATGCCAAATAGAAACGATCAATTCGGCGATGGAAATCAGAACAACGATCAGTTTGGAAATCAAAATAGAAATAACGGGAACGATCAGTTTAATAACATGATGCAGAATAAAGGGAATAATCAATTTAATCCAGGTAATCAGAACAGAGGT @HWI-ST279:219:D0RJNACXX:6:2303:16038:171912 1:N:0:GGCTAC TGAAGTAACACTAACAGAGAAAGTACATGTACTAAACAGTTCCTTAAGTGCAGTTGCTTCCTTGTGATAAACATTCTCTAAATCTCTAGTTGATGTTTGCC + CCCFFEFFHHHHGJJJJJJIJJJJJJJJJHIIIIJGHGHHHIJIJJJJHGIJJJJGJJJJJJJJJJJJJJJJJIJHIJJJJJJJHHHHHFHFFFFFFFDEE @HWI-ST279:219:D0RJNACXX:6:2303:16145:171918 1:N:0:GGCTAC GCTCTGAGATAGGTTCCAACTTCCTCCCGCGAACGCACCCGTACTTGCAGCCCAAAAACGAGAAGGGGAACAATAGAAAGCAAGTGAAAGGATGCTGCTGG + @@CFFFFEHHHHHFHHIJJJIJJDHIJIJJJIIJIIIJIJIEHHHHGHFFFFFDDEDD?@@B@BDDDDDDBDDCCCCDDDDDCC:CDDDDDDCCACDD>>> @HWI-ST279:219:D0RJNACXX:6:2303:16023:171925 1:N:0:GGCTAC GACCCATATAAATATGCGGTCACTACATCCATCAACTGTATTTCTAAGTTCAAATTGACTGCCATAGATATTAAGAACCGGAATGTAATTCCATCCATTAC + CCCFFFFFHHHHHJJJIJJHGJJJJJJJJJJJJJJIJJHIIJJIJIIJIHJJJJJJIJJJIJJJJIJIIJJIIIJHIHHHFFDDEEEEEEEDEEDDDDFDD @HWI-ST279:219:D0RJNACXX:6:2303:16127:171927 1:N:0:GGCTAC CACCTGTGGCAAGAAACTTAATGTTCATTCAGGCTCGATTTCAGGCTTCAGCATTATCAAATTTCTCATCAAGAAAGGATGAAAACAAATGTAGGTTACAG + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIJIJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJIHHHFFFFFFFEEEDDDDDDDEDACDDCD @HWI-ST279:219:D0RJNACXX:6:2303:16211:171937 1:N:0:GGCTAC TTTAAATCTCTAACTATCTCTAACACTCAAATATGCTAAGTCTAATAATCTAATAATTTCTAACACTCACATATGCTAACTCTAATAATCTAATAATCTTA + <@@DFBBDDBFHDEABGECDDAC?DDF3CFH?H?CEFGGGC?DGFFDEHCGHCHFGHABC;@;@DA @HWI-ST279:219:D0RJNACXX:6:2303:16195:171950 1:N:0:GGCTAC CCCTGAAAATTAGGCTGCTCCTAGATTAGCAGCATGTTGGTCTAATGGGCCATATTCAAGCTCAATCAGCAAATAAGTAGGGACTTGGTCGGGTTCCAAGG + CCCFFFFFGHHHHJGIJJJJJJJJIJJIJJJJIJJIGIIJIIJJJJIJJIJJJJJJIJIJJJJJJJIJJJJJJIHEFCHHGFFFFEEEEEDDDBDDDEDDB @HWI-ST279:219:D0RJNACXX:6:2303:16100:171952 1:N:0:GGCTAC ATGGAATGGGTTTTGGCATTAGTCTACGTTTAGTACTTCTAATTAGTGTCAAACATTCGATGTGATAGGGATTAAAATTTAGTCCCTAAACCAAACAGGGC + CCCFFFFFHHFHHJJJJJJJJJIJJJJJIJJJJGIJJJJJJJJJJJHIIJJJJJJJJJJJJJIJJJJIJJJJJJJIHHHHHHEFFFFFEEEEDDDDDDDDB @HWI-ST279:219:D0RJNACXX:6:2303:16145:171961 1:N:0:GGCTAC ATGCATCCATGCTAAGATATTTCCTCGTGTGGCACTGTTCAGTGTTCATCAGCAGTGGTTGGATACGTGAACCCCACACGCCAGAAATCATAACGTGCAGT + EFBBHGGIGCGFGIIGGFGEGICDDEGEHEHGEDFGGEH@==@GCAEB;?=B>@BBBCCCA@CACC(,A>A=6=;>AAAA@########################################## @HWI-ST279:219:D0RJNACXX:6:2303:16255:171754 1:N:0:GGCTAC TCCTAGTCTCAACCATAAACGATGCCGACCAGGGATCGGCGGATGTTGCTTATAGGACTCCGCCGGCACCTTATGAGAAATCAAAGTCTTTGGGTTCCGGG + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIJJIJIJHDDDCECDCDACCDD@CDDDDDDDDDDDDDDCDCCCDDDDDDDDDADDEDCDDBBDDDDD @HWI-ST279:219:D0RJNACXX:6:2303:16383:171755 1:N:0:GGCTAC TGGGGGATGTATATTTTTTCCTTTTTCCTAGGTATGACCCTCCAGGGGTGGCATCTTGGAATTTTTTCCTGATTTATCAATAGAATTTCGCCCCGCCTTGT + 1::4AD8@D??<:AFGBB):E?4<*?C4*??*00*?*909)8??F@;-5@################################################### @HWI-ST279:219:D0RJNACXX:6:2303:16283:171764 1:N:0:GGCTAC ATTCCTAATCCAATATCTAGGCATGTTATATCTAAAGCTAACAAAGGAATAACAAATAATATGGCTGAGAACAGTAGGTAAATATCCCATACAGTATCATT + @@@DBDADFHHFDFAHEIGEEDHIEHGGGFDAHHDAHIIIIICFGGCDFHHEEGIGIGEGIGCHEIEHIGGHII77=@==?ACCEDE@DDDCCC>@BCDDE @HWI-ST279:219:D0RJNACXX:6:2303:16473:171773 1:N:0:GGCTAC TTGACGTTTTCCCAAAAGTCTTCGGAGACCTCTTTCCGATGAAAAATCTTTTTTTGTGTAATTCCAAATGTGAAGCAAGTCTCCGTATCTTATTGGTGAAA + CCCFFFFFHHHHHJJJJJFHIJJJJJJIIJJJJJJJJJJJJJJJJJJJJJJJJJHHDFCEFFFEEFEEEDDDEEDDDDDDDEDDDBDDDEEEDEDDDDDDD @HWI-ST279:219:D0RJNACXX:6:2303:16387:171787 1:N:0:GGCTAC CAAATGTTGGTGATTTTTAATTTTTATTTTACTATATTCTAAACCAACCAAACAAGCTTCCTTTCCAGATTTTTAGTGCTATGTTGAGTTTCATATGTTAC + BCCFFFDFHHDFHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJIJJJJJJJJJJJJJJJJJJHHHHHHFFFFFFFEEEEEEEFEEEEC @HWI-ST279:219:D0RJNACXX:6:2303:16445:171789 1:N:0:GGCTAC TATTCGGAGATGACCTGAACTCCTGAAGCATATGCAGTCTGCTTAGATTAGAGCTCATACATCGCTAATGTCCCAGTAATGTCAGATGCCATCCACATATG + CCCFFFFFHHHHHJJJJJJJJJJJJJJJIJJJJJJJJIIIJIIJJJJJJJJJJIIJJJJJJJJJJJJJJJHHHHHHHFFFFFFFEEEEEEDDDDDDDDDED @HWI-ST279:219:D0RJNACXX:6:2303:16340:171805 1:N:0:GGCTAC CCTATTTTATTGCAAGCTTTTGTGATGATTTTTTAAAACATTTATTGCCAAGTTAGTACAGTGATATATCTATTGGGAAAAGTACTCCAGGTATCATGTCT + CCCFFFFFHHHHHJIJJJJJJJHIJJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJJHIJJIJJJJJJJJJJJJHHHHEDFFFFFEECEEEEEDEDD @HWI-ST279:219:D0RJNACXX:6:2303:16446:171814 1:N:0:GGCTAC ATGGGAAATACTTCAAGAAGTTATGAGGGGACATCCTGTACTGTTGAATAGAGCACCTACCCTGCATAGATTAGGCATACAGGCTTTCCAACCCACTTTAG + CCCFFFFFHHHHHJJJJIJJIJJJJJJJJJIJIJJJIJHGHIJIIJJJJJJJJIJJJJJJJJJJJJIJIHHHHHHDEFFFFEEDEEDDDEDDDDDDDDDDD @HWI-ST279:219:D0RJNACXX:6:2303:16469:171833 1:N:0:GGCTAC CAAAATAACATCCTAAAAGTAGAAACATTGAAAGCAATAAACAATAGGTCTTCTTATTCCCTTAAAAAAATACTGAGGGTTATGAAGGGGGGTTTTTCCAA + CCCFFFFFHHGHHIIJJIICHJHIGIJJJJIGEGGGGGIIGIIIGHGIBGHGIIIJJJJJJIIJIJJII################################ @HWI-ST279:219:D0RJNACXX:6:2303:16447:171839 1:N:0:GGCTAC TTTTCTTGAAAAGAATCCCAATACTTCATTGGGTGGGATGGCGGAACAAACCAAAAAAATTGTCTTATTTGATAAGGTTATGAATTAACAAATAAGACAGG + CCCFFFFFHHHHHJIJJJJJJIJJJJJIJJJJJ?FHIGCHIGIIGIIIJHHHGFFFFDDBCDDDDDEDDEEEDDDDCAACCDDCDDDCDDDDDDDDDDCDB @HWI-ST279:219:D0RJNACXX:6:2303:16422:171845 1:N:0:GGCTAC AATGCTTTAGTTGGTGAGAAGAAGACTGGTGTTACTCACACACCTGGCAAAACAAAGCATTTCCAGACGCTGATAATCTCAGAGGAGCTCACTCTATGTGA + CCCFFFFFHHHHHJGIIJJJJJJJIJJJJCFHIGIIJEIBGHJJIJJJJIJIJJJIJGIJJJIJIJJJHHFFFEECEEEEDDDDDCDDDDDDDDDDDECDC @HWI-ST279:219:D0RJNACXX:6:2303:16378:171848 1:N:0:GGCTAC CATTAAGATTTTGAAGACTTGAATATATAAACTTATTTAAATTTTACATGTTAATGATTTGTTCTTACTTTTTTTAATAATTCCAAAATATATATGAAGAG + CCCFFFFFHHHHHJJJJIJIJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJIJJJJJHIJJJJJJJJJJJJJJJJJHHHHHHFFFFFFDEEFEFFEFEEDDD @HWI-ST279:219:D0RJNACXX:6:2303:16339:171865 1:N:0:GGCTAC TATTAAAATTTAATCAGTTTTTATTGCATTTTCAAATAATTTTGGACAAAAAGTATGAAAATTCTAAAAATTTTCAGTGCACCGAAATTTTAGTCCAAAAC + ;?F<3C:C?;5;;>;55;9? @HWI-ST279:219:D0RJNACXX:6:2303:16496:171869 1:N:0:GGCTAC AAATAATATGTAACATAGCTAGACAACAACTTACATAAGTTGATGTGGTTTATAATAATTTAAATTTGAACTACGATTCGTATGTAAAAATAAGGTGATGT + BCCFFFFFHHHHHJJJJJJJJJIJJJIIIJJJJJJJIJJHIIIJJFHIFHHJJJJJJJJJJJJJJJJJIJJJJJJJJJJHHHHHFFFFFFFEEEEACDDDD @HWI-ST279:219:D0RJNACXX:6:2303:16268:171870 1:N:0:GGCTAC GCAAATTAGTTAAATTAATTGTGTGCAATCAGAAAAATTCATCAATTAATTCTACCTATTGTTTTTTTCTGGGTATAGTATGACTGTAAACTGTAAGTAAA + @@@FFFFFAHHHDHIIIIIGHCHIICHCG>EEHHIGHIIDEIICIGIDFDFDHGCCEEDDCCDD>CC>@CC< @HWI-ST279:219:D0RJNACXX:6:2303:16392:171873 1:N:0:GGCTAC ATGTTTACTATAACACCACATTTTCAAATCATTGTGTAATTAGGCTTAAAAGATTTATCTCGCAATTTACACGTAATCTGTATAATTGGTTTTTATTTTTT + CCCDFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIHIJJJJJJJJJJJJJJHIIJJJJJJJJJJJJJJJIJJDHHHHHHDFFFFFFFEEDEDDDDDEEDD @HWI-ST279:219:D0RJNACXX:6:2303:16412:171876 1:N:0:GGCTAC ACATAGTTTGGAGTTTGGACTTTAGAGATGAATATGTTGTTTAACCGGGGACGGGTTCACCACAGGGAAAAATTCACCGCGTGGTGATCGGGGCTGATGAA + @@@DDDDDF8DADFGHIB9FGIGB>BHHC9F4?*:E*1:C=A*?DG:?DH>F6:=41;;CC3;2@B;(,;?9?>355:?3&)5>8?:>:<@########## @HWI-ST279:219:D0RJNACXX:6:2303:16336:171880 1:N:0:GGCTAC TGAAGAAACTAAGGTGAGGTATCCTAGTCAGGGGTCAATTTGGCCACAGAAAATAACCAGTGCTGGCTGCTTAAACTAATCCATCATCACAGCACTGATTT + @@@FFFFFGHHHHJFF4EFGBHGHIBHEHHHHHIFDFHIIGGIIHIIGGIIJIJJJGHIG@@@FHAC9@EEHEAC;B@DFCEDCEEDDDDCCBDDCCDDDD @HWI-ST279:219:D0RJNACXX:6:2303:16300:171889 1:N:0:GGCTAC TGAGCCAGCAGAAGTATGCTTCTAATGTTGTGGAAAAGTGTCTATCTTTTGGAACTCCTGATGAACGTGAAGGCCTTATAAGAGAGATTGTATCCTCTGGC + CCCFFFFFGHHHHJCGHJJJJIJJJJJJJJIIJJJJJJFHIGIJJJJJJJJJHIJJJJJIJJJJJJJIJIGIJHHHHHHFFFFFFDDEEDDEEFDEDDDDC Nobody by Shel Silverstein Nobody loves me, Nobody cares, Nobody picks me peaches and pears. Nobody offers me candy and Cokes, Nobody listens and laughs at me jokes. Nobody helps when I get in a fight, Nobody does all my homework at night. Nobody misses me, Nobody cries, Nobody thinks I'm a wonderful guy. So if you ask me who's my best friend, in a whiz, I'll stand up and tell you that Nobody is. But yesterday night I got quite a scare, I woke up and Nobody just wasn't there. I called out and reached out for Nobody's hand, In the darkness where Nobody usually stands. Then I poked through the house, in each cranny and nook, But I found somebody each place that I looked. I searched till I'm tired, and now with the dawn, There's no doubt about it- Nobody's gone! ***************************** Array Problem Set ***************************** 1. Create a shuffled sequence Turn a DNA string into an array with split() Use a for loop to perform the following procedure N times (N = length of seq) Select a random position A with rand() Select a random position B with rand() Exchange the letters at array indices A and B Print the now shuffled sequence 2a. Start with 2 very similar DNA sequences. Align with ClustalW or some other web alignment application. Output should be in fasta format. Store (copy and paste) the sequence, including dashes, from each ClustalW fasta output in a separate string variable inside your script. Turn each string into an array with split() Use a for loop to compare each index for nucleotide differences. Report the nucleotide position of each difference. 2b. Do the same as above but instead of coping and pasting into string variables import from a file. 3. Calculate GC content Turn a DNA string into an array with split() Use a foreach loop to look at each nucleotide in turn Calculate total length of the sequence Keep a running total of C's and G's Print the calculated GC content as a percent. 4. Run this code. Is its output what you expect? Why? for (my $i = 0; $i < 10; $i++) { if ($i = 2) { print "\$i = $i\n"; } } ***************************** Hash Problem set ***************************** . Determine the codon usage for a DNA sequence - Create a string containing a DNA sequence - In a for loop - use the function substr ($seq, $offset, 3) to extract codons - store each codon and the number of times it has occurred in a hash - Report the codon usage 2. Create a report of the expression levels and sequences of genes expressed in liver. - There is a tab delimited file of expression data. - This file contains the following information: GeneID, tissue the gene is expressed in, expression level, and gene sequence. - Open this file script using open(). - In a loop:- - Read a line in the file - remove the "\n" at the end of the line - split the line on tabs using split (/\t/ , $line) - Store the data on each line in 3 different hashes as described below hash key value %tissue GeneID tissue %expr GeneID expression level %seq GeneID sequence - Now search the %tissue hash for genes that are expressed in liver. Make a list of the GeneIDs corresponding to these genes. - create a report of the GeneID and expression level of these genes * Expression Data * CDC2 brain 34.5 AGCGCGGTGAGTTTGAAACTGCTCGCACTTGGCTTCAAAGCTGGCTCTTGGAAATTGAGCGGAGAGCGAC ALT liver 9.2 ATGTTCAGAAGAAGTTTAAAACTATTAAGTAAAGAAACCATTACTCGTGTTAAACCAAATACAACTATTG ARG1 liver 458.5 AGCCGATGCGTGGCGCCCCGGCGGCCACGCCGCCGCCCGCTACGGAATCGGCGGCCGAGCGGCTGCGCCG TSHR thyroid 2.8 CCTCCTCCACAGTGGTGAGGTCACAGCCCCTTGGAGCCCTCCCTCTTCCCACCCCTCCCGCTCCCGGGTC ***************************** Regular Expression Problem Set ***************************** 1.The enzyme ApoI has a restriction site: R^AATTY where R and Y are degenerate nucleotideides. See the IUPAC table to identify the nucleotide possibilities for the R and Y. Write a regular expression that will match occurrences of the site in a sequence. (hint: what are you going to do about the actual cut site, represented by the '^'?) 2. Use the regular expression you just wrote to find all the restriction sites in the following sequence. Be sure to think about how to handle the newlines! GAATTCAAGTTCTTGTGCGCACACAAATCCAATAAAAACTATTGTGCACACAGACGCGAC TTCGCGGTCTCGCTTGTTCTTGTTGTATTCGTATTTTCATTTCTCGTTCTGTTTCTACTT AACAATGTGGTGATAATATAAAAAATAAAGCAATTCAAAAGTGTATGACTTAATTAATGA GCGATTTTTTTTTTGAAATCAAATTTTTGGAACATTTTTTTTAAATTCAAATTTTGGCGA AAATTCAATATCGGTTCTACTATCCATAATATAATTCATCAGGAATACATCTTCAAAGGC AAACGGTGACAACAAAATTCAGGCAATTCAGGCAAATACCGAATGACCAGCTTGGTTATC AATTCTAGAATTTGTTTTTTGGTTTTTATTTATCATTGTAAATAAGACAAACATTTGTTC CTAGTAAAGAATGTAACACCAGAAGTCACGTAAAATGGTGTCCCCATTGTTTAAACGGTT GTTGGGACCAATGGAGTTCGTGGTAACAGTACATCTTTCCCCTTGAATTTGCCATTCAAA ATTTGCGGTGGAATACCTAACAAATCCAGTGAATTTAAGAATTGCGATGGGTAATTGACA TGAATTCCAAGGTCAAATGCTAAGAGATAGTTTAATTTATGTTTGAGACAATCAATTCCC CAATTTTTCTAAGACTTCAATCAATCTCTTAGAATCCGCCTCTGGAGGTGCACTCAGCCG CACGTCGGGCTCACCAAATATGTTGGGGTTGTCGGTGAACTCGAATAGAAATTATTGTCG CCTCCATCTTCATGGCCGTGAAATCGGCTCGCTGACGGGCTTCTCGCGCTGGATTTTTTC ACTATTTTTGAATACATCATTAACGCAATATATATATATATATATTTAT 3. Determine the site(s) of the cut in the above sequence. Print out the sequence with "^" at the cut site. Hints: Use subpatterns (parentheses and $1, $2) to find the cut site within the pattern. Use s/// Example: if the pattern is GACGT^CT the following sequence AAAAAAAAGACGTCTTTTTTTAAAAAAAAGACGTCTTTTTTT would be cut like this: AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT 4. Now that you've done your restriction digest, determine the lengths of your fragments and sort them by length (in the same order they would separate on an electrophoresis gel). Hint: take a look at the split man page or think about storing your matches in an array. With one of these two approaches you should be able to convert this string: AAAAAAAAGACGT^CTTTTTTTAAAAAAAAGACGT^CTTTTTTT into this array: ("AAAAAAAAGACGT","CTTTTTTTAAAAAAAAGACGT","CTTTTTTT") ***************************** Subroutine Problem Set ***************************** Create a subroutine that reverse complements a sequence. This subroutine should take a nucleotide sequence as a parameter and return the reverse complement. Here's the pseudo code: -- BEGIN PSEUDOCODE -- subroutine reverse_complement { get the parameter nucleotide string reverse complement the nucleotide string return the complemented nucleotide string } -- END PSEUDOCODE -- Write a program that takes in a nucleotide string as an argument, calls the reverse_complement subroutine, and then prints the reverse complement sequence to STDOUT. -- BEGIN SAMPLE RUN -- ./reverse_complement.pl GAGAGAGAGAGTTTTTTTTT AAAAAAAAACTCTCTCTCTC -- END SAMPLE RUN -- ***************************** Bioperl Problem Set ***************************** Problem Set for Bioperl Preparation: 1. Download uniprot_sprot: curl -O "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz" 2. Unzip the file. gunzip uniprot_sprot.fasta.gz PROBLEM 1: Bio::DB::Fasta 1. write a script to retrieve all IDs from uniprot_sprot.fasta using the get_all_ids method from Bio::DB:Fasta. 2. Search through the list of IDs for all IDs that contain the term "HDAC" 3. Print the sequences for these proteins, in FASTA format. PROBLEM 2: Bio::SeqIO 1. Write a script using Bio::SeqIO to retrieve the CDS translation from a genbank file. PROBLEM 3: Bio::SearchIO * On the command line, run the following command to format the file for using as a blast database: makeblastdb -in uniprot_sprot.fasta -dbtype prot Here is how you can blast your favorite seq against swissprot: blastp -query query.fasta -db uniprot_sprot.fasta -evalue 1e-10 -out query_v_sprot.blastout You can find additional information on BLAST+ at: http://www.ncbi.nlm.nih.gov/books/NBK1763/ 4. Run Blast with your 3 protein sequences from earlier. * use the uniprot_sprot.fasta as your database * run blast with an e-value cut-off of 1e-10 5. Parse your Blast output. For Hits with "significance" less than or equal to 1e-50 retrieve every HSP and print in a tab delimited format: * QUERY Name * HIT Name * HSP Evalue ***************************** Database/DBI Problem Set ***************************** Problem sets: Databases and Perl DBI - For this problem set, we have installed MySQL servers in each of the machines. - In each MySQL server, you'll find a database named progbio2012. - This database has 5 tables: genes, genes_go, expression, snp, go_terms 1. 'genes' table lists the location and evidence class of each gene 2. 'genes_go' table contains the Gene Ontology terms for genes that have a Gene Ontology annotation 3. 'go_terms' list the Gene Ontology descriptions of the GO Ids 4. 'expression' table contains the expression level of genes in 4 experiments. 5. 'snps' list of SNPs in chr1 and chr10. - You will need the data in these tables for the problem sets --- Getting familiar with mySQL --- 1. Follow the steps below to enter the MySQL shell and use the database progbio2012 on the unix command line: $ mysql -u root You're now in the MySQL shell. - To use the database progbio2012: mysql> use progbio2012; - To list the tables in the database: mysql> show tables; --- Getting familiar with mySQL --- 2. Use the 'explain' command to see the schema of each table on the unix command line: mysql> explain genes; - Try out SHOW command to see the SQL-CREATE syntax for each table: mysql> show create table genes; --- SQL --- 3. Using SQL, perform the following queries: a. How many rows are there in the gene table? b. How many genes have GO annotations? HINT: count(distinct gene_id) c. List the number of genes in each evidence class in the genes table HINT: Using a COUNT ... GROUP BY query d. Using a SQL query that joins the genes_go and expression table, select the day1 value of genes that have the go_term 'chromatin binding' ('GO:0003682') e. Try the query above again, but limit it to genes on chr1. --- PERL DBI --- 4. Do the following using Perl-DBI a. Using a similar query to Q3.d above, write a Perl DBI script that produces a tab-delimited text file for genes with go_term 'nucleic acid binding' ('GO:0003676') The text file should have the columns: gene_id, go_term, day1, day2, day3, day4 b. Construct a query to find genes where the expression level in day4 is greater than day1. Print out this list of genes. 5. Advanced problems a. How many genes in the first 100Mb of chr1 contain snps? b. Compute the gene density on chr10 across 1Mb windows. Assume that Chr10 has a total length of 150Mb. c. Compute the average expression level in each experiment (day1, day2, day3, day4) for the genes in each GO term.