Problem sets: Databases and Perl DBI

- For this problem set, we have installed MySQL servers in each of the machines.

- In each MySQL server, you'll find a database named progbio2012.

- This database has 5 tables: genes, genes_go, expression, snp, go_terms
	1. 'genes' table lists the location and evidence class of each gene
	2. 'genes_go' table contains the Gene Ontology terms for genes 
       that have a Gene Ontology annotation
	3. 'go_terms' list the Gene Ontology descriptions of the GO Ids
	4. 'expression' table contains the expression level of genes in 4 experiments.
	5. 'snps' list of SNPs in chr1 and chr10.

- You will need the data in these tables for the problem sets


--- Getting familiar with mySQL ---

1. Follow the steps below to enter the MySQL shell and use the 
   database progbio2012 on the unix command line:

	 $ mysql -u root
								
	 You're now in the MySQL shell.

	- To use the database progbio2012:

	 mysql> use progbio2012;
								
	- To list the tables in the database:

	 mysql> show tables;
			
			
--- Getting familiar with mySQL ---

2. Use the 'explain' command to see the schema of each table
on the unix command line:

	 mysql> explain genes; 
								

	- Try out SHOW command to see the SQL-CREATE syntax for each table:

	 mysql> show create table genes;
	
	
--- SQL ---

3. Using SQL, perform the following queries:
	a. How many rows are there in the gene table?

	b. How many genes have GO annotations?
	HINT: count(distinct gene_id)

	c. List the number of genes in each evidence class in the genes table
	HINT: Using a COUNT ... GROUP BY query


	d. Using a SQL query that joins the genes_go and expression table, 
	select the day1 value of genes that have the go_term 'chromatin binding'
    ('GO:0003682')

	e. Try the query above again, but limit it to genes on chr1.				
	 
--- PERL DBI ---

4. Do the following using Perl-DBI

	a. Using a similar query to Q3.d above, write a Perl DBI script that produces 
       a tab-delimited text file for genes with go_term 'nucleic acid binding'
       ('GO:0003676')
	The text file should have the columns: gene_id, go_term, day1, day2, day3, day4

	b. Construct a query to find genes where the expression level in day4 is 
       greater than day1. 
	Print out this list of genes.


5. Advanced problems
	a. How many genes in the first 100Mb of chr1 contain snps?

	b. Compute the gene density on chr10 across 1Mb windows. 
	   Assume that Chr10 has a total length of 150Mb.

	c. Compute the average expression level in each experiment (day1, day2, day3, day4) 
       for the genes in each GO term.