Title: A bit more shell

Download and open lesson material:

Open your terminal or Gitbash, and enter

cd                    # changes to your home directory
curl http://giladlab.uchicago.edu/data/Metahit_index.txt > Metahit_index.txt

What's going on?

The curl command sends an HTTP request and dumps the result to standard out. Here, we redirect the output of curl to Metahit_index.txt

Sanity check

The above command should have downloaded a data table. Let's check that the data are in the expected format--tab-delimeted text. head Metahit_index.txt 4475667.3 MH0001 Denmark female 49 25.55 N 20203603 human-associated habitat 4475668.3 MH0002 Denmark female 59 27.28 N 20203603 human-associated habitat ... If this is what you see, you have the right table. If not, you may need to troubleshoot why curl didn't give you what you were expecting.

Let's move and copy the data:

Let's move the data into the repo directory. This is what we would do if it were really our project: we'd have a folder named "Future Nature Paper" and place our data in it, rather than clogging up our home directory. The command mv moves the data:

mv Metahit_index.txt ~/2014-05-12-cshl

This is clearly very important data. Let's make a back up of the data so that while we're making python scripts and running them we don't accidentally destroy the data.

cp Metahit_index.txt Metahit_index.bakcup.txt

While we are working through python examples in a bit, if you happen to damage your file, now you have a back up you can revert to.

Let's explore the data:

less Metahit_index.txt

You'll see the data has several columns: 1. identifier 2. another identifier 3. country of origin 4. sex 5. age 6. BMI 7. disease status 8. PMID of the Metahit paper 9. sampling location

Let's look at the bottom of the file using tail:

tail Metahit_index.txt

How many individuals are in this study?

wc -l Metahit_index.txt

Grep

One very cool function that you can run in the shell is grep. grep is a search tool, which will search for words you enter line by line. For instance, in our data we have Danish individuals and Spanish individuals. If we want to see the data for just the Spanish individuals we can type:

grep Spain Metahit_index.txt

If we wanted to see only females:

grep female Metahit_index.txt

What happens if we want to see only males?

Piping

We didn't properly get to explore piping yesterday. Let's explore piping to get preliminary summary statistics on this data. Piping is essentially chaining together commands one after the other in your computer and you will get as output only the very final data coming out of the last command. So, we can use the pipe, grep, and word count command to do some interesting things:

How many women were in the study?

grep female Metahit_index.txt | wc -l

Short Exercise

Find how many Spanish people were in the study.
Pull out all of the information for only the first five people that did not have a disease in the dataset and save it to a file in your home directory called 5-healthy-spanish-participants.txt.

Example: text-mining gene expression data

Go to data

 cd
 cd ~/2014-05-12-cshl/materials/data/cuffdiff

This is a list of a few commands that we will use to do a little data mining of text file containing a comparison of gene expression data from and RNA-Seq experiment.

\|	strings together the inputs/outputs of a series of commands
cut	extracts sections from each line of text
grep	searches for patterns in text
sort	orders lines in text
head	prints the top N lines of text
wc	counts words, characters or lines

The data

You have a tab-delimited text file, gene_exp.txt, that contains data from a differential gene expression analysis. Each line describes a comparison of numerical expression levels for one gene in two samples.

What does the file look like

What is the file structure? Without options, head will print the top 10 lines of the file

$ head  gene_exp.txt
gene	sample_1	sample_2	status	value_1	value_2	significant
AT1G01010		WT		hy5	NOTEST	0	1.49367	no
AT1G01020		WT		hy5	NOTEST	7.27837	10.7195	no
AT1G01030		WT		hy5	NOTEST	1.18638	1.10483	no
AT1G01040		WT		hy5	NOTEST	0.239843	2.24208	no
AT1G01046		WT		hy5	NOTEST	0		0	no
AT1G01050		WT		hy5	OK	9.06975		23.5089	yes
AT1G01060		WT		hy5	NOTEST	4.04534		6.46964	no
AT1G01070		WT		hy5	NOTEST	1.24918		2.41377	no
AT1G01073		WT		hy5	NOTEST	0		0	no

This is a well-formatted, tab-delimited text file. The header line describes the columns for us. We can use this to help answer some questions. Note that sort and cut assume that the columns in each row are tab-delimited.

How many records are there in the file?

We can use wc -l to count the lines

$ wc -l gene_exp.txt 
     200 gene_exp.txt

How many genes have enough data to perform the comparison (have 'OK' status)? How many had significantly different expression levels between samples?

We can search for OK and yes in the file, then count how many lines are returned by grep. Note the use of the pipe symbol '

'. wc -l is acting on the text printed by grep, not the input file.

$ grep OK gene_exp.txt | wc -l
    32
$ grep yes gene_exp.txt | wc -l
    12

For 199 genes, 32 had enough data to do a comparison and 12 had significantly different expression.

Question: What if the strings yes or OK appear in other columns of the file?

There is not a lot of room for free text in this example but it can't hurt to check. This sort of thing happens all the time in real life! We can use cut to remove the normal column (for example, column 7 for 'yes'). Remove column 7, then search for yes.

$ cut -f1-6 gene_exp.txt | grep yes

No results. That is good. For OK we need to search all columns except column 4:

cut -f1-3,5-7 gene_exp.txt | grep OK

No results. The file is good. Note that the -k argument could also have been expressed as -k1,2,3,5,6,7

What are the 20 genes with the highest expression levels in sample 1 and differ significantly between samples?

We can use grep to get the 'yes' lines, then use sort to order the lines base on the numeric values in column 5 (value_1). We pipe the output to head so we just look at the top 10 lines for now. The k5 argument means sort on column (key) 5; -n means sort numerically.

$ grep 'yes' gene_exp.txt | sort -k5 -n | head
AT1G01320	WT	hy5	OK	4.94764	20.8172	yes
AT1G01050	WT	hy5	OK	9.06975	23.5089	yes
AT1G02120	WT	hy5	OK	11.3678	23.8411	yes
AT1G01610	WT	hy5	OK	14.4058	5.2457	yes
AT1G01120	WT	hy5	OK	15.6414	7.70519	yes
AT1G01430	WT	hy5	OK	19.7868	10.301	yes
AT1G01090	WT	hy5	OK	55.3618	33.6315	yes
AT1G01170	WT	hy5	OK	71.5683	21.235	yes
AT1G02500	WT	hy5	OK	71.7299	32.6785	yes
AT1G02140	WT	hy5	OK	105.425	65.1563	yes

You may have noticed that cut and sort use different arguments for the same thing (column number). The collection of tools in unix-like operating systems evolved over time from a variety of sources and authors, so their command line arguments are not always consistent. If in doubt:

man cut

Note the the values in column 5 above are all zeros. We are not quite there yet. We can use the r flag to sort in descending order.

$ grep 'yes' gene_exp.txt | sort -k5 -n -r | head
AT1G01100	WT	hy5	OK	275.52	192.323	yes
AT1G01620	WT	hy5	OK	178.441	80.266	yes
AT1G02140	WT	hy5	OK	105.425	65.1563	yes
AT1G02500	WT	hy5	OK	71.7299	32.6785	yes
AT1G01170	WT	hy5	OK	71.5683	21.235	yes
AT1G01090	WT	hy5	OK	55.3618	33.6315	yes
AT1G01430	WT	hy5	OK	19.7868	10.301	yes
AT1G01120	WT	hy5	OK	15.6414	7.70519	yes
AT1G01610	WT	hy5	OK	14.4058	5.2457	yes
AT1G02120	WT	hy5	OK	11.3678	23.8411	yes

OK, now we have the 10 most abundant genes in sample 1. Since we were only asked to list the genes, we can use cut to get just the list of gene names

$ grep 'yes' gene_exp.txt | sort -k5 -n -r | head | cut -f1
AT1G01100
AT1G01620
AT1G02140
AT1G02500
AT1G01170
AT1G01090
AT1G01430
AT1G01120
AT1G01610
AT1G02120

And we have our answer!