Chapter 13 Miscellanea
Tools like sort, head and tail, grep, awk, and sed represent a powerful toolbox, particularly when used with the standard input and standard output streams. There are many other useful command line utilities, and we’ll cover a few more, but we needn’t spend quite as much time with them as we have for awk and sed.
Manipulating Line Breaks
All of the features of the tools we’ve covered so far assume the line as the basic unit of processing; awk processes columns within each line, sed matches and replaces patterns in each line (but not easily across lines), and so on. Unfortunately, sometimes the way data break over lines isn’t convenient for these tools. The tool tr translates sets of characters in its input to another set of characters as specified: ... | tr '<set1>' '<set2>'32
As a brief example, tr 'TA' 'AT' pz_cDNAs.fasta would translate all T characters into A characters, and vice versa (this goes for every T and A in the file, including those in header lines, so this tool wouldn’t be too useful for manipulating FASTA files). In a way, tr is like a simple sed. The major benefit is that, unlike sed, tr does not break its input into a sequence of lines that are operated on individually, but the entire input is treated as a single stream. Thus tr can replace the special “newline” characters that encode the end of each line with some other character.
On the command line, such newline characters may be represented as \n, so a file with the following three lines
line 1
line 2
line 3
could alternatively be represented as line 1\nline 2\nline 3\n (most files end with a final newline character). Supposing this file was called lines.txt, we could replace all of the \n newlines with # characters.
[oneils@mbp ~/apcb/intro/fasta_stats]$ cat lines.txt | tr '\n' '#'
line 1#line 2#line 3#oneils@mbp ~/apcb/intro/fasta_stats$
Notice in the above that even the final newline has been replaced, and our command prompt printed on the same line as the output. Similarly, tr (and sed) can replace characters with newlines, so tr '#' '\n' would undo the above.
Using tr in combination with other utilities can be helpful, particularly for formats like FASTA, where a single “record” is split across multiple lines. Suppose we want to extract all sequences from pz_cDNAs.fasta with nReads greater than 5. The strategy would be something like:
Identify a character not present in the file, perhaps an
@or tab character\t(and check withgrepto ensure it is not present before proceeding).Use
trto replace all newlines with that character, for example,tr '\n' '@'.Because
>characters are used to indicate the start of each record in FASTA files, usesedto replace record start>characters with newlines followed by those characters:sed -r 's/>/\n>/g'.At this point, the stream would look like so, where each line represents a single sequence record (with extraneous
@characters inserted):>PZ7180000027934 nReads=5 cov=2.32231@TTTAATGATCAGTAAAGTTATAGTAGTTGTATGTACAATATT >PZ456916 nReads=1 cov=1@AAACTGTCTCTAATTAATTTATAAAATTTAATTTTTTAGTAAAAAAGCTAAAATA >PZ7180000037718 nReads=9 cov=6.26448@ACTTTTTTTTTAATTTATTTAATTATATTAACTAATAAATCC >PZ7180000000004_TY nReads=86 cov=36.4238@CCAAAGTCAACCATGGCGGCCGGGGTACTTTATACTTAUse
grep,sed,awk, and so on to select or modify just those lines of interest. (If needed, we could also usesedto remove the inserted@characters so that we can process on the sequence itself.) For our example, usesed -r 's/=/ /1' | awk '{if($3 > 5) {print $0}}'to print only lines where the nReads is greater than 5.Reformat the file back to FASTA by replacing the
@characters for newlines, withtrorsed.The resulting stream will have extra blank lines as a result of the extra newlines inserted before each > character. These can be removed in a variety of ways, including
awk '{if(NF > 0) print $0}'.
Counting Duplicate Lines
We saw that sort with the -u flag can be used to remove duplicates (defined by the key columns used). What about isolating duplicates, or otherwise counting or identifying them? Sadly, sort isn’t up to the task, but a tool called uniq can help. It collapses consecutive, identical lines. If the -c flag is used, it prepends each line with the number of lines collapsed: uniq <file> or ... | uniq.
Because uniq considers entire lines in its comparisons, it is somewhat more rigid than sort -u; there is no way to specify that only certain columns should be used in the comparison.34 The uniq utility will also only collapse identical lines if they are consecutive, meaning the input should already be sorted (unless the goal really is to merge only already-consecutive duplicate lines). Thus, to identify duplicates, the strategy is usually:
- Extract columns of interest using
awk. - Sort the result using
sort. - Use
uniq -cto count duplicates in the resulting lines.
Let’s again consider the output of ./fasta_stats.py pz_cDNAs.fasta, where column 4 lists the most common 5-mer for each sequence. Using this extract/sort/uniq pattern, we can quickly identify how many times each 5-mer was listed. We’ll also use the tool cut to extract the 4th column (we could also choose to use awk '{print $4}').
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs.fasta | \
> grep -v '#' | \
> cut -f 4 | \
> sort | \
> uniq -c | \
> less -S
The result lists the counts for each 5-mer. We could continue by sorting the output by the new first column to identify the 5-mers with the largest counts.
45 AAAAA
1 AAAAG
8 AAAAT
18 AAATA
1 AAATG
4 AAATT
...
It is often useful to run uniq -c on lists of counts produced by uniq -c. Running the result above through awk '{print $1}' | sort -n | uniq -c reveals that 88 5-mers are listed once, 14 are listed twice, and so on.
88 1
14 2
8 3
2 4
3 5
3 6
...
Counting items with uniq -c is a powerful technique for “sanity checking” data. If we wish to check that a given column or combination of columns has no duplicated entries, for example, we could apply the extract/sort/uniq strategy followed by awk '{if($1 > 1) print $0}'. Similarly, if we want to ensure that all rows of a table have the same number of columns, we could run the data through awk '{print NF}' to print the number of columns in each row and then apply extract/sort/uniq, expecting all column counts to be collapsed into a single entry.
For-Loops in bash
Sometimes we want to run the same command or similar commands as a set. For example, we may have a directory full of files ending in .tmp, but we wished they ended in .txt.
[oneils@mbp ~/apcb/intro/temp]$ ls
file10.tmp file13.tmp file16.tmp file19.tmp file2.tmp file5.tmp file8.tmp
file11.tmp file14.tmp file17.tmp file1.tmp file3.tmp file6.tmp file9.tmp
file12.tmp file15.tmp file18.tmp file20.tmp file4.tmp file7.tmp
Because of the way command line wildcards work, we can’t use a command like mv *.tmp *.txt; the *.tmp would expand into a list of all the files, and *.txt would expand into nothing (as it matches no existing file names).
Fortunately, bash provides a looping construct, where elements reported by commands (like ls *.tmp) are associated with a variable (like $i), and other commands (like mv $i $i.txt) are executed for each element.
[oneils@mbp ~/apcb/intro/temp$ for i in $(ls *.tmp); do mv $i ]$i.txt; done
[oneils@mbp ~/apcb/intro/temp]$ ls
file10.tmp.txt file14.tmp.txt file18.tmp.txt file2.tmp.txt file6.tmp.txt
file11.tmp.txt file15.tmp.txt file19.tmp.txt file3.tmp.txt file7.tmp.txt
file12.tmp.txt file16.tmp.txt file1.tmp.txt file4.tmp.txt file8.tmp.txt
file13.tmp.txt file17.tmp.txt file20.tmp.txt file5.tmp.txt file9.tmp.txt
It’s more common to see such loops in executable scripts, with the control structure broken over several lines.
#!/bin/bash
for i in $(ls *.tmp); do
mv $i $i.txt;
done
This solution works, though often looping and similar programming techniques (like if-statements) in bash become cumbersome, and using a more robust language like Python may be the better choice. Nevertheless, bash does have one more interesting trick up its sleeve: the bash shell can read data on standard input, and when doing so attempts to execute each line. So, rather than using an explicit for-loop, we can use tools like awk and sed to “build” commands as lines. Let’s remove the .tmp from the middle of the files by building mv commands on the basis of a starting input of ls -1 *.tmp* (which lists all the files matching *.tmp* in a single column). First, we’ll build the structure of the commands.
[oneils@mbp ~/apcb/intro/temp]$ ls -1 *.tmp* | \
> awk '{print "mv "$1" "$1}'
mv file10.tmp.txt file10.tmp.txt
mv file11.tmp.txt file11.tmp.txt
mv file12.tmp.txt file12.tmp.txt
...
To this we will add a sed -r s/\.tmp//2 to replace the second instance of .tmp with nothing (remembering to escape the period in the regular expression), resulting in lines like
mv file10.tmp.txt file10.txt
mv file11.tmp.txt file11.txt
mv file12.tmp.txt file12.txt
...
After the sed, we’ll pipe this list of commands to bash, and our goal is accomplished.
[oneils@mbp ~/apcb/intro/temp]$ ls -1 *.tmp.* | \
> awk '{print "mv "$1" "$1}' | \
> sed -r 's/\.tmp//2' | \
> bash
[oneils@mbp ~/apcb/intro/temp]$ ls
file10.txt file13.txt file16.txt file19.txt file2.txt file5.txt file8.txt
file11.txt file14.txt file17.txt file1.txt file3.txt file6.txt file9.txt
file12.txt file15.txt file18.txt file20.txt file4.txt file7.txt
Exercises
In the file
pz_cDNAs.fasta, sequence IDs are grouped according to common suffixes like_TY,_ACT, and the like. Which group has the largest number of sequences, and how many are in that group?Using the various command line tools, extract all sequences composed of only one read (
nReads=1) frompz_cDNAs.fastato a FASTA formatted file calledpz_cDNAs_singles.fasta.In the annotation file
PZ.annot.txt, each sequence ID may be associated with multiple gene ontology (GO) “numbers” (column 2) and a number of different “terms” (column 3). Many IDs are associated with multiple GO numbers, and there is nothing to stop a particular number or term from being associated with multiple IDs.... PZ7180000023260_APN GO:0005515 btb poz domain containing protein PZ7180000035568_APN GO:0005515 btb poz domain containing protein PZ7180000020052_APQ GO:0055114 isocitrate dehydrogenase (nad+) PZ7180000020052_APQ GO:0006099 isocitrate dehydrogenase (nad+) PZ7180000020052_APQ GO:0004449 isocitrate dehydrogenase (nad+) ...Which GO number is associated with largest number of unique IDs? How many different IDs is it associated with? Next, answer the same questions using the GO term instead of GO number. For the latter, beware that the column separators in this file are tab characters (
\t) butawkby default uses any whitespace, including the spaces found in the terms column. In this file, though,isocitrateis not a term, butisocitrate dehydrogenase (nad+)is.
Unlike other tools,
trcan only read its input from stdin.↩︎While binary databases such as those used by
sqlite3and Postgres have their place (especially when large tables need to be joined or searched), storing data in simple text files makes for easy access and manipulation. A discussion of SQL syntax and relational databases is beyond the scope of this book; see Jay Kreibich’s Using SQLite (Sebastopol, CA: O’Reilly Media, Inc., 2010) for a friendly introduction tosqlite3and its syntax.↩︎This isn’t quite true: the
-f <n>flag for uniq removes the first<n>fields before performing the comparison.↩︎

