Chapter 9 The Standard Streams
In previous chapters, we learned that programs may produce output not only by writing to files, but also by printing them on the standard output stream. To further illustrate this feature, we’ve created a simple program called fasta_stats.py that, given a FASTA file name as its first parameter, produces statistics on each sequence. We’ll also look at the file pz_cDNAs.fasta, which contains a set of 471 de novo assembled transcript sequences from Papilio zelicaon, and pz_cDNAs_sample.fasta, which contains only the first two.
[oneils@mbp ~/apcb/intro]$ mkdir fasta_stats
[oneils@mbp ~/apcb/intro]$ cd fasta_stats
[oneils@mbp ~/apcb/intro/fasta_stats]$ wget https://raw.githubusercontent.com/oneilsh/apcb_second_edition/master/data/fasta_stats.py
--2021-09-28 09:23:59-- https://raw.githubusercontent.com/oneilsh/apcb_second_edition/master/data/fasta_stats.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)
...
[oneils@mbp ~/apcb/intro/fasta_stats]$ chmod +x fasta_stats.py
[oneils@mbp ~/apcb/intro/fasta_stats]$ wget https://raw.githubusercontent.com/oneilsh/apcb_second_edition/master/data/pz_cDNAs.fasta
...
[oneils@mbp ~/apcb/intro/fasta_stats]$ wget https://raw.githubusercontent.com/oneilsh/apcb_second_edition/master/data/pz_cDNAs_sample.fasta
...
[oneils@mbp ~/apcb/intro/fasta_stats]$ ls
fasta_stats.py pz_cDNAs.fasta pz_cDNAs_sample.fasta
We can run the fasta_stats.py program (after making it executable with chmod) with ./fasta_stats.py pz_cDNAs_sample.fasta.
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs_sample.fasta
# Column 1: Sequence ID
# Column 2: GC content
# Column 3: Length
# Column 4: Most common 5mer
# Column 5: Count of most common 5mer
# Column 6: Repeat unit of longest simple perfect repeat (2 to 10 chars)
# Column 7: Length of repeat (in characters)
# Column 8: Repeat type (dinucleotide, trinucleotide, etc.)
Processing sequence ID PZ7180000000004_TX
PZ7180000000004_TX 0.279 1000 AAATA 12 unit:TAA 12 trinucleotide
Processing sequence ID PZ7180000031590
PZ7180000031590 0.379 486 ACAAA 5 unit:ATTTA 10 pentanucleotide
Based on the information printed, it appears that sequence PZ7180000031590 has a GC content (percentage of the sequence composed of G or C characters) of 37.9%, is 486 base pairs long, the most common five-base-pair sequence is ACAAA (occurring 5 times), and the longest perfect repeat is 10 base pairs long, caused by the pentanucleotide ATTTA, occurring twice.
Much like hmmsearch, this program writes its output to standard output. If we would like to save the results, we know that we can redirect the output of standard out with the > redirect.
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs_sample.fasta > pz_sample_stats.txt
Processing sequence ID PZ7180000031590
Processing sequence ID PZ7180000000004_TX
[oneils@mbp ~/apcb/intro/fasta_stats]$ ls
fasta_stats.py pz_cDNAs.fasta pz_cDNAs_sample.fasta pz_sample_stats.txt
When we run this command, however, we see that even though the output file has been created, text is still printed to the terminal! If we use less -S to view the pz_sample_stats.txt file, we see that some of the output has gone to the file.
# Column 1: Sequence ID
# Column 2: GC content
# Column 3: Length
# Column 4: Most common 5mer
# Column 5: Count of most common 5mer
# Column 6: Repeat unit of longest simple perfect repeat (2 to 10 chars)
# Column 7: Length of repeat (in characters)
# Column 8: Repeat type (dinucleotide, trinucleotide, etc.)
PZ7180000000004_TX 0.279 1000 AAATA 12 unit:TAA 12 trinucleotide
PZ7180000031590 0.379 486 ACAAA 5 unit:ATTTA 10 pentanucleotide
pz_sample_stats.txt (END)
So what is going on? It turns out that programs can produce output (other than writing to files) on two streams. We are already familiar with the first, standard output, which is by default printed to the terminal but can be redirected to a file with >. The second, called standard error, is also by default printed to the terminal but is not redirected with >.
By default, like standard output, standard error (also known as “standard err” or “stderr”) is printed to the terminal.
Because standard error usually contains diagnostic information, we may not be interested in capturing it into its own file. Still, if we wish, bash can redirect the standard error to a file by using the 2> redirect.26
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs_sample.fasta > pz_sample_stats.txt 2> pz_sample_stats.err.txt
[oneils@mbp ~/apcb/intro/fasta_stats]$ ls
fasta_stats.py pz_cDNAs_sample.fasta pz_sample_stats.txt
pz_cDNAs.fasta pz_sample_stats.err.txt
We might pictorially represent programs and their output as alternative information flows:

Filtering Lines, Standard Input
It can often be useful to extract lines from a file based on a pattern. For example, the pz_sample_stats.txt file contains information on what each column describes, as well as the data itself. If we want to extract all the lines that match a particular pattern, say, unit:, we can use the tool grep (for Global search for Regular Expression and Print), which prints to standard output lines that match a given pattern (or don’t match a given pattern, if using the -v flag): grep '<pattern>' <file>. To illustrate, we’ll first run fasta_stats on the full input file, redirecting the standard output to a file called pz_stats.txt.
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs.fasta > pz_stats.txt
Looking at the file with less -S pz_stats.txt, we can see that informational lines as well as data-containing lines are stored in the file:
# Column 1: Sequence ID
# Column 2: GC content
# Column 3: Length
# Column 4: Most common 5mer
# Column 5: Count of most common 5mer
# Column 6: Repeat unit of longest simple perfect repeat (2 to 10 chars)
# Column 7: Length of repeat (in characters)
# Column 8: Repeat type (dinucleotide, trinucleotide, etc.)
PZ7180000000004_TX 0.279 1000 AAATA 12 unit:TAA 12 trinucleotide
PZ7180000031590 0.379 486 ACAAA 5 unit:ATTTA 10 pentanucleotide
PZ7180000027934 0.379 937 ATTTT 7 unit:GTTTGAC 14 heptanucleotide
PZ456916 0.167 162 TAATT 6 unit:TAAT 8 tetranucleotide
PZ7180000037718 0.299 518 TTTTT 15 unit:GTTGT 10 pentanucleotide
PZ7180000000004_TY 0.578 998 AGAAG 8 unit:GCGAG 10 pentanucleotide
...
To get rid of the informational lines, we can use grep to extract the other lines by searching for some pattern they have in common; in this case, the pattern unit: will do. Because grep prints its results to standard output, we will need to redirect the modified output to a file called, perhaps, pz_stats.table to indicate its tabular nature.
[oneils@mbp ~/apcb/intro/fasta_stats]$ grep 'unit:' pz_stats.txt > pz_stats.table
This time, less -S pz_stats.table reveals only the desired lines.
PZ7180000000004_TX 0.279 1000 AAATA 12 unit:TAA 12 trinu
PZ7180000031590 0.379 486 ACAAA 5 unit:ATTTA 10 pentanucleoti
PZ7180000027934 0.379 937 ATTTT 7 unit:GTTTGAC 14 heptanucleoti
PZ456916 0.167 162 TAATT 6 unit:TAAT 8 tetranucleoti
PZ7180000037718 0.299 518 TTTTT 15 unit:GTTGT 10 pentanucleoti
PZ7180000000004_TY 0.578 998 AGAAG 8 unit:GCGAG 10 penta
...
Rather than viewing the file with less, we can also count the number of lines present in the file with the wc tool, which counts the number of lines, words, and characters of input: wc <file>.
Working with the cleaned data table reveals that our program produced 27,723 characters broken into 3,768 words among 471 lines of data output.
[oneils@mbp ~/apcb/intro/fasta_stats]$ wc pz_stats.table
471 3768 27723 pz_stats.table
This sort of command-line-driven analysis can be quite powerful, particularly because many of these programs — like less, grep, and wc — can both print their results on standard output and read input from standard input rather than from a file. Standard input is the secondary input mechanism for command-line programs (other than reading from files directly). By default, standard input, or “stdin”, is unused.
How can we get input to a program on its standard input? It turns out that the easiest way to do so is to redirect the standard output of another program to it using the |, also known as the “pipe,” redirect (found above the Enter key on most keyboards). In this case, the data come in “on the left”:

To drive this home, we’ll first remove our pz_stats.table file, and then rerun our grep for unit: on the pz_stats.txt file, but rather than send the result of grep to a file with the > redirect, we’ll direct it straight to the standard input of wc with a | redirect.
[oneils@mbp ~/apcb/intro/fasta_stats]$ rm pz_stats.table
fasta_stats.py pz_cDNAs_sample.fasta pz_sample_stats.txt
pz_cDNAs.fasta pz_sample_stats.err.txt pz_stats.txt
[oneils@mbp ~/apcb/intro/fasta_stats]$ grep 'unit:' pz_stats.txt | wc
471 3768 27723
In this example, we’ve neither created a new file nor specified a file for wc to read from; the data are stored in a temporary buffer that is handled automatically by the shell and operating system. The less program can also read from standard input, so if we wanted to see the contents of the grep without creating a new file, we could run grep 'unit:' pz_stats.txt | less -S.
Recall that the fasta_stats program wrote its output to standard out, and because grep can read from standard input as well, we can process the entire FASTA file without needing to create any new files by using multiple such buffers:
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs.fasta | grep 'unit:' | wc
When this command runs, the results printed by fasta_stats on standard error will still be printed to the terminal (as that is the default and we didn’t redirect standard error), but the standard output results will be filtered through grep and then filtered through wc, producing the eventual output of 471 lines.
At this point, the longish nature of the commands and the fact that our terminal window is only so wide are making it difficult to read the commands we are producing. So, we’ll start breaking the commands over multiple lines by ending partial commands with backslashes. Just as in the shell scripts we wrote at the end of chapter 7, “Using (Bioinformatics) Software”, using backslashes will let the shell know that we aren’t finished entering the command. However, the bash shell indicates that a command spans multiple lines by showing us a >, which shouldn’t be confused with the redirect character that we might type ourselves. The following example shows the exact same command in a more readable form broken over multiple lines, but the highlighted characters have not been typed.
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs.fasta | \
> grep 'unit:' | \
> wc
A chain of commands like the above, separated by pipe characters, is often called a “pipeline.” More generally, though, a pipeline can describe any series of steps from input data to output data (as in the Muscle/HMMER series covered in chapter 7).
Counting Simple AT Repeats
Let’s expand on the small pipeline above to inspect just the “simple” AT repeats, that is, those that are the string “AT” repeated one or more times. We can start with what we have, but rather than just searching for unit:, we’ll modify the pattern to find unit:AT, and see what we get:
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs.fasta | \
> grep 'unit:AT' | \
> less -S
The resulting output is close to what we hoped for, but not quite complete, as this pattern also matches things like unit:ATT and unit:ATG.
PZ22254_A 0.132 68 ATTAA 5 unit:AT 4 dinucleotide
PZ20930_A 0.229 375 ATTAA 10 unit:ATT 6 trinucleotide
PZ841572 0.285 411 TTTTT 7 unit:ATGTGA 12 hexanucleotid
PZ7180000019701 0.254 678 TTTTT 12 unit:ATAAA 10 pentanucleoti
PZ729309 0.269 535 ATATT 11 unit:ATTAA 10 pentanucleoti
PZ835452 0.238 311 AAAAA 6 unit:ATG 6 trinucleotide
...
We probably want to further filter the output, but based on what pattern? In this case, those lines that match not only unit:AT, but also the term dinucleotide. Rather than attempt to produce a single complicated pattern that does this job in a single grep, we can add another grep call into the pipeline.
[oneils@mbp ~/apcb/intro/fasta_stats]$ ./fasta_stats.py pz_cDNAs.fasta | \
> grep 'unit:AT' | \
> grep 'dinucleotide' | \
> less -S
This command results in the output we want:
PZ7180000031598 0.21 81 AATAT 5 unit:AT 6 dinucleotide
PZ23512_A 0.281 153 TTAAA 3 unit:AT 6 dinucleotide
PZ851952 0.399 338 TATAT 12 unit:AT 24 dinucleotide
PZ17053_A 0.21 119 ATTAA 6 unit:AT 4 dinucleotide
PZ463243 0.227 97 TTGTA 3 unit:AT 4 dinucleotide
PZ22254_A 0.132 68 ATTAA 5 unit:AT 4 dinucleotide
...
Rather than run the results through less -S, we could instead use wc to count the simple (dinucleotide) AT repeats. There is an important concept at play here, that of iterative development, the idea that as we get closer to a solution, we inspect the results and repeat as necessary. Iterative development is a good strategy for many areas in life, but it is essential and pervasive in computing.
Once we’ve decided that we like the small computational process we have created, we might decide to encapsulate it and make it repeatable as a shell script, perhaps called count_ATs.sh.

The above script will need to be made executable and placed in a location referenced by $PATH, as will the fasta_stats.py program.
Exercises
Use
grepandwcto determine how many sequences are in the fileorf_trans.fastawithout creating any temporary files.How many sequence headers in the file
orf_trans.fastahave the term “polymerase”?Some of the sequence headers in
orf_trans.fastahave the phrase “Verified ORF” to indicate that the open reading frame has been verified experimentally. Some also have the term “reverse complement” to indicate that the ORF (open reading frame) is present on the reverse complement sequence of the canonical genome description. How many sequences are verified ORFs and are not on the reverse complement?The sequence headers in
orf_trans.fastahave information on the chromosome from which they originate, such asChr IorChr II. How many sequences are present on chromosome I?
The
tcshandcshshells unfortunately cannot natively separately redirect stdout and stderr to files. A potential workaround looks like:( ./fasta_stats.py pz_cDNAs_sample.fasta > pz_sample_stats.txt ) > & pz_sample_stats.err.txt. This command runs two independent redirects; using parentheses causes the redirect of stdout to happen first, then the further redirect of stderr can occur next. Howbash-compatible shells handle standard output and standard error is one of the primary reasons they are preferred over the oldercsh-compatible shells.↩︎