Reverse complement FASTA benchmark

FASTA format is used in bioinformatics to describe peptide sequences or nucleic acid sequences.

One of the benchmarks at the computer languages game requires that a program parse a ‘stdin’ redirected input of these FASTA formated sequences and print the content of the sequence in reverse order. Each character in the sequence should also be complemented according to this translation table:

code  meaning   complement

A    A                   T
C    C                   G
G    G                   C
T/U  T                   A
M    A or C              K
R    A or G              Y
W    A or T              W
S    C or G              S
Y    C or T              R
K    G or T              M
V    A or C or G         B
H    A or C or T         D
D    A or G or T         H
B    C or G or T         V
N    G or A or T or C    N

Headers

Each sequence contains a header with an id and description. These lines are printed ‘as is’ .

Challenges

  • String reversal
  • Translation table substitution
  • Input handling and memory allocation

String Reversal

Sequences are reversed using a bitwise XOR and without using a temporary character or string buffers.

Operations are performed in a loop where forward is the first index and end is the last index in the sequence.

buffer[forward]^=buffer[end];
buffer[end]^=buffer[forward];
buffer[forward]^=buffer[end];

Translation table substitution

A small character array of complement values is created and accessed with the integer value of the source character matching to arrays index. For example:

buffer[end]=FtoCOMP[buffer[end]];

Buffer at index end will receive a character at index ‘buffer[end]‘ from translation table array FtoCOMP. Index ‘buffer[end]’ translated to an integer , will point to a desired complement . This speeds up the process of substitution and allows for very fast array index notation.

Input handling

The requirement to handle input one line at a time corresponds quite well to the translation procedure. In pseudo-code,

  • process one line at a time
  • for each sequence , print header line, reverse the sequence, translate it into its complement and print it out

Only one buffer is used as a destination for reading input. Beginning pointer where each line is placed is dictated by moving a pointer through a character array. Once a line is read in, we check if it is the last line in a sequence by peeking into the input stream. If it is , it is processed according to above instructions and beginning pointer where the next line is placed is reset to 0.

These functions are called for each line N in the input stream:

fgets_unlocked, fgetc_unlocked.

These functions are called for each line that’s not a header line:

ungetc, strlen.

Other functions are called only at the end of a sequence and do not approach N runtime. This version written in C language has been compiled and tested using kernel 2.6.24 with GCC 4.2.3 .

Compile command:

gcc -Wall -O3 -fomit-frame-pointer revc.c -o revc.gcc_run

Source code is located here: http://zenebo.com/revc.c.tar.gz

Atoi() or not to Atoi()

Update: May 13th, 2008

My solution has been accepted and placed in the interesting alternative category. A bitter-sweet victory considering my program as tested by their standards is the fastest C language entry . It is still a small honor to be recognized and to know that my entry shaved off almost 30% off the next fastest C program.

Let’s see how the FASTA benchmark program stacks up in the next few days.

Cheers.

see it here: http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all

The computer language benchmarks game at http://shootout.alioth.debian.org/gp4/index.php matches up a fundamental processing task or algorithm in a variety of programming languages. One benchmark called ‘sumfile’ has the following requirements:

  • read integers from stdin, one line at a time
  • print the sum of those integers

It is based on this icon program:

procedure main(argv)
 sum:=0
 while(sum+:=read())
 write (sum)
end

The C language entry topped at 6th and 7th place. Surprising as it was, a Java 6 version came in second with more than a second shaved off the total CPU running time.

The task for all of these is to read a flat file containing one integer per line, add the integer to the total and print out the sum at the end. Three versions of test files containing 1,000, 11,000 and 21,000 lines are used. While there could be more interesting ways to tackle this task, it’s trivial and simple demands warranted a quick look.

The fastest C program used these function calls per each line in the file:

fgets_unlocked() , atoi() and a += operator to sum up the total. It seems that either one of these functions could be improved on, but since line-by-line reading is a strict requirement thus leaving the atoi function open to critique.

This following function will return an integer given a string. It handles positive and negative number strings.

int matoi(char *c) {
int res = 0,n=1;
if(*c=='-'){n=-1;*c++;}
while (*c >= '0' && *c <= '9')
res = res * 10 + *c++ - '0';
return res*n;

}

When compared to the standard atoi() function and parsing 500,000 lines of integers, this version on a 2.4 GHz Core2Duo performed 60% better . It’s simplicity is also it’s weakness as the matoi() function is designed to handle proper input. Perfect for a situation where input is controlled.

Original benchmark entry was modified with timing capture and the addition of matoi() function. Header contains any previous entry owners and information.

Archive with source code : http://zenebo.com/sum1.c.tar.gz