Reverse complement FASTA benchmark

FASTA format is used in bioinformatics to describe peptide sequences or nucleic acid sequences.

One of the benchmarks at the computer languages game requires that a program parse a ‘stdin’ redirected input of these FASTA formated sequences and print the content of the sequence in reverse order. Each character in the sequence should also be complemented according to this translation table:

code  meaning   complement

A    A                   T
C    C                   G
G    G                   C
T/U  T                   A
M    A or C              K
R    A or G              Y
W    A or T              W
S    C or G              S
Y    C or T              R
K    G or T              M
V    A or C or G         B
H    A or C or T         D
D    A or G or T         H
B    C or G or T         V
N    G or A or T or C    N

Headers

Each sequence contains a header with an id and description. These lines are printed ‘as is’ .

Challenges

  • String reversal
  • Translation table substitution
  • Input handling and memory allocation

String Reversal

Sequences are reversed using a bitwise XOR and without using a temporary character or string buffers.

Operations are performed in a loop where forward is the first index and end is the last index in the sequence.

buffer[forward]^=buffer[end];
buffer[end]^=buffer[forward];
buffer[forward]^=buffer[end];

Translation table substitution

A small character array of complement values is created and accessed with the integer value of the source character matching to arrays index. For example:

buffer[end]=FtoCOMP[buffer[end]];

Buffer at index end will receive a character at index ‘buffer[end]‘ from translation table array FtoCOMP. Index ‘buffer[end]’ translated to an integer , will point to a desired complement . This speeds up the process of substitution and allows for very fast array index notation.

Input handling

The requirement to handle input one line at a time corresponds quite well to the translation procedure. In pseudo-code,

  • process one line at a time
  • for each sequence , print header line, reverse the sequence, translate it into its complement and print it out

Only one buffer is used as a destination for reading input. Beginning pointer where each line is placed is dictated by moving a pointer through a character array. Once a line is read in, we check if it is the last line in a sequence by peeking into the input stream. If it is , it is processed according to above instructions and beginning pointer where the next line is placed is reset to 0.

These functions are called for each line N in the input stream:

fgets_unlocked, fgetc_unlocked.

These functions are called for each line that’s not a header line:

ungetc, strlen.

Other functions are called only at the end of a sequence and do not approach N runtime. This version written in C language has been compiled and tested using kernel 2.6.24 with GCC 4.2.3 .

Compile command:

gcc -Wall -O3 -fomit-frame-pointer revc.c -o revc.gcc_run

Source code is located here: http://zenebo.com/revc.c.tar.gz

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s