PS3

Home Assignments Lecture Blog Resources Discussion Group

DNA Sequence Alignment

We'll be implementing the DNA sequence alignment assignment described at http://www.cs.princeton.edu/courses/archive/fall13/cos126/assignments/sequence.html.

This will help you confirm you're getting the right answers.

Some implementation notes:

  • Your code should accept the two strings to be matched from stdin; e.g., as piped in from a test file:
% EditDistance < example10.txt
  • Remember that the solution is in two parts:
    1. Filling out the NxM matrix per the min-of-three-options formula, bottom to top, right to left (this gives you the optimal edit distance in the upper-left, [0][0] cell of the matrix);
    2. Traversing the matrix from top-to-bottom, left-to-right (i.e., [0][0] to [N][M]) to recover the choices you made in filling it, and thereby also recovering the actual edit sequence.
  • You should create a class file EditDistance with the following methods:
    • A constructor that accepts the two strings to be compared, and allocates any data structures necessary into order to do the work (e.g., the NxM matrix)
    • A static method int penalty(char a, char b) that returns the penalty for aligning chars a and b (this will be a 0 or a 1)
    • A static method int min(int a, int b, int c) which returns the minimum of the three arguments.
    • A method int OptDistance() which populates the matrix based on having the two strings, and returns the optimal distance (from the [0][0] cell of the matrix when done)
    • A method string Alignment() which traces the matrix and returns a string that can be printed to display the actual alignment. In general, this will be a multi-line string—i.e., with embedded \n's.
Note: You may call the class ED (instead of EditDistance) if you'd rather type less.
  • You should have a main routine that accepts the two strings from stdin, uses your EditDistance class to do the work, and then prints the result to stdout. Remember that your final output should look like this:
% EditDistance < example10.txt
Edit distance = 7
A T 1
A A 0
C - 2
A A 0
G G 0
T G 1
T T 0
A - 2
C C 0
C A 1
  • You have to allocate the memory for the opt matrix dynamically, after you read in the two strings and figure out how long they are. There are a few different ways to do this; see http://stackoverflow.com/questions/936687/how-do-i-declare-a-2d-array-in-c-using-new for ideas. I am partial to the solution suggested by Dietrich Epp, which is to allocate a single NxM block, and then use the expression i*M+j as the index to the ith row and jth column.

    Make sure to de-allocate the memory in your class destructor; e.g., if your array pointer is named _opt, then delete [] _opt;
  • The dynamic programming solution we discussed requires filling the whole matrix with values (step 1 of the two-part solution), so it's N-squared in space (or more precisely, NxM).
So you shouldn't expect to have your code work for the test case with two 500,000 char strings. Assuming you use a 32-bit int to hold edit distance values in your matrix, that's 2 MB of data squared, or 4,000 GB.
Probably your computer can't allocate this much RAM.
The largest problem you should be able to handle is in ecoli28284.txt. This should cause you to allocate an array of approx. 800 million values. Assuming you're using 4-byte ints, that's 3.2 GB of data.
Don't worry about larger cases unless you want to explore alternate approaches. If so, there is a solution which computes the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).
  • After you have things working, add code to calculate and print execution time. You may use SFML's sf::Clock and sf::Time classes, as follows:
    • To your main routine, #include <SFML/System.hpp> at the top.

      Then define the following two classes in your main:
        sf::Clock clock;
        sf::Time t;

      At the end of main, after printing out the solution, display the running time:
        t= clock.getElapsedTime();
        cout << "Execution time is " << t.asSeconds() << " seconds \n";
    • Remember to add -lsfml-system to your compile and link commands in your Makefile or project properties.
    • Try different compiler optimization flags to see how execution time is affected. E.g., -O1 or -O2 could halve running time over using no optimization.
  • Verify your algorithm's space usage with the Valgrind runtime analysis tool. Here are the steps:
    • Install Valgrind: sudo apt-get install valgrind (Note: Valgrind is already installed on the CS cluster machine. You can compile and run your code there if necessary.)
    • Make sure your code is compiled and linked with debugging information (-g flag).
    • Use Valgrind to run your code with the “massif” heap analysis tool; e.g.:
        valgrind --tool=massif ./ED < ../sequence/ecoli28284.txt
    • Then Valgrind will produce a log file named massif.out.XXXXX, where XXXXX is the process ID from your run. View with file with the ms_print utility that's part of the Valgrind distribution; e.g.:
        ms_print massif.out.11515 | less
    • Confirm that the amount of memory used matches your expectations.
    • More documentation on Valgrind and massif is here: http://valgrind.org/docs/manual/ms-manual.html.
    • If you are on Windows, use your own platform-specific tools to analyze memory utilization. Or just copy stuff to the CS cluster and finish your work there.
  • Download and fill in this Attach:ps3-readme.txt file, and include it with your code submission.
  • Archive all of your code files, any Makefile, and the completed ps3-readme.txt file, and submit to:
    submit fredm 204-ps3 your-archive

Grading rubric

FeatureValueComment
core implementation4 
Makefile1Makefile or explicit build/link instructions included
Running time table2 
Valgrind analysis2 
largest N mem calc1 
largest N time calc1 
readme2discussion is expected -- at least mentioning something per section.
Total13