DNA Sequence Alignment
We'll be implementing the DNA sequence alignment assignment described at http://www.cs.princeton.edu/courses/archive/fall13/cos126/assignments/sequence.html.
- Download the test files from here: ftp://ftp.cs.princeton.edu/pub/cs126/sequence.zip
- See the Princeton's Checklist page for suggestions and output from sample runs.
This will help you confirm you're getting the right answers.
Some implementation notes:
- Your code should accept the two strings to be matched from stdin; e.g., as piped in from a test file:
% EditDistance < example10.txt
- Remember that the solution is in two parts:
- Filling out the NxM matrix per the min-of-three-options formula, bottom to top, right to left (this gives you the optimal edit distance in the upper-left,  cell of the matrix);
- Traversing the matrix from top-to-bottom, left-to-right (i.e.,  to [N][M]) to recover the choices you made in filling it, and thereby also recovering the actual edit sequence.
- You should create a class file
EditDistancewith the following methods:
- A constructor that accepts the two strings to be compared, and allocates any data structures necessary into order to do the work (e.g., the NxM matrix)
- A static method
int penalty(char a, char b)that returns the penalty for aligning chars a and b (this will be a 0 or a 1)
- A static method
int min(int a, int b, int c)which returns the minimum of the three arguments.
- A method
int OptDistance()which populates the matrix based on having the two strings, and returns the optimal distance (from the  cell of the matrix when done)
- A method
string Alignment()which traces the matrix and returns a string that can be printed to display the actual alignment. In general, this will be a multi-line string—i.e., with embedded
Note: You may call the class
EditDistance) if you'd rather type less.
- You should have a main routine that accepts the two strings from stdin, uses your EditDistance class to do the work, and then prints the result to stdout. Remember that your final output should look like this:
% EditDistance < example10.txt Edit distance = 7 A T 1 A A 0 C - 2 A A 0 G G 0 T G 1 T T 0 A - 2 C C 0 C A 1
- You have to allocate the memory for the
optmatrix dynamically, after you read in the two strings and figure out how long they are. There are a few different ways to do this; see http://stackoverflow.com/questions/936687/how-do-i-declare-a-2d-array-in-c-using-new for ideas. I am partial to the solution suggested by Dietrich Epp, which is to allocate a single NxM block, and then use the expression
i*M+jas the index to the ith row and jth column.
Make sure to de-allocate the memory in your class destructor; e.g., if your array pointer is named
delete  _opt;
- The dynamic programming solution we discussed requires filling the whole matrix with values (step 1 of the two-part solution), so it's N-squared in space (or more precisely, NxM).
So you shouldn't expect to have your code work for the test case with two 500,000 char strings. Assuming you use a 32-bit int to hold edit distance values in your matrix, that's 2 MB of data squared, or 4,000 GB.
Probably your computer can't allocate this much RAM.
The largest problem you should be able to handle is in
ecoli28284.txt. This should cause you to allocate an array of approx. 800 million values. Assuming you're using 4-byte
ints, that's 3.2 GB of data.
Don't worry about larger cases unless you want to explore alternate approaches. If so, there is a solution which computes the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).
- After you have things working, add code to calculate and print execution time. You may use SFML's
sf::Timeclasses, as follows:
- To your
#include <SFML/System.hpp>at the top.
Then define the following two classes in your
sf::Clock clock; sf::Time t;
At the end of
main, after printing out the solution, display the running time:
t= clock.getElapsedTime(); cout << "Execution time is " << t.asSeconds() << " seconds \n";
- Remember to add
-lsfml-systemto your compile and link commands in your Makefile or project properties.
- Try different compiler optimization flags to see how execution time is affected. E.g.,
-O2could halve running time over using no optimization.
- To your
- Verify your algorithm's space usage with the Valgrind runtime analysis tool. Here are the steps:
- Install Valgrind:
sudo apt-get install valgrind(Note: Valgrind is already installed on the CS cluster machine. You can compile and run your code there if necessary.)
- Make sure your code is compiled and linked with debugging information (
- Use Valgrind to run your code with the “massif” heap analysis tool; e.g.:
valgrind --tool=massif ./ED < ../sequence/ecoli28284.txt
- Then Valgrind will produce a log file named
XXXXXis the process ID from your run. View with file with the
ms_printutility that's part of the Valgrind distribution; e.g.:
ms_print massif.out.11515 | less
- Confirm that the amount of memory used matches your expectations.
- More documentation on Valgrind and massif is here: http://valgrind.org/docs/manual/ms-manual.html.
- If you are on Windows, use your own platform-specific tools to analyze memory utilization. Or just copy stuff to the CS cluster and finish your work there.
- Install Valgrind:
- Download and fill in this
Attach:ps3-readme.txtfile, and include it with your code submission.
- Archive all of your code files, any Makefile, and the completed
ps3-readme.txtfile, and submit to:
submit fredm 204-ps3 your-archive
|Makefile||1||Makefile or explicit build/link instructions included|
|Running time table||2|
|largest N mem calc||1|
|largest N time calc||1|
|readme||2||discussion is expected -- at least mentioning something per section.|