PS3
Home Assignments Lecture Blog Resources Discussion Group
DNA Sequence Alignment
We'll be implementing the DNA sequence alignment assignment described at http://www.cs.princeton.edu/courses/archive/fall13/cos126/assignments/sequence.html.
- Download the test files from here: ftp://ftp.cs.princeton.edu/pub/cs126/sequence.zip
- See the Princeton's Checklist page for suggestions and output from sample runs.
This will help you confirm you're getting the right answers.
Some implementation notes:
- Your code should accept the two strings to be matched from stdin; e.g., as piped in from a test file:
% EditDistance < example10.txt
- Remember that the solution is in two parts:
- Filling out the NxM matrix per the min-of-three-options formula, bottom to top, right to left (this gives you the optimal edit distance in the upper-left, [0][0] cell of the matrix);
- Traversing the matrix from top-to-bottom, left-to-right (i.e., [0][0] to [N][M]) to recover the choices you made in filling it, and thereby also recovering the actual edit sequence.
- You should create a class file
EditDistance
with the following methods:- A constructor that accepts the two strings to be compared, and allocates any data structures necessary into order to do the work (e.g., the NxM matrix)
- A static method
int penalty(char a, char b)
that returns the penalty for aligning chars a and b (this will be a 0 or a 1) - A static method
int min(int a, int b, int c)
which returns the minimum of the three arguments. - A method
int OptDistance()
which populates the matrix based on having the two strings, and returns the optimal distance (from the [0][0] cell of the matrix when done) - A method
string Alignment()
which traces the matrix and returns a string that can be printed to display the actual alignment. In general, this will be a multi-line string—i.e., with embedded\n
's.
Note: You may call the class
ED
(instead of EditDistance
) if you'd rather type less.
- You should have a main routine that accepts the two strings from stdin, uses your EditDistance class to do the work, and then prints the result to stdout. Remember that your final output should look like this:
% EditDistance < example10.txt Edit distance = 7 A T 1 A A 0 C - 2 A A 0 G G 0 T G 1 T T 0 A - 2 C C 0 C A 1
- You have to allocate the memory for the
opt
matrix dynamically, after you read in the two strings and figure out how long they are. There are a few different ways to do this; see http://stackoverflow.com/questions/936687/how-do-i-declare-a-2d-array-in-c-using-new for ideas. I am partial to the solution suggested by Dietrich Epp, which is to allocate a single NxM block, and then use the expressioni*M+j
as the index to the ith row and jth column.
Make sure to de-allocate the memory in your class destructor; e.g., if your array pointer is named_opt
, thendelete [] _opt;
- The dynamic programming solution we discussed requires filling the whole matrix with values (step 1 of the two-part solution), so it's N-squared in space (or more precisely, NxM).
So you shouldn't expect to have your code work for the test case with two 500,000 char strings. Assuming you use a 32-bit int to hold edit distance values in your matrix, that's 2 MB of data squared, or 4,000 GB.
Probably your computer can't allocate this much RAM.
The largest problem you should be able to handle is in
ecoli28284.txt
. This should cause you to allocate an array of approx. 800 million values. Assuming you're using 4-byte int
s, that's 3.2 GB of data.
Don't worry about larger cases unless you want to explore alternate approaches. If so, there is a solution which computes the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).
- After you have things working, add code to calculate and print execution time. You may use SFML's
sf::Clock
andsf::Time
classes, as follows:- To your
main
routine,#include <SFML/System.hpp>
at the top.
Then define the following two classes in yourmain
:sf::Clock clock; sf::Time t;
At the end ofmain
, after printing out the solution, display the running time:t= clock.getElapsedTime(); cout << "Execution time is " << t.asSeconds() << " seconds \n";
- Remember to add
-lsfml-system
to your compile and link commands in your Makefile or project properties. - Try different compiler optimization flags to see how execution time is affected. E.g.,
-O1
or-O2
could halve running time over using no optimization.
- To your
- Verify your algorithm's space usage with the Valgrind runtime analysis tool. Here are the steps:
- Install Valgrind:
sudo apt-get install valgrind
(Note: Valgrind is already installed on the CS cluster machine. You can compile and run your code there if necessary.) - Make sure your code is compiled and linked with debugging information (
-g
flag). - Use Valgrind to run your code with the “massif” heap analysis tool; e.g.:
valgrind --tool=massif ./ED < ../sequence/ecoli28284.txt
- Then Valgrind will produce a log file named
massif.out.XXXXX
, whereXXXXX
is the process ID from your run. View with file with thems_print
utility that's part of the Valgrind distribution; e.g.:
ms_print massif.out.11515 | less
- Confirm that the amount of memory used matches your expectations.
- More documentation on Valgrind and massif is here: http://valgrind.org/docs/manual/ms-manual.html.
- If you are on Windows, use your own platform-specific tools to analyze memory utilization. Or just copy stuff to the CS cluster and finish your work there.
- Install Valgrind:
- Download and fill in this
Attach:ps3-readme.txt
file, and include it with your code submission. - Archive all of your code files, any Makefile, and the completed
ps3-readme.txt
file, and submit to:submit fredm 204-ps3 your-archive
Grading rubric
Feature | Value | Comment |
core implementation | 4 | |
Makefile | 1 | Makefile or explicit build/link instructions included |
Running time table | 2 | |
Valgrind analysis | 2 | |
largest N mem calc | 1 | |
largest N time calc | 1 | |
readme | 2 | discussion is expected -- at least mentioning something per section. |
Total | 13 |