Recent Changes - Search:
ECG Home

GitHub

People

Publications

Calendar

Projects

Fall 2017

Older Courses

Spring 2017

Fall 2016

Spring 2016

Fall 2015

Spring 2015

Fall 2014

Spring 2014

Fall 2013

Spring 2013

Fall 2012

Spring 2012

Fall 2011

Spring 2011

Fall 2010

Spring 2010

Fall 2009

Spring 2009

Fall 2008

Spring 2008

Fall 2007

HOWTOs

edit SideBar

PS4

Home Assignments Lecture Blog Resources Discussion Group

DNA Sequence Alignment

We will:

Some implementation notes:

This will help you confirm you're getting the right answers.

Build Set-up

  • Your Makefile should include the flag -g for all compiles and links. This will produce debug information that will be used by Valgrind.
  • Your executable file should be named ED.
  • Your code should accept the two strings to be matched from stdin; e.g., as piped in from a test file:
% ED < example10.txt

Implementation

  • You may elect to implement any of four solution approaches:
    • Recursive without memoization (too slow to be practical, but order N in space)
    • Recursive using memo-ization
    • Dynamic programming using an NxM matrix, per the Princeton problem set (aka, the Needleman-Wunsch method)
    • Using Hirschberg's algorithm, which is linear in space

We're going to use most of class time discussing the Needleman-Wunsch approach per the Princeton PS.

  • If you implement the dynamic programming with a matrix approach, remember that the solution is in two parts:
    1. Filling out the NxM matrix per the min-of-three-options formula, bottom to top, right to left (this gives you the optimal edit distance in the upper-left, [0][0] cell of the matrix);
    2. Traversing the matrix from top-to-bottom, left-to-right (i.e., [0][0] to [N][M]) to recover the choices you made in filling it, and thereby also recovering the actual edit sequence.
  • In any case, you should create a class file ED (for “Edit Distance”) with the following methods:
    • A constructor that accepts the two strings to be compared, and allocates any data structures necessary into order to do the work (e.g., the NxM matrix)
    • A static method int penalty(char a, char b) that returns the penalty for aligning chars a and b (this will be a 0 or a 1)
    • A static method int min(int a, int b, int c) which returns the minimum of the three arguments.
    • A method int OptDistance() which populates the matrix based on having the two strings, and returns the optimal distance (from the [0][0] cell of the matrix when done)
    • A method string Alignment() which traces the matrix and returns a string that can be printed to display the actual alignment. In general, this will be a multi-line string—i.e., with embedded \n's.
  • You should have a main routine that accepts the two strings from stdin, uses your EditDistance class to do the work, and then prints the result to stdout. Remember that your final output should look like this:
% ./ED < example10.txt
Edit distance = 7
A T 1
A A 0
C - 2
A A 0
G G 0
T G 1
T T 0
A - 2
C C 0
C A 1
  • You have to allocate the memory for the opt matrix dynamically, after you read in the two strings and figure out how long they are. There are a few different ways to do this:
    • vector of columns, each containing a row vector
    • one long vector, with internal calculations to treat it as a matrix
    • one big block of memory, similarly with calculations to treat it as a matrix
    • others?
See http://stackoverflow.com/questions/936687/how-do-i-declare-a-2d-array-in-c-using-new for some ideas.

If you use new, make sure to de-allocate memory in your class destructor; e.g., if your array pointer is named _opt, then delete [] _opt;
  • The dynamic programming solution we discussed requires filling the whole matrix with values (step 1 of the two-part solution), so it's N-squared in space (or more precisely, NxM).
So you shouldn't expect to have your code work for the test case with two 500,000 char strings. Assuming you use a 32-bit int to hold edit distance values in your matrix, that's 2 MB of data squared, or 4,000 GB.
Probably your computer can't allocate this much RAM.
The largest problem you should be able to handle is in ecoli28284.txt. This should cause you to allocate an array of approx. 800 million values. Assuming you're using 4-byte ints, that's 3.2 GB of data.
Don't worry about larger cases unless you want to explore alternate approaches. If so, there is a solution which computes the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).
  • After you have things working, add code to calculate and print execution time. You may use SFML's sf::Clock and sf::Time classes, as follows:
    • To your main routine, #include <SFML/System.hpp> at the top.

      Then define the following two objects in your main:
        sf::Clock clock;
        sf::Time t;

      At the end of main, after computing the solution, capture the running time:

      t= clock.getElapsedTime();

      Then after printing out the solution, display the running time:

      cout << "Execution time is " << t.asSeconds() << " seconds \n";
    • Remember to add -lsfml-system to your compile and link commands in your Makefile.
    • Try different compiler optimization flags to see how execution time is affected. E.g., -O1 or -O2 could halve running time over using no optimization.
  • Verify your algorithm's space usage with the Valgrind runtime analysis tool. Here are the steps:
    • Install Valgrind: sudo apt-get install valgrind
    • Make sure your code is compiled and linked with debugging information (-g flag).
    • Use Valgrind to run your code with the “massif” heap analysis tool; e.g.:
        valgrind --tool=massif ./ED < ../sequence/ecoli28284.txt
    • Then Valgrind will produce a log file named massif.out.XXXXX, where XXXXX is the process ID from your run. View with file with the ms_print utility that's part of the Valgrind distribution; e.g.:
        ms_print massif.out.11515 | less
    • By examining the massif.out file, confirm that the amount of memory used matches your expectations.
    • Also, you can use the program massif-visualizer to view the logs (see http://milianw.de/tag/massif-visualizer) -- install with sudo apt-get install massif-visualizer.
    • More documentation on Valgrind and massif is here: http://valgrind.org/docs/manual/ms-manual.html.
  • Download and fill in this Attach:ps4-readme.txt file, and include it with your code submission.

Making the autograder happy

The autograder will compile your code and check it with the following input files:

bothgaps20.txt  ecoli2500.txt  ecoli7000.txt  example10.txt  fli8.txt
ecoli10000.txt  ecoli5000.txt  endgaps7.txt   fli10.txt      fli9.txt

To make sure this works:

  • Your Makefile must produce an executable named ED.
  • Your program must accept input from stdin (e.g., using the < redirect).
  • Your program must print out first the edit distance and then the edit path. For example:
% ./ED < example10.txt
Edit distance = 7
A T 1
A A 0
C - 2
A A 0
G G 0
T G 1
T T 0
A - 2
C C 0
C A 1
You must follow this format (and capitalization of “Edit distance”) exactly, and you can't have trailing spaces at the ends of the lines.
  • You should print the elapsed time after all this. These lines will be ignored.
  • Everything must be in a directory named ps4, and per usual, make sure to remove object files before tarring up your work.

Reporting your results

Please fill out your data in the ps4-readme.txt file and then copy them to iSENSE. You will be happier this way.

After you have a set of results (different times and memory usage for each problem size/input length), go to http://isenseproject.org/projects/871 and report the results.

Start by typing the contribution key 204 and your name into the box entitled Contribute Data:

Then, click on the icon for “manual entry”:

You'll be taken to a screen that looks like this:

Now, fill in your information from your data runs. Start by putting your name and some details about your machine in the dataset name at the top:

Then, click “Add Row” a few times to give yourself rows to enter data, and enter your results. Here's what you'll be entering:

CPU speedstring-lengthtime-to-solutionmemory-usedimplementation-methodarray-methodoperating-systemcpu-type
(MHz)5000, 7000, etc(seconds)(MB)   

Notes:

  • CPU speed in MHz
  • input-size is the length of the problem string, for these test problems:
    • ecoli2500.txt
    • ecoli5000.txt
    • ecoli7000.txt
    • ecoli10000.txt
    • ecoli20000.txt
    • ecoli28284.txt
  • time-to-solution is in seconds
  • memory-used is in MB
  • implementation method is one of: recursive, recursive-with-memoization, Needleman-Wunsch, Hirschberg, other
  • array-method is one of: vectors, c-arrays, hash-table, other
  • operating-system is one of: x86-unix-native, mac-os-x, lubuntu, other
  • cpu-type — e.g., core i3, core i7, AMD (and model), etc.

It'll look something like this:

When you've entered everything, click Save. Note: you won't be able to go back and change things after saving.

To view your data, select string-length for the X axis and either time-to-solution or memory-used for the Y axis.

Submitting

Put all of your work into a directory named ps4: all of your code files, any Makefile, and the completed ps4-readme.txt file. Make sure to run make clean before archiving.

Submit your work via Bottlenose:

Martin section (201): https://grader.cs.uml.edu/assignments/483
Grinberg section (202): https://grader.cs.uml.edu/assignments/484

The executable file that your Makefile builds should be called EditDistance (the grading script checks that this executable builds successfully.)

Grading rubric

FeatureValueComment
core implementation6 
  3 pts for filling the matrix and getting the edit distance
  3 pts for traversing the matrix to retrieve the path
Makefile2
Performance time & space in readme2 
Performance data on iSENSE2 
Valgrind analysis discussion2in readme
largest N mem calc1in readme
largest N time calc1in readme
Total16 
Edit - History - Print - Recent Changes - Search
Page last modified on August 10, 2015, at 11:10 PM