Search:

Projects

Fall 2017

Older Courses

Spring 2017

Fall 2016

Spring 2016

Fall 2015

Spring 2015

Fall 2014

Spring 2014

Fall 2013

Spring 2013

Fall 2012

Spring 2012

Fall 2011

Spring 2011

Fall 2010

Spring 2010

Fall 2009

Spring 2009

Fall 2008

Spring 2008

Fall 2007

HOWTOs

# PS4

## DNA Sequence Alignment

We will:

Some implementation notes:

## Build Set-up

• Your `Makefile` should include the flag `-g` for all compiles and links. This will produce debug information that will be used by Valgrind.
• Your executable file should be named `ED`.
• Your code should accept the two strings to be matched from stdin; e.g., as piped in from a test file:
`% ED < example10.txt`

## Implementation

• You may elect to implement any of four solution approaches:
• Recursive without memoization (too slow to be practical, but order N in space)
• Recursive using memo-ization
• Dynamic programming using an NxM matrix, per the Princeton problem set (aka, the Needleman-Wunsch method)
• Using Hirschberg's algorithm, which is linear in space

We're going to use most of class time discussing the Needleman-Wunsch approach per the Princeton PS.

• If you implement the dynamic programming with a matrix approach, remember that the solution is in two parts:
1. Filling out the NxM matrix per the min-of-three-options formula, bottom to top, right to left (this gives you the optimal edit distance in the upper-left, [0][0] cell of the matrix);
2. Traversing the matrix from top-to-bottom, left-to-right (i.e., [0][0] to [N][M]) to recover the choices you made in filling it, and thereby also recovering the actual edit sequence.
• In any case, you should create a class file `ED` (for “Edit Distance”) with the following methods:
• A constructor that accepts the two strings to be compared, and allocates any data structures necessary into order to do the work (e.g., the NxM matrix)
• A static method `int penalty(char a, char b)` that returns the penalty for aligning chars a and b (this will be a 0 or a 1)
• A static method `int min(int a, int b, int c)` which returns the minimum of the three arguments.
• A method `int OptDistance()` which populates the matrix based on having the two strings, and returns the optimal distance (from the [0][0] cell of the matrix when done)
• A method `string Alignment()` which traces the matrix and returns a string that can be printed to display the actual alignment. In general, this will be a multi-line string—i.e., with embedded `\n`'s.
• You should have a main routine that accepts the two strings from stdin, uses your EditDistance class to do the work, and then prints the result to stdout. Remember that your final output should look like this:
```% ./ED < example10.txt
Edit distance = 7
A T 1
A A 0
C - 2
A A 0
G G 0
T G 1
T T 0
A - 2
C C 0
C A 1```
• You have to allocate the memory for the `opt` matrix dynamically, after you read in the two strings and figure out how long they are. There are a few different ways to do this:
• vector of columns, each containing a row vector
• one long vector, with internal calculations to treat it as a matrix
• one big block of memory, similarly with calculations to treat it as a matrix
• others?
See http://stackoverflow.com/questions/936687/how-do-i-declare-a-2d-array-in-c-using-new for some ideas.

If you use `new`, make sure to de-allocate memory in your class destructor; e.g., if your array pointer is named `_opt`, then `delete [] _opt;`
• The dynamic programming solution we discussed requires filling the whole matrix with values (step 1 of the two-part solution), so it's N-squared in space (or more precisely, NxM).
So you shouldn't expect to have your code work for the test case with two 500,000 char strings. Assuming you use a 32-bit int to hold edit distance values in your matrix, that's 2 MB of data squared, or 4,000 GB.
Probably your computer can't allocate this much RAM.
The largest problem you should be able to handle is in `ecoli28284.txt`. This should cause you to allocate an array of approx. 800 million values. Assuming you're using 4-byte `int`s, that's 3.2 GB of data.
Don't worry about larger cases unless you want to explore alternate approaches. If so, there is a solution which computes the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).
• After you have things working, add code to calculate and print execution time. You may use SFML's `sf::Clock` and `sf::Time` classes, as follows:
• To your `main` routine, `#include <SFML/System.hpp>` at the top.

Then define the following two objects in your `main`:
```  sf::Clock clock;
sf::Time t;```

At the end of `main`, after computing the solution, capture the running time:

` t= clock.getElapsedTime(); `

Then after printing out the solution, display the running time:

` cout << "Execution time is " << t.asSeconds() << " seconds \n";`
• Remember to add `-lsfml-system` to your compile and link commands in your Makefile.
• Try different compiler optimization flags to see how execution time is affected. E.g., `-O1` or `-O2` could halve running time over using no optimization.
• Verify your algorithm's space usage with the Valgrind runtime analysis tool. Here are the steps:
• Install Valgrind: `sudo apt-get install valgrind`
• Make sure your code is compiled and linked with debugging information (`-g` flag).
• Use Valgrind to run your code with the “massif” heap analysis tool; e.g.:
`valgrind --tool=massif ./ED < ../sequence/ecoli28284.txt`
• Then Valgrind will produce a log file named `massif.out.XXXXX`, where `XXXXX` is the process ID from your run. View with file with the `ms_print` utility that's part of the Valgrind distribution; e.g.:
`ms_print massif.out.11515 | less`
• By examining the massif.out file, confirm that the amount of memory used matches your expectations.
• Also, you can use the program `massif-visualizer` to view the logs (see http://milianw.de/tag/massif-visualizer) -- install with `sudo apt-get install massif-visualizer`.
• More documentation on Valgrind and massif is here: http://valgrind.org/docs/manual/ms-manual.html.
• Download and fill in this `Attach:ps4-readme.txt` file, and include it with your code submission.

The autograder will compile your code and check it with the following input files:

```bothgaps20.txt  ecoli2500.txt  ecoli7000.txt  example10.txt  fli8.txt
ecoli10000.txt  ecoli5000.txt  endgaps7.txt   fli10.txt      fli9.txt```

To make sure this works:

• Your `Makefile` must produce an executable named `ED`.
• Your program must accept input from stdin (e.g., using the `<` redirect).
• Your program must print out first the edit distance and then the edit path. For example:
```% ./ED < example10.txt
Edit distance = 7
A T 1
A A 0
C - 2
A A 0
G G 0
T G 1
T T 0
A - 2
C C 0
C A 1```
You must follow this format (and capitalization of “`Edit distance`”) exactly, and you can't have trailing spaces at the ends of the lines.
• You should print the elapsed time after all this. These lines will be ignored.
• Everything must be in a directory named `ps4`, and per usual, make sure to remove object files before tarring up your work.

Please fill out your data in the `ps4-readme.txt` file and then copy them to iSENSE. You will be happier this way.

After you have a set of results (different times and memory usage for each problem size/input length), go to http://isenseproject.org/projects/871 and report the results.

Start by typing the contribution key 204 and your name into the box entitled Contribute Data:

Then, click on the icon for “manual entry”:

You'll be taken to a screen that looks like this:

Then, click “Add Row” a few times to give yourself rows to enter data, and enter your results. Here's what you'll be entering:

 CPU speed string-length time-to-solution memory-used implementation-method array-method operating-system cpu-type (MHz) 5000, 7000, etc (seconds) (MB)

Notes:

• CPU speed in MHz
• input-size is the length of the problem string, for these test problems:
• ecoli2500.txt
• ecoli5000.txt
• ecoli7000.txt
• ecoli10000.txt
• ecoli20000.txt
• ecoli28284.txt
• time-to-solution is in seconds
• memory-used is in MB
• implementation method is one of: recursive, recursive-with-memoization, Needleman-Wunsch, Hirschberg, other
• array-method is one of: vectors, c-arrays, hash-table, other
• operating-system is one of: x86-unix-native, mac-os-x, lubuntu, other
• cpu-type — e.g., core i3, core i7, AMD (and model), etc.

It'll look something like this:

When you've entered everything, click Save. Note: you won't be able to go back and change things after saving.

To view your data, select string-length for the X axis and either time-to-solution or memory-used for the Y axis.

## Submitting

Put all of your work into a directory named `ps4`: all of your code files, any Makefile, and the completed `ps4-readme.txt` file. Make sure to run `make clean` before archiving.

 Feature Value Comment core implementation 6 3 pts for filling the matrix and getting the edit distance 3 pts for traversing the matrix to retrieve the path Makefile 2 Performance time & space in `readme` 2 Performance data on iSENSE 2 Valgrind analysis discussion 2 in `readme` largest N mem calc 1 in `readme` largest N time calc 1 in `readme` Total 16