91.406/91.553 Parallel Processing
Instructor: Dr. James Canning

Textbooks
                 
   Parallel Programming with MPI by Peter Pacheco
                     A Programmer's Guide to ZPL (can be obtained from the University of Washington website.)

Additional Readings:

                      ZPL's WYSIWIG Performance Model. This is authored by Chamberlain, Choi, and Lewis.
                      High-Level Progamming Language Abstractions for Advanced and Dynamic Parallel Computations.  This is Steven Deitz's thesis.
                      Multiresolution Languages for Portable yet Efficient Parallel Programming. This is authored by Bradford Chamberlain. 
 
Deliverables from the Programmer's Guide to ZPL
Task
Chapter
Title
Description
Due Date
Check
1
1
Jacobi Iteration
Page 3. This is ZPL 1.0 code. You need to make the adjustment so that it works with ZPL 2.0.



2

There are no programming deliverables from chapter 2 in the Programmer's Guide.



3

There are no programming deliverables from chapter 3 in the Programmer's Guide.


2
4
Sample Statistics
Page 42.


3
4
Correlation Coefficient
Page 44.


4
4
Histogram 1
Page 46


5
4
Uniform Convolution Shit/Add
Page 47


6
4
Uniform Convolution Scan
Page 49


7
4
Counting Connected Components
Page 51



5

There are no programming deliverables from chapter5 in the Programmer's Guide.


8
6
Cannon's Matrix Multiply
Page 84


9
6
Summa Matrix Multiply
Page 86


10
6
Tridiagonal Matrix Multiplication
Page 89


11
6
Ranking
Page 90


12
6
Histogram 2
Page 92


13
6
Lossy Data Compression
Page 94


14
6
Odd/Even Transposition Sort
Page 96



7

There are no programming deliverables from chapter 7 in the Programmer's Guide.



8

There are no programming deliverables from chapter 8 in the Programmer's Guide.


15
9
Inconsequential Code Fragments
Page 123. Create a short, but complete program that highlights the import of page 123.


16
9
Score
Page 126. Create a full example. I recall doing this. It took me awhile to get my head around it.


17
9
N-Body Computations 1
Page 133.


18
9
N-Body Computations 2
Page 134.


19
9
Random Number Generation
Page 137.



10

There are no programming deliverables from chapter 10 in the Programmer's Guide.



Deliverables from Pacheco's MPI Book.
Task
Chapter 1
Title
Description
Due Date
Check

1

There are no programming deliverables from chapter 1.



2




1
3
Greetings!
Pages 41/42.


2
4
Serial Trapazoidal Rule
Pages 55/56


3
4
Trapazoidal Rule 1
Pages 57,58,59. This defines a, b, and n at compile time. No broadcast required.


4
4
Trapazoidal Rule 2
Pages 61, 62. This uses the first  Get_Data. Embed Get_Data into your trapazoidal rule solution so that a, b, and n are input by the user and then broadcasted to the other processes via Get_Data's naive sequential message strategy.


5
5
Trapazoidal Rule 3
Section 5.1. Re do the  Trapazoidal Rule program using Get_data1. This solution relies upon a user defined tree broadcast.


6
5
Trapazoidal Rule 4
Section 5.2. Adjust your solution to the Trapazoidal Rule program so that it uses Get_data2. This solution relies upon MPI_Bcast.


7
5
Serial Dot Product
Build a C program that uses the function on page 75 to compute the dot product of two vectors.


8
5
Dot Product 1
Build a parallel C/MPI program that uses the function given on page 76. It relies upon the MPI_Reduce function call.


9
5
Dot Product 2
Build a parallel C/MPI program that uses the MPI_Allreduce function call as discussed in Section 5.6.


10
5
Serial Matrix Vector
Code up a C program that computes a serial matrix vector product as mentioned on page 78.


11
5
Mat Vec 1
Build a parallel C/MPI program that calls just MPI_Gather to compute a matrix vector product. It does not call MPI_Allgather. This is outlined in section 5.7 and section 5.8.


12
5
Mat Vec 2
Build a parallel C/MPI program that calls MPI_Scatter to compute a matrix vector product. This is outlined in section 5.7.


13
5
Mat Vec 3
Build a parallel C/MPI program that calls MPI_Allgather as shown on page 83.



6

At present we will not implement any of the ideas in chapter 6.


14
7
Serial Matrix Mult.
Implement the standard serial order n-cubed matrix multiply in C. This is given on page 112.


15
7
MM Naive
Implement the naive and costly parallel matrix multiply mentioned at the bottom of page 112.


16
7
Fox's MM
Implement Fox's MM given in chapter 7 of your textbook.



8

At present,  there are no programming assignments for chapter 8. However, the code given is useful as it simplies writing I/O routines for C/MPI programs.



9

At present,  there are no programming assignments for chapter 9. This chapter concerns debugging parallel programs.



10

At present,  there are no programming assignments for chapter 10. This may change.



11

At present,  there are no programming assignments for chapter 11.



12

At present,  there are no programming assignments for chapter 12.



13

At present,  there are no programming assignments for chapter 13. However, the hypercube strategy is used in parallel bitonic sort and can also be used in parallel 1D FFT.


17
14
Serial Bitonic Sort
Implement the serial bitonic sort given in the text book. Make sure that it works. Make sure that you understand it. This is located in section 14.3


18
14
Parallel Bitonic Sort
Implement the parallel bitonic sort given in the text book. Make sure that it works. Make srue that you understand it. This is located in section 14.4.



15

At present, there are no programming assignments for chapter 15.



16

At present, there are no programming assignments for chapter 16.



Other ZPL Deliverables
Task
Name
Description
Due Date
Check
1
Trapazoidal Rule



2
Dot Product



3
Square Deal
Traverse a matrix using a inside out counter clockwise spiral path. Deposit prime numbers when they exist.


4
Transpose
Write a ZPL program that will transpose a 2D matrix.


5
Fox's MM
Write a ZPL program that will multiply matrices using Fox's strategy.


6
Mandlebrot
Implement a program that will compute and display a Mandlebrot image as was described in class by Nat Tuck. You can use any display mechanism that you wish. It does not need to be the GTK/Glade/Perl/Perl version that Nat presented.


7
Eigenvalues 1
Implement a program that will compute the eigenvalues of a 2D symmetric matrix using the standard Jacobi  tranformational strategy.


8
NAS EP
Implement a ZPL program that will compute the NAS EP benchmark.


9
NAS IS
Implement a ZPL program that will compute the NAS IP benchmark.


10
Bitonic Sort
Implement a ZPL program that will compute a Bitonic Sort. It might be necessary to relies on ZPL 2.0. The enhancements given to us by Steven Dietz. It would be nice to see if we could do this without Dietiz's enhancements to get a feel of the limitations of ZPL 1.0.


11
FFT w/ ZPL1.0
Implement a ZPL program that will compute a 1D FFT without using the language extensions in Dietz's thesis. In so doing, we can get a feel for the issue.


12
FFT w/ ZPL2.0
Implement a ZPL program that will compute a 1D FFT using the language extensions offered to us by Dietz is his thesis. This is ZPL 2.0


13
Convex Hull
Divide and Conquer



14
Matrix
Inverse 1



15
Matrix
Inverse 2



16
Matrix
Inverse 3





Other Serial Deliverables in the C Programming Language
1
Sieve
Implement the Sieve of Erathanones.
Due Date
Check
2
NAS EP
Implement the NAS EP Benchmark.


3
Convolution 1
Implement a program that the compute the convolution of two vectors without using a fourier transform. Just use the definition of convolution.


4
Fourier Matrix
Implement a program that will input a value n from the command line and then it will create an n by n complex fourier matrix.


5
Convolution 2
Implement a program that will compute the convolution of two complex vectors by first taking the discrete order N-squared fourier transform of each vector. Then the two transformed vectors are elementwise multiplied to form a vector, say FV. This is followed by applying the inverse N-squared transform to the vector FV. You should verify that we get the same result as if we convolved two original vectors.


6
Convolution
3
Implement a program that will compute the convolution of two complex vectors by first taking the discrete order n log n fourier transform of each vector. This implementation must use the recursive, divide and conquer version discussed in class.


7
Convolution
4
Implement a program that will compute the convolution of two complex vectors by first taking the discrete order n log n fourier tranform of each vector. This implementation must use the iterative, divide and conquer version discussed in class.


8
Mandlebrot
Implement a program that will compute and display a Mandlebort image as was described in class. I gave a hand out on this.


9
2D Fourier Transform
Implement a program that will input a simple image. Say, for example, a white square on a black background. Using GTK/GDK or some other display software, compute the 2D fourier transform of the image using the pixel values as elements of the two dimensional input matrix. Now that the image has been moved up into the frequency domain, display the picture. For each point, you should take the real part and imaginary part and compute the magnitude of the complex. Remap the values of the magnitude into either a black and white picture or a grey scale picture.


10
NAS IS
Read the NAS IS (integer sort) benchmark specification and get it to work.


 11
 NAS FT
Read the NAS FT benchmark specification and get it to work.


12
Convex Hull
Divide & Conquer





Other C/MPI Programs
Task
Name
Description
Due Date
Check
1
Sieve
Using C/MPI write a parallel version of the Sieve of Erasthones.


2
NAS EP
Using C/MPI write a parallel version of the NAS EP benchmark. This differs just a bit from the serial version since the starting seed values for each processor needs to be computed first, So there is a little bit more work to be done here.


3
2D FFT
Using C/MPI write a program that will input an image and output its 2D FFT. You will need to display the magnitude of the real and imaginary part remapped to a scale.  You can do this simply have each process compute a local, serial 1D FFT on its own. 


4
1D
FFT
Using C/MPI write a program that will convolve two vectors using a distributed 1D FFT algorithm. I have given a handout for doing this in class. The hypercube strategy is probably the way to go for us. You should have a way for convincing yourself and me that your program is doing its job.


5
NAS IS
Using C/MPI write a program that will run the NAS IS (integer sort) benchmark. 


6
NAS
FT
Implement the NAS FT Benchmark.


7
Convex Hull
 

8




9




10




11




12




13




14




15






Deliverables related to the WYSIWIG paper
Task
Name
Description
Due Date
Check
1
ZPL Operations
Read section 5.1 thoroughly. Submit a small packet of code that includes your six little ZPL programs. These programs are given as statements in Figure 8 of the WYSIWIG paper. Attach to your code nice looking graphs that are similar to those presented in Figure 8. You are submitting a fourth row of the figure. For now, use 1x1, 4x4, and 6x6. However, we should have 64 cores shortly and when we do, you can use 8x8 too.  I would like you to attach a written paragraph or two as well. Your prose should discuss your findings. Do they match those in the paper or not? Technology has moved forward since the paper was written. You might need to increase the number of elements per processor 100 x 100 doubles to something much larger.



2
MM
Read section 5.2 thorougly. Submit a paper packet of code that includes your three versions of MM in ZPL: Summa, Fox, and Cannon. If you do not include Fox for whatever reason, at least submit Summa and Cannon. I want you to replicate the experiment discussed in the paper on our machine. Include Fox's timings as well. Does your result agree with the spirit of the paper? Any anomolies? Or, is it as you expected?



3
Table 3
I want you to extend Table 3 by adding a row. Label the row Fox. Fill out the table with the entries you think should be there.



4
Uniform Convolution Table
I want you to go back and review both the Shift/Add and the Scan solutions of the Uniform Convolution problem given in the ZPL Programmer's Guide.  Create a table similar to Table 3 in the Wysiwig paper. Fill in the entries with the values you believe are correct. Pass this table in to me. Make the table look reasonable. Be respectful of my of 56 year old eyes.



5
Uniform
Convolution
Figure
Please re-run the experiment discussed in section 5.2 of the paper. However, this time use the two competing solutions for the uniform convolution problem. Generate three bar graphs that are similar to a row of Figure 10. Ideally, we will have our 64 cores up and running and one of your bar charts is an 8 x 8.




Deliverables related to Sorting
Task
Name
Description
Due Date
Check
1
Sorting 1
I want you to create an experiment that will compare the performance of the MPI/C Parallel Bitonic Sort with the ZPL Odd/Even Transposition Sort. Have some fun doing it.  Initially take timings and create a graph that show how well each method scales. Try to sort a large number of integers. You can rig it so that each integer falls within a range, say 0 to 9999. Perhaps you can output the smallest value, the median value, and the largest value to gain some confidence. Vary the number of processors. Vary the number of integers. What are your findings? If you successfully coded up a ZPL Bitonic sort throw that into your analysis too. If you have a parallel integer sort (NAS IS) working, throw that one in as well.




Sorting
2
Sort N integers as fast as you can on our machine. Make N as large as you can and then go. What is the fastest sort you can build for our machine?






Day
What we did
Assignment
1
We talked about writing parallel programs using C with MPI calls and using the ZPL 2.0 language. We will run these programs on our local cluster. Students are to email Nat Tuck so that they can get an account on ermine.  We also discussed that we will explore the new programming language called Chapel. Chapel is the creation of Brad Chamberlain and others. Brad also is the main implementor of ZPL 1.0.  I doodled the sequential bitonic sort algorithm in class using 16 unique integers. Students could see what is going on. I then displayed the C code which implements the bitonic sort.  We noted that the C w/ MPI is a bit painful, while ZPL is easier to code. However, ZPL is limited.  We gave hand motions that simulated flood and reduce.  Most of the programs will be to program traditional algorithms. We hope to learn a bit about weather models and one of them will form the bedrock of a larger project.  Perhaps this project will occur in a later semester. Everybody went home happy, including me.
Students need to implement the sequential bitonic sort algoirthm in C.  Students need to email Nat Tuck and receive their account. Students should read chapter 1 and chapter 2 of the MPI book. Students should download the ZPL Programmer's Guide from the University of Washintgon website. They should also download some of the other papers listed there.
2
We discussed chapter 3 and chapter 4 of the course textbook. We illustrated the parallel greetings program in chapter 3. We walked through the three versions of the trapazoidal rule program in chapter 4. The first one is a serial version. The second one is a parallel version with hard wired input. The third one is a naive parallel version with a sequential broadcast of the inputs. It does illustrate the use of the tag parameter. Students were asked to paper copy hand in their sequential bitonic sort. I will accept this next time too. We also showed a ZPL version of the trapazoidal rule problem. This ZPL version is both a sequential and a parallel version. It is much more succinct. We reviewed the Jacobi smoothing program found in chapter 1 of the ZPL programmer's guide. We noted that this version worked for ZPL 1.0, but it will need to be altered for the ZPL 2.0 that we have. Memory must be explicitly allocated in ZPL 2.0. This meant that we needed to create a region called BigR as well. We started to run timings, but it was getting late.
Students must obtain an ermine account from Nat. They should verify that they can log on. They should enter the greetings.c, serial_trap.c, and trap1.c programs and get them to run. They should also enter the Jacobi.z and the trap.z programs too. This is good practice will it will enable you to ramp up quickly. Do these things. Students make sure to read chapter 3 and chapter 4 of the textbook closely. The students should read chapter 1 of the ZPL Programmer's guide.
3
We discussed section 5.1 in the textbook. It is a user-defined tree broadcast. Using a tree broadcast, this is an enhancement of the earlier trap.c programs.  We also discussed section 5.2. It is an MPI based tree broadcast. We discussed a serial version of the sieve of erasthothanes. We handed out the statement of the NAS Benchmark EP. We reviewed it.
Students should get the user defined tree broadcast to work.
Students should run the MPI based tree broadcast.
Students should read Sections 5.1, 5.2, 5.3, and 5.4. They should implement and get working all the versions of the trap rule program. This includes using reduce as described in section 5.4.
Students should get the sieve of erasthonanes to work in C.
Students should read the description of the NAS EP program with an eye towards getting it to work in C. Students should try to implement the serial version of the NAS EP benchmark in C. I can offer suggestions if needed.

4


5