PS6

Home Assignments Lecture Blog Resources Discussion Group

Markov Model of Natural Language

In this assignment, we analyze an input text for transitions between k-grams (a fixed number of characters) and the following letter.

Then we produce a probabilistic model of the text: given a particular k-gram series of letters, what letters follow at what probabilities?

Then we use the model to generate nonsense text that's surprisingly reasonable.

See the full assignment at the Princeton site: http://www.cs.princeton.edu/courses/archive/fall13/cos126/assignments/markov.html

Implementation

Begin by implementing the following class:

class MarkovModel
----------------------------------------------------------------------------------------
// Note: all of the below constructors/methods should be public.

       MarkovModel(string text, int k)// create a Markov model of order k from given text
                                      // Assume that text has length at least k.

   int order()                        // order k of Markov model

   int freq(string kgram)             // number of occurrences of kgram in text
                                      // (throw an exception if kgram is not of length k)

   int freq(string kgram, char c)     // number of times that character c follows kgram
                                      // if order=0, return num of times char c appears
                                      // (throw an exception if kgram is not of length k)


  char randk(string kgram)            // random character following given kgram
                                      // (Throw an exception if kgram is not of length k.
                                      //  Throw an exception if no such kgram.)

string gen(string kgram, int T)       // generate a string of length T characters
                                      // by simulating a trajectory through the corresponding
                                      // Markov chain.  The first k characters of the newly
                                      // generated string should be the argument kgram.
                                      // Throw an exception if kgram is not of length k.
                                      // Assume that T is at least k.

           <<                         // overload the stream insertion operator and display
                                      // the internal state of the Markov Model. Print out
                                      // the order, the alphabet, and the frequencies of
                                      // the k-grams and k+1-grams.

Details

  • This is the same as the Princeton spec except that the class method rand has been renamed randk.
  • Here is a header file to get you started: Attach:MarkovModel.hpp.
  • Make sure you throw std::runtime_error's in the methods per the spec. Review how to do this in PS5a.
  • Test your implementation against the following test file: Attach:mmtest.cpp.
  • Consider the behavior of a 0-order model. This model will generate new characters with a distribution proportional to the ratio they appear in the input text. This model is context-free; it does not use an input kgram (representing the current state of the model) when generating a new character. As such:
    • The freq(string kgram) method takes an empty string as its input kgram (length of kgram must equal the order of the model).
    • This method call should produce as a result the length of the original input text (given in the constructor).
    • The freq(string kgram, char c) also will take a null string as input. It should produce as output the number of times the character c appears in the original input text.
  • Per discussion in class, consider using a C++ map (http://www.cplusplus.com/reference/map/map/) store frequency counts of each kgram and ``k+1''-gram as it's encountered when you traverse the string in the constructor.
  • Make sure to wrap around the end of the string during traversal, as described in the Princeton assignment.
  • At construction time, it is highly recommended to build up a string that represents the active alphabet and store it as a private class member variable. This will be useful when implementing the randk method, so that you can produce all of the k+1-grams that will follow the provided kgram, test for the frequency of each k+1-gram, and then produce output characters with proportional frequencies.
  • See Willie Boag's suggestion in the discussion group for how to generate output characters that follow the probability distribution that's discovered in the input: https://groups.google.com/forum/#!topic/91-204-202-s14/16yuB54d5Fw. You don't have to do it this way; it's just a suggestion.
  • Overload stream insertion operator to display internal state of Markov model: See these sample instructions for how to do the overload. You should print out all of the k-gram and k+1-gram frequencies, the order of the model, and its alphabet. Here are sample instructions for how to build an iterator over the map of k-gram and k+1-gram keys.
  • Create a Makefile or include short, direct instructions for how to build your project.
  • Fill out the Attach:ps6-readme.txt.

Submit

Submit the following:

  • your code files MarkovModel.cpp and MarkovModel.hpp
  • your main.cpp that does the text generation task (it may be named TextGenerator.cpp)
  • the mmtest.cpp
  • your Makefile or make instructions
  • your ps6-readme.txt file
submit fredm 204-ps6 files

Grading Rubric

FeatureValueComment
MM implementation4full & correct implementation=4 pts; nearly complete=3pts; part way=2 pts; started=1 pt
Text Generation implementation2full & correct implementation=2 pts; part way=1 pt; not=0 pt
Makefile2Makefile or explicit build/link instructions included
Exceptions implementation2full & correct implementation=2 pts; part way=1 pt; not=0 pt
Readme2discussion is expected -- at least mentioning something per section.
Total12