PS6
Home Assignments Lecture Blog Resources Discussion Group
Markov Model of Natural Language
In this assignment, we analyze an input text for transitions between k-grams (a fixed number of characters) and the following letter.
Then we produce a probabilistic model of the text: given a particular k-gram series of letters, what letters follow at what probabilities?
Then we use the model to generate nonsense text that's surprisingly reasonable.
See the full assignment at the Princeton site: http://www.cs.princeton.edu/courses/archive/fall13/cos126/assignments/markov.html
Implementation
Begin by implementing the following class:
class MarkovModel ---------------------------------------------------------------------------------------- // Note: all of the below constructors/methods should be public. MarkovModel(string text, int k)// create a Markov model of order k from given text // Assume that text has length at least k. int order() // order k of Markov model int freq(string kgram) // number of occurrences of kgram in text // (throw an exception if kgram is not of length k) int freq(string kgram, char c) // number of times that character c follows kgram // if order=0, return num of times char c appears // (throw an exception if kgram is not of length k) char randk(string kgram) // random character following given kgram // (Throw an exception if kgram is not of length k. // Throw an exception if no such kgram.) string gen(string kgram, int T) // generate a string of length T characters // by simulating a trajectory through the corresponding // Markov chain. The first k characters of the newly // generated string should be the argument kgram. // Throw an exception if kgram is not of length k. // Assume that T is at least k. << // overload the stream insertion operator and display // the internal state of the Markov Model. Print out // the order, the alphabet, and the frequencies of // the k-grams and k+1-grams.
Details
- This is the same as the Princeton spec except that the class method
rand
has been renamedrandk
. - Here is a header file to get you started: Attach:MarkovModel.hpp.
- Make sure you throw
std::runtime_error
's in the methods per the spec. Review how to do this in PS5a. - Test your implementation against the following test file: Attach:mmtest.cpp.
- Consider the behavior of a 0-order model. This model will generate new characters with a distribution proportional to the ratio they appear in the input text. This model is context-free; it does not use an input kgram (representing the current state of the model) when generating a new character. As such:
- The
freq(string kgram)
method takes an empty string as its input kgram (length of kgram must equal the order of the model). - This method call should produce as a result the length of the original input text (given in the constructor).
- The
freq(string kgram, char c)
also will take a null string as input. It should produce as output the number of times the characterc
appears in the original input text.
- The
- Per discussion in class, consider using a C++ map (http://www.cplusplus.com/reference/map/map/) store frequency counts of each kgram and ``k+1''-gram as it's encountered when you traverse the string in the constructor.
- Make sure to wrap around the end of the string during traversal, as described in the Princeton assignment.
- At construction time, it is highly recommended to build up a string that represents the active alphabet and store it as a private class member variable. This will be useful when implementing the
randk
method, so that you can produce all of the k+1-grams that will follow the provided kgram, test for the frequency of each k+1-gram, and then produce output characters with proportional frequencies. - See Willie Boag's suggestion in the discussion group for how to generate output characters that follow the probability distribution that's discovered in the input: https://groups.google.com/forum/#!topic/91-204-202-s14/16yuB54d5Fw. You don't have to do it this way; it's just a suggestion.
- Overload stream insertion operator to display internal state of Markov model: See these sample instructions for how to do the overload. You should print out all of the k-gram and k+1-gram frequencies, the order of the model, and its alphabet. Here are sample instructions for how to build an iterator over the map of k-gram and k+1-gram keys.
- Create a Makefile or include short, direct instructions for how to build your project.
- Fill out the Attach:ps6-readme.txt.
Submit
Submit the following:
- your code files
MarkovModel.cpp
andMarkovModel.hpp
- your
main.cpp
that does the text generation task (it may be namedTextGenerator.cpp
) - the
mmtest.cpp
- your Makefile or make instructions
- your
ps6-readme.txt
file
submit fredm 204-ps6 files
Grading Rubric
Feature | Value | Comment |
MM implementation | 4 | full & correct implementation=4 pts; nearly complete=3pts; part way=2 pts; started=1 pt |
Text Generation implementation | 2 | full & correct implementation=2 pts; part way=1 pt; not=0 pt |
Makefile | 2 | Makefile or explicit build/link instructions included |
Exceptions implementation | 2 | full & correct implementation=2 pts; part way=1 pt; not=0 pt |
Readme | 2 | discussion is expected -- at least mentioning something per section. |
Total | 12 |