Generative Model using Markov's Chain Algorithm to Analyse a Corpus of Text to Learn Statistical Patterns of Word Sequences & Use Those Patterns to Generate New, Original Text
The program is built around a Markov Chain model. The model works in two phases:
-
Training (Build Phase):
- The program reads a source text (the "corpus").
- It scans the text and breaks it down into sequences of words called prefixes. The length of these prefixes is determined by the
prefixLen
constant (e.g., a length of 2 means it looks at pairs of words). - For each prefix, it records the word that immediately follows it (the suffix).
- It builds a map where each key is a prefix and the value is a list of all possible suffixes that have appeared after that prefix in the corpus. For example:
{"A computer": ["is", "system"]}
.
-
Generation (Generate Phase):
- The program starts with a random prefix from the ones it learned.
- It randomly selects one of the possible suffixes for that prefix to be the next word.
- The prefix is then updated by "sliding" it one word forward (dropping the first word and adding the newly chosen word).
- This process repeats until the desired number of words has been generated, creating a new block of text.
When running, you will see output in your terminal, first confirming that the model has been trained, and then showing the newly generated text.
You can easily customise the behavior of the text predictor by changing the constants and variables in main.go
:
- Change the Corpus: Modify the
corpus
constant in themain()
function. You can paste any text you like. For larger texts, consider reading from an external file. - Adjust Prefix Length: Change the
prefixLen
constant at the top of the file. A larger number (e.g., 3) will produce text that is more coherent but less varied, as it relies on longer learned phrases. A smaller number (e.g., 1) will be more random. - Change Output Length: In the
main()
function, change the number passed tomodel.Generate()
. For example,model.Generate(100)
will generate 100 words.