diff options
Diffstat (limited to 'silpa/modules/ngram/algorithm')
-rw-r--r-- | silpa/modules/ngram/algorithm | 23 |
1 files changed, 23 insertions, 0 deletions
diff --git a/silpa/modules/ngram/algorithm b/silpa/modules/ngram/algorithm new file mode 100644 index 0000000..495b85a --- /dev/null +++ b/silpa/modules/ngram/algorithm @@ -0,0 +1,23 @@ +We have a TREE data structure. Each node in the tree is an instance of NgramNode. +Each NgramNode objects contains a string value of the node and a Rank +Rank is the incremented frequency of occurance of the corresponding string in the training corpus + +NGramNode is a super class of SyllableNgramNode and WordNgramNode +That means, each node in the tree can be either a syllable or a word. +We have only one tree for both words and syllables as of now + +In the tree, the root node is an empty node with label *. That indicates that all its childs, either syllables or words, +are start of word or sentence respectively. + +Child of a node meaning: +Y is a child ofX means , Y can follow immediately after the occurance of X in the text, Where X,Y are either syllable or word(only one time in a tree route) +X can have any number of childs. +The probability that a node in the list of childs occur in a given context is controlled by Rank(node) +Rank is nothing but integer values incremented based on frequency of occurance. +Higher the rank, higher the probability that the node can follow immediately after X + +Persistance of the populated tree is achieved through pickling the entire tree structure. + +Tree operations: +a) Adding a syllable-ngram, n=2 + |