diff options
author | Santhosh Thottingal <santhosh.thottingal@gmail.com> | 2009-04-16 20:51:39 +0530 |
---|---|---|
committer | Santhosh Thottingal <santhosh.thottingal@gmail.com> | 2009-04-16 20:51:39 +0530 |
commit | b4c9aab679ee466431a64688226ed870380d5b29 (patch) | |
tree | 1c4755464f4de3ef50b164811309bf7d46cf57d2 /silpa | |
parent | 712efc3d8159aa22d8c51f8266c1a6813a4f9dba (diff) | |
download | Rachana.git-b4c9aab679ee466431a64688226ed870380d5b29.tar.gz Rachana.git-b4c9aab679ee466431a64688226ed870380d5b29.tar.xz Rachana.git-b4c9aab679ee466431a64688226ed870380d5b29.zip |
Ngram model algorithm notes
Diffstat (limited to 'silpa')
-rw-r--r-- | silpa/modules/ngram/algorithm | 23 |
1 files changed, 23 insertions, 0 deletions
diff --git a/silpa/modules/ngram/algorithm b/silpa/modules/ngram/algorithm new file mode 100644 index 0000000..495b85a --- /dev/null +++ b/silpa/modules/ngram/algorithm @@ -0,0 +1,23 @@ +We have a TREE data structure. Each node in the tree is an instance of NgramNode. +Each NgramNode objects contains a string value of the node and a Rank +Rank is the incremented frequency of occurance of the corresponding string in the training corpus + +NGramNode is a super class of SyllableNgramNode and WordNgramNode +That means, each node in the tree can be either a syllable or a word. +We have only one tree for both words and syllables as of now + +In the tree, the root node is an empty node with label *. That indicates that all its childs, either syllables or words, +are start of word or sentence respectively. + +Child of a node meaning: +Y is a child ofX means , Y can follow immediately after the occurance of X in the text, Where X,Y are either syllable or word(only one time in a tree route) +X can have any number of childs. +The probability that a node in the list of childs occur in a given context is controlled by Rank(node) +Rank is nothing but integer values incremented based on frequency of occurance. +Higher the rank, higher the probability that the node can follow immediately after X + +Persistance of the populated tree is achieved through pickling the entire tree structure. + +Tree operations: +a) Adding a syllable-ngram, n=2 + |