summaryrefslogtreecommitdiffstats
path: root/silpa/modules/ngram/algorithm
diff options
context:
space:
mode:
Diffstat (limited to 'silpa/modules/ngram/algorithm')
-rw-r--r--silpa/modules/ngram/algorithm23
1 files changed, 23 insertions, 0 deletions
diff --git a/silpa/modules/ngram/algorithm b/silpa/modules/ngram/algorithm
new file mode 100644
index 0000000..495b85a
--- /dev/null
+++ b/silpa/modules/ngram/algorithm
@@ -0,0 +1,23 @@
+We have a TREE data structure. Each node in the tree is an instance of NgramNode.
+Each NgramNode objects contains a string value of the node and a Rank
+Rank is the incremented frequency of occurance of the corresponding string in the training corpus
+
+NGramNode is a super class of SyllableNgramNode and WordNgramNode
+That means, each node in the tree can be either a syllable or a word.
+We have only one tree for both words and syllables as of now
+
+In the tree, the root node is an empty node with label *. That indicates that all its childs, either syllables or words,
+are start of word or sentence respectively.
+
+Child of a node meaning:
+Y is a child ofX means , Y can follow immediately after the occurance of X in the text, Where X,Y are either syllable or word(only one time in a tree route)
+X can have any number of childs.
+The probability that a node in the list of childs occur in a given context is controlled by Rank(node)
+Rank is nothing but integer values incremented based on frequency of occurance.
+Higher the rank, higher the probability that the node can follow immediately after X
+
+Persistance of the populated tree is achieved through pickling the entire tree structure.
+
+Tree operations:
+a) Adding a syllable-ngram, n=2
+