blob: 11766083ee165f297262cbc9e0f35c70d09e78b1 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
|
Word Recognizer Implementation
== New Tools ==
* prepary.py - prepare the initial sqlite database;
* populate.py - convert the corpus file to n-gram sqlite format;
* partialword.py - recognize partial words;
* newword.py - filter out the new words;
* markpinyin.py - mark pinyin according to the n-gram sequence;
== Data Flow ==
prepare.py => populate.py => threshold.py => partialword.py => newword.py;
== Implementation ==
=== populate.py ===
word history = [ W1, W2, ... , Wn ]
store word history and freq into ngram table.
=== populate.py ===
multi-pass processing (1...N)
steps:
for each index file:
for each pass for ngram table:
for each word history from corpus file:
UPDATE ngram SET freq = freq + 1 WHERE words = "word history";
OR INSERT INTO ngram VALUES("word history", 1);
=== partialword.py ===
get partial word threshold pass:
for each word from libpinyin dictionaries:
get the word uni-gram frequency from ngram table in 1-gram.db;
store the word and freq pair into an array;
sort the word array by freq;
get the threshold from the freq of the last 10% word in position;
get partial words pass:
words = set([])
while True:
get all partial word candidates from ngram of 2-gram.db;
skip all existing or already merged words;
if no new partial word candidates,
break;
save all partial word candidates to "partialword.txt" file;
for each index file:
for each pass for ngram table from N to 1:
convert ngram table to sqlite fts table;
do combine merged words from higher-gram to lower-gram;
for new each partial word:
select matched word sequences from ngram fts table
update or insert merged word sequences into lower-gram;
delete origin word sequences (before merged) from higher-gram;
remember all partial word candidates as merged words;
=== newword.py ===
get new word prefix entropy threshold pass:
for each word from libpinyin dictionaries:
get the prefix information entropy of the word from bigram table;
store the word and entropy pair into an array;
sort the word array by entropy;
get the prefix entropy threshold from the entropy of the last 50% word in position;
get new word postfix entropy threshold pass:
for each word from libpinyin dictionaries:
get the postfix information entropy of the word from bigram table;
store the word and entropy pair into an array;
sort the word array by entropy;
get the postfix entropy threshold from the entropy of the last 50% word in position;
filter out new words pass:
for each new word candidates (partial words):
compute the prefix information entropy of the word from bigram table;
if entropy < threshold:
continue
compute the postfix information entropy of the word from bigram table;
if entropy < threshold:
continue
save the new word candidate as new word; (newword.txt)
=== markpinyin.py ===
mark pinyin according to the merged word sequence;
atomic word is from libpinyin dictionaries.
merged word is from new words. (in partialword.txt)
merge pinyin helper:
merge all pairs with the same pinyin, and sum freq;
steps:
for each new word:
if an atomic word:
return all pinyin and freq pairs;
if an merge word sequence:
for each merged pair:
for each prefix:
for each postfix:
pinyin = prefix pinyin + "'" + postfix pinyin;
freq = default * merged poss * prefix poss * postfix poss
return all pinyin and freq pairs;
==== notes ====
oldwords.txt: phrase ␣ pinyin without tone ␣ pinyin freq
partialwords.txt: prefix ␣ postfix ␣ phrase ␣ merge freq
newwords.txt: phrase
for new words, recursive divide pinyin freq into atomic phrases according to merge freq of partialwords.txt and pinyin freq of oldwords.txt;
if atomic phrase in old words,
then divide pinyin freq by old pinyin freq;
combine the same pinyin and phrase into one, freq = sum freq/all freq;
total pinyin freq is default to 100;
|