summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorPeng Wu <alexepico@gmail.com>2011-07-14 11:16:42 +0800
committerPeng Wu <alexepico@gmail.com>2011-07-14 11:18:43 +0800
commit24cadd56e4a93daf93e47287fd929b4d25a7a948 (patch)
tree79fe52dc797acee6d1589b98b33ba4453dae4a0b /docs
parentd3ed4f9115c72d1dcefcb4e24d122955b48d4d7b (diff)
downloadtrainer-24cadd56e4a93daf93e47287fd929b4d25a7a948.tar.gz
trainer-24cadd56e4a93daf93e47287fd929b4d25a7a948.tar.xz
trainer-24cadd56e4a93daf93e47287fd929b4d25a7a948.zip
write file format in progress
Diffstat (limited to 'docs')
-rw-r--r--docs/fileformat47
1 files changed, 45 insertions, 2 deletions
diff --git a/docs/fileformat b/docs/fileformat
index d0945b9..e8bf5de 100644
--- a/docs/fileformat
+++ b/docs/fileformat
@@ -1,9 +1,52 @@
-The file format of libpinyin
+The File Format of libpinyin
-Input file format
+Input File Format
1. Index Files
* raw corpus are classified into /index/<category>/<subsection>/<items>.index
* Every line consists of <item>#<item path name>
2. Content Files
* The content file is stored in <item path name>, such as <number>.text.
* Note: please add a prefix to the <item path name>, so the content files are easier to organize.
+
+Status File Format
+1. Introduction
+ As mentioned above, there are two kinds of input files:
+ 1. Index Files will be called <items>.index;
+ 2. Content Files will be called <number>.text.
+
+ The training process consists of 5 steps:
+ 1. Segment Raw Corpus
+ 2. Generate Models
+ 3. Estimate Models
+ 4. Prune Models
+ 5. Evaluate Models
+
+2. Status Files
+ Segment Status Files
+ 1. For <number>.text, <number>.text.status will be generated, like:
+ {'SegmentEpoch': 1}.
+ 2. For <items>.index, <items>.index.status will be generated, like:
+ {'SegmentEpoch': 1}.
+ Generate Status Files
+ 1. For <number>.text, if the <number>.text is qualified,
+ <number>.text.status will be generated, like:
+ {'GenerateEpoch': 2}.
+ 2. For <items>.index, <items>.index.status will be generated, like:
+ {'GenerateEpoch': 2}.
+ 3. The generated K Mixture Model files are placed in 'models':
+ 1. The model files are placed in the same sub-directory as <items>.index;
+ 2. Each model files are named as 'model-candidates-0.db' ,etc.
+ Estimate Status Files
+ 1. For model-candidates-<num>.db, model-candidates-<num>.db.status are generated, like:
+ {'EstimateEpoch': 3, 'EstimateScore': 0.7}
+ 2. The 'Estimate.index' file are generated, with content like:
+ <sub-directory>#model-candidates-<num>.db#<score>
+ The lines are sorted by <score>.
+ Prune Status Files
+ 1. 'merged.db', 'pruned.db', 'kmm.text', 'interpolation.text' are generated when running prune tools in 'finals/try<num>' sub-directory.
+ 2. 'prune.status' file are generated also, like:
+ {'PruneEpoch': 4, 'PruneMergeNumber': 1000,
+ 'PruneK':2, 'PruneCDF': 0.6}
+ Evaluate Status Files
+ 1. 'evaluate.status' file are generated, like:
+ {'EvaluateEpoch': 5, 'EvaluateCorrectionRate': 0.77} \ No newline at end of file