diff options
author | Peng Wu <alexepico@gmail.com> | 2011-07-14 11:16:42 +0800 |
---|---|---|
committer | Peng Wu <alexepico@gmail.com> | 2011-07-14 11:18:43 +0800 |
commit | 24cadd56e4a93daf93e47287fd929b4d25a7a948 (patch) | |
tree | 79fe52dc797acee6d1589b98b33ba4453dae4a0b /docs | |
parent | d3ed4f9115c72d1dcefcb4e24d122955b48d4d7b (diff) | |
download | trainer-24cadd56e4a93daf93e47287fd929b4d25a7a948.tar.gz trainer-24cadd56e4a93daf93e47287fd929b4d25a7a948.tar.xz trainer-24cadd56e4a93daf93e47287fd929b4d25a7a948.zip |
write file format in progress
Diffstat (limited to 'docs')
-rw-r--r-- | docs/fileformat | 47 |
1 files changed, 45 insertions, 2 deletions
diff --git a/docs/fileformat b/docs/fileformat index d0945b9..e8bf5de 100644 --- a/docs/fileformat +++ b/docs/fileformat @@ -1,9 +1,52 @@ -The file format of libpinyin +The File Format of libpinyin -Input file format +Input File Format 1. Index Files * raw corpus are classified into /index/<category>/<subsection>/<items>.index * Every line consists of <item>#<item path name> 2. Content Files * The content file is stored in <item path name>, such as <number>.text. * Note: please add a prefix to the <item path name>, so the content files are easier to organize. + +Status File Format +1. Introduction + As mentioned above, there are two kinds of input files: + 1. Index Files will be called <items>.index; + 2. Content Files will be called <number>.text. + + The training process consists of 5 steps: + 1. Segment Raw Corpus + 2. Generate Models + 3. Estimate Models + 4. Prune Models + 5. Evaluate Models + +2. Status Files + Segment Status Files + 1. For <number>.text, <number>.text.status will be generated, like: + {'SegmentEpoch': 1}. + 2. For <items>.index, <items>.index.status will be generated, like: + {'SegmentEpoch': 1}. + Generate Status Files + 1. For <number>.text, if the <number>.text is qualified, + <number>.text.status will be generated, like: + {'GenerateEpoch': 2}. + 2. For <items>.index, <items>.index.status will be generated, like: + {'GenerateEpoch': 2}. + 3. The generated K Mixture Model files are placed in 'models': + 1. The model files are placed in the same sub-directory as <items>.index; + 2. Each model files are named as 'model-candidates-0.db' ,etc. + Estimate Status Files + 1. For model-candidates-<num>.db, model-candidates-<num>.db.status are generated, like: + {'EstimateEpoch': 3, 'EstimateScore': 0.7} + 2. The 'Estimate.index' file are generated, with content like: + <sub-directory>#model-candidates-<num>.db#<score> + The lines are sorted by <score>. + Prune Status Files + 1. 'merged.db', 'pruned.db', 'kmm.text', 'interpolation.text' are generated when running prune tools in 'finals/try<num>' sub-directory. + 2. 'prune.status' file are generated also, like: + {'PruneEpoch': 4, 'PruneMergeNumber': 1000, + 'PruneK':2, 'PruneCDF': 0.6} + Evaluate Status Files + 1. 'evaluate.status' file are generated, like: + {'EvaluateEpoch': 5, 'EvaluateCorrectionRate': 0.77}
\ No newline at end of file |