The File Format of libpinyin Input File Format 1. Index Files * raw corpus are classified into /index///.index * Every line consists of # 2. Content Files * The content file is stored in , such as .text. * Note: please add a prefix to the , so the content files are easier to organize. Status File Format 1. Introduction As mentioned above, there are two kinds of input files: 1. Index Files will be called .index; 2. Content Files will be called .text. The training process consists of 5 steps: 1. Segment Raw Corpus 2. Generate Models 3. Estimate Models 4. Prune Models 5. Evaluate Models 2. Status Files Segment Status Files 1. For .text, .text.status will be generated, like: {'SegmentEpoch': 1}. 2. For .index, .index.status will be generated, like: {'SegmentEpoch': 1}. Generate Status Files 1. For .text, if the .text is qualified, .text.status will be generated, like: {'GenerateEpoch': 2}. 2. For .index, .index.status will be generated, like: {'GenerateEpoch': 2, 'GenerateModelEnd':10, 'GenerateTextEnd':1000}. 3. The generated K Mixture Model files are placed in 'models': 1. The model files are placed in the same sub-directory as .index; 2. Each model files are named as 'model-candidates-0.db', etc. 3. The status files are named as 'model-candidates-0.db.status', like: {'GenerateEpoch': 2, 'GenerateStart': 100, 'GenerateEnd': 200}. Estimate Status Files 1. For model-candidates-.db, model-candidates-.db.status are generated, like: {'EstimateEpoch': 3, 'EstimateScore': 0.7} 2. The 'Estimate.index' file are generated, with content like: #model-candidates-.db# The lines are sorted by . Prune Status Files 1. 'merged.db', 'kmm_merged.text' , 'pruned.db', 'kmm_pruned.text', 'interpolation.text' are generated when running prune tools in 'finals/try' sub-directory. 2. 'prune.status' file are generated also, like: {'PruneEpoch': 4, 'PruneMergeNumber': 1000, 'PruneK':2, 'PruneCDF': 0.6} Evaluate Status Files 1. 'evaluate.status' file are generated, like: {'EvaluateEpoch': 5, 'EvaluateCorrectionRate': 0.77}