docs/fileformat


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

The File Format of libpinyin

Input File Format
1. Index Files
   * raw corpus are classified into /index/<category>/<subsection>/<items>.index
     * Every line consists of <item>#<item path name>
2. Content Files
   * The content file is stored in <item path name>, such as <number>.text.
     * Note: please add a prefix to the <item path name>, so the content files are easier to organize.

Status File Format
1. Introduction
   As mentioned above, there are two kinds of input files:
   1. Index Files will be called <items>.index;
   2. Content Files will be called <number>.text.

   The training process consists of 5 steps:
       1. Segment Raw Corpus
       2. Generate Models
       3. Estimate Models
       4. Prune Models
       5. Evaluate Models

2. Status Files
   Segment Status Files
       1. For <number>.text, <number>.text.status will be generated, like:
          {'SegmentEpoch': 1}.
       2. For <items>.index, <items>.index.status will be generated, like:
          {'SegmentEpoch': 1}.
   Generate Status Files
       1. For <number>.text, if the <number>.text is qualified,
          <number>.text.status will be generated, like:
          {'GenerateEpoch': 2}.
       2. For <items>.index, <items>.index.status will be generated, like:
          {'GenerateEpoch': 2, 'GenerateModelEnd':10, 'GenerateTextEnd':1000}.
       3. The generated K Mixture Model files are placed in 'models':
          1. The model files are placed in the same sub-directory as <items>.index;
          2. Each model files are named as 'model-candidates-0.db', etc.
          3. The status files are named as 'model-candidates-0.db.status', like:
          {'GenerateEpoch': 2, 'GenerateStart': 100, 'GenerateEnd': 200}.
   Estimate Status Files
       1. For model-candidates-<num>.db, model-candidates-<num>.db.status are generated, like:
          {'EstimateEpoch': 3, 'EstimateScore': 0.7}
       2. The 'Estimate.index' file are generated, with content like:
          <sub-directory>#model-candidates-<num>.db#<score>
          The lines are sorted by <score>.
   Prune Status Files
       1. 'merged.db', 'kmm_merged.text' , 'pruned.db', 'kmm_pruned.text', 'interpolation2.text' are generated when running prune tools in 'finals/try<name>' sub-directory.
       2. 'cwd.status' file are generated also, like:
          {'PruneEpoch': 4, 'PruneMergeNumber': 1000,
          'PruneModelSize' : 10000000,
          'PruneK' : 2, 'PruneCDF' : 0.6}
   Evaluate Status Files
       1. 'cwd.status' file are generated, like:
          {'EvaluateEpoch': 5, 'EvaluateAverageLambda': 0.66,
          'EvaluateCorrectionRate': 0.77}