Changes between Version 11 and Version 12 of NgramPreparsing
- Timestamp:
- 07/19/09 15:01:36 (4 months ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
NgramPreparsing
v11 v12 4 4 5 5 == ''n''-gram Statistics == 6 While processing some sections of the 78 Wikipedia split, the following statistics were observed for ''n''-grams using the hash function ` H(h, c)= c + (h << 6) + (h << 16) - h`.6 While processing some sections of the 78 Wikipedia split, the following statistics were observed for ''n''-grams using the hash function `h' = c + (h << 6) + (h << 16) - h`. 7 7 8 8 {{{ 26 26 distribution used = 776159 (8.41513e-12 %) 27 27 28 5-grams 29 Distinct n-grams: 635601 628624 98.9022987691% 30 nlines = 628624 31 ncollisions = 0 (0 %) 32 distribution used = 628624 (6.81556e-12 %) 28 33 29 34 pc-4e34-0.it.usyd.edu.au 46 51 distribution used = 840554 (9.1133e-12 %) 47 52 53 5-grams 54 Distinct n-grams: 681532 674745 99.0041553441% 55 nlines = 674745 56 ncollisions = 0 (0 %) 57 distribution used = 674745 (7.3156e-12 %) 48 58 49 59 pc-4e33-0.it.usyd.edu.au 66 76 distribution used = 844794 (9.15927e-12 %) 67 77 78 5-grams 79 Distinct n-grams: 685291 678226 98.9690511038% 80 nlines = 678226 81 ncollisions = 0 (0 %) 82 distribution used = 678226 (7.35334e-12 %) 68 83 69 84 pc-4e37-0.it.usyd.edu.au 81 96 82 97 4-grams 83 Distinct n-grams: 561169 551415 98.2618426891% 84 nlines = 551415 85 ncollisions = 0 (0 %) 86 distribution used = 551415 (5.97845e-12 %) 87 98 Distinct n-grams: 862822 846759 98.1383182162% 99 nlines = 846759 100 ncollisions = 0 (0 %) 101 distribution used = 846759 (9.18058e-12 %) 102 103 5-grams 104 Distinct n-grams: 686587 679651 98.9897857081% 105 nlines = 679651 106 ncollisions = 0 (0 %) 107 distribution used = 679651 (7.36879e-12 %) 88 108 89 109 pc-4e75-0.it.usyd.edu.au 106 126 distribution used = 850902 (9.2255e-12 %) 107 127 128 5-grams 129 Distinct n-grams: 689045 682093 98.9910673468% 130 nlines = 682093 131 ncollisions = 0 (0 %) 132 distribution used = 682093 (7.39527e-12 %) 108 133 109 134 pc-4e74-1.it.usyd.edu.au 126 151 distribution used = 842425 (9.13359e-12 %) 127 152 153 5-grams 154 Distinct n-grams: 685146 678148 98.9786118579% 155 nlines = 678148 156 ncollisions = 0 (0 %) 157 distribution used = 678148 (7.3525e-12 %) 128 158 129 159 pc-4e41-1.it.usyd.edu.au 146 176 distribution used = 838168 (9.08744e-12 %) 147 177 178 5-grams 179 Distinct n-grams: 681426 674468 98.9789060000% 180 nlines = 674468 181 ncollisions = 0 (0 %) 182 distribution used = 674468 (7.3126e-12 %) 148 183 149 184 pc-4e32-2.it.usyd.edu.au 166 201 distribution used = 832684 (9.02798e-12 %) 167 202 168 203 5-grams 204 Distinct n-grams: 676379 669500 98.9829666503% 205 nlines = 669500 206 ncollisions = 0 (0 %) 207 distribution used = 669500 (7.25873e-12 %) 208 209 pc-4e73-0.it.usyd.edu.au 169 210 pc-3w14-0.it.usyd.edu.au 170 211 2-grams 185 226 ncollisions = 0 (0 %) 186 227 distribution used = 858794 (9.31106e-12 %) 228 229 5-grams 230 Distinct n-grams: 695111 688327 99.0240407647% 231 nlines = 688327 232 ncollisions = 0 (0 %) 233 distribution used = 688327 (7.46286e-12 %) 187 234 }}} 188 235