Changes between Version 11 and Version 12 of NgramPreparsing

Show
Ignore:
Author:
tim (IP: 38.102.22.200)
Timestamp:
07/19/09 15:01:36 (4 months ago)
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • NgramPreparsing

    v11 v12  
    44 
    55== ''n''-gram Statistics == 
    6 While processing some sections of the 78 Wikipedia split, the following statistics were observed for ''n''-grams using the hash function `H(h, c) = c + (h << 6) + (h << 16) - h`. 
     6While processing some sections of the 78 Wikipedia split, the following statistics were observed for ''n''-grams using the hash function `h' = c + (h << 6) + (h << 16) - h`. 
    77 
    88{{{ 
    2626distribution used = 776159 (8.41513e-12 %) 
    2727 
     285-grams 
     29Distinct n-grams:       635601  628624  98.9022987691% 
     30nlines      = 628624 
     31ncollisions = 0 (0 %) 
     32distribution used = 628624 (6.81556e-12 %) 
    2833 
    2934pc-4e34-0.it.usyd.edu.au 
    4651distribution used = 840554 (9.1133e-12 %) 
    4752 
     535-grams 
     54Distinct n-grams:       681532  674745  99.0041553441% 
     55nlines      = 674745 
     56ncollisions = 0 (0 %) 
     57distribution used = 674745 (7.3156e-12 %) 
    4858 
    4959pc-4e33-0.it.usyd.edu.au 
    6676distribution used = 844794 (9.15927e-12 %) 
    6777 
     785-grams 
     79Distinct n-grams:       685291  678226  98.9690511038% 
     80nlines      = 678226 
     81ncollisions = 0 (0 %) 
     82distribution used = 678226 (7.35334e-12 %) 
    6883 
    6984pc-4e37-0.it.usyd.edu.au 
    8196 
    82974-grams 
    83 Distinct n-grams:       561169  551415  98.2618426891% 
    84 nlines      = 551415 
    85 ncollisions = 0 (0 %) 
    86 distribution used = 551415 (5.97845e-12 %) 
    87  
     98Distinct n-grams:       862822  846759  98.1383182162% 
     99nlines      = 846759 
     100ncollisions = 0 (0 %) 
     101distribution used = 846759 (9.18058e-12 %) 
     102 
     1035-grams 
     104Distinct n-grams:       686587  679651  98.9897857081% 
     105nlines      = 679651 
     106ncollisions = 0 (0 %) 
     107distribution used = 679651 (7.36879e-12 %) 
    88108 
    89109pc-4e75-0.it.usyd.edu.au 
    106126distribution used = 850902 (9.2255e-12 %) 
    107127 
     1285-grams 
     129Distinct n-grams:       689045  682093  98.9910673468% 
     130nlines      = 682093 
     131ncollisions = 0 (0 %) 
     132distribution used = 682093 (7.39527e-12 %) 
    108133 
    109134pc-4e74-1.it.usyd.edu.au 
    126151distribution used = 842425 (9.13359e-12 %) 
    127152 
     1535-grams 
     154Distinct n-grams:       685146  678148  98.9786118579% 
     155nlines      = 678148 
     156ncollisions = 0 (0 %) 
     157distribution used = 678148 (7.3525e-12 %) 
    128158 
    129159pc-4e41-1.it.usyd.edu.au 
    146176distribution used = 838168 (9.08744e-12 %) 
    147177 
     1785-grams 
     179Distinct n-grams:       681426  674468  98.9789060000% 
     180nlines      = 674468 
     181ncollisions = 0 (0 %) 
     182distribution used = 674468 (7.3126e-12 %) 
    148183 
    149184pc-4e32-2.it.usyd.edu.au 
    166201distribution used = 832684 (9.02798e-12 %) 
    167202 
    168  
     2035-grams 
     204Distinct n-grams:       676379  669500  98.9829666503% 
     205nlines      = 669500 
     206ncollisions = 0 (0 %) 
     207distribution used = 669500 (7.25873e-12 %) 
     208 
     209pc-4e73-0.it.usyd.edu.au 
    169210pc-3w14-0.it.usyd.edu.au 
    1702112-grams 
    185226ncollisions = 0 (0 %) 
    186227distribution used = 858794 (9.31106e-12 %) 
     228 
     2295-grams 
     230Distinct n-grams:       695111  688327  99.0240407647% 
     231nlines      = 688327 
     232ncollisions = 0 (0 %) 
     233distribution used = 688327 (7.46286e-12 %) 
    187234}}} 
    188235