n-gram Preparsing

This task involves persisting to disk specific parses for n-grams and being able to load these specific parses straight back into the chart. When a preparse is loaded back into the chart, the cells that this preparse shadows should not be allowed for use in any combination rules.

n-gram Statistics

While processing some sections of the 78 Wikipedia split, the following statistics were observed for n-grams using the hash function h' = c + (h << 6) + (h << 16) - h.

pc-4e43-1.it.usyd.edu.au
2-grams
Distinct n-grams:       1126279 1070186 95.0196176968%
nlines      = 1070186
ncollisions = 0 (0 %)
distribution used = 1070186 (1.1603e-11 %)

3-grams
Distinct n-grams:       1078903 1048126 97.1473802556%
nlines      = 1048126
ncollisions = 0 (0 %)
distribution used = 1048126 (1.13638e-11 %)

4-grams
Distinct n-grams:       791545  776159  98.0562065327%
nlines      = 776159
ncollisions = 0 (0 %)
distribution used = 776159 (8.41513e-12 %)

5-grams
Distinct n-grams:       635601  628624  98.9022987691%
nlines      = 628624
ncollisions = 0 (0 %)
distribution used = 628624 (6.81556e-12 %)

pc-4e34-0.it.usyd.edu.au
2-grams
Distinct n-grams:       1189852 1129098 94.8939868151%
nlines      = 1129098
ncollisions = 0 (0 %)
distribution used = 1129098 (1.22417e-11 %)

3-grams
Distinct n-grams:       1168272 1134068 97.0722571455%
nlines      = 1134068
ncollisions = 0 (0 %)
distribution used = 1134068 (1.22956e-11 %)

4-grams
Distinct n-grams:       856732  840554  98.1116615230%
nlines      = 840554
ncollisions = 0 (0 %)
distribution used = 840554 (9.1133e-12 %)

5-grams
Distinct n-grams:       681532  674745  99.0041553441%
nlines      = 674745
ncollisions = 0 (0 %)
distribution used = 674745 (7.3156e-12 %)

pc-4e33-0.it.usyd.edu.au
2-grams
Distinct n-grams:       1194237 1132662 94.8439882535%
nlines      = 1132662
ncollisions = 0 (0 %)
distribution used = 1132662 (1.22803e-11 %)

3-grams
Distinct n-grams:       1175704 1140752 97.0271428863%
nlines      = 1140752
ncollisions = 0 (0 %)
distribution used = 1140752 (1.23681e-11 %)

4-grams
Distinct n-grams:       860982  844794  98.1198213203%
nlines      = 844794
ncollisions = 0 (0 %)
distribution used = 844794 (9.15927e-12 %)

5-grams
Distinct n-grams:       685291  678226  98.9690511038%
nlines      = 678226
ncollisions = 0 (0 %)
distribution used = 678226 (7.35334e-12 %)

pc-4e37-0.it.usyd.edu.au
2-grams
Distinct n-grams:       1195179 1133279 94.8208594695%
nlines      = 1133279
ncollisions = 0 (0 %)
distribution used = 1133279 (1.2287e-11 %)

3-grams
Distinct n-grams:       1177007 1142158 97.0391849836%
nlines      = 1142158
ncollisions = 0 (0 %)
distribution used = 1142158 (1.23833e-11 %)

4-grams
Distinct n-grams:       862822  846759  98.1383182162%
nlines      = 846759
ncollisions = 0 (0 %)
distribution used = 846759 (9.18058e-12 %)

5-grams
Distinct n-grams:       686587  679651  98.9897857081%
nlines      = 679651
ncollisions = 0 (0 %)
distribution used = 679651 (7.36879e-12 %)

pc-4e75-0.it.usyd.edu.au
2-grams
Distinct n-grams:       1194628 1133134 94.8524561620%
nlines      = 1133134
ncollisions = 0 (0 %)
distribution used = 1133134 (1.22855e-11 %)

3-grams
Distinct n-grams:       1179311 1144627 97.0589606982%
nlines      = 1144627
ncollisions = 0 (0 %)
distribution used = 1144627 (1.24101e-11 %)

4-grams
Distinct n-grams:       867101  850902  98.1318208605%
nlines      = 850902
ncollisions = 0 (0 %)
distribution used = 850902 (9.2255e-12 %)

5-grams
Distinct n-grams:       689045  682093  98.9910673468%
nlines      = 682093
ncollisions = 0 (0 %)
distribution used = 682093 (7.39527e-12 %)

pc-4e74-1.it.usyd.edu.au
2-grams
Distinct n-grams:       1182634 1121630 94.8416839022%
nlines      = 1121630
ncollisions = 0 (0 %)
distribution used = 1121630 (1.21607e-11 %)

3-grams
Distinct n-grams:       1165312 1130832 97.0411357644%
nlines      = 1130832
ncollisions = 0 (0 %)
distribution used = 1130832 (1.22605e-11 %)

4-grams
Distinct n-grams:       858185  842425  98.1635661308%
nlines      = 842425
ncollisions = 0 (0 %)
distribution used = 842425 (9.13359e-12 %)

5-grams
Distinct n-grams:       685146  678148  98.9786118579%
nlines      = 678148
ncollisions = 0 (0 %)
distribution used = 678148 (7.3525e-12 %)

pc-4e41-1.it.usyd.edu.au
2-grams
Distinct n-grams:       1180214 1119052 94.8177194983%
nlines      = 1119052
ncollisions = 0 (0 %)
distribution used = 1119052 (1.21328e-11 %)

3-grams
Distinct n-grams:       1162037 1127838 97.0569783922%
nlines      = 1127838
ncollisions = 0 (0 %)
distribution used = 1127838 (1.2228e-11 %)

4-grams
Distinct n-grams:       854114  838168  98.1330361052%
nlines      = 838168
ncollisions = 0 (0 %)
distribution used = 838168 (9.08744e-12 %)

5-grams
Distinct n-grams:       681426  674468  98.9789060000%
nlines      = 674468
ncollisions = 0 (0 %)
distribution used = 674468 (7.3126e-12 %)

pc-4e32-2.it.usyd.edu.au
2-grams
Distinct n-grams:       1178470 1118087 94.8761529780%
nlines      = 1118087
ncollisions = 0 (0 %)
distribution used = 1118087 (1.21223e-11 %)

3-grams
Distinct n-grams:       1154002 1120140 97.0656896608%
nlines      = 1120140
ncollisions = 0 (0 %)
distribution used = 1120140 (1.21446e-11 %)

4-grams
Distinct n-grams:       848489  832684  98.1372769711%
nlines      = 832684
ncollisions = 0 (0 %)
distribution used = 832684 (9.02798e-12 %)

5-grams
Distinct n-grams:       676379  669500  98.9829666503%
nlines      = 669500
ncollisions = 0 (0 %)
distribution used = 669500 (7.25873e-12 %)

pc-4e73-0.it.usyd.edu.au
pc-3w14-0.it.usyd.edu.au
2-grams
Distinct n-grams:       1203201 1140546 94.7926406311%
nlines      = 1140546
ncollisions = 0 (0 %)
distribution used = 1140546 (1.23658e-11 %)

3-grams
Distinct n-grams:       1193592 1158138 97.0296382683%
nlines      = 1158138
ncollisions = 0 (0 %)
distribution used = 1158138 (1.25566e-11 %)

4-grams
Distinct n-grams:       875241  858794  98.1208604258%
nlines      = 858794
ncollisions = 0 (0 %)
distribution used = 858794 (9.31106e-12 %)

5-grams
Distinct n-grams:       695111  688327  99.0240407647%
nlines      = 688327
ncollisions = 0 (0 %)
distribution used = 688327 (7.46286e-12 %)

Log

30/6/09: Got version 1 of the binary swizzling working. The parser can now take a bunch of input sentences, and dump each sentence and its associated binary representation in the chart straight to disk, where the sentence is represented as an n-gram. This is done through using the --printer memdump argument to the parser. When parsing other sentences then, you can then specify a database of previously parsed n-grams to use. When parsing this time, if an n-gram appears in the current sentence which is also in the database, the section of the chart covered by this n-gram will be shadowed out and replaced by the database version. This shadowing process means that those cells in the chart will not be able to be used for any CCG rules and consequentially rules those cells to be const.

26/6/09: Spoke to James and will stop working on the current implementation and work on a very different idea where we use a binary format and dump chunks of RAM straight to disk as this mapping file, after doing some pointer swizzling.

25/6/09: Almost completed basic naïve implementation whereby you have a text file as a mapping from n-gram to CCGBank representation of that n-gram. When a parse is requested and such a map is provided, if an n-gram in the sentence to parse exists in the mapping, the chart is filled in appropriately using the data provided by the CCGBank representation