n-gram Preparsing
This task involves persisting to disk specific parses for n-grams and being able to load these specific parses straight back into the chart. When a preparse is loaded back into the chart, the cells that this preparse shadows should not be allowed for use in any combination rules.
n-gram Statistics
While processing some sections of the 78 Wikipedia split, the following statistics were observed for n-grams using the hash function h' = c + (h << 6) + (h << 16) - h.
pc-4e43-1.it.usyd.edu.au 2-grams Distinct n-grams: 1126279 1070186 95.0196176968% nlines = 1070186 ncollisions = 0 (0 %) distribution used = 1070186 (1.1603e-11 %) 3-grams Distinct n-grams: 1078903 1048126 97.1473802556% nlines = 1048126 ncollisions = 0 (0 %) distribution used = 1048126 (1.13638e-11 %) 4-grams Distinct n-grams: 791545 776159 98.0562065327% nlines = 776159 ncollisions = 0 (0 %) distribution used = 776159 (8.41513e-12 %) 5-grams Distinct n-grams: 635601 628624 98.9022987691% nlines = 628624 ncollisions = 0 (0 %) distribution used = 628624 (6.81556e-12 %) pc-4e34-0.it.usyd.edu.au 2-grams Distinct n-grams: 1189852 1129098 94.8939868151% nlines = 1129098 ncollisions = 0 (0 %) distribution used = 1129098 (1.22417e-11 %) 3-grams Distinct n-grams: 1168272 1134068 97.0722571455% nlines = 1134068 ncollisions = 0 (0 %) distribution used = 1134068 (1.22956e-11 %) 4-grams Distinct n-grams: 856732 840554 98.1116615230% nlines = 840554 ncollisions = 0 (0 %) distribution used = 840554 (9.1133e-12 %) 5-grams Distinct n-grams: 681532 674745 99.0041553441% nlines = 674745 ncollisions = 0 (0 %) distribution used = 674745 (7.3156e-12 %) pc-4e33-0.it.usyd.edu.au 2-grams Distinct n-grams: 1194237 1132662 94.8439882535% nlines = 1132662 ncollisions = 0 (0 %) distribution used = 1132662 (1.22803e-11 %) 3-grams Distinct n-grams: 1175704 1140752 97.0271428863% nlines = 1140752 ncollisions = 0 (0 %) distribution used = 1140752 (1.23681e-11 %) 4-grams Distinct n-grams: 860982 844794 98.1198213203% nlines = 844794 ncollisions = 0 (0 %) distribution used = 844794 (9.15927e-12 %) 5-grams Distinct n-grams: 685291 678226 98.9690511038% nlines = 678226 ncollisions = 0 (0 %) distribution used = 678226 (7.35334e-12 %) pc-4e37-0.it.usyd.edu.au 2-grams Distinct n-grams: 1195179 1133279 94.8208594695% nlines = 1133279 ncollisions = 0 (0 %) distribution used = 1133279 (1.2287e-11 %) 3-grams Distinct n-grams: 1177007 1142158 97.0391849836% nlines = 1142158 ncollisions = 0 (0 %) distribution used = 1142158 (1.23833e-11 %) 4-grams Distinct n-grams: 862822 846759 98.1383182162% nlines = 846759 ncollisions = 0 (0 %) distribution used = 846759 (9.18058e-12 %) 5-grams Distinct n-grams: 686587 679651 98.9897857081% nlines = 679651 ncollisions = 0 (0 %) distribution used = 679651 (7.36879e-12 %) pc-4e75-0.it.usyd.edu.au 2-grams Distinct n-grams: 1194628 1133134 94.8524561620% nlines = 1133134 ncollisions = 0 (0 %) distribution used = 1133134 (1.22855e-11 %) 3-grams Distinct n-grams: 1179311 1144627 97.0589606982% nlines = 1144627 ncollisions = 0 (0 %) distribution used = 1144627 (1.24101e-11 %) 4-grams Distinct n-grams: 867101 850902 98.1318208605% nlines = 850902 ncollisions = 0 (0 %) distribution used = 850902 (9.2255e-12 %) 5-grams Distinct n-grams: 689045 682093 98.9910673468% nlines = 682093 ncollisions = 0 (0 %) distribution used = 682093 (7.39527e-12 %) pc-4e74-1.it.usyd.edu.au 2-grams Distinct n-grams: 1182634 1121630 94.8416839022% nlines = 1121630 ncollisions = 0 (0 %) distribution used = 1121630 (1.21607e-11 %) 3-grams Distinct n-grams: 1165312 1130832 97.0411357644% nlines = 1130832 ncollisions = 0 (0 %) distribution used = 1130832 (1.22605e-11 %) 4-grams Distinct n-grams: 858185 842425 98.1635661308% nlines = 842425 ncollisions = 0 (0 %) distribution used = 842425 (9.13359e-12 %) 5-grams Distinct n-grams: 685146 678148 98.9786118579% nlines = 678148 ncollisions = 0 (0 %) distribution used = 678148 (7.3525e-12 %) pc-4e41-1.it.usyd.edu.au 2-grams Distinct n-grams: 1180214 1119052 94.8177194983% nlines = 1119052 ncollisions = 0 (0 %) distribution used = 1119052 (1.21328e-11 %) 3-grams Distinct n-grams: 1162037 1127838 97.0569783922% nlines = 1127838 ncollisions = 0 (0 %) distribution used = 1127838 (1.2228e-11 %) 4-grams Distinct n-grams: 854114 838168 98.1330361052% nlines = 838168 ncollisions = 0 (0 %) distribution used = 838168 (9.08744e-12 %) 5-grams Distinct n-grams: 681426 674468 98.9789060000% nlines = 674468 ncollisions = 0 (0 %) distribution used = 674468 (7.3126e-12 %) pc-4e32-2.it.usyd.edu.au 2-grams Distinct n-grams: 1178470 1118087 94.8761529780% nlines = 1118087 ncollisions = 0 (0 %) distribution used = 1118087 (1.21223e-11 %) 3-grams Distinct n-grams: 1154002 1120140 97.0656896608% nlines = 1120140 ncollisions = 0 (0 %) distribution used = 1120140 (1.21446e-11 %) 4-grams Distinct n-grams: 848489 832684 98.1372769711% nlines = 832684 ncollisions = 0 (0 %) distribution used = 832684 (9.02798e-12 %) 5-grams Distinct n-grams: 676379 669500 98.9829666503% nlines = 669500 ncollisions = 0 (0 %) distribution used = 669500 (7.25873e-12 %) pc-4e73-0.it.usyd.edu.au pc-3w14-0.it.usyd.edu.au 2-grams Distinct n-grams: 1203201 1140546 94.7926406311% nlines = 1140546 ncollisions = 0 (0 %) distribution used = 1140546 (1.23658e-11 %) 3-grams Distinct n-grams: 1193592 1158138 97.0296382683% nlines = 1158138 ncollisions = 0 (0 %) distribution used = 1158138 (1.25566e-11 %) 4-grams Distinct n-grams: 875241 858794 98.1208604258% nlines = 858794 ncollisions = 0 (0 %) distribution used = 858794 (9.31106e-12 %) 5-grams Distinct n-grams: 695111 688327 99.0240407647% nlines = 688327 ncollisions = 0 (0 %) distribution used = 688327 (7.46286e-12 %)
Log
30/6/09: Got version 1 of the binary swizzling working. The parser can now take a bunch of input sentences, and dump each sentence and its associated binary representation in the chart straight to disk, where the sentence is represented as an n-gram. This is done through using the --printer memdump argument to the parser. When parsing other sentences then, you can then specify a database of previously parsed n-grams to use. When parsing this time, if an n-gram appears in the current sentence which is also in the database, the section of the chart covered by this n-gram will be shadowed out and replaced by the database version. This shadowing process means that those cells in the chart will not be able to be used for any CCG rules and consequentially rules those cells to be const.
26/6/09: Spoke to James and will stop working on the current implementation and work on a very different idea where we use a binary format and dump chunks of RAM straight to disk as this mapping file, after doing some pointer swizzling.
25/6/09: Almost completed basic naïve implementation whereby you have a text file as a mapping from n-gram to CCGBank representation of that n-gram. When a parse is requested and such a map is provided, if an n-gram in the sentence to parse exists in the mapping, the chart is filled in appropriately using the data provided by the CCGBank representation