The CCG Parser
The grammar used by the parser is taken from CCGbank, a treebank of CCG derivations developed by Julia Hockenmaier and Mark Steedman. CCGbank was created by semi-automatically converting the phrase-structure trees in the Penn Treebank into CCG derivations. Since the grammar is based on 'real text', it has wide-coverage, leading to a robust parser. CCGbank consists primarily of newspaper text, making the parser particularly good at analysing this kind of text. Lexical category data has also been manually created for questions and the parser comes with a supertagger question model, making it ideally suited for use in an open-domain Question Answering system.
The CCG parser has the following features:
- efficient and robust parsing of real text: up to 35 newspaper sentences per second and close to 100% coverage on unseen sentences in CCGbank;
- accurate recovery of predicate-argument dependencies, including long-range dependencies: around 85% overall labelled F-score when evaluated on unseen text in CCGbank;
- a number of output options, including CCG derivations; CCG dependency structures; Briscoe and Carroll-style grammatical relations; and interpretable Discourse Representation Structures (if used with Boxer).
The parser is described in detail in the following paper to appear in Computational Linguistics: Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models http://web.comlab.ox.ac.uk/oucl/work/stephen.clark/papers/cl07parser.pdf.