The CCG Grammar

The grammar used in the CCG parser is described in the following paper to appear in Computational Linguistics: http://web.comlab.ox.ac.uk/oucl/work/stephen.clark/papers/cl07parser.pdf. Below is a brief description of the main components.

Lexical Category Set

The lexical category set used by the parser is given in the supertagger model directory, since it is the supertagger which assigns lexical categories to the words in a sentence, which are then combined by the parser. The classes file in the supertagger model directory lists the categories, together with their frequencies in sections 2-21 of CCGBank. The current version of the parser uses a lexical category set containing 425 different categories. Julia Hockenmaier's thesis contains a detailed description of CCGbank, including the various grammatical features used in the categories (S[dcl] for declarative sentence, and so on).

The markedup File

What we call the markedup file contains the head and dependency information for each lexical category. The annotation on each category determines the dependencies which will be output by the parser, including long-range dependencies. The markedup file also contains a mapping from CCG dependencies to Briscoe & Carroll-style grammatical relations. The markedup file is in the $CANDC/src/data/ccg/cats directory. The report contains more description of the markedup file, and the mapping to grammatical relations.

Combinatory Rules

The rules used to combine categories are forward and backward application, forward composition, generalised forward composition, backward composition, backward-crossed composition, generalised backward-crossed composition, type-raising, and finally a coordination schema which coordinates any two categories of the same type. The Syntactic Process (Steedman, 2000) describes the combinatory rules.

The rules are implemented as schemas, except for type-raising which is implemented by adding one of three fixed sets of categories to the chart whenever an NP, PP or S[adj]\NP is present. The category sets -- trNP, trPP, and trAP -- are in the $CANDC/src/data/ccg/cats directory.

Unary Rules

CCGbank contains a number of unary type-changing rules, which typically change a verb phrase into a modifier. The following examples, taken from Julia Hockenmaier's thesis, demonstrate the most common rules. The rules rewrite the category on the left as the category on the right. In the examples the bracketed expression has the type-changing rule applied to it.

  • S[pss]\NP -> NP\NP
    • workers [exposed to it]
  • S[adj]\NP -> NP\NP
    • a forum [likely to bring attention to the problem]
  • S[ng]\NP -> NP\NP
    • signboards [advertising imported cigarettes]
  • S[ng]\NP -> (S\NP)\(S\NP)
    • became chairman [succeeding Ian Butler]
  • S[dcl]/NP -> NP\NP
    • the millions of dollars [it generates]

Another common type-changing rule in CCGbank changes a noun category N into a noun phrase NP. Appendix A of the report lists all the unary type-changing rules used in the parser.

Punctuation Rules

There are a number of rules in CCGbank for absorbing punctuation. For example, the following rule takes a comma followed by a declarative sentence and returns a declarative sentence:

, S[dcl] -> S[dcl]

There are a number of similar comma rules for other categories. There are also similar punctuation rules for semicolons, colons, and brackets. There is also a rule schema which treats a comma as a coordination (where X can be any category):

, X -> X\X

Appendix A of the report contains the complete list of punctuation rules used in the parser.

Other Rules

There are two rules for combining sequences of noun phrases and sequences of declarative sentences:

NP NP -> NP
S[dcl] S[dcl] -> S[dcl]

Finally, there are some coordination constructions in the original Penn Treebank which were difficult to convert into CCGbank analyses, for which the following rule is used:

conj N -> N

Normal-Form Constraints

There are two types of normal-form constraints which can be used by the parser. These constraints are useful for increasing the speed of the parser and reducing the size of the parse charts used for training the parser model. The first constraint only allows two categories to combine if they have been seen to combine in Sections 2-21 of CCGbank. The parser model directory contains a rules file which contains a list of all such category pairs. The second type of constraint, based on work by Jason Eisner (1996), aims to reduce the use of function composition and type-raising by only applying those rules when necessary. The report contains a more precise description of the Eisner constraints.