Using the Parser

Since the parser binary consists of the supertagger and the parser, two models must be given as command line arguments, one for the parser itself (--parser) and one for the supertagger (--super).

% bin/parser --parser models/parser --super models/super

There are several output formats produced by the parser, which can be specified using the --printer option:

Type Description
deps CCG predicate-argument dependencies
grs Briscoe and Carroll style grammatical relations
prolog CCG derivations rendered as Prolog trees

The default is grs.

The maximum size of the packed chart produced by the parser can be limited by specifying the maximum number of entries in the chart with the --parser-maxsupercats option. The default value is set to 300,000 which we have found to give a reasonable compromise between speed and coverage. Increasing the value will typically increase the number of sentences receiving an analysis, but also reduce the speed.

The maximum sentence length can also be limited with --parser-maxwords. By default this is set to 250 words.

The parser and supertagger interact closely together. If the parser cannot find an analysis with the current supertagger ambiguity level, the supertagger retags the sentence at another level of ambiguity and the parser tries again. The multiple levels of supertagging in the parser are controlled by three options (--betas, --dict_cutoffs and --start_level). The beta and dictionary cutoff are space separated lists of values, which must be the same size. The start level is a zero-based index indicating which level to try first. The default configuration has five levels which have been found empirically on CCGbank section 00 to provide a reasonable compromise between speed and accuracy.

The parser also has a number of options for restricting which categories can be combined. The --parser-seen_rules option, which is on by default, restricts the category combinations to those seen in the training data (CCGbank sections 02-21 in the predefined models). This is very effective at increasing the speed of the parser without compromising the accuracy (on newspaper text).

The --parser-eisner_nf option, which is also on by default, eliminates many non-normal form category combinations (described in a 1996 ACL paper by Jason Eisner). This is effective at increasing the speed of the parser and can be used for any text type.

The --parser-question_rules option activates some additional unary type-changing rules which only apply to questions. When running the parser on questions, it is also important that the --parser-seen_rules option is set to false.

Example Parser Usage

The C&C tools are set up to take their input from STDIN by default, so you can run the POS tagger on raw text and pass it on to the parser as follows (here the environment variable $CANDC refers to where C&C lives on your system):

> echo "Pierre thinks that Mary persuaded Bill to eat apples" | pos --model $CANDC/models/pos/ | parser --parser $CANDC/models/parser/ --super $CANDC/models/super

This should produce the following output:

tagging total:    0.00s usr:    0.00s sys:    0.00s
total   total:    5.96s usr:    5.92s sys:    0.04s
# this file was generated by the following command(s):
#   parser --parser /usr/local/candc/models/parser/ --super /usr/local/candc/models/super

# this file was generated by the following command(s):
#   parser --parser /usr/local/candc/models/parser/ --super /usr/local/candc/models/super

1 attempt nospan at B=0.075, K=20
1 attempt nospan at B=0.03, K=20
1 attempt nospan at B=0.01, K=20
1 attempt nospan at B=0.005, K=20
1 parsed at B=0.0001, K=150
1 coverage 100%
(dobj persuaded_4 Bill_5)
(dobj eat_7 apples_8)
(ncsubj eat_7 Bill_5 _)
(xcomp to_6 persuaded_4 eat_7)
(ncsubj persuaded_4 Mary_3 _)
(ccomp that_2 thinks_1 persuaded_4)
(ncsubj thinks_1 Pierre_0 _)
<c> Pierre|NNP|N thinks|VBZ|(S[dcl]\NP)/S[em] that|IN|S[em]/S[dcl] Mary|NNP|N persuaded|VBD|((S[dcl]\NP)/(S[to]\NP))/NP Bill|NNP|N to|TO|(S[to]\NP)/(S[b]\NP) eat|VB|(S[b]\NP)/NP apples|NNS|N

1 stats 7.39449 773 1089

To run this on an input file containing just words to parse and output to a particular file, the following can be used:

> pos --model $CANDC/models/pos --input wsj24.words | parser --parser $CANDC/models/parser --super $CANDC/models/super --output candc.wsj24.parsed

Here, it is assumed that wsj24.words is a file with multiple lines of text, where each line contains a separate sentence. E.g.

The economy 's temperature will be taken from several vantage points this week , with readings on trade , output , housing and inflation .
The most troublesome report may be the August merchandise trade deficit due out tomorrow .
...