C&C Pipeline
A pipeline of all of the C&C tools: POS tagger, chunker, named entity recogniser and CCG parser (and the morpha morphological analyser) is provided in the bin/soap_server binary. It supports two modes: local file reading/writing or SOAP server mode.
soap_server supports all of the configuration options of the individual taggers except (at this point) the input is restricted to pre-tokenized text with one sentence per line. All of the options are prefixed with --candc, so for example setting the POS tagger model requires the --candc-pos argument.
soap_server expects a directory containing models for each of the taggers, the parser and also the verbstem list used by the morphological analyser. The location of this directory must be specified with the --models (or alternatively --candc) option. The models.tgz files available from the download page contain all of the models required to run soap_server.
For example, to run the whole pipeline from the command line (if models.tgz was unpacked in the current working directory), you can use:
% bin/soap_server --models models
The parser output is sent to standard output by default:
# this file was generated by the following command(s): # bin/soap_server --models models (ncmod _ sentence_4 test_3) (det sentence_4 a_2) (xcomp _ is_1 sentence_4) (ncsubj is_1 This_0 _) <c> This|this|DT|I-NP|O|NP is|be|VBZ|I-VP|O|(S[dcl]\NP)/NP a|a|DT|I-NP|O|NP[nb]/N test|test|NN|I-NP|O|N/N sentence|sentence|NN|I-NP|O|N .|.|.|O|O|.
and the log information goes to standard error by default:
# reading text from <stdin> # writing to <stdout> # writing log to <stderr> # this file was generated by the following command(s): # bin/soap_server --models models 1 parsed at B=0.075, K=20 1 coverage 100% 1 stats 1.38629 47 48
The log shows coverage statistics and information about the supertagger/parser interaction.
You can specify the input, output and log files using the --input, --output and --log options respectively. You can also use the --prefix option to specify both the output and log file prefix. For example, --prefix test will create the output file test.out and the log file test.log.
C&C Web Service
Loading all of the statistical models for the different taggers and parser takes a considerable amount of time. For parsing lots of small files, e.g. in a Question Answering system, the loading time can become prohibitive.
Also, we currently only have an API for C++ and an experimental API for Python.
For these reasons we have developed a web service version of the pipeline using SOAP (Simple Object Access Protocol). We are using the GSOAP C++ library which is very efficient and standards compliant. It works with almost all Web Service bindings for different languages.
There is currently only one function provided by the web service and that is to parse the contents of a string (one or more tokenized sentences one per line) just like the standard input/output mode. There is also a command line tool client_ccg which can be used to pass the web service the contents of standard input and print the returned parses from the web service on standard output.
The web service is started by giving soap_server the --server option with a hostname and port:
% bin/soap_server --candc models --server localhost:9000 waiting for connections on localhost:9000
The port number (9000 in this example) must be larger than 1024 unless you are running this as root, which we very strongly discourage! If you specify localhost then the web service will only be accessible to processes inside the machine, whereas if you specify the hostname instead the web service will be accessible outside of the machine as well.
Once the server is running, you can contact it with the client soap_client program:
% bin/soap_client --url http://localhost:9000 This is a test sentence . (ncmod _ sentence_4 test_3) (det sentence_4 a_2) (xcomp _ is_1 sentence_4) (ncsubj is_1 This_0 _) <c> This|this|DT|I-NP|O|NP is|be|VBZ|I-VP|O|(S[dcl]\NP)/NP a|a|DT|I-NP|O|NP[nb]/N test|test|NN|I-NP|O|N/N sentence|sentence|NN|I-NP|O|N .|.|.|O|O|.
The url must have the http://, and the hostname and port must match the values used for the soap_server server. The --url option has the default value http://localhost:9000. You can also specify input and output files explicitly using --input and --output rather than using stdin and stdout.
The client
Morphological Analysis
The morphological analyser used in the C&C NLP tools pipeline is morph, developed by Minnen, Carroll and Pearce. If you are using the information from morph, please cite the following paper:
Minnen, G., J. Carroll and D. Pearce (2001). Applied morphological processing of English, Natural Language Engineering, 7(3). 207-223.