Using the Taggers

The taggers by default only require a single command line option (--model) which specifies where the corresponding model data is stored. For example, if the POS model is in models/pos, then the tagger can be run with:

% bin/pos --model models/pos

This will cause the tagger to read text from standard input and write the tagged text to standard output.

The taggers can also be run on input and output files specified using the --input and --output options. Again, if one or the other is not specified, then standard input or standard output is used. The input and output formats (described above) can be changed using the --ifmt and --ofmt options.

The --opref and --overbose options add a comment block preface to the top of the output file with information about the tagging process. This preface can be read by the C&C programs to provide a complete history of how the file was created. The --opref option only shows the command line arguments, whereas the --overbose option lists all of the non-default configuration options and the version number of the tagger.

For the multi-taggers, the level of ambiguity can be manipulated using the --beta option. The beta parameter determines the average level of ambiguity by outputting tags whose probability is within beta of the highest probability tag. A smaller value for beta leads to a higher level of average ambiguity.

Tagger Training Programs

The tagger training programs require three command line options: the directory where the new model will be saved, the path to the tagged data and a comment field. The following example is for training the POS tagger:

% bin/train_pos --model models/pos --input wsj02-21.pos --comment "WSJ sections 02-21" 

The training program will create the directory models/pos or reuse the directory if it already exists.

The comment field allows you to record any information about the model which will be added to the verbose output preface. The comment field is mandatory (but can be an empty string).

The train_ner code requires the directory to be created beforehand and a gazetteers file to exist in the directory. This file should list the locations of any gazetteers that the named entity recogniser will use. Typically these gazetteers are also stored in the model directory in a subdirectory called gazetteer. To see an example of this, look at the predefined muc model in the C&C tools distribution.