Configurations and Command Line Arguments

The C&C tools use a configuration management system which allows the user to override practically all of the default parameters for training and running the taggers and parser. This leads to a very complicated looking set of command line arguments for most of the tools. However, the default values will be suitable for most applications. Many of the options change the internal behaviour of the system and require knowledge of the implementation to be used correctly.

All of the C&C programs require command line arguments to run. If no arguments are provided, the program will return a help message describing all of the command line arguments, with the most important options listed at the top. The help message is also returned when the --help argument is used. For example, the POS tagger has these options:

% bin/pos --help
usage: pos [options]

main program options:
  --help: show the help message
  --config <arg>: load a configuration file
  --version: the version number

  --model <arg>: POS tagger config (alias for pos)

  --dict_cutoff <arg>: the frequency at which the tag dictionary is used (default = 5)
  --algorithm <arg>: the decoding algorithm to use [viterbi] (default = "viterbi")
  --input <arg>: the input file to read from (default = "<stdin>")
  --ifmt <arg>: the input file format (default = "%w \n")

  --output <arg>: the output file to write to (default = "<stdout>")
  --ofmt <arg>: the output file format (default = "%w%|%p \n")

  --opref: start the output with a preface (default = false)
  --overbose: start the output with a verbose preface (default = false)
  --oconfig <arg>: save the current configuration into a file (default = "")

POS tagger config options:
  --pos-help: show the help message
  --pos-config <arg>: load a configuration file
  --pos-dir <arg>: the pos directory (default = "//pos")
  --pos-tagdict <arg>: the tag dictionary file path (default = "//tagdict")
  --pos-unknowns <arg>: the set of tags for unknown words (default = "//unknowns")

  --pos-cutoff_default <arg>: the minimum frequency cutoff for features (default = 1)
  --pos-cutoff_words <arg>: the minimum frequency cutoff for word features (default = 1)
  --pos-rare_cutoff <arg>: the word frequency for which rare word features are used (default = 5)
  --pos-beam_width <arg>: the number of best tags to keep in the beam (default = 5)
  --pos-beam_ratio <arg>: the ratio of the worst:best tags in the beam (default = 0.005)
  --pos-forward_beam_ratio <arg>: the ratio of the worst:best tags in the forward step (default = 0.001)
  --pos-tagdict_min <arg>: the minimum frequency for adding a word-tag pair to the tag dict (default = 5)
  --pos-tagdict_ratio <arg>: the ratio of the min:max frequency of word-tag pairs in the tag dict (default = 500)  

All of the tools now provide a --version argument which returns the C&C Subversion repository version number and the date on which the system was compiled:

% bin/pos --version
pos 169 (built 12 September 2006, 01:26:20)

This will allow complete reproducibility since a particular version can be checked out of the repository and rerun on the data.

All of the tools also provide a --licence argument which will print out the C&C NLP tools licence and exit.