C&C Taggers

The taggers are based on Maximum Entropy tagging methods, using log-linear probability distributions to model local decisions at each point in the tagging process. The taggers have been designed to be highly efficient; for example, the POS tagger runs at over 100,000 words per second.

Each tagger can be run as a "multi-tagger", potentially assigning more than one tag to a word. The multi-tagger uses the forward-backward algorithm to calculate a distribution over tags for each word in the sentence, and a parameter beta determines how many tags are assigned to each word. The assignment is "dynamic" in the sense that the number of tags assigned to a word is determined by the ambiguity of the word in question, as measured by the distribution over the tags.

A further feature of the taggers is that they can all be trained on new annotated data. The C&C tools contain both GIS and BFGS training code.