Input/Output Formats
Unlike previous versions of the C&C tools, the new taggers (and therefore the parser too) can accept many different input formats and produce many different output formats. These are described using a little language similar to C printf format strings.
For example, the input format %w|%p \n indicates that the program expects word (%w) and POS tag (%p) pairs as input, where the words and POS tags are separated by pipe characters (vertical bars |), and each word-POS tag pair is separated by a single space, and whole sentences are separated by newlines (\n).
This is the default input format for the supertagger:
Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS old|JJ ,|, will|MD join|VB ... Mr.|NNP Vinken|NNP is|VBZ chairman|NN of|IN Elsevier|NNP N.V.|NNP ... ...
The corresponding output format is %w|%p|%s \n which adds a single supertag to each word in the output:
Pierre|NNP|N/N Vinken|NNP|N ,|,|, 61|CD|N/N years|NNS|N old|JJ|(S[adj]\NP)\NP ,|,|, will|MD|(S[dcl]\NP)/(S[b]\NP) join|VB|((S[b]\NP)/PP)/NP ... Mr.|NNP|N/N Vinken|NNP|N is|VBZ|(S[dcl]\NP)/NP chairman|NN|N of|IN|(NP\NP)/NP Elsevier|NNP|N/N N.V.|NNP|N ...
For the multi-tagger, it is more convenient to use a vertical output format, with one word, plus the multiple tags with their probabilities, on one line. The output format for this is %w\t%p\t%S\n\n\n which corresponds to the words, POS tags and the multiple supertags (note the capitalised %S) being separated by tab characters (\t), each word being separated by a single newline, and the sentences being separated by two newlines (which results in a blank line being printed between each sentence):
Pierre 1 NNP 1 Vinken 1 NNP 0.990279 , 1 , 1 61 1 CD 1 years 1 NNS 1 old 1 JJ 0.992836 , 1 , 1 will 1 MD 1 join 1 VB 1 ... Mr. 1 NNP 1 Vinken 1 NNP 1 is 1 VBZ 1 chairman 1 NN 1 of 1 IN 1 Elsevier 1 NNP 0.994435 N.V. 1 NNP 1 ...
You can also add extra markers at the beginning and end of the sentence. For example, to add <s> and </s> as markers on separate lines before and after each sentence, use the following format: <s>\n%w\t%p\t%S\n\n</s>\n.
<s> Pierre 1 NNP 1 Vinken 1 NNP 0.990279 , 1 , 1 61 1 CD 1 years 1 NNS 1 old 1 JJ 0.992836 , 1 , 1 will 1 MD 1 join 1 VB 1 ... </s> <s> Mr. 1 NNP 1 Vinken 1 NNP 1 is 1 VBZ 1 chairman 1 NN 1 of 1 IN 1 Elsevier 1 NNP 0.994435 N.V. 1 NNP 1 ... </s>
You can also add markers for the horizontal format described above.
The output and input format do not need to be related to each other. That is, you can read a horizontal input format with pipes and still output a vertical format using tabs. You don't need to output all of the fields you read in either.
Using the C++ API it is also possible to read in different fields from different data as long as the words are aligned. However, this functionality isn't currently available for the command line tools.
Reading unused fields
Another feature of the new input/output is that other fields can be read in which are not used in the tagging process, and also form part of the output.
For example, if you wanted to replace the existing POS tags in some named entity tagged text, then you can specify an input format of %w|%p|%n \n and an output format of %w|%p|%n \n to the POS tagger. The POS tagger will then replace just the POS tags and print out the original named entity tags.
You can also do this for up to 10 additional fields (labelled %0 through to %9) which don't correspond to the predefined tag types.
The predefined tag types are:
| Type | Single Tag | Multiple Tags |
| words (or tokens) | %w | %W |
| POS tag | %p | %P |
| chunk tag | %c | %C |
| named entity tag | %n | %N |
| supertag | %s | %S |
Finally, you can ignore any field in the input by using %?, and, just like printf, if you want to print a % then use %%.
Unix, Mac OS X and Windows newlines
The IO system will automatically recognise both Unix/Mac OS X newlines (\n) and Windows CRLF (\r\n) as line separators when reading in files. Currently we do not support older Mac <= 9 files (separator \r).
To produce Windows CRLF output for one sentence per line use \r\n, e.g. for the POS tagger you can use %w|%p \r\n. However, we recommend using Unix newlines where ever possible.