Training the parser
This is currently internal documentation only and is subject to change.
There are four major steps to training the parser:
- extracting training/development/test data from CCGbank
- training up the supertagger
- creating the data required for a model
- estimating the weights for the model
Extracting data from CCGbank
The first stage of the training process is to extract data from the LDC's CCGbank release. This is done using the script create_data which stores training, development and test files in the working/ccg directory. These files are a combination of gold standard data files (in working/ccg/gold), files generated from the output of the parser (in working/ccg/generated) and features and dependencies extracted for training the parser (in working/ccg/feats, working/ccg/deps and working/ccg/wsj02-21.feats).
create_data first converts CCGbank into the old pipe file format used by Julia's own earlier releases of CCGbank. Eventually we will remove this step.
create_data also splits the features and dependencies data into individual node files, so it needs to know how many MPI nodes will be used for training. Note: this is the number of nodes, i.e. CPUs, not the number of machines the nodes are running on.
So, create_data takes three arguments: the location of the CCGbank data, the number of nodes in the MPI cluster, and the destination directory for the data, e.g.
% src/scripts/ccg/create_data ../data/CCGbank1.2 18 working
Training the supertagger
The supertagger needs to be trained before the create_model scripts since some models require the supertagger to create the necessary data. There is more detailed information on the Taggers page, but here is all you need to train the supertagger from CCGbank:
% bin/train_super --model working/ccg/super --input working/ccg/gold/wsj02-21.stagged --solver bfgs --comment "CCGbank 02-21" --verbose
The later scripts assume the model is stored in the working/ccg/super directory, but again this can be changed in the later scripts as well. Training the supertagger takes quite a long time since there are around 500 classes in the maximum entropy model.
Creating the model data
There are currently three different models which can be created, in line with the Computational Linguistics journal submission. They are:
- create_model_derivs -- the normal-form model (beta = 0.01,0.05,0.1)
TODO: this isn't exactly what was in the CL paper, but was the current setting in forests.cc
- create_model_deps -- the dependency model (beta = 0.1)
- create_model_hybrid -- the hybrid dependency/normal-form model (beta = 0.1)
The main steps in the create_model scripts are to:
- create a model directory (in working/ccg/model_derivs etc)
- copy the markedup category data from src/data/ccg/cats into the model directory
- filter the set of all features extracted in the create_data script, so that only the features used by the particular model are included
- collect a feature lexicon
- extract a list of valid rule instances from CCGbank (create_model_derivs only)
- create configuration files for the estimation code (bin/tree_gis)
Starting the MPI daemons
The last two models run the bin/count_rules program which requires the MPI daemon to be running. The command and arguments will depend on your local distribution, but for us running MPICH, we use:
% mpdboot -f ~ask/env/mpd.hosts -n 9
You can check the MPI daemon has started correctly with mpdtrace:
% mpdtrace -l
It should show you the hostnames and process IDs of the MPI daemons on each machine. Note, it is the number of machines and not the number of nodes with the -n option for mpdboot.
Running the create_model scripts
The create_model_derivs takes no arguments since it doesn't use MPI and can be run with:
% src/scripts/ccg/create_model_derivs
The other create_model scripts take two arguments: the working directory for the MPI processes and the number of MPI nodes.
We need to know the working directory explicitly since it may not be mounted with the same name in the local file space than it is for the other nodes in the cluster. For example, on our setup we must explicitly name the machine holding the drive, e.g. /n/nlp0 in:
% src/scripts/ccg/create_model_deps /n/nlp0/u1/repos/candc/trunk 18
Estimating the weights
The final stage of training, using the train_model scripts, requires the MPI daemons to be running for all models since both forest creation (bin/forests) and the estimation process (bin/tree_gis) use MPI.
The train model scripts run bin/forests with different arguments (the same ones available in the bin/parser and bin/soap_ccg programs) to turn normal-form and CCGbank rules on and off, and set the beta values used by the supertagger/parser to create the forests.
The forests are by default created in the /tmp directories on each node since (at least on our machines) they are a lot faster. You will need up to 1.5 GB free per node in /tmp.
The train_model scripts take the same two command line arguments as above: the working directory for the MPI processes and the number of MPI nodes:
% src/scripts/ccg/train_model_derivs /n/nlp0/u1/repos/candc/trunk 18
Once this is completed the model is ready to be used with the parser.
Note: the command line arguments to the parser still need to be specified, i.e. if you are using the dependency model (working/ccg/model_deps), then you must turn off the normal-form constraints with --parser-eisner_nf false. This will get fixed in the near future.