Steve's thoughts on experiments

1st Thoughts

View the task as one of domain adaptation, with constraints providing higher quality training data than self-training alone. Of course the methods are more generally applicable than domain adaptation, but this is one useful and compelling application of the techniques.

The trick now is to find a domain which isn't too artificial but one for which we're likely to get accuracy gains. The domain generated by the original queries is an obvious one to use, eg the "Alexander Graham Bell inventing the telephone domain". In order to widen this domain a little, probably better to have the "Alexander Graham Bell telephone" domain, where the inventing is implicit; this also allows the creation, pioneering, etc of the telephone. In fact, one could even imagine an application of this domain. Suppose we have a QA system for which users often ask about Alexander Graham Bell; here it would be useful to have a parser which is accurate on these sentences. The ultimate application would be real-time learning of the parser when the user enters the query.

Suggested datasets

5 different domains, based on pairs of named entities. Eg (Alexander Graham Bell, telephone).

50 sentences for each from the relevant query annotated with gold-standard GRs for testing. The advantage in having these marked up is that we can also use this test data to measure the accuracy of the parses after the constraints have been applied.

300+ sentences for training for each domain. 300 might be enough given the domain is so constrained.

Experiments (for each domain)

  1. Measure the accuracy of the parser on the test sentences before and after applying the constraints.
  2. Train the supertagger on the 1-best parse from the constrained training sentences. Do the same on the unconstrained 1-best parse. Add the training sentences, say, 10 times to the CCGbank data. Measure performance of the supertagger and parser using:
    • wsj model
    • wsj+wiki model (no constraints)
    • wsj+wiki model (with constraints)
  3. Repeat the above for training the full parsing model. Measure accuracy of the wsj model, the wsj+wiki model without constraints, and the wsj+wiki model with constraints. The hope is that the performance will increase in each case.

2nd Thoughts

We want to demonstrate the potential utility of the constrained data. Problem: we're not going to improve on the full wsj parsing task, or indeed the performance of the parser on wikipedia with a full model, just because this sets the baseline too high at this stage. So we need an alternative experimental framework which isn't too noddy or artificial -- so that people will be convinced that the method might generalise to the full case and is worth pursuing -- but on the other hand is noddy enough that we can get some +ve results in the next 4 weeks.

Two proposals

  1. start out with a model trained on a subset of CCGbank (eg 5,000 sentences?) and try and improve performance on the wikipedia test data. (we can also use the wsj test data of course). This mirrors the situation where training data is sparse and where we're using the bootstrapping as a cheap way of getting more data. The big advantage with this one is that we already have all the data. It would also demonstrate that the constraints can provide broadly applicable data which is better because of a knock-on effect of the constraints applying in the training data, rather than just acquiring specific knowledge about inventor sentences, say.
  1. create around 3 new "domains" defined by some manually chosen queries (ones which appear to provide useful constraints). Eg the "inventor" domain. Extract N sentences using the query, where ideally N is around 1000; take 50 for manual annotation with GRs; use the rest for training. Then we can see this as a kind of domain adaptation exercise, albeit one with rather tightly constrained domains. Note that the constraints can generlise across the inventors, eg a constraint of the form INVENTOR pioneered INVENTION can be extracted from, and apply to, both Edison and Dyson. The big disadvantage with this one is the need to annotate test data.

Hence proposal is to try 1) next week.

Methodology in training and testing

  • generate N sentences from some query, eg one getting invention sentences. Find constraints by finding high-frequency dependency paths, eg INVENTOR invented INVENTION (so notice generalisation across inventor and invention). Apply constraints to the training data, giving back 1-best parse.
  • Measure how good constraints are: for this we will only see whether the constraints have been applied correctly and in what % of cases. Getting some measure of GR accuracy on the training data would be nice, but requires annotated data; so we won't do this. Nor will we artificially pick sentences for the test data for which the constraints apply, since it will be better to have a representative set (of course if it turns out that enough of the test sentences are ones where the constraints can be applied and they're changed by the constraints, then we can use them to measure the GR accuracy before and after the constraints, but there's unlikely to be many in 50 test sentences).
  • Train 2 models: one on the new training data with the constraints applied, and one without. We'll need to think about how much wsj data to add into the mix. For 2) it might be worth having no wsj data at all (or at least only a small amount). We can do this for the supertagger alone as well as the parser+supertagger. Test if the supertagger and supertagger+parser improves when using constrained 1-best data compared to the straight self-training case.
  • Oracle expt: repeat but only use those training sentences where the constraints have been correctly applied. We'll know what these are because we've already done that manual test (this can be done relatively quickly).