JHU 2009

Large-Scale Syntactic Processing: Parsing the Web

Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.

The focus of the proposal will be the C&C parser, a state-of-the-art statistical parser based on Combinatory Categorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce linguistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surprisingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.

Sydney Clusters

See the wiki page Sydney Clusters.

Team

Project Leaders

Graduate Students

Undergraduate Students

Tasks

Attachments

  • intro_slides.pdf (442.6 kB) -Stephen Clark's introductory presentation, added by james on 06/23/09 04:57:39.
  • ugrad-tim.pdf (332.5 kB) -Tim's slides for the ugrad meeting on 2/7/2009, added by tim on 07/03/09 03:31:57.
  • ugrad-jonathan.pdf (204.8 kB) -Jonathan's presentation at the third ugrad lunch during the JHU workshop, added by jonathan on 07/15/09 06:04:15.
  • july1.pdf (214.7 kB) -BE tagging, added by yue on 07/17/09 00:03:08.
  • jhu-undergrad-jessi.pdf (488.8 kB) -Jessi's slides from undergrad meeting 7/15/09, added by jessi on 07/17/09 01:33:47.
  • july15.pdf (159.8 kB) - added by yue on 07/17/09 02:39:15.