How to tokenise with nltk

  • download (http://nltk.googlecode.com/files/nltk-2.0b3.zip)
  • install:
    python setup.py install
    python -m nltk.downloader punkt
    
  • python code something like:
    import nltk
    sentbreaker = nltk.data.load('tokenizers/punkt/english.pickle').tokenize
    for sentence in sentbreaker(data):
        yield nltk.word_tokenize(sentence)