How to tokenise with nltk
- download (http://nltk.googlecode.com/files/nltk-2.0b3.zip)
- install:
python setup.py install python -m nltk.downloader punkt
- python code something like:
import nltk sentbreaker = nltk.data.load('tokenizers/punkt/english.pickle').tokenize for sentence in sentbreaker(data): yield nltk.word_tokenize(sentence)