enlp.processing.stdtools.tokenise¶

enlp.processing.stdtools.tokenise(model, text)[source]¶

Return list of tokens for a piece of text.

A token is a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon (time, for example: 2:00). All other symbols are tokens themselves except apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations.

Parameters

modelspacy.lang: SpaCy language model
textstr: text string on which to remove stopwords

Returns

tokenslist: List of tokens, list is ordered as tokens appear in sentence.

Examples

>>> import spacy
>>> lang_mod = spacy.load('nb_dep_ud_sm')
>>> text = 'Den raske brune reven hoppet over den late hunden.'
>>> print (tokenise(lang_mod,text))
['Den', 'raske', 'brune', 'reven', 'hoppet', 'over', 'den', 'late', 'hunden', '.']