enlp.processing.stdtools.tokenise

enlp.processing.stdtools.tokenise(model, text)[source]

Return list of tokens for a piece of text.

A token is a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon (time, for example: 2:00). All other symbols are tokens themselves except apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations.

Parameters
modelspacy.lang

SpaCy language model

textstr

text string on which to remove stopwords

Returns
tokenslist

List of tokens, list is ordered as tokens appear in sentence.

Examples

>>> import spacy
>>> lang_mod = spacy.load('nb_dep_ud_sm')
>>> text = 'Den raske brune reven hoppet over den late hunden.'
>>> print (tokenise(lang_mod,text))
['Den', 'raske', 'brune', 'reven', 'hoppet', 'over', 'den', 'late', 'hunden', '.']