enlp.processing.stdtools.tokenise¶
-
enlp.processing.stdtools.
tokenise
(model, text)[source]¶ Return list of tokens for a piece of text.
A token is a string of contiguous characters between two spaces, or between a space and punctuation marks. A token can also be an integer, real, or a number with a colon (time, for example: 2:00). All other symbols are tokens themselves except apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations.
- Parameters
- model
spacy.lang
SpaCy language model
- text
str
text string on which to remove stopwords
- model
- Returns
- tokens
list
List of tokens, list is ordered as tokens appear in sentence.
- tokens
Examples
>>> import spacy >>> lang_mod = spacy.load('nb_dep_ud_sm') >>> text = 'Den raske brune reven hoppet over den late hunden.' >>> print (tokenise(lang_mod,text)) ['Den', 'raske', 'brune', 'reven', 'hoppet', 'over', 'den', 'late', 'hunden', '.']