NLP Pipeline

The following example takes a raw extract of norwegian texts, removes stopwords and lemmatizes prior to analysing the distribution of the words in the text.

import matplotlib.pyplot as plt
plt.close('all') # very important for read the docs to avoid it crashing due to memory

from enlp.processing.stdtools import get_stopwords, tokenise
from enlp.pipeline import NLPPipeline
from enlp.visualisation.freq_distribution import wordcloud_plot

import spacy

Load Spacy’s Norwegian language model and the example text

langmodel = spacy.load('nb_dep_ud_sm')
with open("example_data/no_den_stygge_andungen.txt", "r") as file:
    text = file.read()
text = text.replace('\n', ' ')

Get a list of stopwords to be removed from the text

# Get stopwords
all_stopwords, stopwords_nb, stopwords_en = get_stopwords()

Using NLP pipeline class to create a processing workflow. The pipeline shall involve: - remove punctuation - remove stopwords - stem remaining words

# Initialise object
processed_text = NLPPipeline(langmodel, text)

# Run processing as a pipeline
processed_text.rm_punctuation().rm_stopwords(stopwords=all_stopwords).nltk_stem_no()

Out:

<enlp.pipeline.NLPPipeline object at 0x7fb82899f350>

Compare text strings of firs 80 characters of the original and processed.

print ('Original: ', text[:80], '...')
print ('Processed: ', processed_text.text[:80], '...')

Out:

Original:  Den stygge andungen Det var så deilig ute på landet; det var sommer, kornet stod ...
Processed:  stygg andung deil ute somm korn stod gult havr grønn høyet reist stakk ned grønn ...

Wordcloud comparison between most common words

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
wordcloud_plot(tokenise(langmodel,text), ax=ax1)
wordcloud_plot(processed_text.tokenise().tokens, ax=ax2)
plt.tight_layout()
../_images/sphx_glr_ex_nlp_pipeline_001.png

Total running time of the script: ( 0 minutes 8.942 seconds)

Gallery generated by Sphinx-Gallery