Note
Click here to download the full example code
Topic ModellingΒΆ
The following example shows how to generate a topic model from a corpus and then determine the topic of a new document.
NOTE: This is an example to show how to run the procedure however due to the small dataset used the results are likely to be non-sensical.
import enlp.understanding.topics as tp
import enlp.processing.stdtools as stdt
import spacy
Load example text and get stopwords
with open("example_data/en_nlptexts.txt", "r") as file:
text=file.read()
all_stopwords, stopwords_nb, stopwords_en = stdt.get_stopwords()
Preprocess text - for this example we have a very small corpus to allow the documentation to build therefore we will split the single document into paragraphs for processing to imitate multiple document input and we will also remove stopwords and punctuation as the text is too small.
# Split text into paragraphs to imitate documents
docs = text.split('\n\n')
# Remove \n and replace with space
docs = [d.replace('\n',' ') for d in docs]
# Because example text is small, remove stopwords and punctuation
en = spacy.load('en_core_web_md')
stopwords, stops_nb, stops_en = stdt.get_stopwords()
docs = [stdt.rm_punctuation(en,stdt.rm_stopwords(en, d, stops_en)) for d in docs]
Create topic model & visualise keywords per topic
tp_model, dictionary = tp.bow_topic_modelling(docs,no_topics=3)
# print topics
tp.print_topic_words(tp_model)
Out:
Topic: 0
Words: 0.008*"text" + 0.007*"systems" + 0.007*"question" + 0.006*"language" + 0.006*"system" + 0.005*"knowledge" + 0.005*"machine" + 0.005*"natural" + 0.005*"translation" + 0.004*"English"
Topic: 1
Words: 0.011*"language" + 0.009*"sentiment" + 0.006*"system" + 0.006*"text" + 0.005*"human" + 0.005*"data" + 0.005*"." + 0.005*"natural" + 0.005*"word" + 0.004*"based"
Topic: 2
Words: 0.012*"word" + 0.011*"words" + 0.008*"sense" + 0.008*"systems" + 0.007*"based" + 0.007*"text" + 0.006*"sentiment" + 0.006*"language" + 0.006*"learning" + 0.005*"analysis"
Determine the topic of a new document
Out:
topic_no score
0 0 0.333333
1 1 0.333333
2 2 0.333333
Total running time of the script: ( 0 minutes 13.257 seconds)