Text Mining on CBC news article with Natural Language Processing(NLP) - Automatically summarize the given text using spaCy & Python.
- Spacy
- spaCy Model
python -m spacy download en_core_web_sm
- Convert the input text to a list of sentences. Then, compute the number of sentences in the given Text.
- Calculate the frequency of words in each sentence:
- The output is a dictionary where each key is a sentence and the value is also a dictionary of word frequency.
- Calculate Term frequency for each word in a sentence:
TF(word) = (Number of times term “word” appears in a sentence) / (Total number of terms in the sentence)
- Create a matrix termFrequency:
- The termFrequency matrix is a dictionary where each key is a sentence and the value is also a dictionary of word frequency.
- For each word compute how many sentences contain that word.
- Calculate IDF for each word in a sentence.
IDF(word) = log_e(Total number of sentences / number of sentences with term word in it)
- Compute the TF-IDF for each word in each sentence.
- Use the TF-IDF computed in (7) and give a weight for each sentence.
- Threshold: compute the average sentence weight
- Generate the summary : select a sentence for summarization if the weight of the sentence exceeds the threshold.