TF-IDF Python Implementation Monogram, Bi-gram, and Tri-gram

Saurabh Singh
4 min readDec 26, 2020

--

This is my first post, so any suggestions are very welcomed.

Photo by Kelly Sikkema on Unsplash

Let's started, So the first question is what is TF-IDF. ❓

What is TF-IDF

TF-IDF stands for Term Frequency — Inverse Document Frequency. This is a statical measure that evaluated how important a word is to a document in a collection of documents. This is a very popular technique to find the important words in a collection of documents. Other techniques that do the same things are Rake(Rapid Automatic Keyword Extraction), Yake(Yet Another Keyword Extraction), and KeyBERT. TF-IDF is a trendy topic for data scientists interested in Natural Language Processing. So what is NLP? NLP, in simple words, is the Study of Human Language. TF-IDF can be calculated by multiplying two matrixes Term frequency and Inverse document frequency

Term frequency is the frequency of a term or word in a document, and it can be calculated by dividing the frequency of a word in a document by the total number of words in a document.

Term Frequency

Inverse Document Frequency tells us how common or rare a word in the entire documents. And This can be calculated as a total number of documents divided by the number of documents that contain a word and calculating the logarithm.

If a word is present in many documents, then a word is very common, and its value approaches 0. And the word is scarce; then its value is 1.

Inverse Document Frequency

TF-IDF = TF * IDF

Implementation of TF-IDF in Python

Image from Kaggle

Importing some libraries

from sklearn.feature_extraction.text import TfidfVectorizer    # for TF-IDF implementation
import re # for text cleaning
from nltk.corpus import stopwords # to remove stopwords

To Implement TF-IDF, I am using two documents containing the definition of Linear Regression and Logistic Regression.

document_1 = "Linear Regression is a supervised machine machine learning algorithm where the predicted output is continuous and has a constant slope. It's used to predict values within a continuous range, rather than trying to classify them into categories"document_2 = "Logistic Regression is a machine Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability. The hypothesis of logistic regression tends it to limit the cost function between 0 and 1"

Text Cleaning is a very important step in NLP. In text-cleaning, we remove digits, punctuation, stopwords as they do not add any meanings and they slow down our models.

def text_preprocessing(text):
text = text.lower() # lowering the text
text = re.sub(r"[^\w\s]", "", text) # removing the puntuactions
text = re.sub(r"[0-9]", "", text) # removing the numbers
s_word = stopwords.words('english')
text = [tex for tex in text.split() if tex not in s_word] # removing the stopwords
text = " ".join(text)
while re.search(r'\b(.+)(\s+\1\b)+', text): # removing the word repeating e.g. machine machine -> machine
text = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', text)
return text

Implement TF-IDF

tf_idf_vector = TfidfVectorizer(ngram_range=(2, 2), stop_words='english')    # create instance of TfidfVectorizer
tf_idf_transform = tf_idf_vector.fit_transform([document_1, document_2]) # fitting the data(corpus)
features_names = tf_idf_vector.get_feature_names() # returning the features from the collection of docments
dense_list = tf_idf_transform.todense().tolist() # value of TF-IDF
pd.set_option('display.max_columns', None) # to show all the columns
df = pd.DataFrame(dense_list, columns=features_names, index=['document_1', 'document_2'])
df

Changing ngram_range will change the number of words in features:-

For ngram_range = (1, 1), the only one-word feature will be returned. called mono-gram

For (1, 2), one-word, and two-word features will be returned.

For (2, 2), only two-word features will be returned. called bi-gram

For (3, 3), only three-word features will be returned. called tri-gram

Definition from the official sklearn page-

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

TF-IDF values of the features

TF-IDF Result

Thanks,

Best Regards,
Saurabh Singh
isaurabh2709@gmail.com

https://github.com/saurabhy27/TF-IDF/blob/main/TF-IDF.ipynb

--

--

Saurabh Singh
Saurabh Singh

Written by Saurabh Singh

Data Scientist, Mostly work in NLP. IIT Bombay Alumni

No responses yet