Multiclass Email Classification real-time Project from scratch
This project is a part of my work as a data science intern at Yatra.com
There are nearly 50k emails fired each day to yatra’s support team from customers with various queries.
The support team manually resolves these queries and labels them with the following categories, cancellation, Refund, Booking Information, Eticket/Voucher, Promotions, Website Error, Amendments, etc., accordingly.
This project resolves manual labeling using a multi-class text classifier to classify text to a suitable category accurately.
The BeautifulSoup and Unescape are used to clean the raw email body. The basic preprocessing is done using NLTK(tokenization), gensim(simple preprocess and stop words removal), Spacy(lemmatization).
The TF-IDF vectorizer is used to vectorize the text data followed by Chi2 for 1000 optimal feature selection and then applied the classification algorithms. Deployed the model using Flask.
We start here,
We are ready to dive into text classification :)
So, the story start’s from the very beginning. That’s data cleaning. Cleaning text data can be very tedious and time-consuming sometimes, and in my case, the raw email body was coupled with tons of garbage texts.
email_data.head()
Here’s a sample Email Body for you:
print(email_data['Email_Body'][5]):'<div dir="auto">I booked a flight ticket from Trivandrum to Mumbai<div dir="auto"><br></div><div dir="auto">Amount debited from my
account.</div><div dir="auto"><br></div><div dir="auto">But
ticket not received yet.</div><div dir="auto"><br></div><div
dir="auto">Please check</div></div> '
quite messy, right?
Later, I’ll be removing the rows with null email body or labels, cause those rows are not ganna help us in any way.
#removing rows will null mail content and request type
email_data =email_data[email_data['Email_Body'].notna()]
email_data = email_data[email_data['Request_Type'].notna()]
email_data = email_data.reset_index(drop=True)
Below is the function to remove HTML, links, and other garbage text out of the email body and giving us a clean column named ‘semi_clean_body’
#Funciton to Clean Email Body
#Code to remove the garbage text and extracting out the required string#Libraries used to clean the string
from bs4 import BeautifulSoup
from html import unescape#Adding a column for clean text to the dataframedef clean_html_string(df):
df['semi_clean_body'] = ''
for i in range(len(df)):
toxic_string= df['Email_Body'][i]
html_str= unescape(toxic_string)
soup= BeautifulSoup(html_str)
clean_str= soup.get_text()
df['semi_clean_body'][i]= clean_str
return df
Output:
print(email_data['semi_clean_body'][5]) :I booked a flight from Trivandrum to Mumbai Amount debited from my account. But ticket not received yet. Please check
After cleaning the Email body, the following steps follow the pre-processing of the Email Body.
The pre-processing of the Email Body follows the steps of:
- Tokenization of the string using NLTK(Natural Language Tool kit)
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
2. Gensim is used for basic pre-processing(removing special characters, removing numbers, removing leading and trailing spaces, converting all characters to lower case, etc) of the string. Also, gensim is used to remove stop words.
3. Using Spacy for Lemmatization.
Lemmatization usually refers to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form, properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.
# Convert email body to list
# data becomes a list of strings
data = email_data.semi_clean_body.values.tolist()from gensim.utils import simple_preprocess
from nltk.tokenize.treebank import TreebankWordDetokenizer
import gensim
import spacy
#Initialize spacy ‘en’ model
nlp = spacy.load(‘en_core_web_sm’, disable=[‘parser’, ‘ner’])# tokenize — break down each sentence into a list of words
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence),
deacc=True)) #deacc=True removes punctuations# REMOVE stop_words and lemmatize
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word
not in stop_words] for doc in texts]def lemmatization(texts, allowed_postags=[‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’]):
texts_out = []
for sent in texts:
doc = nlp(“ “.join(sent))
texts_out.append([token.lemma_ for token in doc if
token.pos_ in allowed_postags])
return texts_out#Tokenizing
data_words = list(sent_to_words(data))# Remove Stop Words
data_words_nostops = remove_stopwords(data_words_cleaned)# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_nostops, allowed_postags= [‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’])#Detokenizing the data
#Adding a column for the cleaned text
email_data = pd.DataFrame(columns=['cleaned_body'], index=email_data.index)for i in range(0, len(data_lemmatized)):
email_data[‘cleaned_body’][i]=
TreebankWordDetokenizer().detokenize(data_lemmatized[i])email_data['cleaned_body'] = email_data['cleaned_body'].values
Cleaned unprocessed text:
email_data['semi_clean_body'][0]:'I have made a flight booking on Yatra. My PNR No are S6ZF2Y . I had booked a return flight for both the PNR. The outbound flight i.eรย "BOM -TRV"รย had been cancelled by the airways and the full amount has been credited back to my account.รย Now the Inbound flight i.e."รย TRV -รย BOM " has also been cancelled by the airways and the full amount has been processed to the yatra account on 24th Feb 2020. So I want you to refund me the amount ASAP. When would I get the refund back?Waiting for your reply'
Cleaned processed text:
email_data['cleaned_body'][0]:'flight book book flight outbound flight cancel credited account inbound flight cancel process account refund refund waiting reply'
Analyzing the different classes of the label:
The 5 label classes sorted by frequent are as follows:
- Refund — refers to the query for the customer asking for a flight ticket refund
- Others — refers to the several irrelevant queries asked by customers
- Amendments — changes in the already booked flight ticket
- Cancellation —refers to the query for the customer asking to cancel the flight ticket
- Website Error — customer-facing difficulty while booking through the website
Modeling
Steps Involved
Forming a pipeline that will do the following processes:
- Formation of Tfidf vectors
What is Tf-Idf and how does it work?
Machine learning algorithms cannot work with the raw text directly. Rather, the text must be converted into vectors of numbers.
TF-IDF is a strategy to score the relative importance of words.
Term frequency: the number of times a word appears in the document to the total number of words in the document.
IDF (Inverse document frequency): The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.
TF-IDF is simply the TF multiplied by IDF
2. Used Chi2 for best feature selection
3. Random forest for classification
Here, I’m using only random forest, but I tried several other models. I would recommend the same to you guys as well, keep playing with different models and tune the hyperparameters of Tfidf and the classification algorithms accordingly. I hope I was able to provide some insights into the multi-class text classification.
New Test data Pre-processing and making Predictions
Loading Our Previously trained classifier as pickle
Pre-process new email body before making predictions
In the variable, data I’ve provided the email body for which I’d like to make the prediction about the category it most suitably fits. Followed by cleaning and pre-processing the text and then detokenizing it.
Making Predictions
Our classifier clf2 is trained using a random forest classifier. The variable my_prediction is a list with a number of elements as 1, and its 0th element will contain the predicted category as a number, and the category corresponding to the number is identified using the dictionary “prediction_category”.
Output:
print(my_predicyions):'Refunds'
Later, I deployed the model using Flask.
For the complete code refer to this
Also, jupyter notebook for data cleaning and pre-processing is here
Notebook for Modelling is here
Notebook for new text pre-processing and making predictions is here
Feel free to ask any doubt or suggestions you wanna give, thanks for reading :)