Recently there was an internal work competition to see who could derive the best classification prediction for incoming messages to a website. This post explains the problem we are trying to solve, gives some background on the dataset, and covers some natural language processing steps to prepare the dataset.
The webform messages in the training set have three classifications: legitimate, or valid messages, trash messages, and misfire messages. The messages in the training set have been scored by multiple human beings over the course of nine months, so the purpose of the competition is to automate the classification. Because the webform messages are free-form text entry, classification techniques associated with Natural Language Processing, or NLP, were identified for processing the data.
A Bit about the Dataset
The pertinent components of the dataset are the label or dependent variable, which is one of either the words misfire, legit or trash. There is no missing data for the label, and the words are always in English regardless of the language of the message.
The webform message comes in two columns, the subject and the description. There is no missing data for these dependent variables, however the language of the webform message is not always in English. For the analysis in question the non-English messages were removed using a Python library called langid. Messages which were not classified as English with a confidence of >=70% were removed. For the purposes of the analysis the subject and description were combined together in a new column called total_text.
The rest of the data are account or user identifiers which are irrelevant and were dropped.
One of the main decisions for the analysis was how to tokenize the total_text field. I tried three different functions to process the data in the total_text field.
from textblob import TextBlob # can do POS tagging with this library def split_into_tokens(message): """ This function converts the incoming message into proper Unicode and applies the TextBlob function to the converted message and returns the individual words of the message. """ message = unicode(message, 'utf8') # convert bytes into proper unicode return TextBlob(message).words
def split_into_lemmas(message): """ This function converts the incoming message into proper Unicode and applies the TextBlob function to the converted message and returns the lemmas, or roots/stems, of the words in the message. """ message = unicode(message, 'utf8').lower() words = TextBlob(message).words return [word.lemma for word in words]
PUNCTUATION = set(string.punctuation) STOPWORDS = set(stopwords.words('english') + ['nan']) STEMMER = PorterStemmer() def tokenize(text): """ This function converts the incoming message into proper Unicode and processes the converted message by 'lowercasing' the letters, removing punctuation and removing stopwords. It returns the stemmed words of the message, according to the logic of the PorterStemmer. The PorterStemmer is an algorithmic approach to stripping word suffixes """ message = unicode(text, 'utf8') tokens = word_tokenize(message) lowercased = [t.lower() for t in tokens] no_punctuation =  for word in lowercased: punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION]) no_punctuation.append(punct_removed) no_stopwords = [w for w in no_punctuation if not w in STOPWORDS] stemmed = [STEMMER.stem(w) for w in no_stopwords] return [w for w in stemmed if w]
After defining the function I always find it a good idea to do a sanity check on the results and put the tokenized results in a new column and check out the results to make sure there are no unintended consequences. The below snippet creates a new column called tokenized_sentences by running the tokenize function on each row of the data frame.
dataset['tokenized_sentences'] = \ dataset.apply(lambda row: tokenize(row['total_text']), axis=1)
I tried several techniques (Support Vector Machine, Naive Bayes, Stochastic Gradient Descent, Decision Tree) which will be covered in future blog posts.
- header image credit Pittsburgh Post Gazette