Classification of Webform Messages Project - Part 1 explaining the problem

Recently there was an internal work competition to see who could derive the best classification prediction for incoming messages to a website. This post explains the problem we are trying to solve, gives some background on the dataset, and covers some natural language processing steps to prepare the dataset.

Problem Statement

The webform messages in the training set have three classifications: legitimate, or valid messages, trash messages, and misfire messages. The messages in the training set have been scored by multiple human beings over the course of nine months, so the purpose of the competition is to automate the classification. Because the webform messages are free-form text entry, classification techniques associated with Natural Language Processing, or NLP, were identified for processing the data.

A Bit about the Dataset

The pertinent components of the dataset are the label or dependent variable, which is one of either the words misfire, legit or trash. There is no missing data for the label, and the words are always in English regardless of the language of the message.

The webform message comes in two columns, the subject and the description. There is no missing data for these dependent variables, however the language of the webform message is not always in English. For the analysis in question the non-English messages were removed using a Python library called langid. Messages which were not classified as English with a confidence of >=70% were removed. For the purposes of the analysis the subject and description were combined together in a new column called total_text.

The rest of the data are account or user identifiers which are irrelevant and were dropped.

NLP Functions

One of the main decisions for the analysis was how to tokenize the total_text field. I tried three different functions to process the data in the total_text field.

  • split_into_tokens
from textblob import TextBlob # can do POS tagging with this library

def split_into_tokens(message):  
    This function converts the incoming message into proper Unicode
    and applies the TextBlob function to the converted message
    and returns the individual words of the message.
    message = unicode(message, 'utf8')  # convert bytes into proper unicode
    return TextBlob(message).words
  • split_into_lemmas
def split_into_lemmas(message):  
    This function converts the incoming message into proper Unicode
    and applies the TextBlob function to the converted message
    and returns the lemmas, or roots/stems, of the words in the message.
    message = unicode(message, 'utf8').lower()
    words = TextBlob(message).words 
    return [word.lemma for word in words] 
  • tokenize
PUNCTUATION = set(string.punctuation)  
STOPWORDS = set(stopwords.words('english') + ['nan'])  
STEMMER = PorterStemmer()

def tokenize(text):  
    This function converts the incoming message into proper Unicode
    and processes the converted message by 'lowercasing' the letters, removing 
    punctuation and removing stopwords. It returns the stemmed words of 
    the message, according to the logic of the PorterStemmer. The PorterStemmer is an algorithmic approach to stripping word suffixes
    message = unicode(text, 'utf8')
    tokens = word_tokenize(message)
    lowercased = [t.lower() for t in tokens]
    no_punctuation = []
    for word in lowercased:
        punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])
    no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]
    stemmed = [STEMMER.stem(w) for w in no_stopwords]
    return [w for w in stemmed if w]

After defining the function I always find it a good idea to do a sanity check on the results and put the tokenized results in a new column and check out the results to make sure there are no unintended consequences. The below snippet creates a new column called tokenized_sentences by running the tokenize function on each row of the data frame.

dataset['tokenized_sentences'] = \  
dataset.apply(lambda row: tokenize(row['total_text']), axis=1)  

I tried several techniques (Support Vector Machine, Naive Bayes, Stochastic Gradient Descent, Decision Tree) which will be covered in future blog posts.

  • header image credit Pittsburgh Post Gazette

Pitt Fagan

Greetings! I'm passionate about data; specifically the big data and data science ecosystems! It's such an exciting time to be working in these spaces. I run the BigDataMadison meetup where I live.