Analysis using SVM

This particular post explores Support Vector Machines, or SVM. Please see part 1 of this series to get an explanation of the problem and data preparation steps.

SVM is a supervised learning technique used for both regression or classification. It works by reprojecting the coordinates of each data row in n-dimensional space. The algorithm seeks to find the hyperplane in n-dimensional space that maximizes the distance between the classes as defined by the label. The strength of the algorithm is dependent on the data and how well the classes can be differentiated by the hyperplane. For this post we are performing a multi-class classification with three classes.

I did all of the analysis in Python using Jupyter notebooks, Pandas and Scikit-Learn.

First we randomize the order of the dataframe and split into training and test sets, using an 80%/20% split. Some data science references recoemmend a 70%/20%/10% split between train/test/validation, where 10% of the dataset is held completely separate and used for cross-validation purposes. If you are doing enough folds for the cross-validation (I did 10 fold) I think it is safe to have an 80%/20% split as it is highly unlikely that a systemically bad split would influence all of the validation attempts.

``````dataset = dataset.reindex(np.random.permutation(dataset.index))

msg_train, msg_test, label_train, label_test = \
train_test_split(dataset['total_text'], dataset['label'], test_size=0.2)
``````

The code below is the guts of the analysis. I create a Pipeline object in Scikit-Learn. The pipeline has three major components: the vectorizer, the transformer and the classifier. Once you specify the steps of the pipeline, the next section specified the parameters for each component of the pipeline. The Support Vector Classifier can take many parameters, only some of which are listed below. Changing some of these parameters will have a small impact on the model results and others can have a huge impact; especially the kernel function. In the example below it is set to linear, so the solution will use a linear function to separate the classes. There are other options like polynomial, rbf (radial basis function), etc.

The final section is where you outline the cross-validation steps. By doing a grid search you can also identify the optimal hyperparameter values employing a grid search technique for each run through the pipeline.

``````pipeline_svm = Pipeline([
('vect', CountVectorizer(analyzer=tokenize)), # split_into_tokens(), split_into_lemmas(), tokenize()
('tfidf', TfidfTransformer()),
('clf', SVC()),  # SVC = Support Vector Classifier
])

param_svm1 = {
'vect__max_df': (0.5, 0.75,), #(0.5, 0.75, 1.0),
'vect__max_features': (None, 5000,),  #(None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1),),  #((1, 1), (1, 2)), unigrams or bigrams
'tfidf__use_idf': (True,), # (True, False),
'tfidf__norm': ('l1', 'l2',), # ('l1', 'l2'),
'clf__C': (100,1000,10000,),
#'clf__degree': (2,3,4,),
'clf__gamma': ('auto',), #('auto',0.001,0.0001,0.00001,),
'clf__kernel': ('linear',),
'clf__decision_function_shape': ('ovo', 'ovr'),  #('ovo', 'ovr',),
#'clf__class_weight': ('balanced',),
}

grid_svm1 = GridSearchCV(
pipeline_svm1,  # pipeline from above
verbose=0,
param_grid=param_svm1,  # parameters to tune via cross validation
refit=True,
n_jobs=-1,
scoring='accuracy',  # what score are we optimizing? # accuracy, f1_weighted, etc.
cv=StratifiedKFold(label_train, n_folds=10),
)
``````

Below is the code to run the pipeline and print out the optimal hyperparameter values from the grid search. Please note that for each unique combination of parameters, it will do a 10 fold cross validation so for the model as it is constructed right now, will iterate through many dozens of models (and thus take a long time). But it's a great technique to run through many iterations overnight and see what the best hyperparameter results are at the end.

``````%time svm_detector = grid_svm.fit(msg_train, label_train)

print("Best score: %0.3f" % grid_svm.best_score_)
print("Best parameters set:")
for param_name in sorted(param_svm.keys()):
print("%s: %r" % (param_name, grid_svm.best_params_[param_name]))
``````

Here are the results of the printout:

Best score: 0.906
Best parameters set:
clf__C: 10000
clf__gamma: 'auto'
clf__kernel: 'rbf'
tfidf__norm: 'l2'
tfidf__use_idf: True
vect__max_df: 0.75
vect__max_features: None
vect__ngram_range: (1, 1)

For final validation of the model results, you run the model against the test set (defined by the 80%/20% split) and create a confusion matrix of the results.

``````cm = confusion_matrix(label_test, svm_detector.predict(msg_test))

model_name = 'SVC_unbalanced'
class_names = ['legit', 'misfire', 'trash']
plt.figure()
fig = plt.gcf()
title = "Normalized Confusion matrix - " + model_name
plot_confusion_matrix(cm, classes=class_names, normalize=True, title=title)
``````

And here is the resulting plot of the results. So, the results are a bit of a mixed bag. The model did a great job of identifying legit webform messages (97.7% True Positive), and a reasonably good job of identifying trash messages. Misfire message identification was not good, with more misfire messages classified as legit rather than misfire (67.1% vs 23.4%).

Potential Next Steps

The magnitude of the disparity in the misclassifications implies that tweaking hyperparameters will not suffice to improve the model enough. So trying a completely different classification approach could be warranted. The results for the legit class are very encouraging so my inclination for a next step is to do some feature engineering and add variable(s) to the model. Specifically doing POS (Part Of Speech) tagging on the total_text field or even adding the respective lengths of the strings that make up the subject and description field could be warranted for inclusion into the model to improve performance.