One Hot Encoding - Explained!

One Hot Encoding is a technique used in certain data science analyses to "handle" categorical data. For analytical techniques that require continuously varying data, categorical data presents a problem. How does one compare "apple" to "orange", if those are two possible values in a data column?

One hot encoding converts values like this to a numerical representation, which can then be used in models. It uses a process where all possible non-NULL values in the data column are converted to individual columns. For every row, where the value in the categorical column matches the column header, the column gets a value of 1. All other columns that are created from the categorical data receive a value of 0. I think a graphical representation makes this easier to understand so here is a simple one.

RowIdFruitType
0Apple
1Orange
2Apple
3(NULL)
4Apple
5Orange
6Kiwi

After one hot encoding the FruitType column, you would expect to see something like this:

RowIdAppleOrangeKiwiFruitTypeImpute
01.00.00.00
10.01.00.00
21.00.00.00
31.0*0.00.01
41.00.00.00
50.01.00.00
60.00.01.00

One of the most vexing problems with cleaning and processing data to get it ready for analysis, is how to deal with NULL or missing data? Models like linear regression cannot handle missing data, so there are two general techniques.
1. Delete the rows where you have missing data. If you do not have many rows with missing data, perhaps this is not a problem. but if you have a significant percentage of rows with missing data, removing all of those rows from the model will harm the predictive power of your model.
2. Impute the data, which is what I have done in row 3. The fastest imputation involves using the mode of the column, or the most frequent value in the column. In this example, "apple" is the most common value in FruitType, so row4 gets a 1 in the Apple column. Frequently, data is missing for a reason (i.e. it is not random) so I sometimes find it insightful to add a column to keep track of whether or not a record has been imputed, which is what happened in the FruitTypeImpute column. This is optional, but if this variable proves to be significant to your model, it can be a signal to look more closely at the circumstances through which this variable has missing data.

Below is a simple example of using one hot encoding in Apache Spark, using the built-in features StringIndexer and OneHotEncoder out of the ml package.


import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

val fruits = sqlContext.createDataFrame(Seq( (0, "Apple"), (1, "Orange"), (2, "Apple"), (3, ""), (4, "Apple"), (5, "Orange"), (6, "Kiwi") )).toDF("rowId", "fruitType")

val indexer = new StringIndexer().setInputCol("fruitType").setOutputCol("fruitTypeIndex").fit(fruits)

val indexed = indexer.transform(fruits)

val encoder = new OneHotEncoder().setInputCol("fruitTypeIndex").setOutputCol("fruitTypeVec")

val encoded = encoder.transform(indexed)

encoded.select("rowId", "fruitTypeVec").show()

This last command returns the following information. the first column is the original rowId. The second column is a Vector datatype object, where the first value is the number of elements in the vector, the second is the index position where the value occurs, and the third value is the number at that index position.

rowIdfruitTypeVec
0(3,[0],[1.0])
1(3,[1],[1.0])
2(3,[0],[1.0])
3(3,[],[])
4(3,[0],[1.0])
5(3,[1],[1.0])
6(3,[2],[1.0])

NB: "One Hot" refers to a state in electrical engineering where all of the bits in a circuit are 0, except a single bit with a value of 1. The bit with a value of 1 is said to be "hot". Hopefully you can now understand why this term makes sense when applied to categorical data!

Pitt Fagan

Greetings! I'm passionate about data; specifically the big data and data science ecosystems! It's such an exciting time to be working in these spaces. I run the BigDataMadison meetup where I live.