Zeppelin: Data Analysis for the People

Madison Bus Trip Analysis


As data becomes more prevalent across many industries, there is a greater need for tools that facilitate the democratization and collaboration of data analysis projects. One such application is Apache Zeppelin a web-based, open-source data analytics notebook. Zeppelin supports using multiple languages within the same notebook, sits on top of Apache Spark to run calculations, and has considerable built-in data visualization tools. Notebooks can be exported and shared as JSON and multiple parties can work on them collaboratively through GitHub. Visuals can also be published as HTML to a website.

In the following example I demonstrate the range of functionality using Zeppelin to analyze bus trip data from Madison Metro, our local municipal bus service. Screenshots show the code and results from a simple analysis and calculation of basic statistics on Madison Metro Bus trips and associated bus stop and ridership data.

Using Zeppelin to Analyze Madison Metro Bus Data

The codeblock below demonstrates how I loaded the data as a text file into Spark using the Scala language. The code maps the data to fields in a dataframe and converts the dataframe into a table construct. By converting the data into a table in Spark, standard SQL can be run against the data, even though it has never been loaded into a database.

Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext), so you don't need create them manually.

import org.apache.spark.sql.functions._

// load bus trip data
val busTrip = sc.textFile("/home/pitt/Documents/trips.txt")
val busTripHeader = busTrip.first() 
val busTripData = busTrip.filter( line => line !=busTripHeader)

case class Trip(routeId: Integer, routeShortName: String, serviceId: String, tripId: Integer, tripHeadsign: String, directionId: Integer, directionName: String, blockId: Integer, shapeId: Integer, shapeCode: String, tripType: String, tripSort: Integer, wheelchairAccessible: Integer, bikesAllowed: Integer )

// example of data row: 6838,01,70_WKD,723344,CAPITOL SQUARE,0,CapSq,170361,42380,1E,,35820,1,1

val trip = busTripData.map(s => s.split(","))
  .map(s => Trip(s(0).toInt, s(1), s(2), s(3).toInt, s(4), s(5).toInt,s(6),s(7).toInt, s(8).toInt, s(9), s(10) ,s(11).toInt, s(12).toInt, s(13).toInt)).toDF()

trip.registerTempTable("tripTable")

The image below shows the query and resulting graph of the number of trips for each bus route. Unsurprisingly, bus route 80, a free route that services the UW, has the highest number of trips in the analyzed resultset.

select routeShortName, count(tripId) value
from tripTable 
where wheelchairAccessible = 1 
group by routeShortName 
order by routeShortName

resultset1

Here I load another text file to show bus trip stops and ridership data, which is similar to the first example above:

val tripDist = sc.textFile("/home/pitt/Documents/stop_times.txt")
val tripDistHeader = tripDist.first() 
val tripDistData = tripDist.filter( line => line !=tripDistHeader)

case class stopDist(tripId: Integer, stopSequence: Integer, stopId: Integer, pickupType: Integer, dropoffType: Integer, arrivalTime: String ,departureTime: String, timePoint: String, distanceTraveled: Float )

// example of data row: 723344,18,1787,1,0,10:13:00,10:13:00,1,2.8175

val busStopDist = tripDistData.map(s => s.split(","))
  .map(s => stopDist(s(0).toInt, s(1).toInt, s(2).toInt, s(3).toInt, s(4).toInt, s(5), s(6), s(7), s(8).toFloat)).toDF()

busStopDist.registerTempTable("tripDistTable")

Finally, the codeblock below shows a SparkSQL query that joins the first data file (tripTable) together with the second data file (tripDistTable) to show the longest bus route (route 56), the number of stops along the route, the total distance covered in the route, and the average distance between stops in the route.

// example joining two datasets together
select tripTable.routeShortName, 
max(stopSequence) as maxStops, 
max(distanceTraveled) as maxDist, 
max(distanceTraveled)/max(stopSequence) as avgDistPerStop
from tripTable
join tripDistTable on (tripTable.tripId = tripDistTable.tripId)
group by tripTable.routeShortName order by maxDist desc limit 1

Here is the result of the SparkSQL query.

resultset2

This Apache Zeppelin notebook demonstration is drawn from a presentation I did at the BigDataMadison meetup. You can download the notebook and presentation here: https://github.com/Pshrub/bigdatamadison_spark

Try it for yourself! Any dataset will do, or replicate this example using Madison Metro Bus Data.

Pitt Fagan

Greetings! I'm passionate about data; specifically the big data and data science ecosystems! It's such an exciting time to be working in these spaces. I run the BigDataMadison meetup where I live.