Madison Bus Trip Analysis
As data becomes more prevalent across many industries, there is a greater need for tools that facilitate the democratization and collaboration of data analysis projects. One such application is Apache Zeppelin a web-based, open-source data analytics notebook. Zeppelin supports using multiple languages within the same notebook, sits on top of Apache Spark to run calculations, and has considerable built-in data visualization tools. Notebooks can be exported and shared as JSON and multiple parties can work on them collaboratively through GitHub. Visuals can also be published as HTML to a website.
In the following example I demonstrate the range of functionality using Zeppelin to analyze bus trip data from Madison Metro, our local municipal bus service. Screenshots show the code and results from a simple analysis and calculation of basic statistics on Madison Metro Bus trips and associated bus stop and ridership data.
Using Zeppelin to Analyze Madison Metro Bus Data
The codeblock below demonstrates how I loaded the data as a text file into Spark using the Scala language. The code maps the data to fields in a dataframe and converts the dataframe into a table construct. By converting the data into a table in Spark, standard SQL can be run against the data, even though it has never been loaded into a database.
Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext), so you don't need create them manually.
import org.apache.spark.sql.functions._ // load bus trip data val busTrip = sc.textFile("/home/pitt/Documents/trips.txt") val busTripHeader = busTrip.first() val busTripData = busTrip.filter( line => line !=busTripHeader) case class Trip(routeId: Integer, routeShortName: String, serviceId: String, tripId: Integer, tripHeadsign: String, directionId: Integer, directionName: String, blockId: Integer, shapeId: Integer, shapeCode: String, tripType: String, tripSort: Integer, wheelchairAccessible: Integer, bikesAllowed: Integer ) // example of data row: 6838,01,70_WKD,723344,CAPITOL SQUARE,0,CapSq,170361,42380,1E,,35820,1,1 val trip = busTripData.map(s => s.split(",")) .map(s => Trip(s(0).toInt, s(1), s(2), s(3).toInt, s(4), s(5).toInt,s(6),s(7).toInt, s(8).toInt, s(9), s(10) ,s(11).toInt, s(12).toInt, s(13).toInt)).toDF() trip.registerTempTable("tripTable")
The image below shows the query and resulting graph of the number of trips for each bus route. Unsurprisingly, bus route 80, a free route that services the UW, has the highest number of trips in the analyzed resultset.
select routeShortName, count(tripId) value from tripTable where wheelchairAccessible = 1 group by routeShortName order by routeShortName
Here I load another text file to show bus trip stops and ridership data, which is similar to the first example above:
val tripDist = sc.textFile("/home/pitt/Documents/stop_times.txt") val tripDistHeader = tripDist.first() val tripDistData = tripDist.filter( line => line !=tripDistHeader) case class stopDist(tripId: Integer, stopSequence: Integer, stopId: Integer, pickupType: Integer, dropoffType: Integer, arrivalTime: String ,departureTime: String, timePoint: String, distanceTraveled: Float ) // example of data row: 723344,18,1787,1,0,10:13:00,10:13:00,1,2.8175 val busStopDist = tripDistData.map(s => s.split(",")) .map(s => stopDist(s(0).toInt, s(1).toInt, s(2).toInt, s(3).toInt, s(4).toInt, s(5), s(6), s(7), s(8).toFloat)).toDF() busStopDist.registerTempTable("tripDistTable")
Finally, the codeblock below shows a SparkSQL query that joins the first data file (tripTable) together with the second data file (tripDistTable) to show the longest bus route (route 56), the number of stops along the route, the total distance covered in the route, and the average distance between stops in the route.
// example joining two datasets together select tripTable.routeShortName, max(stopSequence) as maxStops, max(distanceTraveled) as maxDist, max(distanceTraveled)/max(stopSequence) as avgDistPerStop from tripTable join tripDistTable on (tripTable.tripId = tripDistTable.tripId) group by tripTable.routeShortName order by maxDist desc limit 1
Here is the result of the SparkSQL query.
This Apache Zeppelin notebook demonstration is drawn from a presentation I did at the BigDataMadison meetup. You can download the notebook and presentation here: https://github.com/Pshrub/bigdatamadison_spark
Try it for yourself! Any dataset will do, or replicate this example using Madison Metro Bus Data.