From June 13-15, 2017 I attended the DataWorks Summit 2017 in San Jose. I was very fortunate to attend as I did so by winning a raffle prize run by the kind hosts of the Roaring Elephant Podcast. I want to thank Dave Russell and Jhon Masschelein, the hosts of the podcast, and Hortonworks for sponsoring the conference ticket. I would also like to thank my employer, Zendesk for covering my travel expenses. It was my first time going to one of the signature Big Data conferences and I had an excellent time.
Here is the podcast covering the DataWorks Summit on which I was the featured guest!
There were nine tracks spread out over the three full days. I think I attended a talk from every track except the Crash Course track.
a) Apache Hadoop
b) Apache Spark and Data Science
d) Cloud and Operations
e) Crash Course
f) Data Processing and Warehousing
g) Enterprise Adoption
h) Governance and Security
i) IoT and Streaming
Here are my Top 3 presentations from the conference.
Whoops. The Numbers are Wrong! Scaling Data Quality at Netflix by Michelle Ufford, Netflix.
This presentation was very interesting, not least because of all the imagery of various Netflix original programming that provides the backdrop to many of the slides. Luke Cage and the Iron Fist anyone? The general ethos at Netflix when it comes to data quality problems is to focus on finding the problems and not worry so much about why they are occurring. The talk focused on several in-house products that have been developed, like MetaCat (a federated data catalog available anywhere) and Quinto (a data qaulity service application). They make heavy use of the WAP ETL pattern. WAP stands for Write Audit Publish.
Netflix has plans to open source a wrapper library called Jumpstarter, which defines individuals rules for data quality self-service. The talk ended on a cautionary note, with advice that not every team or table warrants data quality checks. Their future efforts will be focused around robust anomaly detection and handling. They are considering/planning to open source more of the internal products in the area of Data Quality.
Data Ingest Self Service and Management using Apache Nifi and Kafka by Imran Amjad and Dave Torok, Comcast.
The presenters outlined the system they architected to allow colleagues at all levels of the company to self-serve their data governance logic via a web portal for the information that is under their purview. In the portal they set up a pipeline by defining their source and target schemas, which in turn drives a number of processes like Kafka topics and Nifi templates. All in all a cool way to federate this process out to the teams that are most concerned with, and knowledgeable about, their own data.
They mentioned a couple of libraries for dealing with JSON files, called JSONPATH and especially JOLT, which I am interested in checking out.
Large Scale Graph Processing and Machine Learning Algorithms for Payment Fraud Prevention by Venkatesh Ramanathan, PayPal.
A really in-depth, technical presentation which was great. Fraud is a massive problem for PayPal, and it was neat to see the immense operational and research efforts underway to combat it. Fraud prevention is a multi-tiered approach, occurring at the transaction, account, and network levels.
Dr. Ramanthan really likes Gradient Boosted Trees, which is a technique I saw mentioned in a few other presentations. He also mentioned Active Learning, which was new to me and which he defined as an ML algorithm that can achieve better accuracy if it is allowed to choose the data from which it learns. This is an under-reported technique with which they have gotten good results.
A special shoutout to the most unique presentation I heard, called The Apache Way by Alan Gates, Hortonworks. Everyone who works in the Big Data/Data Science spaces owes a major debt to the Apache Software Foundation, as this foundation has fostered, and continues to inculcate, numerous influential projects. I found the presentation to be quite inspirational, as he stressed the fundamental importance of community to projects and becoming involved.
Some random thoughts on the Conference:
- The venue (San Jose McEnery Convention Center) was great. Rooms were large enough that I never saw anyone turned away from the presentations. Obsessively scanning everyone's badge as they funneled into each room through a small doorway was a bit of a pain.
- The co-host of the conference, in addition to Hortonworks, was Yahoo. Yahoo obviously occupies a justifiably famous place in the annals of Hadoop history due to HDFS and MapReduce back in 2004-2006, but the day before the conference Yahoo was purchased by Verizon. So there was a huge Yahoo pavilion in the sponsors area that had all this Yahoo branding up but was otherwise empty. It mainly served as a meeting place for ex-Yahoo-ers to chat.
- I learned quickly to pay attention to the company that the presenter(s) were from. If they were from Hortonworks, or were from a company that had no business relationship with Hortonworks, then the talk had a greater likelihood of being good. If the company behind the presentation had a business relationship with Hortonworks, then the talk was likely to be a sales demonstration of their technology, which you can always get in the sponsor hall if you want that. I'm looking at you, BMC!
- At the Keynote of the first day there was a brief, canned demonstration of an IBM product called the Data Science Experience, or DSX for short. The speaker touted the capability for employees to run a machine learning algorithm in three clicks, which is scary as hell for someone who actually does data science for a living! One point they did harp on, which in my experience in the industry is a legitimate problem, is making it more easy to deploy your analytics solutions to production.
- There were two hour keynotes on each of the three days, which seemed like overkill. On the first two days the keynote kicked off with a serious laser light show, techno music and fog machine. Just like a stadium rock concert! Impressive, but weird.