This year was the 10th edition of Berlin Buzzwords, the third I’ve attended, and the first time I’ve volunteered there. My role was as “runner” on the Monday, which mostly meant being a friendly face and giving directions as well as a little carrying and fetching and then doing registration for the evening AI Monday event.
As always, Berlin Buzzwords is one of the best data conferences in Europe featuring mostly talks on the open source technology side. I missed some of the talks I wanted to see while carrying ice or books around, but they’ll all be online soon enough anyway. And as a volunteer, I had more of an opportunity to talk to people about their data journeys.
I was combining my trip to Berlin with some sightseeing, and so opted to skip the Sunday barcamp this year in favor of a trip to the German Technology Museum - which has some interesting exhibitions on computers as well as all sorts of other technologies.
I joined the conference on the Monday just in time for Band Aids don’t fix bullet holes: Repairing the broken promises of ubiquitous machine learning, which covered how we got to everyone wanting to solve problems with machine learning and how to recover from this situation. I had to miss the middle of this talk, but there were some great takeaways and some great quotes. I particularly appreciated the description of deep learning neural networks, and how the first layers act as feature extraction and the later layers end up being just individual machine learning algorithms - which works great for image recognition with cooperative inputs, but not so much elsewhere.
The next talk I managed to attend was Location Analytics - Real-Time Geofencing using Kafka. This talk went into great detail how geofencing can be done with Kafka data with KSQL and some UDFs, highlighting some limitations whilst also showing that custom UDFs are quite easy to write. The talk finished with a suggestion that this could be done better using Tile38, a geospatial database and realtime geofencing server which speaks the Redis protocol as well as integrating with Kafka nicely.
Streaming Your Shared Ride was a talk on Apache Beam on top of Flink, and a good overview of Lyft’s data journey that took them to this solution.
Hops in the Cloud was an interesting talk on the unique Hops distribution of Hadoop, and how they have cloud support on the roadmap. Despite being a smaller distrubution, there’s a number of great features in Hops that make it attractive - including handling small files well, and a metadata layer which makes things like free-text search of the filesystem possible.
Taming the language border in data analytics and science with Apache Arrow was a great introduction to Apache Arrow, and well worth watching the video if sharing data between multiple programming languages on the same machine is of interest. It’s designed as a way of avoiding copying large amounts of data around in memory - for example going between Python, R and Scala in Spark.
Towards Flink 2.0: Rethinking the stack and APIs to unify Batch & Stream covered the upcoming changes to bring Flink’s batch and stream paradigms closer, as well as an interesting insight into how batch and stream differ in general. Flink 2.0 is going to be an interesting release, and hopefully will make it a lot easier to work with for less experienced Java developers.
Last but not least, Accelerate big data analytics with Apache Kylin was a great introduction to Apache Kylin covering how and why it was created. The likes of Impala, Spark SQL, and Hive on Spark have all failed to achieve the sub-second querying that users want, and Apache Kylin can help. Yahoo Japan switched from Impala to Kylin, and their average query latency went from 1 minute to < 1 second.