Hazelcast + Big Data: Interview with Ben Evans
We had the chance to speak with Ben Evans, Java Champion, author for O’Reilly and InfoQ, as well as co-founder of JClarity, a startup which delivers performance tools and services to help development and ops teams. Together with his recent whitepaper on Hazelcast in combination with Spark, he also delivered a small application, BetLeopard, to demonstrate the integration between the two technologies.
Chris Engelbert: You’re around in the Java world for quite some time, when did you start with Java and why?
Ben Evans: Well, one of the things about me is that I am completely self-taught. I haven’t got any formal qualifications in computing of any kind – but I’ve been teaching myself to code since I was 8 years old, and my first encounter with Java was pretty much the same.
I was a Physics PhD student in 1998 and I’d been earning some extra money by tutoring a disabled student who couldn’t attend his Computer Science lectures in person. I’d learned Dijkstra’s algorithm and enough graph theory to stay ahead of his algorithms class, and at the end of term, he came to me and asked if I’d sit in on another class for him – some new programming language I’d never heard of, called “Java”, and the rest is history…
Chris Engelbert: Maybe a hard question but what do you think is your favorite Hazelcast feature?
Ben Evans: It’s not a single feature, but one of my favorite things about Hazelcast is that it’s scalable, in the sense that you can start with a developer who just knows core Java, and then introduce them to the concept of data caches that feel like Java collections, but which scale across machines. The deeper you go into the technology, you find it has a really nice learning curve, so you encounter the features you need as you need them, and it kind of grows with you as you need it to.
Chris Engelbert: Okay, let’s get into the topic. I’m pretty sure our readers are familiar with Hazelcast but can you give us a quick overview of Apache Spark?
Ben Evans: One of the ways I like to think about Spark is to think of it as Hadoop++. It’s a thing you see in a lot of technology markets – the dilemma about whether it’s better to be the first mover (& steal a march on the competition) or second mover (& learn from the first product’s mis-steps). Of course, it depends completely on the market conditions, and when you launch, and how hungry the market is for your product.
In the case of Hadoop, they launched at the right time, just as the market was getting warmed up and ready for Big Data, so they got great market share. However, the downside is that Hadoop has some architectural limitations and needed to build a whole ecosystem of supporting products around the core engine. This meant that the market was ripe for a second mover product to come in – especially one that was compatible with an already existing and maturing ecosystem.
So, the way I think about Spark is that it’s a general purpose, large scale data processing engine, that focuses on running computations and processing, and leverages existing data storage and clustering technologies. It is still a batch-focused technology though (like Hadoop) and doesn’t have a “true” streaming architecture.
Nevertheless, it’s a solid project – it has better performance than Hadoop, fewer rough edges and a better client programming experience (including a wider choice of client languages).
Chris Engelbert: You created a demo application to show how Hazelcast and Apache Spark work together. BetLeopard itself simulates a sport bet system. Why did you choose this use case?
Ben Evans: Pure nostalgia and mischief. My last job before joining the financial industry (about 15 years ago) was chief architect for a sports betting and gaming firm (Note for US readers: Legal sports betting has a long history in the UK). Betting systems do have some similarities with financial systems, but they aren’t as complex – so I wanted to see just how much of a full system I could write in a short amount of time, whilst still showcasing the features and producing something that would scale.
The name is kind of fun too – lots of sports betting firms in the UK have silly names, so I wanted something in that tradition – and I just had a picture in my head of a funny animal in a traditional bookmakers hat (usually a green trilby) and that was that – we had our name & logo for the demo.
Chris Engelbert: What do you see as the major advantage when integrating Hazelcast and Apache Spark?
Ben Evans: The two technologies complement each other – Java developers who may not be used to Big Data can start with Hazelcast IMap and partitions and JCache, whilst allowing Big Data developers on the same team, to work with Spark and the tools they’re familiar with. The connector between the two libraries also enables some cross-pollination and for each side to have an easy on-ramp to start to pick up some parts of the unfamiliar technology.
Spark’s performance is usually benchmarked against data sets stored on disc, so of course Hazelcast’s in-memory storage options are also going to look pretty attractive, if Big Data performance is an important point for your application.
Chris Engelbert: Almost done. You’ve probably heard that Hazelcast just released Hazelcast Jet. If you’re already familiar with it, what is your view on Hazelcast Jet versus Hazelcast and Apache Spark? Where do you see advantages or disadvantages of both systems?
Ben Evans: I’m just getting started with Jet – and it’s really interesting. To me, the big difference between Spark and Jet is that Jet has what I would call a “true streaming” or a “streaming-first” approach to processing. Spark is a great next iteration of the Hadoop style of batch architecture, but as I know from years of developing financial applications, there are fundamental differences between how you approach building each type of system.
It’s not that one is better or worse, they just have different architectural priorities that are going to fit to different use cases, and they represent different trade-offs – and it’s also easy to forget that sometimes our trade-offs are not purely at the code and technical level.
For example, if the team is developing an application that has a big need for analytics, but not in real time, and already has a lot of Spark or Hadoop experience, then Hazelcast + Spark might be the right decision. On the other hand, if the analytics are much more real-time and are directly driving crucial operational concerns, then the trade-off might be to invest in skilling the team up in Jet, and produce a Hazelcast + Jet application specifically architected for real-time analysis.
Ultimately, of course, the real win for developers is that there’s all this great free and open source software out there, which is solving so many application infrastructure & “plumbing” problems to allow developers to focus on what really matters for their applications.
Chris Engelbert: Quickly coming back to the use cases. As you’re living in London, the city of banks, can you see any other common use cases that would benefit from combining the best of both worlds?
Ben Evans: I can see a lot of “traditional” risk and operational analytic applications that I think could be taking an interest in Jet. Spark is of course already getting a foothold in banks (but the ones I deal with tend to have an existing Hadoop footprint, which I think will be around for a while yet). With banks’ emphasis on event-driven analytics as well as relentless performance requirements, I think it should be a great opportunity for both Jet and Spark to make some real footprint – although probably for slightly distinct use cases.
Chris Engelbert: Thank you very much for your time, is there anything else you like to share with the Hazelcast community?
Ben Evans: I’m hoping that we might release an update to BetLeopard soon, which brings in Jet so that people can see the two approaches side-by-side. Who knows – maybe we’ll call it “JetLeopard” – so watch this space.