Data and Middleware Technologies

What Is Apache Flink?

Apache Flink is an open source software technology for running stateful computations on both streaming data and batch data in a distributed system. It typically reads data from a source data repository, performs some action on the data, and then writes the outputs to a destination data repository (i.e., a “sink”). It provides APIs, including SQL, for building stream processing and batch processing jobs. It takes data one point at a time and performs operations on the individual data points or on groups of data points known as “windows.” It is similar to technologies like Hazelcast, Apache Storm, Apache Samza, Apache Apex, and AWS Kinesis Data Analytics.

Flink started as a fork of a research project called Stratosphere (which was a collaboration of the Technical University of Berlin, the Humboldt University of Berlin, and the Hasso Plattner Institute) and became an Apache Incubator project in March 2014. It was later accepted as an Apache top-level project in December 2014.

Why Use Apache Flink?

Flink is used as a framework for building data pipeline jobs that process large amounts of data. Rather than having the software developer worry about task coordination and allocation, Flink takes care of those low-level concerns. The data can be in the form of data streams (also known as data in motion) or in the form of batch data (also known as data at rest).

When used to process data streams (i.e., stream processing), it typically connects to a message queue (e.g., Apache Kafka, Apache Pulsar) and reads data from the queue to process it in an ordered manner. As a stream processing engine, it complements the message queue because it can efficiently process one data point at a time, which is ideal for use cases that require that level of granularity for identifying trends and patterns.

When used to process batch data, it can read data from a large data store in a parallelized way to boost throughput. It behaves similarly to processing streaming data, except that the batch data is bounded so there is an expected end to the processing. The processing is done on the entire collection of batch data to produce an output such as an analytics-ready data set.

Flink Use Cases

Flink can handle many types of use cases, and according to the official Flink website, the ideal use cases can be grouped into 3 categories:

  • Event-driven applications: These types of applications take “event data” (i.e., data points that reflect a change, such as updates in a database table) and react to those changes per the business logic in the application. Oftentimes the event data is used to propagate a data update from a source to many sinks.
  • Data analytics applications: These applications uncover insights from data in a more proactive way, versus the traditional way of preparing data for downstream use by human analysts using business intelligence (BI) tools. Examples of analytics include capturing max/min values, averages, and standard deviations, typically in aggregations defined by a sequence of data points in a specified window.
  • Data pipeline applications: Data pipelines are a series of processing steps that prepare the data for downstream use, typically for human-driven analytics. These applications typically take data from a source and transform it in a way to create more analytical value, including by enriching it with external data. Data pipelines are the construct for practices like Streaming ETL.

Who Uses Apache Flink?

Flink is used by a wide variety of companies across many industries. Some well-known companies that leverage Flink include Alibaba, Amazon, eBay, Netflix, and Uber. Here’s a comprehensive list of businesses that use Flink.

Level up with Hazelcast