Apache Flink (Daryl - in review(Juan))
Overview
Apache Flink is a powerful open-source framework and distributed processing engine for performing stateful ingest and computations (transformations) on unbounded (streaming) and bounded (batch) data streams. It excels at low-latency, high-throughput stream processing, making it ideal for real-time analytics, event-driven applications, and continuous data pipelines. When used with CrateDB, Flink can ingest, transform, and enrich streaming data before persisting it into CrateDB’s distributed SQL engine. Together, they provide a scalable, end-to-end solution for both data streaming and analytical workloads.
Benefits of CrateDB + Apache Flink
Real-time data ingestion – Stream data from Kafka, MQTT, or custom sources through Flink and insert or upsert directly into CrateDB for instant availability.
Scalable stream processing – Flink’s distributed engine and CrateDB’s shared-nothing architecture scale horizontally to handle growing data volumes and concurrent analytics.
Low-latency analytics – Flink processes events in milliseconds, while CrateDB’s columnar storage and distributed SQL queries provide sub-second response times.
Unified batch and stream handling – Use the same Flink jobs to process both historical (bounded) and live (unbounded) data, storing results in CrateDB for consistent querying.
Stateful transformations and enrichment – Apply windowed aggregations, joins, and enrichments in Flink before persisting processed results in CrateDB.
Simplified integration – Both systems offer native SQL interfaces and JDBC compatibility, making it easy to connect existing tools and pipelines.
Fault tolerance and reliability – Flink’s checkpointing and CrateDB’s data replication ensure resilience against node or network failures.
Operational visibility – CrateDB can serve as both a sink for processed data and a query layer for monitoring Flink job metrics, results, and derived analytics.
Flexible deployment options – Both run seamlessly on Kubernetes, Docker, and bare metal, supporting hybrid and cloud-native architectures.
Ingestion Options
Bounded/Batch
A bounded collection is one that has a defined size, i.e. a batch (run-once) job. This could be in terms of a file import, or a saved database export. There is a defined number of records, and then the job completes.
Unbounded/Streaming
An unbounded collection is one that effectively never ends, there will be a stream of new records, these could be continous, or episodic, but the important difference to a bounded job is the job never ends. It remains active, polling or waiting for new records.
Job & Task Managers
Apache Flink is highly scalable. In a running system, there will be one active job manager, in a high-availability cluster there can be additional job managers that become active if the primary fails.
Within a cluster, there will be a task manager per Flink node typically. These task managers are associated with a Flink job manager. When a new job comes in, the job manager will assign it to one or more task managers depending on the exact definition of the job. For instance, some processes can run on certain nodes, and other processes within the same job can be run on other Flink task manager nodes.
How to Define Jobs
Apache Flink is built in Java, and all jobs are packaged as JARs, but how these JARs are built is based on two paths, either using Flink Table API & SQL or custom code.
Flink SQL
This is the easier route to take to get ingestion running using Apache Flink. It's not typically designed to do full streaming applications, but is a fairly straightforward way of ingesting data from a source and writing to a destination.
There are several approaches to use, the main two are listed here:
Use the SQL Client (https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/dev/table/sqlclient/) with two .sql files, one containing initialisation code, and one containing the actual SQL statements for the ingestion. This only works if the required connector JARs are installed on the Apache Flink cluster.
Package as a single JAR file and upload to the Apache Flink cluster. This allows use of the Table API (https://nightlies.apache.org/flink/flink-docs-release-2.1/docs/dev/table/tableapi/) for querying source databases.
Using JARs for custom Java code
This is more involved, and beyond the scope of this document, but in essence it's using the Flink Java APIs directly to construct jobs and then sending to the stream environment for execution.
Apache Flink Concepts
Dashboard
The dashboard is the web based UI for Apache Flink. Here is a high level list of features and typical associated user actions.
Cluster Overview
Shows the health and resource usage of the cluster (JobManagers, TaskManagers, slots, task load, etc.)
Check cluster uptime, available slots, version, metrics summary
Job Management
Displays all submitted jobs and their state (running, finished, failed, cancelling)
Start, cancel, rescale, take/savepoints
TaskManager View
Shows each worker node and its running subtasks
Monitor load, logs, metrics per TaskManager
Job Graph / DAG
Visual DAG of your job showing operators and data flow
Inspect operator state, parallelism, backpressure, data rates
Metrics & Backpressure
Fine-grained performance metrics and backpressure indicators
Identify bottlenecks, tune parallelism or checkpoint intervals
Checkpoints & Savepoints
Shows status of recent checkpoints and allows manual savepoints
Verify fault tolerance, trigger or cancel checkpoints
Exceptions & Logs
Surfaced exceptions, root causes, logs, and task restarts
Debug task failures, inspect stdout/stderr
Configuration & Environment
Shows Flink configuration, classpath, and version info
Verify environment settings for deployments
Process Functions (i.e. pipeline stages)
This is the core of how Apache Flink works. A job is constructed of processes connected as a pipeline, with earlier stages feeding into later stages with one or more source connectors and one or more destination connectors.
This allows nearly unlimited flexibility in how ingestion functions - carryout out complex transformation of data, talking to multiple source and destination systems, and performing all of this in a highly performant way.
The pages in the Apache Flink docs are recommended reading (rather than repeating here):
Parallelism
Apache Flink has powerful support for parallelism at each stage of ingestion. These are essentially multiple threads of operation that can also be run across distributed nodes in a cluster.
Some examples:
Some source connectors allow reading data in a parallel way. MongoDB for example has split-vector where a source collection is split into even partitions and reads happen concurrently from each partition. This substantially speeds up reads if the collection allows it.
Process functions are ideal for parallelism, for example, there could be a process function that carries out some sort of transformation of records. It makes absolute sense that these transformations occur in parallel.
Parallelism can also make sense with certain destinations. But it's worth remembering that when using CrateDB as a destination the best ingestion performance often comes from batching records when using INSERT queries, and having many writers may in fact reduce performance.
Backpressure
Backpressure is when a source or process function is generating data faster than a downstream function or destination connector can process it. It's not a fatal error, just a warning, but it's a good indication that rebalancing of source/destination/process functions needs to happen. I.e. more downstream parallelism or capacity.
Checkpoints
Checkpoints are a key part of the mechanism for restartability and resumabilty. It records the current position from a source which has been read/processed up until. In the case of failure (e.g. destination has become unavailable) it will allow the job to be resumed or restarted safely from a known "good point".
There are many dimensions to how this is configured, see the official docs:
Last updated

