joshita / dev

Published

- 2 min read

Apache Flink

img of Apache Flink

Apache Flink

  1. Apache Flink is a distributed stream processing framework designed for real-time and batch processing of data.
  2. Flink uses a stream-first architecture; even batch jobs are internally treated as bounded streams.
  3. Real-time data pipelines, complex event processing, fraud detection, recommendation engines, and batch data processing
  4. Supports exactly-once state consistency and fault tolerance, making it ideal for real-time applications.
  5. Offers event time semantics, handling late-arriving data via watermarks and windowing.
  6. Supports various windowing strategies: time-based, count-based, session-based, and custom windowing.
  7. Integrates seamlessly with Kafka, Kinesis, Hadoop, Elasticsearch, JDBC, and other data systems.

Watermarking Mechanism in Apache Flink to handle real time processing data while accomodating out of order data it defines how late an event can be still be processed in limited amount of time Why Watermarking

  1. system latencies
  2. Network delay
  3. Delay in ingestion from Kafka or other ingestion sources

Tracks the progress in an event stream, decides when to cut off the window For late arrival of data flink decides a threshold how long it should wait for delayed events Each watermark moves monotonically, never moves baackward Events that come after the watermark can be marked as late or can be discarded. Things are based on event time

Windowing A feature in flink which is stream based architecture , that allows you to group streaming data over time or count based interval Why streaming is needed ? Streaming data is infinite, there needs to be some aggregration, grouping or segment of data which we can process at a time before collecting new data Example: calculating average temperature every minutes Count website clicks every 5 seconds

Types of window in flink

  1. Tumbling window : Fixed size window, non overloapping
  2. Sliding window : Variable size window, overlapping but sliding, Average stock price every 5 seconds, updated every 2 seconds.
  3. Session window : Groups events in sessions based on gaps in activity User activity sessions in a web application.
  4. Global window : Groups all events in same window, user defined triggers when to process them
  5. Custom window : Allows customers to define custom windowing logic

Handling late arrival of data

  1. Flink uses event time not processing time to decide late arrival of data
  2. Flink uses watermark to process events in the stream, events with timestamp late than the watermark are considered late
  3. They can be used later to decide how should the be processed either in a separate stream.
  4. Be default late events are discarded unless allowed explicitly
      SingleOutputStreamOperator<Event> result = input
    .keyBy(Event::getKey)
    .window(TumblingEventTimeWindows.of(Time.seconds(10))) // 10-second window
    .allowedLateness(Time.seconds(5)) // 5-second grace period
    .sum("value");