Overview of Apache Kafka's Internal Components

Welcome to this blog post where we'll delve into the internal components of Apache Kafka, an event streaming system designed to allow applications to act on new events immediately as they occur. In this overview, we'll cover the core concepts and components that form the foundation of Kafka's architecture.

Kafka's Core Architecture

At the heart of Kafka's architecture is the storage layer, designed to efficiently store data as events. This storage layer is distributed, allowing for easy scaling to accommodate growing needs over time. There is a separate processing layer where streams and tables are present. By separating the processing layer from the storage layer, Kafka allows these two aspects to scale independently, providing flexibility and efficiency.

Primitive APIs

We have two primitive APIs for accessing the data stored in the storage layer. One is the Producer API, enabling the publishing of events into the storage layer, and the other is the Consumer API, allowing applications to read events from the storage layer.

High-Level APIs

To make Kafka more versatile and integrated with other systems, two high-level APIs were built on top of the core. The first is the Connect API, designed for integration with external data sources and data sinks through source and sink connectors. The second is the processing API, which includes Kafka Streams for Java developers and KSQL, a more declarative approach using SQL-like syntax for continuous event processing.

Core Concepts of Kafka

Events

At the core of Kafka are events, which represent occurrences in the world. Each event is modeled as a record and it contains the following essential attributes

Record =>
    timestamp
    key
    value
    headers

The payload is usually included in the value. The key plays several roles, such as enforcing ordering, collocating data with the same key, and facilitating key retention.

Topics

Topics in Kafka are analogous to database tables, organizing events of the same type together. When publishing events, you specify the target topic, and when consuming events, you subscribe to specific topics. All events within a topic are immutable and appended sequentially. To distribute data across the Kafka cluster, topics are divided into partitions.

Partitions

Partitions are the units of data distribution within topics. When creating a topic, you can specify one or more partitions. Each partition is typically stored on a single broker in the Kafka cluster, but the Tiered Storage feature from Confluent allows data to exceed the capacity of a single broker. Partitions facilitate parallelism, as each can be accessed independently, enabling concurrent reading and writing.

Offsets

Each event within a Kafka topic partition has a unique identifier called an offset. Offsets are monotonically increasing numbers and are never reused. They enable Kafka to maintain the order of events and help consumers keep track of their progress when processing events.

Conclusion

Apache Kafka is a powerful event streaming system designed for storing and processing data efficiently in a distributed environment. Its core components, including events, topics, partitions, and offsets, form the foundation of its architecture. Kafka's flexibility, scalability, and versatility make it a popular choice for various use cases, including real-time data processing, messaging systems, and event-driven architectures.

I hope this overview has given you valuable insights into the internal components of Apache Kafka.

Thank you for reading.

Command Palette