Event streaming using Apache Kafka
by Amit Sahu
There are levels of event recording in accordance with the use cases for which we are tracking like application logging (Log4j levels). Now a mission-critical application (like transactions) will generate a huge number of events, this will enable to debug/analyse in case of any failure or audits. For low to medium priority apps, we can keep it simple and less frequent.
Before we get into details of Kafka-enabled event streaming, let us discuss event processing.
The concept of event processing came into existence with the advent of advanced tracking, logging and reporting of metrics in services where each event is considered as a micro milestone or simply a marker. When these microdata bits are put together, they form data sets to do wonders. These events can be processed (incremental reports), compiled (monitoring and enriching) or stored for historical analysis (predictive analysis).
Take a real-world scenario, an e-commerce website. The customer interactions will generate a massive log of events, real transactions, click logs, queries, etc. The utilisation of these data-sets does not provide stakeholders with any essential business insight. Businesses that effectively get the insights from these data-sets will always have an upper hand in the market or respective domain sourced directly from the customers. Fortunately, organisations have now realised the importance of event processing. Even though considered troublesome, it is worth the trouble. Some have even explored out data sources like social media extractions with advanced data mining tools/vendors.
Need of an event streaming platform
Receiving and storing events from multiple sources can be tricky, as these events may or may not have a uniform structure. Given their huge volume, any legacy platform such as a database can be overwhelming. Also, it is not necessary that each of these events, in general, is critical. That is why it is required to have a system/platform which can perform quick writes and effective reads.
On the other hand, this system also needs to be highly available, partitioned for parallel reads and configurable for a variety of sources in an enterprise. This is where Kafka comes into the picture.
Kafka: Some underlying concepts
Kafka provides both horizontal and vertical scalability. Kafka is optimised for fast writes. Data in Kafka is bifurcated as topics which are mutually exclusive. For different data requirements, one can use multiple Kafka topics. These topics are partitioned, or simply log portions which are stored separately in topic-<x> (where x is the partition number) directories in the broker disc (Storage). Partitioning enables multiple consumer threads to consume data from topics in parallel by saying this, Kafka consumer threads have a limitation, i.e. a single consumer thread can only consume to one partition of a topic in a given time.
Now, these partitions are replicated across brokers. In a nutshell, the topic is a logical unit of storage and partitions are physical units of storage. This will make sense as the overall Kafka cluster health is determined with the number of ISRs or offline partitions in the cluster. These replicas of partitions are present across brokers, in case any broker failure happens, automatically next copy takes its place. How does this happens? Each partition replicas have a leader and others are followers. These make-up the ISRs i.e. in-sync replicas. In case any replica unable to follow up with the leader for data replication, it is kicked out the ISR list and loses the ability to become the leader for that partition. Kicked out stale replicas are called OSR or out of sync replicas. When the leader partition fails, next partition in the ISR list takes place and serves producer and consumer requests.
The data stored in Kafka is converted into byte arrays or simply binary form, this helps faster replications within the cluster. Binary formats are easier to transport over the wire and use comparatively low network resources.
Partitioning and replication factor for a topic are specified during the creation itself and can be updated in the runtime. The partitions of a Kafka topic can only be increased.
Brokers act as an intermediator between the producers and the consumers. They provide an abstraction for memory management and distributed processing. Brokers follow a load-balanced operation and depend on the zookeeper ensemble for communication and state management.
Zookeeper in Kafka is a centralised system that maintains states, configuration and naming of Kafka components. Zookeeper is a top-level Apache project used in many distributed systems like Apache Nifi, Apache Hadoop. It is responsible for orderly updates in Kafka states and operation.
Kafka as an event streaming platform
As quoted by Apache Software Foundation, “Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies,” which has revolutionized event streaming.
Let’s discuss some salient features of Kafka:
Kafka stores messages as immutable commit log, which can be subscribed by multiple systems at very high rates. Kafka is optimised for extremely fast writes.
It decouples stream producing and stream consuming systems, i.e. now we can have different producing and consuming rates and buffer or stage the balance data in the discs.
All Kafka components are multi-node configurable either in a load-balanced or a master-slave setup. This makes it preferable for production-grade use-cases as it eradicates single point of failures (SPOFs). (Kafka brokers are load-balanced, Zookeepers, schema registry follow master-slave architecture).
It has built-in replication, partitioning and fault-tolerance, hence enterprise architects don’t need to design it, they just need to configure it. (single partitioned topics like “_schemas” are highly durable but will not be available for multiple consumptions).
Kafka is not a memory-intensive application as it flushes or commits data directly to disc. So higher the read/write speed of the system, lower is the throughput.
Configurable durability guarantee i.e. you can specifically declare the pipelines as critical or non-critical. Strong durability guarantees impact latency of that pipeline (Acks/ idempotence).
A scalable system, you can always up-scale and down-scale brokers in the cluster for matching variable loads for busy seasons (adding/removing new brokers in the cluster and reassigning partitions using Kafka re-assign partitioning tool).
Multi-data centre support for disaster-recovery using tools like Mirror maker or confluent replicator.
Apache Kafka has emerged as a key platform for distributed streaming. It is growingly becoming the backbone of the IT architecture of enterprises that are focusing on real-time data streaming.
It is possible to process the data stream directly using an API. However, for complex data streams, Kafka provides fully integrated streams API which is built on the core primitives of Kafka: uses producer and consumer APIs for input, Kafka for stateful storage, and the same group mechanism for fault tolerance among the stream processor instances.
With Kafka as a solution for event stream processing, the need for an integrated platform for receiving, storing and processing events for modern enterprises can be well achieved. Kafka ships with a variety of configurable components which makes it a preferable choice for event processing.