Stream Processing Metamorphosis - A Kafka’s tale

João Vazao Vasques
Talkdesk Engineering
6 min readJun 29, 2017

--

Every new internet startup can be seen as an act of creation. You start small with a limited scope: a small web app (e.g Rails) and a database. Assuming you’re lucky, your little startup begins to grow and you need to develop additional systems to provide the functionalities that your customers demand. This growth, although exciting, brings additional complexity, mainly in the interactions and dependencies between the new systems being developed.

Does this movie sound familiar to you? Let me tell you, you’re not alone. Around 2011, Linkedin was facing similar issues as you can see in the figure below.

Figure 1: Linkedin around 2011 [source]

As you can probably imagine, it’s almost impossible for anyone to understand the data flow and dependencies that exist in such an architecture diagram. Also, think about the cost of adding a new system that has dependencies on, let’s say, two others. Not practical. Something had to be done and that’s how Kafka was born.

The log abstraction and Kafka

At a higher level, Kafka can be seen as a log. I find Jay Kreps’, one of Kafka’s creator, definition very clear and short.

“A log is an append-only sequence of records ordered by time” — Jay Kreps

Figure 2: a log

A log is a very popular concept in computing since a lot of concepts can be modeled using that abstraction. Some examples are:

  • a file is an array of bytes
  • a table is an array of records

In the context of these two examples, we can say that a log is a table or file where the entries are sorted by time. Logs are also used in distributed systems in a lot of areas such as state machine replication and distributed consensus.

As you’ll see shortly, the log abstraction and Kafka are tightly related.

Deep dive into Kafka

Okay, now we’re going to deep dive into Kafka! While presenting some of Kafka’s internals I will make a comparison with RabbitMQ (a popular messaging we use here at Talkdesk).

Kafka is a fully distributed system, meaning it has no master or coordinator. Every node in a Kafka cluster is called a broker. Each broker stores one or more partitions of a topic.

Figure 3: Anatomy of a Kafka topic

A topic is the main abstraction in Kafka. It’s a stream of messages of a particular type (e.g. calls, user clicks, likes, page views, etc.). A generic topic, T, is divided into P partitions. As you can see in Figure 3, messages in each partition are identified by sequential id (offset). A partition is the smallest unit of parallelism in Kafka. The direct implication of this is that Kafka only maintains the order of messages within each partition. It might be tempting to think of Kafka topics as RabbitMQ queues, but Kafka topics are global to the whole system, while RabbitMQ queues are tied to the consumer instance that creates them.

Topics should be replicated. Each partition has one leader and zero or more replicas. A replica is said to be in-sync if it’s not far behind the leader. When a topic is created, a replication factor is chosen. Once set, that value cannot be lowered.

Remember the similarities between the log abstraction and Kafka? A closer look at Figure 3 shows how a topic partition is very similar to a log (Figure 2).

What can I do with Kafka?

Clients interact with Kafka to produce and consume messages.

A Kafka Producer publishes messages to a partition in a topic. In a topic with P partitions, how does Kafka choose which partition is selected for a given message? When publishing a message, one can specify a key. If a key is present Kafka uses a hash function based on the key and number of partitions to select one. If the key is not present, a round robin policy is used. Because partitions are replicated, a producer can wait for acknowledges when producing a message. The number of acknowledges is configurable and it goes from 0 to R (number of partition replicas). Before selecting a value it is important to take into consideration aspects such as latency and durability.

Consumers in Kafka have some considerable differences when compared to other messaging systems such as RabbitMQ. Kafka’s consumers are pull-based (RabbitMQ is push-based), which means that consumers explicitly request the cluster for messaging. Multiple consumers can read messages from the same topic but only from a single partition (because it is the smallest unit of parallelism). Messages stay in Kafka after they are consumed. This is a big difference when comparing with RabbitMQ and has major implications as you will see next. One interesting concept that Kafka introduces is the notion of consumer groups.

Figure 4: An example of Kafka’s consumer groups

Consumer groups are associated with a topic and they allow Kafka to scale consumers horizontally. Imagine you have a topic with four partitions and two consumers in a group (Figure 4). Each of those consumers will own two partitions of that topic. If you add an additional consumer to that group, a rebalancing process will take place and in the end you will have two consumers owning one partition each and the other consumer owning two. This allows Kafka to scale consumers very easily without any changes in the code. Going back to RabbitMQ, scaling consumers is not that simple.

A lot more could be said about Kafka (Confluent’s blog is an amazing source of knowledge) but since you already have a good overview of it I am instead going to present you the metamorphosis that Kafka brought to stream processing.

The metamorphosis

The way companies manage and interact with their data has changed dramatically over the years. This is a consequence of two facts:

  1. Companies are collecting very big volumes of data;
  2. The world is changing faster than ever.

Due to these two facts, the way that companies make decisions needs also to be faster. Companies make decisions based on the data they collect and analyse. A direct consequence is that they need to collect and process data faster. This data often comes in the form of events and requires that:

  1. Event processing needs to be fast (in order of 100K events/sec) and fault tolerant;
  2. Events need to be stored and reprocessed.

Kafka meets all of these crucial requirements.

First, it’s blazing fast as some impressive results show. Kafka has some design decisions that allow it to be that fast such not having complex message routing (e.g. as RabbitMQ does).

Second, messages are stored in Kafka (in each topic) for a specified amount of time (default is 7 days). In RabbitMQ and other systems, you have to develop a “logger consumer” and store all of your events somewhere else (S3, Cassandra, etc.). By storing messages for a given amount of time, Kafka allows anyone to replay events at any point in time without having to deploy any additional infrastructure. This results in endless benefits that range from debugging to training of machine learning models. You don’t have to take my word for it, all relevant processing frameworks such as Flink or Spark have Kafka as their recommended connector to inject streaming data.

Kafka continues to appear in new data processing architecture. For example, if you work with data processing you probably have heard of the term Lambda Architecture. Recently, a simplification of this architecture, where the batch layer is removed, has appeared and it is named Kappa Architecture. As you might suspect, Kafka is the central piece of this new architecture.

Kafka has been a crucial milestone regarding stream processing and the many ways it’s been adopted is a testament to its fluid power. Personally, as part of the Data team here at Talkdesk, I am very excited about the cool things we can do with Kafka and would love to hear about your use cases as well.

--

--

Blockchain Analytics @Chainlink | Alumni of: @Unbabel, @talkdesk, @Uniplaces |1x startup founder| Taekwondo black belt