Apache Kafka FAQ

09 August,2019 by Jack Vamvas

These are some common questions I've received in working with Apache Kafka. I'll keep adding more detail

What is Apache Kafka ?

Apache Kafka is a distributed , partitioned, replicated commit log service. Apache Kafka can also be described  a Publish - Subscribe message system for distributed applications. 

What was the original purpose of Apache Kafka ?

Apache Kafka was developed within LinkedIn. LinkedIn invested in developing a single, unified , distributed pub-sub data pipeline. Prior to Kafka - LinkedIn maintained multiple data pipelines , such as inmail messaging , site events e.g site views and other operational data. Kafka was developed to unify the scalability effort of maintaining multiple data pipelines

Describe  some common use cases for  Apache Kafka

Activity tracking - think of the a publish-subscribe feed for event processing or monitoring. The original use case

Database Updates - and downstream processing. For example - a profile update on a web site may require other applications to be notified 

Aggregating Metrics - Pulling metrics from different source logs and placing in a central repository . 

Stream Processing - This has great potential.For example - process multiple daily news feeds from different sources and aggregate\enrich\publish

Messaging - Think : alternative to RabbitMQ or ActiveMQ

What are some principles to consider when deciding on whether data streaming , such as Apache Kafka should be used?

Engineers should not stream all data , simply because data streaming is available. There are significant considerations in deciding to change existing batch jobs to data streaming . 

Questions to ask 

1) Decide on which stream processing engine to choose. Will the data be event based streamed or batched? The term - Micro-batching refers to a higher frequency  level of batch processing. The key point is although it batches more frequently than the standard schedules - it is not event driven . If the aim is to have the data arrive more frequently and batch is already in place - continue to use batch

2)Is lambda architecture required? Lambda architecture balances performance\SLA requirements  and fault-tolerance by  creating a batch layer that provides a comprehensive and accurate “correct” view of batch data, while simultaneously implementing a speed layer for real-time stream processing to provide potentially incomplete, but timely,in the context of the application, views of online data. This is also known as a hybrid approach

It may be the case that an existing batch job simply needs to be augmented with a speed layer, and if this is the case then choosing a data-processing engine that supports both layers of the lambda architecture may facilitate code reuse.

3)How invested are other parts of the organisation in data streaming? Usually if a technology platform is heavily utilised than there tends to be deployment and ops knowledge surrounding the product

4)How is ETL used within the organisation? Apache Kafka  needs to be viewed in the context of expected sources\sinks

5)How much of a learning cure is required?

6)Teams supporting data streaming will have to work on critical issues  - typically including on-call. Batch failures tend to be addressed urgently , but data streaming requires immediate 

7)Is there a culture of measuring and responding and fixing issues?

 

 

 

What is some common terminology used when discussing Kafka?

Cluster - Kaka is run as a cluster on one or more servers

Broker - The Cluster Servers are known as Brokers

Topic - A Kafka Cluster stores streams of records called topics

Partitions - Each Kafka broker has a unique ID and contains topic partitions

Producers - Sources writing to topics

Consumers - Read from Topics 

Connectors - Link Kafka to existing data systems

Stream processors - Assist transform of input stream to output stream

 

Is Apache Kafka an ETL tool? Streaming v Batch Processing

It's strange when you consider how to categorise Kafka. Is it a type of database system ? Is it a type of filesystem? I've always viewed it as an event streaming platform

Rather than defining (or not defining) Apache Kafka as an ETL tool , I try and consider the context. An example of the context would be , differentiating between the microservices\messaging\stream streaming framework versus batch processing.

An example of batch processing is wrapping a bunch of tasks such as :  receive file > parsing > validating > cleansed > organized > aggregated  wrapped in a SSIS job (or other type of ETL tool)  and executing a  schedule i.e  non continuous data

An example of microservices\messaging\stream streaming framework  is  : continuous data . The key to the continuous data is the ability to pass the data through a type of messaging server which manages the continuous data flow. Apache Kafka is an example of this type of messaging server

 

What protocols  are supported by Kafka?

JSON , Avro 

Do you have a diagram giving a very high level view of Kafka?

This diagram - gives you a flavour of possibilities. 

Kafka

What is Zookeper?

Zookeper manager distributed processes. It is compulsory part of an Apache Kafka cluster.Kafka Brokers use Zookeper for cluster controller election and cluster membership management

What are Kafka Brokers?

Kafka Brokers are the main messaging and storage parts of Apache Kafka. Apache Kafka has a concept called Topics. Topics is a another way of saying "message streams" . These topics are separated into partitions. The partitions are then replicated for the high availability .  The Kafka Cluster is managed by the Kafka Broker servers

What is a Kafka Connect worker?

The Kafka Connect worker allows Kafka to integrate with external systems. The Kafka Connect API permits configuring connectors that continually pull from some source data system into Kafka or push from Kafka into some sink data system. 

Are connections between Kafka and Producers encrypted by default?

Apache Kafka communicates in plaintext - by default , which means that all data is sent in the clear. To encrypt communication, it is recommended to configure the Apache Kafka components in your deployment to use SSL encryption.  It can be confusing by the usage of the term SSL - as TLS has  replaced SSL.    This approach deals with the Man in the Middle Attack . Once the data is at rest on the Broker - you'll need to consider if some sort of encryption is required at disk. 

Do you have links to some in-depth Kafka related blog posts ?

A Brief History of Kafka, LinkedIn’s Messaging Platform

How We’re Improving and Advancing Kafka at LinkedIn

How does KSQL fit into the Kafka picture?

KSQL is a SQL - like interface

Considerations for using KSQL

Where do you want to maintain the business logic ?Closer to the source , target or as part of the stream process?

How much of the data transformation do you want to maintain at the KSQL level?

If there are ad-hoc queries KSQL is not a good use. Data is maintained for a limited time in in Kafka so relying on the full data set would present gaps in the resultset . 

There are no indexes in KSQL. This would make it difficult in ad-hoc query and BI Reports. 

In the KSQL Context are streams and tables on the same level? For example is there a requirement to create a JOIN between a table and a stream ?

How do you monitor Apache Kafka?

When planning on how to monitor Kafka these are some considerations:

Do you need to track end-end latency ?

Do you need to know about data loss?

Do you need to know about administratives operations - such as partition alignment?

 

Do you want to audit data ?

1) Count all data in a Kafka Cluster , create an audit event , to compare against producer counts - aiming to detect any data loss

2) Some sort of monitor on data completeness. Maintain these alerts in a database when data is delayed \ lost. Use these alerts to focus on areas and reason for data loss

 

 

 


Author: Jack Vamvas (http://www.dba-ninja.com)


Share:

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment on Apache Kafka FAQ


dba-ninja.com