09 August,2019 by Rambler
These are some common questions I've received in working with Apache Kafka. I'll keep adding more detail
What is Apache Kafka ?
Apache Kafka is a distributed , partitioned, replicated commit log service. Apache Kafka can also be described a Publish - Subscribe message system for distributed applications.
What was the original purpose of Apache Kafka ?
Apache Kafka was developed within LinkedIn. LinkedIn invested in developing a single, unified , distributed pub-sub data pipeline. Prior to Kafka - LinkedIn maintained multiple data pipelines , such as inmail messaging , site events e.g site views and other operational data. Kafka was developed to unify the scalability effort of maintaining multiple data pipelines
Describe some common use cases for Apache Kafka
Activity tracking - think of the a publish-subscribe feed for event processing or monitoring. The original use case
Database Updates - and downstream processing. For example - a profile update on a web site may require other applications to be notified
Aggregating Metrics - Pulling metrics from different source logs and placing in a central repository .
Stream Processing - This has great potential.For example - process multiple daily news feeds from different sources and aggregate\enrich\publish
Messaging - Think : alternative to RabbitMQ or ActiveMQ
What are some principles to consider when deciding on whether data streaming , such as Apache Kafka should be used?
Engineers should not stream all data , simply because data streaming is available. There are significant considerations in deciding to change existing batch jobs to data streaming .
Questions to ask
1) Decide on which stream processing engine to choose. Will the data be event based streamed or batched? The term - Micro-batching refers to a higher frequency level of batch processing. The key point is although it batches more frequently than the standard schedules - it is not event driven . If the aim is to have the data arrive more frequently and batch is already in place - continue to use batch
2)Is lambda architecture required? Lambda architecture balances performance\SLA requirements and fault-tolerance by creating a batch layer that provides a comprehensive and accurate “correct” view of batch data, while simultaneously implementing a speed layer for real-time stream processing to provide potentially incomplete, but timely,in the context of the application, views of online data. This is also known as a hybrid approach
It may be the case that an existing batch job simply needs to be augmented with a speed layer, and if this is the case then choosing a data-processing engine that supports both layers of the lambda architecture may facilitate code reuse.
3)How invested are other parts of the organisation in data streaming? Usually if a technology platform is heavily utilised than there tends to be deployment and ops knowledge surrounding the product
4)How is ETL used within the organisation? Apache Kafka needs to be viewed in the context of expected sources\sinks
5)How much of a learning cure is required?
6)Teams supporting data streaming will have to work on critical issues - typically including on-call. Batch failures tend to be addressed urgently , but data streaming requires immediate
7)Is there a culture of measuring and responding and fixing issues?
What is some common terminology used when discussing Kafka?
Cluster - Kaka is run as a cluster on one or more servers
Broker - The Cluster Servers are known as Brokers
Topic - A Kafka Cluster stores streams of records called topics
Partitions - Each Kafka broker has a unique ID and contains topic partitions
Producers - Sources writing to topics
Consumers - Read from Topics
Connectors - Link Kafka to existing data systems
Stream processors - Assist transform of input stream to output stream
Is Apache Kafka an ETL tool? Streaming v Batch Processing
It's strange when you consider how to categorise Kafka. Is it a type of database system ? Is it a type of filesystem? I've always viewed it as an event streaming platform
Rather than defining (or not defining) Apache Kafka as an ETL tool , I try and consider the context. An example of the context would be , differentiating between the microservices\messaging\stream streaming framework versus batch processing.
An example of batch processing is wrapping a bunch of tasks such as : receive file > parsing > validating > cleansed > organized > aggregated wrapped in a SSIS job (or other type of ETL tool) and executing a schedule i.e non continuous data
An example of microservices\messaging\stream streaming framework is : continuous data . The key to the continuous data is the ability to pass the data through a type of messaging server which manages the continuous data flow. Apache Kafka is an example of this type of messaging server
What protocols are supported by Kafka?
JSON , Avro
Do you have a diagram giving a very high level view of Kafka?
This diagram - gives you a flavour of possibilities.
What is Zookeper?
Zookeper manager distributed processes. It is compulsory part of an Apache Kafka cluster.Kafka Brokers use Zookeper for cluster controller election and cluster membership management
What are Kafka Brokers?
Kafka Brokers are the main messaging and storage parts of Apache Kafka. Apache Kafka has a concept called Topics. Topics is a another way of saying "message streams" . These topics are separated into partitions. The partitions are then replicated for the high availability . The Kafka Cluster is managed by the Kafka Broker servers
What is a Kafka Connect worker?
The Kafka Connect worker allows Kafka to integrate with external systems. The Kafka Connect API permits configuring connectors that continually pull from some source data system into Kafka or push from Kafka into some sink data system.
Are connections between Kafka and Producers encrypted by default?
Apache Kafka communicates in plaintext - by default , which means that all data is sent in the clear. To encrypt communication, it is recommended to configure the Apache Kafka components in your deployment to use SSL encryption. It can be confusing by the usage of the term SSL - as TLS has replaced SSL. This approach deals with the Man in the Middle Attack . Once the data is at rest on the Broker - you'll need to consider if some sort of encryption is required at disk.
Do you have links to some in-depth Kafka related blog posts ?
A Brief History of Kafka, LinkedIn’s Messaging Platform
How We’re Improving and Advancing Kafka at LinkedIn
How does KSQL fit into the Kafka picture?
KSQL is a SQL - like interface
Considerations for using KSQL
Where do you want to maintain the business logic ?Closer to the source , target or as part of the stream process?
How much of the data transformation do you want to maintain at the KSQL level?
If there are ad-hoc queries KSQL is not a good use. Data is maintained for a limited time in in Kafka so relying on the full data set would present gaps in the resultset .
There are no indexes in KSQL. This would make it difficult in ad-hoc query and BI Reports.
In the KSQL Context are streams and tables on the same level? For example is there a requirement to create a JOIN between a table and a stream ?
How do you monitor Apache Kafka?
When planning on how to monitor Kafka these are some considerations:
Do you need to track end-end latency ?
Do you need to know about data loss?
Do you need to know about administratives operations - such as partition alignment?
Do you want to audit data ?
1) Count all data in a Kafka Cluster , create an audit event , to compare against producer counts - aiming to detect any data loss
2) Some sort of monitor on data completeness. Maintain these alerts in a database when data is delayed \ lost. Use these alerts to focus on areas and reason for data loss
This is only a preview. Your comment has not yet been posted.
As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.
Having trouble reading this image? View an alternate.
Posted by: |