Kafka
What is it
Event streaming platform used to broadcast event between systems.
Used for:
- Writing and reading events
- Storing those events for a period of time, allowing look-back
Kafka provides at-least-once semantics in most cases.
- Needs additional configuration to guarantee
What is it not
Not a queue
- Can mimic FIFOwith a singleproducerand singleconsumergroup, but lacks features to be a dedicatedqueue
Transport for RPC calls
- Kafkafire and forgets - should not care about the results of consumptions of events
Topics
Producers and Consumers exchange information on a named topic.
Topics are always multi-producer and multi-consumer.
- allows multiple systems to publish to one topicand consume the same events in multiple different applications.
Partitions
Each topic is divided into Partitions.
- Each partitionis a separate bucket of events within thetopic.
Each partition can be located on a different broker node, making this important for the performance of the Kafka cluster.
Each message is produced to a single partition.
- Consumersmust consume all- partitions, this is automatically handled via a- ConsumerGroup.
Order of events is guaranteed within a partition not within a topic.
- As a consequence, it is useful to use a fixed partition key, such as theaccount_id, so that message ordering is preserved within an account.
Producer => Brokers => All Partitions (to ConsumerGroup)
Publishing and Consuming
Producers connect to Kafka Brokers to publish their messages.
Each message consists of a Key and a Value field.
- The Valuefield is thepayloadof theevent, the Key is used to allocate theeventto a particularpartition.
- The partitionsof thetopicare spread across thebrokernodes allowing for spreading the load of a singletopicacross multiplebrokers.
Consumption is typically done via a ConsumerGroup.
The group can contain 1 or more consumers and these can be across different hosts, allowing for parallel processing.
- The Kafka brokersensure that eachpartitionis assigned to a singleconsumerin the group.
The maximum parallelism is limited to the number of partitions in the topic.
- Each consumer grouphas a unique id, so allmembersof that group would register with thesame idand thepartitionsof atopicwould be divided among them.
Deterministic Hashing
In order to ensure all events for a particular account are produced to a single partition, and thus retain ordering, one needs a deterministic way to go from some_id → partition #.
- It is important that all producersfollow the same convention.
Key is set to some_id.
Partition = murmur3(some_id) % numPartitions
murmur3 is a common non-cryptographic hashing function available in virtually any language, so it makes a good choice as a convention.
Idempotent Producers
When producing to Kafka, one way to reduce the possibility of duplicates is to enable idempotence on the producer.
Producer will do extra bookkeeping, such as including a sequence number, to ensure events are not unintentionally published twice.
Reference:
- https://stackoverflow.com/questions/58894281/difference-between-idempotence-and-exactly-once-in-kafka-stream
Replication Factor and Min In-Sync Replicas
Kafka Topics have two important settings which, when misconfigured, can cause the topics to become unusable.
The first is replication factor.
- the number of copies of the message written to the Kafka cluster.
The second is the minimum number of in sync replicas.
- indicates the number of replicas that have been published to before the message is considered durably stored.
If replicationFactor < min in sync replicas, the cluster will no longer accept publishing to that topic.
The solution is to increase the replication factor or decrease the number of in-sync replicas.