Published events are pending in the stream
Symptom
You publish events, but some of them are not received by the subscriber and stay pending in the stream.
Cause
When NATS EventingBackend has more than 1 replica and the Clustering
property on the NATS Server is enabled,
a leader-election is taking place on the stream and consumer levels (
see NATS Documentation).
Once the leader is elected, all the messages are being replicated across the replicas.
Sometimes replicas can go out-of-sync with the other replicas. As a result of this, messages on some consumers can stop being acknowledged and start piling up in the stream.
Remedy
There are two ways of how to fix the "broken" consumers with pending messages. You will need either to trigger a leader reelection either on the consumers with pending messages or on the stream level.
Trigger consumer leader election
First, you need to find out which consumer(s) have pending messages. For that, you need the latest version of NATS cli installed on your machine. You can find the broken consumer in two ways: by using Grafana dashboard or directly by using the NATS cli command.
Find the broken consumers using Grafana dashboard
- Access and Expose Grafana
- Find the NATS JetStream Dashboard and check the pending messages
- Find the consumer with pending messages and encode it as an
md5
hash:
echo -n "tunas-testing/test-noapp3/kyma.noapp.order.created.v1" | md5
this shell command results in ebcabfe5c902612f0ba3ebde7653f30b
.
- Then, you need to find consumer's leader:
nats consumer info sap ebcabfe5c902612f0ba3ebde7653f30b
Information for Consumer sap > ebcabfe5c902612f0ba3ebde7653f30b created 2022-10-24T15:49:43+02:00Configuration: Name: ebcabfe5c902612f0ba3ebde7653f30b Description: tunas-testing/test-noapp3/kyma.noapp.order.created.v1 ...Cluster Information: Name: eventing-nats Leader: eventing-nats-1 # that's what we need Replica: eventing-nats-0, current, seen 0.96s ago Replica: eventing-nats-2, current, seen 0.96s ago
You can see, that its leader is the eventing-nats-1
replica.
Find the broken consumers using the NATS cli
If you have NATS cli installed on your machine, you can simply run this shell script:
for consumer in $(nats consumer list -n sap) # sap is the stream namedo nats consumer info sap $consumer -j | jq -c '{name: .name, pending: .num_pending, leader: .cluster.leader}'done
You must get the following output:
{"name":"ebcabfe5c902612f0ba3ebde7653f30b","pending":25,"leader":"eventing-nats-1"}{"name":"c74c20756af53b592f87edebff67bdf8","pending":0,"leader":"eventing-nats-0"}
here you can see, that the consumer ebcabfe5c902612f0ba3ebde7653f30b
has pending messages. The other one has no
pending message and
is successfully processing events.
Trigger the consumer leader reelection
Now, when we know the name of the broken consumer and its leader, we can trigger the reelection:
- You must port-forward the leader replica and trigger the leader reelection for that broken consumer:
kubectl port-forward -n kyma-system eventing-nats-1 4222
- Trigger the leader reelection:
nats consumer cluster step-down sap ebcabfe5c902612f0ba3ebde7653f30b
After execution, you see the following message:
New leader elected "eventing-nats-2"Information for Consumer sap > ebcabfe5c902612f0ba3ebde7653f30b created 2022-10-24T15:49:43+02:00
You can check the consumer now and confirm that the pending messages started to be dispatched.
Restart the NATS Pods and trigger the stream leader reelection
Sometimes triggering the leader reelection on the broken consumers doesn't work. In that case, you must try to trigger leader reelection on the stream level:
nats stream cluster step-down sap
You must get the following result:
11:08:22 Requesting leader step down of "eventing-nats-1" in a 3 peer RAFT group11:08:23 New leader elected "eventing-nats-0"Information for Stream sap created 2022-10-24 15:47:19 Subjects: kyma.> Replicas: 3 Storage: File