Issue submitter TODO list
Describe the bug (actual behavior)
Hi,
I had issues to list topics from confluent cloud and with help of Kiro (Claude 4.6) I managed to get it working. Below is the details provided by Kiro.
When connecting to Confluent Cloud, topics are not listed in versions v1.4.x and v1.5.0. The cluster shows as INITIALIZING indefinitely. v1.3.0 works correctly with the same configuration.
Three separate failures occur in the cluster scrape pipeline introduced by the metrics refactor in #1208:
- ClusterAuthorizationException from describeMetadataQuorum() in StatisticsService
Confluent Cloud does not permit querying KRaft quorum info. The code only handles UnsupportedVersionException but not ClusterAuthorizationException, causing the entire stats update to fail.
- ClusterAuthorizationException from listConsumerGroups() in ScrapedClusterState
A restricted service account (common in production) cannot list all consumer groups cluster-wide. There is no error handling on this call, so it propagates and kills the scrape.
- TimeoutException from listOffsets() in ScrapedClusterState
The new scrape pipeline fetches offsets for every partition on every cycle. On Confluent Cloud, this times out due to network latency, which then invalidates the AdminClient (via #1468) and causes the cluster to never recover.
Expected behavior
Topics should be listed. Failures in non-essential scrape calls (quorum info, consumer group listing, offset fetching) should degrade gracefully rather than preventing topic listing entirely.
Proposed fix
In StatisticsService.loadQuorumInfo():
.onErrorResume(t ->
t instanceof UnsupportedVersionException || t instanceof ClusterAuthorizationException
? Mono.just(Optional.empty())
: Mono.error(t)
);
In ScrapedClusterState.scrape(), add to listConsumerGroups():
.onErrorResume(ClusterAuthorizationException.class, e -> {
log.warn("Not authorized to list consumer groups, skipping");
return Mono.just(List.of());
})
In ScrapedClusterState.scrape(), add to the phase 2 Mono.zip:
.onErrorResume(TimeoutException.class, e -> {
log.warn("Timeout during offset/consumer scrape, topics will show without offset counts");
return Mono.just(Tuples.of(Map.of(), Map.of(), Map.of(), HashBasedTable.create()));
})
Your installation details
- Confluent Cloud (managed Kafka)
- SASL/SSL with PLAIN mechanism
- Restricted service account (no cluster-level ACLs)
- Versions affected: v1.4.1, v1.4.2, v1.5.0
- Version working: v1.3.0
Steps to reproduce
- Configure kafbat-ui against a Confluent Cloud cluster using a service account API key
- Start any version after v1.3.0
- Navigate to Topics — shows "No topics found" or cluster stays INITIALIZING
Screenshots
No response
Logs
ERROR --- [parallel-4] io.kafbat.ui.service.StatisticsService: Failed to collect cluster info
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listOffsets(api=LIST_OFFSETS)
WARN --- [parallel-4] i.k.ui.service.AdminClientServiceImpl: AdminClient for the cluster is invalidated due to Timed out waiting for a node assignment.
Additional context
Note: issue #1672 covers point 1 partially (for Strimzi/limited accounts), but doesn't mention Confluent Cloud and doesn't cover points 2 and 3.
Issue submitter TODO list
main-labeled docker image and the issue still persists thereDescribe the bug (actual behavior)
Hi,
I had issues to list topics from confluent cloud and with help of Kiro (Claude 4.6) I managed to get it working. Below is the details provided by Kiro.
When connecting to Confluent Cloud, topics are not listed in versions v1.4.x and v1.5.0. The cluster shows as INITIALIZING indefinitely. v1.3.0 works correctly with the same configuration.
Three separate failures occur in the cluster scrape pipeline introduced by the metrics refactor in #1208:
Confluent Cloud does not permit querying KRaft quorum info. The code only handles UnsupportedVersionException but not ClusterAuthorizationException, causing the entire stats update to fail.
A restricted service account (common in production) cannot list all consumer groups cluster-wide. There is no error handling on this call, so it propagates and kills the scrape.
The new scrape pipeline fetches offsets for every partition on every cycle. On Confluent Cloud, this times out due to network latency, which then invalidates the AdminClient (via #1468) and causes the cluster to never recover.
Expected behavior
Topics should be listed. Failures in non-essential scrape calls (quorum info, consumer group listing, offset fetching) should degrade gracefully rather than preventing topic listing entirely.
Proposed fix
In StatisticsService.loadQuorumInfo():
.onErrorResume(t ->
t instanceof UnsupportedVersionException || t instanceof ClusterAuthorizationException
? Mono.just(Optional.empty())
: Mono.error(t)
);
In ScrapedClusterState.scrape(), add to listConsumerGroups():
.onErrorResume(ClusterAuthorizationException.class, e -> {
log.warn("Not authorized to list consumer groups, skipping");
return Mono.just(List.of());
})
In ScrapedClusterState.scrape(), add to the phase 2 Mono.zip:
.onErrorResume(TimeoutException.class, e -> {
log.warn("Timeout during offset/consumer scrape, topics will show without offset counts");
return Mono.just(Tuples.of(Map.of(), Map.of(), Map.of(), HashBasedTable.create()));
})
Your installation details
Steps to reproduce
Screenshots
No response
Logs
ERROR --- [parallel-4] io.kafbat.ui.service.StatisticsService: Failed to collect cluster info
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listOffsets(api=LIST_OFFSETS)
WARN --- [parallel-4] i.k.ui.service.AdminClientServiceImpl: AdminClient for the cluster is invalidated due to Timed out waiting for a node assignment.
Additional context
Note: issue #1672 covers point 1 partially (for Strimzi/limited accounts), but doesn't mention Confluent Cloud and doesn't cover points 2 and 3.