[Virtual Event] Agentic AI Streamposium: Learn to Build Real-Time AI Agents & Apps | Register

More Signal, Less Guesswork: New Kafka Observability Updates in Confluent Cloud

Written By

We’re introducing enhanced visibility for streaming workload performance on Confluent Cloud, making it easier for developers and operators to understand, troubleshoot, and optimize real-time applications. As Apache Kafka® has become the backbone of data streaming, many teams rely on Confluent Cloud for its scale, elasticity, and reduced operational burden. While fully managed infrastructure removes much of the heavy lifting, teams still need the ability to monitor, tune, and optimize performance when it matters most.

Over the last few months, we’ve shipped updates focused squarely on addressing that feedback. New metrics answer questions that customers used to chase through logs and support tickets, and a refreshed cluster monitoring experience in Confluent Cloud Console brings the most important signals together in one place. Here’s what’s new and upcoming:

  • `io.confluent.kafka.server/client_limit_milliseconds` metric to understand which users are breaching limits and why

  • `io.confluent.kafka.server/max_pending_rebalance_time_milliseconds` metric to evaluate rebalance duration and frequency

  • `io.confluent.kafka.server/connection_accept_count`metric to correlate connection attempts against elastic cluster scaling 

  • `io.confluent.kafka.server/partition_count`metric with new dimension for `cleanup.policy` to differentiate between compacted and non-compacted partitions 

Know Exactly Who Is Breaching Your Limits – and Why 

Throughput quotas for Confluent Cloud prevent a single application from over-consuming resources on a cluster—but until now, figuring out which user breached a limit and why required correlating multiple metrics or using client-side monitoring to get the full picture.

The new `io.confluent.kafka.server/client_limit_milliseconds` metric closes that gap. It reports the duration of throttling events broken down by principal, the violated limit, and the reason that limit was violated. Teams that previously had to reach for client-side telemetry or ping a developer in Slack can now pull this directly from the Metrics API, with support for export formats as well. 

In practice, this collapses a debugging loop that could take hours into one that takes minutes. An operator investigating a throughput dip can identify the principal and whether the limit came from their own quota configuration or from the cluster itself, without leaving their existing monitoring context. In the future, we also plan to add a `client_id` label to make the response even more precise. 

Example request:

{
  "group_by": [
    "metric.client_id",
    "metric.principal_id",
    "metric.violated_limit",
    "metric.reason"
    "resource.kafka.id"
  ],
  "aggregations": [
    {
      "metric": "io.confluent.kafka.server/client_limit",
      "agg": "AVG"
    }
  ],
  "filter": {
    "op": "AND",
    "filters": [
      {
        "field": "resource.kafka.id",
        "op": "EQ",
        "value": "lkc-xxxx"
      }
    ]
  },
  "granularity": "ALL",
  "intervals": ["2025-04-11T00:00:00.000Z/PT10M"]
}

Example response:

{
  "data": [
    {
      "timestamp": "2025-04-11T00:00:00Z",
      "value": 335,
      "resource.kafka.id": "lkc-xxxxx",
      "metric.principal_id": "u-23agh7",
      "metric.violated_limit": "produce_throughput_quota",
      "metric.reason": "cluster_quota_violation"
    }
  ]
}

Understand Consumer Group Rebalances at a Glance

Rebalances are a usual part of consumer group management, but frequent or long-running events are one of the more common causes of performance bottlenecks. Until now, diagnosing them meant stitching together client logs from multiple applications or using a metric such as consumer lag as a proxy. 

To provide more direct visibility into rebalance events, we’ve added `io.confluent.kafka.server/max_pending_rebalance_time_milliseconds` to the Metrics API.

When querying the metric, a non-zero value in a given time interval tells you a group had at least one rebalance event during that time, with the value itself telling you the maximum time spent in a rebalance event. When querying over a time range, going from non-zero values to zero values between intervals can indicate that the group was stable between rebalance events. Plotted over time, it can help picture both rebalance frequency and duration, which is usually enough to identify the severity and prevalence of any rebalance issues. Here’s an example from our documentation:

{
     "data": [
      {
          "metric.consumer_group_id": "test-group",
          "metric.group_protocol": "CLASSIC",
          "timestamp": "2025-11-14T16:00:00Z",
          "value": 0.0
      },
      {
          "metric.consumer_group_id": "test-group",
          "metric.group_protocol": "CLASSIC",
          "timestamp": "2025-11-14T16:01:00Z",
          "value": 14594.0
      },
      {
          "metric.consumer_group_id": "test-group",
          "metric.group_protocol": "CLASSIC",
          "timestamp": "2025-11-14T16:02:00Z",
          "value": 0.0
      }
    ]
}

In this example, we query over three minutes at one-minute granularity. We can see that the group goes from stable to rebalancing, with a maximum of about 14.5 seconds spent in the rebalancing state. The group then becomes stable afterward. 

We’ve updated our consumer lag page in Confluent Cloud Console to show rebalance information for a given consumer group. The user interface queries over the last six hours with one-minute granularity. 

If you find yourself facing frequent rebalances or long rebalance events, check to see if you’re still using the class consumer group protocol. Apache Kafka 4.0 introduced the new “consumer” group protocol type, which meaningfully reduces the impact of rebalance events. Check out the deep dive blog post to learn more. 

Better Scale Signals for Elastic Clusters

Confluent Cloud elastic cluster types use Elastic Confluent Units for Kafka (eCKUs) as the unit of elastic capacity. To help customers better understand when they’re approaching capacity limits, we’ve updated our metrics, as follows.

  •  New metric for Connection attempts: `io.confluent.kafka.server/connection_accept_count`is a new metric that returns a value of the total count of connection attempts in a time interval. Connection attempts are a common yet silent reason that your cluster may elastically scale. This is also often the first sign of a misbehaving client or a connection storm before it shows up as a throughput problem.

  •  (Upcoming) New dimension for compacted partitions: The existing `io.confluent.kafka.server/partition_count` has been updated to have a new dimension for `cleanup.policy`. When marked with `compact`, this indicates the number of compacted partitions in the cluster, which is a direct scale dimension for elastic clusters. 

Both metrics will be available through the Metrics API and fit alongside other eCKU utilization metrics that are already available. For a complete picture of how your elastic clusters scale, make sure the following metrics are being monitored:

  • `io.confluent.kafka.server/elastic_cku_count`

  • `io.confluent.kafka.server/active_connection_count`

  • `io.confluent.kafka.server/connection_accept_count`

  • `io.confluent.kafka.server/partition_count`

  • `io.confluent.kafka.server/request_count`

  • `io.confluent.kafka.server/request_bytes`

  • `io.confluent.kafka.server/response_bytes`

Monitoring Elastic Scale Out of the Box

We‘re also excited to introduce the new and improved monitoring experience that comes out of the box with Confluent Cloud Console. Most notably, the new Cluster Monitoring page displays various metrics pertaining to elastic cluster demand and Dedicated cluster load. Specifically for elastic clusters, the new eCKU usage chart gives a thorough representation of your current eCKUs and, most importantly, allows you to simultaneously visualize which dimensions are contributing to that usage. 

What's Next

We‘re committed to creating the best monitoring experience possible for our users. Upcoming work includes incorporating throttled client visibility into the Cloud Console, expanding metrics coverage and dimensionality, and supporting consumer group lag for empty consumer groups. 

You can explore these updates in Confluent Cloud today. For a complete list of available metrics and example queries to get started, visit the Metrics API reference.

Apache®, Apache Kafka®, and Kafka® are registered trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.

  • Nusair Haq is a product manager for Confluent Cloud focusing on helping operators manage and understand their workloads through new APIs, metrics and UI workflows. Outside of work, Nusair loves spending time with his family, exploring the Greater-Toronto food scene and watching motorsports, particularly MotoGP.

Did you like this blog post? Share it now