prometheus metrics for autobahn/avail, autobahn/data and p2p/mux by pompon0 · Pull Request #3682 · sei-protocol/sei-chain

pompon0 · 2026-07-01T13:17:49Z

It will give us insight into consensus state and rpc performance.

cursor · 2026-07-01T13:18:36Z

PR Summary

Low Risk
Observability-only changes with no consensus or auth logic; main operational risk is renamed/changed metric series breaking existing alerts and dashboards.

Overview
Adds Prometheus observability for consensus availability and P2P RPC streams, and standardizes autobahn data metrics on the internal metricsgen + libs/utils/prometheus stack (dropping k8s.io/component-base).

Autobahn avail records highest AppQC/CommitQC road and global block gauges, plus proposal→commit and inter-commit latency histograms (commit spacing labeled by view timeout count). Observations run when QCs are applied in prune and when PushCommitQC accepts a new QC.

Autobahn data moves latency metrics into a dedicated data/metrics package with resource/stage labels; block histograms use plain Observe, tx histograms keep ObserveWithWeight. State no longer registers as a custom Prometheus collector—metrics register via init.

P2P mux tracks per-stream in-flight, message/byte counters, and stream lifetime latency, labeled by role (connect vs accept) and rpc_name. rpc.Register now requires a string name wired into mux StreamKindConfig.Name; all giga RPCs are named (e.g. stream_commit_qcs, get_block).

Dependency note: existing dashboards keyed on the old sei_data__latency name will need updating to the new tendermint_internal_autobahn_data_* series.

^{Reviewed by Cursor Bugbot for commit 045d98f. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-07-01T13:19:27Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Jul 3, 2026, 6:51 PM

codecov · 2026-07-01T13:20:49Z

Codecov Report

❌ Patch coverage is 98.70130% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.45%. Comparing base (fd69364) to head (045d98f).

Files with missing lines	Patch %	Lines
sei-tendermint/internal/p2p/mux/metrics/metrics.go	91.66%	1 Missing and 1 partial ⚠️
sei-tendermint/internal/autobahn/avail/state.go	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@                  Coverage Diff                   @@
##           gprusak-prometheus    #3682      +/-   ##
======================================================
- Coverage               58.75%   58.45%   -0.31%     
======================================================
  Files                    2188     2192       +4     
  Lines                  178842   178050     -792     
======================================================
- Hits                   105084   104078    -1006     
- Misses                  64510    64722     +212     
- Partials                 9248     9250       +2

Flag	Coverage Δ
sei-chain-pr	`79.02% <98.70%> (+5.07%)`	⬆️
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-tendermint/internal/autobahn/avail/inner.go	`97.46% <100.00%> (+0.06%)`	⬆️
...int/internal/autobahn/avail/metrics/metrics.gen.go	`100.00% <100.00%> (ø)`
...dermint/internal/autobahn/avail/metrics/metrics.go	`100.00% <100.00%> (ø)`
...mint/internal/autobahn/data/metrics/metrics.gen.go	`100.00% <100.00%> (ø)`
...ndermint/internal/autobahn/data/metrics/metrics.go	`100.00% <100.00%> (ø)`
sei-tendermint/internal/autobahn/data/state.go	`79.89% <100.00%> (ø)`
sei-tendermint/internal/autobahn/data/testonly.go	`62.50% <ø> (+1.17%)`	⬆️
...tendermint/internal/p2p/mux/metrics/metrics.gen.go	`100.00% <100.00%> (ø)`
sei-tendermint/internal/p2p/mux/mux.go	`78.85% <100.00%> (+1.07%)`	⬆️
sei-tendermint/internal/p2p/mux/stream.go	`90.47% <100.00%> (+0.47%)`	⬆️
... and 4 more

... and 40 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

seidroid

Adds OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream I/O, plus minor refactors. The instrumentation is reasonable, but the new mux metrics file is not gofmt-clean (will fail the lint CI), and several monotonic totals are declared as UpDownCounters.

Findings: 2 blocking | 4 non-blocking | 2 posted inline

Blockers

sei-tendermint/internal/p2p/mux/metrics/metrics.go is not gofmt-clean, which violates the repo's formatting requirement and will fail golangci/make lint in CI (see inline comment). Run gofmt -s -w on the file.
1 blocking issue(s) flagged inline on specific lines.

Non-blocking

Repository review guidelines (REVIEW_GUIDELINES.md) were empty/missing, so no repo-specific standards were applied beyond AGENTS.md.
The Cursor second-opinion pass (cursor-review.md) produced no output.
In avail/metrics.go ObserveCommitQC, proposalToCommitLatency.Record(...) is called before the ordering guard last.val.Index() >= qc.Index(), so a re-observed/stale CommitQC still records a latency sample. Consider moving the record after the guard if only fresh QCs should be measured (minor, metrics-only).
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid

Observability-only PR adding OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream traffic. The changes are correct and well-structured (dedup guards, bounded label cardinality, safe mutex-by-value refactor); the only substantive concern is that the in-flight stream gauge can drift upward when streams are abandoned without a local Close.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

No tests were added for the new metrics helpers (ObserveCommitQC/ObserveAppQC dedup logic and the Stream metric lifecycle). A small unit test around the monotonic dedup guards (out-of-order QCs are ignored) and Open/Close idempotency would guard against regressions; acceptable to skip for an observability-only change, but worth noting.
Cursor's second-opinion pass (cursor-review.md) produced no output, and REVIEW_GUIDELINES.md was empty on the base branch, so no repo-specific standards were applied. Codex's single medium finding (the inFlight counter leak) is incorporated as an inline comment.
Minor: the Send metric is recorded when a message is enqueued to the frame queue (stream.go:95) rather than when it is actually written to the connection, so send_bytes/send_msgs slightly lead the wire. This is a reasonable approximation but worth a comment if precise byte accounting is ever expected.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-01T13:46:00Z

 		return
 	}
 	inner.closed.local = true
+	inner.metrics.Close()


[suggestion] The inFlight up/down counter is only decremented here in the local close() path. Open() increments it once a stream's recv end is established, but the decrement never runs if a stream is abandoned without a local Close() — e.g. Mux.Run exits on a connection/read/write error, or a caller observes remote close (tryPrune in runRecv only fully prunes once both local and remote are closed) but never calls Close(). Because inFlight is a process-global counter shared across all connections, this drifts permanently upward across peer churn, making the gauge unreliable over a long-running node. Consider decrementing symmetrically on prune/teardown (e.g. from tryPrune or a runner shutdown hook) so an opened stream is always balanced by exactly one decrement. (Matches Codex's medium finding.)

local Close is ALWAYS required by the contract on Stream object - if mux.Connect/Accept succeeds, then caller is required to call Close eventually.

Superseded: latest AI review found no blocking issues.

bdchatham

xreview summary (opentelemetry / sei-networking / systems / idiom lenses). Nice, well-scoped instrumentation — most of it is clean. Three points worth resolving before merge, plus a few smaller ones inline.

🔴 MeterProvider wiring — where is the global MeterProvider set up? otel.Meter() is a no-op unless a provider with a Prometheus/OTLP reader and a service.name resource is registered in the binary, and I don't see it in this diff. If it isn't already wired in main, every instrument here silently records nothing — worth confirming before merge.

On your weighted-histogram question: you don't need to call Record more than once. You have exactly one latency value per commit, so it's a single Record into a Float64Histogram — the sum+count counter split was solving a problem the code doesn't have. Details in the inline comment on commitToCommitLatency.

bdchatham · 2026-07-01T15:38:39Z

+	"tendermint_internal_autobahn_avail__app_global_block_number",
+	metric.WithDescription("global block number of the highest observed appQC"),
+))
+var proposalToCommitLatency = utils.OrPanic1(meter.Float64Histogram(


🟡 Add metric.WithUnit("s") on the avail latency instruments — the mux-side latency sets it but these don't, so the exporter won't append _seconds and the two packages end up with inconsistently-named latency series.

bdchatham · 2026-07-01T15:38:39Z

-	return &RPC[API, Req, Resp]{kind, limit, req, resp}
+	service[kind] = &rpcConfig{
+		limit: limit,
+		name:  fmt.Sprintf("%T", utils.Zero[Req]()),


🟡 fmt.Sprintf("%T", ...) bakes the fully-qualified Go type into the rpc_name label, so renaming or moving the request type silently breaks dashboard continuity. Prefer an explicit, stable name string.

bdchatham · 2026-07-01T15:41:04Z

Weighted histograms in OTel — summary of the thread

TL;DR on "how do I migrate the weighted histograms without calling Record N times":

There's no weighted Record in the OTel Go API — a histogram counts observations, so weighting a single latency by N blocks means either N Record calls or emitting a pre-aggregated point yourself. Two things worth separating: is looping actually a problem (no), and what's the right metric (probably not a weighted histogram at all).

1. Looping Record is cheap at our scale. 1–120 blocks at single-digit commits/sec is single-digit µs/commit — invisible next to the sequencing work. The only real footgun is allocating the attribute set inside the loop, so hoist it:

recordOpts := metric.WithAttributeSet(attrSet) // build once, outside the loop
for i := uint64(0); i < blocks; i++ {
    seqLatency.Record(ctx, latencySec, recordOpts)
}

2. But I'd reconsider the weighted histogram itself. Replaying one shared per-commit measurement into N identical points inflates the statistical support and quietly redefines _count (blocks, not commits). The honest signal keeps latency and load orthogonal:

seqLatency.Record(ctx, latencySec, opts)            // per-commit, unweighted → commit-health p99
blocksSequenced.Add(ctx, int64(blocks), opts)       // blocks_sequenced_total → throughput
// optional: blocksPerCommit.Record(ctx, float64(blocks), opts) // batch-size distribution

Now rate(blocks_sequenced_total) gives throughput, the histogram stays meaningful, and you can correlate "latency rose when blocks/commit rose" on one panel — strictly more information than one weighted blob.

3. If you genuinely need a pre-aggregated / weighted histogram in OTel, it's doable natively via a custom metric.Producer — the SDK analogue of Prometheus MustNewConstHistogram. You register it on the reader (metric.WithProducer(...), passed into prometheus.New(...) — verify the exact symbol on our pinned SDK) and hand back a point whose Count/Sum/BucketCounts you set directly, so weighting is "add N to the bucket, N to count":

func (p *seqLatencyProducer) Observe(latencySec float64, blocks uint64) {
    p.mu.Lock(); defer p.mu.Unlock()
    i := sort.SearchFloat64s(p.bounds, latencySec) // first bound >= value
    p.bucketCounts[i] += blocks                    // weight by N, O(1)
    p.count += blocks
    p.sum += latencySec * float64(blocks)
}

func (p *seqLatencyProducer) Produce(ctx context.Context) ([]metricdata.ScopeMetrics, error) {
    p.mu.Lock()
    bc := append([]uint64(nil), p.bucketCounts...) // snapshot under lock
    dp := metricdata.HistogramDataPoint[float64]{
        Attributes: p.attrs, StartTime: p.start, Time: time.Now(),
        Count: p.count, Bounds: p.bounds, BucketCounts: bc, Sum: p.sum,
    }
    p.mu.Unlock()
    return []metricdata.ScopeMetrics{{
        Scope: instrumentation.Scope{Name: "consensus"},
        Metrics: []metricdata.Metrics{{
            Name: "consensus_block_sequencing_latency", Unit: "s",
            Data: metricdata.Histogram[float64]{
                Temporality: metricdata.CumulativeTemporality,
                DataPoints:  []metricdata.HistogramDataPoint[float64]{dp},
            },
        }},
    }}, nil
}

The catch: you now own the cumulative-temporality invariants (bucket counts monotonically non-decreasing + a stable StartTime, or Prometheus reads a reset and histogram_quantile breaks), your own concurrency, your own attribute sets, and you lose Views + exemplars (you've bypassed the SDK aggregator). That's worth it only when the bucket counts already exist upstream (a pre-aggregated subprocess/library/batch source) or the per-observation path is genuinely hot — i.e. folding in an aggregation, not synthesizing one. Not our case here, so I'd go with option 1 or 2.

On the API-ergonomics point: yes, client_golang's MustNewConstHistogram is more ergonomic for pre-aggregated data — that's the price of OTel being backend/temporality-agnostic (push+pull, delta+cumulative, Views, exemplars). If we ever want the raw ergonomics for a few metrics, we can register a plain client_golang Collector on the same Prometheus registry and it scrapes alongside the OTel series — but only under a pull/Prometheus exporter, not OTLP push.

wen-coding · 2026-07-01T20:05:32Z

+			}
+			// Constructed once per CommitQC, which we should afford.
+			attrs := metric.WithAttributeSet(attribute.NewSet(
+				// Timeouts capped: 20 means [20,inf)


I suppose we almost never have 20+ timeouts on one RoadIndex? What happens if the cluster gets stuck?

then we don't care about this metric, I suppose.

wen-coding · 2026-07-01T20:22:57Z

+}
+
+var observedCommitQC = newObserved[*types.CommitQC]()
+var observedAppQC = newObserved[*types.AppQC]()


Do we need the full QC? We are only using RoadIndex, View Index, and timestamp I suppose?

this is a simplification, there is no need to restrain it down.

wen-coding · 2026-07-01T20:58:48Z

 		return false, nil
 	}
 	i.latestAppQC = utils.Some(appQC)
+	metrics.ObserveAppQC(appQC)


This could be much later than we actually observe AppQC for this RoadIndex right? Because the block proposal packing AppQC is optional.

true, these metrics are describing the avail.State content.

wen-coding · 2026-07-01T20:59:18Z

 	i.commitQCs.prune(idx)
 	if i.commitQCs.next == idx {
 		i.commitQCs.pushBack(commitQC)
+		metrics.ObserveCommitQC(c, commitQC)


Sorry I'm confused, why do we observe these QCs inside prune() instead of right after the QCs are verified?

metrics track newest commitQC - the commitQC from the anchor is newest iff the anchor wipes the whole data state.

seidroid

This PR adds otel/Prometheus metrics for autobahn/avail (commit/app QC latency & progress gauges) and the p2p mux (per-RPC latency, in-flight, msg/byte counters), threads an RPC name through registration, and generalizes the metricsgen tool to support lowercase struct names. The changes are well-structured and correct; I found no blocking issues, only minor observability notes.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

The Cursor second-opinion review (cursor-review.md) is empty — that pass produced no output.
The Codex review (codex-review.md) targets sei-tendermint/node/node.go (NoOpMetricsProvider / package-level Global collectors), which is NOT one of this PR's 13 changed files. Those specific findings are out of scope for this PR and appear to reference stale or mis-scoped code. The general observation still partially applies: the new mux/metrics package uses package-level Global collectors registered in init(), so multiple node instances in one process (e.g. tests) share and merge these series. This is consistent with the repo's existing metricsgen Global pattern and is not a regression introduced here, but is worth keeping in mind for any in-process multi-node setups.
inFlight gauge relies on Stream.Open()/Close() being paired. Close() is idempotent at the caller (guarded by inner.closed.local), so double-decrement is not possible, but if a stream is opened and its connection tears down without close() ever running, the gauge can drift upward. Worth confirming all stream lifecycles funnel through close().
No unit tests accompany the new metrics recording paths (ObserveCommitQC/ObserveAppQC dedup logic, mux Stream open/send/recv/close). The dedup guards (Index()/GlobalNumber() ordering) would benefit from a small test; metricsgen itself does get a new lowercase-struct test.
1 suggestion(s)/nit(s) flagged inline on specific lines.

Comments that couldn't be anchored to the diff

sei-tendermint/internal/autobahn/avail/metrics/metrics.go:106 -- [nit] nit: qc.Proposal().View().Number is labeled timeouts. The view number is a reasonable proxy for the number of timed-out rounds, but the mapping isn't obvious from the code — a brief comment tying view number to "timeouts" would help future readers (the cap-at-20 comment is good).

seidroid

A well-structured observability PR (metricsgen tag unification + new autobahn/avail and p2p/mux metrics). No correctness or security issues found; one valid non-blocking metrics-quality concern (startup replay inflates a latency histogram) and a couple of cosmetic notes.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

Startup metric skew (confirmed, Codex): newInner replays the persisted prune anchor via i.prune(...) (inner.go:92), and prune now calls metrics.ObserveCommitQC. On the first observation after restart the commit-to-commit guard is skipped (no prior last), but proposal_to_commit_latency is still observed as now - anchor.Proposal().Timestamp(), so every restart adds one spurious, downtime-inflated sample to that histogram. Consider skipping latency observations during replay (record only the state gauges) so dashboards/percentiles aren't distorted. Not a consensus-correctness issue.
Import-alias inconsistency (nit): generated metrics.gen.go files now alias the prometheus helpers as tmprometheus, while the hand-written metrics.go files (e.g. internal/state/metrics.go, internal/evidence/metrics.go, internal/eventlog/metrics.go) still use tmmetrics. It compiles fine (per-file aliases) but is slightly confusing; optionally align them.
sei-tendermint/internal/p2p/mux/metrics/metrics.go imports prometheus with an explicit alias identical to the package name (prometheus "..."); the alias is redundant and can be dropped.
Second-opinion passes: cursor-review.md was empty (Cursor produced no output). codex-review.md contained a single finding, which is incorporated above. REVIEW_GUIDELINES.md was empty, so no repo-specific standards were applied.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-03T11:52:19Z

 	i.commitQCs.prune(idx)
 	if i.commitQCs.next == idx {
 		i.commitQCs.pushBack(commitQC)
+		metrics.ObserveCommitQC(c, commitQC)


[suggestion] prune is invoked during startup replay of the persisted prune anchor (newInner at inner.go:92), so ObserveCommitQC here records a proposal_to_commit_latency sample equal to now - anchor.Proposal().Timestamp(). After a restart that proposal timestamp is old, producing one downtime-inflated outlier per restart. (commit_to_commit_latency is safe on the first call since there is no prior observation.) Consider observing latency only for live commits and updating just the state gauges during replay.

seidroid

A large but largely mechanical metrics refactor (import alias rename, pointer→value wrapper vecs, new metrics_buckets tag + NoBuckets, new avail/data/mux metrics) that is logically sound and well-tested, but the hand-written libs/utils/prometheus/prometheus.go is not gofmt/goimports-clean and will fail the golangci-lint CI job the repo mandates.

Findings: 3 blocking | 3 non-blocking | 2 posted inline

Blockers

sei-tendermint/libs/utils/prometheus/prometheus.go is not gofmt/goimports-clean, so the golangci-lint job (which enables the gofmt and goimports formatters) will fail: (a) the import group at lines 12–13 orders k8s.io/... before github.com/sei-protocol/... (goimports sorts alphabetically, github.com < k8s.io); (b) struct { ... } single-line declarations at lines 17 and 27 must be struct{ ... } under gofmt. Run gofmt -s -w and goimports -w on this file (the only non-generated new source file that isn't formatter-clean).
2 blocking issue(s) flagged inline on specific lines.

Non-blocking

Cursor's second-opinion review (cursor-review.md) was empty — that pass produced no output. Codex's single finding (the gofmt/goimports issue above) is confirmed.
The new p2p/mux metrics (mux/metrics/metrics.gen.go) and several migrated histograms are registered with empty Help: "" strings; consider adding help text for operator readability, though this matches some pre-existing metrics.
mux/metrics.Stream increments inFlight in Open() but only decrements it in Close() when a start time is set; a stream that is opened but whose Close() is never reached (e.g. connection torn down without the normal close path) would leak the in_flight gauge. Worth confirming the mux always calls Close().

seidroid · 2026-07-03T13:54:37Z

 	dto "github.com/prometheus/client_model/go"
 	"google.golang.org/protobuf/proto"

+	"k8s.io/component-base/metrics/prometheusextension"


[blocker] Import ordering is not goimports-clean: within this group k8s.io/component-base/... is placed before github.com/sei-protocol/..., but goimports sorts alphabetically (github.com < k8s.io). This will fail the goimports linter in CI. Run goimports -w on the file.

seidroid · 2026-07-03T13:54:37Z

 )

+type HistogramOpts = prometheus.HistogramOpts
+type HistogramVec struct { *prometheusextension.WeightedHistogramVec }


[blocker] Not gofmt-clean: struct { *prometheusextension.WeightedHistogramVec } will be reformatted by gofmt to struct{ ... } (same for type Observer struct { ... } on line 27). The repo enables the gofmt formatter in golangci-lint, so this will fail CI. Run gofmt -s -w on this file.

seidroid

A large, mostly-mechanical and well-tested refactor of the sei-tendermint Prometheus metrics layer (value-type metric vecs, a custom weighted Histogram, a reworked metricsgen bucket-tag syntax) plus new autobahn/avail, autobahn/data, and p2p/mux stream metrics. The core logic is sound with good test coverage; no correctness blockers, only observability/operational notes.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

Second-opinion passes surfaced nothing actionable: Codex reported "No material findings" (and noted it could not run the Go 1.25.6 tests due to a Go 1.24 toolchain / no network), and the Cursor review file (cursor-review.md) is empty — that pass produced no output.
Operator/dashboard impact: several metrics are renamed or restructured (e.g. the data-package histogram sei_data__latency becomes tendermint_internal_autobahn_data_latency, and weighted-histogram wiring is replaced by the new custom Histogram). This is expected for a metrics refactor but will break existing Grafana dashboards/alerts referencing the old names — worth calling out in release notes.
Semantics of the new custom Histogram differ subtly from the standard prometheus client: an empty/none bucket set yields only a single +Inf bucket (count+sum only, no distribution), and there is intentionally no implicit DefBuckets at the library layer — the DefBuckets default is applied by metricsgen instead. Anyone hand-constructing a tmprometheus.HistogramVec outside the generator must remember to pass buckets explicitly.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-03T18:37:58Z

+	// Latency from proposal being constructed to commit being observed.
+	proposalToCommitLatency prometheus.HistogramVec `metrics_buckets:"exp(0.01, 1.2, 35)"`
+	// Latency between consecutive commits being observed.
+	commitToCommitLatency prometheus.HistogramVec `metrics_labels:"timeouts" metrics_buckets:"none"`


[nit] nit: commitToCommitLatency uses metrics_buckets:"none", so this "latency" histogram exports only the implicit +Inf bucket (i.e. count and sum, no distribution/quantiles). If the intent is just an average per timeouts label that's fine, but for a latency metric a finite bucket set (e.g. exp(...)) is usually more useful — please confirm this is intentional.

Superseded: latest AI review found no blocking issues.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 045d98f. Configure here.}

cursor · 2026-07-03T18:52:31Z

+		}
+		Global.appRoadIndexAt().Set(int64(qc.Proposal().RoadIndex())) // nolint: gosec
+		// +1 is for consistency with commitGlobalBlockNumber
+		Global.appGlobalBlockNumberAt().Set(int64(qc.Proposal().GlobalNumber() + 1)) // nolint: gosec


App global gauge misaligned

Low Severity

The app global block gauge is set to GlobalNumber()+1, while the commit gauge uses GlobalRange(c).Next. For commit QCs that finalize multiple global blocks, the app proposal’s global number may be the range start, so the two gauges diverge even when the pair is valid.

^{Reviewed by Cursor Bugbot for commit 045d98f. Configure here.}

seidroid

This PR swaps the heavy k8s.io/component-base metrics dependency for the repo's own metricsgen-based prometheus wrappers and adds observability for autobahn avail/data QCs and p2p mux streams. The refactor is clean and the changes are non-app-hash-breaking; the only substantive concern is a potential slow drift of the new in_flight gauge on abnormal stream teardown.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

None at the file/PR level.

Non-blocking

Second-opinion passes: cursor-review.md and REVIEW_GUIDELINES.md were both empty, so no Cursor findings or repo-specific guidelines were available to merge; only the Codex pass produced output (the in_flight observation, reflected inline).
mux metrics (latency, in_flight, send_msgs, recv_msgs, send_bytes, recv_bytes) are generated with empty Help strings. Adding doc comments on the struct fields in sei-tendermint/internal/p2p/mux/metrics/metrics.go would give the exported Prometheus metrics meaningful HELP text.
ObserveCommitQC/ObserveAppQC use package-global mutex-guarded observed* state and a process-global Global registry; in multi-node in-process test setups the cross-node dedup (last.val.Index() >= qc.Index()) will drop observations from other nodes. This is an inherent limitation of the global-metrics pattern (metrics are process-global anyway) and fine for production single-node use, but worth being aware of when interpreting test metrics.
1 suggestion(s)/nit(s) flagged inline on specific lines.

seidroid · 2026-07-03T18:56:29Z

 			ctrl.Updated()
 			return err
 		}
+		inner.metrics.Open()


[suggestion] in_flight is incremented here in Open() but only decremented in the local close() path (inner.metrics.Close()). The ordering is correct for the open() error path (the increment happens only after WaitUntil succeeds, so the early s.close(inner) above doesn't over-decrement). However, because the gauge lives in a process-global registry, any stream that is opened but never has Stream.Close() called on it — e.g. a stream abandoned when Mux.Run exits on a connection/read/write error and its owning goroutine returns without closing — leaks a permanent +1. Over many reconnects the gauge can drift upward and stop reflecting the true in-flight count, and the latency sample for those streams is never recorded. Consider tying the decrement/observe to stream teardown/pruning (tryPrune) rather than only to the application calling Close().

pompon0 added 4 commits July 1, 2026 13:01

rpc metrics

b19b54a

mux metrics

fa43d93

consensus metrics

f1d08a4

metrics for autobahn/avail and p2p/mux

eb45dad

pompon0 requested review from bdchatham and wen-coding July 1, 2026 13:18

pompon0 added the non-app-hash-breaking label Jul 1, 2026

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/inner.go

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

seidroid Bot previously requested changes Jul 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/p2p/mux/metrics/metrics.go Outdated

Comment thread sei-tendermint/internal/p2p/mux/metrics/metrics.go Outdated

pompon0 added 2 commits July 1, 2026 15:30

long names

1dc194a

removed updown

d6737d1

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go

pompon0 added 2 commits July 1, 2026 15:37

cap on timeouts in metrics

df9c86a

moved latency recording

ba94963

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

seidroid Bot approved these changes Jul 1, 2026

View reviewed changes

bdchatham reviewed Jul 1, 2026

View reviewed changes

bdchatham approved these changes Jul 1, 2026

View reviewed changes

wen-coding reviewed Jul 1, 2026

View reviewed changes

pompon0 added 2 commits July 2, 2026 12:50

WIP

59edf04

WIP

a856a28

wen-coding approved these changes Jul 2, 2026

View reviewed changes

pompon0 added 2 commits July 3, 2026 12:31

Merge branch 'gprusak-prometheus' into gprusak-metrics

57f84b0

generated metrics

bcd5da5

pompon0 changed the base branch from main to gprusak-prometheus July 3, 2026 11:02

seidroid Bot approved these changes Jul 3, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

pompon0 added 2 commits July 3, 2026 13:30

avail metrics migrated

65c9d72

concise bucket specs

78561f6

seidroid Bot approved these changes Jul 3, 2026

View reviewed changes

data metrics plugged in

beaa7cf

pompon0 changed the title ~~otel metrics for autobahn/avail and p2p/mux~~ prometheus metrics for autobahn/avail and p2p/mux Jul 3, 2026

seidroid Bot previously requested changes Jul 3, 2026

View reviewed changes

pompon0 added 4 commits July 3, 2026 19:44

fmt

a8e619e

custom histogram

867234e

nobucket

d730fd2

adjustments

eab524c

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/avail/inner.go

seidroid Bot approved these changes Jul 3, 2026

View reviewed changes

pompon0 added 2 commits July 3, 2026 20:49

Merge branch 'gprusak-prometheus' into gprusak-metrics

22dc3ca

tidy

045d98f

cursor Bot reviewed Jul 3, 2026

View reviewed changes

pompon0 changed the title ~~prometheus metrics for autobahn/avail and p2p/mux~~ prometheus metrics for autobahn/avail, autobahn/data and p2p/mux Jul 3, 2026

seidroid Bot approved these changes Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

pompon0 commented Jul 1, 2026

Uh oh!

cursor Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Uh oh!

seidroid Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdchatham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bdchatham commented Jul 1, 2026

Weighted histograms in OTel — summary of the thread

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pompon0 Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

Non-blocking

Comments that couldn't be anchored to the diff

Uh oh!

Uh oh!

seidroid Bot left a comment

Choose a reason for hiding this comment

Blockers

cursor Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading

pompon0 Jul 2, 2026 •

edited

Loading