prometheus metrics for autobahn/avail, autobahn/data and p2p/mux#3682
prometheus metrics for autobahn/avail, autobahn/data and p2p/mux#3682pompon0 wants to merge 21 commits into
Conversation
PR SummaryLow Risk Overview Autobahn avail records highest AppQC/CommitQC road and global block gauges, plus proposal→commit and inter-commit latency histograms (commit spacing labeled by view timeout count). Observations run when QCs are applied in Autobahn data moves latency metrics into a dedicated P2P mux tracks per-stream in-flight, message/byte counters, and stream lifetime latency, labeled by Dependency note: existing dashboards keyed on the old Reviewed by Cursor Bugbot for commit 045d98f. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## gprusak-prometheus #3682 +/- ##
======================================================
- Coverage 58.75% 58.45% -0.31%
======================================================
Files 2188 2192 +4
Lines 178842 178050 -792
======================================================
- Hits 105084 104078 -1006
- Misses 64510 64722 +212
- Partials 9248 9250 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Adds OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream I/O, plus minor refactors. The instrumentation is reasonable, but the new mux metrics file is not gofmt-clean (will fail the lint CI), and several monotonic totals are declared as UpDownCounters.
Findings: 2 blocking | 4 non-blocking | 2 posted inline
Blockers
sei-tendermint/internal/p2p/mux/metrics/metrics.gois not gofmt-clean, which violates the repo's formatting requirement and will fail golangci/make lintin CI (see inline comment). Rungofmt -s -won the file.- 1 blocking issue(s) flagged inline on specific lines.
Non-blocking
- Repository review guidelines (
REVIEW_GUIDELINES.md) were empty/missing, so no repo-specific standards were applied beyond AGENTS.md. - The Cursor second-opinion pass (
cursor-review.md) produced no output. - In
avail/metrics.goObserveCommitQC,proposalToCommitLatency.Record(...)is called before the ordering guardlast.val.Index() >= qc.Index(), so a re-observed/stale CommitQC still records a latency sample. Consider moving the record after the guard if only fresh QCs should be measured (minor, metrics-only). - 1 suggestion(s)/nit(s) flagged inline on specific lines.
There was a problem hiding this comment.
Observability-only PR adding OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream traffic. The changes are correct and well-structured (dedup guards, bounded label cardinality, safe mutex-by-value refactor); the only substantive concern is that the in-flight stream gauge can drift upward when streams are abandoned without a local Close.
Findings: 0 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- No tests were added for the new metrics helpers (ObserveCommitQC/ObserveAppQC dedup logic and the Stream metric lifecycle). A small unit test around the monotonic dedup guards (out-of-order QCs are ignored) and Open/Close idempotency would guard against regressions; acceptable to skip for an observability-only change, but worth noting.
- Cursor's second-opinion pass (cursor-review.md) produced no output, and REVIEW_GUIDELINES.md was empty on the base branch, so no repo-specific standards were applied. Codex's single medium finding (the inFlight counter leak) is incorporated as an inline comment.
- Minor: the
Sendmetric is recorded when a message is enqueued to the frame queue (stream.go:95) rather than when it is actually written to the connection, so send_bytes/send_msgs slightly lead the wire. This is a reasonable approximation but worth a comment if precise byte accounting is ever expected. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
| return | ||
| } | ||
| inner.closed.local = true | ||
| inner.metrics.Close() |
There was a problem hiding this comment.
[suggestion] The inFlight up/down counter is only decremented here in the local close() path. Open() increments it once a stream's recv end is established, but the decrement never runs if a stream is abandoned without a local Close() — e.g. Mux.Run exits on a connection/read/write error, or a caller observes remote close (tryPrune in runRecv only fully prunes once both local and remote are closed) but never calls Close(). Because inFlight is a process-global counter shared across all connections, this drifts permanently upward across peer churn, making the gauge unreliable over a long-running node. Consider decrementing symmetrically on prune/teardown (e.g. from tryPrune or a runner shutdown hook) so an opened stream is always balanced by exactly one decrement. (Matches Codex's medium finding.)
There was a problem hiding this comment.
local Close is ALWAYS required by the contract on Stream object - if mux.Connect/Accept succeeds, then caller is required to call Close eventually.
Superseded: latest AI review found no blocking issues.
bdchatham
left a comment
There was a problem hiding this comment.
xreview summary (opentelemetry / sei-networking / systems / idiom lenses). Nice, well-scoped instrumentation — most of it is clean. Three points worth resolving before merge, plus a few smaller ones inline.
🔴 MeterProvider wiring — where is the global MeterProvider set up? otel.Meter() is a no-op unless a provider with a Prometheus/OTLP reader and a service.name resource is registered in the binary, and I don't see it in this diff. If it isn't already wired in main, every instrument here silently records nothing — worth confirming before merge.
On your weighted-histogram question: you don't need to call Record more than once. You have exactly one latency value per commit, so it's a single Record into a Float64Histogram — the sum+count counter split was solving a problem the code doesn't have. Details in the inline comment on commitToCommitLatency.
| "tendermint_internal_autobahn_avail__app_global_block_number", | ||
| metric.WithDescription("global block number of the highest observed appQC"), | ||
| )) | ||
| var proposalToCommitLatency = utils.OrPanic1(meter.Float64Histogram( |
There was a problem hiding this comment.
🟡 Add metric.WithUnit("s") on the avail latency instruments — the mux-side latency sets it but these don't, so the exporter won't append _seconds and the two packages end up with inconsistently-named latency series.
| return &RPC[API, Req, Resp]{kind, limit, req, resp} | ||
| service[kind] = &rpcConfig{ | ||
| limit: limit, | ||
| name: fmt.Sprintf("%T", utils.Zero[Req]()), |
There was a problem hiding this comment.
🟡 fmt.Sprintf("%T", ...) bakes the fully-qualified Go type into the rpc_name label, so renaming or moving the request type silently breaks dashboard continuity. Prefer an explicit, stable name string.
Weighted histograms in OTel — summary of the threadTL;DR on "how do I migrate the weighted histograms without calling There's no weighted 1. Looping recordOpts := metric.WithAttributeSet(attrSet) // build once, outside the loop
for i := uint64(0); i < blocks; i++ {
seqLatency.Record(ctx, latencySec, recordOpts)
}2. But I'd reconsider the weighted histogram itself. Replaying one shared per-commit measurement into N identical points inflates the statistical support and quietly redefines seqLatency.Record(ctx, latencySec, opts) // per-commit, unweighted → commit-health p99
blocksSequenced.Add(ctx, int64(blocks), opts) // blocks_sequenced_total → throughput
// optional: blocksPerCommit.Record(ctx, float64(blocks), opts) // batch-size distributionNow 3. If you genuinely need a pre-aggregated / weighted histogram in OTel, it's doable natively via a custom func (p *seqLatencyProducer) Observe(latencySec float64, blocks uint64) {
p.mu.Lock(); defer p.mu.Unlock()
i := sort.SearchFloat64s(p.bounds, latencySec) // first bound >= value
p.bucketCounts[i] += blocks // weight by N, O(1)
p.count += blocks
p.sum += latencySec * float64(blocks)
}
func (p *seqLatencyProducer) Produce(ctx context.Context) ([]metricdata.ScopeMetrics, error) {
p.mu.Lock()
bc := append([]uint64(nil), p.bucketCounts...) // snapshot under lock
dp := metricdata.HistogramDataPoint[float64]{
Attributes: p.attrs, StartTime: p.start, Time: time.Now(),
Count: p.count, Bounds: p.bounds, BucketCounts: bc, Sum: p.sum,
}
p.mu.Unlock()
return []metricdata.ScopeMetrics{{
Scope: instrumentation.Scope{Name: "consensus"},
Metrics: []metricdata.Metrics{{
Name: "consensus_block_sequencing_latency", Unit: "s",
Data: metricdata.Histogram[float64]{
Temporality: metricdata.CumulativeTemporality,
DataPoints: []metricdata.HistogramDataPoint[float64]{dp},
},
}},
}}, nil
}The catch: you now own the cumulative-temporality invariants (bucket counts monotonically non-decreasing + a stable On the API-ergonomics point: yes, client_golang's |
| } | ||
| // Constructed once per CommitQC, which we should afford. | ||
| attrs := metric.WithAttributeSet(attribute.NewSet( | ||
| // Timeouts capped: 20 means [20,inf) |
There was a problem hiding this comment.
I suppose we almost never have 20+ timeouts on one RoadIndex? What happens if the cluster gets stuck?
There was a problem hiding this comment.
then we don't care about this metric, I suppose.
| } | ||
|
|
||
| var observedCommitQC = newObserved[*types.CommitQC]() | ||
| var observedAppQC = newObserved[*types.AppQC]() |
There was a problem hiding this comment.
Do we need the full QC? We are only using RoadIndex, View Index, and timestamp I suppose?
There was a problem hiding this comment.
this is a simplification, there is no need to restrain it down.
| return false, nil | ||
| } | ||
| i.latestAppQC = utils.Some(appQC) | ||
| metrics.ObserveAppQC(appQC) |
There was a problem hiding this comment.
This could be much later than we actually observe AppQC for this RoadIndex right? Because the block proposal packing AppQC is optional.
There was a problem hiding this comment.
true, these metrics are describing the avail.State content.
| i.commitQCs.prune(idx) | ||
| if i.commitQCs.next == idx { | ||
| i.commitQCs.pushBack(commitQC) | ||
| metrics.ObserveCommitQC(c, commitQC) |
There was a problem hiding this comment.
Sorry I'm confused, why do we observe these QCs inside prune() instead of right after the QCs are verified?
There was a problem hiding this comment.
metrics track newest commitQC - the commitQC from the anchor is newest iff the anchor wipes the whole data state.
There was a problem hiding this comment.
This PR adds otel/Prometheus metrics for autobahn/avail (commit/app QC latency & progress gauges) and the p2p mux (per-RPC latency, in-flight, msg/byte counters), threads an RPC name through registration, and generalizes the metricsgen tool to support lowercase struct names. The changes are well-structured and correct; I found no blocking issues, only minor observability notes.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- The Cursor second-opinion review (
cursor-review.md) is empty — that pass produced no output. - The Codex review (
codex-review.md) targetssei-tendermint/node/node.go(NoOpMetricsProvider / package-levelGlobalcollectors), which is NOT one of this PR's 13 changed files. Those specific findings are out of scope for this PR and appear to reference stale or mis-scoped code. The general observation still partially applies: the newmux/metricspackage uses package-levelGlobalcollectors registered ininit(), so multiple node instances in one process (e.g. tests) share and merge these series. This is consistent with the repo's existing metricsgenGlobalpattern and is not a regression introduced here, but is worth keeping in mind for any in-process multi-node setups. inFlightgauge relies onStream.Open()/Close()being paired.Close()is idempotent at the caller (guarded byinner.closed.local), so double-decrement is not possible, but if a stream is opened and its connection tears down withoutclose()ever running, the gauge can drift upward. Worth confirming all stream lifecycles funnel throughclose().- No unit tests accompany the new metrics recording paths (
ObserveCommitQC/ObserveAppQCdedup logic, muxStreamopen/send/recv/close). The dedup guards (Index()/GlobalNumber()ordering) would benefit from a small test; metricsgen itself does get a new lowercase-struct test. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
Comments that couldn't be anchored to the diff
sei-tendermint/internal/autobahn/avail/metrics/metrics.go:106-- [nit] nit:qc.Proposal().View().Numberis labeledtimeouts. The view number is a reasonable proxy for the number of timed-out rounds, but the mapping isn't obvious from the code — a brief comment tying view number to "timeouts" would help future readers (the cap-at-20 comment is good).
There was a problem hiding this comment.
A well-structured observability PR (metricsgen tag unification + new autobahn/avail and p2p/mux metrics). No correctness or security issues found; one valid non-blocking metrics-quality concern (startup replay inflates a latency histogram) and a couple of cosmetic notes.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Startup metric skew (confirmed, Codex):
newInnerreplays the persisted prune anchor viai.prune(...)(inner.go:92), andprunenow callsmetrics.ObserveCommitQC. On the first observation after restart the commit-to-commit guard is skipped (no priorlast), butproposal_to_commit_latencyis still observed asnow - anchor.Proposal().Timestamp(), so every restart adds one spurious, downtime-inflated sample to that histogram. Consider skipping latency observations during replay (record only the state gauges) so dashboards/percentiles aren't distorted. Not a consensus-correctness issue. - Import-alias inconsistency (nit): generated
metrics.gen.gofiles now alias the prometheus helpers astmprometheus, while the hand-writtenmetrics.gofiles (e.g.internal/state/metrics.go,internal/evidence/metrics.go,internal/eventlog/metrics.go) still usetmmetrics. It compiles fine (per-file aliases) but is slightly confusing; optionally align them. sei-tendermint/internal/p2p/mux/metrics/metrics.goimports prometheus with an explicit alias identical to the package name (prometheus "..."); the alias is redundant and can be dropped.- Second-opinion passes:
cursor-review.mdwas empty (Cursor produced no output).codex-review.mdcontained a single finding, which is incorporated above.REVIEW_GUIDELINES.mdwas empty, so no repo-specific standards were applied. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
| i.commitQCs.prune(idx) | ||
| if i.commitQCs.next == idx { | ||
| i.commitQCs.pushBack(commitQC) | ||
| metrics.ObserveCommitQC(c, commitQC) |
There was a problem hiding this comment.
[suggestion] prune is invoked during startup replay of the persisted prune anchor (newInner at inner.go:92), so ObserveCommitQC here records a proposal_to_commit_latency sample equal to now - anchor.Proposal().Timestamp(). After a restart that proposal timestamp is old, producing one downtime-inflated outlier per restart. (commit_to_commit_latency is safe on the first call since there is no prior observation.) Consider observing latency only for live commits and updating just the state gauges during replay.
There was a problem hiding this comment.
A large but largely mechanical metrics refactor (import alias rename, pointer→value wrapper vecs, new metrics_buckets tag + NoBuckets, new avail/data/mux metrics) that is logically sound and well-tested, but the hand-written libs/utils/prometheus/prometheus.go is not gofmt/goimports-clean and will fail the golangci-lint CI job the repo mandates.
Findings: 3 blocking | 3 non-blocking | 2 posted inline
Blockers
sei-tendermint/libs/utils/prometheus/prometheus.gois notgofmt/goimports-clean, so the golangci-lint job (which enables thegofmtandgoimportsformatters) will fail: (a) the import group at lines 12–13 ordersk8s.io/...beforegithub.com/sei-protocol/...(goimports sorts alphabetically,github.com<k8s.io); (b)struct { ... }single-line declarations at lines 17 and 27 must bestruct{ ... }under gofmt. Rungofmt -s -wandgoimports -won this file (the only non-generated new source file that isn't formatter-clean).- 2 blocking issue(s) flagged inline on specific lines.
Non-blocking
- Cursor's second-opinion review (
cursor-review.md) was empty — that pass produced no output. Codex's single finding (the gofmt/goimports issue above) is confirmed. - The new p2p/mux metrics (
mux/metrics/metrics.gen.go) and several migrated histograms are registered with emptyHelp: ""strings; consider adding help text for operator readability, though this matches some pre-existing metrics. mux/metrics.StreamincrementsinFlightinOpen()but only decrements it inClose()when a start time is set; a stream that is opened but whoseClose()is never reached (e.g. connection torn down without the normal close path) would leak thein_flightgauge. Worth confirming the mux always callsClose().
| dto "github.com/prometheus/client_model/go" | ||
| "google.golang.org/protobuf/proto" | ||
|
|
||
| "k8s.io/component-base/metrics/prometheusextension" |
There was a problem hiding this comment.
[blocker] Import ordering is not goimports-clean: within this group k8s.io/component-base/... is placed before github.com/sei-protocol/..., but goimports sorts alphabetically (github.com < k8s.io). This will fail the goimports linter in CI. Run goimports -w on the file.
| ) | ||
|
|
||
| type HistogramOpts = prometheus.HistogramOpts | ||
| type HistogramVec struct { *prometheusextension.WeightedHistogramVec } |
There was a problem hiding this comment.
[blocker] Not gofmt-clean: struct { *prometheusextension.WeightedHistogramVec } will be reformatted by gofmt to struct{ ... } (same for type Observer struct { ... } on line 27). The repo enables the gofmt formatter in golangci-lint, so this will fail CI. Run gofmt -s -w on this file.
There was a problem hiding this comment.
A large, mostly-mechanical and well-tested refactor of the sei-tendermint Prometheus metrics layer (value-type metric vecs, a custom weighted Histogram, a reworked metricsgen bucket-tag syntax) plus new autobahn/avail, autobahn/data, and p2p/mux stream metrics. The core logic is sound with good test coverage; no correctness blockers, only observability/operational notes.
Findings: 0 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion passes surfaced nothing actionable: Codex reported "No material findings" (and noted it could not run the Go 1.25.6 tests due to a Go 1.24 toolchain / no network), and the Cursor review file (cursor-review.md) is empty — that pass produced no output.
- Operator/dashboard impact: several metrics are renamed or restructured (e.g. the data-package histogram
sei_data__latencybecomestendermint_internal_autobahn_data_latency, and weighted-histogram wiring is replaced by the new custom Histogram). This is expected for a metrics refactor but will break existing Grafana dashboards/alerts referencing the old names — worth calling out in release notes. - Semantics of the new custom Histogram differ subtly from the standard prometheus client: an empty/
nonebucket set yields only a single +Inf bucket (count+sum only, no distribution), and there is intentionally no implicit DefBuckets at the library layer — the DefBuckets default is applied by metricsgen instead. Anyone hand-constructing atmprometheus.HistogramVecoutside the generator must remember to pass buckets explicitly. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
| // Latency from proposal being constructed to commit being observed. | ||
| proposalToCommitLatency prometheus.HistogramVec `metrics_buckets:"exp(0.01, 1.2, 35)"` | ||
| // Latency between consecutive commits being observed. | ||
| commitToCommitLatency prometheus.HistogramVec `metrics_labels:"timeouts" metrics_buckets:"none"` |
There was a problem hiding this comment.
[nit] nit: commitToCommitLatency uses metrics_buckets:"none", so this "latency" histogram exports only the implicit +Inf bucket (i.e. count and sum, no distribution/quantiles). If the intent is just an average per timeouts label that's fine, but for a latency metric a finite bucket set (e.g. exp(...)) is usually more useful — please confirm this is intentional.
Superseded: latest AI review found no blocking issues.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 045d98f. Configure here.
| } | ||
| Global.appRoadIndexAt().Set(int64(qc.Proposal().RoadIndex())) // nolint: gosec | ||
| // +1 is for consistency with commitGlobalBlockNumber | ||
| Global.appGlobalBlockNumberAt().Set(int64(qc.Proposal().GlobalNumber() + 1)) // nolint: gosec |
There was a problem hiding this comment.
App global gauge misaligned
Low Severity
The app global block gauge is set to GlobalNumber()+1, while the commit gauge uses GlobalRange(c).Next. For commit QCs that finalize multiple global blocks, the app proposal’s global number may be the range start, so the two gauges diverge even when the pair is valid.
Reviewed by Cursor Bugbot for commit 045d98f. Configure here.
There was a problem hiding this comment.
This PR swaps the heavy k8s.io/component-base metrics dependency for the repo's own metricsgen-based prometheus wrappers and adds observability for autobahn avail/data QCs and p2p mux streams. The refactor is clean and the changes are non-app-hash-breaking; the only substantive concern is a potential slow drift of the new in_flight gauge on abnormal stream teardown.
Findings: 0 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion passes:
cursor-review.mdandREVIEW_GUIDELINES.mdwere both empty, so no Cursor findings or repo-specific guidelines were available to merge; only the Codex pass produced output (the in_flight observation, reflected inline). - mux metrics (
latency,in_flight,send_msgs,recv_msgs,send_bytes,recv_bytes) are generated with emptyHelpstrings. Adding doc comments on the struct fields insei-tendermint/internal/p2p/mux/metrics/metrics.gowould give the exported Prometheus metrics meaningful HELP text. ObserveCommitQC/ObserveAppQCuse package-global mutex-guardedobserved*state and a process-globalGlobalregistry; in multi-node in-process test setups the cross-node dedup (last.val.Index() >= qc.Index()) will drop observations from other nodes. This is an inherent limitation of the global-metrics pattern (metrics are process-global anyway) and fine for production single-node use, but worth being aware of when interpreting test metrics.- 1 suggestion(s)/nit(s) flagged inline on specific lines.
| ctrl.Updated() | ||
| return err | ||
| } | ||
| inner.metrics.Open() |
There was a problem hiding this comment.
[suggestion] in_flight is incremented here in Open() but only decremented in the local close() path (inner.metrics.Close()). The ordering is correct for the open() error path (the increment happens only after WaitUntil succeeds, so the early s.close(inner) above doesn't over-decrement). However, because the gauge lives in a process-global registry, any stream that is opened but never has Stream.Close() called on it — e.g. a stream abandoned when Mux.Run exits on a connection/read/write error and its owning goroutine returns without closing — leaks a permanent +1. Over many reconnects the gauge can drift upward and stop reflecting the true in-flight count, and the latency sample for those streams is never recorded. Consider tying the decrement/observe to stream teardown/pruning (tryPrune) rather than only to the application calling Close().


It will give us insight into consensus state and rpc performance.