Skip to content

prometheus metrics for autobahn/avail, autobahn/data and p2p/mux#3682

Open
pompon0 wants to merge 21 commits into
gprusak-prometheusfrom
gprusak-metrics
Open

prometheus metrics for autobahn/avail, autobahn/data and p2p/mux#3682
pompon0 wants to merge 21 commits into
gprusak-prometheusfrom
gprusak-metrics

Conversation

@pompon0

@pompon0 pompon0 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

It will give us insight into consensus state and rpc performance.

@pompon0 pompon0 requested review from bdchatham and wen-coding July 1, 2026 13:18
@cursor

cursor Bot commented Jul 1, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Observability-only changes with no consensus or auth logic; main operational risk is renamed/changed metric series breaking existing alerts and dashboards.

Overview
Adds Prometheus observability for consensus availability and P2P RPC streams, and standardizes autobahn data metrics on the internal metricsgen + libs/utils/prometheus stack (dropping k8s.io/component-base).

Autobahn avail records highest AppQC/CommitQC road and global block gauges, plus proposal→commit and inter-commit latency histograms (commit spacing labeled by view timeout count). Observations run when QCs are applied in prune and when PushCommitQC accepts a new QC.

Autobahn data moves latency metrics into a dedicated data/metrics package with resource/stage labels; block histograms use plain Observe, tx histograms keep ObserveWithWeight. State no longer registers as a custom Prometheus collector—metrics register via init.

P2P mux tracks per-stream in-flight, message/byte counters, and stream lifetime latency, labeled by role (connect vs accept) and rpc_name. rpc.Register now requires a string name wired into mux StreamKindConfig.Name; all giga RPCs are named (e.g. stream_commit_qcs, get_block).

Dependency note: existing dashboards keyed on the old sei_data__latency name will need updating to the new tendermint_internal_autobahn_data_* series.

Reviewed by Cursor Bugbot for commit 045d98f. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJul 3, 2026, 6:51 PM

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.70130% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.45%. Comparing base (fd69364) to head (045d98f).

Files with missing lines Patch % Lines
sei-tendermint/internal/p2p/mux/metrics/metrics.go 91.66% 1 Missing and 1 partial ⚠️
sei-tendermint/internal/autobahn/avail/state.go 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@                  Coverage Diff                   @@
##           gprusak-prometheus    #3682      +/-   ##
======================================================
- Coverage               58.75%   58.45%   -0.31%     
======================================================
  Files                    2188     2192       +4     
  Lines                  178842   178050     -792     
======================================================
- Hits                   105084   104078    -1006     
- Misses                  64510    64722     +212     
- Partials                 9248     9250       +2     
Flag Coverage Δ
sei-chain-pr 79.02% <98.70%> (+5.07%) ⬆️
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/autobahn/avail/inner.go 97.46% <100.00%> (+0.06%) ⬆️
...int/internal/autobahn/avail/metrics/metrics.gen.go 100.00% <100.00%> (ø)
...dermint/internal/autobahn/avail/metrics/metrics.go 100.00% <100.00%> (ø)
...mint/internal/autobahn/data/metrics/metrics.gen.go 100.00% <100.00%> (ø)
...ndermint/internal/autobahn/data/metrics/metrics.go 100.00% <100.00%> (ø)
sei-tendermint/internal/autobahn/data/state.go 79.89% <100.00%> (ø)
sei-tendermint/internal/autobahn/data/testonly.go 62.50% <ø> (+1.17%) ⬆️
...tendermint/internal/p2p/mux/metrics/metrics.gen.go 100.00% <100.00%> (ø)
sei-tendermint/internal/p2p/mux/mux.go 78.85% <100.00%> (+1.07%) ⬆️
sei-tendermint/internal/p2p/mux/stream.go 90.47% <100.00%> (+0.47%) ⬆️
... and 4 more

... and 40 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread sei-tendermint/internal/autobahn/avail/inner.go
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated
seidroid[bot]
seidroid Bot previously requested changes Jul 1, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream I/O, plus minor refactors. The instrumentation is reasonable, but the new mux metrics file is not gofmt-clean (will fail the lint CI), and several monotonic totals are declared as UpDownCounters.

Findings: 2 blocking | 4 non-blocking | 2 posted inline

Blockers

  • sei-tendermint/internal/p2p/mux/metrics/metrics.go is not gofmt-clean, which violates the repo's formatting requirement and will fail golangci/make lint in CI (see inline comment). Run gofmt -s -w on the file.
  • 1 blocking issue(s) flagged inline on specific lines.

Non-blocking

  • Repository review guidelines (REVIEW_GUIDELINES.md) were empty/missing, so no repo-specific standards were applied beyond AGENTS.md.
  • The Cursor second-opinion pass (cursor-review.md) produced no output.
  • In avail/metrics.go ObserveCommitQC, proposalToCommitLatency.Record(...) is called before the ordering guard last.val.Index() >= qc.Index(), so a re-observed/stale CommitQC still records a latency sample. Consider moving the record after the guard if only fresh QCs should be measured (minor, metrics-only).
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-tendermint/internal/p2p/mux/metrics/metrics.go Outdated
Comment thread sei-tendermint/internal/p2p/mux/metrics/metrics.go Outdated
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observability-only PR adding OpenTelemetry metrics for autobahn/avail consensus QCs and p2p/mux stream traffic. The changes are correct and well-structured (dedup guards, bounded label cardinality, safe mutex-by-value refactor); the only substantive concern is that the in-flight stream gauge can drift upward when streams are abandoned without a local Close.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • No tests were added for the new metrics helpers (ObserveCommitQC/ObserveAppQC dedup logic and the Stream metric lifecycle). A small unit test around the monotonic dedup guards (out-of-order QCs are ignored) and Open/Close idempotency would guard against regressions; acceptable to skip for an observability-only change, but worth noting.
  • Cursor's second-opinion pass (cursor-review.md) produced no output, and REVIEW_GUIDELINES.md was empty on the base branch, so no repo-specific standards were applied. Codex's single medium finding (the inFlight counter leak) is incorporated as an inline comment.
  • Minor: the Send metric is recorded when a message is enqueued to the frame queue (stream.go:95) rather than when it is actually written to the connection, so send_bytes/send_msgs slightly lead the wire. This is a reasonable approximation but worth a comment if precise byte accounting is ever expected.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

return
}
inner.closed.local = true
inner.metrics.Close()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] The inFlight up/down counter is only decremented here in the local close() path. Open() increments it once a stream's recv end is established, but the decrement never runs if a stream is abandoned without a local Close() — e.g. Mux.Run exits on a connection/read/write error, or a caller observes remote close (tryPrune in runRecv only fully prunes once both local and remote are closed) but never calls Close(). Because inFlight is a process-global counter shared across all connections, this drifts permanently upward across peer churn, making the gauge unreliable over a long-running node. Consider decrementing symmetrically on prune/teardown (e.g. from tryPrune or a runner shutdown hook) so an opened stream is always balanced by exactly one decrement. (Matches Codex's medium finding.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local Close is ALWAYS required by the contract on Stream object - if mux.Connect/Accept succeeds, then caller is required to call Close eventually.

@seidroid seidroid Bot dismissed their stale review July 1, 2026 13:46

Superseded: latest AI review found no blocking issues.

@bdchatham bdchatham left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xreview summary (opentelemetry / sei-networking / systems / idiom lenses). Nice, well-scoped instrumentation — most of it is clean. Three points worth resolving before merge, plus a few smaller ones inline.

🔴 MeterProvider wiring — where is the global MeterProvider set up? otel.Meter() is a no-op unless a provider with a Prometheus/OTLP reader and a service.name resource is registered in the binary, and I don't see it in this diff. If it isn't already wired in main, every instrument here silently records nothing — worth confirming before merge.

On your weighted-histogram question: you don't need to call Record more than once. You have exactly one latency value per commit, so it's a single Record into a Float64Histogram — the sum+count counter split was solving a problem the code doesn't have. Details in the inline comment on commitToCommitLatency.

Comment thread sei-tendermint/internal/p2p/mux/metrics/metrics.go Outdated
Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated
"tendermint_internal_autobahn_avail__app_global_block_number",
metric.WithDescription("global block number of the highest observed appQC"),
))
var proposalToCommitLatency = utils.OrPanic1(meter.Float64Histogram(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Add metric.WithUnit("s") on the avail latency instruments — the mux-side latency sets it but these don't, so the exporter won't append _seconds and the two packages end up with inconsistently-named latency series.

Comment thread sei-tendermint/internal/p2p/rpc/rpc.go Outdated
return &RPC[API, Req, Resp]{kind, limit, req, resp}
service[kind] = &rpcConfig{
limit: limit,
name: fmt.Sprintf("%T", utils.Zero[Req]()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 fmt.Sprintf("%T", ...) bakes the fully-qualified Go type into the rpc_name label, so renaming or moving the request type silently breaks dashboard continuity. Prefer an explicit, stable name string.

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated
Comment thread sei-tendermint/internal/p2p/mux/stream.go
@bdchatham

Copy link
Copy Markdown
Contributor

Weighted histograms in OTel — summary of the thread

TL;DR on "how do I migrate the weighted histograms without calling Record N times":

There's no weighted Record in the OTel Go API — a histogram counts observations, so weighting a single latency by N blocks means either N Record calls or emitting a pre-aggregated point yourself. Two things worth separating: is looping actually a problem (no), and what's the right metric (probably not a weighted histogram at all).

1. Looping Record is cheap at our scale. 1–120 blocks at single-digit commits/sec is single-digit µs/commit — invisible next to the sequencing work. The only real footgun is allocating the attribute set inside the loop, so hoist it:

recordOpts := metric.WithAttributeSet(attrSet) // build once, outside the loop
for i := uint64(0); i < blocks; i++ {
    seqLatency.Record(ctx, latencySec, recordOpts)
}

2. But I'd reconsider the weighted histogram itself. Replaying one shared per-commit measurement into N identical points inflates the statistical support and quietly redefines _count (blocks, not commits). The honest signal keeps latency and load orthogonal:

seqLatency.Record(ctx, latencySec, opts)            // per-commit, unweighted → commit-health p99
blocksSequenced.Add(ctx, int64(blocks), opts)       // blocks_sequenced_total → throughput
// optional: blocksPerCommit.Record(ctx, float64(blocks), opts) // batch-size distribution

Now rate(blocks_sequenced_total) gives throughput, the histogram stays meaningful, and you can correlate "latency rose when blocks/commit rose" on one panel — strictly more information than one weighted blob.

3. If you genuinely need a pre-aggregated / weighted histogram in OTel, it's doable natively via a custom metric.Producer — the SDK analogue of Prometheus MustNewConstHistogram. You register it on the reader (metric.WithProducer(...), passed into prometheus.New(...) — verify the exact symbol on our pinned SDK) and hand back a point whose Count/Sum/BucketCounts you set directly, so weighting is "add N to the bucket, N to count":

func (p *seqLatencyProducer) Observe(latencySec float64, blocks uint64) {
    p.mu.Lock(); defer p.mu.Unlock()
    i := sort.SearchFloat64s(p.bounds, latencySec) // first bound >= value
    p.bucketCounts[i] += blocks                    // weight by N, O(1)
    p.count += blocks
    p.sum += latencySec * float64(blocks)
}

func (p *seqLatencyProducer) Produce(ctx context.Context) ([]metricdata.ScopeMetrics, error) {
    p.mu.Lock()
    bc := append([]uint64(nil), p.bucketCounts...) // snapshot under lock
    dp := metricdata.HistogramDataPoint[float64]{
        Attributes: p.attrs, StartTime: p.start, Time: time.Now(),
        Count: p.count, Bounds: p.bounds, BucketCounts: bc, Sum: p.sum,
    }
    p.mu.Unlock()
    return []metricdata.ScopeMetrics{{
        Scope: instrumentation.Scope{Name: "consensus"},
        Metrics: []metricdata.Metrics{{
            Name: "consensus_block_sequencing_latency", Unit: "s",
            Data: metricdata.Histogram[float64]{
                Temporality: metricdata.CumulativeTemporality,
                DataPoints:  []metricdata.HistogramDataPoint[float64]{dp},
            },
        }},
    }}, nil
}

The catch: you now own the cumulative-temporality invariants (bucket counts monotonically non-decreasing + a stable StartTime, or Prometheus reads a reset and histogram_quantile breaks), your own concurrency, your own attribute sets, and you lose Views + exemplars (you've bypassed the SDK aggregator). That's worth it only when the bucket counts already exist upstream (a pre-aggregated subprocess/library/batch source) or the per-observation path is genuinely hot — i.e. folding in an aggregation, not synthesizing one. Not our case here, so I'd go with option 1 or 2.

On the API-ergonomics point: yes, client_golang's MustNewConstHistogram is more ergonomic for pre-aggregated data — that's the price of OTel being backend/temporality-agnostic (push+pull, delta+cumulative, Views, exemplars). If we ever want the raw ergonomics for a few metrics, we can register a plain client_golang Collector on the same Prometheus registry and it scrapes alongside the OTel series — but only under a pull/Prometheus exporter, not OTLP push.

}
// Constructed once per CommitQC, which we should afford.
attrs := metric.WithAttributeSet(attribute.NewSet(
// Timeouts capped: 20 means [20,inf)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we almost never have 20+ timeouts on one RoadIndex? What happens if the cluster gets stuck?

@pompon0 pompon0 Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we don't care about this metric, I suppose.

}

var observedCommitQC = newObserved[*types.CommitQC]()
var observedAppQC = newObserved[*types.AppQC]()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the full QC? We are only using RoadIndex, View Index, and timestamp I suppose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a simplification, there is no need to restrain it down.

return false, nil
}
i.latestAppQC = utils.Some(appQC)
metrics.ObserveAppQC(appQC)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be much later than we actually observe AppQC for this RoadIndex right? Because the block proposal packing AppQC is optional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, these metrics are describing the avail.State content.

i.commitQCs.prune(idx)
if i.commitQCs.next == idx {
i.commitQCs.pushBack(commitQC)
metrics.ObserveCommitQC(c, commitQC)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm confused, why do we observe these QCs inside prune() instead of right after the QCs are verified?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics track newest commitQC - the commitQC from the anchor is newest iff the anchor wipes the whole data state.

@pompon0 pompon0 changed the base branch from main to gprusak-prometheus July 3, 2026 11:02

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds otel/Prometheus metrics for autobahn/avail (commit/app QC latency & progress gauges) and the p2p mux (per-RPC latency, in-flight, msg/byte counters), threads an RPC name through registration, and generalizes the metricsgen tool to support lowercase struct names. The changes are well-structured and correct; I found no blocking issues, only minor observability notes.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • The Cursor second-opinion review (cursor-review.md) is empty — that pass produced no output.
  • The Codex review (codex-review.md) targets sei-tendermint/node/node.go (NoOpMetricsProvider / package-level Global collectors), which is NOT one of this PR's 13 changed files. Those specific findings are out of scope for this PR and appear to reference stale or mis-scoped code. The general observation still partially applies: the new mux/metrics package uses package-level Global collectors registered in init(), so multiple node instances in one process (e.g. tests) share and merge these series. This is consistent with the repo's existing metricsgen Global pattern and is not a regression introduced here, but is worth keeping in mind for any in-process multi-node setups.
  • inFlight gauge relies on Stream.Open()/Close() being paired. Close() is idempotent at the caller (guarded by inner.closed.local), so double-decrement is not possible, but if a stream is opened and its connection tears down without close() ever running, the gauge can drift upward. Worth confirming all stream lifecycles funnel through close().
  • No unit tests accompany the new metrics recording paths (ObserveCommitQC/ObserveAppQC dedup logic, mux Stream open/send/recv/close). The dedup guards (Index()/GlobalNumber() ordering) would benefit from a small test; metricsgen itself does get a new lowercase-struct test.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comments that couldn't be anchored to the diff

  • sei-tendermint/internal/autobahn/avail/metrics/metrics.go:106 -- [nit] nit: qc.Proposal().View().Number is labeled timeouts. The view number is a reasonable proxy for the number of timed-out rounds, but the mapping isn't obvious from the code — a brief comment tying view number to "timeouts" would help future readers (the cap-at-20 comment is good).

Comment thread sei-tendermint/internal/autobahn/avail/metrics/metrics.go Outdated

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-structured observability PR (metricsgen tag unification + new autobahn/avail and p2p/mux metrics). No correctness or security issues found; one valid non-blocking metrics-quality concern (startup replay inflates a latency histogram) and a couple of cosmetic notes.

Findings: 0 blocking | 5 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Startup metric skew (confirmed, Codex): newInner replays the persisted prune anchor via i.prune(...) (inner.go:92), and prune now calls metrics.ObserveCommitQC. On the first observation after restart the commit-to-commit guard is skipped (no prior last), but proposal_to_commit_latency is still observed as now - anchor.Proposal().Timestamp(), so every restart adds one spurious, downtime-inflated sample to that histogram. Consider skipping latency observations during replay (record only the state gauges) so dashboards/percentiles aren't distorted. Not a consensus-correctness issue.
  • Import-alias inconsistency (nit): generated metrics.gen.go files now alias the prometheus helpers as tmprometheus, while the hand-written metrics.go files (e.g. internal/state/metrics.go, internal/evidence/metrics.go, internal/eventlog/metrics.go) still use tmmetrics. It compiles fine (per-file aliases) but is slightly confusing; optionally align them.
  • sei-tendermint/internal/p2p/mux/metrics/metrics.go imports prometheus with an explicit alias identical to the package name (prometheus "..."); the alias is redundant and can be dropped.
  • Second-opinion passes: cursor-review.md was empty (Cursor produced no output). codex-review.md contained a single finding, which is incorporated above. REVIEW_GUIDELINES.md was empty, so no repo-specific standards were applied.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

i.commitQCs.prune(idx)
if i.commitQCs.next == idx {
i.commitQCs.pushBack(commitQC)
metrics.ObserveCommitQC(c, commitQC)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] prune is invoked during startup replay of the persisted prune anchor (newInner at inner.go:92), so ObserveCommitQC here records a proposal_to_commit_latency sample equal to now - anchor.Proposal().Timestamp(). After a restart that proposal timestamp is old, producing one downtime-inflated outlier per restart. (commit_to_commit_latency is safe on the first call since there is no prior observation.) Consider observing latency only for live commits and updating just the state gauges during replay.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI

@pompon0 pompon0 changed the title otel metrics for autobahn/avail and p2p/mux prometheus metrics for autobahn/avail and p2p/mux Jul 3, 2026
seidroid[bot]
seidroid Bot previously requested changes Jul 3, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large but largely mechanical metrics refactor (import alias rename, pointer→value wrapper vecs, new metrics_buckets tag + NoBuckets, new avail/data/mux metrics) that is logically sound and well-tested, but the hand-written libs/utils/prometheus/prometheus.go is not gofmt/goimports-clean and will fail the golangci-lint CI job the repo mandates.

Findings: 3 blocking | 3 non-blocking | 2 posted inline

Blockers

  • sei-tendermint/libs/utils/prometheus/prometheus.go is not gofmt/goimports-clean, so the golangci-lint job (which enables the gofmt and goimports formatters) will fail: (a) the import group at lines 12–13 orders k8s.io/... before github.com/sei-protocol/... (goimports sorts alphabetically, github.com < k8s.io); (b) struct { ... } single-line declarations at lines 17 and 27 must be struct{ ... } under gofmt. Run gofmt -s -w and goimports -w on this file (the only non-generated new source file that isn't formatter-clean).
  • 2 blocking issue(s) flagged inline on specific lines.

Non-blocking

  • Cursor's second-opinion review (cursor-review.md) was empty — that pass produced no output. Codex's single finding (the gofmt/goimports issue above) is confirmed.
  • The new p2p/mux metrics (mux/metrics/metrics.gen.go) and several migrated histograms are registered with empty Help: "" strings; consider adding help text for operator readability, though this matches some pre-existing metrics.
  • mux/metrics.Stream increments inFlight in Open() but only decrements it in Close() when a start time is set; a stream that is opened but whose Close() is never reached (e.g. connection torn down without the normal close path) would leak the in_flight gauge. Worth confirming the mux always calls Close().

dto "github.com/prometheus/client_model/go"
"google.golang.org/protobuf/proto"

"k8s.io/component-base/metrics/prometheusextension"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] Import ordering is not goimports-clean: within this group k8s.io/component-base/... is placed before github.com/sei-protocol/..., but goimports sorts alphabetically (github.com < k8s.io). This will fail the goimports linter in CI. Run goimports -w on the file.

)

type HistogramOpts = prometheus.HistogramOpts
type HistogramVec struct { *prometheusextension.WeightedHistogramVec }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] Not gofmt-clean: struct { *prometheusextension.WeightedHistogramVec } will be reformatted by gofmt to struct{ ... } (same for type Observer struct { ... } on line 27). The repo enables the gofmt formatter in golangci-lint, so this will fail CI. Run gofmt -s -w on this file.

Comment thread sei-tendermint/internal/autobahn/avail/inner.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large, mostly-mechanical and well-tested refactor of the sei-tendermint Prometheus metrics layer (value-type metric vecs, a custom weighted Histogram, a reworked metricsgen bucket-tag syntax) plus new autobahn/avail, autobahn/data, and p2p/mux stream metrics. The core logic is sound with good test coverage; no correctness blockers, only observability/operational notes.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Second-opinion passes surfaced nothing actionable: Codex reported "No material findings" (and noted it could not run the Go 1.25.6 tests due to a Go 1.24 toolchain / no network), and the Cursor review file (cursor-review.md) is empty — that pass produced no output.
  • Operator/dashboard impact: several metrics are renamed or restructured (e.g. the data-package histogram sei_data__latency becomes tendermint_internal_autobahn_data_latency, and weighted-histogram wiring is replaced by the new custom Histogram). This is expected for a metrics refactor but will break existing Grafana dashboards/alerts referencing the old names — worth calling out in release notes.
  • Semantics of the new custom Histogram differ subtly from the standard prometheus client: an empty/none bucket set yields only a single +Inf bucket (count+sum only, no distribution), and there is intentionally no implicit DefBuckets at the library layer — the DefBuckets default is applied by metricsgen instead. Anyone hand-constructing a tmprometheus.HistogramVec outside the generator must remember to pass buckets explicitly.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

// Latency from proposal being constructed to commit being observed.
proposalToCommitLatency prometheus.HistogramVec `metrics_buckets:"exp(0.01, 1.2, 35)"`
// Latency between consecutive commits being observed.
commitToCommitLatency prometheus.HistogramVec `metrics_labels:"timeouts" metrics_buckets:"none"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] nit: commitToCommitLatency uses metrics_buckets:"none", so this "latency" histogram exports only the implicit +Inf bucket (i.e. count and sum, no distribution/quantiles). If the intent is just an average per timeouts label that's fine, but for a latency metric a finite bucket set (e.g. exp(...)) is usually more useful — please confirm this is intentional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI

@seidroid seidroid Bot dismissed their stale review July 3, 2026 18:37

Superseded: latest AI review found no blocking issues.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 045d98f. Configure here.

}
Global.appRoadIndexAt().Set(int64(qc.Proposal().RoadIndex())) // nolint: gosec
// +1 is for consistency with commitGlobalBlockNumber
Global.appGlobalBlockNumberAt().Set(int64(qc.Proposal().GlobalNumber() + 1)) // nolint: gosec

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

App global gauge misaligned

Low Severity

The app global block gauge is set to GlobalNumber()+1, while the commit gauge uses GlobalRange(c).Next. For commit QCs that finalize multiple global blocks, the app proposal’s global number may be the range start, so the two gauges diverge even when the pair is valid.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 045d98f. Configure here.

@pompon0 pompon0 changed the title prometheus metrics for autobahn/avail and p2p/mux prometheus metrics for autobahn/avail, autobahn/data and p2p/mux Jul 3, 2026

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR swaps the heavy k8s.io/component-base metrics dependency for the repo's own metricsgen-based prometheus wrappers and adds observability for autobahn avail/data QCs and p2p mux streams. The refactor is clean and the changes are non-app-hash-breaking; the only substantive concern is a potential slow drift of the new in_flight gauge on abnormal stream teardown.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Second-opinion passes: cursor-review.md and REVIEW_GUIDELINES.md were both empty, so no Cursor findings or repo-specific guidelines were available to merge; only the Codex pass produced output (the in_flight observation, reflected inline).
  • mux metrics (latency, in_flight, send_msgs, recv_msgs, send_bytes, recv_bytes) are generated with empty Help strings. Adding doc comments on the struct fields in sei-tendermint/internal/p2p/mux/metrics/metrics.go would give the exported Prometheus metrics meaningful HELP text.
  • ObserveCommitQC/ObserveAppQC use package-global mutex-guarded observed* state and a process-global Global registry; in multi-node in-process test setups the cross-node dedup (last.val.Index() >= qc.Index()) will drop observations from other nodes. This is an inherent limitation of the global-metrics pattern (metrics are process-global anyway) and fine for production single-node use, but worth being aware of when interpreting test metrics.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

ctrl.Updated()
return err
}
inner.metrics.Open()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] in_flight is incremented here in Open() but only decremented in the local close() path (inner.metrics.Close()). The ordering is correct for the open() error path (the increment happens only after WaitUntil succeeds, so the early s.close(inner) above doesn't over-decrement). However, because the gauge lives in a process-global registry, any stream that is opened but never has Stream.Close() called on it — e.g. a stream abandoned when Mux.Run exits on a connection/read/write error and its owning goroutine returns without closing — leaks a permanent +1. Over many reconnects the gauge can drift upward and stop reflecting the true in-flight count, and the latency sample for those streams is never recorded. Consider tying the decrement/observe to stream teardown/pruning (tryPrune) rather than only to the application calling Close().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants