Skip to content

Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces#4562

Open
erik-the-implementer wants to merge 5 commits into
mainfrom
tracestate-sample-rate
Open

Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces#4562
erik-the-implementer wants to merge 5 commits into
mainfrom
tracestate-sample-rate

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Motivation

The Cloudflare worker in front of Electric Cloud head-samples successful requests at 1:20 and keeps all >= 500 responses at sampleRate 1, weighting its exported spans accordingly. Electric only ever received the sampled/not-sampled bit via traceparent, which causes two problems today:

  1. ~20x undercount: Electric spans carry no SampleRate attribute, so Honeycomb aggregates over electric-region spans under-report worker traffic ~20x relative to the worker's own (weighted) spans.
  2. Missing server-side error traces: ~95% of error traces have no Electric spans at all. The worker's keep-on-error decision happens at export time — after it already propagated traceparent with sampled=0 — and Electric's parent-based sampler then drops every span for the request.

The worker now sends its sampling rate alongside traceparent on every proxied request (regardless of the sampled flag) via the W3C tracestate header:

tracestate: electric=rate:<positive-integer>    # currently rate:20

Honeycomb's OTLP ingest natively reads an integer span attribute named SampleRate and weights aggregates by it.

Behavior matrix (per shape GET request)

Remote parent Rate hint Response status Behavior
none (direct traffic) any Unchanged — spans recorded, no SampleRate attribute
sampled=1 rate:N < 500 Spans exported as today plus SampleRate = N on the root span and the stream_chunk child spans
sampled=1 rate:N >= 500 Spans exported as today plus SampleRate = 1 (error responses ignore the hint — mirrors the worker's keep-all-errors-at-rate-1 semantics)
sampled=0 any < 500 Unchanged — no spans (this is the volume win, deliberately not regressed)
sampled=0 any >= 500 New: exactly one root Plug_shape_get span synthesized and exported with SampleRate = 1, in the same trace as the worker's spans
any unparseable / rate:<1 any Hint ignored — behaves as if no hint was sent

Implementation

  • Electric.Plug.TraceContextPlug — after the existing traceparent extraction, parses the electric=rate:N member out of the (already-decoded) tracestate carried by the extracted span context, and stores %{parent_span_ctx, parent_sampled?, sample_rate_hint} in conn.private. Exposes trace_context/1 and sample_rate_attrs/2 for downstream plugs.
  • Electric.Plug.ServeShapePlugemit_shape_telemetry/1 (which runs on success, halt and caught-error paths once the response status is known) now ends with stamp_sample_rate/1:
    • sampled remote parent → adds the SampleRate attribute to the recording root span (attributes may be set any time before span end);
    • unsampled remote parent + status >= 500 → synthesizes the error-tail span (below);
    • no remote parent → no-op.
  • Electric.Shapes.Api.Response.send_stream/2 — the per-request shape_get.plug.stream_chunk child spans (which survive the flattening in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561 because they are 1:n, not 1:1) get the same SampleRate attribute at creation, from the stashed hint and the response status.
  • Electric.Telemetry.OpenTelemetry.export_unsampled_remote_span/4 — the error-tail mechanism. Since the parent-based sampler means no recording span exists during a remote-unsampled request, the span is synthesized when the 5xx response is emitted: a copy of the extracted remote parent span context with the W3C sampled trace flag forced on is installed as the parent in a fresh OTel context, and :otel_tracer.start_span/4 is called under it. The parent-based sampler (default {parent_based, %{root: always_on}} — verified in otel_sampler_parent_based.erl) then takes its remote_parent_sampledalways_on branch and records the span.

Synthesized-span fidelity (no SDK compromises needed)

  • Same trace, real parent: the span keeps the inbound trace_id and is parented on the worker's span id — no span-link fallback was necessary. Tests assert trace_id/parent_span_id equal the inbound traceparent values.
  • Correct timing: start time is backdated to the request start using the monotonic timestamp ServeShapePlug.start_telemetry_span/1 already records in conn.private[:electric_telemetry_span] (the OTel SDK stores span start/end as native monotonic time, so the value is directly usable); end time is "now", so the duration matches the request.
  • Standard attributes: it carries the same root-span attributes open_telemetry_attrs/1 puts on a normal Plug_shape_get span (final request/response state), plus SampleRate = 1 and an error status (error_str when available, HTTP <status> otherwise).

Notes / scope decisions

  • The synthesized error tail covers the root span only; per-chunk child spans of remote-unsampled 5xx requests stay dropped (they don't exist retroactively). Root-only is the useful unit here — it carries the full request attributes.
  • The error tail triggers on the hint-independent condition "remote parent unsampled + status >= 500" (per the behavior matrix), so 5xx responses on traces head-sampled away by any W3C-compliant upstream become visible server-side.
  • Known pre-existing edge: when a crash happens mid-stream after the response has been committed, emit_shape_telemetry/1 is skipped (inherited Plug.ErrorHandler limitation documented in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561's base) — but in that case the committed status was non-5xx to begin with, so no error-tail span is lost.
  • Sampler's exclude_spans semantics and the flattening work from the base branch are untouched.
  • Embedded usage (no MIX_TARGET=application, no OTel SDK): the propagator no-ops, so no trace context is ever stored and every path is unchanged.

Tests

  • test/electric/plug/trace_context_plug_test.exs (new): tracestate parsing — valid/absent/malformed hints, rate:0/negative/non-integer rejection, multi-member tracestate, whitespace, sampled-flag extraction, sample_rate_attrs/2 status mapping (200/304/404/499 → N; 500/503 → 1; no hint → %{}).
  • test/electric/plug/serve_shape_plug_sample_rate_test.exs (new, async: false, exports real spans to the test process via otel_simple_processor + otel_exporter_pid):
    • sampled parent + 200 → root span has SampleRate = 20 (and stream_chunk child spans too);
    • sampled parent + 5xx → SampleRate = 1;
    • unsampled parent + 5xx → exactly one exported Plug_shape_get span with SampleRate = 1, trace_id and parent_span_id matching the inbound traceparent, start_time <= end_time, error status, standard root attributes;
    • unsampled parent + 200 → nothing exported;
    • no remote parent → spans exported without SampleRate (200 and 5xx).
    • Test-infra note: restarting the :opentelemetry app with a pid exporter requires erasing the OTel API's per-application tracer persistent-term cache, because cached tracers embed the old provider's (empty) processor pipeline and create_application_tracers keeps stale entries across restarts.
  • Full test/electric/plug, test/electric/telemetry and test/electric/shapes/api directories pass (235 tests, 0 failures), including the files touched on the base branch.
  • mix compile --warnings-as-errors clean on both the application target and the default (embedded) target; mix format --check-formatted clean.

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.34%. Comparing base (0947230) to head (25b7688).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
+ Coverage   58.11%   58.34%   +0.23%     
==========================================
  Files         369      370       +1     
  Lines       40459    40674     +215     
  Branches    11471    11552      +81     
==========================================
+ Hits        23513    23732     +219     
+ Misses      16871    16867       -4     
  Partials       75       75              
Flag Coverage Δ
packages/agents 71.37% <ø> (ø)
packages/agents-mcp 77.70% <ø> (+0.15%) ⬆️
packages/agents-mobile 75.49% <ø> (ø)
packages/agents-runtime 82.47% <ø> (+0.31%) ⬆️
packages/agents-server 74.83% <ø> (-0.15%) ⬇️
packages/agents-server-ui 6.25% <ø> (+<0.01%) ⬆️
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.71% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 58.34% <ø> (+0.23%) ⬆️
unit-tests 58.34% <ø> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR teaches the sync service to honor an upstream head-sampling rate hint (tracestate: electric=rate:N) by stamping the SampleRate span attribute on shape-GET spans, and to synthesize a single root span for 5xx responses whose remote parent was head-sampled away. The implementation is clean, well-scoped, and unusually well-tested. (Reviews the PR's own commits; the PR is stacked on #4561.)

What's Working Well

  • OTel SDK semantics are used correctly, not fought. export_unsampled_remote_span/4 forces the sampled bit on a copy of the remote parent's span_ctx and starts the span under a fresh :otel_ctx.new(), so it does not pollute the process's current-span stack. Relying on the default {parent_based, %{root: always_on}} sampler taking its remote_parent_sampled branch is the right mechanism.
  • Timestamp units are consistent. start_time flows from System.monotonic_time() (native) and the end time from :opentelemetry.timestamp() (also native) — same clock, so the synthesized span's duration is real.
  • Embedded / non-telemetry safety. Every new code path is gated behind a non-nil trace_context/1, which is only populated when the W3C propagator extracts a parent. In embedded mode the propagator no-ops, so the new paths short-circuit.
  • Defensive parsing. sample_rate_hint/1's with chain rejects rate:0, negatives, floats, trailing garbage, and the empty value, all exhaustively covered by the table-driven test.
  • Hot-path discipline. In Response.send_stream/2 the sample_rate_attrs map is computed once before the chunk reduction and only Map.put-merged per chunk — no per-chunk recompute.
  • Test coverage is excellent. The behavior matrix maps 1:1 onto real exported-span assertions, plus the unit-level trace_context_plug_test.exs.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

None.

Issue Conformance

No linked issue (flagged per project convention — PRs should reference the issue/incident they address). The PR description remains exceptionally thorough: motivation (the ~20x undercount and ~95% missing error traces), a full per-request behavior matrix, scope decisions, and the known pre-existing mid-stream-crash limitation inherited from #4561. The implementation matches the described matrix, and the changeset (@core/sync-service: patch) is present and accurately scoped. No scope creep.

Previous Review Status

Iteration 5 — change since last review (commit 25b7688dd, "Fold sample-rate stamping into add_span_attrs_from_conn"):

A pure refactor of serve_shape_plug.ex (+49/-62, no other files touched): the standalone stamp_sample_rate/1 is removed and its logic folded into add_span_attrs_from_conn/1. The combined function now computes the full attribute map once (open_telemetry_attrs |> Map.merge(sample_rate_attrs(conn, conn.status))) and branches a single time on the (request-lifetime-fixed) remote trace context. I traced all four cases against the pre-refactor behavior and the integration matrix in serve_shape_plug_sample_rate_test.exs:

  • Sampled / direct (no remote parent): add_span_attributes(attrs) — equivalent to old add_span_attrs_from_conn + the parent_sampled?: true / nil branches of stamp_sample_rate. Empty sample_rate_attrs (no hint) collapses to plain conn attrs, as before.
  • Unsampled remote parent + 5xx: export_unsampled_error_span(conn, parent_span_ctx, attrs) with Map.put(attrs, sample_rate_attr(), 1). Since effective_sample_rate(_, status>=500) = 1, the merged attrs already carries SampleRate=1; the explicit Map.put keeps it correct even when the hint is absent. Final attribute set is identical to the old open_telemetry_attrs |> Map.put(..., 1). Still exactly one span, no add_span_attributes duplicate.
  • Unsampled remote parent + <500: export guard (is_integer(conn.status) and conn.status >= 500) is false, so nothing is exported. The non-recording span's dropped add_span_attributes call from the old path was a no-op anyway, so no observable change.
  • Span-start call site (start_telemetry_span/1, status nil): new code now also runs the trace-context branch here. With nil status the export guard can't match (is_integer(nil) is false), so the rate hint is stamped on the recording span as-is and overwritten with the effective value at emit time — exactly as the new docstring states. This is a harmless (arguably beneficial) early-stamp; it cannot synthesize a premature error span.

Also verified: the moved get_handle / get_live_mode / get_root_table helpers and the security-rationale comment on get_root_table are carried over verbatim; no dangling stamp_sample_rate references remain; sample_rate_attr/0 is still used. Net result is fewer moving parts (one function, one branch) with the externally observable span behavior unchanged.

Earlier iterations (b6d48d767 and before): the vendor-neutral language pass (iteration 4) and the two iteration-2 suggestions (uniform string keys in the stream_chunk attributes map; documented global-state mutation in the integration test) all remain resolved.

Remaining: Nothing. The refactor preserves behavior and improves readability; nothing regressed. This PR continues to look ready to merge.


Review iteration: 5 | 2026-06-12

Base automatically changed from flatten-shape-get-spans to main June 11, 2026 13:10
alco and others added 3 commits June 11, 2026 15:15
…tail

The Cloudflare worker in front of Electric head-samples successful
requests at 1:N (currently 1:20) and keeps all >= 500 responses at
sampleRate 1, weighting its exported spans accordingly. Electric only
received the sampled/not-sampled bit via traceparent, so:

 1. Electric spans carried no SampleRate attribute and electric-region
    aggregates under-reported worker traffic ~20x;
 2. ~95% of error traces had no server-side spans at all, because the
    worker's keep-on-error decision happens at export time, after a
    traceparent with sampled=0 was already propagated.

The worker now sends its rate hint on every proxied request via the
W3C tracestate header (member: `electric=rate:<N>`). This commit makes
Electric honor it:

 * TraceContextPlug parses the hint (missing/unparseable/rate < 1 hints
   are ignored) and stashes it in the conn together with the remote
   parent span context and its sampled flag.

 * For sampled remote parents, ServeShapePlug stamps `SampleRate` = N
   (status < 500) or `SampleRate` = 1 (status >= 500) on the
   Plug_shape_get root span once the response status is known; the
   shape_get.plug.stream_chunk child spans get the same attribute at
   creation.

 * For unsampled remote parents, a 5xx response now synthesizes a
   single root request span at response time with `SampleRate` = 1,
   carrying the standard root-span attributes, backdated to the request
   start, and parented onto the remote span context with the sampled
   trace flag forced on — so the parent-based sampler records it and it
   lands in the same trace as the worker's kept-on-error spans.
   Unsampled successful requests still export nothing.

Direct traffic (no remote parent) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the tracestate-sample-rate branch from f1158ed to 7fe2939 Compare June 11, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants