Skip to content

IGNITE-28747 GridToStringBuilder#handleRecursion may cause NPE. Colle…#13262

Open
EgorBaranovEnjoysTyping wants to merge 15 commits into
apache:masterfrom
EgorBaranovEnjoysTyping:IGNITE-28747
Open

IGNITE-28747 GridToStringBuilder#handleRecursion may cause NPE. Colle…#13262
EgorBaranovEnjoysTyping wants to merge 15 commits into
apache:masterfrom
EgorBaranovEnjoysTyping:IGNITE-28747

Conversation

@EgorBaranovEnjoysTyping

Copy link
Copy Markdown
Contributor

Collect toString as tree, avoid extra allocations and problems in recursion resolution
Problems solved:
Any recursive call in toString method would be handled
Allocate less memory
Memory wouldn't be kept for a whole thread life

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Refactors Ignite’s toString infrastructure to a node-based implementation with improved recursion handling and enhanced bounded string building, plus adds/extends tests for the new behavior.

Changes:

  • Replaced the previous thread-local SBLimitedLength + recursion map approach with a GridToStringNode tree, recursion monitors, and marker recovery.
  • Extended SBLimitedLength/CircularStringBuilder capabilities (insert/substring, head/tail limits) to support the new toString flow.
  • Added new unit tests for SBLimitedLength and expanded existing toString/circular buffer tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
modules/core/src/test/java/org/apache/ignite/testsuites/IgniteUtilSelfTestSuite.java Registers the new SBLimitedLengthSelfTest in the util test suite.
modules/core/src/test/java/org/apache/ignite/internal/util/tostring/SBLimitedLengthSelfTest.java Adds focused tests for append/insert/toString and prohibited operations in SBLimitedLength.
modules/core/src/test/java/org/apache/ignite/internal/util/tostring/GridToStringBuilderSelfTest.java Adds regression tests around recursion and an NPE scenario for the new implementation.
modules/core/src/test/java/org/apache/ignite/internal/util/tostring/CircularStringBuilderSelfTest.java Adds tests for CircularStringBuilder.insert and substring.
modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/SBLimitedLength.java Updates limited-length builder behavior, adds insert routing logic, forbids mutating reductions, and integrates marker recovery.
modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/SBLengthLimit.java Adjusts overflow logic and tail creation behavior used by SBLimitedLength.
modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/NodeRecursionMonitor.java Introduces recursion tracking via a thread-local identity registry.
modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/Node.java Adds node types and a factory to build a stringification tree with recursion termination and collection/map/array handling.
modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/GridToStringBuilder.java Refactors to the node-based flow and adds throwable rendering.
modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/CircularStringBuilder.java Adds insert and substring operations; removes reset.
modules/commons/src/main/java/org/apache/ignite/internal/util/GridStringBuilder.java Integrates marker recovery into various append/insert/replace operations and updates javadocs.
Comments suppressed due to low confidence (1)

modules/commons/src/main/java/org/apache/ignite/internal/util/tostring/SBLimitedLength.java:1

  • These i(...) overloads bypass SBLimitedLength's custom insert routing by calling super.i(...), which inserts directly into the head buffer and ignores head/tail limiting. They should delegate to this class’s i(int, String) (or equivalent) so inserts beyond the head limit are correctly redirected/handled.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@anton-vinogradov anton-vinogradov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this — the NPE is real and the diagnosis is correct. My concern is that a small, local defect is being fixed with a full rewrite of the toString engine, and the rewrite introduces problems that are worse than the original bug.

Root cause (small and local). SBLimitedLength overrides every a(...) (append) but none of the i(...) (insert) methods. handleRecursion() calls buf.i(pos, hash), which writes straight into impl() and bypasses onWrite() — the only place that allocates tail. The insert pushes the head past HEAD_LEN, so overflowed() flips to true while tail is still null, and the next a(savedName) takes the overflow branch -> tail.append(...) -> NPE. The RecursivePayload test reproduces exactly this.

The minimal fix is already in this PR. The i(...) overrides + lazy-tail in SBLimitedLength are the correct fix and are enough to close the ticket on their own (~2 files + the new test). Everything else — the GridToStringNode tree, NodeRecursionMonitor, the marker/recover mechanism, gutting GridToStringBuilder by ~685 lines — is an unrelated refactor.

Blocking concerns with the refactor (details inline):

  1. Marker/recover breaks the toString() contract. A nested S.toString() returns a marker string (new String("GridToStringNode")) instead of its value, recovered later by String identity. Any wrapping — "Wrapper{" + S.toString(...) + "}", very common in our code — defeats the identity lookup and leaks the literal GridToStringNode into the log. recoverObject's own javadoc concedes the design can't guarantee correctness.
  2. Thread-keyed static map leaks. static ConcurrentHashMap<Thread, ...> CATCHED_NODES pins Thread keys (and captured nodes) whenever clear() is skipped on any path. The original ThreadLocal was leak-free by construction; this replaces it with a slower, leak-prone map of identical semantics.
  3. toString() can now throw. markNode() ends with identities().orElseThrow(); on a path where init() didn't run, an inner toString() throws — the exact failure mode the original blanket try/catch existed to prevent.

Also: the tree is built in full (reflection + recursion + each field's toString()) and only length-limited at render time, which defeats the whole point of SBLimitedLength (cap the cost of toString on large graphs). The PR claims fewer allocations but adds per-node objects, Optional/lambda chains, a per-call new SBLimitedLength, and a recoverObject lookup on every append — with no benchmark.

Proposal: reduce this PR to the SBLimitedLength insert overrides + lazy tail + the RecursivePayload test. If we genuinely want to cut toString allocations, let's do it as a separate ticket with JMH before/after and an explicit story for large graphs and wrapped toString().

* @return The unique marker string.
*/
static String markNode(GridToStringNode node) {
String result = new String(GridToStringNode.class.getSimpleName());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two problems on this method:

  1. Using new String(...) identity as a map key is extremely fragile (see the wrapped-toString case).
  2. orElseThrow() makes markNode throw if the thread state wasn't initialized (a nested call on a path where init() didn't run). toString() must never throw — the original guarded every path with try/catch; this reintroduces throwing from toString.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. new String will recover new Object, toString may return value from String pool
  2. It's assertion, that init run.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed on new String(...) — you do need a fresh identity because toString() may hand back an interned/pooled instance. That part is fine.

  2. But Optional.orElseThrow() is not a Java assert: it throws NoSuchElementException unconditionally (it's not gated by -ea), and that throw escapes from inside toString() — exactly the "toString must never throw" failure mode the original blanket try/catch existed to prevent. If the intent is only to document the "init() ran" invariant, use assert identities().isPresent() (or orElseThrow(() -> new AssertionError(...))), so it's a no-op in production instead of being able to turn a logging call into a thrown exception.

* A thread-local cache for nodes, used to handle references of
* inner toString() calls by mapping temporary markers to actual nodes.
*/
static final ConcurrentHashMap<Thread, IdentityHashMap<String, GridToStringNode>> CATCHED_NODES

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static ConcurrentHashMap<Thread, ...> keyed by Thread is a classic leak: in a pooled-thread server the keys never die, and any path that skips clear() (exception in a user toString, reentrancy) accumulates entries that pin the Thread and all captured nodes/objects. The value is a per-thread map touched only by its owner, so the ConcurrentHashMap buys nothing but overhead. This should be a ThreadLocal — which is what the original used and why it didn't leak. (nit: CATCHED -> CACHED.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clear is always called in finaly statements for first call in sequence

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finally-clear handles the leak (given perfectly balanced init/clear), but the design point stands: this map is only ever touched by its owning thread, so ConcurrentHashMap<Thread, …> is pure overhead over a ThreadLocal — and LAST_CONSTRUCTED_… / OBJECT_REGISTRY right next to it are already ThreadLocal. A ThreadLocal<IdentityHashMap<…>> is leak-free by construction and removes the per-call thread lookup. Also the CATCHEDCACHED rename (Copilot's nit) doesn't look applied yet.

}

/** {@inheritDoc} */
@Override public GridStringBuilder i(int offset, String str) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the correct fix for the NPE. My suggestion is to ship this (the insert overrides + lazy tail) plus the RecursivePayload test as the entire patch, drop recoverObject from it, and split the node-tree work into a separate ticket/RFC.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped recoverObject.
I don't want to artificially modify the tests by removing valid cases just to satisfy the 'Fix NPE' requirement. I'd rather view this task as 'The handleRecursion method is not working correctly'."

@anton-vinogradov

Copy link
Copy Markdown
Contributor

I pushed a minimal alternative targeted at this branch: EgorBaranovEnjoysTyping#1

The root cause is a one-method omission: SBLimitedLength overrides every append a(...) but none of the insert i(...) methods, so the identity-hash insert in handleRecursion() (buf.i(pos, hash)) bypasses onWrite() / the lazy tail allocation. The insert can push the head past the limit while tail is still null, and the next append then NPEs on a null tail — the exact stack trace from the ticket.

The alternative overrides SBLimitedLength.i(int, String) (~16 lines) to route inserts through the same overflow accounting as a(...), and reverts the tostring package to the master baseline. It also adds a deterministic regression test that sweeps payload sizes across the HEAD_LEN boundary: it reproduces the NPE without the fix (pad ≈ 7978–7980) and passes with it. GridToStringBuilderSelfTest is green (20 tests).

My reasoning for preferring a localized fix over the node-tree rewrite is in the inline review comments above (marker/recover changing the toString contract, the Thread-keyed static map vs the leak-free ThreadLocal, and toString becoming able to throw). If you merge that PR into this branch, it replaces the current approach here.

@anton-vinogradov

Copy link
Copy Markdown
Contributor

Thanks for the round of fixes — re-reviewed at 7c75b1a.

The main blocker is genuinely resolved. markNode now returns the real node.toString() (wrapped in new String(...) for identity) instead of the literal GridToStringNode placeholder, so the dangerous "Foo{" + S.toString(...) + "}" concatenation case now degrades to the correct content rather than corrupting the log. I verified parity with master:

wrapped  : Parent [child=WRAP[Child [v=42]], name=p]
unwrapped: Parent2 [child=PlainChild [v=7]]

Cycle handling and the original NPE repro (self-ref sweep across the HEAD_LEN boundary, pad 7975–7985) also pass with no NPE. Dropping recoverObject and the typo/placeholder fixes look good. 👍

A few things still open:

  1. CATCHED_NODES (GridToStringNode:38) is still static ConcurrentHashMap<Thread, …>. Your "clear() is always in finally" answer addresses the leak, but not the design: this map is only ever touched by its owning thread, and LAST_CONSTRUCTED_… / OBJECT_REGISTRY in the same class are already ThreadLocal. A ThreadLocal here is leak-free by construction and drops the per-call Thread-keyed lookup — please make it consistent.

  2. markNode eagerly materializes the whole child subtree into a String (new String(node.toString())) just to use it as a map key. That's exactly the eagerness that weakens the point of SBLimitedLength — bounding the cost of toString on large graphs, not just the output length. The streaming baseline stopped doing work once the head filled; this renders every nested S.toString in full first. I'd still like a JMH before/after on a large/deep graph (or at least an explicit acknowledgement of the trade-off in the ticket).

  3. SBLimitedLength.i(int,String) else-branch still dereferences tail with no guard (:277). The "offset ≥ head ⇒ tail already exists" invariant is plausible, but it's implicit. Cheapest fix is to document it with assert tail != null so a future change can't silently reintroduce the NPE class this PR is fixing.

On the bigger picture: I understand you'd rather frame this as "handleRecursion is incorrect" than "fix the NPE", and the rewrite is meaningfully safer now. But the core concern stands — this is a ~1700-line architectural rewrite of the toString engine (two parallel nested-call mechanisms, markNode vs appendInnerBuffer) riding on an NPE ticket, with an unmeasured perf change on exactly the large-graph path the limiter exists for. I still think the cleanest path is the localized SBLimitedLength insert fix + regression test for this ticket, and the node-tree as a separate change with a JMH story — but that's a maintainer call, so I'll defer to the others on scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants