Skip to content

IGNITE-28692 Add cache metrics for MDC-related checks#13238

Open
sergey-chugunov-1985 wants to merge 14 commits into
apache:masterfrom
sergey-chugunov-1985:ignite-28692
Open

IGNITE-28692 Add cache metrics for MDC-related checks#13238
sergey-chugunov-1985 wants to merge 14 commits into
apache:masterfrom
sergey-chugunov-1985:ignite-28692

Conversation

@sergey-chugunov-1985

Copy link
Copy Markdown
Contributor

Thank you for submitting the pull request to the Apache Ignite.

In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:

The Contribution Checklist

  • There is a single JIRA ticket related to the pull request.
  • The web-link to the pull request is attached to the JIRA ticket.
  • The JIRA ticket has the Patch Available state.
  • The pull request body describes changes that have been made.
    The description explains WHAT and WHY was made instead of HOW.
  • The pull request title is treated as the final commit message.
    The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • A reviewer has been mentioned through the JIRA comments
    (see the Maintainers list)
  • The pull request has been checked by the Teamcity Bot and
    the green visa attached to the JIRA ticket (see tab PR Check at TC.Bot - Instance 1 or TC.Bot - Instance 2)

Notes

If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.

@anton-vinogradov anton-vinogradov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against IGNITE-28692. The two metrics are a reasonable direction, but a few things are worth addressing before merge — including one logic bug that the current tests don't catch and a per-exchange overhead that also hits non-MDC clusters. Details inline.

Higher level:

  1. The JIRA lists three deliverables — a warning on cache start plus the two metrics. The warning-on-cache-start isn't part of this PR. Is it intentionally split out? It's arguably the most visible signal for a user who simply forgot the cache-level configuration; a passive boolean metric is unlikely to reach exactly that user.

  2. IsCacheAffinityConfigurationMdcSafe is a heuristic over filter types, so it can't be a real guarantee in either direction: ClusterNodeAttributeColocatedBackupFilter is only MDC-safe when the cells are actually stretched across DCs (a deployment property, invisible from the filter), and a non-Rendezvous custom affinity is currently always reported as safe. Might be worth reflecting that in the metric description / Javadoc so Safe isn't read as a guarantee.

  3. For the distribution metric, a boolean loses diagnostic value — a count of partitions without full DC coverage would also serve as a severity signal. Just a thought, not blocking.

Comment on lines +660 to +661
if (!(filter instanceof MdcAffinityBackupFilter) && !(filter instanceof ClusterNodeAttributeColocatedBackupFilter))
return false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterNodeAttributeAffinityBackupFilter places backups on nodes with a different value of the attribute, so with DATA_CENTER_ID it forces the backup into another DC — i.e. it is MDC-safe. But it's neither MdcAffinityBackupFilter nor ClusterNodeAttributeColocatedBackupFilter, so this if returns false for it. By the time we reach here such a filter is already guaranteed to contain DATA_CENTER_ID (otherwise we'd have returned at the check above), so it's safe to accept:

Suggested change
if (!(filter instanceof MdcAffinityBackupFilter) && !(filter instanceof ClusterNodeAttributeColocatedBackupFilter))
return false;
if (!(filter instanceof MdcAffinityBackupFilter)
&& !(filter instanceof ClusterNodeAttributeColocatedBackupFilter)
&& !(filter instanceof ClusterNodeAttributeAffinityBackupFilter))
return false;

This case isn't covered today: CACHE_WITH_MDC_SAFE_ATTRIBUTE_FILTER uses exactly this filter, but MdcCacheMetricsTest only checks its distribution metric, never IsCacheAffinityConfigurationMdcSafe — worth adding that assertion.

Minor follow-up for the multi-attribute case: getAttributeNames().contains(DATA_CENTER_ID) is necessary but not sufficient — with e.g. [DATA_CENTER_ID, RACK] the AND-semantics still allow a backup in the same DC but a different rack.

private void updateMdcMetrics(CacheGroupContext grp, AffinityAssignment assignment) {
BaselineTopology top = cctx.discovery().discoCache().state().baselineTopology();
if (top != null) {
int numberOfDataCenters = top.numberOfDatacenters();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberOfDatacenters() returns -1 when DCs aren't configured, but that isn't handled here: for a non-MDC cluster we still iterate every partition of every non-replicated group on the exchange thread (allocating a stream per partition) and then discard the result — on every PME. An early return honours the -1 contract and skips the work for non-MDC clusters; it also avoids leaving the metric at its default true when the DC count is unknown:

Suggested change
int numberOfDataCenters = top.numberOfDatacenters();
int numberOfDataCenters = top.numberOfDatacenters();
if (numberOfDataCenters < 2)
return;

Comment on lines +250 to +253
private Boolean affCfgMdcSafe;

/** */
private Boolean mdcSafePartDistrib;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are written from the exchange thread (setMdcSafePartitionDistribution) and read by metric exporters / JMX from other threads. Without volatile there's no happens-before, so readers may observe a stale value — more likely to actually surface on weakly-ordered (ARM) CPUs. The other metric fields here are thread-safe (LongAdderMetric); these should be too:

Suggested change
private Boolean affCfgMdcSafe;
/** */
private Boolean mdcSafePartDistrib;
private volatile Boolean affCfgMdcSafe;
/** */
private volatile Boolean mdcSafePartDistrib;

Comment on lines +285 to +290
Collection<Map<String, Object>> allNodesAttrs = nodeMap.values();

if (!allNodesAttrs.isEmpty() && allNodesAttrs.iterator().next().get(ATTR_DATA_CENTER_ID) != null)
return (int)allNodesAttrs.stream().map(m -> m.get(ATTR_DATA_CENTER_ID)).distinct().count();

return -1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inspecting only the first node's attributes makes this return -1 whenever the first baseline node happens to lack the DC attribute — even if every other node has it. That's exactly the partial-misconfiguration case this feature is meant to catch, and it would silently mark the whole cluster as safe. Counting across all nodes (ignoring nulls) is more robust:

Suggested change
Collection<Map<String, Object>> allNodesAttrs = nodeMap.values();
if (!allNodesAttrs.isEmpty() && allNodesAttrs.iterator().next().get(ATTR_DATA_CENTER_ID) != null)
return (int)allNodesAttrs.stream().map(m -> m.get(ATTR_DATA_CENTER_ID)).distinct().count();
return -1;
long dcs = nodeMap.values().stream()
.map(m -> m.get(ATTR_DATA_CENTER_ID))
.filter(dc -> dc != null)
.distinct()
.count();
return dcs == 0 ? -1 : (int)dcs;


assertNotNull(cacheWithMdcFilterDistributionSafeMetric);
assertNotNull(cacheWithColocatedFilterDistributionSafeMetric);
assertNotNull(cacheWithColocatedFilterDistributionSafeMetric);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy-paste: this repeats the colocated assertion. It should null-check the third metric (currently cacheWithMdcSafeAttrFilterDistributionSafeMetric is never asserted non-null):

Suggested change
assertNotNull(cacheWithColocatedFilterDistributionSafeMetric);
assertNotNull(cacheWithMdcSafeAttrFilterDistributionSafeMetric);

@anton-vinogradov anton-vinogradov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental double-submit — please disregard this review; see my other review on this PR (the one with the inline comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants