IGNITE-28692 Add cache metrics for MDC-related checks#13238
IGNITE-28692 Add cache metrics for MDC-related checks#13238sergey-chugunov-1985 wants to merge 14 commits into
Conversation
anton-vinogradov
left a comment
There was a problem hiding this comment.
Reviewed against IGNITE-28692. The two metrics are a reasonable direction, but a few things are worth addressing before merge — including one logic bug that the current tests don't catch and a per-exchange overhead that also hits non-MDC clusters. Details inline.
Higher level:
-
The JIRA lists three deliverables — a warning on cache start plus the two metrics. The warning-on-cache-start isn't part of this PR. Is it intentionally split out? It's arguably the most visible signal for a user who simply forgot the cache-level configuration; a passive boolean metric is unlikely to reach exactly that user.
-
IsCacheAffinityConfigurationMdcSafeis a heuristic over filter types, so it can't be a real guarantee in either direction:ClusterNodeAttributeColocatedBackupFilteris only MDC-safe when the cells are actually stretched across DCs (a deployment property, invisible from the filter), and a non-Rendezvous custom affinity is currently always reported as safe. Might be worth reflecting that in the metric description / Javadoc soSafeisn't read as a guarantee. -
For the distribution metric, a boolean loses diagnostic value — a count of partitions without full DC coverage would also serve as a severity signal. Just a thought, not blocking.
| if (!(filter instanceof MdcAffinityBackupFilter) && !(filter instanceof ClusterNodeAttributeColocatedBackupFilter)) | ||
| return false; |
There was a problem hiding this comment.
ClusterNodeAttributeAffinityBackupFilter places backups on nodes with a different value of the attribute, so with DATA_CENTER_ID it forces the backup into another DC — i.e. it is MDC-safe. But it's neither MdcAffinityBackupFilter nor ClusterNodeAttributeColocatedBackupFilter, so this if returns false for it. By the time we reach here such a filter is already guaranteed to contain DATA_CENTER_ID (otherwise we'd have returned at the check above), so it's safe to accept:
| if (!(filter instanceof MdcAffinityBackupFilter) && !(filter instanceof ClusterNodeAttributeColocatedBackupFilter)) | |
| return false; | |
| if (!(filter instanceof MdcAffinityBackupFilter) | |
| && !(filter instanceof ClusterNodeAttributeColocatedBackupFilter) | |
| && !(filter instanceof ClusterNodeAttributeAffinityBackupFilter)) | |
| return false; |
This case isn't covered today: CACHE_WITH_MDC_SAFE_ATTRIBUTE_FILTER uses exactly this filter, but MdcCacheMetricsTest only checks its distribution metric, never IsCacheAffinityConfigurationMdcSafe — worth adding that assertion.
Minor follow-up for the multi-attribute case: getAttributeNames().contains(DATA_CENTER_ID) is necessary but not sufficient — with e.g. [DATA_CENTER_ID, RACK] the AND-semantics still allow a backup in the same DC but a different rack.
| private void updateMdcMetrics(CacheGroupContext grp, AffinityAssignment assignment) { | ||
| BaselineTopology top = cctx.discovery().discoCache().state().baselineTopology(); | ||
| if (top != null) { | ||
| int numberOfDataCenters = top.numberOfDatacenters(); |
There was a problem hiding this comment.
numberOfDatacenters() returns -1 when DCs aren't configured, but that isn't handled here: for a non-MDC cluster we still iterate every partition of every non-replicated group on the exchange thread (allocating a stream per partition) and then discard the result — on every PME. An early return honours the -1 contract and skips the work for non-MDC clusters; it also avoids leaving the metric at its default true when the DC count is unknown:
| int numberOfDataCenters = top.numberOfDatacenters(); | |
| int numberOfDataCenters = top.numberOfDatacenters(); | |
| if (numberOfDataCenters < 2) | |
| return; |
| private Boolean affCfgMdcSafe; | ||
|
|
||
| /** */ | ||
| private Boolean mdcSafePartDistrib; |
There was a problem hiding this comment.
These are written from the exchange thread (setMdcSafePartitionDistribution) and read by metric exporters / JMX from other threads. Without volatile there's no happens-before, so readers may observe a stale value — more likely to actually surface on weakly-ordered (ARM) CPUs. The other metric fields here are thread-safe (LongAdderMetric); these should be too:
| private Boolean affCfgMdcSafe; | |
| /** */ | |
| private Boolean mdcSafePartDistrib; | |
| private volatile Boolean affCfgMdcSafe; | |
| /** */ | |
| private volatile Boolean mdcSafePartDistrib; |
| Collection<Map<String, Object>> allNodesAttrs = nodeMap.values(); | ||
|
|
||
| if (!allNodesAttrs.isEmpty() && allNodesAttrs.iterator().next().get(ATTR_DATA_CENTER_ID) != null) | ||
| return (int)allNodesAttrs.stream().map(m -> m.get(ATTR_DATA_CENTER_ID)).distinct().count(); | ||
|
|
||
| return -1; |
There was a problem hiding this comment.
Inspecting only the first node's attributes makes this return -1 whenever the first baseline node happens to lack the DC attribute — even if every other node has it. That's exactly the partial-misconfiguration case this feature is meant to catch, and it would silently mark the whole cluster as safe. Counting across all nodes (ignoring nulls) is more robust:
| Collection<Map<String, Object>> allNodesAttrs = nodeMap.values(); | |
| if (!allNodesAttrs.isEmpty() && allNodesAttrs.iterator().next().get(ATTR_DATA_CENTER_ID) != null) | |
| return (int)allNodesAttrs.stream().map(m -> m.get(ATTR_DATA_CENTER_ID)).distinct().count(); | |
| return -1; | |
| long dcs = nodeMap.values().stream() | |
| .map(m -> m.get(ATTR_DATA_CENTER_ID)) | |
| .filter(dc -> dc != null) | |
| .distinct() | |
| .count(); | |
| return dcs == 0 ? -1 : (int)dcs; |
|
|
||
| assertNotNull(cacheWithMdcFilterDistributionSafeMetric); | ||
| assertNotNull(cacheWithColocatedFilterDistributionSafeMetric); | ||
| assertNotNull(cacheWithColocatedFilterDistributionSafeMetric); |
There was a problem hiding this comment.
Copy-paste: this repeats the colocated assertion. It should null-check the third metric (currently cacheWithMdcSafeAttrFilterDistributionSafeMetric is never asserted non-null):
| assertNotNull(cacheWithColocatedFilterDistributionSafeMetric); | |
| assertNotNull(cacheWithMdcSafeAttrFilterDistributionSafeMetric); |
Thank you for submitting the pull request to the Apache Ignite.
In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:
The Contribution Checklist
The description explains WHAT and WHY was made instead of HOW.
The following pattern must be used:
IGNITE-XXXX Change summarywhereXXXX- number of JIRA issue.(see the Maintainers list)
the
green visaattached to the JIRA ticket (see tabPR Checkat TC.Bot - Instance 1 or TC.Bot - Instance 2)Notes
If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.