Skip to content

feat(os-index): implement OS getIndicesStats and getClusterHealth for Phase 1 portlet display#35853

Open
fabrizzio-dotCMS wants to merge 6 commits into
mainfrom
fix/issue-35820-debug-ghost-index
Open

feat(os-index): implement OS getIndicesStats and getClusterHealth for Phase 1 portlet display#35853
fabrizzio-dotCMS wants to merge 6 commits into
mainfrom
fix/issue-35820-debug-ghost-index

Conversation

@fabrizzio-dotCMS
Copy link
Copy Markdown
Member

@fabrizzio-dotCMS fabrizzio-dotCMS commented May 27, 2026

Summary

  • Root cause: In Phase 1 (dual-write), OSIndexAPIImpl.getIndicesStats() and OSIndexAPIImpl.getClusterHealth() were unimplemented stubs returning empty maps. OS shadow indices appeared in the admin indices portlet with all metrics as n/a (count, size, replicas, health).
  • IndexAPIImpl.getIndicesStats() routed exclusively to the read provider (ES in Phase 1) via router.read(), so OS stats were never consulted even if implemented.

Changes

OSIndexAPIImpl — implements two previously stubbed methods:

  • getIndicesStats(): queries OS cluster via indices().stats(cluster_xxx.*), extracts primaries.docs().count() and primaries.store().sizeInBytes() per index, returns vendor-neutral IndexStats map
  • getClusterHealth(): queries OS cluster via cluster().health(cluster_xxx.*) at Indices level, returns vendor-neutral ClusterIndexHealth map
  • Errors are caught and logged — degraded portlet beats a hard failure

IndexAPIImpl.getIndicesStats() — changed from router.read() (ES-only) to the same dual-write aggregation pattern used by getClusterHealth(): merges ES and OS maps in Phase 1/2, OS entry wins on key collision.

Tests

  • ContentletIndexAPIImplMigrationIT (integration, live ES + OS clusters):

    • getIndicesStats Phase 1: ES(T0) and OS(T1) both present in merged map with valid data
    • getIndicesStats Phase 0: ES only (auto-skipped on single-cluster profile)
    • getClusterHealth Phase 1: ES(T0) and OS(T1) both present with non-null status and positive shard count
  • OSIndexAPIImplIntegrationTest (integration, live OS cluster):

    • getIndicesStats: creates two indices, verifies doc count ≥ 0, raw size ≥ 0, size string non-empty
    • getClusterHealth: creates an index, verifies status non-empty, shards > 0, replicas ≥ 0

Test plan

  • Run ContentletIndexAPIImplMigrationIT with FEATURE_FLAG_OPEN_SEARCH_PHASE=1 — ES(T0) and OS(T1) indices should both appear in getIndicesStats and getClusterHealth results
  • Verify admin indices portlet (/dotAdmin/#/dot-layout-grid) shows real count/size/replicas/health for OS shadow indices in Phase 1 instead of n/a

Closes #35820

🤖 Generated with Claude Code

…ts.jsp (#35820)

Adds a temporary debug panel and per-row provider column to the indices
portlet to identify the origin of each listed index during an ES→OS Phase 1
migration bootstrap.

Console output (Logger.info, tag [DEBUG #35820]):
- Active migration phase ordinal
- Raw keys from listDotCMSIndices(), getIndicesStats(), getClusterHealth()
- ES store slots from indicies table
- OS store slots from versioned_indices table (os:: tag stripped)
- Per-index: origin label, hasStats, hasHealth, active, building flags

Page output:
- Dark debug panel above the indices table with all diagnostic data
- Extra "[DBG] Provider / Origin" column on each index row
  showing ES-store / OS-store / ORPHAN label, plus in-DB confirmation

To be removed once root cause is confirmed on the cloud instance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 27, 2026

Claude finished @fabrizzio-dotCMS's task in 3m 33s —— View job


Claude PR Review

  • Read changed files
  • Analyze diagnostic code in IndexAPIImpl.java
  • Analyze diagnostic code in OSIndexAPIImpl.java
  • Review test changes
  • Post review

Scope drift — PR metadata is stale

The title and body describe a temporary Phase 1 diagnostic ("debug code will be removed after root cause is confirmed"), but commit 28f79abd already reverted the JSP diagnostic and subsequent commits (e5e615d5, 0788346c, 7dbf16b6) turned this into a real implementation of OS getIndicesStats / getClusterHealth plus aggregation in IndexAPIImpl. This is now production code that affects the maintenance portlet. Title (debug(index-portlet): …), body, and any release-notes line should be updated to reflect that scope before merge — reviewers and a future bisect-author will be misled otherwise.

Findings

1. toHumanReadableSize — locale bug and missing unitsOSIndexAPIImpl.java:425-430

if (bytes < 1_048_576L) return String.format("%.1fkb", bytes / 1_024.0);
  • String.format without an explicit Locale follows the JVM default — on a server with LANG=es_* / de_* this produces "1,2kb" instead of "1.2kb", which then collides with any consumer that parses the string. Pass Locale.ROOT.
  • No tb / pb branches. A 2 TB index renders as "2048.0gb". ES's ByteSizeValue.toString() (used in ESIndexAPI.java:170) handles those. Since the two providers' results are merged into one map by IndexAPIImpl.getIndicesStats(), this creates inconsistent formatting between rows in the same portlet table. Either reuse ByteSizeValue/equivalent or extend this helper.

2. ExpandWildcard.Open excludes closed indicesOSIndexAPIImpl.java:369

The ES path (ESIndexAPI.java:144) hits /<prefix>*/_stats with no wildcard filter — defaults are open+hidden depending on cluster setting. The OS path explicitly restricts to Open, so any closed OS index will silently produce no row in the portlet. Confirm this is the intended behavior or align both. getClusterHealth doesn't set a wildcard at all — also inconsistent with the stats call right above it.

3. No allowNoIndices(true)OSIndexAPIImpl.java:367-369, 401-402

Fresh cluster / pre-bootstrap path: a call before any index matches <prefix>* can throw IndexNotFoundException from the OS client. It's caught by the broad Exception handler, but produces an Logger.error stack trace per portlet load until indices exist. getClosedIndexes() in the same class sets allowNoIndices(true) explicitly — worth doing here too.

4. Silent merge collisionIndexAPIImpl.java:146-148

final Map<String, IndexStats> merged = new HashMap<>(esImpl.getIndicesStats());
merged.putAll(osImpl.getIndicesStats());

The javadoc says "On key collision the OS entry wins." In Phase 1/2 ES holds T0 names and OS holds T1 names, so collisions shouldn't happen — but if naming ever overlaps (misconfig, shared prefix across clusters) ES data is dropped without any signal. A Logger.warn when esStats.keySet() ∩ osStats.keySet() ≠ ∅ would make this debuggable. Same point applies to getClusterHealth() which already uses the identical pattern.

5. Broad catch (Exception e) → return empty mapOSIndexAPIImpl.java:390-393, 419-422

Renders as "no data" in the portlet, which is what was happening before this PR for a different reason. A transient network blip + cached empty map gives the operator the same misleading screen the ghost-index investigation was chasing. Matches existing patterns in the file, so not a blocker — but worth narrowing to OS client exceptions, or at least logging at warn with a hint that the table will be incomplete.

6. Test assertion strengthOSIndexAPIImplIntegrationTest.java

assertTrue("Document count must be non-negative", liveStats.documentCount() >= 0) and sizeRaw() >= 0 — these are tautological for long-typed counters. They confirm the path runs without throwing, which is fine, but they don't verify the docs/bytes are actually being read from the response. After createIndex, the index is empty, so checking documentCount() == 0 and sizeRaw() > 0 (an empty index still has segment overhead) would be tighter. Not blocking.

7. Nit — fully-qualified IndexStatsOSIndexAPIImpl.java

The OS client also has a class named IndexStats (org.opensearch.client.opensearch.indices.stats.IndexStats), so the long com.dotcms.content.index.domain.IndexStats references are intentional disambiguation. Fine as-is; flagging only so future cleanup doesn't break it by adding the import.


Tests cover the right cases (Phase 0 ES-only, Phase 1 dual-merge for both stats and health). Routing aggregation in IndexAPIImpl follows the same shape as the existing listIndices / getClusterHealth / getIndices aggregating-read methods documented in the class header.
· Branch

fabrizzio-dotCMS and others added 3 commits May 27, 2026 15:49
Old reindex-leftover indices had hasHealth=true but hasStats=false (closed
indices are excluded from _stats). They incorrectly fell through to UNKNOWN.

Now:
- hasStats              → Stale ES index (orphan in cluster, not in DB)
- hasHealth + isClosed  → Stale CLOSED ES index
- hasHealth             → Stale ES index (possibly closed — no stats)
- neither               → UNKNOWN (truly unresolvable)

Also switched inEs/inOs lookups to use bare (stripped) name so the
comparison is correct regardless of cluster-prefix presence in the
indices list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… portlet display

Both methods were stubs returning empty maps. OS indices appeared in the
admin portlet with all metrics as n/a in Phase 1/2.

OSIndexAPIImpl: query OS cluster via indices().stats() and cluster().health()
using the existing clusterPrefix wildcard and hasClusterPrefix/removeClusterIdFromName
helpers. Errors are caught and logged — degraded portlet beats a hard failure.

IndexAPIImpl.getIndicesStats: changed from router.read() (ES-only) to the same
dual-write aggregation pattern used by getClusterHealth() — merges ES and OS maps
so both providers contribute stats in Phase 1/2. OS entry wins on key collision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause confirmed. OS stats/health now implemented in OSIndexAPIImpl
and IndexAPIImpl — debug scaffolding no longer needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…egation

Unit (IndexAPIImplStatsAggregationTest, 8 cases — no cluster needed):
- Phase 0/3: single-provider delegation for getIndicesStats and getClusterHealth
- Phase 1: dual-write merge — disjoint keys, OS wins on collision, OS degraded fallback

Integration (OSIndexAPIImplIntegrationTest — live OS cluster):
- getIndicesStats: verifies doc count >= 0, raw size >= 0, non-empty size string
- getClusterHealth: verifies non-null status, shards > 0, replicas >= 0
  (replaces the previous "returns non-null" smoke test with assertions on actual data)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ntegration)

Removes the Mockito-based unit test and adds three real-cluster cases to
ContentletIndexAPIImplMigrationIT — the existing suite that exercises
phase-aware routing against live ES and OS containers:

- getIndicesStats Phase 1: ES(T0) + OS(T1) both present in merged map
- getIndicesStats Phase 0: ES only (skipped on single-cluster profile)
- getClusterHealth Phase 1: ES(T0) + OS(T1) both present with valid status/shards

Also strengthens OSIndexAPIImplIntegrationTest: replaces smoke tests with
assertions on actual data (doc count >= 0, size non-blank, shards > 0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@fabrizzio-dotCMS fabrizzio-dotCMS changed the title debug(index-portlet): Phase 1 ghost-index diagnostic (#35820) feat(os-index): implement OS getIndicesStats and getClusterHealth for Phase 1 portlet display May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[DEFECT] Confusing empty OS index appears in indices portlet on Phase 1 migration bootstrap

1 participant