Fix: incorrect search result ranking - schema and indexer by mdorf · Pull Request #294 · ncbo/ontologies_linked_data

mdorf · 2026-06-02T02:08:47Z

Summary

Data-layer companion for fixing the term-search ranking regression on ncbo/ontologies_api. Ships the Solr schema and indexer changes that enable BioPortal ontology rank to participate in term-search scoring before pagination (ncbo/ontologies_api#230), and pairs them with several term-indexing improvements that this branch needed along the way.

What this PR does

Search ranking infrastructure (the core change)

Schema (lib/ontologies_linked_data/models/class.rb): adds the ontologyRank Solr field (pfloat, indexed, stored, default: 0.0).
Schema: switches prefLabelExact and synonymExact (plus their *_* dynamic language variants) from string to string_ci, restoring case-insensitive exact match that regressed during the schemaless migration.
Indexer: populates ontologyRank per class doc from LinkedData::Models::Ontology.rank[:normalizedScore], cached for the duration of an indexing run (class-level cache reset by OntologySubmissionIndexer).

Term-indexing improvements (paired but independently useful)

Skip unindex_by_acronym when indexing into a fresh collection.
Allow term-only indexing without CSV generation.
Honor disabled index commits during processing.
Use keyword options for the term indexer.
Restore requested_lang after indexing to prevent cross-ontology leakage.
Guard against nil submission after metadata extraction failure.
Bump goo pin to 684fbe06.

Dependencies

ncbo/goo PR Fix: Solr improvements - alias-shadow guard and commitWithin support goo#187 — required. Adds the alias-guard fix in SolrConnector#init that prevents CREATEALIAS from silently shadowing an existing collection.

Deploy coordination

This branch ships the schema for ontologyRank and the indexer that populates it. The downstream ontologies_api change that consumes the field in Solr scoring needs the term-search collection rebuilt under this schema before it can deploy. Coordinated cutover:

Build the new term-search collection under this schema (full reindex).
Flip the term_search alias to the new collection.
Deploy the ontologies_api change together with this branch.

Out of scope

The term-search ranking algorithm change (boost=sum(ontologyRank,1) in edismax params, removal of post-hoc Ruby sort) — lives in the corresponding ncbo/ontologies_api PR.
Property-search has a parallel pattern (separate Solr collection, separate indexer); tracked as Property search ranking has the same pre-pagination bug as #230 ontologies_api#231.

Thread a new unindex_existing flag (default true, preserving current behavior for all existing callers) through SubmissionProcessor, the index_terms concern, and the OntologySubmissionIndexer. When false, the indexer skips the per-ontology unindex_by_acronym delete-by-query, which is wasted work against a target collection that was already cleared by the caller. Add a test verifying that unindex_by_acronym is not invoked when unindex_existing: false.

@submission

SubmissionMetadataExtractor#extract_metadata returns nil when the submission fails Goo validation. The processor then reassigned @submission = ... and every subsequent call on @submission raised NoMethodError, with the rescue at notify_submission_processed adding a confusing "undefined method archived? for nil:NilClass" on top. Raise a clear error immediately so the caller's rescue logs one actionable message, and skip the email notification when @submission is nil to silence the trailing noise.

The class indexer sets RequestStore.store[:requested_lang] = :ALL so Goo returns language-tagged literals as Hash forms when building Solr documents. The pre-refactor multi-threaded indexer made this change inside per-page worker threads, so it was thread-local and vanished with the thread. The single-threaded rewrite runs on the main thread and never reset it, leaking :ALL into the submission's after_save Solr callback (causing 400s on single-valued metadata fields like status_t) and into the next ontology's bring_remaining (causing attributes to load as Hashes that fail Goo validation and turn extract_metadata into a nil return). Hoist the :ALL assignment out of the per-page loop, capture the previous value before indexing starts, and restore it in the existing ensure block. This matches the save/restore pattern already used in submission_missing_labels.rb.

The string_ci field type used for prefLabelExact/synonymExact (and their dynamic language variants) is an analyzed TextField, which has no docValues. With stored: false, those fields no longer appear in Solr responses, breaking ontologies_api/helpers/search_helper.rb's result sort (a[:prefLabelExact].downcase) and hl.fl highlighting on the Exact fields. Setting stored: true restores retrievability without changing the case-insensitive exact-match indexing behavior.

codecov · 2026-06-02T02:20:41Z

Codecov Report

❌ Patch coverage is 94.80519% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.79%. Comparing base (74da285) to head (7f1302f).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
...ubmission_process/operations/submission_indexer.rb	94.33%	3 Missing ⚠️
...ervices/submission_process/submission_processor.rb	91.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #294      +/-   ##
===========================================
+ Coverage    80.21%   80.79%   +0.58%     
===========================================
  Files          100      100              
  Lines         6753     6791      +38     
===========================================
+ Hits          5417     5487      +70     
+ Misses        1336     1304      -32

Flag	Coverage Δ
ag	`80.63% <94.80%> (+0.49%)`	⬆️
fs	`80.73% <94.80%> (+0.55%)`	⬆️
gd	`80.73% <94.80%> (+0.58%)`	⬆️
unittests	`80.79% <94.80%> (+0.58%)`	⬆️
vo	`80.72% <94.80%> (+0.59%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Bulk reindexing is dominated by the per-page SPARQL fetch from the triplestore (~24 of 36 hours in the last full staging run, ~71% of indexing time). Doubling the page size halves the number of page queries and their per-query setup overhead, at the cost of higher peak memory per page. Performance change only; no behavior change.

The bulk term indexer was strictly sequential per page: fetch a page of classes from the triplestore, map attributes, write to Solr, then fetch the next. SPARQL retrieval dominated the last full reindex (~71% of indexing wall-clock), so the map + Solr-write of each page was idle time the next fetch could have used. Prime page 1 synchronously, then fetch page N+1 on a background thread while the main thread maps and indexes page N. Because Goo's Where#page mutates the shared query object, only one thread ever touches `paging` at a time: the main thread fetches page 1, and each later fetch runs on its own thread while the main thread is busy and never touches `paging`. Notes: - RequestStore is thread-local, so the :ALL language override is re-set inside the fetch thread; otherwise the background fetch would drop multilingual values. - equivalent_predicates is returned with each page so the consumer never reads it from `paging` concurrently with a background fetch. - Thread#value re-raises a fetch error on the main thread so a failed fetch surfaces instead of silently dropping the remaining pages. - Page size is now an injectable keyword (default 5000) so tests can force multi-page behavior on small fixtures. Adds unit coverage for multi-page ordering, the thread-local language override, and fetch-error propagation, plus a real end-to-end test that indexes an ontology single-page vs many-page and asserts an identical complete document set.

Reverts d901277 (prefetch next page while mapping/indexing current) and 596a61a (page size 2500 -> 5000). Both shipped together for staging validation and made indexing 2.4x (CADSRVS) to 4x (NCIT) slower than the proven 2500/sequential baseline. Per-class timings showed fetch and map each ~8-10x slower while the Solr index phase was unchanged -- the signature of in-process memory pressure (GC / likely OS swap), not triplestore or Solr load. The larger page plus the prefetch holding a second page resident raised peak RSS; prefetch's premise (hide an I/O-bound fetch) also inverted on warm runs where local mapping dominates. Restoring the proven 2500/sequential config. The reverted work remains in history if a genuinely fetch-bound run with RAM headroom ever justifies revisiting it.

Repoints prefLabelExact, synonymExact, and their per-language dynamic variants from string_ci to the new string_ci_exact type (goo) so multilingual duplicate-spelling labels no longer inflate term frequency. Tokenized prefLabel/synonym keep tf and positions for phrase queries. Requires the goo string_ci_exact type and a full reindex.

mdorf added 10 commits May 29, 2026 10:30

Fix search index ranking fields and indexing commits

cd56ec6

Test submission index paging optimization

eaf95a6

Honor disabled index commits during processing

b3aadc0

Allow term indexing without CSV generation

3cfde5f

Use keyword options for term indexer

b4ff72a

Gemfile.lock update

a1627e1

mdorf mentioned this pull request Jun 2, 2026

Fix: incorrect search result rankings ncbo/ontologies_api#232

Merged

3 tasks

mdorf mentioned this pull request Jun 2, 2026

Fix: faster bulk term indexing with deferred commits and CSV/optimize toggles ncbo/ncbo_cron#131

Merged

2 tasks

mdorf changed the title ~~Fix incorrect search result ranking — schema and indexer~~ Fix: incorrect search result ranking - schema and indexer Jun 2, 2026

mdorf mentioned this pull request Jun 3, 2026

Refresh Solr ontologyRank field regularly via dedicated job ncbo/ncbo_cron#132

Open

mdorf added 5 commits June 3, 2026 12:12

Bump goo to string_ci_exact commit

7f1302f

mdorf mentioned this pull request Jun 4, 2026

BioPortal search incorrectly ranks results ncbo/ontologies_api#230

Closed

mdorf merged commit a808416 into develop Jun 9, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: incorrect search result ranking - schema and indexer#294

Fix: incorrect search result ranking - schema and indexer#294
mdorf merged 15 commits into
developfrom
fix/incorrect-search-result-ranking

mdorf commented Jun 2, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mdorf commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR does

Search ranking infrastructure (the core change)

Term-indexing improvements (paired but independently useful)

Dependencies

Deploy coordination

Out of scope

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mdorf commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading