Skip to content

Fix: incorrect search result ranking - schema and indexer#294

Merged
mdorf merged 15 commits into
developfrom
fix/incorrect-search-result-ranking
Jun 9, 2026
Merged

Fix: incorrect search result ranking - schema and indexer#294
mdorf merged 15 commits into
developfrom
fix/incorrect-search-result-ranking

Conversation

@mdorf

@mdorf mdorf commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Data-layer companion for fixing the term-search ranking regression on ncbo/ontologies_api. Ships the Solr schema and indexer changes that enable BioPortal ontology rank to participate in term-search scoring before pagination (ncbo/ontologies_api#230), and pairs them with several term-indexing improvements that this branch needed along the way.

What this PR does

Search ranking infrastructure (the core change)

  • Schema (lib/ontologies_linked_data/models/class.rb): adds the ontologyRank Solr field (pfloat, indexed, stored, default: 0.0).
  • Schema: switches prefLabelExact and synonymExact (plus their *_* dynamic language variants) from string to string_ci, restoring case-insensitive exact match that regressed during the schemaless migration.
  • Indexer: populates ontologyRank per class doc from LinkedData::Models::Ontology.rank[:normalizedScore], cached for the duration of an indexing run (class-level cache reset by OntologySubmissionIndexer).

Term-indexing improvements (paired but independently useful)

  • Skip unindex_by_acronym when indexing into a fresh collection.
  • Allow term-only indexing without CSV generation.
  • Honor disabled index commits during processing.
  • Use keyword options for the term indexer.
  • Restore requested_lang after indexing to prevent cross-ontology leakage.
  • Guard against nil submission after metadata extraction failure.
  • Bump goo pin to 684fbe06.

Dependencies

Deploy coordination

This branch ships the schema for ontologyRank and the indexer that populates it. The downstream ontologies_api change that consumes the field in Solr scoring needs the term-search collection rebuilt under this schema before it can deploy. Coordinated cutover:

  1. Build the new term-search collection under this schema (full reindex).
  2. Flip the term_search alias to the new collection.
  3. Deploy the ontologies_api change together with this branch.

Out of scope

mdorf added 10 commits May 29, 2026 10:30
Thread a new unindex_existing flag (default true, preserving current
behavior for all existing callers) through SubmissionProcessor,
the index_terms concern, and the OntologySubmissionIndexer. When
false, the indexer skips the per-ontology unindex_by_acronym
delete-by-query, which is wasted work against a target collection
that was already cleared by the caller. Add a test verifying that
unindex_by_acronym is not invoked when unindex_existing: false.
SubmissionMetadataExtractor#extract_metadata returns nil when the
submission fails Goo validation. The processor then reassigned
@submission = ... and every subsequent call on @submission raised
NoMethodError, with the rescue at notify_submission_processed
adding a confusing "undefined method archived? for nil:NilClass"
on top. Raise a clear error immediately so the caller's rescue
logs one actionable message, and skip the email notification when
@submission is nil to silence the trailing noise.
The class indexer sets RequestStore.store[:requested_lang] = :ALL so
Goo returns language-tagged literals as Hash forms when building Solr
documents. The pre-refactor multi-threaded indexer made this change
inside per-page worker threads, so it was thread-local and vanished
with the thread. The single-threaded rewrite runs on the main thread
and never reset it, leaking :ALL into the submission's after_save
Solr callback (causing 400s on single-valued metadata fields like
status_t) and into the next ontology's bring_remaining (causing
attributes to load as Hashes that fail Goo validation and turn
extract_metadata into a nil return).

Hoist the :ALL assignment out of the per-page loop, capture the
previous value before indexing starts, and restore it in the
existing ensure block. This matches the save/restore pattern already
used in submission_missing_labels.rb.
The string_ci field type used for prefLabelExact/synonymExact (and their
dynamic language variants) is an analyzed TextField, which has no
docValues. With stored: false, those fields no longer appear in Solr
responses, breaking ontologies_api/helpers/search_helper.rb's result
sort (a[:prefLabelExact].downcase) and hl.fl highlighting on the Exact
fields. Setting stored: true restores retrievability without changing
the case-insensitive exact-match indexing behavior.
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.80519% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.79%. Comparing base (74da285) to head (7f1302f).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
...ubmission_process/operations/submission_indexer.rb 94.33% 3 Missing ⚠️
...ervices/submission_process/submission_processor.rb 91.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #294      +/-   ##
===========================================
+ Coverage    80.21%   80.79%   +0.58%     
===========================================
  Files          100      100              
  Lines         6753     6791      +38     
===========================================
+ Hits          5417     5487      +70     
+ Misses        1336     1304      -32     
Flag Coverage Δ
ag 80.63% <94.80%> (+0.49%) ⬆️
fs 80.73% <94.80%> (+0.55%) ⬆️
gd 80.73% <94.80%> (+0.58%) ⬆️
unittests 80.79% <94.80%> (+0.58%) ⬆️
vo 80.72% <94.80%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mdorf mdorf changed the title Fix incorrect search result ranking — schema and indexer Fix: incorrect search result ranking - schema and indexer Jun 2, 2026
mdorf added 5 commits June 3, 2026 12:12
Bulk reindexing is dominated by the per-page SPARQL fetch from the
triplestore (~24 of 36 hours in the last full staging run, ~71% of
indexing time). Doubling the page size halves the number of page
queries and their per-query setup overhead, at the cost of higher
peak memory per page. Performance change only; no behavior change.
The bulk term indexer was strictly sequential per page: fetch a page
of classes from the triplestore, map attributes, write to Solr, then
fetch the next. SPARQL retrieval dominated the last full reindex
(~71% of indexing wall-clock), so the map + Solr-write of each page
was idle time the next fetch could have used.

Prime page 1 synchronously, then fetch page N+1 on a background thread
while the main thread maps and indexes page N. Because Goo's Where#page
mutates the shared query object, only one thread ever touches `paging`
at a time: the main thread fetches page 1, and each later fetch runs on
its own thread while the main thread is busy and never touches `paging`.

Notes:
- RequestStore is thread-local, so the :ALL language override is re-set
  inside the fetch thread; otherwise the background fetch would drop
  multilingual values.
- equivalent_predicates is returned with each page so the consumer never
  reads it from `paging` concurrently with a background fetch.
- Thread#value re-raises a fetch error on the main thread so a failed
  fetch surfaces instead of silently dropping the remaining pages.
- Page size is now an injectable keyword (default 5000) so tests can
  force multi-page behavior on small fixtures.

Adds unit coverage for multi-page ordering, the thread-local language
override, and fetch-error propagation, plus a real end-to-end test that
indexes an ontology single-page vs many-page and asserts an identical
complete document set.
Reverts d901277 (prefetch next page while mapping/indexing current)
and 596a61a (page size 2500 -> 5000).

Both shipped together for staging validation and made indexing 2.4x
(CADSRVS) to 4x (NCIT) slower than the proven 2500/sequential baseline.
Per-class timings showed fetch and map each ~8-10x slower while the Solr
index phase was unchanged -- the signature of in-process memory pressure
(GC / likely OS swap), not triplestore or Solr load. The larger page plus
the prefetch holding a second page resident raised peak RSS; prefetch's
premise (hide an I/O-bound fetch) also inverted on warm runs where local
mapping dominates. Restoring the proven 2500/sequential config.

The reverted work remains in history if a genuinely fetch-bound run with
RAM headroom ever justifies revisiting it.
Repoints prefLabelExact, synonymExact, and their per-language
dynamic variants from string_ci to the new string_ci_exact type
(goo) so multilingual duplicate-spelling labels no longer inflate
term frequency. Tokenized prefLabel/synonym keep tf and positions
for phrase queries. Requires the goo string_ci_exact type and a
full reindex.
@mdorf mdorf merged commit a808416 into develop Jun 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant