Fix: incorrect search result ranking - schema and indexer#294
Merged
Conversation
Thread a new unindex_existing flag (default true, preserving current behavior for all existing callers) through SubmissionProcessor, the index_terms concern, and the OntologySubmissionIndexer. When false, the indexer skips the per-ontology unindex_by_acronym delete-by-query, which is wasted work against a target collection that was already cleared by the caller. Add a test verifying that unindex_by_acronym is not invoked when unindex_existing: false.
SubmissionMetadataExtractor#extract_metadata returns nil when the submission fails Goo validation. The processor then reassigned @submission = ... and every subsequent call on @submission raised NoMethodError, with the rescue at notify_submission_processed adding a confusing "undefined method archived? for nil:NilClass" on top. Raise a clear error immediately so the caller's rescue logs one actionable message, and skip the email notification when @submission is nil to silence the trailing noise.
The class indexer sets RequestStore.store[:requested_lang] = :ALL so Goo returns language-tagged literals as Hash forms when building Solr documents. The pre-refactor multi-threaded indexer made this change inside per-page worker threads, so it was thread-local and vanished with the thread. The single-threaded rewrite runs on the main thread and never reset it, leaking :ALL into the submission's after_save Solr callback (causing 400s on single-valued metadata fields like status_t) and into the next ontology's bring_remaining (causing attributes to load as Hashes that fail Goo validation and turn extract_metadata into a nil return). Hoist the :ALL assignment out of the per-page loop, capture the previous value before indexing starts, and restore it in the existing ensure block. This matches the save/restore pattern already used in submission_missing_labels.rb.
The string_ci field type used for prefLabelExact/synonymExact (and their dynamic language variants) is an analyzed TextField, which has no docValues. With stored: false, those fields no longer appear in Solr responses, breaking ontologies_api/helpers/search_helper.rb's result sort (a[:prefLabelExact].downcase) and hl.fl highlighting on the Exact fields. Setting stored: true restores retrievability without changing the case-insensitive exact-match indexing behavior.
3 tasks
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #294 +/- ##
===========================================
+ Coverage 80.21% 80.79% +0.58%
===========================================
Files 100 100
Lines 6753 6791 +38
===========================================
+ Hits 5417 5487 +70
+ Misses 1336 1304 -32
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Merged
2 tasks
Bulk reindexing is dominated by the per-page SPARQL fetch from the triplestore (~24 of 36 hours in the last full staging run, ~71% of indexing time). Doubling the page size halves the number of page queries and their per-query setup overhead, at the cost of higher peak memory per page. Performance change only; no behavior change.
The bulk term indexer was strictly sequential per page: fetch a page of classes from the triplestore, map attributes, write to Solr, then fetch the next. SPARQL retrieval dominated the last full reindex (~71% of indexing wall-clock), so the map + Solr-write of each page was idle time the next fetch could have used. Prime page 1 synchronously, then fetch page N+1 on a background thread while the main thread maps and indexes page N. Because Goo's Where#page mutates the shared query object, only one thread ever touches `paging` at a time: the main thread fetches page 1, and each later fetch runs on its own thread while the main thread is busy and never touches `paging`. Notes: - RequestStore is thread-local, so the :ALL language override is re-set inside the fetch thread; otherwise the background fetch would drop multilingual values. - equivalent_predicates is returned with each page so the consumer never reads it from `paging` concurrently with a background fetch. - Thread#value re-raises a fetch error on the main thread so a failed fetch surfaces instead of silently dropping the remaining pages. - Page size is now an injectable keyword (default 5000) so tests can force multi-page behavior on small fixtures. Adds unit coverage for multi-page ordering, the thread-local language override, and fetch-error propagation, plus a real end-to-end test that indexes an ontology single-page vs many-page and asserts an identical complete document set.
Reverts d901277 (prefetch next page while mapping/indexing current) and 596a61a (page size 2500 -> 5000). Both shipped together for staging validation and made indexing 2.4x (CADSRVS) to 4x (NCIT) slower than the proven 2500/sequential baseline. Per-class timings showed fetch and map each ~8-10x slower while the Solr index phase was unchanged -- the signature of in-process memory pressure (GC / likely OS swap), not triplestore or Solr load. The larger page plus the prefetch holding a second page resident raised peak RSS; prefetch's premise (hide an I/O-bound fetch) also inverted on warm runs where local mapping dominates. Restoring the proven 2500/sequential config. The reverted work remains in history if a genuinely fetch-bound run with RAM headroom ever justifies revisiting it.
Repoints prefLabelExact, synonymExact, and their per-language dynamic variants from string_ci to the new string_ci_exact type (goo) so multilingual duplicate-spelling labels no longer inflate term frequency. Tokenized prefLabel/synonym keep tf and positions for phrase queries. Requires the goo string_ci_exact type and a full reindex.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Data-layer companion for fixing the term-search ranking regression on
ncbo/ontologies_api. Ships the Solr schema and indexer changes that enable BioPortal ontology rank to participate in term-search scoring before pagination (ncbo/ontologies_api#230), and pairs them with several term-indexing improvements that this branch needed along the way.What this PR does
Search ranking infrastructure (the core change)
lib/ontologies_linked_data/models/class.rb): adds theontologyRankSolr field (pfloat, indexed, stored,default: 0.0).prefLabelExactandsynonymExact(plus their*_*dynamic language variants) fromstringtostring_ci, restoring case-insensitive exact match that regressed during the schemaless migration.ontologyRankper class doc fromLinkedData::Models::Ontology.rank[:normalizedScore], cached for the duration of an indexing run (class-level cache reset byOntologySubmissionIndexer).Term-indexing improvements (paired but independently useful)
unindex_by_acronymwhen indexing into a fresh collection.requested_langafter indexing to prevent cross-ontology leakage.goopin to684fbe06.Dependencies
SolrConnector#initthat preventsCREATEALIASfrom silently shadowing an existing collection.Deploy coordination
This branch ships the schema for
ontologyRankand the indexer that populates it. The downstreamontologies_apichange that consumes the field in Solr scoring needs the term-search collection rebuilt under this schema before it can deploy. Coordinated cutover:term_searchalias to the new collection.ontologies_apichange together with this branch.Out of scope
boost=sum(ontologyRank,1)in edismax params, removal of post-hoc Ruby sort) — lives in the correspondingncbo/ontologies_apiPR.