Fix: incorrect search result rankings#232
Merged
Merged
Conversation
Apply boost=sum(ontologyRank,1) on all three edismax paths via the common parameter section. Multiplier is (1 + rank), bounded in [1, 2]; unranked ontologies are unaffected. Rank now participates in Solr scoring before pagination, addressing #230. Bump goo, ncbo_cron, and ontologies_linked_data gem pins to the matching fix/incorrect-search-result-ranking branches. The new boost references the ontologyRank Solr field which is only declared in the ontologies_linked_data branch's schema. The post-hoc Ruby tiebreaker remains in place for this commit and will be removed in the next.
Solr is now the single source of result ordering via the boost added in the previous commit. The in-page Ruby tiebreaker is redundant and would mask any Solr-side scoring regression. The doc[:ontology_rank] attachment and Ontology.rank fetch upstream become unused; cleaned up in the next commit.
The post-hoc Ruby sort was the sole consumer of these. The line-293 skip-list entry (filter_attrs_by_language) was a defensive exclusion to prevent ontology_rank from being treated as a language-suffixed attribute during response serialization, also dead now.
Seeds four ontologies with synthetic BioPortal ranks (1.0, 0.7, 0.4, 0.1) into a clean Solr index, queries /search?q=melanoma, and asserts the returned acronyms come back in descending rank order. With Solr's boost=sum(ontologyRank,1) participating before pagination, the ordering reflects ontology rank from the very first page returned by Solr. Adds two minimal OWL fixtures (search_rank_melanoma_upper.owl with "Melanoma", search_rank_melanoma_lower.owl with "melanoma") that exercise the string_ci prefLabelExact field type and guarantee a case-insensitive match for the lowercase query.
The previous test method bundled two distinct invariants: the Phase 1 schema fix (string_ci prefLabelExact / synonymExact for case-insensitive exact match) and the Phase 2 mechanism (Solr-side boost producing rank-ordered results before pagination). Splitting them gives more diagnostic information when something regresses and makes each test's intent explicit. - test_schema_uses_case_insensitive_string_ci_exact_fields seeds only RANKHIGH (one ontology with mixed-case "Melanoma" prefLabel) and asserts the schema field types plus a lowercase prefLabelExact match. - test_search_rank_orders_results_via_solr_side_boost seeds all four ranked ontologies and asserts pagesize=3 returns the three highest ranks in order, with RANKLOW counted in totalCount but excluded from page 1. Shared setup is factored into a private with_ranked_melanoma_ontologies helper. Each test method now has a docstring explaining the mechanism it exercises (Phase 1 vs Phase 2) and the coverage gap the existing fixtures don't address (the original #230 cross-Solr-page failure mode would need 50+ matching docs to reproduce). Both tests verified against the live setup; original combined test's asserted behavior is preserved.
Merged
2 tasks
…ch-result-ranking # Conflicts: # Gemfile.lock
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #232 +/- ##
===========================================
- Coverage 78.10% 78.09% -0.01%
===========================================
Files 66 66
Lines 3731 3726 -5
===========================================
- Hits 2914 2910 -4
+ Misses 817 816 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This was referenced Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves BioPortal ontology rank into Solr's edismax scoring so it participates BEFORE pagination, fixing #230. With the previous post-hoc Ruby sort, high-rank ontologies stranded beyond Solr's first-page window couldn't be promoted; now they're ranked at the Solr layer and arrive on page 1 by score.
The change is small and surgical — single
boostparameter + removal of the now-redundant Ruby tiebreaker and its dead support code.Addresses
Dependencies
This PR depends on companion PRs in three other repos. All must merge together as a coordinated cutover.
SolrConnector#initthat preventsCREATEALIASfrom silently shadowing an existing collection (discovered and fixed during this PR's testing); also bundles unrelatedcommitWithinsupport.string_ciexact fields,ontologyRankfield), indexer population, plus paired term-indexing improvements.commitWithin, skips CSV /unindex_by_acronym/ forced optimize when appropriate) needed to rebuild the term-search collection efficiently under the new schema before this PR can deploy.Deploy coordination
This change requires the active Solr
term_searchalias to point to a collection that defines theontologyRankfield. Deploying against the oldterm_search_core1schema returnsHTTP 400 "undefined field: ontologyRank". The cutover is a single coordinated event:term_search_20260431at time of writing) completes its full reindex.term_searchalias is flipped fromterm_search_core1to the new collection.Commits
boost=sum(ontologyRank,1)to all three edismax paths; bumpgoo/ncbo_cron/ontologies_linked_datato the matching branch tipsdocs.sort!blockontology_rankattachment,Ontology.rankfetch, and skip-list entryTest plan
Two new tests are added by this PR; both pass:
test_schema_uses_case_insensitive_string_ci_exact_fields— asserts the Phase 1 schema invariant (case-insensitive exact match viastring_ci).test_search_rank_orders_results_via_solr_side_boost— asserts the Phase 2 mechanism (pagesize=3 returns the top 3 ontologies by rank, with the lowest-ranked correctly excluded from page 1 but still counted intotalCount).Empirical validation against the live
term_search_20260431collection on staging:q=melanomaq=diabetesq=DOID:1909q=C0025202cuifield appear afterward, rank-ordered.q=melanoma&ontologies=DDSSWhat does NOT change
qffield weights — preserves the OBO ID search enhancement from Search for short IDs fails in certain cases #134/Search index is missing short IDs for many ontologies #135.bq=idAcronymMatch:true^80— preserved.properties_search_helper.rb— different model, different semantics. Parallel pattern documented as follow-up: Property search ranking has the same pre-pagination bug as #230 #231.Out of scope (deferred)
q=diabetesresults as repeated LOINC entries) — separate, pre-existing concern surfaced but not introduced by this change.qfweight changes, autocomplete redesign — all deferred.Operational note
During testing of this PR we triggered a goo-side bug that shadowed the in-progress
term_search_20260431collection with an alias (see ncbo/goo#187). The collection's data on disk was recoverable by deleting the rogue alias; ~24 ontologies whose writes were diverted toterm_search_bootstrapduring the incident window will need re-indexing intoterm_search_20260431after the main reindex completes.