Skip to content

Fix: incorrect search result rankings#232

Merged
mdorf merged 8 commits into
developfrom
fix/incorrect-search-result-ranking
Jun 9, 2026
Merged

Fix: incorrect search result rankings#232
mdorf merged 8 commits into
developfrom
fix/incorrect-search-result-ranking

Conversation

@mdorf

@mdorf mdorf commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Moves BioPortal ontology rank into Solr's edismax scoring so it participates BEFORE pagination, fixing #230. With the previous post-hoc Ruby sort, high-rank ontologies stranded beyond Solr's first-page window couldn't be promoted; now they're ranked at the Solr layer and arrive on page 1 by score.

The change is small and surgical — single boost parameter + removal of the now-redundant Ruby tiebreaker and its dead support code.

Addresses

Dependencies

This PR depends on companion PRs in three other repos. All must merge together as a coordinated cutover.

Deploy coordination

This change requires the active Solr term_search alias to point to a collection that defines the ontologyRank field. Deploying against the old term_search_core1 schema returns HTTP 400 "undefined field: ontologyRank". The cutover is a single coordinated event:

  1. The new collection (term_search_20260431 at time of writing) completes its full reindex.
  2. The term_search alias is flipped from term_search_core1 to the new collection.
  3. This PR + the three companion PRs all deploy together.

Commits

SHA Change
2a5c809 Add boost=sum(ontologyRank,1) to all three edismax paths; bump goo/ncbo_cron/ontologies_linked_data to the matching branch tips
59ad6e8 Remove the post-hoc Ruby docs.sort! block
f02884d Remove the now-unused ontology_rank attachment, Ontology.rank fetch, and skip-list entry
3de4ed2 Bump goo pin to the alias-guard fix (684fbe06)
65aa4a7 Add ranking regression test + OWL fixtures
61a1a59 Split the test into schema-only and boost-only concerns for diagnostic clarity

Test plan

Two new tests are added by this PR; both pass:

  • test_schema_uses_case_insensitive_string_ci_exact_fields — asserts the Phase 1 schema invariant (case-insensitive exact match via string_ci).
  • test_search_rank_orders_results_via_solr_side_boost — asserts the Phase 2 mechanism (pagesize=3 returns the top 3 ontologies by rank, with the lowest-ranked correctly excluded from page 1 but still counted in totalCount).

Empirical validation against the live term_search_20260431 collection on staging:

Query Result
q=melanoma Top 5: HRDO, MESH (rank 1.000), MEDDRA (0.959), LOINC (0.947), NIFSTD. Phase 0 predicted exactly this.
q=diabetes Top 5: RADLEX, MEDDRA, LOINC×3. High-rank ontologies elevated as expected.
q=DOID:1909 Previously-tied score-611 cluster now rank-ordered (NIFSTD, DDSS, ERO, CLO, ...). OBO ID work (#134/#135) preserved.
q=C0025202 OCHV at #1 by notation match; cross-references from MESH/MEDDRA via cui field appear afterward, rank-ordered.
q=melanoma&ontologies=DDSS All results scoped to DDSS as expected.

What does NOT change

Out of scope (deferred)

  • Property-search ranking has the same architectural anti-pattern and a verifiable bug — see Property search ranking has the same pre-pagination bug as #230 #231.
  • Cross-Solr-page rank-ordering regression test — the new tests use 4 small fixtures which don't force Solr-side pagination; reproducing the original BioPortal search incorrectly ranks results #230 failure mode would need 50+ matching docs across ontologies. Worth a separate, heavier regression-test PR.
  • Multilingual prefLabel de-duplication (visible in some q=diabetes results as repeated LOINC entries) — separate, pre-existing concern surfaced but not introduced by this change.
  • Properties helper retuning, qf weight changes, autocomplete redesign — all deferred.

Operational note

During testing of this PR we triggered a goo-side bug that shadowed the in-progress term_search_20260431 collection with an alias (see ncbo/goo#187). The collection's data on disk was recoverable by deleting the rogue alias; ~24 ontologies whose writes were diverted to term_search_bootstrap during the incident window will need re-indexing into term_search_20260431 after the main reindex completes.

mdorf added 6 commits June 1, 2026 15:04
Apply boost=sum(ontologyRank,1) on all three edismax paths via the
common parameter section. Multiplier is (1 + rank), bounded in [1, 2];
unranked ontologies are unaffected. Rank now participates in Solr
scoring before pagination, addressing #230.

Bump goo, ncbo_cron, and ontologies_linked_data gem pins to the
matching fix/incorrect-search-result-ranking branches. The new boost
references the ontologyRank Solr field which is only declared in the
ontologies_linked_data branch's schema.

The post-hoc Ruby tiebreaker remains in place for this commit and
will be removed in the next.
Solr is now the single source of result ordering via the boost added
in the previous commit. The in-page Ruby tiebreaker is redundant and
would mask any Solr-side scoring regression.

The doc[:ontology_rank] attachment and Ontology.rank fetch upstream
become unused; cleaned up in the next commit.
The post-hoc Ruby sort was the sole consumer of these. The line-293
skip-list entry (filter_attrs_by_language) was a defensive exclusion
to prevent ontology_rank from being treated as a language-suffixed
attribute during response serialization, also dead now.
Seeds four ontologies with synthetic BioPortal ranks (1.0, 0.7, 0.4,
0.1) into a clean Solr index, queries /search?q=melanoma, and
asserts the returned acronyms come back in descending rank order.
With Solr's boost=sum(ontologyRank,1) participating before
pagination, the ordering reflects ontology rank from the very first
page returned by Solr.

Adds two minimal OWL fixtures (search_rank_melanoma_upper.owl with
"Melanoma", search_rank_melanoma_lower.owl with "melanoma") that
exercise the string_ci prefLabelExact field type and guarantee a
case-insensitive match for the lowercase query.
The previous test method bundled two distinct invariants: the Phase 1
schema fix (string_ci prefLabelExact / synonymExact for case-insensitive
exact match) and the Phase 2 mechanism (Solr-side boost producing
rank-ordered results before pagination). Splitting them gives more
diagnostic information when something regresses and makes each test's
intent explicit.

- test_schema_uses_case_insensitive_string_ci_exact_fields seeds only
  RANKHIGH (one ontology with mixed-case "Melanoma" prefLabel) and
  asserts the schema field types plus a lowercase prefLabelExact match.
- test_search_rank_orders_results_via_solr_side_boost seeds all four
  ranked ontologies and asserts pagesize=3 returns the three highest
  ranks in order, with RANKLOW counted in totalCount but excluded from
  page 1.

Shared setup is factored into a private with_ranked_melanoma_ontologies
helper. Each test method now has a docstring explaining the mechanism
it exercises (Phase 1 vs Phase 2) and the coverage gap the existing
fixtures don't address (the original #230 cross-Solr-page failure mode
would need 50+ matching docs to reproduce).

Both tests verified against the live setup; original combined test's
asserted behavior is preserved.
@mdorf mdorf changed the title Fix incorrect search result ranking Fix: incorrect search result rankings Jun 2, 2026
@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.09%. Comparing base (d6b28f9) to head (100df47).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #232      +/-   ##
===========================================
- Coverage    78.10%   78.09%   -0.01%     
===========================================
  Files           66       66              
  Lines         3731     3726       -5     
===========================================
- Hits          2914     2910       -4     
+ Misses         817      816       -1     
Flag Coverage Δ
ag 78.09% <100.00%> (-0.01%) ⬇️
fs 78.09% <100.00%> (-0.01%) ⬇️
gd 78.09% <100.00%> (-0.01%) ⬇️
unittests 78.09% <100.00%> (-0.01%) ⬇️
vo 78.09% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mdorf mdorf merged commit 29564e0 into develop Jun 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants