Fix: incorrect search result rankings by mdorf · Pull Request #232 · ncbo/ontologies_api

mdorf · 2026-06-02T02:09:20Z

Summary

Moves BioPortal ontology rank into Solr's edismax scoring so it participates BEFORE pagination, fixing #230. With the previous post-hoc Ruby sort, high-rank ontologies stranded beyond Solr's first-page window couldn't be promoted; now they're ranked at the Solr layer and arrive on page 1 by score.

The change is small and surgical — single boost parameter + removal of the now-redundant Ruby tiebreaker and its dead support code.

Addresses

BioPortal search incorrectly ranks results #230

Dependencies

This PR depends on companion PRs in three other repos. All must merge together as a coordinated cutover.

Fix: Solr improvements - alias-shadow guard and commitWithin support goo#187 — alias-guard fix in SolrConnector#init that prevents CREATEALIAS from silently shadowing an existing collection (discovered and fixed during this PR's testing); also bundles unrelated commitWithin support.
Fix: incorrect search result ranking - schema and indexer ontologies_linked_data#294 — schema (string_ci exact fields, ontologyRank field), indexer population, plus paired term-indexing improvements.
Fix: faster bulk term indexing with deferred commits and CSV/optimize toggles ncbo_cron#131 — bulk-indexer improvements (consumes commitWithin, skips CSV / unindex_by_acronym / forced optimize when appropriate) needed to rebuild the term-search collection efficiently under the new schema before this PR can deploy.

Deploy coordination

This change requires the active Solr term_search alias to point to a collection that defines the ontologyRank field. Deploying against the old term_search_core1 schema returns HTTP 400 "undefined field: ontologyRank". The cutover is a single coordinated event:

The new collection (term_search_20260431 at time of writing) completes its full reindex.
The term_search alias is flipped from term_search_core1 to the new collection.
This PR + the three companion PRs all deploy together.

Commits

SHA	Change
`2a5c809`	Add `boost=sum(ontologyRank,1)` to all three edismax paths; bump `goo`/`ncbo_cron`/`ontologies_linked_data` to the matching branch tips
`59ad6e8`	Remove the post-hoc Ruby `docs.sort!` block
`f02884d`	Remove the now-unused `ontology_rank` attachment, `Ontology.rank` fetch, and skip-list entry
`3de4ed2`	Bump goo pin to the alias-guard fix (684fbe06)
`65aa4a7`	Add ranking regression test + OWL fixtures
`61a1a59`	Split the test into schema-only and boost-only concerns for diagnostic clarity

Test plan

Two new tests are added by this PR; both pass:

test_schema_uses_case_insensitive_string_ci_exact_fields — asserts the Phase 1 schema invariant (case-insensitive exact match via string_ci).
test_search_rank_orders_results_via_solr_side_boost — asserts the Phase 2 mechanism (pagesize=3 returns the top 3 ontologies by rank, with the lowest-ranked correctly excluded from page 1 but still counted in totalCount).

Empirical validation against the live term_search_20260431 collection on staging:

Query	Result
`q=melanoma`	Top 5: HRDO, MESH (rank 1.000), MEDDRA (0.959), LOINC (0.947), NIFSTD. Phase 0 predicted exactly this.
`q=diabetes`	Top 5: RADLEX, MEDDRA, LOINC×3. High-rank ontologies elevated as expected.
`q=DOID:1909`	Previously-tied score-611 cluster now rank-ordered (NIFSTD, DDSS, ERO, CLO, ...). OBO ID work (#134/#135) preserved.
`q=C0025202`	OCHV at #1 by notation match; cross-references from MESH/MEDDRA via `cui` field appear afterward, rank-ordered.
`q=melanoma&ontologies=DDSS`	All results scoped to DDSS as expected.

What does NOT change

qf field weights — preserves the OBO ID search enhancement from Search for short IDs fails in certain cases #134/Search index is missing short IDs for many ontologies #135.
bq=idAcronymMatch:true^80 — preserved.
properties_search_helper.rb — different model, different semantics. Parallel pattern documented as follow-up: Property search ranking has the same pre-pagination bug as #230 #231.
Schema or index — Phase 1 already shipped those on the OLD branch.

Out of scope (deferred)

Property-search ranking has the same architectural anti-pattern and a verifiable bug — see Property search ranking has the same pre-pagination bug as #230 #231.
Cross-Solr-page rank-ordering regression test — the new tests use 4 small fixtures which don't force Solr-side pagination; reproducing the original BioPortal search incorrectly ranks results #230 failure mode would need 50+ matching docs across ontologies. Worth a separate, heavier regression-test PR.
Multilingual prefLabel de-duplication (visible in some q=diabetes results as repeated LOINC entries) — separate, pre-existing concern surfaced but not introduced by this change.
Properties helper retuning, qf weight changes, autocomplete redesign — all deferred.

Operational note

During testing of this PR we triggered a goo-side bug that shadowed the in-progress term_search_20260431 collection with an alias (see ncbo/goo#187). The collection's data on disk was recoverable by deleting the rogue alias; ~24 ontologies whose writes were diverted to term_search_bootstrap during the incident window will need re-indexing into term_search_20260431 after the main reindex completes.

Apply boost=sum(ontologyRank,1) on all three edismax paths via the common parameter section. Multiplier is (1 + rank), bounded in [1, 2]; unranked ontologies are unaffected. Rank now participates in Solr scoring before pagination, addressing #230. Bump goo, ncbo_cron, and ontologies_linked_data gem pins to the matching fix/incorrect-search-result-ranking branches. The new boost references the ontologyRank Solr field which is only declared in the ontologies_linked_data branch's schema. The post-hoc Ruby tiebreaker remains in place for this commit and will be removed in the next.

Solr is now the single source of result ordering via the boost added in the previous commit. The in-page Ruby tiebreaker is redundant and would mask any Solr-side scoring regression. The doc[:ontology_rank] attachment and Ontology.rank fetch upstream become unused; cleaned up in the next commit.

The post-hoc Ruby sort was the sole consumer of these. The line-293 skip-list entry (filter_attrs_by_language) was a defensive exclusion to prevent ontology_rank from being treated as a language-suffixed attribute during response serialization, also dead now.

Seeds four ontologies with synthetic BioPortal ranks (1.0, 0.7, 0.4, 0.1) into a clean Solr index, queries /search?q=melanoma, and asserts the returned acronyms come back in descending rank order. With Solr's boost=sum(ontologyRank,1) participating before pagination, the ordering reflects ontology rank from the very first page returned by Solr. Adds two minimal OWL fixtures (search_rank_melanoma_upper.owl with "Melanoma", search_rank_melanoma_lower.owl with "melanoma") that exercise the string_ci prefLabelExact field type and guarantee a case-insensitive match for the lowercase query.

The previous test method bundled two distinct invariants: the Phase 1 schema fix (string_ci prefLabelExact / synonymExact for case-insensitive exact match) and the Phase 2 mechanism (Solr-side boost producing rank-ordered results before pagination). Splitting them gives more diagnostic information when something regresses and makes each test's intent explicit. - test_schema_uses_case_insensitive_string_ci_exact_fields seeds only RANKHIGH (one ontology with mixed-case "Melanoma" prefLabel) and asserts the schema field types plus a lowercase prefLabelExact match. - test_search_rank_orders_results_via_solr_side_boost seeds all four ranked ontologies and asserts pagesize=3 returns the three highest ranks in order, with RANKLOW counted in totalCount but excluded from page 1. Shared setup is factored into a private with_ranked_melanoma_ontologies helper. Each test method now has a docstring explaining the mechanism it exercises (Phase 1 vs Phase 2) and the coverage gap the existing fixtures don't address (the original #230 cross-Solr-page failure mode would need 50+ matching docs to reproduce). Both tests verified against the live setup; original combined test's asserted behavior is preserved.

…ch-result-ranking # Conflicts: # Gemfile.lock

codecov-commenter · 2026-06-02T03:10:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.09%. Comparing base (d6b28f9) to head (100df47).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #232      +/-   ##
===========================================
- Coverage    78.10%   78.09%   -0.01%     
===========================================
  Files           66       66              
  Lines         3731     3726       -5     
===========================================
- Hits          2914     2910       -4     
+ Misses         817      816       -1

Flag	Coverage Δ
ag	`78.09% <100.00%> (-0.01%)`	⬇️
fs	`78.09% <100.00%> (-0.01%)`	⬇️
gd	`78.09% <100.00%> (-0.01%)`	⬇️
unittests	`78.09% <100.00%> (-0.01%)`	⬇️
vo	`78.09% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mdorf added 6 commits June 1, 2026 15:04

Gemfile.lock update

3de4ed2

mdorf mentioned this pull request Jun 2, 2026

Fix: faster bulk term indexing with deferred commits and CSV/optimize toggles ncbo/ncbo_cron#131

Merged

2 tasks

mdorf changed the title ~~Fix incorrect search result ranking~~ Fix: incorrect search result rankings Jun 2, 2026

mdorf added 2 commits June 1, 2026 19:57

Gemfile.lock update

272d649

Merge remote-tracking branch 'origin/develop' into fix/incorrect-sear…

100df47

…ch-result-ranking # Conflicts: # Gemfile.lock

This was referenced Jun 3, 2026

Refresh Solr ontologyRank field regularly via dedicated job ncbo/ncbo_cron#132

Open

BioPortal search incorrectly ranks results #230

Closed

mdorf merged commit 29564e0 into develop Jun 9, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: incorrect search result rankings#232

Fix: incorrect search result rankings#232
mdorf merged 8 commits into
developfrom
fix/incorrect-search-result-ranking

mdorf commented Jun 2, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mdorf commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Addresses

Dependencies

Deploy coordination

Commits

Test plan

What does NOT change

Out of scope (deferred)

Operational note

Uh oh!

codecov-commenter commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mdorf commented Jun 2, 2026 •

edited

Loading

codecov-commenter commented Jun 2, 2026 •

edited

Loading