Skip to content

[FEATURE] Unified SQL language across OpenSearch, Flint and Spark SQL #5346

@dai-chen

Description

@dai-chen

Is your feature request related to a problem?

OpenSearch SQL is currently fragmented across two repositories and four engines. This fragmentation creates the same class of problems identified in the PPL unification effort (opensearch-project/opensearch-spark#1136): semantic divergence, engineering duplication, feature lag, and missed optimization opportunities.

Repository Engine Scope
opensearch-project/sql Legacy (NLPchina) JOINs, UNIONs, subqueries
opensearch-project/sql V2 Simple queries, improved correctness
opensearch-project/sql Calcite-based ANSI SQL (new) Analytics engine integration, ANSI SQL compliance
opensearch-project/opensearch-spark Spark SQL + Flint SQL Spark queries, secondary index DDL

A gap analysis that runs the full V2 + Legacy integration test suite (807 queries) against the new Calcite engine shows a 67.9% pass rate, with 259 failures across 13 categories. This quantifies the distance between the current engines and a unified SQL baseline. See #5346 (comment) below for the full breakdown.

What solution would you like?

Define a Unified SQL Language specification — an ANSI SQL core with internal language extensions that model capabilities such as custom identifier/parsing rules, relevance search and Flint DDL — and implement it once in the Calcite-based engine for reuse across both the OpenSearch and Spark.

Unified SQL Language
├── ANSI SQL Core
├── Search Extension (full-text search, vector search, identifier rules)
└── Secondary Index Extension (Flint DDL)

Phase 1: Analytics engine integration

  • Core DQL — SELECT, subqueries, CTEs, window functions
  • Explain support
  • Full-text search functions

Phase 2: Full language unification

See #5346 (comment) below for the full design and roadmap.

What alternatives have you considered?

  • Keep all engines permanently: Maintenance burden of 4 code paths, inconsistent behavior, no convergence.
  • Strict ANSI SQL only: Breaks existing users who depend on full-text search functions, backtick quoting, and OpenSearch identifier conventions
  • Enable V2 SQL: Since SQL V2 and PPL V3 share the same AST abstraction, it could be routed through the existing CalciteRelNodeVisitor as PPL V3. However, this bypasses Calcite built-in logical optimizations, ANSI SQL features like CTEs etc.

Do you have any additional context?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions