Schema Support - Logical Separation for Stores by gregorian-09 · Pull Request #343 · deven96/ahnlich

gregorian-09 · 2026-06-16T09:10:34Z

Closes #319
Supersedes #341 (source branch feat/schema-support was deleted).

Rebased on latest upstream main with additional fixes:

Schema::try_new to propagate errors instead of panicking on empty string
Versioned persistence trait with auto-migration (V1 flat -> V2 nested)
Hardcoded public references replaced with Schema::DEFAULT_NAME
Clippy clean, format clean

gregorian-09 · 2026-06-16T19:08:56Z

tests::server_tests::test_mmap_persistence_performance is still failing. I’ll do a proper investigation tomorrow and see what’s causing it.

deven96 · 2026-06-17T22:19:40Z

There's been some merged changes from @jimezesinachi with respect to clustering mode for ahnlich so perhaps @jimezesinachi can help work through what would be needed here to make the modifications for logical separation fit with his changes

Introduce a `Schema` type wrapping a `String` to provide logical separation for stores in ahnlich. This is the foundation for issue deven96#319 (multi-tenant store isolation). Key design decisions: - Newtype over `String` (not protobuf-generated), hand-written in `types/src/schema.rs`. - Default schema name is "public" to preserve backward compatibility with all existing stores. - Derives `Hash`, `Eq`, `Ord`, `Clone`, `Debug`, `Serialize`, `Deserialize` for use as a map key. - `From<String>` and `From<&str>` for ergonomic construction. - Panics on empty name to prevent silent corruption. - Separate from the protobuf layer so we can evolve the schema model without regenerating proto types. This commit only adds the type; refactoring the store engine to use it comes next.

…ema map Address two compilation issues in the DB store handler refactoring: - Use `existing.current.clone()` instead of `existing.clone()` on papaya's `OccupiedError` (which does not implement Clone). - Resolve type inference for `inner.pin().len()` in `drop_schema`. Update operations.rs and handler.rs pass `Schema::default()` in all `StoreHandler` method calls, since the protobuf layer does not yet carry a schema field. This maintains full backward compatibility: every store operation implicitly targets the `"public"` schema.

Apply the same nested-map pattern from db StoreHandler to AIStoreHandler: - Change AIStores from `Map<StoreName, AIStore>` to `Map<Schema, Map<StoreName, AIStore>>`. - Add `AIInnerStores` type alias. - Add `get_or_create_schema()` and `get_schema()` helpers. - Update `create_store()`, `get()`, `list_stores()`, `get_store()`, `set()`, `validate_and_prepare_store_data()`, `get_ndarray_repr_for_store()`, `drop_store()`, `store_original()`, and `purge_stores()` to accept and use a schema parameter. - Add `drop_schema()` method (guarded: cannot drop "public"). - Add `AIProxyError::InvalidArgument(String)` variant for schema guard rejection. All existing callers pass `Schema::default()` for backward compatibility.

Update all AI operations and the AI server handler to pass `Schema::default()` in direct `AIStoreHandler` method calls. This completes the schema-aware refactoring of the AI proxy layer. All existing store operations continue to target the `"public"` schema until the protobuf layer carries an explicit schema field.

The types crate's build.rs deletes all files in `src/` except `utils/` before regenerating protobuf code. This caused our hand-written `schema.rs` to be silently deleted on every `cargo build/check`, and caused `pub mod schema;` to be omitted from `lib.rs`. Fix the deletion loop to also preserve `schema.rs`, and add `"schema"` to the module-name set written into `lib.rs`. Also fix type inference for `inner.pin().len()` in both `StoreHandler::drop_schema` and `AIStoreHandler::drop_schema` by binding the pin guard to a local variable first.

The schema parameter was added to all store handler methods in a previous commit, but threading &Schema through every method signature introduced too_many_arguments and leaked schema handling into every caller. Each handler already holds a default_schema field that methods use implicitly. - Remove schema param from StoreHandler: create_store, create_pred_index, create_non_linear_algorithm_index, del_key_in_store, del_pred_in_store, get_sim_in_store, get_pred_in_store, get_key_in_store, set_in_store, get_store, drop_pred_index_in_store, drop_non_linear_algorithm_index, drop_store. - Remove schema param from AIStoreHandler: get, create_store, get_store, validate_and_prepare_store_data, set, get_ndarray_repr_for_store, drop_store, store_original. - Add default_schema field to AIStoreHandler struct and new(). - Drop #[allow(clippy::too_many_arguments)] on set and get_ndarray_repr_for_store. - Update all callers to stop passing &Schema::default(). - Fix del_pred_in_store incorrectly named create_pred_index. - Remove unused Schema imports.

Add optional schema field to CreateStore, DropStore, ListStores, and GetStore protobuf messages. Add DropSchema message and RPC for dropping entire schemas. Proto changes: - Add optional string schema field to CreateStore, DropStore, ListStores, GetStore in both db/ and ai/ query protos - Add DropSchema message (string schema) to both db/ and ai/ protos - Add DropSchema to pipeline oneof and service RPC definitions Store handler changes: - get() searches all schemas (default_schema first, then fallback) - create_store, drop_store accept &Schema parameter - list_stores accepts Option<&Schema> for filtering - drop_schema already existed but is now wired through protobuf Operations layer: - Each operation extracts schema from request, resolves to Schema (defaults to "public" when None), passes to handler - Add drop_schema function for both DB and AI operations - Schema import added where needed All 58 DB tests pass. Both crates compile with clippy.

…tubs - Fix purge_stores counting schemas instead of stores (ai/src/engine/store.rs) - Add AI proxy schema tests (ai/src/tests/aiproxy_test.rs) - Add DB schema tests (db/src/tests/server_tests.rs) - Fix DSL missing schema fields (dsl/src/ai.rs, dsl/src/db.rs, tests) - Regenerate all SDK protobuf stubs (Go, Node, Python) - Add build_fixed.py for Python proto generation - Add multi-schema docs (docs/schema.md)

- Go: goimports + gofmt - Node: prettier - Python: isort import sorting

- Add read_snapshot_raw and load_snapshot_with_migration to Persistence - Add StoreHandler::load_and_migrate_snapshot for DB flat->nested migration - Add AIStoreHandler::load_and_migrate_snapshot for AI flat->nested migration - Wire migration into DB and AI server handler startup - Add migration tests (2 DB + 2 AI) with JSON fixture files - Fix benchmark create_store calls to pass schema parameter - Fix Go SDK lint: add package comments, fix revive warnings

- Replace redundant closures with tuple variant in map_err calls - Remove build_fixed.py from Python SDK as requested

…y compatibility

- Added schema field to AiStoreInfo proto (field 8) for all SDKs - list_stores(None) now defaults to public schema instead of all schemas - Added all_store_names_by_schema() for internal cross-schema iteration - purge_stores now drops DB stores across all schemas using correct schema per store - Regenerated all gRPC stubs (Rust, Go, Node, Python) - Populated schema in get_store from request params

…NodeId #[serde(transparent)] breaks JSON map key deserialization because serde_json's KeyDeserialize trait doesn't propagate through transparent newtypes. JSON map keys are always strings, and u64::deserialize chokes on strings. Restored custom serde that serializes as string and deserializes from both strings and numbers.

gregorian-09 · 2026-06-19T10:23:45Z

There's been some merged changes from @jimezesinachi with respect to clustering mode for ahnlich so perhaps @jimezesinachi can help work through what would be needed here to make the modifications for logical separation fit with his changes

@deven96 @jimezesinachi

So, I went through this diff by diff against everything Jim's clustering touches. writing it up because I want someone to actually check the one thing I'm not 100% sure about, not just rubber-stamp it.

store_runtime.rs / cluster_mutations.rs - clean, no diff. enum, submit_db_command!, all of Jim's stuff is exactly as he left it.

handler.rs - three additions, all additive:

drop_schema handler, copy-pasted the drop_store pattern:

StoreRuntime::Cluster(cluster) => {
    submit_db_command!(Some(cluster), query::DropSchema, params, DbCommand::DropSchema)
}
StoreRuntime::Standalone(store_handler) => {
    operations::drop_schema(store_handler, params)?
}

list_stores actually forwards the request body now instead of ignoring it (list_stores_response(&self.runtime, request.into_inner())). cluster dispatch inside is untouched, this just lets schema: None flow through to standalone.
get_store resolves schema before dispatch, same split as before.
new_with_config swapped Persistence::load_snapshot for load_snapshot_with_migration to handle the versioned format. standalone-only, cluster has its own snapshot mechanism so this doesn't touch it.

cluster_queries.rs - one new param:

pub(crate) async fn list_stores_response(
    runtime: &StoreRuntime,
    params: query::ListStores,
) -> Result<server::StoreList, tonic::Status> {
    if let Some(cluster) = runtime.cluster() {
        cluster.list_stores(params).await?;
    }
    read_store_handler(runtime, |store_handler| {
        Ok(operations::list_stores(store_handler, params))
    })
}

heads up - the cluster branch's return value isn't used, it always falls through to read_store_handler after. pretty sure that's pre-existing and not something I introduced, but flagging it because it reads weird on first pass. param only actually lands in operations::list_stores via the standalone path.

replication/mod.rs - two changes:

DropSchema command, same shape as DropStore, nothing surprising.
get_snapshot now does into_latest() to turn VersionedDbStores into Stores before handing it to the cluster, since snapshots are versioned internally but the cluster snapshot type is still plain Stores:

fn get_snapshot(&self) -> Self::Snapshot {
    self.store_handler.get_snapshot().into_latest()
        .expect("snapshot should be at latest version")
}

this is the one I want someone else's eyes on. it's an .expect() sitting on the replication snapshot path - if a node ever hands back a snapshot that isn't at the latest version, this panics instead of returning an error. probably can't happen given how migration is wired right now, but "probably can't happen" on a path that takes down a node is exactly the kind of thing I don't want to just wave through myself.

restore_snapshot - completely untouched.

operations.rs - all schema plumbing, nothing clever:

resolve_schema() helper, Option<String> to Schema, defaults to public
create_store/drop_store/list_stores take schema now, None behaves identically to before
new drop_schema, self-contained
threw in explicit type annotations on non_linear_indices (e.g. StdHashSet<non_linear_index::Index>) because the compiler asked nicely. no behavior change.

test files - cluster_tests.rs and replication_store_tests.rs, every diff is just schema: None added to existing calls. didn't touch any test logic.

so: nothing in Jim's clustering paths actually changed, and everything new degrades to old behavior when schema is None. I'm fairly confident in all of it except the .expect() in get_snapshot - would like one of you to sanity check whether that assumption actually holds before this gets accepted.

deven96 · 2026-06-21T22:32:20Z

There's been some merged changes from @jimezesinachi with respect to clustering mode for ahnlich so perhaps @jimezesinachi can help work through what would be needed here to make the modifications for logical separation fit with his changes

@deven96 @jimezesinachi

So, I went through this diff by diff against everything Jim's clustering touches. writing it up because I want someone to actually check the one thing I'm not 100% sure about, not just rubber-stamp it.

store_runtime.rs / cluster_mutations.rs - clean, no diff. enum, submit_db_command!, all of Jim's stuff is exactly as he left it.

handler.rs - three additions, all additive:

drop_schema handler, copy-pasted the drop_store pattern:
StoreRuntime::Cluster(cluster) => {
    submit_db_command!(Some(cluster), query::DropSchema, params, DbCommand::DropSchema)
}
StoreRuntime::Standalone(store_handler) => {
    operations::drop_schema(store_handler, params)?
}
list_stores actually forwards the request body now instead of ignoring it (list_stores_response(&self.runtime, request.into_inner())). cluster dispatch inside is untouched, this just lets schema: None flow through to standalone.

get_store resolves schema before dispatch, same split as before.

new_with_config swapped Persistence::load_snapshot for load_snapshot_with_migration to handle the versioned format. standalone-only, cluster has its own snapshot mechanism so this doesn't touch it.

cluster_queries.rs - one new param:
pub(crate) async fn list_stores_response(
    runtime: &StoreRuntime,
    params: query::ListStores,
) -> Result<server::StoreList, tonic::Status> {
    if let Some(cluster) = runtime.cluster() {
        cluster.list_stores(params).await?;
    }
    read_store_handler(runtime, |store_handler| {
        Ok(operations::list_stores(store_handler, params))
    })
}
heads up - the cluster branch's return value isn't used, it always falls through to read_store_handler after. pretty sure that's pre-existing and not something I introduced, but flagging it because it reads weird on first pass. param only actually lands in operations::list_stores via the standalone path.

replication/mod.rs - two changes:

DropSchema command, same shape as DropStore, nothing surprising.

get_snapshot now does into_latest() to turn VersionedDbStores into Stores before handing it to the cluster, since snapshots are versioned internally but the cluster snapshot type is still plain Stores:
fn get_snapshot(&self) -> Self::Snapshot {
    self.store_handler.get_snapshot().into_latest()
        .expect("snapshot should be at latest version")
}
this is the one I want someone else's eyes on. it's an .expect() sitting on the replication snapshot path - if a node ever hands back a snapshot that isn't at the latest version, this panics instead of returning an error. probably can't happen given how migration is wired right now, but "probably can't happen" on a path that takes down a node is exactly the kind of thing I don't want to just wave through myself.

restore_snapshot - completely untouched.

operations.rs - all schema plumbing, nothing clever:

resolve_schema() helper, Option<String> to Schema, defaults to public

create_store/drop_store/list_stores take schema now, None behaves identically to before

new drop_schema, self-contained

threw in explicit type annotations on non_linear_indices (e.g. StdHashSet<non_linear_index::Index>) because the compiler asked nicely. no behavior change.

test files - cluster_tests.rs and replication_store_tests.rs, every diff is just schema: None added to existing calls. didn't touch any test logic.

so: nothing in Jim's clustering paths actually changed, and everything new degrades to old behavior when schema is None. I'm fairly confident in all of it except the .expect() in get_snapshot - would like one of you to sanity check whether that assumption actually holds before this gets accepted.

the cluster branch's return value isn't used, it always falls through to read_store_handler after. pretty sure that's pre-existing and not something I introduced ... yes that's definitely a bug and not intentional cc @jimezesinachi

get_snapshot is primarily used by the current persistence task to serialize to disk asynchronously at intervals and also by the new RAFT mechanism in build_snapshot .... in both cases we are fine to capture the error while logging and propagating accordingly .... if we cannot convert to latest we shouldn't crash but also should not be serializing old formats as is

jimezesinachi · 2026-06-22T10:17:34Z

There's been some merged changes from @jimezesinachi with respect to clustering mode for ahnlich so perhaps @jimezesinachi can help work through what would be needed here to make the modifications for logical separation fit with his changes

@deven96 @jimezesinachi

So, I went through this diff by diff against everything Jim's clustering touches. writing it up because I want someone to actually check the one thing I'm not 100% sure about, not just rubber-stamp it.

store_runtime.rs / cluster_mutations.rs - clean, no diff. enum, submit_db_command!, all of Jim's stuff is exactly as he left it.

handler.rs - three additions, all additive:

drop_schema handler, copy-pasted the drop_store pattern:
StoreRuntime::Cluster(cluster) => {
    submit_db_command!(Some(cluster), query::DropSchema, params, DbCommand::DropSchema)
}
StoreRuntime::Standalone(store_handler) => {
    operations::drop_schema(store_handler, params)?
}
list_stores actually forwards the request body now instead of ignoring it (list_stores_response(&self.runtime, request.into_inner())). cluster dispatch inside is untouched, this just lets schema: None flow through to standalone.

get_store resolves schema before dispatch, same split as before.

new_with_config swapped Persistence::load_snapshot for load_snapshot_with_migration to handle the versioned format. standalone-only, cluster has its own snapshot mechanism so this doesn't touch it.

cluster_queries.rs - one new param:
pub(crate) async fn list_stores_response(
    runtime: &StoreRuntime,
    params: query::ListStores,
) -> Result<server::StoreList, tonic::Status> {
    if let Some(cluster) = runtime.cluster() {
        cluster.list_stores(params).await?;
    }
    read_store_handler(runtime, |store_handler| {
        Ok(operations::list_stores(store_handler, params))
    })
}
heads up - the cluster branch's return value isn't used, it always falls through to read_store_handler after. pretty sure that's pre-existing and not something I introduced, but flagging it because it reads weird on first pass. param only actually lands in operations::list_stores via the standalone path.

replication/mod.rs - two changes:

DropSchema command, same shape as DropStore, nothing surprising.

get_snapshot now does into_latest() to turn VersionedDbStores into Stores before handing it to the cluster, since snapshots are versioned internally but the cluster snapshot type is still plain Stores:
fn get_snapshot(&self) -> Self::Snapshot {
    self.store_handler.get_snapshot().into_latest()
        .expect("snapshot should be at latest version")
}
this is the one I want someone else's eyes on. it's an .expect() sitting on the replication snapshot path - if a node ever hands back a snapshot that isn't at the latest version, this panics instead of returning an error. probably can't happen given how migration is wired right now, but "probably can't happen" on a path that takes down a node is exactly the kind of thing I don't want to just wave through myself.

restore_snapshot - completely untouched.

operations.rs - all schema plumbing, nothing clever:

resolve_schema() helper, Option<String> to Schema, defaults to public

create_store/drop_store/list_stores take schema now, None behaves identically to before

new drop_schema, self-contained

threw in explicit type annotations on non_linear_indices (e.g. StdHashSet<non_linear_index::Index>) because the compiler asked nicely. no behavior change.

test files - cluster_tests.rs and replication_store_tests.rs, every diff is just schema: None added to existing calls. didn't touch any test logic.

so: nothing in Jim's clustering paths actually changed, and everything new degrades to old behavior when schema is None. I'm fairly confident in all of it except the .expect() in get_snapshot - would like one of you to sanity check whether that assumption actually holds before this gets accepted.

@gregorian-09, the version of list_stores_response in the actual code in the main branch does not look like this at all. Here is what it looks like:

pub(crate) async fn list_stores_response(
    runtime: &StoreRuntime,
) -> Result<server::StoreList, tonic::Status> {
    if let Some(cluster) = runtime.cluster() {
        cluster
            .raft
            .ensure_linearizable()
            .await
            .map_err(map_linearizable_error)?;
    }

    read_store_handler(runtime, |store_handler| {
        Ok(operations::list_stores(store_handler))
    })
}

gregorian-09 · 2026-06-22T21:19:05Z

There's been some merged changes from @jimezesinachi with respect to clustering mode for ahnlich so perhaps @jimezesinachi can help work through what would be needed here to make the modifications for logical separation fit with his changes

@deven96 @jimezesinachi
So, I went through this diff by diff against everything Jim's clustering touches. writing it up because I want someone to actually check the one thing I'm not 100% sure about, not just rubber-stamp it.
store_runtime.rs / cluster_mutations.rs - clean, no diff. enum, submit_db_command!, all of Jim's stuff is exactly as he left it.
handler.rs - three additions, all additive:

drop_schema handler, copy-pasted the drop_store pattern:
StoreRuntime::Cluster(cluster) => {
    submit_db_command!(Some(cluster), query::DropSchema, params, DbCommand::DropSchema)
}
StoreRuntime::Standalone(store_handler) => {
    operations::drop_schema(store_handler, params)?
}
list_stores actually forwards the request body now instead of ignoring it (list_stores_response(&self.runtime, request.into_inner())). cluster dispatch inside is untouched, this just lets schema: None flow through to standalone.

get_store resolves schema before dispatch, same split as before.

new_with_config swapped Persistence::load_snapshot for load_snapshot_with_migration to handle the versioned format. standalone-only, cluster has its own snapshot mechanism so this doesn't touch it.

cluster_queries.rs - one new param:
pub(crate) async fn list_stores_response(
    runtime: &StoreRuntime,
    params: query::ListStores,
) -> Result<server::StoreList, tonic::Status> {
    if let Some(cluster) = runtime.cluster() {
        cluster.list_stores(params).await?;
    }
    read_store_handler(runtime, |store_handler| {
        Ok(operations::list_stores(store_handler, params))
    })
}
heads up - the cluster branch's return value isn't used, it always falls through to read_store_handler after. pretty sure that's pre-existing and not something I introduced, but flagging it because it reads weird on first pass. param only actually lands in operations::list_stores via the standalone path.
replication/mod.rs - two changes:

DropSchema command, same shape as DropStore, nothing surprising.

get_snapshot now does into_latest() to turn VersionedDbStores into Stores before handing it to the cluster, since snapshots are versioned internally but the cluster snapshot type is still plain Stores:
fn get_snapshot(&self) -> Self::Snapshot {
    self.store_handler.get_snapshot().into_latest()
        .expect("snapshot should be at latest version")
}
this is the one I want someone else's eyes on. it's an .expect() sitting on the replication snapshot path - if a node ever hands back a snapshot that isn't at the latest version, this panics instead of returning an error. probably can't happen given how migration is wired right now, but "probably can't happen" on a path that takes down a node is exactly the kind of thing I don't want to just wave through myself.
restore_snapshot - completely untouched.
operations.rs - all schema plumbing, nothing clever:

resolve_schema() helper, Option<String> to Schema, defaults to public

create_store/drop_store/list_stores take schema now, None behaves identically to before

new drop_schema, self-contained

threw in explicit type annotations on non_linear_indices (e.g. StdHashSet<non_linear_index::Index>) because the compiler asked nicely. no behavior change.

test files - cluster_tests.rs and replication_store_tests.rs, every diff is just schema: None added to existing calls. didn't touch any test logic.
so: nothing in Jim's clustering paths actually changed, and everything new degrades to old behavior when schema is None. I'm fairly confident in all of it except the .expect() in get_snapshot - would like one of you to sanity check whether that assumption actually holds before this gets accepted.
@gregorian-09, the version of list_stores_response in the actual code in the main branch does not look like this at all. Here is what it looks like:
pub(crate) async fn list_stores_response(
    runtime: &StoreRuntime,
) -> Result<server::StoreList, tonic::Status> {
    if let Some(cluster) = runtime.cluster() {
        cluster
            .raft
            .ensure_linearizable()
            .await
            .map_err(map_linearizable_error)?;
    }

    read_store_handler(runtime, |store_handler| {
        Ok(operations::list_stores(store_handler))
    })
}

@deven96 @jimezesinachi I re-checked the current diff against main, and Jim is right about the list_stores_response path. I'm really sorry about that, Jim. The code on main does not call cluster.list_stores(params).await? and then ignore the result. It calls ensure_linearizable() in cluster mode, then reads from the local store handler. So there isn’t a fall-through bug in that specific path like I thought. I apologize once again boss.

The re-check did surface the behavior we discussed around omitted schemas. DB ListStores { schema: None } was still broader than AI, because it could list stores across schemas. I’ve changed that so an omitted schema defaults to public, matching the intended behavior. The DB schema test has been updated to assert that schema: None only returns public stores. I have also adjusted the AI list-store enrichment path to keep that behavior consistent. AI already filters its own stores by schema, but when it fetched DB metadata it was using the no-schema list call. With the DB behavior corrected, that would only look in public, even for a custom-schema AI request. The DB client now has a schema-aware list helper, and AI passes the requested schema through when enriching store metadata.

While, for the snapshot path, I also changed the RAFT snapshot hook so snapshot conversion errors are propagated instead of panicking. get_snapshot now returns a Result, build_snapshot forwards that error, and the DB implementation maps conversion failures into StorageError::IO. That keeps the node from crashing on snapshot conversion failure while still failing the snapshot operation clearly.

deven96 · 2026-06-24T08:58:31Z

I think we are almost there @gregorian-09

One unresolved comment left

deven96 · 2026-06-24T21:38:00Z

Approved but I'm giving the benchmarks some time to run as I'd want to be sure we didn't introduce some regrettable performance regression @gregorian-09

deven96 · 2026-06-24T23:06:35Z

Approved but I'm giving the benchmarks some time to run as I'd want to be sure we didn't introduce some regrettable performance regression @gregorian-09

The performances look largely the same so I will go ahead and merge this... Just realized the commenter for the benchmarks is broken though cc @Iamdavidonuh

deven96 · 2026-06-24T23:12:21Z

As a follow up @gregorian-09 , we would probably need to update the public facing documentation to show that we support logical separation via schemas in all the commands it was introduced plus the new commands

And then we finally shore up a few additive tests to the autogenerated libraries. Afterwards we should then be ready to cut a new release for the binaries and libraries @Iamdavidonuh

gregorian-09 force-pushed the main branch from 4201f92 to 28190a9 Compare June 16, 2026 11:12