Add ClickBench example and create-chkit scaffold#128
Draft
KeKs0r wants to merge 9 commits into
Draft
Conversation
Adds examples/clickbench/ with the full ClickBench hits schema and a load migration that ingests the public ClickBench dataset via the ClickHouse url() table function. Targets ObsessionDB via the plugin-obsessiondb plugin. Includes a .gitignore exception so example clickhouse.config.ts files are tracked, and clarifies the --migration-id flag description (used while authoring the load migration).
create-chkit downloads a curated example from github:obsessiondb/chkit/examples/<name> via giget, rewrites the project name and repins chkit + @chkit/* deps to latest, then runs install with the auto-detected package manager. Default example is clickbench. Built with @clack/prompts for the UI. Restructures Getting Started docs into two pages — Start with an example (using create-chkit) and Add to an existing project (using chkit init) — and updates the docs link printed by chkit init.
The clickbench load migration truncates default.hits before reloading the 70 GB ClickBench dataset. ClickHouse's default max_table_size_to_drop (50 GB) blocks the TRUNCATE once the table is already populated, leaving the migration stuck. Pass max_table_size_to_drop = 0 and max_partition_size_to_drop = 0 on the TRUNCATE so the migration is re-runnable against a partially or fully loaded table.
| @@ -0,0 +1,16 @@ | |||
| import { defineConfig } from '@chkit/core' | |||
| @@ -0,0 +1,122 @@ | |||
| import { schema, table } from '@chkit/core' | |||
The @clickhouse/client library default is 30s, which kills migrate in-flight on long-running DDL or INSERT statements (the ClickBench dataset load is the canonical example — the load query took longer than 30s, the client closed the socket, and the server cancelled the INSERT). Lift the default to 120s across the stateless, session, and DDL-fallback clients. Properly exposing a per-config timeout is a follow-up.
The single `INSERT ... FROM url(hits_{0..99}.parquet)` exceeds the
hard request-duration limit on edge proxies in front of managed
ClickHouse deployments (the ObsessionDB customer-benchmark endpoint
504'd at ~10 min, ~55M of 100M rows). Split into five 20-file chunks
so each INSERT fits well under typical proxy budgets, and pin
max_execution_time = 0 on every chunk to keep the server-side query
timer from biting. Verified end-to-end against ObsessionDB: full
99,997,497 rows / 8.69 GiB loaded.
Replace the 5×20-file chunked url() load with a single s3() INSERT. s3() does native partitioned-Parquet parallelism that url() doesn't; combined with max_download_threads = 32 and max_insert_threads = 16 this is expected to drop wall time from ~13 min to ~3-5 min and bring the whole load comfortably under typical edge-proxy request budgets, removing the need for chunking. max_execution_time = 0 still required to disable the server-side query timer. The dataset URL changes from datasets.clickhouse.com (CloudFront alias) to the underlying clickhouse-public-datasets S3 bucket.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a ClickBench CHKit example with schema and a separate full dataset load migration from datasets.clickhouse.com. Adds the create-chkit scaffolding package so users can start from curated examples, and refreshes getting-started docs around example-first and existing-project flows. Clarifies that --migration-id is an escape hatch for overriding the default timestamp migration prefix. Validated with package CLI typecheck/lint and a docs build.