Skip to content

Update altinity-kb-monitoring.md#181

Open
realyota wants to merge 3 commits into
Altinity:mainfrom
realyota:kb/altinity-kb-monitoring-review
Open

Update altinity-kb-monitoring.md#181
realyota wants to merge 3 commits into
Altinity:mainfrom
realyota:kb/altinity-kb-monitoring-review

Conversation

@realyota
Copy link
Copy Markdown
Contributor

@realyota realyota commented May 6, 2026

I have read the CLA Document and I hereby sign the CLA

@lesandie lesandie self-assigned this May 6, 2026
* Prometheus alerts [https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml)
* ClickHouse Server: enable the built-in [Prometheus endpoint](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-prometheus) in `clickhouse-server` config. It can expose metrics from `system.metrics`, `system.asynchronous_metrics`, `system.events`, and `system.errors`; newer versions can also expose histograms and dimensional metrics through the [Prometheus protocol handler](https://clickhouse.com/docs/interfaces/prometheus). Common dashboards: [14192](https://grafana.com/grafana/dashboards/14192) and [13500](https://grafana.com/grafana/dashboards/13500).
* ClickHouse Keeper: starting with ClickHouse 22.12, Keeper has its own Prometheus endpoint. Configure `prometheus.port` and `prometheus.endpoint` in the Keeper config and scrape every Keeper node; the release example uses port `9369` and `/metrics`. These are Keeper server metrics, not the same thing as ClickHouse Server metrics about ZooKeeper / Keeper client activity. See the [22.12 release note](https://clickhouse.com/blog/clickhouse-release-22-12#clickhouse-keeper---prometheus-endpoint-antonio-andelic).
* Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml).
* Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/master/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/master/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml).

| Keeper / ZooKeeper client state | `SELECT metric, value FROM system.metrics WHERE metric IN ('ZooKeeperSession', 'ZooKeeperSessionExpired', 'ZooKeeperConnectionLossStartedTimestampSeconds', 'ZooKeeperRequest', 'ZooKeeperWatch')`<br>`SELECT event, value FROM system.events WHERE event IN ('ZooKeeperHardwareExceptions', 'ZooKeeperUserExceptions', 'ZooKeeperOtherExceptions')` | Connection-loss timestamp or expired sessions appear, sessions / watches move unexpectedly, or error counters increase | High |
| Error counters | `SELECT name, value, last_error_time, last_error_message FROM system.errors WHERE value > 0 AND last_error_time >= now() - INTERVAL 5 MINUTE ORDER BY last_error_time DESC` | Unexpected error counters increase; do not alert on every non-zero value without a baseline because some counters can increase during successful queries | Medium |
| Detached parts | `SELECT count() FROM system.detached_parts` | Count is greater than your normal baseline; use this table instead of filesystem globs when available | Medium |
| Restart detection | `SELECT value FROM system.asynchronous_metrics WHERE metric = 'Uptime'` | Uptime drops below the expected value | Medium |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part - is very hard to review, and i'm not sure if new content is really better than the old one.

Old was trivial, no decision left to the user.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example 'Kafka consumers' and 'S3 disk / object storage errors' are not healthchecks.

Copy link
Copy Markdown
Member

@filimonov filimonov Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's for now just convert old table to markdown, and postpone the other changes in that section to the next PR. This one is already too big.

* Graphite-compatible pipelines: ClickHouse can push `system.metrics`, `system.events`, and `system.asynchronous_metrics` to Graphite with `<graphite>` in `config.xml`; multiple `<graphite>` sections are supported for different intervals. See the ClickHouse [Graphite configuration](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-graphite). Do not confuse this monitoring exporter with the [GraphiteMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree) table engine, which stores Graphite time-series data in ClickHouse.
* InfluxDB / Telegraf: for InfluxDB stacks, prefer the [Telegraf ClickHouse input plugin](https://docs.influxdata.com/telegraf/v1/input-plugins/clickhouse/) or scrape the ClickHouse Prometheus endpoint through Telegraf. The old InfluxDB v1 [Graphite protocol](https://docs.influxdata.com/influxdb/v1/supported_protocols/graphite/) path is mainly for legacy Graphite-compatible pipelines.
* Nagios / Icinga: keep these checks coarse: `/ping`, `/replicas_status`, host checks, and a small number of thresholded SQL checks. If you write custom plugins, follow the standard [Monitoring Plugins guidelines](https://www.monitoring-plugins.org/doc/guidelines.html) for return codes, thresholds, timeouts, and one-line output. Do not rely on unmaintained ClickHouse-specific plugins without reviewing them first.
* Commercial platforms: [Datadog](https://docs.datadoghq.com/integrations/clickhouse/?tab=host), [Sematext](https://sematext.com/docs/integration/clickhouse/), [IBM Instana](https://www.ibm.com/docs/en/instana-observability?topic=technologies-monitoring-clickhouse), [Site24x7](https://www.site24x7.com/plugins/clickhouse-monitoring.html), and [Acceldata Pulse](https://docs.acceldata.io/pulse/user-guide/clickhouse) have ClickHouse monitoring integrations or documented ClickHouse monitoring workflows. Validate exact metric coverage, ClickHouse version support, and ClickHouse Keeper coverage before relying on a vendor dashboard as the only monitoring source.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put commercial in a different section (not at the very and of "other solutions")

* `system.asynchronous_metrics` is also a snapshot, but values are calculated periodically in the background.
* `system.events` contains cumulative counters since server start. Alert on deltas or rates between scrapes, not on the raw value alone, except for rare counters where any increase is meaningful.

If you need a full picture of query volume, latency, errors, or short-lived query spikes, use [`system.query_log`](https://clickhouse.com/docs/operations/system-tables/query_log) / [`system.query_thread_log`](https://clickhouse.com/docs/operations/system-tables/query_thread_log) in addition to scraped metrics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_thread_log for normal monitoring is excessive (it's more for a deeper debug, together with trace log and others).


* [https://tech.marksblogg.com/clickhouse-prometheus-grafana.html](https://tech.marksblogg.com/clickhouse-prometheus-grafana.html)
* [Key Metrics for Monitoring ClickHouse](https://sematext.com/blog/clickhouse-monitoring-key-metrics/)
* [OpenTelemetry support](https://clickhouse.com/docs/en/operations/opentelemetry/)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for now. But otel deserves own article / section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants