Update altinity-kb-monitoring.md by realyota · Pull Request #181 · Altinity/altinityknowledgebase

realyota · 2026-05-06T12:52:31Z

I have read the CLA Document and I hereby sign the CLA

filimonov · 2026-06-03T08:46:58Z

-* Prometheus alerts [https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml)
+* ClickHouse Server: enable the built-in [Prometheus endpoint](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-prometheus) in `clickhouse-server` config. It can expose metrics from `system.metrics`, `system.asynchronous_metrics`, `system.events`, and `system.errors`; newer versions can also expose histograms and dimensional metrics through the [Prometheus protocol handler](https://clickhouse.com/docs/interfaces/prometheus). Common dashboards: [14192](https://grafana.com/grafana/dashboards/14192) and [13500](https://grafana.com/grafana/dashboards/13500).
+* ClickHouse Keeper: starting with ClickHouse 22.12, Keeper has its own Prometheus endpoint. Configure `prometheus.port` and `prometheus.endpoint` in the Keeper config and scrape every Keeper node; the release example uses port `9369` and `/metrics`. These are Keeper server metrics, not the same thing as ClickHouse Server metrics about ZooKeeper / Keeper client activity. See the [22.12 release note](https://clickhouse.com/blog/clickhouse-release-22-12#clickhouse-keeper---prometheus-endpoint-antonio-andelic).
+* Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml).


Suggested change

* Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml).

* Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/master/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/master/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml).

filimonov · 2026-06-03T09:04:34Z

+| Keeper / ZooKeeper client state | `SELECT metric, value FROM system.metrics WHERE metric IN ('ZooKeeperSession', 'ZooKeeperSessionExpired', 'ZooKeeperConnectionLossStartedTimestampSeconds', 'ZooKeeperRequest', 'ZooKeeperWatch')`<br>`SELECT event, value FROM system.events WHERE event IN ('ZooKeeperHardwareExceptions', 'ZooKeeperUserExceptions', 'ZooKeeperOtherExceptions')` | Connection-loss timestamp or expired sessions appear, sessions / watches move unexpectedly, or error counters increase | High |
+| Error counters | `SELECT name, value, last_error_time, last_error_message FROM system.errors WHERE value > 0 AND last_error_time >= now() - INTERVAL 5 MINUTE ORDER BY last_error_time DESC` | Unexpected error counters increase; do not alert on every non-zero value without a baseline because some counters can increase during successful queries | Medium |
+| Detached parts | `SELECT count() FROM system.detached_parts` | Count is greater than your normal baseline; use this table instead of filesystem globs when available | Medium |
+| Restart detection | `SELECT value FROM system.asynchronous_metrics WHERE metric = 'Uptime'` | Uptime drops below the expected value | Medium |


That part - is very hard to review, and i'm not sure if new content is really better than the old one.

Old was trivial, no decision left to the user.

For example 'Kafka consumers' and 'S3 disk / object storage errors' are not healthchecks.

Let's for now just convert old table to markdown, and postpone the other changes in that section to the next PR. This one is already too big.

filimonov · 2026-06-03T09:09:37Z

+* Graphite-compatible pipelines: ClickHouse can push `system.metrics`, `system.events`, and `system.asynchronous_metrics` to Graphite with `<graphite>` in `config.xml`; multiple `<graphite>` sections are supported for different intervals. See the ClickHouse [Graphite configuration](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-graphite). Do not confuse this monitoring exporter with the [GraphiteMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree) table engine, which stores Graphite time-series data in ClickHouse.
+* InfluxDB / Telegraf: for InfluxDB stacks, prefer the [Telegraf ClickHouse input plugin](https://docs.influxdata.com/telegraf/v1/input-plugins/clickhouse/) or scrape the ClickHouse Prometheus endpoint through Telegraf. The old InfluxDB v1 [Graphite protocol](https://docs.influxdata.com/influxdb/v1/supported_protocols/graphite/) path is mainly for legacy Graphite-compatible pipelines.
+* Nagios / Icinga: keep these checks coarse: `/ping`, `/replicas_status`, host checks, and a small number of thresholded SQL checks. If you write custom plugins, follow the standard [Monitoring Plugins guidelines](https://www.monitoring-plugins.org/doc/guidelines.html) for return codes, thresholds, timeouts, and one-line output. Do not rely on unmaintained ClickHouse-specific plugins without reviewing them first.
+* Commercial platforms: [Datadog](https://docs.datadoghq.com/integrations/clickhouse/?tab=host), [Sematext](https://sematext.com/docs/integration/clickhouse/), [IBM Instana](https://www.ibm.com/docs/en/instana-observability?topic=technologies-monitoring-clickhouse), [Site24x7](https://www.site24x7.com/plugins/clickhouse-monitoring.html), and [Acceldata Pulse](https://docs.acceldata.io/pulse/user-guide/clickhouse) have ClickHouse monitoring integrations or documented ClickHouse monitoring workflows. Validate exact metric coverage, ClickHouse version support, and ClickHouse Keeper coverage before relying on a vendor dashboard as the only monitoring source.


Put commercial in a different section (not at the very and of "other solutions")

filimonov · 2026-06-03T09:10:45Z

+* `system.asynchronous_metrics` is also a snapshot, but values are calculated periodically in the background.
+* `system.events` contains cumulative counters since server start. Alert on deltas or rates between scrapes, not on the raw value alone, except for rare counters where any increase is meaningful.
+
+If you need a full picture of query volume, latency, errors, or short-lived query spikes, use [`system.query_log`](https://clickhouse.com/docs/operations/system-tables/query_log) / [`system.query_thread_log`](https://clickhouse.com/docs/operations/system-tables/query_thread_log) in addition to scraped metrics.


query_thread_log for normal monitoring is excessive (it's more for a deeper debug, together with trace log and others).

filimonov · 2026-06-03T09:12:10Z


-* [https://tech.marksblogg.com/clickhouse-prometheus-grafana.html](https://tech.marksblogg.com/clickhouse-prometheus-grafana.html)
-* [Key Metrics for Monitoring ClickHouse](https://sematext.com/blog/clickhouse-monitoring-key-metrics/)
+* [OpenTelemetry support](https://clickhouse.com/docs/en/operations/opentelemetry/)


ok for now. But otel deserves own article / section

realyota added 3 commits March 17, 2026 13:31

kb: tighten monitoring article checks

7296178

Merge branch 'Altinity:main' into kb/altinity-kb-monitoring-review

37be41b

Update altinity-kb-monitoring.md

438aae7

lesandie self-assigned this May 6, 2026

filimonov reviewed Jun 3, 2026

View reviewed changes

filimonov requested changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update altinity-kb-monitoring.md#181

Update altinity-kb-monitoring.md#181
realyota wants to merge 3 commits into
Altinity:mainfrom
realyota:kb/altinity-kb-monitoring-review

realyota commented May 6, 2026

Uh oh!

filimonov Jun 3, 2026

Uh oh!

filimonov Jun 3, 2026

Uh oh!

filimonov Jun 3, 2026

Uh oh!

filimonov Jun 3, 2026 •

edited

Loading

Uh oh!

filimonov Jun 3, 2026

Uh oh!

filimonov Jun 3, 2026

Uh oh!

filimonov Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

realyota commented May 6, 2026

Uh oh!

filimonov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

filimonov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

filimonov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

filimonov Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filimonov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

filimonov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

filimonov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

filimonov Jun 3, 2026 •

edited

Loading