Update altinity-kb-monitoring.md#181
Conversation
| * Prometheus alerts [https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml) | ||
| * ClickHouse Server: enable the built-in [Prometheus endpoint](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-prometheus) in `clickhouse-server` config. It can expose metrics from `system.metrics`, `system.asynchronous_metrics`, `system.events`, and `system.errors`; newer versions can also expose histograms and dimensional metrics through the [Prometheus protocol handler](https://clickhouse.com/docs/interfaces/prometheus). Common dashboards: [14192](https://grafana.com/grafana/dashboards/14192) and [13500](https://grafana.com/grafana/dashboards/13500). | ||
| * ClickHouse Keeper: starting with ClickHouse 22.12, Keeper has its own Prometheus endpoint. Configure `prometheus.port` and `prometheus.endpoint` in the Keeper config and scrape every Keeper node; the release example uses port `9369` and `/metrics`. These are Keeper server metrics, not the same thing as ClickHouse Server metrics about ZooKeeper / Keeper client activity. See the [22.12 release note](https://clickhouse.com/blog/clickhouse-release-22-12#clickhouse-keeper---prometheus-endpoint-antonio-andelic). | ||
| * Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml). |
There was a problem hiding this comment.
| * Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/eb3fc4e28514d0d6ea25a40698205b02949bcf9d/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml). | |
| * Altinity Kubernetes Operator: if ClickHouse is deployed by the operator, use the operator-managed metrics exporter, dashboards, and alerts. See the operator [Prometheus setup](https://github.com/Altinity/clickhouse-operator/blob/master/docs/prometheus_setup.md), [Grafana setup](https://github.com/Altinity/clickhouse-operator/blob/master/docs/grafana_setup.md), [dashboard](https://github.com/Altinity/clickhouse-operator/tree/master/grafana-dashboard), and [alert rules](https://github.com/Altinity/clickhouse-operator/blob/master/deploy/prometheus/prometheus-alert-rules-clickhouse.yaml). |
| | Keeper / ZooKeeper client state | `SELECT metric, value FROM system.metrics WHERE metric IN ('ZooKeeperSession', 'ZooKeeperSessionExpired', 'ZooKeeperConnectionLossStartedTimestampSeconds', 'ZooKeeperRequest', 'ZooKeeperWatch')`<br>`SELECT event, value FROM system.events WHERE event IN ('ZooKeeperHardwareExceptions', 'ZooKeeperUserExceptions', 'ZooKeeperOtherExceptions')` | Connection-loss timestamp or expired sessions appear, sessions / watches move unexpectedly, or error counters increase | High | | ||
| | Error counters | `SELECT name, value, last_error_time, last_error_message FROM system.errors WHERE value > 0 AND last_error_time >= now() - INTERVAL 5 MINUTE ORDER BY last_error_time DESC` | Unexpected error counters increase; do not alert on every non-zero value without a baseline because some counters can increase during successful queries | Medium | | ||
| | Detached parts | `SELECT count() FROM system.detached_parts` | Count is greater than your normal baseline; use this table instead of filesystem globs when available | Medium | | ||
| | Restart detection | `SELECT value FROM system.asynchronous_metrics WHERE metric = 'Uptime'` | Uptime drops below the expected value | Medium | |
There was a problem hiding this comment.
That part - is very hard to review, and i'm not sure if new content is really better than the old one.
Old was trivial, no decision left to the user.
There was a problem hiding this comment.
For example 'Kafka consumers' and 'S3 disk / object storage errors' are not healthchecks.
There was a problem hiding this comment.
Let's for now just convert old table to markdown, and postpone the other changes in that section to the next PR. This one is already too big.
| * Graphite-compatible pipelines: ClickHouse can push `system.metrics`, `system.events`, and `system.asynchronous_metrics` to Graphite with `<graphite>` in `config.xml`; multiple `<graphite>` sections are supported for different intervals. See the ClickHouse [Graphite configuration](https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-graphite). Do not confuse this monitoring exporter with the [GraphiteMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree) table engine, which stores Graphite time-series data in ClickHouse. | ||
| * InfluxDB / Telegraf: for InfluxDB stacks, prefer the [Telegraf ClickHouse input plugin](https://docs.influxdata.com/telegraf/v1/input-plugins/clickhouse/) or scrape the ClickHouse Prometheus endpoint through Telegraf. The old InfluxDB v1 [Graphite protocol](https://docs.influxdata.com/influxdb/v1/supported_protocols/graphite/) path is mainly for legacy Graphite-compatible pipelines. | ||
| * Nagios / Icinga: keep these checks coarse: `/ping`, `/replicas_status`, host checks, and a small number of thresholded SQL checks. If you write custom plugins, follow the standard [Monitoring Plugins guidelines](https://www.monitoring-plugins.org/doc/guidelines.html) for return codes, thresholds, timeouts, and one-line output. Do not rely on unmaintained ClickHouse-specific plugins without reviewing them first. | ||
| * Commercial platforms: [Datadog](https://docs.datadoghq.com/integrations/clickhouse/?tab=host), [Sematext](https://sematext.com/docs/integration/clickhouse/), [IBM Instana](https://www.ibm.com/docs/en/instana-observability?topic=technologies-monitoring-clickhouse), [Site24x7](https://www.site24x7.com/plugins/clickhouse-monitoring.html), and [Acceldata Pulse](https://docs.acceldata.io/pulse/user-guide/clickhouse) have ClickHouse monitoring integrations or documented ClickHouse monitoring workflows. Validate exact metric coverage, ClickHouse version support, and ClickHouse Keeper coverage before relying on a vendor dashboard as the only monitoring source. |
There was a problem hiding this comment.
Put commercial in a different section (not at the very and of "other solutions")
| * `system.asynchronous_metrics` is also a snapshot, but values are calculated periodically in the background. | ||
| * `system.events` contains cumulative counters since server start. Alert on deltas or rates between scrapes, not on the raw value alone, except for rare counters where any increase is meaningful. | ||
|
|
||
| If you need a full picture of query volume, latency, errors, or short-lived query spikes, use [`system.query_log`](https://clickhouse.com/docs/operations/system-tables/query_log) / [`system.query_thread_log`](https://clickhouse.com/docs/operations/system-tables/query_thread_log) in addition to scraped metrics. |
There was a problem hiding this comment.
query_thread_log for normal monitoring is excessive (it's more for a deeper debug, together with trace log and others).
|
|
||
| * [https://tech.marksblogg.com/clickhouse-prometheus-grafana.html](https://tech.marksblogg.com/clickhouse-prometheus-grafana.html) | ||
| * [Key Metrics for Monitoring ClickHouse](https://sematext.com/blog/clickhouse-monitoring-key-metrics/) | ||
| * [OpenTelemetry support](https://clickhouse.com/docs/en/operations/opentelemetry/) |
There was a problem hiding this comment.
ok for now. But otel deserves own article / section
I have read the CLA Document and I hereby sign the CLA