Elasticsearch vs. OpenSearch vs. Loki vs. Quickwit vs. ClickHouse: Operations

Elasticsearch vs. OpenSearch vs. Loki vs. Quickwit vs. ClickHouse: Operations

Created:
Updated:

Cluster setup, ingest pipelines, and backup & disaster recovery differ significantly across Elasticsearch, OpenSearch, Grafana Loki, Quickwit, and ClickHouse β€” and these differences often matter more than raw performance when a log archive must run reliably for 7+ years.

This is Part 2 of a three-part series. Part 1 covers the technical comparison: storage model, object storage tiering, compression, resource consumption, query languages, and SaaS options. This part covers cluster setup and HA requirements, ingest pipeline configuration, Day-2 operations on Kubernetes and bare metal, backup & disaster recovery, hidden object storage costs, observability, and alerting integration. Part 3 covers encryption, access control, and WORM immutability for compliance archives.

Setup RequirementsπŸ”—

Running a log archive in production means maintaining the system for years. The setup complexity and HA requirements vary significantly across the five candidates.

ElasticsearchOpenSearchLokiQuickwitClickHouse
Min. nodes for HA3 master-eligible + 2 data (5+ total)same as ES3+ (write / read / backend roles)2+ indexers/searchers + HA PostgreSQL3 nodes (ClickHouse server + Keeper co-located)
External coordinationnone (built-in Raft since 7.x)none (built-in Raft)nonePostgreSQL (for HA metastore)ClickHouse Keeper (built-in, 3-node quorum)
Object storage dependencyoptional (required for Frozen Tier)optional (required for Frozen Tier)required (primary store)required (primary store)optional (required for cold tier)
Kubernetes operatorECK (official, very mature)OpenSearch Operator (less mature)Helm chart (grafana/loki, well-maintained)Helm chart (community)clickhouse-operator (Altinity, mature)
Setup complexityHigh (many node roles at scale)HighMediumMediumMedium

Elasticsearch / OpenSearchπŸ”—

Minimum HA requires 3 master-eligible nodes for quorum to prevent split-brain. At scale, dedicated master, data, coordinating, and ingest node roles are common, growing the cluster to 7+ nodes. No external coordination service is needed β€” both dropped ZooKeeper in favour of a built-in Raft implementation in version 7.x.

ECK (Elastic Cloud on Kubernetes) is the de-facto Kubernetes operator for Elasticsearch: official, very mature, handles rolling upgrades and keystore management. The OpenSearch Operator is functional but less battle-tested.

LokiπŸ”—

Loki runs in three modes: monolithic (development only), Simple Scalable (3 roles: write / read / backend), and full microservices. Simple Scalable with 2+ replicas per role is the recommended production configuration. Object storage is required from day one β€” there is no local-only production mode. The Compactor must run as a singleton; leader-election HA is possible but adds operational complexity. No external coordination service is needed.

QuickwitπŸ”—

Stateless Indexers and Searchers make horizontal scaling straightforward. The operational complexity concentrates in the Metastore: in development mode a file on S3 is sufficient, but for production HA it requires PostgreSQL (e.g. RDS, CloudSQL, or self-hosted Patroni). Note that ClickHouse itself is a database and brings its own cluster requirements (Keeper + ReplicatedMergeTree); Quickwit adds PostgreSQL on top of that β€” a second separate database to deploy and maintain alongside the log store itself.

ClickHouseπŸ”—

HA in ClickHouse requires two mechanisms working together:

  • ReplicatedMergeTree table engine: data is replicated between ClickHouse server nodes.
  • ClickHouse Keeper (built-in ZooKeeper replacement, stable since 22.x): coordinates replicated table state and requires a 3-node quorum.

The minimum production HA setup is 3 nodes, each running ClickHouse server and ClickHouse Keeper co-located. For larger clusters Keeper is moved to 3 dedicated nodes. The clickhouse-operator (Altinity) handles this topology well on Kubernetes. ZooKeeper is no longer required β€” Keeper is fully self-contained in the ClickHouse binary.

Ingest: Getting Data InπŸ”—

Choosing a log store also means choosing how data arrives there. The two variables are the shipping agent (runs on the source node or as a Kubernetes DaemonSet) and whether an intermediate buffer like Kafka is used.

Log ingest pipeline: shipping agents (Vector, Fluent Bit, OTel Collector, Filebeat, Promtail) to optional Kafka buffer to log stores (Elasticsearch, OpenSearch, Loki, Quickwit, ClickHouse)

Agent support matrixπŸ”—

AgentElasticsearch / OpenSearchLokiQuickwitClickHouse
Vectorβœ“ native sinkβœ“ native sinkβœ“ HTTP sinkβœ“ native sink
Fluent Bitβœ“ nativeβœ“ nativeβœ“ HTTP outputβœ“ HTTP output
Fluentdβœ“ nativeβœ“ pluginβœ“ HTTP pluginβœ“ plugin
Filebeat / Elastic Agentβœ“ nativeβœ—βœ—βœ—
Promtailβœ—βœ“ nativeβœ—βœ—
Grafana Alloyβœ“βœ“ nativeβœ“ OTLPβœ“ OTLP
OpenTelemetry Collectorβœ“ exporterβœ“ exporterβœ“ OTLP nativeβœ“ community exporter
Kafka (as source)βœ“ Kafka Connectorβœ“ consumerβœ“ native Kafka sourceβœ“ Kafka table engine

Key agentsπŸ”—

Vector is the most versatile choice β€” it supports all five systems, is written in Rust (low CPU and memory overhead), and transforms events in-flight via VRL (Vector Remap Language). For a new product without legacy constraints, Vector as a Kubernetes DaemonSet is the most straightforward path.

Fluent Bit is lighter still (C binary, ~1 MB), and the default DaemonSet agent in many Kubernetes distributions. It covers ES/OpenSearch, Loki, and ClickHouse well; its transformation capabilities are less expressive than VRL.

Filebeat and Elastic Agent are tightly coupled to the Elastic ecosystem. They are not usable with Loki, Quickwit, or ClickHouse. If you are already committed to Elasticsearch or OpenSearch they are a natural fit; otherwise, they constrain future optionality.

OpenTelemetry Collector is increasingly the standard for unified observability pipelines β€” logs, traces, and metrics through a single agent. Quickwit has native OTLP support (gRPC and HTTP). ClickHouse has a community OTLP exporter. For greenfield deployments where structured logs, traces, and metrics all need to land in the same store, OTel Collector is worth evaluating seriously.

Kafka as a decoupling bufferπŸ”—

For high ingest rates or when multiple consumers need the same log stream (archiving to ClickHouse and real-time alerting via another system), Kafka as an intermediate buffer is a common pattern:

Agent β†’ Kafka ──→ ClickHouse  (Kafka table engine, no separate consumer needed)
              β”œβ”€β”€β†’ Loki        (Kafka consumer)
              └──→ ES/OS       (Kafka Connect)

ClickHouse has an advantage here: the Kafka table engine is built into the binary β€” ClickHouse pulls directly from Kafka topics via a Materialized View, without a separate consumer service.

Kubernetes deployment patternπŸ”—

The standard approach on Kubernetes is a DaemonSet (one agent pod per node) reading container logs from /var/log/pods/ as written by containerd. This requires no changes to application pods. A sidecar agent (one per application pod) is only warranted when pods need independent pipelines or different credentials.

Suggestion for a new product: Vector as DaemonSet β†’ ClickHouse. Vector’s clickhouse sink writes directly via the HTTP interface, supports schema mapping in VRL, and requires no Kafka layer for moderate ingest rates.

Log Format Standards: ECS vs. OpenTelemetryπŸ”—

Two field naming conventions dominate the log ecosystem. Committing to one at deployment time avoids silent query failures and broken dashboards years later when teams rotate and nobody remembers why log.level and severity_text coexist in the same pipeline.

Elastic Common Schema (ECS)πŸ”—

  • Key fields: @timestamp, message, log.level, host.name, service.name, trace.id, span.id
  • Native output of: Elastic Agent, Filebeat, Metricbeat, Auditbeat
  • Kibana and OpenSearch Dashboards built-in visualizations assume ECS field names

OpenTelemetry Semantic ConventionsπŸ”—

  • Key fields: timestamp (not @timestamp), body (not message), severity_text (not log.level), resource.attributes.service.name (nested), trace_id, span_id (underscored, not dotted)
  • Native output of: OpenTelemetry Collector, Grafana Alloy
  • Quickwit OTLP ingestion and the official ClickHouse OTel table schema use OTel column names

The practical problemπŸ”—

Mixing conventions in the same pipeline causes invisible query failures: a dashboard filtered on log.level = "error" returns no results when data was shipped with severity_text. Vector VRL can rename fields in-flight, but someone must set this up and maintain it as agents and pipelines evolve over 7+ years.

ES / OpenSearchLokiQuickwitClickHouse
Native conventionECSLabel-based (no fixed schema)OTel Semantic ConventionsFlexible; official OTel schema uses OTel conventions

For new deployments: choose one convention before ingesting the first log line.

  • Existing Elastic/OpenSearch stack β†’ ECS: changing convention later requires reprocessing existing data or maintaining dual-path queries.
  • Greenfield OTel-first β†’ OpenTelemetry Semantic Conventions: forward-compatible with the CNCF ecosystem, natively supported by Quickwit and ClickHouse’s OTel schema.
  • Migrating from ES to ClickHouse: transform fields at ingest time via Vector VRL or OTel Collector transform processor. Define the target convention before migration β€” do not carry ECS field names into a new ClickHouse schema without an explicit mapping.

Operational EffortπŸ”—

Initial setup is a one-time cost. Day-2 operations β€” upgrades, scaling, backup, failure recovery β€” recur across a 7+ year lifetime. The table below compares the key dimensions for both Kubernetes and bare-metal deployments.

ElasticsearchOpenSearchLokiQuickwitClickHouse
K8s operatorECK (official, very mature)OpenSearch Operator (community)Helm chart (no CRDs needed)Community Helm chartAltinity clickhouse-operator (mature)
Rolling upgradesβœ“ ECK-managedβœ“ operator-managedβœ“ stateless rolesβœ“ stateless (trivial)βœ“ operator-managed
Upgrade path constraintOne major at a timeOne major at a timeMinor-safe; re-index on majorNo migration on binary upgradeOne major at a time
Backup strategySnapshot API β†’ S3Snapshot API β†’ S3S3 is primary (bucket-level)S3 is primary (bucket-level)clickhouse-backup + S3
Horizontal scaleMedium (shard rebalancing)Medium (shard rebalancing)Easy (stateless)Easy (stateless)Medium (topology change)
Bare-metal packagingdeb/rpm + systemddeb/rpm + systemdSingle binary + systemdSingle binary + systemddeb/rpm + systemd
Main Day-2 pain pointILM policy tuning, shard sizingSame as ES, less mature toolingCompactor singleton, cachingPostgreSQL HA maintenanceMerge queue, mutation ops

ElasticsearchπŸ”—

Kubernetes: ECK automates rolling upgrades, TLS certificate rotation, keystore secret injection, and node lifecycle management. Adding data nodes triggers automatic shard rebalancing β€” on multi-TB indices this can take hours and should be scheduled as a maintenance window. ECK is widely regarded as the most mature Kubernetes operator of the five systems compared here.

Bare metal: Official deb/rpm packages with a systemd unit. Upgrades follow a strict major-version ladder (7 β†’ 8 β†’ 9; no skipping). ILM policy tuning is an ongoing task: shard count mismatches between hot and cold tiers are the most common source of long-term over-allocation. Snapshots to S3/GCS/Azure Blob are the backup mechanism β€” mature and reliable, though full-cluster restores from snapshots are slow on large data sets.

OpenSearchπŸ”—

Kubernetes: Operationally nearly identical to Elasticsearch. The OpenSearch Operator is community-maintained and covers the standard lifecycle, but has fewer automated recovery paths than ECK. Helm-chart deployments without an operator are common in smaller setups.

Bare metal: Package management and upgrade procedures mirror Elasticsearch. Index State Management (ISM) is the OpenSearch equivalent of ILM and is entirely free β€” no subscription required for any ILM-equivalent feature.

LokiπŸ”—

Kubernetes: Stateless write / read / backend roles make operations straightforward β€” kubectl rollout restart on any role is safe without data loss. No PVCs are needed for indexers; chunk data lives in object storage. Two recurring operational constraints:

  • Compactor singleton: must run as exactly one instance at a time. Simultaneous Compactor replicas corrupt the index. The Helm chart enforces this with replicas: 1; leader-election HA is available but adds complexity.
  • Caching: Memcached or Redis for chunk and query-result caches are optional but strongly recommended for production query performance β€” adding another stateful component to maintain.

Bare metal: Single binary with a YAML config file, easy to run via systemd. The configuration surface is large, but the simple-scalable mode covers most production cases without deep parameter tuning.

QuickwitπŸ”—

Kubernetes: The stateless architecture makes this the operationally lightest of the five: Indexers and Searchers carry no local state, so scaling up or down requires no rebalancing. Node replacement is transparent.

The critical stateful dependency is PostgreSQL (metastore). It requires its own HA story (Patroni, RDS Multi-AZ, CloudSQL HA, etc.) and must be backed up and upgraded independently of Quickwit. On Kubernetes this means running and maintaining a second stateful workload alongside Quickwit itself.

Bare metal: Two services to manage: the Quickwit binary and PostgreSQL. The Quickwit binary itself is stateless and trivial to update; PostgreSQL major-version upgrades are the main planned-maintenance item.

ClickHouseπŸ”—

Kubernetes: The Altinity clickhouse-operator manages the full lifecycle: shard/replica topology, rolling upgrades, and Keeper configuration. For rolling upgrades, the Keeper quorum must remain stable β€” Keeper nodes are upgraded first, ClickHouse server nodes follow.

Ongoing operational concerns:

  • Merge queue: MergeTree merges parts in the background. If ingest rate exceeds merge throughput, the part count per partition grows, slowing queries and eventually throttling inserts. Monitor system.merges and system.part_log for backlog.
  • Data deletion: prefer ALTER TABLE DROP PARTITION or TTL rules over ALTER TABLE DELETE. Mutations are asynchronous, rewrite data, and are resource-intensive β€” they should be avoided on high-volume log tables.
  • Backup: clickhouse-backup is the standard open-source tool for snapshot + S3 upload. Unlike Elasticsearch, ClickHouse has no built-in snapshot API; backup scheduling and restore testing must be set up explicitly.

Bare metal: Official deb/rpm packages with a systemd unit. ClickHouse upgrades are generally safe one major at a time; Keeper state does not require migration between versions.

Backup and Disaster RecoveryπŸ”—

For a 7+ year archive, backup is a first-class operational requirement. The five systems take fundamentally different approaches depending on whether their primary store is local disk or object storage.

ElasticsearchOpenSearchLokiQuickwitClickHouse
Backup mechanismSnapshot API β†’ S3 (SLM, free)Snapshot API β†’ S3 (SM, free)S3 bucket replication / versioningS3 replication + PostgreSQL WAL archivingclickhouse-backup β†’ S3
AutomationSLM policies (Kibana)SM policies (OpenSearch)S3 lifecycle / cross-region replicationpgBackRest / pg_basebackup + S3 policiesCronJob / operator CRD
RPOSnapshot interval (typically hourly)Same as ESNear-zero (S3 durability)Near-zero for data; PostgreSQL RPO for metadataBackup interval (hourly incremental)
RTOHours (large index restore)HoursSeconds (redeploy, repoint to same bucket)Seconds for data + PostgreSQL restore timeHours (hot-tier re-download); fast for cold
Frozen / cold tierAlready in S3 (no separate backup needed)Already in S3All data in S3All data in S3Cold parts already in S3
Tooling maturityVery high (Snapshot API stable 10+ years)HighN/A (S3 native)MediumMedium (clickhouse-backup is 3rd party)

Elasticsearch and OpenSearchπŸ”—

The Snapshot API is the standard backup mechanism for both. Snapshots are incremental β€” only new or changed Lucene segments are uploaded per run. Restore from a named snapshot re-opens the index without a data rebuild.

  • SLM (Snapshot Lifecycle Management) in ES and SM (Snapshot Management) in OpenSearch automate scheduled snapshots with configurable retention β€” both free.
  • Frozen tier indices are stored in S3 (Searchable Snapshots) and are their own backup. Only hot and warm tier indices require explicit snapshot policies.
  • RTO caveat: restoring a large hot-tier index can take hours on multi-TB data. Restoration speed scales with the number of data nodes downloading in parallel.
  • ECK / OpenSearch Operator: configure snapshot repositories via CRD; pair with SLM/SM policies in Kibana/OpenSearch Dashboards for automated scheduling.

LokiπŸ”—

Loki’s architecture makes backup near-trivial: all data lives in object storage from the moment chunks are flushed (typically within minutes of ingestion). There is no local data to back up β€” the S3 bucket is the primary store and the archive simultaneously.

Recommended protections:

  • S3 versioning on the Loki bucket: recovers from accidental deletion or compaction bugs
  • Async Replication to a second bucket (ideally in a different region) for disaster recovery β€” the only copy of the data is in object storage, so region-level failure without replication means data loss
  • Compactor persistent volume: holds small state files (marker files for compaction progress, a few MB). Back up with Velero on Kubernetes or a cron rsync on bare metal.

On OVH: Async Replication is available for cross-region DR. Key requirements and caveats:

  • Versioning must be enabled on both source and destination buckets
  • Source and destination must be in the same OVH Public Cloud project (cross-project replication is not supported)
  • Objects existing before the replication rule was configured require a separate Batch Replication job β€” replication only covers new objects by default
  • Delete markers are not replicated by default (prevents accidental deletion propagating to the replica); enable DeleteMarkerReplication: Enabled only if you want deletes to propagate
  • No latency SLA β€” replication is asynchronous with no guaranteed RPO

RTO is near-zero: deploy new Loki pods pointed at the same (or replica) S3 bucket and the cluster is immediately operational. No index rebuild, no data restore required.

QuickwitπŸ”—

Like Loki, Quickwit stores all split data in object storage β€” S3 replication covers the data. The same OVH Async Replication caveats apply (versioning required, same project, Batch Replication job for existing objects). The additional dependency is the PostgreSQL metastore, which requires its own backup strategy:

  • pg_dump for simple setups (RPO = dump interval)
  • WAL archiving (pgBackRest, pg_basebackup) for near-zero RPO in production
  • Metastore and S3 data must be backed up in a consistent state β€” a metastore snapshot that references splits not yet in S3 (or vice versa) causes index inconsistencies on restore. Test restore procedures end-to-end, not just individual component backups.

ClickHouseπŸ”—

ClickHouse has no built-in snapshot API β€” backup requires an external tool.

clickhouse-backup (open-source, maintained by Altinity) is the de-facto standard:

clickhouse-backup create my-backup-$(date +%Y%m%d)
clickhouse-backup upload  my-backup-$(date +%Y%m%d)
  • Incremental mode: only uploads parts absent from the previous backup
  • Targets: S3, GCS, Azure Blob, local filesystem
  • Kubernetes: run as a CronJob or via the ClickHouseBackup CRD (Altinity operator)
  • Restore: clickhouse-backup download + restore re-attaches frozen parts to the table without rewriting data

S3 cold-tier parts (already moved by TTL) are inherently protected if the S3 bucket has versioning or cross-region replication β€” clickhouse-backup does not re-upload them.

Avoid filesystem snapshots for ClickHouse: parts in mid-merge at snapshot time produce an inconsistent state on restore.

Test your restores

ClickHouse restore tooling is less battle-tested than the Elasticsearch Snapshot API. Run restore drills on a separate cluster before you need to recover from an actual incident. A backup that has never been tested is not a backup.

Hidden Cost: S3 API Request FeesπŸ”—

Object storage is priced on two axes: storage per GB/month and API requests per thousand operations. Storage costs are predictable and grow linearly. Request fees are invisible until a query pattern or a misconfigured cache triggers thousands of GET requests against cold data.

This is primarily relevant for AWS S3 ($0.0004 per 1,000 GET requests) and GCP Cloud Storage ($0.004 per 10,000 operations). OVH Object Storage has no per-request fees β€” storage cost only, which eliminates this concern entirely for OVH deployments.

Cold data scan β€” GET requests generatedMitigation
Elasticsearch Frozen TierMany GETs per query (one per Lucene segment file)Shard request cache; limit unscoped frozen queries
OpenSearch Frozen TierSame as ESSame
LokiOne GET per chunk readChunk cache (Memcached / Redis)
QuickwitMultiple GETs per split accessedSplit cache; tag_fields pruning skips splits entirely
ClickHouse cold tierOne GET per part file accessedMark cache + skip indexes prune parts before read

At AWS S3 pricing, a query scanning 10,000 cold objects costs ~$0.004. At 1,000 such queries per day over a year that is ~$1,500 in request fees β€” not dominant, but not zero.

Mitigations common to all systems:

  • Enable local read caches (ES shard cache, Quickwit split cache, ClickHouse mark cache) to avoid re-fetching unchanged cold objects
  • Always scope queries with a time range to prevent unbounded scans across the full 7-year archive
  • Batch investigative queries: run a handful of broad queries sequentially rather than many interactive point queries against the same cold time window

ObservabilityπŸ”—

A system retained for 7+ years must be monitorable across software versions, operator rotations, and tooling changes. The five systems differ substantially here.

ElasticsearchOpenSearchLokiQuickwitClickHouse
Metrics formatPrometheus (via exporter) or MetricbeatPrometheus (official plugin) or Performance AnalyzerPrometheus (native /metrics)Prometheus (native /metrics)Prometheus (native /metrics since 20.1)
Built-in dashboardsKibana Stack MonitoringOpenSearch Dashboards monitoringGrafana Loki Mixin (ships with kube-prometheus-stack)Grafana dashboards (community)Official Grafana plugin + community dashboards
Self-introspection/_cluster/health, /_nodes/stats REST APIsame as ES + Performance Analyzer RESTper-component HTTP endpoints/api/v1/cluster REST, per-component /metricssystem.* tables (SQL-queryable)
Tracing supportAPM integrationβ€”Jaeger / Tempo (components emit traces)OTel-native (ingests and emits)β€”
MaturityVery highHighHigh (within Grafana stack)Medium (younger ecosystem)Very high (database self-monitoring)

ElasticsearchπŸ”—

Exposes a rich REST monitoring API (/_cluster/health, /_nodes/stats, /_cat/*). Kibana Stack Monitoring provides ready-made dashboards. Prometheus scraping requires the community prometheus-elasticsearch-exporter or Metricbeat. The key operational signals to track are JVM heap usage and GC pause times β€” heap pressure and GC storms are the most common root cause of ES degradation. Very mature; the Elastic stack has been operated in production for 15+ years.

OpenSearchπŸ”—

Ships the Performance Analyzer plugin (enabled by default), which exposes per-node metrics via a dedicated REST port. An official Prometheus exporter plugin is available. The monitoring story is similar to Elasticsearch, though some community tooling (exporters, Grafana dashboards) is less comprehensive due to the younger ecosystem.

LokiπŸ”—

All components natively expose Prometheus /metrics. The Grafana Loki Mixin β€” jsonnet-generated Grafana dashboards and Prometheus alerting rules β€” ships in the repository and is included in kube-prometheus-stack, giving pre-built production dashboards out of the box. Loki can be used to store its own component logs (dogfooding within the Grafana LGTM stack). All components have built-in tracing instrumentation for Jaeger or Grafana Tempo.

QuickwitπŸ”—

Native Prometheus /metrics on all components. Grafana dashboards are available but less comprehensive than the Loki Mixin. The REST API at /api/v1/cluster and /api/v1/indexing/plan provides cluster state inspection. As an OpenTelemetry-native system, Quickwit can both ingest and emit OTel signals β€” a coherent observability story if you are already running an OTel collector pipeline.

ClickHouseπŸ”—

The standout feature is system.* tables β€” ClickHouse records all operational data into persistent, SQL-queryable tables on the cluster itself:

  • system.query_log β€” every query: execution time, memory consumed, bytes read, query text
  • system.part_log β€” every merge, insert, and split of MergeTree data parts
  • system.metric_log β€” CPU, memory, and disk I/O sampled over time
  • system.asynchronous_metric_log, system.crash_log, and more

Questions like β€œwhich query consumed the most memory last week?” or β€œhow many merges ran during yesterday’s S3 upload spike?” are answerable with a SQL query β€” no external dashboard required. Native Prometheus /metrics and the official Grafana ClickHouse datasource plugin provide standard dashboard integration on top of this.

Alerting IntegrationπŸ”—

For a log archive, alerting on log content is often secondary to retention and search. But when it matters β€” detecting error patterns, rate anomalies, or security events β€” the five systems have very different built-in capabilities.

ElasticsearchOpenSearchLokiQuickwitClickHouse
Native alertingβœ“ Watcher + Detection Rules (free)βœ“ Alerting plugin (free)βœ“ Ruler (Prometheus-compatible)βœ—βœ—
Prometheus AlertmanagerVia rules + ES Prometheus exporterVia OpenSearch Prometheus pluginβœ“ native (LogQL β†’ Alertmanager)Via external Prometheus queriesVia Prometheus CH exporter
Grafana Alertingβœ“βœ“βœ“ nativeβœ“ community datasourceβœ“ official plugin
Alert on log contentβœ“ query-based rulesβœ“ SQL / PPL monitor queriesβœ“ LogQL count/rate expressionsExternal polling onlyGrafana or scheduled SQL

Elasticsearch Watcher supports query-based alert rules (e.g. β€œfire if error rate exceeds threshold in the last 5 minutes”) with HTTP, email, Slack, and PagerDuty actions. Kibana Detection Rules provide a higher-level UI. Both are free in the basic tier.

OpenSearch Alerting plugin (free) adds SQL and PPL (Piped Processing Language) monitor queries in addition to ES-style query monitors. Destinations include Slack, webhook, and SNS.

Loki Ruler converts LogQL expressions into Prometheus recording and alerting rules sent to Alertmanager β€” seamless if you already run kube-prometheus-stack.

ClickHouse and Quickwit have no native alerting engine. The standard pattern is Grafana Alerting: attach an alert rule to a Grafana panel backed by a ClickHouse SQL query. For custom logic, a Kubernetes CronJob running clickhouse-client -q "SELECT ..." is lightweight and maintainable without additional infrastructure.

FAQπŸ”—

How do I back up ClickHouse reliably in Kubernetes?πŸ”—

The clickhouse-backup tool (open-source, maintained by Altinity) supports incremental backups to S3/GCS/Azure Blob. On Kubernetes it runs as a CronJob or via the ClickHouseBackup CRD when using the Altinity clickhouse-operator. Unlike Elasticsearch’s well-tested Snapshot API, ClickHouse backup tooling requires explicit setup β€” schedule regular backups and test restores proactively rather than discovering gaps during an incident.

Can I migrate existing log data from Elasticsearch to ClickHouse?πŸ”—

There is no automated migration path β€” the data models are fundamentally different (document store with Lucene inverted index vs. columnar MergeTree). Migration requires: (1) defining a ClickHouse schema mapping ES fields to typed columns, (2) exporting data from Elasticsearch via the Scroll API or snapshot-to-Parquet tooling, (3) bulk-inserting into ClickHouse. For a 7+ year archive, a parallel-run approach β€” routing new data to ClickHouse while ES serves historical queries β€” is more practical than a full bulk migration.


Part 1: Comparison, storage model, compression, resource consumption, and SaaS options

Part 3: Security & Compliance β€” encryption, RBAC, and WORM compliance