
Elasticsearch vs. OpenSearch vs. Loki vs. Quickwit vs. ClickHouse: Operations
Table of contents
Cluster setup, ingest pipelines, and backup & disaster recovery differ significantly across Elasticsearch, OpenSearch, Grafana Loki, Quickwit, and ClickHouse β and these differences often matter more than raw performance when a log archive must run reliably for 7+ years.
This is Part 2 of a three-part series. Part 1 covers the technical comparison: storage model, object storage tiering, compression, resource consumption, query languages, and SaaS options. This part covers cluster setup and HA requirements, ingest pipeline configuration, Day-2 operations on Kubernetes and bare metal, backup & disaster recovery, hidden object storage costs, observability, and alerting integration. Part 3 covers encryption, access control, and WORM immutability for compliance archives.
Setup Requirementsπ
Running a log archive in production means maintaining the system for years. The setup complexity and HA requirements vary significantly across the five candidates.
| Elasticsearch | OpenSearch | Loki | Quickwit | ClickHouse | |
|---|---|---|---|---|---|
| Min. nodes for HA | 3 master-eligible + 2 data (5+ total) | same as ES | 3+ (write / read / backend roles) | 2+ indexers/searchers + HA PostgreSQL | 3 nodes (ClickHouse server + Keeper co-located) |
| External coordination | none (built-in Raft since 7.x) | none (built-in Raft) | none | PostgreSQL (for HA metastore) | ClickHouse Keeper (built-in, 3-node quorum) |
| Object storage dependency | optional (required for Frozen Tier) | optional (required for Frozen Tier) | required (primary store) | required (primary store) | optional (required for cold tier) |
| Kubernetes operator | ECK (official, very mature) | OpenSearch Operator (less mature) | Helm chart (grafana/loki, well-maintained) | Helm chart (community) | clickhouse-operator (Altinity, mature) |
| Setup complexity | High (many node roles at scale) | High | Medium | Medium | Medium |
Elasticsearch / OpenSearchπ
Minimum HA requires 3 master-eligible nodes for quorum to prevent split-brain. At scale, dedicated master, data, coordinating, and ingest node roles are common, growing the cluster to 7+ nodes. No external coordination service is needed β both dropped ZooKeeper in favour of a built-in Raft implementation in version 7.x.
ECK (Elastic Cloud on Kubernetes) is the de-facto Kubernetes operator for Elasticsearch: official, very mature, handles rolling upgrades and keystore management. The OpenSearch Operator is functional but less battle-tested.
Lokiπ
Loki runs in three modes: monolithic (development only), Simple Scalable (3 roles: write / read / backend), and full microservices. Simple Scalable with 2+ replicas per role is the recommended production configuration. Object storage is required from day one β there is no local-only production mode. The Compactor must run as a singleton; leader-election HA is possible but adds operational complexity. No external coordination service is needed.
Quickwitπ
Stateless Indexers and Searchers make horizontal scaling straightforward. The operational complexity concentrates in the Metastore: in development mode a file on S3 is sufficient, but for production HA it requires PostgreSQL (e.g. RDS, CloudSQL, or self-hosted Patroni). Note that ClickHouse itself is a database and brings its own cluster requirements (Keeper + ReplicatedMergeTree); Quickwit adds PostgreSQL on top of that β a second separate database to deploy and maintain alongside the log store itself.
ClickHouseπ
HA in ClickHouse requires two mechanisms working together:
- ReplicatedMergeTree table engine: data is replicated between ClickHouse server nodes.
- ClickHouse Keeper (built-in ZooKeeper replacement, stable since 22.x): coordinates replicated table state and requires a 3-node quorum.
The minimum production HA setup is 3 nodes, each running ClickHouse server and ClickHouse Keeper co-located. For larger clusters Keeper is moved to 3 dedicated nodes. The clickhouse-operator (Altinity) handles this topology well on Kubernetes. ZooKeeper is no longer required β Keeper is fully self-contained in the ClickHouse binary.
Ingest: Getting Data Inπ
Choosing a log store also means choosing how data arrives there. The two variables are the shipping agent (runs on the source node or as a Kubernetes DaemonSet) and whether an intermediate buffer like Kafka is used.
Agent support matrixπ
| Agent | Elasticsearch / OpenSearch | Loki | Quickwit | ClickHouse |
|---|---|---|---|---|
| Vector | β native sink | β native sink | β HTTP sink | β native sink |
| Fluent Bit | β native | β native | β HTTP output | β HTTP output |
| Fluentd | β native | β plugin | β HTTP plugin | β plugin |
| Filebeat / Elastic Agent | β native | β | β | β |
| Promtail | β | β native | β | β |
| Grafana Alloy | β | β native | β OTLP | β OTLP |
| OpenTelemetry Collector | β exporter | β exporter | β OTLP native | β community exporter |
| Kafka (as source) | β Kafka Connector | β consumer | β native Kafka source | β Kafka table engine |
Key agentsπ
Vector is the most versatile choice β it supports all five systems, is written in Rust (low CPU and memory overhead), and transforms events in-flight via VRL (Vector Remap Language). For a new product without legacy constraints, Vector as a Kubernetes DaemonSet is the most straightforward path.
Fluent Bit is lighter still (C binary, ~1 MB), and the default DaemonSet agent in many Kubernetes distributions. It covers ES/OpenSearch, Loki, and ClickHouse well; its transformation capabilities are less expressive than VRL.
Filebeat and Elastic Agent are tightly coupled to the Elastic ecosystem. They are not usable with Loki, Quickwit, or ClickHouse. If you are already committed to Elasticsearch or OpenSearch they are a natural fit; otherwise, they constrain future optionality.
OpenTelemetry Collector is increasingly the standard for unified observability pipelines β logs, traces, and metrics through a single agent. Quickwit has native OTLP support (gRPC and HTTP). ClickHouse has a community OTLP exporter. For greenfield deployments where structured logs, traces, and metrics all need to land in the same store, OTel Collector is worth evaluating seriously.
Kafka as a decoupling bufferπ
For high ingest rates or when multiple consumers need the same log stream (archiving to ClickHouse and real-time alerting via another system), Kafka as an intermediate buffer is a common pattern:
Agent β Kafka βββ ClickHouse (Kafka table engine, no separate consumer needed)
ββββ Loki (Kafka consumer)
ββββ ES/OS (Kafka Connect)ClickHouse has an advantage here: the Kafka table engine is built into the binary β ClickHouse pulls directly from Kafka topics via a Materialized View, without a separate consumer service.
Kubernetes deployment patternπ
The standard approach on Kubernetes is a DaemonSet (one agent pod per node) reading container logs from /var/log/pods/ as written by containerd. This requires no changes to application pods. A sidecar agent (one per application pod) is only warranted when pods need independent pipelines or different credentials.
Suggestion for a new product: Vector as DaemonSet β ClickHouse. Vectorβs clickhouse sink writes directly via the HTTP interface, supports schema mapping in VRL, and requires no Kafka layer for moderate ingest rates.
Log Format Standards: ECS vs. OpenTelemetryπ
Two field naming conventions dominate the log ecosystem. Committing to one at deployment time avoids silent query failures and broken dashboards years later when teams rotate and nobody remembers why log.level and severity_text coexist in the same pipeline.
Elastic Common Schema (ECS)π
- Key fields:
@timestamp,message,log.level,host.name,service.name,trace.id,span.id - Native output of: Elastic Agent, Filebeat, Metricbeat, Auditbeat
- Kibana and OpenSearch Dashboards built-in visualizations assume ECS field names
OpenTelemetry Semantic Conventionsπ
- Key fields:
timestamp(not@timestamp),body(notmessage),severity_text(notlog.level),resource.attributes.service.name(nested),trace_id,span_id(underscored, not dotted) - Native output of: OpenTelemetry Collector, Grafana Alloy
- Quickwit OTLP ingestion and the official ClickHouse OTel table schema use OTel column names
The practical problemπ
Mixing conventions in the same pipeline causes invisible query failures: a dashboard filtered on log.level = "error" returns no results when data was shipped with severity_text. Vector VRL can rename fields in-flight, but someone must set this up and maintain it as agents and pipelines evolve over 7+ years.
| ES / OpenSearch | Loki | Quickwit | ClickHouse | |
|---|---|---|---|---|
| Native convention | ECS | Label-based (no fixed schema) | OTel Semantic Conventions | Flexible; official OTel schema uses OTel conventions |
For new deployments: choose one convention before ingesting the first log line.
- Existing Elastic/OpenSearch stack β ECS: changing convention later requires reprocessing existing data or maintaining dual-path queries.
- Greenfield OTel-first β OpenTelemetry Semantic Conventions: forward-compatible with the CNCF ecosystem, natively supported by Quickwit and ClickHouseβs OTel schema.
- Migrating from ES to ClickHouse: transform fields at ingest time via Vector VRL or OTel Collector transform processor. Define the target convention before migration β do not carry ECS field names into a new ClickHouse schema without an explicit mapping.
Operational Effortπ
Initial setup is a one-time cost. Day-2 operations β upgrades, scaling, backup, failure recovery β recur across a 7+ year lifetime. The table below compares the key dimensions for both Kubernetes and bare-metal deployments.
| Elasticsearch | OpenSearch | Loki | Quickwit | ClickHouse | |
|---|---|---|---|---|---|
| K8s operator | ECK (official, very mature) | OpenSearch Operator (community) | Helm chart (no CRDs needed) | Community Helm chart | Altinity clickhouse-operator (mature) |
| Rolling upgrades | β ECK-managed | β operator-managed | β stateless roles | β stateless (trivial) | β operator-managed |
| Upgrade path constraint | One major at a time | One major at a time | Minor-safe; re-index on major | No migration on binary upgrade | One major at a time |
| Backup strategy | Snapshot API β S3 | Snapshot API β S3 | S3 is primary (bucket-level) | S3 is primary (bucket-level) | clickhouse-backup + S3 |
| Horizontal scale | Medium (shard rebalancing) | Medium (shard rebalancing) | Easy (stateless) | Easy (stateless) | Medium (topology change) |
| Bare-metal packaging | deb/rpm + systemd | deb/rpm + systemd | Single binary + systemd | Single binary + systemd | deb/rpm + systemd |
| Main Day-2 pain point | ILM policy tuning, shard sizing | Same as ES, less mature tooling | Compactor singleton, caching | PostgreSQL HA maintenance | Merge queue, mutation ops |
Elasticsearchπ
Kubernetes: ECK automates rolling upgrades, TLS certificate rotation, keystore secret injection, and node lifecycle management. Adding data nodes triggers automatic shard rebalancing β on multi-TB indices this can take hours and should be scheduled as a maintenance window. ECK is widely regarded as the most mature Kubernetes operator of the five systems compared here.
Bare metal: Official deb/rpm packages with a systemd unit. Upgrades follow a strict major-version ladder (7 β 8 β 9; no skipping). ILM policy tuning is an ongoing task: shard count mismatches between hot and cold tiers are the most common source of long-term over-allocation. Snapshots to S3/GCS/Azure Blob are the backup mechanism β mature and reliable, though full-cluster restores from snapshots are slow on large data sets.
OpenSearchπ
Kubernetes: Operationally nearly identical to Elasticsearch. The OpenSearch Operator is community-maintained and covers the standard lifecycle, but has fewer automated recovery paths than ECK. Helm-chart deployments without an operator are common in smaller setups.
Bare metal: Package management and upgrade procedures mirror Elasticsearch. Index State Management (ISM) is the OpenSearch equivalent of ILM and is entirely free β no subscription required for any ILM-equivalent feature.
Lokiπ
Kubernetes: Stateless write / read / backend roles make operations straightforward β kubectl rollout restart on any role is safe without data loss. No PVCs are needed for indexers; chunk data lives in object storage. Two recurring operational constraints:
- Compactor singleton: must run as exactly one instance at a time. Simultaneous Compactor replicas corrupt the index. The Helm chart enforces this with
replicas: 1; leader-election HA is available but adds complexity. - Caching: Memcached or Redis for chunk and query-result caches are optional but strongly recommended for production query performance β adding another stateful component to maintain.
Bare metal: Single binary with a YAML config file, easy to run via systemd. The configuration surface is large, but the simple-scalable mode covers most production cases without deep parameter tuning.
Quickwitπ
Kubernetes: The stateless architecture makes this the operationally lightest of the five: Indexers and Searchers carry no local state, so scaling up or down requires no rebalancing. Node replacement is transparent.
The critical stateful dependency is PostgreSQL (metastore). It requires its own HA story (Patroni, RDS Multi-AZ, CloudSQL HA, etc.) and must be backed up and upgraded independently of Quickwit. On Kubernetes this means running and maintaining a second stateful workload alongside Quickwit itself.
Bare metal: Two services to manage: the Quickwit binary and PostgreSQL. The Quickwit binary itself is stateless and trivial to update; PostgreSQL major-version upgrades are the main planned-maintenance item.
ClickHouseπ
Kubernetes: The Altinity clickhouse-operator manages the full lifecycle: shard/replica topology, rolling upgrades, and Keeper configuration. For rolling upgrades, the Keeper quorum must remain stable β Keeper nodes are upgraded first, ClickHouse server nodes follow.
Ongoing operational concerns:
- Merge queue: MergeTree merges parts in the background. If ingest rate exceeds merge throughput, the part count per partition grows, slowing queries and eventually throttling inserts. Monitor
system.mergesandsystem.part_logfor backlog. - Data deletion: prefer
ALTER TABLE DROP PARTITIONor TTL rules overALTER TABLE DELETE. Mutations are asynchronous, rewrite data, and are resource-intensive β they should be avoided on high-volume log tables. - Backup:
clickhouse-backupis the standard open-source tool for snapshot + S3 upload. Unlike Elasticsearch, ClickHouse has no built-in snapshot API; backup scheduling and restore testing must be set up explicitly.
Bare metal: Official deb/rpm packages with a systemd unit. ClickHouse upgrades are generally safe one major at a time; Keeper state does not require migration between versions.
Backup and Disaster Recoveryπ
For a 7+ year archive, backup is a first-class operational requirement. The five systems take fundamentally different approaches depending on whether their primary store is local disk or object storage.
| Elasticsearch | OpenSearch | Loki | Quickwit | ClickHouse | |
|---|---|---|---|---|---|
| Backup mechanism | Snapshot API β S3 (SLM, free) | Snapshot API β S3 (SM, free) | S3 bucket replication / versioning | S3 replication + PostgreSQL WAL archiving | clickhouse-backup β S3 |
| Automation | SLM policies (Kibana) | SM policies (OpenSearch) | S3 lifecycle / cross-region replication | pgBackRest / pg_basebackup + S3 policies | CronJob / operator CRD |
| RPO | Snapshot interval (typically hourly) | Same as ES | Near-zero (S3 durability) | Near-zero for data; PostgreSQL RPO for metadata | Backup interval (hourly incremental) |
| RTO | Hours (large index restore) | Hours | Seconds (redeploy, repoint to same bucket) | Seconds for data + PostgreSQL restore time | Hours (hot-tier re-download); fast for cold |
| Frozen / cold tier | Already in S3 (no separate backup needed) | Already in S3 | All data in S3 | All data in S3 | Cold parts already in S3 |
| Tooling maturity | Very high (Snapshot API stable 10+ years) | High | N/A (S3 native) | Medium | Medium (clickhouse-backup is 3rd party) |
Elasticsearch and OpenSearchπ
The Snapshot API is the standard backup mechanism for both. Snapshots are incremental β only new or changed Lucene segments are uploaded per run. Restore from a named snapshot re-opens the index without a data rebuild.
- SLM (Snapshot Lifecycle Management) in ES and SM (Snapshot Management) in OpenSearch automate scheduled snapshots with configurable retention β both free.
- Frozen tier indices are stored in S3 (Searchable Snapshots) and are their own backup. Only hot and warm tier indices require explicit snapshot policies.
- RTO caveat: restoring a large hot-tier index can take hours on multi-TB data. Restoration speed scales with the number of data nodes downloading in parallel.
- ECK / OpenSearch Operator: configure snapshot repositories via CRD; pair with SLM/SM policies in Kibana/OpenSearch Dashboards for automated scheduling.
Lokiπ
Lokiβs architecture makes backup near-trivial: all data lives in object storage from the moment chunks are flushed (typically within minutes of ingestion). There is no local data to back up β the S3 bucket is the primary store and the archive simultaneously.
Recommended protections:
- S3 versioning on the Loki bucket: recovers from accidental deletion or compaction bugs
- Async Replication to a second bucket (ideally in a different region) for disaster recovery β the only copy of the data is in object storage, so region-level failure without replication means data loss
- Compactor persistent volume: holds small state files (marker files for compaction progress, a few MB). Back up with Velero on Kubernetes or a cron rsync on bare metal.
On OVH: Async Replication is available for cross-region DR. Key requirements and caveats:
- Versioning must be enabled on both source and destination buckets
- Source and destination must be in the same OVH Public Cloud project (cross-project replication is not supported)
- Objects existing before the replication rule was configured require a separate Batch Replication job β replication only covers new objects by default
- Delete markers are not replicated by default (prevents accidental deletion propagating to the replica); enable
DeleteMarkerReplication: Enabledonly if you want deletes to propagate - No latency SLA β replication is asynchronous with no guaranteed RPO
RTO is near-zero: deploy new Loki pods pointed at the same (or replica) S3 bucket and the cluster is immediately operational. No index rebuild, no data restore required.
Quickwitπ
Like Loki, Quickwit stores all split data in object storage β S3 replication covers the data. The same OVH Async Replication caveats apply (versioning required, same project, Batch Replication job for existing objects). The additional dependency is the PostgreSQL metastore, which requires its own backup strategy:
- pg_dump for simple setups (RPO = dump interval)
- WAL archiving (pgBackRest, pg_basebackup) for near-zero RPO in production
- Metastore and S3 data must be backed up in a consistent state β a metastore snapshot that references splits not yet in S3 (or vice versa) causes index inconsistencies on restore. Test restore procedures end-to-end, not just individual component backups.
ClickHouseπ
ClickHouse has no built-in snapshot API β backup requires an external tool.
clickhouse-backup (open-source, maintained by Altinity) is the de-facto standard:
clickhouse-backup create my-backup-$(date +%Y%m%d)
clickhouse-backup upload my-backup-$(date +%Y%m%d)- Incremental mode: only uploads parts absent from the previous backup
- Targets: S3, GCS, Azure Blob, local filesystem
- Kubernetes: run as a CronJob or via the
ClickHouseBackupCRD (Altinity operator) - Restore:
clickhouse-backup download+restorere-attaches frozen parts to the table without rewriting data
S3 cold-tier parts (already moved by TTL) are inherently protected if the S3 bucket has versioning or cross-region replication β clickhouse-backup does not re-upload them.
Avoid filesystem snapshots for ClickHouse: parts in mid-merge at snapshot time produce an inconsistent state on restore.
ClickHouse restore tooling is less battle-tested than the Elasticsearch Snapshot API. Run restore drills on a separate cluster before you need to recover from an actual incident. A backup that has never been tested is not a backup.
Hidden Cost: S3 API Request Feesπ
Object storage is priced on two axes: storage per GB/month and API requests per thousand operations. Storage costs are predictable and grow linearly. Request fees are invisible until a query pattern or a misconfigured cache triggers thousands of GET requests against cold data.
This is primarily relevant for AWS S3 ($0.0004 per 1,000 GET requests) and GCP Cloud Storage ($0.004 per 10,000 operations). OVH Object Storage has no per-request fees β storage cost only, which eliminates this concern entirely for OVH deployments.
| Cold data scan β GET requests generated | Mitigation | |
|---|---|---|
| Elasticsearch Frozen Tier | Many GETs per query (one per Lucene segment file) | Shard request cache; limit unscoped frozen queries |
| OpenSearch Frozen Tier | Same as ES | Same |
| Loki | One GET per chunk read | Chunk cache (Memcached / Redis) |
| Quickwit | Multiple GETs per split accessed | Split cache; tag_fields pruning skips splits entirely |
| ClickHouse cold tier | One GET per part file accessed | Mark cache + skip indexes prune parts before read |
At AWS S3 pricing, a query scanning 10,000 cold objects costs ~$0.004. At 1,000 such queries per day over a year that is ~$1,500 in request fees β not dominant, but not zero.
Mitigations common to all systems:
- Enable local read caches (ES shard cache, Quickwit split cache, ClickHouse mark cache) to avoid re-fetching unchanged cold objects
- Always scope queries with a time range to prevent unbounded scans across the full 7-year archive
- Batch investigative queries: run a handful of broad queries sequentially rather than many interactive point queries against the same cold time window
Observabilityπ
A system retained for 7+ years must be monitorable across software versions, operator rotations, and tooling changes. The five systems differ substantially here.
| Elasticsearch | OpenSearch | Loki | Quickwit | ClickHouse | |
|---|---|---|---|---|---|
| Metrics format | Prometheus (via exporter) or Metricbeat | Prometheus (official plugin) or Performance Analyzer | Prometheus (native /metrics) | Prometheus (native /metrics) | Prometheus (native /metrics since 20.1) |
| Built-in dashboards | Kibana Stack Monitoring | OpenSearch Dashboards monitoring | Grafana Loki Mixin (ships with kube-prometheus-stack) | Grafana dashboards (community) | Official Grafana plugin + community dashboards |
| Self-introspection | /_cluster/health, /_nodes/stats REST API | same as ES + Performance Analyzer REST | per-component HTTP endpoints | /api/v1/cluster REST, per-component /metrics | system.* tables (SQL-queryable) |
| Tracing support | APM integration | β | Jaeger / Tempo (components emit traces) | OTel-native (ingests and emits) | β |
| Maturity | Very high | High | High (within Grafana stack) | Medium (younger ecosystem) | Very high (database self-monitoring) |
Elasticsearchπ
Exposes a rich REST monitoring API (/_cluster/health, /_nodes/stats, /_cat/*). Kibana Stack Monitoring provides ready-made dashboards. Prometheus scraping requires the community prometheus-elasticsearch-exporter or Metricbeat. The key operational signals to track are JVM heap usage and GC pause times β heap pressure and GC storms are the most common root cause of ES degradation. Very mature; the Elastic stack has been operated in production for 15+ years.
OpenSearchπ
Ships the Performance Analyzer plugin (enabled by default), which exposes per-node metrics via a dedicated REST port. An official Prometheus exporter plugin is available. The monitoring story is similar to Elasticsearch, though some community tooling (exporters, Grafana dashboards) is less comprehensive due to the younger ecosystem.
Lokiπ
All components natively expose Prometheus /metrics. The Grafana Loki Mixin β jsonnet-generated Grafana dashboards and Prometheus alerting rules β ships in the repository and is included in kube-prometheus-stack, giving pre-built production dashboards out of the box. Loki can be used to store its own component logs (dogfooding within the Grafana LGTM stack). All components have built-in tracing instrumentation for Jaeger or Grafana Tempo.
Quickwitπ
Native Prometheus /metrics on all components. Grafana dashboards are available but less comprehensive than the Loki Mixin. The REST API at /api/v1/cluster and /api/v1/indexing/plan provides cluster state inspection. As an OpenTelemetry-native system, Quickwit can both ingest and emit OTel signals β a coherent observability story if you are already running an OTel collector pipeline.
ClickHouseπ
The standout feature is system.* tables β ClickHouse records all operational data into persistent, SQL-queryable tables on the cluster itself:
system.query_logβ every query: execution time, memory consumed, bytes read, query textsystem.part_logβ every merge, insert, and split of MergeTree data partssystem.metric_logβ CPU, memory, and disk I/O sampled over timesystem.asynchronous_metric_log,system.crash_log, and more
Questions like βwhich query consumed the most memory last week?β or βhow many merges ran during yesterdayβs S3 upload spike?β are answerable with a SQL query β no external dashboard required. Native Prometheus /metrics and the official Grafana ClickHouse datasource plugin provide standard dashboard integration on top of this.
Alerting Integrationπ
For a log archive, alerting on log content is often secondary to retention and search. But when it matters β detecting error patterns, rate anomalies, or security events β the five systems have very different built-in capabilities.
| Elasticsearch | OpenSearch | Loki | Quickwit | ClickHouse | |
|---|---|---|---|---|---|
| Native alerting | β Watcher + Detection Rules (free) | β Alerting plugin (free) | β Ruler (Prometheus-compatible) | β | β |
| Prometheus Alertmanager | Via rules + ES Prometheus exporter | Via OpenSearch Prometheus plugin | β native (LogQL β Alertmanager) | Via external Prometheus queries | Via Prometheus CH exporter |
| Grafana Alerting | β | β | β native | β community datasource | β official plugin |
| Alert on log content | β query-based rules | β SQL / PPL monitor queries | β LogQL count/rate expressions | External polling only | Grafana or scheduled SQL |
Elasticsearch Watcher supports query-based alert rules (e.g. βfire if error rate exceeds threshold in the last 5 minutesβ) with HTTP, email, Slack, and PagerDuty actions. Kibana Detection Rules provide a higher-level UI. Both are free in the basic tier.
OpenSearch Alerting plugin (free) adds SQL and PPL (Piped Processing Language) monitor queries in addition to ES-style query monitors. Destinations include Slack, webhook, and SNS.
Loki Ruler converts LogQL expressions into Prometheus recording and alerting rules sent to Alertmanager β seamless if you already run kube-prometheus-stack.
ClickHouse and Quickwit have no native alerting engine. The standard pattern is Grafana Alerting: attach an alert rule to a Grafana panel backed by a ClickHouse SQL query. For custom logic, a Kubernetes CronJob running clickhouse-client -q "SELECT ..." is lightweight and maintainable without additional infrastructure.
FAQπ
How do I back up ClickHouse reliably in Kubernetes?π
The clickhouse-backup tool (open-source, maintained by Altinity) supports incremental backups to S3/GCS/Azure Blob. On Kubernetes it runs as a CronJob or via the ClickHouseBackup CRD when using the Altinity clickhouse-operator. Unlike Elasticsearchβs well-tested Snapshot API, ClickHouse backup tooling requires explicit setup β schedule regular backups and test restores proactively rather than discovering gaps during an incident.
Can I migrate existing log data from Elasticsearch to ClickHouse?π
There is no automated migration path β the data models are fundamentally different (document store with Lucene inverted index vs. columnar MergeTree). Migration requires: (1) defining a ClickHouse schema mapping ES fields to typed columns, (2) exporting data from Elasticsearch via the Scroll API or snapshot-to-Parquet tooling, (3) bulk-inserting into ClickHouse. For a 7+ year archive, a parallel-run approach β routing new data to ClickHouse while ES serves historical queries β is more practical than a full bulk migration.
Part 1: Comparison, storage model, compression, resource consumption, and SaaS options
Part 3: Security & Compliance β encryption, RBAC, and WORM compliance