How to Back Up and Restore ClickHouse with S3
TOC
OverviewPrerequisitesEnvironment RequirementsAccess RequirementsS3 PermissionsBackup StrategyProcedureCreate a Full BackupValidate Backup SuccessCheck Backup Task StatusCheck the S3 PathCreate an Incremental BackupRestoreRestore PrerequisitesRestore ProcedureStop WritesPrepare ClickHouse According to the Failure ScopeCheck Cluster ReadinessRestore Tables One by OneCheck Restore Task StatusValidate Restore SuccessValidate Total Row CountValidate Partition-Level DataValidate Replica StatusStart the Related ComponentsRecommendationsOverview
This document describes how to back up and restore ClickHouse tables in the observability database by using S3 storage. This procedure applies to clusters that use the ReplicatedMergeTree table engine.
This document provides the following guidance:
- Create a full backup.
- Create an incremental backup based on a full backup.
- Store backup data in a specified S3 path.
- Restore data from an S3 backup.
- Validate backup and restore results.
This document uses the observability.audit table as an example. You can apply the same procedure to other tables.
The following tables are common examples:
audit: stores audit data.event: stores event data.log_kubernetes: stores Kubernetes logs.log_platform: stores platform service logs.log_system: stores node-level system logs.log_workload: stores application and workload logs.
Prerequisites
Before you start, make sure the following conditions are met.
Environment Requirements
Access Requirements
All SQL statements in this document use the built-in ClickHouse administrator account default.
You can run the SQL statements on any healthy ClickHouse instance. For consistency, this document uses a single ClickHouse Pod as an example.
Before you run SQL statements, connect to the target Pod:
Then connect to ClickHouse in the container:
The default user already has the required privileges for this procedure, including BACKUP, RESTORE, SELECT, and ALTER.
S3 Permissions
The S3 credentials must have the following permissions:
PutObjectGetObjectListBucketDeleteObject, if expired backups need to be cleaned up
Backup Strategy
Use the following backup strategy:
- Run the backup operation only once on any healthy replica.
- Use
BACKUP TABLEto create a consistent snapshot without stopping the service. - Use
base_backupfor file-level deduplication in incremental backups. - Keep the base full backup accessible when you restore an incremental backup.
- Use one full backup per week and one incremental backup per day to balance restore complexity and storage cost.
Procedure
The following examples use an initial full backup and a daily incremental backup.
Create a Full Backup
A full backup creates the baseline for subsequent incremental backups.
Run the following command on any healthy ClickHouse instance:
Replace the variables as follows:
Notes:
S3(...)writes the backup to the specified object storage path.compression_method = 'zstd'compresses the backup content with zstd to reduce storage usage and network transfer volume.
Validate Backup Success
After the backup is complete, validate the result.
Check Backup Task Status
Run the following query on the ClickHouse instance where the backup command was executed:
Expected result:
Locate the backup record for observability.audit through the name field. The backup is successful if the following conditions are met:
status = 'BACKUP_CREATED'erroris emptyend_timehas a valuenum_files > 0total_size > 0
Check the S3 Path
Run the following command in an environment that can access the target S3 bucket:
Replace the variables as follows:
Expected result:
- The target path exists.
- The path contains files generated by this backup.
- The file count and total size are greater than 0.
Create an Incremental Backup
An incremental backup uses base_backup and uploads only new or changed data files.
Run the following command on any healthy ClickHouse instance:
Notes:
- The incremental backup depends on the backup specified by
base_backup. - This document recommends that each daily incremental backup uses the latest full backup as its
base_backup, instead of using the previous incremental backup as the base. This keeps the restore dependency simple: restoring a daily incremental backup only requires the incremental backup and its corresponding full backup. - Keep the dependent full backup when you restore the incremental backup.
- Create a new full backup periodically to avoid relying on the same baseline for too long.
Restore
Use this procedure when table data is corrupted, the data directory is deleted, or the table state is abnormal.
Restore Prerequisites
Before you start the restore procedure, make sure the following conditions are met:
- You have a valid full backup or incremental backup.
- If you restore from an incremental backup, the corresponding full backup is still accessible.
- You have confirmed the S3 endpoint, bucket, backup path, access key, and secret key.
- You have access to the Kubernetes cluster and the host nodes where ClickHouse instances are scheduled.
Restore Procedure
This procedure restores ClickHouse data table by table. Choose the preparation steps according to the failure scope, and then run the same table restore procedure for each required table.
Stop Writes
Stop razor first to prevent new data from being written during the restore.
Log in to the cluster master node and create a ResourcePatch to stop razor:
Prepare ClickHouse According to the Failure Scope
Choose one of the following preparation paths according to the failure scope. After the preparation is complete, continue with the same table-by-table restore procedure.
Case A: Table Data Is Corrupted but the Data Directory Is Healthy
Use this case when one or more tables are corrupted, accidentally deleted, or have abnormal data, while the ClickHouse data directory and Keeper state are still healthy.
In this case, do not stop ClickHouse and do not clean /cpaas/data/clickhouse/*. Continue with the table restore procedure directly.
Case B: Data Directory or Keeper Metadata Is Damaged
Use this case only when the ClickHouse data directory is damaged, the node is rebuilt, or the Keeper metadata is unavailable.
In this deployment, ClickHouse Keeper is integrated with ClickHouse and its data is also stored under the ClickHouse data directory. Therefore, cleaning /cpaas/data/clickhouse/* also removes the local ClickHouse Keeper data.
Warning: Cleaning the data directory deletes local ClickHouse data on the target nodes. Confirm that the backup is available before you continue.
Run the
rm -rf /cpaas/data/clickhouse/*command on each ClickHouse host node, not inside the ClickHouse container.
Log in to the cluster master node and create a ResourcePatch to stop ClickHouse:
Confirm that all ClickHouse Pods have stopped:
On each host node where a ClickHouse instance is deployed, clean the local ClickHouse data directory:
Start only the ClickHouse components by deleting the ClickHouse ResourcePatch. Do not start razor yet.
Confirm that all ClickHouse Pods are running:
Check Cluster Readiness
Before you run the restore command, make sure the ClickHouse cluster configuration and macros are available.
Check the replicated cluster configuration:
Check the local ClickHouse macros:
Make sure the cluster contains the expected ClickHouse replicas and the macros such as shard and replica are available.
Restore Tables One by One
Run the following restore procedure for each table that needs to be restored.
Drop the target table on the ClickHouse cluster:
Restore the table from S3:
Notes:
- For a
ReplicatedMergeTreetable in this 3-replica deployment, useON CLUSTER 'replicated'so that the restore operation is distributed to all ClickHouse replicas in the cluster. - If you restore from an incremental backup, ClickHouse reads the dependent base backup automatically.
- Keep the corresponding full backup path accessible during the restore.
- Replace
observability.auditand the S3 path with the actual table and backup path that you want to restore. - Repeat the same procedure for every table that needs to be restored. The table list is determined by the customer.
Check Restore Task Status
After each restore command is executed, check the restore task status before you start any related components:
Expected result:
The restore is successful if the following conditions are met:
statusindicates that the restore operation has completed successfully.erroris empty.end_timehas a value.
Validate Restore Success
After the restore is complete, validate the result from the data, partition, and replica perspectives.
Validate Total Row Count
Run the following query on each ClickHouse instance:
Expected result:
The total_rows value is identical on all ClickHouse instances.
Validate Partition-Level Data
Run the following query on each ClickHouse instance:
Expected result:
- The
partitionlist is identical on all ClickHouse instances. - The
rowsvalue for each partition is identical on all ClickHouse instances.
active_parts is only used to observe the physical part layout. A different number of parts does not necessarily indicate a restore failure.
Validate Replica Status
Run the following query on any ClickHouse instance:
Expected result:
For a 3-replica cluster, the restore is successful if the following conditions are met:
total_replicas = 3active_replicas = 3queue_size = 0absolute_delayis close to0
Start the Related Components
After the restore has been validated successfully, start razor again by deleting the ResourcePatch:
Recommendations
Use the following recommendations in production:
- Maintain the backup chain with one full backup per week and one incremental backup per day.
- Use a consistent naming convention for S3 backup directories, such as
full_YYYYMMDDandincr_YYYYMMDD. - Run restore drills in a test environment on a regular basis.