Problem
During backup create, each table is frozen (via ALTER TABLE ... FREEZE), its data is copied to the shadow directory, and then unfrozen (via ALTER TABLE ... UNFREEZE). Currently, UNFREEZE happens inline within each table's goroutine, meaning the goroutine holds its concurrency slot while waiting for the UNFREEZE DDL to complete. Since upload_concurrency limits the number of concurrent table goroutines, each UNFREEZE blocks the start of the next table's processing.
Impact
With 1000 tables and upload_concurrency=8, each UNFREEZE takes ~50-200ms (lightweight DDL, but involves ZooKeeper on replicated tables). Total serialized UNFREEZE time: 50-200 seconds of wasted concurrency slots.
Proposed Fix
Defer UNFREEZE operations and execute them in a parallel batch after all tables have been backed up.
1. Deferred UNFREEZE collection
// pkg/backup/backuper.go
type pendingUnfreeze struct {
database string
table string
uuid string
}
type Backuper struct {
// ...
pendingUnfreezes []pendingUnfreeze
pendingUnfreezesMu sync.Mutex
}
func (b *Backuper) deferUnfreeze(database, table, uuid string) {
b.pendingUnfreezesMu.Lock()
b.pendingUnfreezes = append(b.pendingUnfreezes, pendingUnfreeze{
database: database,
table: table,
uuid: uuid,
})
b.pendingUnfreezesMu.Unlock()
}
2. Parallel batch execution after backup completes
// pkg/backup/create.go
func (b *Backuper) executePendingUnfreezes(ctx context.Context) {
b.pendingUnfreezesMu.Lock()
pending := make([]pendingUnfreeze, len(b.pendingUnfreezes))
copy(pending, b.pendingUnfreezes)
b.pendingUnfreezes = b.pendingUnfreezes[:0]
b.pendingUnfreezesMu.Unlock()
if len(pending) == 0 {
return
}
start := time.Now()
g, gCtx := errgroup.WithContext(ctx)
g.SetLimit(max(b.cfg.ClickHouse.MaxConnections, 1))
for _, u := range pending {
u := u
g.Go(func() error {
query := fmt.Sprintf("ALTER TABLE `%s`.`%s` UNFREEZE WITH NAME '%s'",
u.database, u.table, u.uuid)
if err := b.ch.QueryContext(gCtx, query); err != nil {
if (strings.Contains(err.Error(), "code: 60") ||
strings.Contains(err.Error(), "code: 81") ||
strings.Contains(err.Error(), "code: 218")) &&
b.cfg.ClickHouse.IgnoreNotExistsErrorDuringFreeze {
log.Warn().Str("table", fmt.Sprintf("%s.%s", u.database, u.table)).
Msgf("can't unfreeze: %v", err)
return nil
}
log.Warn().Str("table", fmt.Sprintf("%s.%s", u.database, u.table)).
Err(err).Msg("UNFREEZE failed")
}
return nil
})
}
if err := g.Wait(); err != nil {
log.Warn().Err(err).Msg("some UNFREEZE operations failed")
}
log.Info().Int("tables", len(pending)).
Str("duration", utils.HumanizeDuration(time.Since(start))).
Msg("parallel UNFREEZE complete")
}
3. Call site in AddTableToLocalBackup
// pkg/backup/create.go — in AddTableToLocalBackup, replace inline UNFREEZE:
// Before:
// b.ch.QueryContext(ctx, fmt.Sprintf("ALTER TABLE ... UNFREEZE ..."))
// After:
if version > 21004000 {
b.deferUnfreeze(table.Database, table.Name, shadowBackupUUID)
}
4. Call in CreateBackup (both success and error paths)
// Always execute pending UNFREEZEs to release freeze locks, even on error
defer b.executePendingUnfreezes(ctx)
Key design decisions
- Error handling: UNFREEZE failures are logged as warnings, not fatal errors. A failed UNFREEZE leaves a stale freeze lock that ClickHouse will clean up on restart. Failing the entire backup for a cleanup operation would be too aggressive.
- Concurrency limit: Uses
MaxConnections to avoid overwhelming ClickHouse with DDL queries.
- Both paths: Called in both success and error defer paths, because leaving freeze locks behind is worse than a failed UNFREEZE attempt.
Benefits
- Table goroutines finish faster (don't block on UNFREEZE), allowing the next table to start sooner
- All UNFREEZEs run in parallel after data copy, using ClickHouse's connection pool efficiently
- Total UNFREEZE time: max(individual_unfreeze_time) instead of sum(individual_unfreeze_time)
- For 1000 tables: ~200ms total vs ~100-200 seconds serialized
Problem
During
backup create, each table is frozen (viaALTER TABLE ... FREEZE), its data is copied to the shadow directory, and then unfrozen (viaALTER TABLE ... UNFREEZE). Currently, UNFREEZE happens inline within each table's goroutine, meaning the goroutine holds its concurrency slot while waiting for the UNFREEZE DDL to complete. Sinceupload_concurrencylimits the number of concurrent table goroutines, each UNFREEZE blocks the start of the next table's processing.Impact
With 1000 tables and
upload_concurrency=8, each UNFREEZE takes ~50-200ms (lightweight DDL, but involves ZooKeeper on replicated tables). Total serialized UNFREEZE time: 50-200 seconds of wasted concurrency slots.Proposed Fix
Defer UNFREEZE operations and execute them in a parallel batch after all tables have been backed up.
1. Deferred UNFREEZE collection
2. Parallel batch execution after backup completes
3. Call site in AddTableToLocalBackup
4. Call in CreateBackup (both success and error paths)
Key design decisions
MaxConnectionsto avoid overwhelming ClickHouse with DDL queries.Benefits