Skip to content

Parallelize UNFREEZE operations after backup create #1381

@minguyen9988

Description

@minguyen9988

Problem

During backup create, each table is frozen (via ALTER TABLE ... FREEZE), its data is copied to the shadow directory, and then unfrozen (via ALTER TABLE ... UNFREEZE). Currently, UNFREEZE happens inline within each table's goroutine, meaning the goroutine holds its concurrency slot while waiting for the UNFREEZE DDL to complete. Since upload_concurrency limits the number of concurrent table goroutines, each UNFREEZE blocks the start of the next table's processing.

Impact

With 1000 tables and upload_concurrency=8, each UNFREEZE takes ~50-200ms (lightweight DDL, but involves ZooKeeper on replicated tables). Total serialized UNFREEZE time: 50-200 seconds of wasted concurrency slots.

Proposed Fix

Defer UNFREEZE operations and execute them in a parallel batch after all tables have been backed up.

1. Deferred UNFREEZE collection

// pkg/backup/backuper.go
type pendingUnfreeze struct {
    database string
    table    string
    uuid     string
}

type Backuper struct {
    // ...
    pendingUnfreezes   []pendingUnfreeze
    pendingUnfreezesMu sync.Mutex
}

func (b *Backuper) deferUnfreeze(database, table, uuid string) {
    b.pendingUnfreezesMu.Lock()
    b.pendingUnfreezes = append(b.pendingUnfreezes, pendingUnfreeze{
        database: database,
        table:    table,
        uuid:     uuid,
    })
    b.pendingUnfreezesMu.Unlock()
}

2. Parallel batch execution after backup completes

// pkg/backup/create.go
func (b *Backuper) executePendingUnfreezes(ctx context.Context) {
    b.pendingUnfreezesMu.Lock()
    pending := make([]pendingUnfreeze, len(b.pendingUnfreezes))
    copy(pending, b.pendingUnfreezes)
    b.pendingUnfreezes = b.pendingUnfreezes[:0]
    b.pendingUnfreezesMu.Unlock()

    if len(pending) == 0 {
        return
    }

    start := time.Now()
    g, gCtx := errgroup.WithContext(ctx)
    g.SetLimit(max(b.cfg.ClickHouse.MaxConnections, 1))

    for _, u := range pending {
        u := u
        g.Go(func() error {
            query := fmt.Sprintf("ALTER TABLE `%s`.`%s` UNFREEZE WITH NAME '%s'", 
                u.database, u.table, u.uuid)
            if err := b.ch.QueryContext(gCtx, query); err != nil {
                if (strings.Contains(err.Error(), "code: 60") || 
                    strings.Contains(err.Error(), "code: 81") || 
                    strings.Contains(err.Error(), "code: 218")) && 
                    b.cfg.ClickHouse.IgnoreNotExistsErrorDuringFreeze {
                    log.Warn().Str("table", fmt.Sprintf("%s.%s", u.database, u.table)).
                        Msgf("can't unfreeze: %v", err)
                    return nil
                }
                log.Warn().Str("table", fmt.Sprintf("%s.%s", u.database, u.table)).
                    Err(err).Msg("UNFREEZE failed")
            }
            return nil
        })
    }
    if err := g.Wait(); err != nil {
        log.Warn().Err(err).Msg("some UNFREEZE operations failed")
    }
    log.Info().Int("tables", len(pending)).
        Str("duration", utils.HumanizeDuration(time.Since(start))).
        Msg("parallel UNFREEZE complete")
}

3. Call site in AddTableToLocalBackup

// pkg/backup/create.go — in AddTableToLocalBackup, replace inline UNFREEZE:
// Before:
//   b.ch.QueryContext(ctx, fmt.Sprintf("ALTER TABLE ... UNFREEZE ..."))
// After:
if version > 21004000 {
    b.deferUnfreeze(table.Database, table.Name, shadowBackupUUID)
}

4. Call in CreateBackup (both success and error paths)

// Always execute pending UNFREEZEs to release freeze locks, even on error
defer b.executePendingUnfreezes(ctx)

Key design decisions

  • Error handling: UNFREEZE failures are logged as warnings, not fatal errors. A failed UNFREEZE leaves a stale freeze lock that ClickHouse will clean up on restart. Failing the entire backup for a cleanup operation would be too aggressive.
  • Concurrency limit: Uses MaxConnections to avoid overwhelming ClickHouse with DDL queries.
  • Both paths: Called in both success and error defer paths, because leaving freeze locks behind is worse than a failed UNFREEZE attempt.

Benefits

  1. Table goroutines finish faster (don't block on UNFREEZE), allowing the next table to start sooner
  2. All UNFREEZEs run in parallel after data copy, using ClickHouse's connection pool efficiently
  3. Total UNFREEZE time: max(individual_unfreeze_time) instead of sum(individual_unfreeze_time)
  4. For 1000 tables: ~200ms total vs ~100-200 seconds serialized

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions