Skip to content

Skip missing metadata files gracefully instead of retrying forever #1379

@minguyen9988

Description

@minguyen9988

Problem

When downloading a backup, downloadTableMetadata() retries indefinitely on missing metadata files. If a table was dropped or renamed during upload, its .json metadata file may not exist on remote storage, but the retry loop keeps trying with exponential backoff — wasting minutes before ultimately failing the entire download.

Current behavior

// pkg/backup/download.go — downloadTableMetadata()
retry := retrier.New(retrier.ExponentialBackoff(b.cfg.General.RetriesOnFailure, ...), b)
err := retry.RunCtx(ctx, func(ctx context.Context) error {
    tmReader, err := b.dst.GetFileReader(ctx, remoteMetadataFile)
    if err != nil {
        return errors.Wrapf(err, "can't GetFileReader(%s) error", remoteMetadataFile)
    }
    // ...
})

When GetFileReader returns "object doesn't exist" / "NoSuchKey" / "StatusCode 404", this error is transient-looking to the retrier, so it retries RetriesOnFailure times with exponential backoff. But a 404 is permanent — the object will never appear.

Proposed Fix

Detect permanent "not found" errors and break out of the retry loop immediately:

retry := retrier.New(retrier.ExponentialBackoff(b.cfg.General.RetriesOnFailure, ...), b)
err := retry.RunCtx(ctx, func(ctx context.Context) error {
    tmReader, err := b.dst.GetFileReader(ctx, remoteMetadataFile)
    if err != nil {
        // "object doesn't exist" is permanent — flag it and stop retrying
        if strings.Contains(err.Error(), "doesn't exist") || 
           strings.Contains(err.Error(), "key not found") || 
           strings.Contains(err.Error(), "NoSuchKey") || 
           strings.Contains(err.Error(), "StatusCode 404") {
            notFoundErr = true
            return nil // break out of retry loop
        }
        return errors.Wrapf(err, "can't GetFileReader(%s) error", remoteMetadataFile)
    }
    // ...
})

// After retry loop:
if notFoundErr {
    log.Warn().Str("remoteMetadataFile", remoteMetadataFile).
        Msg("metadata file not found on remote, skipping table")
    continue
}

When this happens

  • Table dropped during backup upload (backup started, table dropped, metadata never uploaded)
  • Incremental backup where base backup's metadata was cleaned up
  • Partial upload failure where some tables' metadata was never written
  • Table renamed between backup create and upload

Impact

Without this fix, a single missing metadata file causes the entire download to hang for RetriesOnFailure × exponential_backoff duration (typically 5-10 minutes) before failing. With this fix, the missing table is skipped in milliseconds and the rest of the backup downloads successfully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions