Skip to content

[Refactor] Optimize CUDA Packaging: Fat Binaries & Dynamic Wheel Resolution #83

@zheyang0825

Description

@zheyang0825

Problem

Currently, our CUDA packaging and release pipeline faces two major bottlenecks:

  1. Fragmented SM Architectures: We compile sm90 and sm100 targets separately. Users have to figure out which package matches their GPU.
  2. CUDA Version Discrepancies: We support multiple CUDA versions (cu12.9 and cu13.0). Since PyPI does not natively support CUDA version tags in wheel names, distributing these specific builds via standard PyPI is difficult without polluting the namespace or forcing users to use -f release links.

Proposed Solution

To provide a seamless pip install cula experience, we will refactor the packaging pipeline using the strategy popularized by FlashAttention:

1. Unify sm90 and sm100 into a single .so (Fatbinary)

  • Compile both architectures into a single extension by setting TORCH_CUDA_ARCH_LIST="9.0;10.0".
  • Use the __CUDA_ARCH__ macro in C++ code to isolate architecture-specific instructions (to avoid compile-time errors) and implement runtime dispatching.

2. Dynamic Wheel Resolution via setup.py Interception

  • CI/CD: Build wheel matrices for cu12.9 and cu13.0 (containing the fat binaries) and upload them to GitHub Releases. Publish only the source distribution to PyPI.
  • Smart Installation: Override the bdist_wheel command in setup.py. When a user runs pip install cula, setup.py will intercept the build, detect the local CUDA version/OS, and dynamically download the matching pre-compiled .whl from GitHub Releases.
  • Fallback: If the network download fails, setup.py will gracefully fall back to local C++ compilation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions