Problem
Currently, our CUDA packaging and release pipeline faces two major bottlenecks:
- Fragmented SM Architectures: We compile
sm90 and sm100 targets separately. Users have to figure out which package matches their GPU.
- CUDA Version Discrepancies: We support multiple CUDA versions (
cu12.9 and cu13.0). Since PyPI does not natively support CUDA version tags in wheel names, distributing these specific builds via standard PyPI is difficult without polluting the namespace or forcing users to use -f release links.
Proposed Solution
To provide a seamless pip install cula experience, we will refactor the packaging pipeline using the strategy popularized by FlashAttention:
1. Unify sm90 and sm100 into a single .so (Fatbinary)
- Compile both architectures into a single extension by setting
TORCH_CUDA_ARCH_LIST="9.0;10.0".
- Use the
__CUDA_ARCH__ macro in C++ code to isolate architecture-specific instructions (to avoid compile-time errors) and implement runtime dispatching.
2. Dynamic Wheel Resolution via setup.py Interception
- CI/CD: Build wheel matrices for
cu12.9 and cu13.0 (containing the fat binaries) and upload them to GitHub Releases. Publish only the source distribution to PyPI.
- Smart Installation: Override the
bdist_wheel command in setup.py. When a user runs pip install cula, setup.py will intercept the build, detect the local CUDA version/OS, and dynamically download the matching pre-compiled .whl from GitHub Releases.
- Fallback: If the network download fails,
setup.py will gracefully fall back to local C++ compilation.
Problem
Currently, our CUDA packaging and release pipeline faces two major bottlenecks:
sm90andsm100targets separately. Users have to figure out which package matches their GPU.cu12.9andcu13.0). Since PyPI does not natively support CUDA version tags in wheel names, distributing these specific builds via standard PyPI is difficult without polluting the namespace or forcing users to use-frelease links.Proposed Solution
To provide a seamless
pip install culaexperience, we will refactor the packaging pipeline using the strategy popularized by FlashAttention:1. Unify
sm90andsm100into a single.so(Fatbinary)TORCH_CUDA_ARCH_LIST="9.0;10.0".__CUDA_ARCH__macro in C++ code to isolate architecture-specific instructions (to avoid compile-time errors) and implement runtime dispatching.2. Dynamic Wheel Resolution via
setup.pyInterceptioncu12.9andcu13.0(containing the fat binaries) and upload them to GitHub Releases. Publish only the source distribution to PyPI.bdist_wheelcommand insetup.py. When a user runspip install cula,setup.pywill intercept the build, detect the local CUDA version/OS, and dynamically download the matching pre-compiled.whlfrom GitHub Releases.setup.pywill gracefully fall back to local C++ compilation.