[Refactor] Optimize CUDA Packaging: Fat Binaries & Dynamic Wheel Resolution

### Problem
Currently, our CUDA packaging and release pipeline faces two major bottlenecks:
1. **Fragmented SM Architectures:** We compile `sm90` and `sm100` targets separately. Users have to figure out which package matches their GPU.
2. **CUDA Version Discrepancies:** We support multiple CUDA versions (`cu12.9` and `cu13.0`). Since PyPI does not natively support CUDA version tags in wheel names, distributing these specific builds via standard PyPI is difficult without polluting the namespace or forcing users to use `-f` release links.

### Proposed Solution
To provide a seamless `pip install cula` experience, we will refactor the packaging pipeline using the strategy popularized by FlashAttention:

**1. Unify `sm90` and `sm100` into a single `.so` (Fatbinary)**
* Compile both architectures into a single extension by setting `TORCH_CUDA_ARCH_LIST="9.0;10.0"`. 
* Use the `__CUDA_ARCH__` macro in C++ code to isolate architecture-specific instructions (to avoid compile-time errors) and implement runtime dispatching.

**2. Dynamic Wheel Resolution via `setup.py` Interception**
* **CI/CD:** Build wheel matrices for `cu12.9` and `cu13.0` (containing the fat binaries) and upload them to **GitHub Releases**. Publish only the source distribution to PyPI.
* **Smart Installation:** Override the `bdist_wheel` command in `setup.py`. When a user runs `pip install cula`, `setup.py` will intercept the build, detect the local CUDA version/OS, and dynamically download the matching pre-compiled `.whl` from GitHub Releases.
* **Fallback:** If the network download fails, `setup.py` will gracefully fall back to local C++ compilation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Optimize CUDA Packaging: Fat Binaries & Dynamic Wheel Resolution #83

Problem

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Refactor] Optimize CUDA Packaging: Fat Binaries & Dynamic Wheel Resolution #83

Description

Problem

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions