[Feature] Integrate Lightning SM90 Implementation via CuTe DSL

# [Feature] Integrate Lightning SM90 Implementation via CuTe DSL

## Background

FlashInfer has a production-ready Lightning Attention SM90 (Hopper) implementation. For API completeness, we could port it into cuLA using **CuTe DSL**, along with corresponding benchmarks and unit tests.

Upstream PR: https://github.com/flashinfer-ai/flashinfer/pull/2276 from guangyunh-nv

> **Note**: The kernel design and algorithm credits belong to the original authors (guangyunh-nv). This issue tracks a porting effort, not an original implementation.

## Upstream Implementation Overview

FlashInfer PR #2276 implements Lightning Attention Prefill on Hopper architecture:

- **SM90 Optimized**: TMA warp-specialized architecture with asynchronous copy and warp group scheduling
- **Optional gating** and final-state output
- **High-level Python API** with chunked prefill support

## Task Checklist

- [ ] **Port kernels via CuTe DSL**: Rewrite the upstream C++ Hopper lightning kernels using CuTe DSL, following cuLA's build system and coding conventions
- [ ] **Adapt Python interface**: Align with cuLA's existing API style and expose the Lightning SM90 Python bindings
- [ ] **Add benchmarks**: Add performance benchmarks under cuLA's framework, covering the same settings as the KDA benchmarks:
  - Fixed-length: `B={1,2}, T={512, 1024, 4096, 8192, 16384}`, `H=64, D=128, dtype=bf16`
  - Varlen: `num_seqs={10, 20}, total_len={4096, 8192, 16384}`, distributions: uniform / random / skewed
- [ ] **Performance validation**: Verify that the CuTe DSL version achieves comparable performance to the upstream C++ version across all benchmark settings above
- [ ] **Add unit tests**: Port the reference implementation and test cases from upstream `tests/gdn/` to ensure correctness
- [ ] **Update documentation**: Document the newly added Lightning SM90 support

## References

- Upstream PR: [flashinfer-ai/flashinfer#2276](https://github.com/flashinfer-ai/flashinfer/pull/2276)
- Related Issue: [flashinfer-ai/flashinfer#1690](https://github.com/flashinfer-ai/flashinfer/issues/1690)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Integrate Lightning SM90 Implementation via CuTe DSL #76

[Feature] Integrate Lightning SM90 Implementation via CuTe DSL

Background

Upstream Implementation Overview

Task Checklist

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Integrate Lightning SM90 Implementation via CuTe DSL #76

Description

[Feature] Integrate Lightning SM90 Implementation via CuTe DSL

Background

Upstream Implementation Overview

Task Checklist

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions