You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Feature] Integrate Lightning SM90 Implementation via CuTe DSL
Background
FlashInfer has a production-ready Lightning Attention SM90 (Hopper) implementation. For API completeness, we could port it into cuLA using CuTe DSL, along with corresponding benchmarks and unit tests.
Note: The kernel design and algorithm credits belong to the original authors (guangyunh-nv). This issue tracks a porting effort, not an original implementation.
Upstream Implementation Overview
FlashInfer PR #2276 implements Lightning Attention Prefill on Hopper architecture:
SM90 Optimized: TMA warp-specialized architecture with asynchronous copy and warp group scheduling
Optional gating and final-state output
High-level Python API with chunked prefill support
Task Checklist
Port kernels via CuTe DSL: Rewrite the upstream C++ Hopper lightning kernels using CuTe DSL, following cuLA's build system and coding conventions
Adapt Python interface: Align with cuLA's existing API style and expose the Lightning SM90 Python bindings
Add benchmarks: Add performance benchmarks under cuLA's framework, covering the same settings as the KDA benchmarks:
Performance validation: Verify that the CuTe DSL version achieves comparable performance to the upstream C++ version across all benchmark settings above
Add unit tests: Port the reference implementation and test cases from upstream tests/gdn/ to ensure correctness
Update documentation: Document the newly added Lightning SM90 support
[Feature] Integrate Lightning SM90 Implementation via CuTe DSL
Background
FlashInfer has a production-ready Lightning Attention SM90 (Hopper) implementation. For API completeness, we could port it into cuLA using CuTe DSL, along with corresponding benchmarks and unit tests.
Upstream PR: flashinfer-ai/flashinfer#2276 from guangyunh-nv
Upstream Implementation Overview
FlashInfer PR #2276 implements Lightning Attention Prefill on Hopper architecture:
Task Checklist
B={1,2}, T={512, 1024, 4096, 8192, 16384},H=64, D=128, dtype=bf16num_seqs={10, 20}, total_len={4096, 8192, 16384}, distributions: uniform / random / skewedtests/gdn/to ensure correctnessReferences