Kernels:

flashrt
/

flashrt-nvfp4

Kernel card Files Files and versions

FlashRT NVFP4

Kernel card for FlashRT NVFP4 layout helpers and planned fused Blackwell low-bit GEMM epilogues.

The first buildable slice exposes NVFP4 scale-factor layout conversion.

Features

nvfp4_sf_linear_to_swizzled and nvfp4_sf_swizzled_bytes.

Planned Features

CUDA 12.8+ SM120 NVFP4 GEMM with fused bias+GELU and BF16 output.
CUDA 12.8+ SM120 NVFP4 GEMM with fused bias+GELU and FP4 output quantization.
Stream-K down-projection GEMM with optional bias.

Status

The fused GEMM epilogues follow after CUTLASS dependency isolation.

Current validation status is recorded in VALIDATION.md.

Current v1 build scope is CUDA 12.8+ SM120.

See examples/nvfp4_scale_factor_layout.py for a minimal layout-conversion example.

Downloads last month: -

flashrt

kernel

cuda

nvfp4

quantization

Supported hardwares new

CUDA 12.0

DGX Spark

GB10

128GB

GPU

RTX PRO 6000 WS

96GB

GPU

RTX PRO 6000 Max-Q

96GB

GPU

RTX PRO 5000

48GB

GPU

RTX PRO 4500 WS

32GB

GPU

RTX PRO 4000

24GB

GPU

RTX PRO 4000 SFF

24GB

GPU

RTX PRO 2000

16GB

RTX

RTX 5090

32GB

RTX

RTX 5090 D

32GB

RTX

RTX 5090 Mobile

24GB

RTX

RTX 5080

16GB

RTX

RTX 5080 Mobile

16GB

RTX

RTX 5070

12GB

RTX

RTX 5070 Mobile

8GB

RTX

RTX 5070 Ti

16GB

RTX

RTX 5070 Ti Mobile

12GB

RTX

RTX 5060 Ti

16GB

RTX

RTX 5060

8GB

RTX

RTX 5060 Mobile

8GB

OS: linux

Arch: x86_64