FlashRT NVFP4

Kernel card for FlashRT NVFP4 layout helpers and planned fused Blackwell low-bit GEMM epilogues.

The first buildable slice exposes NVFP4 scale-factor layout conversion.

Features

  • nvfp4_sf_linear_to_swizzled and nvfp4_sf_swizzled_bytes.

Planned Features

  • CUDA 12.8+ SM120 NVFP4 GEMM with fused bias+GELU and BF16 output.
  • CUDA 12.8+ SM120 NVFP4 GEMM with fused bias+GELU and FP4 output quantization.
  • Stream-K down-projection GEMM with optional bias.

Status

The fused GEMM epilogues follow after CUTLASS dependency isolation.

Current validation status is recorded in VALIDATION.md.

Current v1 build scope is CUDA 12.8+ SM120.

See examples/nvfp4_scale_factor_layout.py for a minimal layout-conversion example.

Downloads last month
-
flashrt
kernel
cuda
nvfp4
quantization
Supported hardwares new
CUDA 12.0
DGX Spark
GB10
128GB
GPU
RTX PRO 6000 WS
96GB
GPU
RTX PRO 6000 Max-Q
96GB
GPU
RTX PRO 5000
48GB
GPU
RTX PRO 4500 WS
32GB
GPU
RTX PRO 4000
24GB
GPU
RTX PRO 4000 SFF
24GB
GPU
RTX PRO 2000
16GB
RTX
RTX 5090
32GB
RTX
RTX 5090 D
32GB
RTX
RTX 5090 Mobile
24GB
RTX
RTX 5080
16GB
RTX
RTX 5080 Mobile
16GB
RTX
RTX 5070
12GB
RTX
RTX 5070 Mobile
8GB
RTX
RTX 5070 Ti
16GB
RTX
RTX 5070 Ti Mobile
12GB
RTX
RTX 5060 Ti
16GB
RTX
RTX 5060
8GB
RTX
RTX 5060 Mobile
8GB
OS
linux
Arch
x86_64