Cublaslt Grouped Gemm Documentation [hot]

Grouped GEMM requires selecting an algorithm that supports the grouped structure. cublasLtMatmulAlgoGetHeuristic is used to find the best kernel.

💡 Use cublasLtMatmulPreference to set workspace and then cublasLtMatmulAlgoGetHeuristic – the grouped version reuses plans across problems for maximum speed. cublaslt grouped gemm documentation

#CUDA #cuBLASLt #GPUComputing #GEMM #LLM #PerformanceOptimization Grouped GEMM requires selecting an algorithm that supports

: Working implementation samples can be found in the NVIDIA CUDALibrarySamples GitHub repository , specifically under the cuBLASLt directory. Grouped GEMM vs. Batched GEMM Batched GEMM ( cublasGemmBatchedEx ) Grouped GEMM ( cublasLtMatmul ) Dimensions All GEMMs must have the same Each GEMM can have unique Overhead Lower launch overhead than individual calls. Optimized for disparate problem sizes in one kernel. Flexibility Rigid layout and data types. High flexibility in layouts, epilogues, and precisions. How to Implement Optimized for disparate problem sizes in one kernel

) in a single kernel launch. This is particularly useful for accelerating models like Mixture-of-Experts (MoE) where each "expert" may process a different number of tokens.