Note: As of recent CUDA versions, true "Variable Size" Grouped GEMM is often exposed via specific APIs like cublasLtMatmulAlgoGetHeuristic checking for grouped capabilities, or explicitly by setting the batch count in the Matrix Layout descriptor if sizes are uniform.
Unlike standard GEMM APIs that take a single set of matrix pointers, the grouped GEMM interface typically requires arrays of metadata: cublaslt grouped gemm