LLM Infer, AI Infra, CUDA
-
Tsinghua University
- https://www.zhihu.com/people/mu-zi-zhi-6-28
- https://bruce-lee-ly.medium.com
Pinned Loading
-
cuda_auto_tune
cuda_auto_tune PublicNCU-driven iterative optimization workflow for CUDA/CUTLASS/Triton/CuTe DSL kernels.
-
decoding_attention
decoding_attention PublicDecoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
-
cuda_hgemm
cuda_hgemm PublicSeveral optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
cuda_hgemv
cuda_hgemv PublicSeveral optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
-
flash_attention_inference
flash_attention_inference PublicPerformance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Something went wrong, please refresh the page to try again.
If the problem persists, check the GitHub status page or contact support.
If the problem persists, check the GitHub status page or contact support.


