NVIDIA cuDNN
NVIDIA® CUDA® 深度ç¥ç»ç½ç»åºï¼cuDNNï¼æ¯ä¸æ¬¾ GPU å éçæ·±åº¦ç¥ç»ç½ç»åºç¡ç®ååºãcuDNN 为æ åæä½ï¼å¦åååååå·ç§¯ã注æåãç©éµä¹æ³ï¼matmulï¼ãæ± ååå½ä¸åï¼æä¾é«åº¦ä¼åçå®ç°ã
ä¸è½½ cuDNN
ä¸è½½ cuDNN åºä¸è½½ cuDNN å端
( GitHub)
cuDNN ä¹å¯ä»¥éè¿ä¸æ¹çå 管çå¨ä¹ä¸è¿è¡ä¸è½½ã
ä½¿ç¨ conda å¿«éå®è£
conda install nvidia::cudnn cuda-version=12
å®è£ cuDNN åº
ä½¿ç¨ Docker å¿«éæå
docker pull nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04
å®è£ cuDNN åº
ä½¿ç¨ pip å¿«éå®è£
pip install nvidia-cudnn
å®è£
cuDNN åº
pip install nvidia-cudnn-frontend
å®è£ cuDNN å端 API
cuDNN çå·¥ä½åç
å éå¦ä¹ ï¼cuDNN æä¾äºé对 Tensor Core ä¼åçå æ ¸ï¼å¨è®¡ç®å¯éåæä½ä¸å®ç°æä½³æ§è½ï¼å¹¶ä¸ºä¸åé®é¢è§æ¨¡æéåéçå æ ¸æä¾å¯åå¼ç®æ³ã
è忝æï¼cuDNN æ¯æå°è®¡ç®å¯éååå åå¯éåæä½è¿è¡èåã常è§çéç¨è忍¡å¼é常éè¿è¿è¡æ¶å æ ¸çæå®ç°ï¼ç¹æ®çè忍¡å¼å使ç¨é¢ç¼åçä¼åå æ ¸ã
表达æ§ç®åå¾ APIï¼ç¨æ·å¯ä»¥å°è®¡ç®å®ä¹ä¸ºå¼ éä¸çæä½å¾ãcuDNN åºæ¢æç´æ¥ç C APIï¼ä¹æå¼æºç C++ å端以æå使ç¨ä¾¿å©æ§ã大夿°ç¨æ·éæ©å端ä½ä¸ºä½¿ç¨ cuDNN çå ¥å£ã
cuDNN API 代ç 示ä¾
该代ç ä½¿ç¨ cuDNN ä¸ PyTorch éæï¼å®ç°äºå¸¦æåç½®çæ¹éç©éµä¹æ³ã
import torch
import cudnn
# Prepare sample input data. nvmath-python accepts input tensors from pytorch, cupy, and
# numpy.
b, m, n, k = 1, 1024, 1024, 512
A = torch.randn(b, m, k, dtype=torch.float32, device="cuda")
B = torch.randn(b, k, n, dtype=torch.float32, device="cuda")
bias = torch.randn(b, m, 1, dtype=torch.float32, device="cuda")
result = torch.empty(b, m, n, dtype=torch.float32, device="cuda")
# Use the stateful Graph object in order to perform multiple matrix multiplications
# without replanning. The cudnn API allows us to fine-tune our operations by, for
# example, selecting a mixed-precision compute type.
graph = cudnn.pygraph(
intermediate_data_type=cudnn.data_type.FLOAT,
compute_data_type=cudnn.data_type.FLOAT,
)
a_cudnn_tensor = graph.tensor_like(A)
b_cudnn_tensor = graph.tensor_like(B)
bias_cudnn_tensor = graph.tensor_like(bias)
c_cudnn_tensor = graph.matmul(name="matmul", A=a_cudnn_tensor, B=b_cudnn_tensor)
d_cudnn_tensor = graph.bias(name="bias", input=c_cudnn_tensor, bias=bias_cudnn_tensor)
# Build the matrix multiplication. Building returns a sequence of algorithms that can be
# configured. Each algorithm is a JIT generated function that can be executed on the GPU.
graph.build([cudnn.heur_mode.A])
workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
# Execute the matrix multiplication.
graph.execute(
{
a_cudnn_tensor: A,
b_cudnn_tensor: B,
bias_cudnn_tensor: bias,
d_cudnn_tensor: result,
},
workspace
)cuDNN Graph API æè¿°çæä½å¾ç¤ºä¾
å
è¿è¡ ConvolutionFwdï¼å·ç§¯ååï¼ï¼éåæ§è¡å
å«ä¸¤ä¸ªæä½çæåæ ç¯å¾ï¼DAGï¼ã
ææ¡£
宿´æå说æäº cuDNN å端åå端çå®è£
ä¸ä½¿ç¨ã
å端示ä¾
示ä¾å±ç¤ºäº Python å C++ å端 API çä½¿ç¨æ¹æ³ã
ææ°çæ¬å客
äºè§£å¦ä½å¨ cuDNN 9 ä¸ä½¿ç¨ç¼©æ¾ç¹ç§¯æ³¨æåï¼SDPAï¼å é transformerã
NVIDIA Blackwell ä¸ç cuDNN
äºè§£ cuDNN é对 NVIDIA Blackwell å¾®ç¼©æ¾æ ¼å¼çå
¨æ°ï¼æ´æ° API 以åå¦ä½ä½¿ç¨è¿äº API è¿è¡ç¼ç¨ã
主è¦ç¹æ§
深度ç¥ç»ç½ç»
深度å¦ä¹ ç¥ç»ç½ç»å¹¿æ³åºç¨äºè®¡ç®æºè§è§ã对è¯å¼ AI 以忍èç³»ç»ï¼å¹¶æ¨å¨äºè¯¸å¦æºè½é©¾é©¶ä¸æºè½è¯é³å©æççªç ´æ§è¿å±ãNVIDIA ç GPU å éæ·±åº¦å¦ä¹ æ¡æ¶æ¾è缩çäºè¿äºææ¯çè®ç»æ¶é´ï¼å°æ°å¤©çè®ç»è¿ç¨ç¼©çå°æ°å°æ¶ã
cuDNN 为äºç«¯ãåµå ¥å¼è®¾å¤åæºè½é©¾é©¶æ±½è½¦ä¸ç深度ç¥ç»ç½ç»æä¾é«æ§è½ãä½å»¶è¿çæ¨çåºç¡åºã
å é attention è®ç»ï¼é¢å¡«ãå·ç§¯åç©éµä¹æ³ï¼matmulï¼ç计ç®å¯éåæä½
ä¼åå¦ attention è§£ç ãæ± åãsoftmaxãå½ä¸åãæ¿æ´»ãéç¹æä½ãå¼ é忢çå åå¯éåæä½
æ¯æè®¡ç®å¯éåä¸å åå¯éåæä½çèå
æä¾è¿è¡æ¶èå弿ï¼å¯ä¸ºå¸¸è§è忍¡å¼å¨è¿è¡æ¶çæå æ ¸
é对å¦èå attention çéè¦ä¸ç¨æ¨¡å¼è¿è¡ä¼å
æ ¹æ®å ·ä½é®é¢è§æ¨¡åºç¨å¯åå¼ç®æ³ï¼éæ©åéçå®ç°
cuDNN Graph API ä¸èå
cuDNN Graph API 设计ç¨äºè¡¨è¾¾æ·±åº¦å¦ä¹ ä¸ç常è§è®¡ç®æ¨¡å¼ãcuDNN å¾å°æä½è¡¨ç¤ºä¸ºèç¹ãå¼ é表示为边ï¼è¿ä¸å
¸å深度å¦ä¹ æ¡æ¶ä¸çæ°æ®æµå¾ç±»ä¼¼ã
éè¿ Python/C++ å端 APIï¼æ¨èï¼ä»¥ååºå± C å端 APIï¼ç¨äºæ§ç¨ä¾æä¸éå Python/C++ çç¹æ®åºæ¯ï¼ï¼åå¯ä¾¿æ·å°è®¿é® cuDNN Graph APIã
æ¯æå°åå åéå¶çæä½çµæ´»å°èåå°ç©éµä¹æ³ï¼matmulï¼åå·ç§¯çè¾å ¥ä¸è¾åºä¸
æä¾å¦ attention ä¸å·ç§¯å½ä¸åçæ¨¡å¼çä¸ç¨èå
æ¯æåååååä¼ æ
é对ä¸åé®é¢è§æ¨¡ï¼æä¾æä½³å®ç°çå¯åå¼é¢æµ
弿º Python/C++ å端 API
æ¯æåºååä¸ååºåå
cuDNN å éæ¡æ¶
cuDNN å éäºå¹¿æ³ä½¿ç¨ç深度å¦ä¹ æ¡æ¶ï¼å æ¬ PyTorchãJAXãCaffe2ãChainerãKerasãMATLABãMxNetãPaddlePaddle å TensorFlowã
ç¸å
³åºä¸è½¯ä»¶
NVIDIA NeMoâ¢
NeMo æ¯ä¸ä¸ªç«¯å°ç«¯çäºåçæ¡æ¶ï¼å¼åè å¯ç¨å ¶æå»ºãèªå®ä¹å¹¶é¨ç½²æ¥ææ°åäº¿åæ°ççæå¼ AI 模åã
NVIDIA TensorRTâ¢
TensorRT æ¯ç¨äºé«æ§è½æ·±åº¦å¦ä¹ æ¨çç软件å¼åå·¥å
·å
ã
NVIDIA ä¼åæ¡æ¶
深度å¦ä¹ æ¡æ¶éè¿é«çº§ç¼ç¨æ¥å£ä¸ºè®¾è®¡ãè®ç»åéªè¯æ·±åº¦ç¥ç»ç½ç»æä¾æå»ºæ¨¡åã
NVIDIA éåéä¿¡åº
NCCL æ¯ä¸ºé«å¸¦å®½ãä½å»¶è¿ãGPU å éç½ç»è®¾è®¡çéä¿¡åºã
æ´å¤èµæº
éå¾· AI
NVIDIA 认为å¯ä¿¡ç AI æ¯å
±åç责任ï¼å¹¶å·²å»ºç«ç¸å
³æ¿çä¸å®è·µï¼å©å广æ³ç AI åºç¨å¼åã彿 ¹æ®æä»¬çæå¡æ¡æ¬¾ä¸è½½æä½¿ç¨æ¨¡åæ¶ï¼å¼åè
åºä¸å
¶æ¯æç模åå¢éåä½ï¼ç¡®ä¿æç¨æ¨¡å符åç¸å
³è¡ä¸ååºç¨åºæ¯è¦æ±ï¼å¹¶é¢é²äº§å被误ç¨çé£é©ã
å¦éæ¥åå®å
¨æ¼æ´æ NVIDIA AI ç¸å
³é®é¢ï¼è¯·è®¿é®å®æ¹æ¸ éã