[Error] RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched(handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num

Python/Python Error

[Error] RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched(handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

도도걸만단 2026. 2. 26. 20:18

PyTorch CUBLAS_STATUS_INVALID_VALUE 오류 해결 (Blackwell GPU)

1. 발생한 오류

에러 메시지

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched(handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

발생 위치

- 파일: `decoder.py` (ViT 디코더 self-attention)
- 대략 368줄: `attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))`

발생 시점

- 학습 시작 직후, 첫 번째 forward (예: "Saving checkpoint at epoch 0..." 직후)
- 디코더 attention의 Q·K^T 행렬곱(matmul)을 GPU로 할 때 터짐

환경

- GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition
- PyTorch: 2.10.0+cu128
- CUDA: 12.8
- Driver: 580.95.05 (nvidia-smi 기준 CUDA 13.0 지원)

2. 원인 정리

- shape 문제 아님: assert로 확인한 차원은 모두 정상 (batch, heads, seq, head_dim 등).

- 원인으로 보이는 것:
1. PyTorch 2.10.0 + Blackwell + cu128 조합에서 cuBLAS의 strided batched GEMM 호출이 `CUBLAS_STATUS_INVALID_VALUE`를 반환.

2. 디코더에 **bf16/autocast**된 텐서가 들어가면, 특정 환경에서 dtype/stride 때문에 같은 cuBLAS 에러가 날 수 있음.
- 즉, **코드 논리 오류라기보다 “PyTorch 버전 + GPU 아키텍처 + cuBLAS” 조합 이슈**에 가깝고, 다른 GPU(Ampere 등)나 다른 PyTorch 버전에서는 재현되지 않을 수 있음.

3. 해결 방법

방법 1: PyTorch 다운그레이드 (실제로 적용한 방법)

PyTorch 2.10.0 → 2.8.0** 으로 낮추고, 동일하게 **cu128** wheel 사용:

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

- CUDA 12.8은 그대로 두고, PyTorch만 2.8.0으로 바꿔서 해결된 경우.
- 정리: 버전이 너무 높았던 것(2.10.0 쪽 호환/버그) 이 원인이었고, 2.8.0으로 낮추니 해결.

방법 2: 디코더만 float32로 고정 (대안)

PyTorch 버전을 바꾸기 어렵다면, **디코더 forward 전체**를 float32로만 돌리도록 우회:

- `decoder.py`의 `GeneralDecoder.forward`(또는 해당 ViTMAE 디코더의 `forward`) **맨 앞**에
  `with torch.autocast(device_type='cuda', enabled=False):` 로 감싼다.
- 그 안에서 디코더 입력을
  `hidden_states = hidden_states.float().contiguous()`
  로 한 번 맞춰 준다.

이렇게 하면 디코더는 항상 float32로만 연산되어, bf16/autocast 경로에서 오는 cuBLAS 이슈를 피할 수 있음.

4. 요약

| **오류** | `CUBLAS_STATUS_INVALID_VALUE` in `cublasSgemmStridedBatched` (디코더 attention matmul) |
| **원인** | PyTorch 2.10.0 + Blackwell + cu128 조합에서의 cuBLAS 호환/버그 이슈 (및 bf16 경로 가능성) |
| **해결** | PyTorch 2.8.0+cu128 로 다운그레이드 (`pip install ... --index-url https://download.pytorch.org/whl/cu128`) |
| **대안** | 디코더 내부를 `autocast(enabled=False)` + `hidden_states.float().contiguous()` 로 float32 강제 |

'Python > Python Error' 카테고리의 다른 글

[Error] TypeError: got an unexpected keyword argument 'step_i' , ImportError: cannot import name 'cached_download' from 'huggingface_hub' (1)	2025.03.27
[Error] GPU 있는데 인식안될때 / RuntimeError: No CUDA GPUs are available & torch.OutOfMemoryError: CUDA out of memory. Tried to allocate (0)	2025.03.21
[Error] ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device (0)	2025.03.07
[Error] ImportError: Cannot load backend 'TkAgg' which requires the 'tk' interactive framework, as 'headless' is currently running (0)	2025.03.03
[Error] Selenium - AttributeError: 'WebDriver' object has no attribute 'find_element_by_xpath' (2)	2024.10.08

현재글[Error] RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched(handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

프로그래밍선

귀여운 뽀뿌 🐶💗🤍

LLM, depth, Python, tiled multiplane images for practical 3d photography, 프로그래머스, depth pro, Computer Vision, cv2, streamlit, error, SGD, tmpi, PIP, novel view synthesis, 챗봇만들기, 경사하강법, Depth estimation, ICCV, 논문리뷰, nvs,

Today :
Yesterday :

프로그래밍선