[pytorch] RuntimeError " All tensor must be on devices[0]: 0"

2021. 8. 31. 16:48

최근 pytorch version을 1.4에서 1.7로 업그래이드 하면서 개발 환경이 삐그덕 거리기 시작해

발생하는 문제와 해결 했던 방법을 정리 하고자 한다.

기존 개발 환경:

docker + ngc(apex, torch.1.4, cuda 10.1) + single node -multigpu

new 개발 환경:

docker + ngc(torch.distributed, torch.1.7, cuda 10.1) + single node - multi gpu

발생 문제:

DDP module 을 기존 apex에서 torch.nn.parallel.DistributedDataParallel로 변경 후 아래와 같이

실행 하면 전에 볼수 없었던 에러 메시지 발생

python -m torch.distributed.launch 00nproc_per_node 4 train.py

"아래가 발생 에러"

"Single-Process Multi-GPU is not the recommended mode for "
/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py:448: 
UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP.
In this mode, each DDP instance operates on multiple devices and creates multiple
module replicas within one process. 
 The overhead of scatter/gather and GIL contention in every forward pass can slow down
 training. Please consider using one DDP instance per device or per module replica by
 explicitly setting device_ids or CUDA_VISIBLE_DEVICES.
 Traceback (Most recent call last): 
 ~
 ~
 RuntimeError: All tensors must be on devices[0]: 0

에러 발생 원인 코드:

위 에러가 발생한 순간은 아래와 같이 apex.parallel 의 ddp 모듈을 torch의 ddp 모듈로 바꾼뒤이고

#from apex.parallel import DistributedDataParallel as DDP
from torch.nn.parallel import DistributedDAtaParallel as DDP

DDP로 모델을 감싸는 부분에서 위 에러가 발생했다.

#기존 에러 발생 코드 (torch.1.4 및 apex에서는 정상동작)
mymodel = DDP(mymodel)

해결방법:

문제를 해결 하기 위해서는 위의 model을 DDP로 감싸는 부분을 아래와 같이 바꿔 주면 된다.

mymodel = DDP(mymodel, find_unused_parameters=True, device_ids=[local_rank], output_device=[local_rank])

문제가 발생한 원인은 torch 1.7 부터는 좀더 명시적인 정보를 DDP 모듈에 제공해 줘야 하기 떄문인 것으로 보인다.

내가 원하는 건 multi process multi gpu 학습인데 DDP 호출 시 apex에서는 visible gpu 에 알아서 모델을 옮겼다면

torch DDP는 어디에 model을 복사 할건지 명시적으로 지정하도록 했다. 이 부분이 devices_ids=[local_rank] 부분이다.

find_unused_parameters=True 는 model backward pass에 연관 되지 않는 parameter 들을 mark해서 DDP가 해당 파라미터들의 gradient들을 영원히 기다리는 것을 방지 한다.

이에 대한 설명은 [여기]를 참조면 된다.

'pytorch' 카테고리의 다른 글

[Profile] GPU profile을 통한 병목 진단 및 개선 (6)	2021.07.19
[pytorch] AttributeError: DistributedDataParallel has no attribute (0)	2021.04.21
[pytorch] 'Unexpected key(s) in state_dict' error (0)	2021.04.21
[pytorch] torch.gather 설명 (2)	2021.03.05