CUDA and PyTorch

4 minute read

Published:

Random observations on CUDA and PyTorch

  • General Tips: Before moving on to more specialised knowledge, it is good practice to follow some general tips that make the life of a deep learning practitioner easier.

    • Use general tips for designing efficient architectures (such as the ones found in my Efficient DL blog post).
    • Trim RNN backprop lengths to lower activation storage requirement
    • Use with torch.no_grad() wherever you don’t need to backprop, else the computational graph hogs up memory.
  • Profiling your code:
    • Standard cProfile Python interpreter-driven profiling won’t work for PyTorch CUDA-driven programs as CUDA calls are asynchronous. Hence cProfile won’t record any time for those calls. The above problem could be resolved by placing torch.cuda.synchronize() calls after each PyTorch CUDA call and using line_profiler (found here) to subsequently profile the code. The time spent on the synchronize step will indicate how much time was spent on the GPU executing the concerned CUDA kernels.

    • Using line_profiler also helps profile program steps that are not function calls eg. Accessing an element of a very large array.

    • torch.autograd.profiler.profile() is a PyTorch’s in-built profiling tool. Using use_cuda=True helps it to go into CUDA mode and profile CUDA portions of the code as well. Depending on whether the code is CPU-bound or GPU-bound (restricting factor), the profiler should be set to the appropriate mode to identify bottlenecks in the program.

  • CUDA Streams: PyTorch through torch.cuda provides support for CUDA streams and events for greater parallelism, but then synchronization between all processes has to be manually managed (See this StackOverflow post). Eg. Using multiple CUDA streams on a GPU, kernel execution of one tensor can be overlapped with copying of another tensor into global memory OR multiple kernels can be parallelized (if sufficient threads and blocks exist).

  • Timing with Events: Use CUDA events to record GPU runtimes. A python call of a PyTorch function just queues it in a CUDA stream and the interpreter moves to the next line, so the standard time.time() technique won’t work.
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
# whatever you are timing goes here
end.record()

# Waits for everything to finish running
torch.cuda.synchronize()
print(start.elapsed_time(end))
  • Asynchronous Memory Transfers: PyTorch also allows for a non-blocking argument (mainly CPU-GPU transfer functions such as copy_(), to() etc.) to override any explicit CUDA synchronization. CPU-GPU transfers can be made non-blocking/done asynchronously by: [variable_name].cuda(async=True).

  • Page-locked Memory: Allocate tensors on CPU in pinned memory to make CPU-GPU transfers faster, else tensors have to be copied from pageable memory to pinned memory before a CPU-GPU transfer can happen.

  • Using multithreading:
    • I/O latency can be hidden behind GPU operation time by having multiple workers (in the DataLoader) preprocess the data (augmentation etc.) in parallel to the GPU operation, so that new batches are already ready to go before the current batch finishes computation on the GPU. This is essentially CPU multithreading. Note that ideally, there would also be a separate thread doing CPU-GPU batch transfers.

    • Since the dataloader’s workers are not implemented in Python but in C++, they can go about doing their data fetching work in parallel without the restrictions of a Global Interpreter Lock.

    • If the dataset is small enough, it would be best to transfer it entirely to GPU at once and perform all preprocessing/dataloading steps on the GPU. This might require writing custom CUDA kernels to implement some of those ops.

  • Memory Management:
    • Use torch.cuda.empty_cache() to force PyTorch to clear out memory reserved for any new possible tensor allocations in the future.

    • Use the del operator to manually clear out tensors that are most certainly out of scope.

    • Try lowering batch size if you encounter CUDA OOM exceptions with a code snippet like this.

    try:
        trainer = Trainer(model, loss, metrics, optimizer, 
                          resume=resume, config=config, data_loader=data_loader,
                          valid_data_loader=valid_data_loader,
                          lr_scheduler=lr_scheduler)

        trainer.train()
        break
    except RuntimeError as e:
        retry_time +=1
        print('Runtime Error {}\n Run Again.....  {}/{}'.format(e, retry_time, 3))
        if retry_time == 3:
            print('Give up ! Probably CUDA OOM')
            break
    finally:
        del trainer
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        print('Lower batch size in config.json and try again')

GPUtilization package can help in monitoring GPU memory usage at various stages of the program. This would include additional cache allocations by PyTorch to accommodate any future tensors.

  • Accelerate wherever possible: Prefer using the analogous operations in PyTorch rather than the numpy implementations as the former can leverage hardware acceleration.

Tags: