Efficient PyTorch — Eliminating Bottlenecks

What is an efficient PyTorch training pipeline? Is it the one that produces a model with the best accuracy? Or the one that runs the fastest? Or the one that’s easy to understand and extend? Maybe the one that’s easy to parallel? It is all of the above.

Press enter or click to view image in full size

PyTorch is a great instrument for use in research and production areas, which is clearly shown by the adoption of this deep learning framework by Stanford University, Udacity, SalelsForce, Tesla, etc.. However, every tool requires investing time into mastering skills to use it with the maximum efficiency. After using PyTorch for more than two years, I decided to summarize my experience with this deep learning library.

Efficient — (of a system or machine) achieving maximum productivity with minimum wasted effort or expense. (Oxford Languages)

This part of the Efficient PyTorch series gives general tips for identifying and eliminating I/O and CPU bottlenecks. The second part will reveal some tips on efficient tensor operations. The third part — on efficient model debugging techniques.

Disclaimer: This post assumes you have at least some prior knowledge of PyTorch.

I’ll start probably with the most obvious one:

Press enter or click to view image in full size

Photo by Roshni Sidapara on Unsplash

Advice 0: Know where are bottlenecks in your code

Command-line tools like nvidia-smi, htop, iotop, nvtop, py-spy, strace, etc.. should become your best friends. Is your training pipeline CPU-bound? IO-bound? GPU-bound? These tools will help you to find an answer.

You may not have even hear about them or heard but didn’t use. And it’s ok. It’s also fine if you don’t start using them immediately. Simply remember, that someone else may be using them to train models 5%–10%–15%-.. faster than you, which eventually may make a difference between winning or losing the market place or getting approval or rejection on the job position.

Data pre-processing

Nearly every training pipeline starts with theDataset class. It is responsible for providing data samples. Any necessary data transformation and augmentation may happen here. In a nutshell, Dataset is an abstraction that report it’s size and return a data sample by a given index.

If you’re working with image-like data (2D, 3D scans), disk I/O can become a bottleneck. To get the raw pixel data, your code needs to read data from the disc and decode images into memory. Each of the tasks is fast, but when you need to process hundreds of thousands of them as fast as possible — this may become a challenge. Libraries, like NVidia Dali offer a GPU-accelerated JPEG decoding. This definitely worth trying if you face IO bottlenecks in your data processing pipeline.

There is one more option. SSD disks have access time of ~0.08–0.16 milliseconds. Access time for RAM is nanoseconds. We can put our data directly into memory!

Advice 1: If possible, move all or part of your data to RAM.

If you have enough RAM to load and keep all your training data in memory — this is the easiest way to exclude the slowest data retrieval step from the pipeline.

This advice is especially useful for cloud instances, like Amazon’s p3.8xlarge. This instance has EBS disk, and its performance is quite limited for default settings. However, this instance equipped with the astonishing 248Gb of RAM. This is more than enough to keep all ImageNet dataset in memory! Here’s how you can achieve this:

I faced this bottleneck issue personally. I have a home PC equipped with 4x1080Ti GPUs. Once I took p3.8xlarge instance that has four NVidia Tesla V100 and moved my training code there. Given the fact, that V100 is newer and faster than my oldie 1080Ti, I expected to see 15–30% faster training. What was my surprise, when training time per epoch had increased! This was my lesson to pay attention to infrastructure and environment nuances, other than only on CPU and GPU speeds.

Depending on your scenario, you can keep unchanged binary content of each file in RAM and decode it “on-the-fly”, or uncompressed images and keep raw pixels instead. But whatever path you choose, here’s second advice:

Advice 2: Profile. Measure. Compare. Each time you introduce any change to pipeline — evaluate thoughtfully what impact it makes overall.

This advice focuses on solely on training speed, assuming you introduce no changes to the model, hyper-parameters, dataset, etc.. You can have a magic command-line argument (magic switch) that, when specified, would run the training for some reasonable amount of data samples. With this feature you can always quickly profile your pipeline:

# Profile CPU bottlenecks
python -m cProfile training_script.py --profiling# Profile GPU bottlenecks
nvprof --print-gpu-trace python train_mnist.py# Profile system calls bottlenecks
strace -fcT python training_script.py -e trace=open,close,read

Advice 3: Preprocess everything offline

If you’re training on 512x512 resized images, that are made from 2048x2048 pictures — resize them beforehand. If you’re using grayscale images as input to your model — do color conversion offline. If you’re doing NLP — do tokenization beforehand and save to disk. There is no point in redoing the same operation over and over again during the training. In the case of progressive learning, you can save multiple resolutions of your training data — that still will be faster than online resize to the target resolution.

For tabular data, consider converting pd.DataFrame objects to PyTorch tensors at the Dataset creation time.

Advice 4: Tune number of workers for DataLoader

PyTorch uses a DataLoader class to simplify the process of making batches for training your model. To speed up things, it can do it in parallel, using multiprocessing from python. Most of the time it works just fine out of the box. There are a few things to keep in mind:

Each process generates one batch of data and these batches are made available to the main process via mutex synchronization. If you have N workers, then your script will require N times more RAM to store those batches in system memory. How much RAM exactly will you need?
Let’s calculate:

Suppose we train image segmentation model for Cityscapes with batch size 32 and RGB images of size 512x512x3 (height, width, channels). We do image normalization on CPU side (I will explain later on why it’s important). In this case, our final image tensor will be 512 * 512 * 3 * sizeof(float32) = 3,145,728 bytes. Multiplying by batch size gives us 100,663,296 bytes or roughly 100 Mb.
In addition to images, we need to provide ground-truth masks. Their respective size will be (by default, masks has type long, which is 8 bytes) — 512 * 512 * 1 * 8 * 32 = 67,108,864 or roughly 67 Mb.
Hence total memory required for one batch of data is 167 Mb. In case we have 8 workers, the total amount of memory required will be 167 Mb * 8 = 1,336 Mb.

It doesn’t sound too bad, right? The problem arises when your hardware setup is capable of processing more batches than 8 workers can provide. One can naively put 64 workers, but this would consume almost 11 Gb of RAM at least.

Things get even worse if your data is 3D-volumetric scans; in this case, even one sample of single-channel volume 512x512x512 will occupy 134 Mb, and for batch size 32 it will be 4.2 Gb and with 8 workers you will need 32 Gb of RAM only to keep intermediate data in memory.

There is a partial solution to this problem — you can cut channel depth of input data as much as possible:

Keep RGB images at 8-bit per channel depth. Image conversion to float and normalization can be easily done on GPU.
Use uint8 or uint16 data type instead long in dataset.

By doing so, you can reduce RAM requirement a lot. For our example from above, memory usage for memory-efficient data representation will be 33 Mb per batch against 167 Mb. That’s 5 times reduction! Of course, this requires extra steps in the model itself to normalize/cast data to an appropriate data type. However, the smaller the tensors are, the faster CPU to GPU transfer time.

The number of workers for DataLoader should be chosen wisely. You should check how fast your CPU and IO systems are, how much memory do you have, and how quick GPU(s) can process this data.

Multi-GPU training & inference

Press enter or click to view image in full size

Photo by Nana Dua on Unsplash

Neural network models become bigger and bigger. Today’s trend is to use multiple GPUs to seed-up training time. It also often improves model performance thankfully to a bigger batch size. PyTorch has all the features for going multi-GPU withing a few lines of code. However, some caveats are not obvious at the first glance.

model = nn.DataParallel(model) # Runs model on all available GPUs

The easiest way of going multi-GPU is to wrap a model in nn.DataParallel class. And in most cases, it works just fine, unless you train some image-segmentation model (or any other model that produces large-sized tensors as output). At the end of the forward pass, nn.DataParallel will gather outputs from all GPUs on the master GPU to run backward through outputs and make a gradient update.

There are two problems:

GPUs load is unbalanced
Gathering on master GPU requires extra video memory.

First, only master GPU is doing loss computation, backward pass & gradient step, while other GPUs are chilling at 60C waiting for the next bunch of data.

Second, this extra memory required to gather all outputs on the master GPU usually forces you to reduce the batch size by some amount. The thing is, nn.DataParallel splits batch across GPUs evenly. Suppose you have 4 GPUs and a total batch size of 32. Then each GPU will get its block of 8 samples. But the problem is, while all non-master GPUs can easily fit those batches in their corresponding VRAM, master GPU has to allocate additional space to hold the batch size of 32 for outputs from all other cards.

Two solutions exists for this uneven GPU utilization:

Keep using nn.DataParalleland compute loss inside the forward pass during the training. In this case, you don’t return dense prediction masks to master GPU and return just a single scalar loss instead.
Use distributed training, aka nn.DistributedDataParallel. With help of distributed training you solve both problems from above and can enjoy watching 100% load of all your GPUs.

If you know to learn more about multi-GPU training and get in-depth understanding of pros & cons of each approach, check these great posts to learn more:

Advice 5: If you have more then 2 GPU — consider using distributed training mode

How much time it will save vastly depends on your scenario, but I observed ~20% time reduction when training image classification pipeline on 4x1080Ti.

Also it’s worth to mention that you can use nn.DataParallel and nn.DistributedDataParallel for inference as well.

On custom loss functions

Writing custom loss functions is a fun and exciting exercise. I recommend everyone to try it from time to time. There is one thing you have to keep in mind while implementing a loss function with a complex logic: It all runs on CUDA and it’s your duty to write CUDA-efficient code. CUDA-efficient means “no python control flow”. Going back-and-forth between CPU and GPU, accessing individual values of GPU tensor may get the job done, but the performance will be awful.

Some time ago I was implementing a custom cosine embedding loss function for instance segmentation from “Segmenting and tracking cell instances with cosine embeddings and recurrent hourglass networks” paper. It’s quite simple in a text form but has somewhat complex implementation.

The first naive implementation (apart from bugs) that I wrote took minutes (!) to compute a loss value for a single batch. In order to profile CUDA bottlenecks, PyTorch offers an extremely handy built-in profiler. It’s extremely simple to use and it gives you all the information to address bottlenecks in your code:

Advice 9: If you designing custom modules & losses — profile & test them

After profiling my initial implementation I was able to speed-up my implementation by a factor of 100. More on writing efficient tensor expressions in PyTorch will be explained in Efficient PyTorch — Part 2.

Time vs Money

Press enter or click to view image in full size

Photo by Patrick Fore on Unsplash

Last but not least — sometimes it’s worth investing in more capable hardware, rather than optimizing the code. Software optimization is always a high-risk journey with uncertain outcomes. It could be more effective to upgrade CPU, RAM, GPU, or all together. Both money and engineering time are resources and proper utilization of both is a key to success.

Advice 10: Some bottlenecks can be solved more easily with hardware upgrade

Conclusion

Getting maximum of your every-day tools is a key to proficiency. Try not to make shortcuts and do dig deeper if something is unclear to you. There are always a chance to get new knowledge. Ask yourself or your mates — “how my code can be improved?”. I truly believe that this sense of perfection is as important as other skills for computer engineer.