Learning
Sep 3, 2024
7 mins

Why do GPU Containers have long Cold Starts?

Author
Agam Jain

Learn how to minimize cold start times in GPU applications by understanding container runtime, image loading, and lazy loading technique. Discover the limitations of using a Kubernetes and Docker-based approach for GPU images compared to CPU images

Introduction

When setting up autoscaling for GPU applications, one of the biggest challenges developers face is cold starts—the delay when an application starts from an inactive state. Whether you’re scaling from 0 to 1 or 1 to n instances, it can take around 10-15 minutes to start a new instance if your image size is ~15GB.

The most common approach is to use a Kubernetes cluster (EKS, AKS, GKE, etc.) to spin up and down nodes while pulling the image from container registries (ECR, GCR, ACR, GHCR, etc.).

This setup works well for CPU workloads because CPU-based images are generally small, often around 70-200MB (e.g., python:3.9-slim, node:14-slim). However, GPU images, especially those that include models, are much larger. A CUDA base image is approximately 12GB, and including the model can increase the size to hundreds of gigabytes. This significantly increases the time required to pull the image from the registry, leading to long cold starts.

This blog will cover the steps to start containers, the time required for each step, and strategies to reduce cold starts.

How Containers Start

What is needed to start a container?

The most crucial component for starting a container is the root filesystem (rootfs). The rootfs includes all the files necessary to run your application, such as:

  • System files: Essential files required by the operating system.
  • Device files (/dev): Representations of hardware devices like GPU, CPU, or storage.
  • Drivers: Crucial for GPU containers.
  • Dependencies: Libraries and binaries required by your application.
  • Project files: Your application code and its resources.

The entire rootfs is packaged inside the container image. A container image is mostly a tar file that stores all the necessary files (rootfs) for the container. These files are often compressed to save space. If needed, the image can be divided into layers, each layer adding specific files or updates.

Container runtime

Starting a container involves several steps, managed by the container runtime and a snapshotter component.

A container runtime is a software responsible for running containers. It manages container lifecycle operations such as starting, stopping, and deleting containers. Popular container runtimes include Docker, containerd, and CRI-O. Snapshotters are components within container runtimes responsible for managing the container's filesystem

Here’s how a container starts:

  • Download the Image: The snapshotter downloads the container image from a registry (like Docker Hub or ECR).
  • Decompress the Image: After downloading, the snapshotter decompresses the image to extract the root filesystem (rootfs).
  • Start the Container: Once the filesystem is ready, the container runtime starts the container, providing it with access to the necessary files through the snapshotter.
  • Load GPU Model (if applicable): If the container uses a GPU, the model is loaded from disk into GPU memory, making the application ready to run with GPU acceleration.

Performance Comparison: GPU Image vs. CPU Image

To highlight the differences in performance, we compare the time taken to start a container with a GPU image (CUDA base image with a 5 GB model. Total image size = 17GB) versus a CPU image (size = 100 MB). Below is a summary of the time taken at each step:

StepGPU ImageCPU Image
Download Image8 mins~5s
Decompress Image7 mins< 1s
Snapshotter Assembly~3s<1s
Start Container~2s~2s
Load Model into Memory8s-
Total Time [cold start time]15m 13s10s

As you can see, the cold start time for a GPU image is ~90x more than that of a CPU image. This becomes the key limitation of using this setup for autoscaling GPU workloads in production.

Lazy Loading: Potential Solution to Explore

Lazy loading, or asynchronous loading, is an approach to improve container startup times. Instead of downloading and decompressing the entire image before starting the container, lazy loading:

  1. Starts the container with minimal essential files
  2. Downloads additional files on-demand as the application requires them

This approach can significantly reduce initial startup times, especially for large container images. Experiments have shown that containers spend 76% of their start time downloading the image whereas they use only 6% of all the downloaded files.


At Tensorfuse, we have written our version of Docker runtime that helped us achieve container start times of ~3 seconds on warm nodes and ~24 seconds on new nodes.

If your current setup has slower start times, you can move to a faster container runtime with 10x better DevEx that improves your shipping velocity.

Get started with our docs

Tensorfuse Blog

Tensorfuse Blog

Dive into our blog to get expert insights and tutorials on deploying ML models on your own private cloud. Stay up to date with all things open-source and stay ahead in the GenAI race. Subscribe to get updates directly in your inbox.

Dive into our blog to get expert insights and tutorials on deploying ML models on your own private cloud. Stay up to date with all things open-source and stay ahead in the GenAI race. Subscribe to get updates directly in your inbox.

Get started with Tensorfuse today.

Deploy in minutes, scale in seconds.

import tensorkube


image = tensorkube.Image.from_registry(

"nvidia/cuda" ).add_python(version='3.9')

.apt_install([ 'git','git-lfs' ])

.pip_install([ 'transformers', 'torch', 'torchvision', 'tensorrt', ])

.env( { 'SOME-RANDOM-SECRET-KEY': 'xxx-xyz-1234-abc-5678', } )

.run_custom_function( download_and_quantize_model, )


@tensorkube.entrypoint(image, gpu = 'A10G')

def load_model_on_gpu():

import transformers

model = transformers.BertModel.from_pretrained('bert-base-uncased')

model.to('cuda')

tensorkube.pass_reference(model, 'model')


@tensorkube.function(image)

def infer(input: str):

model = tensorkube.get_reference('model')

# test the model on input

response = model(input)

return response



Get started with Tensorfuse today.

Deploy in minutes, scale in seconds.

import tensorkube


image = tensorkube.Image.from_registry(

"nvidia/cuda" ).add_python(version='3.9')

.apt_install([ 'git','git-lfs' ])

.pip_install([ 'transformers', 'torch', 'torchvision', 'tensorrt', ])

.env( { 'SOME-RANDOM-SECRET-KEY': 'xxx-xyz-1234-abc-5678', } )

.run_custom_function( download_and_quantize_model, )


@tensorkube.entrypoint(image, gpu = 'A10G')

def load_model_on_gpu():

import transformers

model = transformers.BertModel.from_pretrained('bert-base-uncased')

model.to('cuda')

tensorkube.pass_reference(model, 'model')


@tensorkube.function(image)

def infer(input: str):

model = tensorkube.get_reference('model')

# test the model on input

response = model(input)

return response



Get started with Tensorfuse today.

Deploy in minutes, scale in seconds.

import tensorkube


image = tensorkube.Image.from_registry(

"nvidia/cuda" ).add_python(version='3.9')

.apt_install([ 'git','git-lfs' ])

.pip_install([ 'transformers', 'torch', 'torchvision', 'tensorrt', ])

.env( { 'SOME-RANDOM-SECRET-KEY': 'xxx-xyz-1234-abc-5678', } )

.run_custom_function( download_and_quantize_model, )


@tensorkube.entrypoint(image, gpu = 'A10G')

def load_model_on_gpu():

import transformers

model = transformers.BertModel.from_pretrained('bert-base-uncased')

model.to('cuda')

tensorkube.pass_reference(model, 'model')


@tensorkube.function(image)

def infer(input: str):

model = tensorkube.get_reference('model')

# test the model on input

response = model(input)

return response



© 2024. All rights reserved.

Product

Blog

Pricing

Documentation

social

x.com

LinkedIn

Privacy Policy