The Cold Start Problem
AI and ML container images like vLLM, SGLang, and TensorRT-LLM are large, typically exceeding 10 GB. With traditional Docker using overlayfs, pulling a 10 GB image from a registry to an instance takes 7-10 minutes. This delay creates three critical problems for production AI deployments: Overprovisioning of GPU infrastructure: To handle traffic spikes without user-facing latency, operators maintain 3-7x more GPU capacity than base load requires. On AWS, an H100 GPU instance costs $7.57/hour, meaning a deployment provisioned for 100 requests/second at peak but averaging 30 requests/second wastes approximately $1,634 per day per 8-GPU instance in idle costs. High idle costs during scale-down: GPU instances cannot be immediately terminated after traffic decreases because the next spike would trigger 7-10 minute cold starts. Operators extend scale-down windows to 5-20 minutes, accumulating significant idle compute costs. Poor user experience during traffic spikes: When traffic exceeds provisioned capacity, new requests queue for minutes while containers start. For latency-sensitive inference workloads, this is unacceptable.Cost Impact: H100 Example
Consider a typical AI inference service with variable traffic patterns:
Traffic Pattern Assumptions:
- Peak hours: 100 requests/second (8 hours/day during business hours)
- Base load: 30 requests/second (16 hours/day during off-peak)
- GPU capacity: Each H100 can handle ~12-15 requests/second for typical 8B parameter models at target latency
- Safety margin: 20% headroom to handle request bursts and maintain SLA
- GPUs needed: 100 req/s ÷ 12 req/s per GPU × 1.2 safety = 8 H100 GPUs running 24/7
- Monthly cost: 8 GPUs × 43,603/month
- Utilization breakdown:
- Peak hours (8h/day): 8 GPUs serving 100 req/s = 100% utilized
- Off-peak hours (16h/day): 8 GPUs serving 30 req/s = 31% utilized
- Wasted capacity: During 480 off-peak hours/month, 5.5 GPUs sit mostly idle
- Idle cost: 5.5 GPUs × 480 hours × 19,994/month in wasted compute
- Base capacity: 30 req/s ÷ 12 req/s per GPU × 1.2 = 3 GPUs, but bump to 4 GPUs to handle the 3-minute scale-up lag during traffic ramps
- Peak scaling: Add 4 additional GPUs during peak hours
- Scale-down strategy: 15-minute delay after traffic drops (prevents thrashing, balances responsiveness with cost)
- Base tier (4 GPUs running 24/7): 4 × 720 hours × 21,842
- Peak tier (4 additional GPUs for 8.25h/day): 4 × 247.5 hours × 7,494
- 8 hours peak + 0.25 hours scale-down delay per day
- Total monthly cost: $29,336/month
- Savings: 29,336 = $14,267/month (33% reduction)
Cold Start Stages
Complete cold start time consists of three stages:- Node provisioning: When autoscaling from zero or adding nodes, cloud providers take 80-120 seconds to provision and boot new GPU instances. This is entirely cloud-dependent and outside user control.
- Container start: Time to pull container image from registry, extract layers, and reach first application log. This stage is the focus of this analysis.
- Model download and load: Time to download model weights from repositories like Hugging Face and load them into GPU memory. Dependent on network bandwidth, disk I/O, and model size.
Container Startup Breakdown
We measure container startup using five metrics:- Time to First Log: Container reaches first application log output
- First Log to Model Download Start: Application initialization (library loading, CUDA setup)
- Total to Model Download: Sum of above two phases
- Model Download: Time downloading model weights from registry
- Weights to GPU: Time loading model weights into GPU memory
- Graph Capture: Framework-specific compilation (e.g., CUDA graph capture)
The Lazy Loading Solution
Analysis shows that for typical AI workloads, 76% of startup time is spent downloading container images, yet only 6.4% of image data is accessed during application startup. This extreme inefficiency creates an optimization opportunity: defer downloading unused files until actually accessed.Snapshotter Mechanisms
Snapshotters are containerd components that manage filesystem layers for containers. They determine when and how image data is retrieved from registries. OverlayFS (Eager Loading - Default) overlayfs downloads all compressed image layers before container start. It extracts each layer to local disk and stacks them to create a unified filesystem view. The container process begins only after complete download and extraction. SOCI (Index-based Lazy Loading) SOCI (Seekable OCI) operates without converting container images. It generates a separate index file mapping files to byte ranges within compressed layers. At container start, SOCI mounts the filesystem immediately. When the application requests a file not yet local, SOCI uses HTTP range requests to fetch only the necessary compressed bytes for that specific file. Nydus (Chunk-based Lazy Loading) Nydus requires converting OCI images to RAFS (Registry Acceleration File System) format. During conversion, Nydus splits files into deduplicated chunks and builds a metadata tree. The container mounts a FUSE-based filesystem. File access requests traverse the metadata tree, fetching only required data chunks on-demand. Nydus supports two backend modes: FUSE (userspace) and EROFS+fscache (kernel-based). EROFS+fscache requires custom-built kernels with experimental features enabled and provides better performance but reduced compatibility. eStargZ (Layer-based Lazy Loading) eStargZ embeds a table of contents within image layers. It requires full image conversion to eStargZ format. Similar to SOCI, it enables on-demand file fetching via HTTP range requests.Benchmark Configuration
Infrastructure- Instance: AWS EC2 g6e.xlarge (4 vCPUs, 32 GB RAM, 1 × NVIDIA L40S GPU with 48 GB memory)
- Storage: EBS gp3 volumes with two configurations tested:
- Low throughput: 125 MB/s, 3,000 IOPS (baseline gp3)
- High throughput: 1,000 MB/s, 4,000 IOPS (provisioned)
- AMI: Amazon Linux 2023 with kernel 6.1
- Container Registry: AWS Elastic Container Registry (ECR)
- Model Registry: Hugging Face with hf-xet for optimized download throughput
Image Name | Size | Workload |
---|---|---|
vLLM Server | 10 GB | Serve Qwen2-8B model |
SGLang Server | 15 GB | Serve Qwen2-8B model |
TensorRT Server | 30 GB | Serve Qwen2-8B model |
Triton Server | 11 GB | Run CSM 1B model |
CUDA SAM2 Server | 13 GB | Serve SAM2 with FastAPI |
Axolotl Finetuning | 10 GB | Finetune Qwen2.5-7B-Instruct |
Results
Low Throughput Disk (125 MB/s, 3K IOPS)
SOCI and Nydus demonstrate 24-61% reduction in completion time compared to overlayfs across all workloads. The largest improvements occur with the largest
images (TensorRT: 55% reduction with SOCI). eStargZ shows no improvement over overlayfs due to higher on-demand fetch overhead.
[GRAPH: Bar chart comparing completion times across snapshotters for 125 MB/s disk configuration, grouped by workload]
High Throughput Disk (1000 MB/s, 4K IOPS)
With high-throughput disks, Nydus and SOCI maintain 14-43% time reduction compared to overlayfs. The performance advantage narrows compared to low-throughput disks, indicating that network and application initialization become larger bottlenecks as disk I/O improves.
[GRAPH: Stacked bar chart showing snapshotter + disk configuration combinations on X-axis, completion time on Y-axis, with clear comparison of how disk throughput affects each snapshotter]
Detailed Analysis: vLLM Startup
Low Throughput Disk (125 MB/s)
Lazy loading reduces Time to First Log by 93-94% (200+ seconds). However, First Log to Model Download increases by 45-92 seconds due to on-demand fetching of libraries required for application initialization (CUDA, Python dependencies). Net reduction in Total to Model Download: 64% for SOCI, 47% for Nydus.
Model Download, Weights to GPU, and Graph Capture phases show similar durations across snapshotters. Graph Capture is compute-bound and unaffected by storage configuration.
High Throughput Disk (1000 MB/s)
Higher disk throughput reduces all phases except Graph Capture (compute-bound). Model Download time decreases by 74% (230s → 60s) due to faster write speeds to local storage. Lazy loading continues to provide 35-42% reduction in Total to Model Download even with high-performance storage.
[GRAPH: Side-by-side phase breakdown comparing overlayfs vs soci vs nydus, showing how lazy loading shifts time from “Time to First Log” to “First Log to Model Download”]
Disk Throughput Impact
Upgrading from 125 MB/s to 1,000 MB/s EBS gp3 storage yields:- OverlayFS: 678s → 356s (47% reduction)
- Nydus: 583s → 306s (47% reduction)
- SOCI: 516s → 300s (42% reduction)
The performance gain from upgrading disk throughput exceeds the gain from switching to lazy loading on slower disks for most workloads. This indicates that disk I/O is the dominant bottleneck during model download and loading phases. On AWS EBS gp3, provisioning 1,000 MB/s throughput costs an additional $35/month per volume compared to baseline 125 MB/s.
[GRAPH: Line chart showing completion time vs disk throughput for each snapshotter, demonstrating how disk performance affects different snapshotters]
Parallel Pull: Optimized Eager Loading
SOCI provides a “Parallel Pull” eager loading mode using multiple concurrent download and decompression streams. On g6e.xlarge (4 vCPUs), performance was comparable to overlayfs. Testing on g6e.16xlarge (64 vCPUs, 256 GB RAM) with 1,000 MB/s disk:Snapshotter | SGLang Startup Time |
---|---|
overlayfs | 270s |
soci (lazy) | 231s |
nydus (lazy) | 240s |
soci (parallel pull) | 221s |
Implementation Considerations
Image Format Requirements
SOCI: No image conversion required. Generates separate index files stored alongside images in the registry. Compatible with existing OCI images. Nydus: Requires converting OCI images to RAFS format. Adds an image management lifecycle step. Supports chunk-level deduplication, potentially reducing total storage for similar images. eStargZ: Requires full conversion to eStargZ format. Adds conversion overhead without performance benefits observed in our benchmarks.Filesystem Backend Trade-offs
FUSE (Userspace): All snapshotters support FUSE. Compatible with standard kernels. Performance overhead from userspace-kernel context switches. EROFS+fscache (Kernel): Nydus supports EROFS with fscache for kernel-level I/O. Requires custom kernel builds with experimental features enabled. Improved performance but reduced operational simplicity.Operational Dependencies
Lazy loading introduces runtime dependencies on registry availability. Network disruptions during container runtime can cause application failures if unaccessed files are subsequently requested. Overlayfs eliminates this risk by completing all downloads before container start.Recommendations
For Large Container Images (>20 GB)
Lazy loading snapshotters provide 40-55% startup time reduction. SOCI is recommended for production deployments due to no image conversion requirement and OCI compatibility. For the 30 GB TensorRT image, SOCI reduced startup from 1430s to 638s on low-throughput disks (55% reduction).For Large Models (>15 GB)
Disk and network throughput are primary bottlenecks. Upgrading from baseline EBS gp3 (125 MB/s) to provisioned throughput (1,000 MB/s) reduced model download and load times by 70%. Cost: $35/month per volume for throughput provisioning. This optimization is effective regardless of snapshotter choice.For Multi-CPU Instances (>32 vCPUs)
SOCI parallel pull mode achieves the lowest startup times by leveraging high CPU core counts for concurrent decompression. On g6e.16xlarge, SOCI parallel pull outperformed lazy loading modes by 4-8%.Combined Strategy
For production AI inference deployments on H100 instances with large images and models:- Use SOCI lazy loading for initial container start optimization (40-55% reduction)
- Provision EBS gp3 with 1,000 MB/s throughput for model download acceleration (70% reduction in model load time)
- Configure autoscaling with 60-120 second scale-down windows to balance cold start frequency with idle costs
Conclusion
Cold start optimization directly impacts AI infrastructure economics. Reducing startup times from 10 minutes to under 2 minutes enables scaling strategies that cut H100 deployment costs by $20,000+/month through reduced overprovisioning. The optimization approach must match the bottleneck: lazy loading snapshotters for large container images, high-throughput storage for large models, or parallel pull for high-CPU instances. Fastpull provides scripts and configurations to replicate these benchmarks and implement lazy loading snapshotters on AWS, GCP, and Azure infrastructure. Implementation details and automation scripts are available at [repository link].Note: All benchmark data collected October 2025 on AWS EC2 g6e instances in us-east-1 region. Results may vary based on registry performance, network conditions, and specific workload characteristics.