Learning
Jun 20, 2024
7 mins

What is serverless GPU computing?

Author
Agam Jain

Lately, serverless GPUs have been gaining a lot of traction among machine learning engineers. In this blog, we'll dive into what serverless computing is all about and trace the journey that brought us here. We'll also explore the benefits of serverless GPUs and how they can speed up the time to market for your ML projects. Whether you're a seasoned developer or just starting out, this guide will help you understand why serverless GPUs are becoming a game-changer in the tech world.

Learning objectives

By the end of this article, you will be able to define serverless computing and its application in running machine learning models on serverless GPUs. You will be able to outline the pros and cons of serverless GPUs and make an informed decision on its benefits for your business.

Serverless computing and a brief history

Serverless computing is a cloud computing method that allows developers to build and run application code without without worrying about back-end infrastructure. Companies using serverless services are billed based on actual usage rather than pre-allocated server capacity. The term ”serverless,” is a misnomer in the sense that there are still servers being used to run the app, but you as a developer are not responsible for managing and provisioning of these servers. This allows you to focus fully on writing code and business logic. Serverless GPUs extends this model to GPU hardware.

To better understand, let’s explore how the deployment models have evolved over the years to reach today’s serverless computing paradigm.

The early days of web

Back in the 90s, creating a web application required owning the necessary hardware. This involved acquiring a machine, installing an operating system, setting up network configurations, installing and configuring web server software like Apache HTTP Server, and installing a database server such as MySQL. You also had to secure the server, configure firewall rules, install SSL certificates, upload your site files, and test and troubleshoot your setup.

Untitled

If that’s not enough already, regular maintenance would significantly delay your app’s market launch.

This issue was addressed with the introduction of virtual machines (VMs) and the onset of the cloud computing era!

Put it in a container

As web technology evolved, so did application deployment methods. The arrival of VMs allowed multiple operating systems to operate on a single physical server, isolated from one another. VMs are fundamental to cloud computing, abstracting physical resources and enabling cloud providers to offer diverse services and solutions.

Untitled

Containerization was the next significant development. Containers encapsulate an application and its dependencies into a single, portable entity that operates consistently across different environments. Unlike VMs, containers utilize the host system’s kernel, making them more resource-efficient and quicker to launch. Docker popularized this technology by providing tools to efficiently create, deploy, and manage containers.

The origins of serverless

Serverless computing takes the abstraction a step further. In this model, developers code and deploy without server management. In serverless architectures, you write functions activated by events, such as HTTP requests, database updates, or file uploads. These functions operate in ephemeral, stateless compute containers that exist only during function execution.

Serverless originated in 2008 when Google released Google App Engine (GAE). In 2014, Amazon introduced AWS Lambda, the first serverless platform. This helped the serverless computing framework gain mass-market appeal and rapid adoption among software developers. The rise of serverless computing has really increase the time developers spend focusing on writing code and business logic rather than managing infa.

Advantages of using serverless GPUs

Serverless model offers several advantages:

  • Cost Efficiency: You are billed only for the compute time you consume, not for idle server time.
  • Scalability: The provider automatically scales the infrastructure to handle varying loads.
  • Improved Productivity: No need to manage servers, patch operating systems, or handle scaling. Developers can focus on writing code and business logic rather than managing infrastructure.
  • Faster Time to Market: Rapid deployment and updates are possible because there’s no infrastructure to manage.

Graph.png

Challenges with serverless

  • Cold Start: Cold start in serverless computing refers to the latency experienced when a serverless function is invoked for the first time or after a period of inactivity. Slow startup can affect the performance and responsiveness of serverless applications, particularly in real-time demand environments.
  • Complex testing and debugging: Debugging can be more complicated with a serverless computing model as developers lack visibility into back-end processes.

Who should use serverless GPUs and how to get started?

If you’re a developer aiming to reduce your go-to-market time and create lightweight, adaptable applications that can be easily expanded or updated, serverless computing is a game-changer. Serverless GPUs are ideal for a variety of users, including:

  • Early stage startups: For startups focused on AI, serverless GPUs offer numerous advantages:
    • Speed to market: Shipping speed is the primary focus of most of the startups, in such a case, going serverless saves you time enabling faster deployment of applications and reduced go-to-market time.
    • Cost efficiency: Companies can scale their GPU usage up or down based on demand, ensuring they do not overpay for unused capacity.
  • Mid-market companies: Businesses that are growing can also benefit greatly from serverless GPUs:
    • Scalability: As your company grows, easily increase your computing resources to handle more complex tasks without worrying about physical infrastructure.
    • Operation efficiency: Focus more on developing AI models instead of managing hardware.
    • Better performance: Implement advanced inference methods like vLLM and TensorRT, and computational enhancements such as Quantization, FSDP (Fully Sharded Data Parallel), and DDP (Distributed Data Parallel), to boost training and inference speeds.

Going serverless can provide you a competitive edge by streamlining your development workflow and enabling faster time to market.

At Tensorfuse, we are developing a serverless GPU runtime that sits on your own AWS/Cloud infra. This setup allows you to leverage the compute credits provided by AWS and other cloud services.

How to get started

To get started with serverless GPU platforms, you have a choice between fully managed solutions and platforms that operate on your own cloud infrastructure. Here’s a list of all the available options in each category:

  • Serverless on your own cloud/infra (BYOC)
    • Tensorfuse
    • AWS Sagemaker
    • Bento ML
    • Mystic AI
    • Azure ML
    • GCP Vertex AI
  • Fully managed
    • Modal
    • Replicate
    • Runpod
    • Beam cloud
    • Inferless
    • BentoML managed service
    • Mystic AI managed service

We will soon publish a comprehensive analysis comparing fully managed and BYOC platforms.

What’s next for serverless?

Serverless GPU computing is evolving towards minimizing cold start times to nearly zero seconds. Although significant advancements have been made with serverless CPU computing, reducing cold start times for serverless GPUs remains a persistent challenge.

At Tensorfuse, we are addressing this issue. If you are a Rust developer interested in pioneering the next generation of serverless GPU computing, contact us at [email protected]

Subscribe to stay up to date with new posts from Tensorfuse.

Subscribe to stay up to date with new posts from Tensorfuse.

Get started with Tensorfuse today.

Deploy in minutes, scale in seconds.

import tensorkube


image = tensorkube.Image.from_registry(

"nvidia/cuda" ).add_python(version='3.9')

.apt_install([ 'git','git-lfs' ])

.pip_install([ 'transformers', 'torch', 'torchvision', 'tensorrt', ])

.env( { 'SOME-RANDOM-SECRET-KEY': 'xxx-xyz-1234-abc-5678', } )

.run_custom_function( download_and_quantize_model, )


@tensorkube.entrypoint(image, gpu = 'A10G')

def load_model_on_gpu():

import transformers

model = transformers.BertModel.from_pretrained('bert-base-uncased')

model.to('cuda')

tensorkube.pass_reference(model, 'model')


@tensorkube.function(image)

def infer(input: str):

model = tensorkube.get_reference('model')

# test the model on input

response = model(input)

return response



Get started with Tensorfuse today.

Deploy in minutes, scale in seconds.

import tensorkube


image = tensorkube.Image.from_registry(

"nvidia/cuda" ).add_python(version='3.9')

.apt_install([ 'git','git-lfs' ])

.pip_install([ 'transformers', 'torch', 'torchvision', 'tensorrt', ])

.env( { 'SOME-RANDOM-SECRET-KEY': 'xxx-xyz-1234-abc-5678', } )

.run_custom_function( download_and_quantize_model, )


@tensorkube.entrypoint(image, gpu = 'A10G')

def load_model_on_gpu():

import transformers

model = transformers.BertModel.from_pretrained('bert-base-uncased')

model.to('cuda')

tensorkube.pass_reference(model, 'model')


@tensorkube.function(image)

def infer(input: str):

model = tensorkube.get_reference('model')

# test the model on input

response = model(input)

return response



Get started with Tensorfuse today.

Deploy in minutes, scale in seconds.

import tensorkube


image = tensorkube.Image.from_registry(

"nvidia/cuda" ).add_python(version='3.9')

.apt_install([ 'git','git-lfs' ])

.pip_install([ 'transformers', 'torch', 'torchvision', 'tensorrt', ])

.env( { 'SOME-RANDOM-SECRET-KEY': 'xxx-xyz-1234-abc-5678', } )

.run_custom_function( download_and_quantize_model, )


@tensorkube.entrypoint(image, gpu = 'A10G')

def load_model_on_gpu():

import transformers

model = transformers.BertModel.from_pretrained('bert-base-uncased')

model.to('cuda')

tensorkube.pass_reference(model, 'model')


@tensorkube.function(image)

def infer(input: str):

model = tensorkube.get_reference('model')

# test the model on input

response = model(input)

return response



© 2024. All rights reserved.

Product

Blog

Pricing

Documentation

social

x.com

LinkedIn

Privacy Policy