Prerequisites
Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.Deploying Meta-Llama-3.2-11B-Instruct with Tensorfuse
Each tensorkube deployment requires two things - your code and your environment (as a Dockerfile). While deploying machine learning models, it is beneficial if your model is also a part of your container image. This reduces cold-start times by a significant margin. To enable this, in addition to a FastAPI app and a dockerfile, we will also write code to download the model and place it in our image file. Learn more about llama 3.2 and gradio by visiting the docs. Also we are using a flask server with a mounted gradio app to serve the model requests. For more information about this you can refer the gradio apps here.Download the model
We will write a small script that downloads the Llama-3.2-Vision-Instruct model from the Hugging Face model hub and saves it in the/models
directory.
Note: Since llama3.2 is a gated repo, you will need to request the authors of repo here for the access to the model.
download_model.py
Code files
We will write a small FastAPI app that loads the model and serves predictions by mounting the gradio app. The FastAPI app will have three endpoints -/readiness
, /
, and /gradio
. Remember that the /readiness
endpoint is used by Tensorkube to check the health of your deployments.
main.py
Environment files (Dockerfile)
Next, create a Dockerfile for your FastAPI app. Given below is a simple Dockerfile that you can use:Dockerfile