Job Queues
Deploy your jobs and queue them programmatically with different parameters
In ML, jobs represent discrete tasks such as model training, inference, or data processing. Efficient management of these tasks is crucial, especially in shared resource environments. Job queues play a vital role in this process by optimising resource allocation and preventing resource contention. This is particularly important in ML workflows, where tasks are often resource-intensive and time-consuming.
Getting started with queued jobs
To get started with jobs, you need to have the Tensorfuse CLI installed on your machine. You can install the CLI using the following command:
Configuration for AWS
You can run the following commands to setup AWS credentials on your machine:
or you can manually export them as environment variables:
Deploying and Running Jobs
-
Deploy a Job
This command deploys a job with the specified parameters.
If your queued jobs also include a different payload for each job, please refer to point 3 for information on how to access the payload in your deployment.
--name <job-name>
: The name of the job.--gpus <number-of-gpus>
: The number of GPUs required for the job. [Default 0]--gpu-type <gpu-type>
: The type of GPU required.--max-scale <max-scale>
: The maximum scale for the job. [Default 3]--cpu <cpu-units>
: The amount of CPU units required. Used only if GPUs are 0. Specified in milliCPUs [Default 100]--memory <memory-size>
: The amount of memory required. Specified in MB [Default 200]--secret <secret-key>
: The name of the secret required by the job. Can be used multiple times to attach multiple secrets.
-
Queue a Job
This command queues a job by pushing data to the queue, which triggers the execution of the job. Make sure that the
job-name
matches the job you deployed.--job-name <job-name>
: The name of the job to be queued.--job-id <job-id>
: The unique identifier for the job.--payload <payload>
: The parameters or data to be passed to the job. Data Type: String.
-
Accessing your payload
To access your payload string inside the deployment, install the tensorkube
package in your Docker image and add the following snippet to your code.
If you are sending a json object as a string, remember to convert it back to a json object like so
-
Poll for Job Status
This command returns the status of any particular job
--job-name <job-name>
: The name of the job to be polled.--job-id <job-id>
: The unique identifier for the job whose status you want to check.
Example
Let’s say you have an inference job as follows, and your job payload is the prompt for the inference.
Create a dockerfile for this as follows:
Here the download.py file is used to download the model from HuggingFace and store it during build. It will be as follows:
Deployment
Deploy this job definition:
Queue a job process:
Get its status:
Programmatic Access to Job Queues
You can also queue jobs programmatically from your python code.
Prerequisetes:
- Tensorkube: Install tensorkube the tensorkube package using the command
-
AWS CLI: This is used by the tensorkube package to be able to access the EKS cluster. You can find the steps to install AWS CLI here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
-
Configure AWS: Run the
aws configure
and enter yourACCESS_KEY_ID
,SECRET_ACCESS_KEY
,SESSION_TOKEN
(only for Identity Center User) andREGION
values as you are prompted. You can also directly modify your~/.aws/credentials
file. Read more about configuring your AWS CLI here https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Code Snippet
To programmatically queue a job, add the following snippet to your code