Skip to main content

Kubernetes Deployment

It is easy to get started to deploy your models to Kubernetes! You can deploy your models with the Helm chart or manually.

Make sure you have a Kubernetes cluster running and kubectl is configured to talk to it, and your model images are accessible from the cluster.

tip

You can use kind to create a local Kubernetes cluster for testing purposes.

Helm Chart

🎬 Demo: https://www.youtube.com/watch?v=ws5AUtLkuuc

For advanced deployments or customization options, you can use the Helm chart provided in the charts directory.

Please make sure you have Helm installed and configured. If you don't have Helm installed, you can follow the instructions here.

Install the chart using the following command:

helm repo add aikit https://sozercan.github.io/aikit/charts
helm install aikit/aikit --name-template=aikit --namespace=aikit --create-namespace
tip

By default, the chart will deploy the llama-3-8b-instruct model. You can customize the deployment by providing a pre-built image or your own model image or other options. You can find the available options in the values section.

Chart will enforce restricted pod security admission for enhanced security and hardening best practices on the namespace level.

Output will be similar to:

NAME: aikit
LAST DEPLOYED: Sat Jun 8 07:53:13 2024
NAMESPACE: aikit
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Access AIKit WebUI or API by running the following commands:

- Port forward the service to your local machine:

kubectl --namespace aikit port-forward service/aikit 8080:8080 &

- Visit http://127.0.0.1:8080/chat to access the WebUI

- Access the OpenAI API compatible endpoint with:

# replace this with the model name you want to use
export MODEL_NAME="llama-3-8b-instruct"
curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"${MODEL_NAME}\", \"messages\": [{\"role\": \"user\", \"content\": \"what is the meaning of life?\"}]}"

As mentioned in the notes, you can then port-forward and send requests to your model, or navigate to the URL provided to access the WebUI.

# port-forward for testing locally
kubectl port-forward -n aikit service/aikit 8080:8080 &

# send requests to your model
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama-3-8b-instruct",
"messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
}'
{"created":1716695271,"object":"chat.completion","id":"809d031e-d78a-4e3a-9719-04683d9e29f9","model":"llama-3-8b-instruct","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"Kubernetes is an open-source container orchestration system that automates the deployment, scaling, and management of applications and services in a cloud-native environment."}}],"usage":{"prompt_tokens":11,"completion_tokens":31,"total_tokens":42}}

Values

KeyTypeDefaultDescription
image.repositoryStringghcr.io/sozercan/llama3The image repository
image.tagString8bThe image tag
image.pullPolicyStringIfNotPresentThe image pull policy
replicaCountInteger1The number of replicas
imagePullSecretsArray[]Image pull secrets
nameOverrideString""Override the name
fullnameOverrideString""Override the fullname
podAnnotationsObject{}Pod annotations
podLabelsObject{}Pod labels
podSecurityContextObject{}Pod security context
securityContextObject{}Security context
service.typeStringClusterIPService type
service.portInteger8080Service port
resources.limits.cpuString4CPU resource limits
resources.limits.memoryString4GiMemory resource limits
resources.requests.cpuString100mCPU resource requests
resources.requests.memoryString128MiMemory resource requests
livenessProbe.httpGet.pathString/Path for the liveness probe
livenessProbe.httpGet.portStringhttpPort for the liveness probe
readinessProbe.httpGet.pathString/Path for the readiness probe
readinessProbe.httpGet.portStringhttpPort for the readiness probe
autoscaling.enabledBooleanfalseIf autoscaling is enabled
autoscaling.minReplicasInteger1Minimum number of replicas for autoscaling
autoscaling.maxReplicasInteger100Maximum number of replicas for autoscaling
autoscaling.targetCPUUtilizationPercentageInteger80Target CPU utilization percentage for autoscaling
nodeSelectorObject{}Node selector
affinityObject{}Affinity settings

Manual Deployment

You can also deploy your models manually using kubectl. Here is an example:

# create a deployment
# replace the image with your own if needed
kubectl create deployment aikit-llama3 --image=ghcr.io/sozercan/llama3:8b

# expose it as a service
kubectl expose deployment aikit-llama3 --port=8080 --target-port=8080 --name=aikit

# easy to scale up and down as needed
kubectl scale deployment aikit-llama3 --replicas=3

# port-forward for testing locally
kubectl port-forward service/aikit 8080:8080 &

# send requests to your model
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama-3-8b-instruct",
"messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
}'
{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-3-8b-instruct","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}