Optimize Gemma 3 Inference: vLLM on GKE 🏎️💨

Press enter or click to view image in full size

The responsiveness of AI-powered tools like chatbots and code assistants often feels instantaneous. But have you considered the underlying mechanisms that make this possible? Conversely, the occasional delays when using large language models can leave you pondering the computational hurdles involved.

Press enter or click to view image in full size

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV-Cache Consumption

Enter the realm of open source, where projects are actively working to make the process of inference far more efficient.

vLLM: a notable example, born at UC Berkeley, directly addresses the challenges of speed and memory consumption. The vLLM community has developed several technologies aiming at improved efficiency: Paged Attention, Continuous Batching, Speculative Decoding, Chunked Prefill, and Flash Attention. vLLM has wide LLM support including Google’s Gemma models, DeepSeek V3, V3.1, and R1, Meta Llama models (including Day 0 support for Llama 4), Mixtral, Qwen2, and many many others. Its impact is significant, enabling blazing fast performance such as serving Google’s Gemma 3 27B model at rates exceeding 20,000 tokens per second or milliseconds responsiveness, capabilities we’ll dissect in this analysis.

In this work, I’ll explore efficient Gemma 3 inference on vLLM on GKE Autopilot.

· Why vLLM on GKE
· Infrastructure Foundation: GKE Autopilot, vLLM, and Hardware Selection
· GKE Goodies
· LLM Inference Memory Requirements
· Sample YAML
· vLLM Benchmark
· A Matter Of Concurrency
∘ Interpreting the Graphs:
∘ concurrency=1
∘ concurrency=10
∘ concurrency=50
∘ concurrency=100
∘ concurrency=500
· Scaling This Further
· Conclusions
· Further Readings

Why vLLM on GKE

Efficient inference directly tackles significant industry challenges by optimizing the cost of deploying large language models like Gemma 3, ultimately making advanced AI more sustainable and accessible. This optimization also translates to improved latency, leading to quicker response times and a better user experience in AI-powered applications. Furthermore, the combination of efficient inference frameworks like vLLM with scalable platforms such as GKE Autopilot simplifies resource management and enables easier scaling of these applications. Benchmarking and performance analysis provide crucial quantitative data on the benefits of this approach, demonstrating tangible improvements in latency, throughput, and resource utilization compared to alternative configurations. Fundamentally, this work showcases the powerful synergy of combining cutting-edge open-source inference optimization with managed cloud infrastructure for the seamless and efficient deployment of advanced AI models.

Infrastructure Foundation: GKE Autopilot, vLLM, and Hardware Selection

The core infrastructure for this analysis leverages GKE in Autopilot mode, specifically utilizing cluster version v1.31.6 . GKE Autopilot was selected for its inherent operational efficiency, particularly advantageous when managing resource-intensive workloads like LLM serving. As a fully managed Kubernetes platform, Autopilot abstracts the provisioning, scaling, and management of worker nodes, allowing focus to shift entirely towards the application layer — in this instance, the vLLM inference server. The cost model aligns directly with consumed resources (CPU, memory, ephemeral storage, and crucially, the GPU accelerators), supplemented by the standard GKE Autopilot cluster management fee ($0.10/hour per cluster).

Within this managed GKE environment, we deployed and evaluated vLLM version v0.8.2. Performance characterization was conducted across a diverse range of NVIDIA GPU accelerators:

NVIDIA L4: Based on the Ada Lovelace architecture, providing 24GB of GDDR6 memory. Ideal for inference workloads requiring a balance of performance and cost-efficiency.
NVIDIA A100: Based on the Ampere architecture, tested in two configurations:
— 40GB HBM2 memory variant.
— 80GB HBM2e memory variant.
NVIDIA H100: Based on the Hopper architecture, featuring 80GB of HBM3 memory.

The target LLM for serving via vLLM was Google’s Gemma 3 model. The gemma-3–27b-it (instruct-tuned) variant is our choice, specifically the original, non-quantized bfloat16 precision version, imposing significant memory and computational demands well-suited for evaluating the platform and hardware capabilities.

The secret to cost-efficient AI inference by

Google also provides foundational Kubernetes manifests for deploying models like Gemma on GKE:

GKE Goodies

Talking about GKE goodies, we have recently announced the GA of GKE automatic application monitoring. This provides ‘1-click,’ best-practice observability for popular model servers on GKE with out-of-the-box dashboards in Cloud Monitoring for vLLM, NVIDIA Triton, TGI, TorchServe, Tensorflow Serving, Jet Stream, and many more.

Press enter or click to view image in full size

GKE automatic application monitoring in action

GKE also provides dashboard for NVIDIA Data Center GPU Manager (DCGM), industry-standard tool from NVIDIA that provides a comprehensive view of GPU utilization, performance, and health.

Press enter or click to view image in full size

GKE DCGM dashboard in action

The funny thing is, enabling such features on GKE Autopilot is just two arguments away:

gcloud container clusters update vllm-cluster \
    --region us-central1 \
    --monitoring=SYSTEM,DCGM \
    --auto-monitoring-scope=ALL

LLM Inference Memory Requirements

Selecting a GPU capable of handling Gemma 3 27B was a critical consideration, driven by its substantial memory footprint. The 27 billion parameter model necessitates a minimum of 46.4 GB of VRAM, not accounting for the KV cache, CUDA kernels, and other runtime overhead. For this initial investigation, I opted to benchmark the full bf16 precision model to ensure reproducibility. While quantization techniques, readily accessible via tools like vLLM-Compressor, offer significant memory reduction, their implementation involves algorithm selection and rigorous performance evaluation — a topic deserving of its own dedicated exploration. Anyone interested in vLLM’s memory optimization capabilities can find comprehensive information in the official Google Cloud documentation:

Alternatively, for those keen to explore the intricacies of quantization and compression techniques, I highly recommend Red Hat’s in-depth paper on the subject:

Press enter or click to view image in full size

LLM Compressor architecture

Finally, Google recently released an illustrative animation visualizing GPU memory utilization during LLM inference. You can find it here:

Press enter or click to view image in full size

Sample YAML

To ensure complete reproducibility of the platform, you can follow these steps. First, provision a Google Kubernetes Engine (GKE) Autopilot cluster, specifying your desired network and subnet. The gcloud command for this step is provided below:

gcloud container clusters create-auto vllm-cluster \
    --region "us-central1" \
    --release-channel "regular" \
    --tier "standard" \
    --enable-dns-access \
    --enable-ip-access \
    --enable-master-global-access \
    --no-enable-google-cloud-access \
    --network <NETWORK NAME> --subnetwork <SUBNET NAME> \
    --cluster-ipv4-cidr "/17" \
    --monitoring=SYSTEM,DCGM \
    --auto-monitoring-scope=ALL \
    --binauthz-evaluation-mode=DISABLED \
    --scopes=https://www.googleapis.com/auth/cloud-platform

Next, apply the following Kubernetes manifest. This manifest configures several essential components:

A StorageClass for Hyperdisk ML persistent volumes.
The vLLM namespace.
A ClusterIP service for the vLLM deployment.
A PersistentVolumeClaim to store the model data.
A secret containing your Hugging Face access token (replace placeholder with your actual token).

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
    name: hyperdisk-ml
parameters:
    type: hyperdisk-ml
    provisioned-throughput-on-create: "2000Mi"
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: false
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
mountOptions:
  - read_ahead_kb=4096
---
apiVersion: v1
kind: Namespace
metadata:
  name: vllm
---
apiVersion: v1
kind: Service
metadata:
  namespace: vllm
  name: llm-service
spec:
  selector:
    app: vllm-openai-api
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: vllm
  name: hdml-static-pvc
spec:
  storageClassName: "hyperdisk-ml"
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-secret
  namespace: vllm
type: Opaque
stringData:
  token: "<YOUR HUGGING FACE TOKEN>"

With the infrastructure in place, deploying vLLM is straightforward. The following manifest deploys vLLM version 0.8.3 on a single NVIDIA H100 GPU.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: vllm
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-openai-api
  template:
    metadata:
      labels:
        app: vllm-openai-api
        ai.gke.io/model: gemma-3-27b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.8.3
        resources:
          requests:
            cpu: "8"
            memory: "128Gi"
            ephemeral-storage: "120Gi"
            nvidia.com/gpu : "1"
          limits:
            cpu: "8"
            memory: "128Gi"
            ephemeral-storage: "120Gi"
            nvidia.com/gpu : "1"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        - --host=0.0.0.0
        - --port=8000
        - --gpu-memory-utilization=0.90
        - --trust-remote-code
        - --enable-chunked-prefill
        - --dtype=bfloat16
        - --swap-space=16
        env:
        - name: MODEL_ID
          value: google/gemma-3-27b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - name: hdml-static
          mountPath: /root/.cache/huggingface
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      - name: hdml-static
        persistentVolumeClaim:
          claimName: hdml-static-pvc
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-h100-80gb
        cloud.google.com/gke-gpu-driver-version: latest
        topology.kubernetes.io/zone: "us-central1-a"
      terminationGracePeriodSeconds: 15

To further optimize costs, consider deploying vLLM on a Spot VM. Achieve this by adding the following annotation to the nodeSelector section of your vLLM deployment: cloud.google.com/gke-spot: “true”. Having worked extensively with Kubernetes for the past seven years, I continue to be amazed by the simplicity and reliability of GKE Autopilot. It consistently delivers a smooth and efficient operational experience and it works like a charm.

vLLM Benchmark

To evaluate performance, the vLLM benchmark was employed. This benchmark focuses on key metrics:

Throughput: Measured in tokens generated per second.
Latency: Encompassing Time to First Token (TTFT), Time per Output Token (TPOT), and Inter-token Latency.
Customizable Dataset Support: Enabling performance assessment across various conversational scenarios.

For this analysis, the ShareGPT dataset was utilized, a collection of real-world conversations between users and LLMs.

Sourced from users voluntarily sharing their chat logs via the ShareGPT browser extension, it offers several benefits:

Realistic Dialogues: Captures diverse user queries, prompts, and model responses, reflecting practical LLM usage.
Broad Topic Coverage: Spans numerous subjects, including coding, writing, creative brainstorming, and general knowledge.
Real-world Interactions: Provides insights into how individuals are actually interacting with LLMs.

Running the vLLM benchmark can be done through a Kubernetes manifest that we wrote for this work:

apiVersion: v1
kind: Pod
metadata:
  name: vllm-benchmark
  namespace: vllm
spec:
  containers:
  - name: vllm-benchmark
    image: python:3.11-bookworm
    resources:
      requests:
        cpu: "4"
        memory: "8Gi"
        ephemeral-storage: "10Gi"
      limits:
        cpu: "4"
        memory: "8Gi"
        ephemeral-storage: "10Gi"
    command: ["/bin/bash", "-c"]
    args:
      - |
        cd /data
        mkdir -p /data/tmp
        export TMPDIR=/data/tmp        [ ! -f "/data/${DATASET_FILE}" ] && curl -L -o /data/"${DATASET_FILE}" "${DATASET_URL}"
        python3 -m venv vllm-benchmark
        source vllm-benchmark/bin/activate
        git clone https://github.com/vllm-project/vllm.git || true
        cd vllm && git checkout v${VLLM_VERSION}
        pip install --no-cache-dir -r requirements/test.txt
        pip install --no-cache-dir vllm==${VLLM_VERSION}
        echo "Init Completed"
        echo "kubectl -n vllm exec -it vllm-benchmark -- bash -c \" \\"
        echo "  source /data/vllm-benchmark/bin/activate "
        echo "  python3 /data/vllm/benchmarks/benchmark_serving.py \\"
        echo "  --host=llm-service --port=8000 \\"
        echo "  --dataset-name=sharegpt \\"
        echo "  --dataset-path=/data/${DATASET_FILE} \\"
        echo "  --tokenizer=google/gemma-3-27b-it \\"
        echo "  --request-rate=500 --backend=openai \\"
        echo "  --num-prompts=1000 --sharegpt-output-len 256 \\"
        echo "  --save-result --model=google/gemma-3-27b-it\""
        sleep infinity
    env:
    - name: DATASET_FILE
      value: ShareGPT_V3_unfiltered_cleaned_split.json
    - name: DATASET_URL
      value: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
    - name: VLLM_VERSION
      value: 0.8.2
    - name: HUGGING_FACE_HUB_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-secret
          key: token
    volumeMounts:
    - name: data-volume
      mountPath: /data
  volumes:
  - name: data-volume
    persistentVolumeClaim:
      claimName: vllm-benchmark-pvc
  nodeSelector:
    topology.kubernetes.io/zone: "us-central1-a"
  terminationGracePeriodSeconds: 5
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-benchmark-pvc
  namespace: vllm
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: "premium-rwo"
  resources:
    requests:
      storage: 20Gi

After deployment, the benchmark is controlled via command-line arguments. The --request-rate parameter defines the number of parallel requests, while --num-prompts specifies the total number of prompts to process. For example, --request-rate=1 --num-prompts=10 executes 10 prompts sequentially, one at a time.

A Matter Of Concurrency

Fun-fact, I deeply love specialty coffee and one of my favorite roastery in Europe is A Matter Of Concrete based in Rotterdam.

This brings us to a crucial question: While GPUs excel at handling individual inference requests with minimal overhead, how do they perform under concurrent load?

Press enter or click to view image in full size

nvitop showing H100s put under use

Interpreting the Graphs:

Green Bar: Total Token Throughput (tokens/s)
Blue Line: 99th percentile Time to First Token (TTFT) latency.
Red Line: Cost per 1M tokens.

concurrency=1

Press enter or click to view image in full size

At a single concurrent request, the total token throughput is remarkably similar across the entire bunch. This suggests that, for single requests, the GPU itself isn’t the primary bottleneck as long as sufficient VRAM is available. However, it’s crucial to consider the cost per token and latency.

For instance, a GKE Autopilot cluster incurs a monthly fee of $73.00, while 100GB of Hyperdisk ML storage with 2GB/s provisioned throughput costs $82.40. Four NVIDIA L4 GPUs (in a g2-standard-48 configuration in us-central1) add $2921.21, bringing the total monthly cost to $3,076.62, or approximately $0.001171 per second. In this scenario, the four NVIDIA L4s achieve a Total Token Throughput of 411.81 tokens/s. Therefore, the cost per token is approximately $0.0000028 ($0.001171/411.81) this multiplied by 1M results in $2.84.

While the four NVIDIA L4s offer the best price/performance ratio for a single concurrent request, their relatively high inference latency results in a suboptimal user experience.

concurrency=10

Press enter or click to view image in full size

At 10 concurrent requests, the Total Token Throughput increases substantially, indicating that GPU throughput and memory bandwidth become more significant factors. The single NVIDIA A100 with 80GB of HBM2e exhibits a strong price/performance balance but suffers from elevated inference latency. Notably, two NVIDIA A100s with 80GB of HBM2e offer the best overall performance in this configuration. A comparison between H100 and A100 GPUs with the same memory capacity highlights the architectural advantages and higher memory bandwidth of the newer H100, delivering nearly double the token throughput and almost three times lower inference latency.

To validate the user experience aspect beyond raw metrics, I performed some practical tests using aider, an AI pair programmer operating in the terminal. The single H100, the dual A100 80GB, and the single A100 80GB configurations all felt notably responsive. Interactions, including code generation and file edits, were smooth and snappy, indicating these setups adequately support interactive developer assistant workloads.

For completeness, the vLLM CPU backend was also evaluated, utilizing both Intel Emerald Rapids (on c4-standard-192) and AMD EPYC Turin (on c4d-standard-384). The results were substantially poorer compared to GPU acceleration: Total Token Throughput barely reached 300 tokens/s, while the 99th percentile TTFT latency exceeded 50 seconds for AMD and 65 seconds for Intel. Consequently, the cost per million tokens was dramatically higher, at an estimated $15.30 for AMD (the C4D price is not available yet thus the monthly cost for c3d-standard-360 was taken as reference) and $9.04 for Intel, rendering them impractical for this workload compared to the GPU options.

Press enter or click to view image in full size

Google Axion was also tested but I have yet the receive a single answer … Given this significant performance and cost gap, CPU configurations were excluded from the main comparative analysis. It’s worth noting that while optimizations exist, particularly for the Intel CPU backend, general-purpose CPUs fundamentally lack the massive parallel processing capabilities required to perform LLM matrix multiplications efficiently.

concurrency=50

Press enter or click to view image in full size

Increasing the concurrency to 50 — sweet spot for a development team using Agents and Coding Assistant tools like Cline — reveals that two H100s exceed 10,000 tokens per second with the lowest latency. Despite offering the best price/performance, the peered A100 with 80GB of HBM2e delivers a poor user experience due to its approximately five-second TTFT latency.

As a practical example of the responsiveness achievable, I used Cline on VSCode with two H100s to ask several detailed questions about the codebase of the open-sourced NVIDIA PhysX 5.6.0 SDK, repository made of approximately 11.5M tokens. The assistant handled these queries with impressive speed, as the following animation demonstrates:

Press enter or click to view image in full size

VSCode with Cline on the vLLM endpoint running on two H100s — of course Gemma 3 model FTW!

It’s crucial to consider the impact of latency on the user experience. Google’s “User Preference and Search Engine Latency” research underscores the importance of low latency in web applications, as users are likely to abandon a website if the loading time exceeds just a few seconds (the “2-second rule”). Gemma 3 could potentially run even on a low-cost Raspberry Pi 5, but the resulting user experience would be unacceptable.

Fun-fact: At one point during testing under significant load, vLLM printed the following status message:

INFO 04-04 14:09:02 [loggers.py:80]
Avg prompt throughput: 5268.6 tokens/s
Avg generation throughput: 6004.3 tokens/s
Running: 408 reqs
Waiting: 467 reqs
GPU KV cache usage: 99.8%
Prefix cache hit rate: 23.6%

This snapshot, periodically reported by vLLM, highlights the engine operating at high intensity. It was processing prompt tokens at an average rate of over 5,200 tokens/s and generating response tokens at over 6,000 tokens/s simultaneously 🤯. This occurred while handling a large number of requests (408 running, 467 waiting) and pushing the GPU KV cache to its limit (99.8%). The relatively modest prefix cache hit rate of 23.6% likely reflects the diversity of prompts within that specific processing interval from the ShareGPT dataset.

concurrency=100

Press enter or click to view image in full size

At 100 concurrent requests, several GPUs were excluded due to excessive waiting times. The results indicate that the given hardware struggles to provide an acceptable user experience, with even two H100s requiring 15 seconds to generate the first token.

concurrency=500

Press enter or click to view image in full size

At the demanding concurrency level of 500 — representative of a heavily utilized backend service handling simultaneous requests for diverse AI tasks (such as powering customer service chatbots, summarizing lengthy reports, analyzing sentiment in user reviews, drafting code snippets, or creating imaginative descriptions based on user prompts like ‘describe this scene in a Studio Ghibli style’) — the four-H100 setup truly excelled. It achieved over 22,000 tokens per second 🤯 with a TTFT latency kept under 1.2 seconds. Remarkably, this high throughput and responsiveness came at a cost-effective $0.55 per million tokens, showcasing the H100’s ability to scale under extreme load and the optimized design of vLLM.

Scaling This Further

While the preceding analysis focused on a scale-up approach within a single node, I acknowledge that this may not fully represent modern, cloud-native deployment strategies. For those interested in exploring more advanced scaling techniques, I offer the following resources:

First, my recent work on DeepSeek R1 671B details how to leverage the power of GKE and LeaderWorkerSet to scale LLM inference beyond the limitations of a single node.

Second, my colleague and good friend

recently presented a Load Balancer for Inferencing, demonstrating how to scale beyond a single LLM inference engine using technologies such as Inference Traffic Management and Model-Aware Routing.

at MWC 2025 presenting Load Balancer For Inferencing

Finally, for further insights into vLLM’s specific scaling and optimization features, I recommend reviewing the following resources:

Conclusions

The core finding is that vLLM stands out as a highly effective engine for LLM inference, dramatically improving efficiency through its advanced techniques like Paged Attention and Continuous Batching, which directly address critical speed and memory challenges. Deploying this powerhouse on GKE Autopilot further streamlines operations; Autopilot’s managed nature removes the complexity of node provisioning and scaling, offering a resource-based cost model ideal for demanding AI workloads.

A crucial insight from the benchmarks is how performance, particularly throughput and latency, hinges directly on the level of concurrent requests. Simply choosing the fastest GPU for single requests isn’t enough. Instead, hardware selection must be carefully tailored to the anticipated user load. For instance:

NVIDIA L4s offer excellent cost-efficiency at very low concurrency but struggle with latency.
A100s present a reasonable balance, particularly the 80GB variant.
H100s truly shine under heavy load, delivering superior throughput and the low latency essential for a good user experience.

This highlights that TTFT latency is paramount for user satisfaction, often more so than raw throughput. Achieving high token rates is less valuable if users face frustrating initial delays. The analysis demonstrated that serving the demanding Gemma 3 27B model at over 20,000 tokens per second is achievable with vLLM on GKE using multiple H100s, showcasing impressive scalability. GKE’s integrated observability features for vLLM and GPU metrics (DCGM) also provide vital performance visibility with minimal setup. Ultimately, optimizing inference, as demonstrated here with reproducible configurations, is fundamental to making large models like Gemma 3 cost-effective, sustainable, and truly accessible for widespread use.

I hope you found this vLLM and GKE Autopilot adventure useful and interesting! Until next time, happy coding! 👋