Deep Dive

Multi-Provider LLM Routing with OpenFaaS:
Groq, Vertex, and Ollama Behind One Endpoint

Warble Cloud·6 min read·OpenFaaS · KubeRay · Prometheus
OpenFaaSGroqVertex AI OllamaPrometheusCost control

Every team with more than one LLM use case ends up with the same problem: some tasks need maximum reasoning quality, others need sub-200ms latency, and some need to run without any external API calls. Hardcoding a single provider into every service is the path to sprawl — when you want to switch, you're touching application code everywhere.

On FRAKMA we solve this with a single OpenFaaS function that acts as the LLM gateway. Every inference request goes through it. The function decides which provider to use based on the task type and latency budget declared by the caller.

Why OpenFaaS for the router?

A router function fits the serverless model perfectly: stateless, fast to invoke, scales to zero when idle, and the built-in Prometheus metrics give you per-function cost tracking without extra instrumentation. The alternative — a dedicated microservice — adds a persistent deployment with idle CPU and memory costs even when no inference is happening.

The routing logic

# functions/llm-router/handler.py
import os, json
from providers import groq_generate, vertex_generate, ollama_embed

def handle(req: str) -> str:
    body     = json.loads(req)
    task     = body.get("task", "generate")  # embed | classify | generate | reason
    budget   = body.get("max_latency_ms", 2000)
    private  = body.get("private", False)  # force local Ollama
    prompt   = body["prompt"]

    if private or task == "embed":
        # Embeddings always local — no token cost, no data leaves cluster
        result = ollama_embed(prompt)

    elif task in ("classify", "generate") and budget < 300:
        # Low latency budget → Groq Llama 3.1: ~120ms typical
        result = groq_generate(prompt, model="llama-3.1-8b-instant")

    else:
        # Complex reasoning → Vertex Gemini 2.5 Flash
        result = vertex_generate(prompt, model="gemini-2.5-flash")

    return json.dumps({"result": result, "provider": result.provider})

Provider comparison on FRAKMA

Providerp50 latencyCost / 1M tokensBest forData leaves cluster
Ollama (in-cluster)80–300ms$0 (GPU spot ~$0.52/hr)Embeddings, classification, privacy-sensitive tasksNo
Groq Cloud120–200ms~$0.05–0.08Low-latency generation, agentic Actor callsYes
Vertex Gemini 2.5 Flash800–1500ms~$0.30–0.75Complex reasoning, safety-critical Critic callsYes (GCP)

Deploying the router function

# functions/llm-router/stack.yml
version: "1.0"
functions:
  llm-router:
    image: warbleoss.azurecr.io/fn-llm-router:latest
    environment:
      GROQ_API_KEY_FILE:   /var/openfaas/secrets/groq-api-key
      OLLAMA_HOST:         http://ollama.warble-system.svc.cluster.local:11434
      GOOGLE_CLOUD_PROJECT: warble-cloud
    secrets:
      - groq-api-key
    limits:
      memory: 256Mi
      cpu: 200m
    labels:
      com.openfaas.scale.zero: "true"
      com.openfaas.scale.zero-duration: 5m
# Deploy
faas-cli build -f functions/llm-router/stack.yml
faas-cli push  -f functions/llm-router/stack.yml
faas-cli deploy -f functions/llm-router/stack.yml \
  --gateway https://faas.frakma.io

Per-function cost tracking with Prometheus

OpenFaaS exposes gateway_function_duration_seconds and invocation counters per function out of the box. We extend this with a custom counter that records token cost per provider, emitted from the handler:

from prometheus_client import Counter

token_cost = Counter(
    "llm_token_cost_usd_total",
    "Estimated LLM token cost in USD",
    ["provider", "task"]
)

# After each call:
token_cost.labels(provider=result.provider, task=task).inc(result.estimated_cost)
Cost result after 30 days

Routing embeddings to Ollama and classification to Groq reduced our monthly LLM spend by 67% compared to sending everything to Vertex AI, with no measurable change in downstream task quality on the Reflexion incident resolution benchmark.

Scale-to-zero on idle

The com.openfaas.scale.zero: "true" label tells the OpenFaaS autoscaler to terminate the router pod after 5 minutes of inactivity. Cold start is ~2s for the Python handler. For use cases where 2s cold start is unacceptable, set the minimum replicas to 1 — you pay for one idle pod but eliminate cold starts entirely.

Lessons learned

Continue reading

Next: WarbleApp CRD Deep Dive →