Multi-Provider LLM Routing with OpenFaaS

Every team with more than one LLM use case ends up with the same problem: some tasks need maximum reasoning quality, others need sub-200ms latency, and some need to run without any external API calls. Hardcoding a single provider into every service is the path to sprawl — when you want to switch, you're touching application code everywhere.

On FRAKMA we solve this with a single OpenFaaS function that acts as the LLM gateway. Every inference request goes through it. The function decides which provider to use based on the task type and latency budget declared by the caller.

Why OpenFaaS for the router?

A router function fits the serverless model perfectly: stateless, fast to invoke, scales to zero when idle, and the built-in Prometheus metrics give you per-function cost tracking without extra instrumentation. The alternative — a dedicated microservice — adds a persistent deployment with idle CPU and memory costs even when no inference is happening.

The routing logic

# functions/llm-router/handler.py
import os, json
from providers import groq_generate, vertex_generate, ollama_embed

def handle(req: str) -> str:
    body     = json.loads(req)
    task     = body.get("task", "generate")  # embed | classify | generate | reason
    budget   = body.get("max_latency_ms", 2000)
    private  = body.get("private", False)  # force local Ollama
    prompt   = body["prompt"]

    if private or task == "embed":
        # Embeddings always local — no token cost, no data leaves cluster
        result = ollama_embed(prompt)

    elif task in ("classify", "generate") and budget < 300:
        # Low latency budget → Groq Llama 3.1: ~120ms typical
        result = groq_generate(prompt, model="llama-3.1-8b-instant")

    else:
        # Complex reasoning → Vertex Gemini 2.5 Flash
        result = vertex_generate(prompt, model="gemini-2.5-flash")

    return json.dumps({"result": result, "provider": result.provider})

Provider comparison on FRAKMA

Provider	p50 latency	Cost / 1M tokens	Best for	Data leaves cluster
Ollama (in-cluster)	80–300ms	$0 (GPU spot ~$0.52/hr)	Embeddings, classification, privacy-sensitive tasks	No
Groq Cloud	120–200ms	~$0.05–0.08	Low-latency generation, agentic Actor calls	Yes
Vertex Gemini 2.5 Flash	800–1500ms	~$0.30–0.75	Complex reasoning, safety-critical Critic calls	Yes (GCP)

Deploying the router function

# functions/llm-router/stack.yml
version: "1.0"
functions:
  llm-router:
    image: warbleoss.azurecr.io/fn-llm-router:latest
    environment:
      GROQ_API_KEY_FILE:   /var/openfaas/secrets/groq-api-key
      OLLAMA_HOST:         http://ollama.warble-system.svc.cluster.local:11434
      GOOGLE_CLOUD_PROJECT: warble-cloud
    secrets:
      - groq-api-key
    limits:
      memory: 256Mi
      cpu: 200m
    labels:
      com.openfaas.scale.zero: "true"
      com.openfaas.scale.zero-duration: 5m

# Deploy
faas-cli build -f functions/llm-router/stack.yml
faas-cli push  -f functions/llm-router/stack.yml
faas-cli deploy -f functions/llm-router/stack.yml \
  --gateway https://faas.frakma.io

Per-function cost tracking with Prometheus

OpenFaaS exposes gateway_function_duration_seconds and invocation counters per function out of the box. We extend this with a custom counter that records token cost per provider, emitted from the handler:

from prometheus_client import Counter

token_cost = Counter(
    "llm_token_cost_usd_total",
    "Estimated LLM token cost in USD",
    ["provider", "task"]
)

# After each call:
token_cost.labels(provider=result.provider, task=task).inc(result.estimated_cost)

Cost result after 30 days

Routing embeddings to Ollama and classification to Groq reduced our monthly LLM spend by 67% compared to sending everything to Vertex AI, with no measurable change in downstream task quality on the Reflexion incident resolution benchmark.

Scale-to-zero on idle

The com.openfaas.scale.zero: "true" label tells the OpenFaaS autoscaler to terminate the router pod after 5 minutes of inactivity. Cold start is ~2s for the Python handler. For use cases where 2s cold start is unacceptable, set the minimum replicas to 1 — you pay for one idle pod but eliminate cold starts entirely.

Lessons learned

Declare intent at the call site, not the provider. Callers specify task and max_latency_ms — never a specific provider. This lets you swap routing logic without changing any caller.
Embeddings are always free to run locally. Ollama with nomic-embed-text on a T4 spot node produces embeddings at <100ms. There is no reason to pay a cloud provider for this.
Track cost at the function level. The Prometheus counter per provider + task gave us the data to justify the routing investment in the first place.