Every team with more than one LLM use case ends up with the same problem: some tasks need maximum reasoning quality, others need sub-200ms latency, and some need to run without any external API calls. Hardcoding a single provider into every service is the path to sprawl — when you want to switch, you're touching application code everywhere.
On FRAKMA we solve this with a single OpenFaaS function that acts as the LLM gateway. Every inference request goes through it. The function decides which provider to use based on the task type and latency budget declared by the caller.
Why OpenFaaS for the router?
A router function fits the serverless model perfectly: stateless, fast to invoke, scales to zero when idle, and the built-in Prometheus metrics give you per-function cost tracking without extra instrumentation. The alternative — a dedicated microservice — adds a persistent deployment with idle CPU and memory costs even when no inference is happening.
The routing logic
# functions/llm-router/handler.py
import os, json
from providers import groq_generate, vertex_generate, ollama_embed
def handle(req: str) -> str:
body = json.loads(req)
task = body.get("task", "generate") # embed | classify | generate | reason
budget = body.get("max_latency_ms", 2000)
private = body.get("private", False) # force local Ollama
prompt = body["prompt"]
if private or task == "embed":
# Embeddings always local — no token cost, no data leaves cluster
result = ollama_embed(prompt)
elif task in ("classify", "generate") and budget < 300:
# Low latency budget → Groq Llama 3.1: ~120ms typical
result = groq_generate(prompt, model="llama-3.1-8b-instant")
else:
# Complex reasoning → Vertex Gemini 2.5 Flash
result = vertex_generate(prompt, model="gemini-2.5-flash")
return json.dumps({"result": result, "provider": result.provider})
Provider comparison on FRAKMA
| Provider | p50 latency | Cost / 1M tokens | Best for | Data leaves cluster |
|---|---|---|---|---|
| Ollama (in-cluster) | 80–300ms | $0 (GPU spot ~$0.52/hr) | Embeddings, classification, privacy-sensitive tasks | No |
| Groq Cloud | 120–200ms | ~$0.05–0.08 | Low-latency generation, agentic Actor calls | Yes |
| Vertex Gemini 2.5 Flash | 800–1500ms | ~$0.30–0.75 | Complex reasoning, safety-critical Critic calls | Yes (GCP) |
Deploying the router function
# functions/llm-router/stack.yml
version: "1.0"
functions:
llm-router:
image: warbleoss.azurecr.io/fn-llm-router:latest
environment:
GROQ_API_KEY_FILE: /var/openfaas/secrets/groq-api-key
OLLAMA_HOST: http://ollama.warble-system.svc.cluster.local:11434
GOOGLE_CLOUD_PROJECT: warble-cloud
secrets:
- groq-api-key
limits:
memory: 256Mi
cpu: 200m
labels:
com.openfaas.scale.zero: "true"
com.openfaas.scale.zero-duration: 5m
# Deploy
faas-cli build -f functions/llm-router/stack.yml
faas-cli push -f functions/llm-router/stack.yml
faas-cli deploy -f functions/llm-router/stack.yml \
--gateway https://faas.frakma.io
Per-function cost tracking with Prometheus
OpenFaaS exposes gateway_function_duration_seconds and invocation counters per function out of the box. We extend this with a custom counter that records token cost per provider, emitted from the handler:
from prometheus_client import Counter
token_cost = Counter(
"llm_token_cost_usd_total",
"Estimated LLM token cost in USD",
["provider", "task"]
)
# After each call:
token_cost.labels(provider=result.provider, task=task).inc(result.estimated_cost)
Routing embeddings to Ollama and classification to Groq reduced our monthly LLM spend by 67% compared to sending everything to Vertex AI, with no measurable change in downstream task quality on the Reflexion incident resolution benchmark.
Scale-to-zero on idle
The com.openfaas.scale.zero: "true" label tells the OpenFaaS autoscaler to terminate the router pod after 5 minutes of inactivity. Cold start is ~2s for the Python handler. For use cases where 2s cold start is unacceptable, set the minimum replicas to 1 — you pay for one idle pod but eliminate cold starts entirely.
Lessons learned
- Declare intent at the call site, not the provider. Callers specify
taskandmax_latency_ms— never a specific provider. This lets you swap routing logic without changing any caller. - Embeddings are always free to run locally. Ollama with
nomic-embed-texton a T4 spot node produces embeddings at <100ms. There is no reason to pay a cloud provider for this. - Track cost at the function level. The Prometheus counter per provider + task gave us the data to justify the routing investment in the first place.