Reflexion AI is Warble Cloud's autonomous incident resolution platform. It uses an Actor/Critic LLM loop to detect, diagnose, and remediate production incidents without waking anyone up. This post documents how we deploy it on FRAKMA using the WarbleApp CRD — and how the three-variant agent design lets us swap LLM providers with a single kubectl apply, with no changes to application code.
The Actor/Critic architecture
Reflexion Engine runs a two-phase reasoning loop for every incident:
- Actor LLM: generates a remediation hypothesis — "scale the deployment to 5 replicas", "restart the pod", "roll back to v1.2". Fast, cheap, creative.
- Critic LLM: evaluates the hypothesis for SLO impact and blast radius. Conservative temperature (0.1). Safety-critical. Always on Vertex AI.
- Executor: if the Critic approves, the action is dry-run, guardrails-checked, and applied.
The key insight: only the Actor needs to change when you want to try a different model. The Critic is a fixed-point validator — it should never be cost-optimised.
Three agent variants, one image
All three engine variants use the same container image. The LLM provider is selected at runtime via the LLM_PROVIDER environment variable. This means one build, three deployment manifests.
engine-vertex
Gemini 2.5 Flash as both Actor and Critic. Best reasoning quality. Requires GCP Workload Identity or service account key.
engine-groq
Llama 3.1-8b via Groq Cloud as Actor. Vertex Critic stays. ~10× cheaper per incident, <200ms actor latency.
engine-ollama
Local Ollama on GPU spot nodes as Actor. No external API calls for inference. Vertex Critic still fires — one outbound call per validation.
The WarbleApp CRD manifests
Each variant is a WarbleApp custom resource. The reconciler creates a Deployment, ClusterIP Service, and nginx Ingress automatically. You never write those resources by hand.
# k8s/reflexion/agents/engine-groq.yaml
apiVersion: warble.io/v1alpha1
kind: WarbleApp
metadata:
name: reflexion-engine-groq
namespace: warble-system
labels:
warble.io/agent: reflexion-engine
warble.io/llm-provider: groq
spec:
stack: reflexion
image: warbleoss.azurecr.io/reflexion-engine:latest
replicas: 2
resources:
requests: {cpu: 250m, memory: 256Mi}
limits: {cpu: 500m, memory: 512Mi}
ingress:
enabled: true
host: reflexion.frakma.io
tlsEnabled: true
env:
- name: LLM_PROVIDER
value: groq
- name: GROQ_API_KEY
valueFrom:
secretKeyRef: {name: reflexion-secrets, key: GROQ_API_KEY}
- name: DATABASE_URL
valueFrom:
secretKeyRef: {name: reflexion-secrets, key: DATABASE_URL}
The Vertex and Ollama variants are structurally identical — only the LLM_PROVIDER value and the corresponding API key secret differ. Resource requests are smaller for the Groq variant because the Actor call returns in milliseconds rather than seconds.
Switching variants in production
All three variants share the same reflexion.frakma.io ingress host, so only one should be active at a time. Switching is two commands:
# Remove current engine
kubectl delete wapp reflexion-engine-vertex -n warble-system
# Apply the new variant
kubectl apply -f k8s/reflexion/agents/engine-groq.yaml
# Watch the rollout
kubectl get wapp -n warble-system --watch
Or trigger the GitHub Actions deploy workflow with agent=engine-groq — it handles the image tag patch, kubectl diff, and the production approval gate for you.
Supporting components
| Component | Manifest | Why it exists |
|---|---|---|
| reflexion-ai-server | agents/ai-server.yaml | FastAPI service: hypothesis engine, healing API, metrics. port: 8000 |
| reflexion-executor | agents/executor.yaml | Validates and executes remediation actions with SLO + blast radius guardrails |
| reflexion-frontend | agents/frontend.yaml | Next.js dashboard. port: 3000, ingress: app.reflexion.frakma.io |
| qdrant | qdrant.yaml | Vector DB for semantic recipe search. Raw k8s — not a WarbleApp. |
The WarbleApp reconciler hardcoded port 8080 in its first version. ai-server runs on 8000 and frontend on 3000. Rather than forcing a PORT=8080 env var hack on every non-standard service, we added a port: field to the CRD spec so the reconciler wires the Service and Ingress backend to the right port natively.
Observability
Each component is scraped by Prometheus. Grafana dashboards track:
- Incident volume and resolution rate by severity
- Actor LLM latency (p50/p99) by provider
- Critic rejection rate — high rejection = Actor producing unsafe hypotheses
- Executor action success/failure + rollback rate
- Token cost per incident (from OpenFaaS function logs)
The cost dashboard is how we validated the Groq variant: same resolution rate, $0.003 per incident vs $0.031 on Vertex-only — a 10× cost reduction with no measurable quality difference on the incidents we tested.
Lessons learned
- Separate Actor from Critic early. The temptation is to use one model for everything. Separating them lets you cost-optimise the Actor without compromising the safety guarantees of the Critic.
- Provider-neutral LLM abstraction pays off immediately. The
internal/llmGo interface — justGenerate(ctx, prompt)— took one afternoon to build and made swapping providers trivial. - WarbleApp CRD eliminates deployment boilerplate. Three variants, zero hand-written Deployment or Service YAMLs. The reconciler handles it all.
- GitOps means every provider switch is auditable. The Argo CD diff shows exactly what changed, who approved it, and when.