Running Reflexion AI on FRAKMA

Reflexion AI is Warble Cloud's autonomous incident resolution platform. It uses an Actor/Critic LLM loop to detect, diagnose, and remediate production incidents without waking anyone up. This post documents how we deploy it on FRAKMA using the WarbleApp CRD — and how the three-variant agent design lets us swap LLM providers with a single kubectl apply, with no changes to application code.

The Actor/Critic architecture

Reflexion Engine runs a two-phase reasoning loop for every incident:

Actor LLM: generates a remediation hypothesis — "scale the deployment to 5 replicas", "restart the pod", "roll back to v1.2". Fast, cheap, creative.
Critic LLM: evaluates the hypothesis for SLO impact and blast radius. Conservative temperature (0.1). Safety-critical. Always on Vertex AI.
Executor: if the Critic approves, the action is dry-run, guardrails-checked, and applied.

The key insight: only the Actor needs to change when you want to try a different model. The Critic is a fixed-point validator — it should never be cost-optimised.

Three agent variants, one image

All three engine variants use the same container image. The LLM provider is selected at runtime via the LLM_PROVIDER environment variable. This means one build, three deployment manifests.

Production default

engine-vertex

Gemini 2.5 Flash as both Actor and Critic. Best reasoning quality. Requires GCP Workload Identity or service account key.

Cost-optimised

engine-groq

Llama 3.1-8b via Groq Cloud as Actor. Vertex Critic stays. ~10× cheaper per incident, <200ms actor latency.

Air-gapped

engine-ollama

Local Ollama on GPU spot nodes as Actor. No external API calls for inference. Vertex Critic still fires — one outbound call per validation.

The WarbleApp CRD manifests

Each variant is a WarbleApp custom resource. The reconciler creates a Deployment, ClusterIP Service, and nginx Ingress automatically. You never write those resources by hand.

# k8s/reflexion/agents/engine-groq.yaml
apiVersion: warble.io/v1alpha1
kind: WarbleApp
metadata:
  name: reflexion-engine-groq
  namespace: warble-system
  labels:
    warble.io/agent: reflexion-engine
    warble.io/llm-provider: groq
spec:
  stack: reflexion
  image: warbleoss.azurecr.io/reflexion-engine:latest
  replicas: 2
  resources:
    requests: {cpu: 250m, memory: 256Mi}
    limits:   {cpu: 500m, memory: 512Mi}
  ingress:
    enabled: true
    host: reflexion.frakma.io
    tlsEnabled: true
  env:
    - name: LLM_PROVIDER
      value: groq
    - name: GROQ_API_KEY
      valueFrom:
        secretKeyRef: {name: reflexion-secrets, key: GROQ_API_KEY}
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef: {name: reflexion-secrets, key: DATABASE_URL}

The Vertex and Ollama variants are structurally identical — only the LLM_PROVIDER value and the corresponding API key secret differ. Resource requests are smaller for the Groq variant because the Actor call returns in milliseconds rather than seconds.

Switching variants in production

All three variants share the same reflexion.frakma.io ingress host, so only one should be active at a time. Switching is two commands:

# Remove current engine
kubectl delete wapp reflexion-engine-vertex -n warble-system

# Apply the new variant
kubectl apply -f k8s/reflexion/agents/engine-groq.yaml

# Watch the rollout
kubectl get wapp -n warble-system --watch

Or trigger the GitHub Actions deploy workflow with agent=engine-groq — it handles the image tag patch, kubectl diff, and the production approval gate for you.

Supporting components

Component	Manifest	Why it exists
reflexion-ai-server	`agents/ai-server.yaml`	FastAPI service: hypothesis engine, healing API, metrics. port: 8000
reflexion-executor	`agents/executor.yaml`	Validates and executes remediation actions with SLO + blast radius guardrails
reflexion-frontend	`agents/frontend.yaml`	Next.js dashboard. port: 3000, ingress: app.reflexion.frakma.io
qdrant	`qdrant.yaml`	Vector DB for semantic recipe search. Raw k8s — not a WarbleApp.

Why the port field matters

The WarbleApp reconciler hardcoded port 8080 in its first version. ai-server runs on 8000 and frontend on 3000. Rather than forcing a PORT=8080 env var hack on every non-standard service, we added a port: field to the CRD spec so the reconciler wires the Service and Ingress backend to the right port natively.

Observability

Each component is scraped by Prometheus. Grafana dashboards track:

Incident volume and resolution rate by severity
Actor LLM latency (p50/p99) by provider
Critic rejection rate — high rejection = Actor producing unsafe hypotheses
Executor action success/failure + rollback rate
Token cost per incident (from OpenFaaS function logs)

The cost dashboard is how we validated the Groq variant: same resolution rate, $0.003 per incident vs $0.031 on Vertex-only — a 10× cost reduction with no measurable quality difference on the incidents we tested.

Lessons learned

Separate Actor from Critic early. The temptation is to use one model for everything. Separating them lets you cost-optimise the Actor without compromising the safety guarantees of the Critic.
Provider-neutral LLM abstraction pays off immediately. The internal/llm Go interface — just Generate(ctx, prompt) — took one afternoon to build and made swapping providers trivial.
WarbleApp CRD eliminates deployment boilerplate. Three variants, zero hand-written Deployment or Service YAMLs. The reconciler handles it all.
GitOps means every provider switch is auditable. The Argo CD diff shows exactly what changed, who approved it, and when.