Closed-Loop LLM Fine-Tuning

Most fine-tuning workflows look like this: a data scientist notices model quality has dropped, opens a notebook, kicks off a training run locally, uploads the checkpoint manually, and tells the platform team to "update the model". This process takes days, is not reproducible, and happens after customers have already seen degraded output.

This post documents how we replaced that process with a single Airflow DAG on the FRAKMA stack — one that runs automatically when embedding drift crosses a threshold and ends with a KServe canary rollout, with zero humans in the critical path.

The problem: drift is silent until it isn't

LLMs trained on a static corpus drift when the real-world distribution shifts. For a classification model, you'd measure accuracy on labelled holdout data. For an LLM, there's usually no ground truth to compute accuracy against — so drift goes undetected until users start complaining.

The practical proxy is embedding drift: compare the cosine similarity distribution of today's query embeddings against a baseline from when the model was last trained. When the distribution diverges beyond a threshold, it's time to retrain.

Why embeddings?

Embedding vectors encode semantic meaning. When the query distribution shifts — new topics, new vocabulary, seasonal patterns — the embedding distribution shifts with it. This is measurable without labels and correlates well with downstream quality degradation in our experience.

The DAG: seven steps, fully automated

1. detect_embedding_drift → Qdrant cosine similarity vs baseline

└─ drift_score > 0.15 → branch: retrain

2. prepare_finetune_dataset → fetch last 30d queries + synthetic pairs

3. submit_rayjob → KubeRay QLoRA on GPU spot nodes

4. evaluate_model → ROUGE-L, BERTScore, task-specific eval

└─ eval_score > baseline → branch: promote

5. register_model → MLflow registry: staging → production

6. update_kserve_canary → 10% traffic to new InferenceService

7. monitor_and_promote → watch error rate 1h → full promote

Step 1: Detecting drift with Airflow + Qdrant

The drift sensor runs daily. It pulls the last 24h of query embeddings from Qdrant (stored by the inference function as a side effect), computes the mean pairwise cosine similarity, and compares it to a rolling baseline stored in PostgreSQL.

# dags/llm_retrain_on_drift.py
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago

@dag(schedule_interval="0 6 * * *", start_date=days_ago(1), catchup=False)
def llm_retrain_on_drift():

    @task
    def detect_drift() -> float:
        from qdrant_client import QdrantClient
        client = QdrantClient(host="qdrant.warble-system.svc.cluster.local")
        vectors = client.scroll("query_embeddings", limit=1000)[0]
        # cosine similarity vs stored baseline
        drift_score = compute_drift(vectors)
        return drift_score

    @task.branch
    def should_retrain(drift_score: float) -> str:
        return "prepare_dataset" if drift_score > 0.15 else "skip"

    drift = detect_drift()
    should_retrain(drift)

Step 2–3: QLoRA fine-tuning on GPU spot nodes (KubeRay)

Once drift is confirmed, the DAG submits a RayJob to the KubeRay cluster. The job runs QLoRA fine-tuning on Llama 3.1-8b using the last 30 days of production queries as preference data. Spot GPU nodes (Standard_NC4as_T4_v3 on AKS) cost ~$0.52/hr and are only active during the job.

@task
def submit_rayjob(dataset_uri: str) -> str:
    import subprocess
    manifest = f"""
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: lora-finetune-{run_id}
  namespace: kuberay
spec:
  entrypoint: python train/qlora_finetune.py --data {dataset_uri}
  rayClusterSpec:
    workerGroupSpecs:
      - replicas: 2
        minReplicas: 1
        maxReplicas: 4
        groupName: gpu-workers
        template:
          spec:
            nodeSelector:
              warble.io/pool: gpu
            containers:
              - resources:
                  limits:
                    nvidia.com/gpu: "1"
"""
    subprocess.run(["kubectl", "apply", "-f", "-"], input=manifest.encode())
    return f"lora-finetune-{run_id}"

The training script logs all hyperparameters, per-epoch loss, and the final adapter checkpoint to MLflow automatically via mlflow.autolog(). The artifact is uploaded to Azure Blob storage (warbleosstate/mlflow-artifacts).

Step 4: Automated evaluation as a gate

Fine-tuning improved loss doesn't guarantee production quality improvement. Before promoting, the DAG runs an eval suite against a held-out test set:

Metric	Threshold	Why
ROUGE-L	> baseline + 0.02	Surface-level quality vs reference outputs
BERTScore F1	> 0.88	Semantic similarity to reference
Task accuracy	> current production	Domain-specific correctness metric
Latency p95	< 2s	No regression in serving speed

If any gate fails, the DAG ends without promoting. The failed run is logged in MLflow for comparison, and a Slack alert fires via an Airflow callback.

Step 5–6: Promote to MLflow → canary on KServe

On passing all eval gates, the model is transitioned from staging to production in the MLflow registry. The next task patches the KServe InferenceService to route 10% of traffic to the new model:

@task
def update_kserve_canary(model_uri: str):
    patch = {
        "spec": {
            "predictor": {
                "canaryTrafficPercent": 10,
                "model": {
                    "storageUri": model_uri,
                    "modelFormat": {"name": "mlserver"}
                }
            }
        }
    }
    k8s_client.patch(
        group="serving.kserve.io", version="v1beta1",
        plural="inferenceservices", name="llm-prod",
        body=patch, namespace="warble-models"
    )

Step 7: Monitor and auto-promote

The final task is a sensor that polls the canary's error rate every 5 minutes for 1 hour. If the error rate stays below 1%, it promotes the canary to 100% traffic. If it exceeds 2%, it rolls back automatically.

Result

End-to-end: drift detected at 06:00 → retrain job submitted at 06:02 → training completes ~45 min on 2× T4 GPU spot nodes → eval passes → canary live at 07:10 → full promote at 08:10. Total cost: ~$0.80 in GPU time per retrain cycle.

Key takeaways

Drift as a first-class trigger. Embedding distance is a reliable, label-free proxy for LLM quality degradation.
Eval gates prevent bad promotions. Improved training loss ≠ improved production quality. Always measure on a held-out task-specific benchmark.
GPU spot nodes make continuous retraining economically viable. ~$0.80 per cycle vs $200+ on on-demand.
KServe canary + auto-rollback removes the last human gate. You're shipping a new model the same way you'd ship a new service version.

What's next

The natural extension is connecting the monitor step back to Qdrant — failed canary queries become labelled examples for the next training run, creating a genuine data flywheel. We'll cover that in a follow-up post on RAG pipeline orchestration.