The default MLflow Helm chart ships with a bundled PostgreSQL sidecar and no artifact backend — fine for local development, unusable in production. This post documents the three changes we made to run MLflow reliably on FRAKMA: shared cluster PostgreSQL, Azure Blob artifact storage, and a wildcard TLS certificate via cert-manager.
Why not the bundled PostgreSQL?
The Helm chart's postgresql.enabled: true creates a single-replica Postgres pod in the mlflow namespace with no backup, no HA, and no shared access. Every time we upgraded the MLflow chart, we risked the bundled Postgres getting recreated and losing metadata. Moving to the shared warble-system PostgreSQL instance — backed by Azure Managed Disks with daily snapshots — solved all of that in one move.
The complete values.yaml
image:
repository: burakince/mlflow
tag: "3.7.0" # community image with Azure Blob support built-in
backendStore:
databaseMigration: true
databaseConnectionCheck: true
postgres:
enabled: true
host: postgresql.warble-system.svc.cluster.local
port: 5432
database: mlflow
user: warble
password: warble # override with secretKeyRef in prod
artifactRoot:
proxiedArtifactStorage: true # MLflow proxies artifact downloads
azureBlob:
enabled: true
storageAccount: warbleosstate
container: mlflow-artifacts
# Azure storage key injected from secret (env var name = AZURE_STORAGE_ACCESS_KEY)
extraSecretNamesForEnvFrom:
- mlflow-azure-secret
postgresql:
enabled: false # disable bundled Postgres
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-auth
hosts:
- host: mlflow.frakma.io
paths: [{path: /, pathType: Prefix}]
tls:
- secretName: frakma-io-wildcard-tls
hosts: [mlflow.frakma.io]
The wildcard TLS certificate
Rather than provisioning a separate cert for every subdomain, we issue one wildcard cert (*.frakma.io) via cert-manager's DNS-01 challenge against Cloudflare. Every platform component references the same frakma-io-wildcard-tls secret.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: frakma-io-wildcard
namespace: warble-system
spec:
secretName: frakma-io-wildcard-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- "*.frakma.io"
- frakma.io
Once issued, the secret is copied to other namespaces using a simple reflector or by referencing it cross-namespace via Helm values. No per-service cert provisioning, no cert renewal per subdomain.
Basic auth for the tracker UI
# Create htpasswd secret
htpasswd -nb mlflow-admin your-password | kubectl create secret generic mlflow-basic-auth \
--from-file=auth=/dev/stdin -n mlflow
Setting proxiedArtifactStorage: true means the MLflow server proxies artifact downloads through itself rather than giving clients a direct Azure SAS URL. This keeps the storage access key server-side and avoids exposing Azure credentials to every MLflow client.
Verifying the setup
# Check MLflow can connect to Postgres
kubectl logs -n mlflow -l app=mlflow | grep "database migration"
# Test artifact upload from a training job
import mlflow
mlflow.set_tracking_uri("https://mlflow.frakma.io")
with mlflow.start_run():
mlflow.log_param("test", "value")
mlflow.log_artifact("model.pkl") # → warbleosstate/mlflow-artifacts/