Live
MLOps

MLOps in 2025: Deploying ML Models to Production the Right Way

Learn the modern MLOps stack — FastAPI, Docker, MLflow, and CI/CD — to reliably ship ML models that actually stay working in production.

Shiv Shankar Sah· AI/ML Engineer
June 1, 2025
10 min read
#MLOps#Docker#FastAPI#MLflow#CI/CD

You've trained a model. Accuracy is great. You show it to stakeholders and everyone is excited. Then you try to deploy it and everything falls apart. The model works on your laptop but breaks in the cloud. Data drifts and accuracy silently degrades. There's no way to roll back to the previous version. Retraining takes a weekend of manual effort. This is the reality of ML without MLOps.

MLOps (Machine Learning Operations) is the set of practices, tools, and culture that bridges the gap between model development and reliable production deployment. In 2025, any serious ML team — whether at a Kathmandu startup or a global tech company — needs MLOps discipline. This guide walks you through the complete modern MLOps stack, with real code you can use today.

The MLOps Lifecycle

The MLOps Lifecycle

MLOps is not a one-time setup — it's a continuous cycle. Unlike traditional software, ML systems have an extra challenge: they degrade over time as the world changes (data drift), and improving them requires retraining on new data.

MLOps Lifecycle — Continuous Loop
ML
System
📊
Data
🏋️
Train
📈
Evaluate
🚀
Deploy
👁️
Monitor
🔄
Feedback
⚠️The Pain Points Without MLOps
  • Works on my machine: Model trained on Python 3.9 breaks on the server running Python 3.11
  • Silent accuracy degradation: Input data distribution shifts; no one notices until customers complain
  • No experiment tracking: You don't remember which hyperparameters gave the best result last month
  • Manual deployments: Every update requires SSHing into servers and running scripts by hand
  • No rollback: If the new model breaks something, there's no quick way to revert
  • Data versioning chaos: Different team members train on different data slices

The Modern MLOps Stack

CategoryToolPurposeFree Tier?
Experiment TrackingMLflowLog metrics, params, models, artifactsYes — self-hosted
Experiment TrackingWeights & BiasesCloud-based tracking with rich visualisationsYes — 100GB storage
Data VersioningDVCGit for large datasets and model filesYes — open source
Model ServingFastAPI + UvicornLightweight, async Python API serverYes — open source
ContainerisationDockerPackage app + dependencies into portable imagesYes — free for public
Container OrchestrationKubernetesScale and manage containerised ML servicesGKE Autopilot free tier
Cloud DeploymentRailway / RenderSimple PaaS for deploying Docker containersYes — generous free tier
CI/CDGitHub ActionsAutomate build, test, deploy on git pushYes — 2000 min/month
Model RegistryMLflow RegistryVersion and stage models (Staging/Production)Yes — self-hosted
MonitoringPrometheus + GrafanaMetrics collection and dashboardsYes — self-hosted
Feature StoreFeastManage and serve ML features consistentlyYes — open source
Step-by-Step Guide

Step 1: Train and Track with MLflow

MLflow experiment tracking is the foundation of MLOps. Every training run should log its parameters, metrics, and artifacts so you can compare experiments and reproduce results.

python
# train.py — Model training with MLflow experiment tracking
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import json
import os

# ──────────────────────────────────────────────────────────
# Configuration — all hyperparameters in one place
# ──────────────────────────────────────────────────────────
CONFIG = {
    "n_estimators": 200,
    "max_depth": 5,
    "learning_rate": 0.05,
    "subsample": 0.8,
    "min_samples_split": 10,
    "test_size": 0.2,
    "random_state": 42,
}


def load_data():
    """Load your dataset here. Replace with actual data loading."""
    from sklearn.datasets import make_classification
    X, y = make_classification(
        n_samples=5000, n_features=20, n_informative=15,
        n_redundant=5, random_state=42
    )
    return pd.DataFrame(X), pd.Series(y)


def train_model():
    # Set MLflow tracking URI — use a local server or remote
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment("fraud-detection-v2")

    with mlflow.start_run(run_name="GBM-experiment-1") as run:
        print(f"MLflow Run ID: {run.info.run_id}")

        # ── Load & Split Data ──
        X, y = load_data()
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=CONFIG["test_size"],
            random_state=CONFIG["random_state"], stratify=y
        )

        # ── Log configuration ──
        mlflow.log_params(CONFIG)
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples",  len(X_test))
        mlflow.log_param("n_features",    X.shape[1])

        # ── Build pipeline (scaler + model) ──
        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("model", GradientBoostingClassifier(
                n_estimators   = CONFIG["n_estimators"],
                max_depth      = CONFIG["max_depth"],
                learning_rate  = CONFIG["learning_rate"],
                subsample      = CONFIG["subsample"],
                min_samples_split = CONFIG["min_samples_split"],
                random_state   = CONFIG["random_state"],
            ))
        ])

        # ── Train ──
        print("Training model...")
        pipeline.fit(X_train, y_train)

        # ── Evaluate ──
        y_pred  = pipeline.predict(X_test)
        y_proba = pipeline.predict_proba(X_test)[:, 1]

        metrics = {
            "accuracy":  accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred),
            "recall":    recall_score(y_test, y_pred),
            "f1":        f1_score(y_test, y_pred),
            "roc_auc":   roc_auc_score(y_test, y_proba),
        }

        # ── Log metrics ──
        mlflow.log_metrics(metrics)

        # ── Cross-validation ──
        cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
        mlflow.log_metric("cv_roc_auc_mean", cv_scores.mean())
        mlflow.log_metric("cv_roc_auc_std",  cv_scores.std())

        print("
Results:")
        for k, v in metrics.items():
            print(f"  {k}: {v:.4f}")
        print(f"  CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

        # ── Save and log model ──
        mlflow.sklearn.log_model(
            pipeline,
            "model",
            registered_model_name="fraud-detector",
            input_example=X_test.iloc[:5],
            signature=mlflow.models.infer_signature(X_test, y_pred),
        )

        print(f"
Model saved to MLflow. Run ID: {run.info.run_id}")
        return run.info.run_id


if __name__ == "__main__":
    run_id = train_model()

Step 2: Serve with FastAPI

python
# serve.py — Production FastAPI model server
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
import mlflow.sklearn
import numpy as np
import pandas as pd
import logging
import time
import os
from contextlib import asynccontextmanager
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from fastapi.responses import Response

# ── Logging ──
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ── Prometheus metrics ──
REQUEST_COUNT    = Counter("predictions_total", "Total predictions", ["status"])
REQUEST_LATENCY  = Histogram("prediction_latency_seconds", "Prediction latency")
INPUT_ERRORS     = Counter("input_errors_total", "Invalid input count")

# ── Global model ──
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    global model
    model_uri = os.getenv("MODEL_URI", "models:/fraud-detector/Production")
    logger.info(f"Loading model from: {model_uri}")
    try:
        model = mlflow.sklearn.load_model(model_uri)
        logger.info("Model loaded successfully")
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        raise
    yield


app = FastAPI(
    title="Fraud Detection API",
    description="ML model serving with MLflow + FastAPI",
    version="2.0.0",
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)


class PredictionRequest(BaseModel):
    features: list[float] = Field(..., description="Model input features", min_length=1)


class PredictionResponse(BaseModel):
    prediction: int
    probability: float
    confidence: str
    latency_ms: float


@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    start = time.time()

    try:
        X = np.array(request.features).reshape(1, -1)
        pred  = int(model.predict(X)[0])
        proba = float(model.predict_proba(X)[0, 1])
    except Exception as e:
        INPUT_ERRORS.inc()
        REQUEST_COUNT.labels(status="error").inc()
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=422, detail=f"Prediction failed: {e}")

    latency = (time.time() - start) * 1000
    REQUEST_COUNT.labels(status="success").inc()
    REQUEST_LATENCY.observe(latency / 1000)

    confidence = "high" if proba > 0.8 or proba < 0.2 else "medium" if proba > 0.6 or proba < 0.4 else "low"

    return PredictionResponse(
        prediction=pred,
        probability=round(proba, 4),
        confidence=confidence,
        latency_ms=round(latency, 2),
    )


@app.get("/metrics")
def metrics():
    """Prometheus metrics endpoint."""
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)


@app.get("/health")
def health():
    return {"status": "healthy", "model_loaded": model is not None}

Step 3: Dockerise the Service

dockerfile
# Dockerfile — Multi-stage build for production ML API
# ──────────────────────────────────────────────────────────
# Stage 1: Build stage (installs all deps including build tools)
# ──────────────────────────────────────────────────────────
FROM python:3.11-slim AS builder

WORKDIR /build

# Install system build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc g++ libgomp1 && \
    rm -rf /var/lib/apt/lists/*

# Install Python dependencies into a virtual env
COPY requirements.txt .
RUN python -m venv /opt/venv && \
    /opt/venv/bin/pip install --no-cache-dir --upgrade pip && \
    /opt/venv/bin/pip install --no-cache-dir -r requirements.txt


# ──────────────────────────────────────────────────────────
# Stage 2: Runtime stage (lean production image)
# ──────────────────────────────────────────────────────────
FROM python:3.11-slim AS runtime

# Create non-root user for security
RUN useradd -m -u 1000 appuser
WORKDIR /app

# Copy virtual env from builder
COPY --from=builder /opt/venv /opt/venv

# Activate venv
ENV PATH="/opt/venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

# Copy application code
COPY --chown=appuser:appuser serve.py .

USER appuser

EXPOSE 8000

# Health check — Docker will restart unhealthy containers
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

# Run with Gunicorn (production WSGI) + Uvicorn workers
CMD ["gunicorn", "serve:app", "--workers", "2", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "120"]

Step 4: Deploy to Railway or Render

For small to medium ML APIs, Railway and Render are the best options for Nepali developers — no complex Kubernetes setup, generous free tiers, and automatic HTTPS.

bash
# ──────────────────────────────────────────────────────────
# Option A: Deploy to Railway
# ──────────────────────────────────────────────────────────

# 1. Install Railway CLI
npm install -g @railway/cli

# 2. Login and initialise project
railway login
railway init  # Creates railway.json

# 3. Add environment variables
railway variables set OPENAI_API_KEY=sk-xxx
railway variables set MODEL_URI=models:/fraud-detector/Production
railway variables set MLFLOW_TRACKING_URI=https://your-mlflow-server.com

# 4. Deploy (Railway auto-detects Dockerfile)
railway up

# Your API is live at: https://your-app.railway.app


# ──────────────────────────────────────────────────────────
# Option B: Deploy to Render (via render.yaml)
# ──────────────────────────────────────────────────────────
cat > render.yaml << 'EOF'
services:
  - type: web
    name: fraud-detection-api
    runtime: docker
    dockerfilePath: ./Dockerfile
    region: singapore   # closest to Nepal
    plan: starter       # $7/month — 512MB RAM, 0.5 CPU
    healthCheckPath: /health
    envVars:
      - key: MODEL_URI
        value: models:/fraud-detector/Production
      - key: OPENAI_API_KEY
        sync: false    # Set in Render dashboard (secret)
EOF

# Push to GitHub and connect repo in Render dashboard
git add render.yaml && git commit -m "Add Render config"
git push origin main

Step 5: CI/CD with GitHub Actions

CI/CD Pipeline — Automated Deployment Flow
📤git push
GitHub Actions triggered
🧪Run tests
🐳Docker build
📦Push to GHCR
🚀Deploy to Render
yaml
# .github/workflows/deploy.yml
# Runs on every push to main — tests, builds, and deploys
name: MLOps CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}/fraud-detection-api

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-asyncio httpx

      - name: Run unit tests
        run: pytest tests/ -v --tb=short

      - name: Run model validation
        run: python scripts/validate_model.py
        env:
          MODEL_URI: ${{ secrets.MODEL_URI }}
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'

    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'

    steps:
      - name: Trigger Render deployment
        run: |
          curl -s -X POST \
            -H "Authorization: Bearer ${{ secrets.RENDER_API_KEY }}" \
            "https://api.render.com/v1/services/${{ secrets.RENDER_SERVICE_ID }}/deploys" \
            -H "Content-Type: application/json" \
            -d '{"clearCache": false}'

      - name: Wait for deployment
        run: |
          echo "Waiting for Render to deploy..."
          sleep 60
          curl -sf https://your-app.onrender.com/health || exit 1
          echo "Deployment successful!"
💡Use DVC for Data Versioning

DVC (Data Version Control) is Git for large files — datasets, model checkpoints, and feature stores. It stores file metadata in Git and the actual data in cheap object storage (S3, GCS, Azure Blob).

pip install dvc[s3]
dvc init
dvc add data/train.csv # tracks with checksum
dvc remote add -d myremote s3://my-ml-bucket/dvc
dvc push # upload to S3
# teammates can now run:
dvc pull # get exact same data
Conclusion

Building a Reliable ML System

The stack we've covered — MLflow for experiment tracking, FastAPI for serving, Docker for packaging, GitHub Actions for CI/CD, and Railway/Render for deployment — gives you a production-grade MLOps foundation without requiring a dedicated DevOps team.

Start small: add MLflow tracking to your existing training scripts this week. Then containerise your best model. Then automate the deployment. Each step independently adds value and you can stop at any point.

In Nepal's growing AI ecosystem, the engineers who can both build models and reliably deploy them are the rarest and most valuable. MLOps is your competitive advantage.

S

Written by

Shiv Shankar Sah

AI/ML Engineer at HexCode Nepal

Passionate about making AI education accessible in Nepal. Writing tutorials, guides, and deep-dives on ML, LLMs, and production AI systems.

Stay Ahead in AI

Get weekly AI tutorials, course updates, career tips, and exclusive offers. Join 2,000+ subscribers in Nepal.

No spam. Unsubscribe anytime.

MLOps in 2025: Deploying ML Models to Production the Right Way