Computer Vision with YOLO: Building a Real-time Object Detector

Object detection is one of the most visually compelling and practically useful applications of computer vision. Rather than just classifying what an image contains, object detection finds where objects are — drawing bounding boxes around every person, car, animal, or item in a scene, in real time. The applications are everywhere: security cameras, self-driving cars, medical imaging, manufacturing quality control, and agricultural monitoring.

YOLO (You Only Look Once) has been the dominant real-time object detection framework since 2016. Its key insight — treating detection as a single regression problem rather than a two-stage pipeline — made it fast enough for real-time video while remaining accurate. In this guide, you'll go from zero to running a custom-trained YOLO model that detects objects in real time using your webcam.

Real-World Uses

Where Object Detection Is Used

🚗Autonomous DrivingDetect pedestrians, vehicles, signs at 60+ FPS

🏭Quality ControlDetect defects on manufacturing lines in real time

🏥Medical ImagingLocate tumours, anomalies in X-rays and MRIs

🌾AgricultureDetect crop diseases, pest infestations from drones

📦Warehouse AutomationTrack inventory, guide robots, count packages

🚦Smart TrafficVehicle counting, speed estimation, incident detection

YOLO Architecture

YOLO Architecture Deep Dive

Understanding YOLO's architecture helps you tune it effectively for your use case. The modern YOLOv8/v9/v10/v11 architecture has three main components: a backbone for feature extraction, a neck for multi-scale feature aggregation, and a head for prediction.

YOLOv8 Architecture — Feature Pyramid Detection

📸 Input Image (640×640×3)

↓

🦴 Backbone — CSPDarknet

Conv + BN + SiLU

Feature extraction

C2f Modules

Cross-stage partial

SPPF

Multi-scale pooling

↓

🔗 Neck — PANet (Path Aggregation)

P3 (80×80)

Small objects

P4 (40×40)

Medium objects

P5 (20×20)

Large objects

↓

🎯 Head — Decoupled Detection

Bounding Boxes

[x, y, w, h]

Class Scores

Softmax over N classes

Confidence

Objectness score

YOLO vs Other Detectors

Model	Approach	Speed (FPS)	mAP (COCO)	Real-time?	Best For
YOLOv8 (Nano)	One-stage	150+ FPS	37.3	Yes	Edge devices, mobile, embedded
YOLOv8 (Large)	One-stage	35 FPS	52.9	Yes	Server inference, high accuracy
YOLOv11 (M)	One-stage	60 FPS	51.5	Yes	Balanced speed/accuracy
Faster R-CNN	Two-stage	5–10 FPS	55+ mAP	No	High accuracy, batch processing
SSD (MobileNet)	One-stage	60 FPS	23 mAP	Yes	Lightweight mobile deployment
RT-DETR (Large)	Transformer	25 FPS	53.1 mAP	Borderline	When accuracy is critical

The Mathematics

The Math Behind YOLO

Intersection over Union (IoU)

IoU is the primary metric for measuring how well a predicted bounding box matches the ground truth. It's used during both training (loss function) and evaluation (mAP calculation). A prediction is typically considered correct if IoU > 0.5.

IoU = Area of Overlap / Area of Union

Area of Overlap is the intersection of the predicted box and the ground truth box. Area of Union is the total area covered by both boxes combined. IoU = 1.0 means perfect overlap; IoU = 0 means no overlap. A threshold of 0.5 is standard for most benchmarks (COCO uses 0.5:0.95 averaging).

Non-Maximum Suppression (NMS)

YOLO typically predicts hundreds of overlapping bounding boxes for the same object. NMS selects the best box and suppresses redundant ones. The algorithm: (1) Sort all predictions by confidence score, (2) Select the highest-confidence box, (3) Remove all other boxes with IoU > threshold (typically 0.45) with the selected box, (4) Repeat until no boxes remain.

Confidence Threshold — Effect on Detections

conf=0.25 (Low)

More detections — includes low-confidence predictions. More false positives. Good for recall-sensitive tasks.

8 detections

conf=0.50 (Medium)

Balanced. Standard for most applications. Removes uncertain predictions while keeping confident ones.

5 detections

conf=0.80 (High)

Only very confident predictions. Few false positives but misses uncertain true positives. High precision.

2 detections

Code — Step by Step

Step 1: Installation

bash

# Install Ultralytics YOLO (includes YOLOv8, v9, v10, v11)
pip install ultralytics opencv-python numpy pillow

# Verify installation
python -c "from ultralytics import YOLO; print('YOLO ready!')"

# Download pretrained YOLOv8 weights (automatic on first use)
# Available: yolov8n (nano), yolov8s (small), yolov8m (medium),
#            yolov8l (large), yolov8x (extra-large)
# nano = fastest, extra-large = most accurate

Step 2: Basic Detection on an Image

python

# detect_image.py — Object detection on a single image
from ultralytics import YOLO
import cv2
import numpy as np
from pathlib import Path

# ── Load pretrained YOLOv8 model ──
# First run downloads ~6MB weights automatically
model = YOLO("yolov8m.pt")  # medium model: speed/accuracy balance

# ── Run inference on an image ──
image_path = "test_image.jpg"  # replace with your image
results = model(
    source=image_path,
    conf=0.5,        # confidence threshold (0–1)
    iou=0.45,        # NMS IoU threshold
    imgsz=640,       # inference image size
    device="cpu",    # use "0" for GPU, "cpu" for CPU
    verbose=True,    # print results to console
)

# ── Process results ──
for r in results:
    print(f"
Detected {len(r.boxes)} objects:")
    print(f"Classes: {r.names}")

    for i, box in enumerate(r.boxes):
        class_id   = int(box.cls[0])
        class_name = r.names[class_id]
        confidence = float(box.conf[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])

        print(f"  [{i+1}] {class_name}: {confidence:.2f} @ [{x1},{y1},{x2},{y2}]")

# ── Save annotated image ──
annotated = results[0].plot()  # BGR numpy array with boxes drawn
cv2.imwrite("output_detected.jpg", annotated)
print("
Saved annotated image: output_detected.jpg")

# ── Batch inference on a folder ──
batch_results = model(
    source="./test_images/",   # process entire folder
    conf=0.5,
    save=True,                  # auto-save annotated images
    save_dir="./outputs/",
)
print(f"Processed {len(batch_results)} images")

Step 3: Real-Time Webcam Detection

python

# webcam_detection.py — Real-time detection from webcam
from ultralytics import YOLO
import cv2
import time
from collections import defaultdict

# ── Configuration ──
MODEL_PATH = "yolov8n.pt"   # nano = fastest for real-time
CONF_THRESHOLD = 0.5
IOU_THRESHOLD  = 0.45
CLASSES_TO_DETECT = None    # None = detect all, or [0, 2, 5] for specific classes
                             # COCO class 0=person, 2=car, 5=bus

# ── Load model ──
model = YOLO(MODEL_PATH)
model.to("cpu")  # or "cuda" if you have a GPU

# ── Open webcam ──
cap = cv2.VideoCapture(0)  # 0 = default webcam
if not cap.isOpened():
    raise RuntimeError("Cannot open webcam. Check connection.")

cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

# ── FPS tracking ──
fps_history = []
frame_count = 0

print("Starting real-time detection. Press 'q' to quit.")

while True:
    t0 = time.time()
    ret, frame = cap.read()
    if not ret:
        print("Failed to read frame")
        break

    # ── Run YOLO inference ──
    results = model(
        source=frame,
        conf=CONF_THRESHOLD,
        iou=IOU_THRESHOLD,
        classes=CLASSES_TO_DETECT,
        verbose=False,  # don't print per-frame
    )

    # ── Draw results ──
    annotated = results[0].plot(
        line_width=2,
        font_size=12,
        labels=True,
        conf=True,
    )

    # ── Calculate and display FPS ──
    fps = 1.0 / (time.time() - t0)
    fps_history.append(fps)
    if len(fps_history) > 30:
        fps_history.pop(0)
    avg_fps = sum(fps_history) / len(fps_history)

    # Count detections per class
    class_counts = defaultdict(int)
    for box in results[0].boxes:
        class_name = results[0].names[int(box.cls[0])]
        class_counts[class_name] += 1

    # Overlay FPS and detection info
    cv2.putText(annotated, f"FPS: {avg_fps:.1f}", (10, 30),
                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    cv2.putText(annotated, f"Objects: {len(results[0].boxes)}", (10, 70),
                cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 200, 255), 2)

    # Display
    cv2.imshow("YOLOv8 Real-time Detection", annotated)
    frame_count += 1

    # Press 'q' to quit, 's' to save screenshot
    key = cv2.waitKey(1) & 0xFF
    if key == ord("q"):
        break
    elif key == ord("s"):
        cv2.imwrite(f"screenshot_{frame_count}.jpg", annotated)
        print(f"Screenshot saved: screenshot_{frame_count}.jpg")

cap.release()
cv2.destroyAllWindows()
print(f"Session complete. Processed {frame_count} frames at avg {sum(fps_history)/len(fps_history):.1f} FPS")

Step 4: Custom Training — Dataset YAML

yaml

# dataset.yaml — Custom dataset configuration for YOLO training
# Place this file alongside your data directory

# Dataset root directory (relative or absolute)
path: /home/user/datasets/my_custom_dataset

# Image directories (relative to path)
train: images/train    # Training images
val:   images/val      # Validation images
test:  images/test     # Test images (optional)

# Number of classes
nc: 5

# Class names (must match label indices 0, 1, 2, ...)
names:
  0: motorcycle
  1: car
  2: bus
  3: truck
  4: pedestrian

# Dataset statistics (optional, for reference)
# Total images: 4,200
# Total annotations: 18,500
# Image sizes: 640×640 (resized during training)

# ──────────────────────────────────────────────────────────
# Expected directory structure:
# my_custom_dataset/
# ├── images/
# │   ├── train/          ← your .jpg or .png images
# │   │   ├── img_001.jpg
# │   │   └── ...
# │   ├── val/
# │   └── test/
# └── labels/             ← YOLO format labels (auto-detected)
#     ├── train/
#     │   ├── img_001.txt  ← one .txt per image
#     │   └── ...
#     ├── val/
#     └── test/
#
# YOLO label format (one line per object):
# <class_id> <x_center> <y_center> <width> <height>
# All values normalised 0–1 relative to image dimensions
# Example: 0 0.5234 0.3891 0.2104 0.4502
#          ↑ class  ↑ cx   ↑ cy   ↑ w    ↑ h

Step 5: Train a Custom Model

python

# train_custom.py — Fine-tune YOLOv8 on your custom dataset
from ultralytics import YOLO
import yaml
import os

# ── Load pretrained YOLOv8 (transfer learning) ──
# Starting from pretrained weights dramatically reduces training time
# and improves accuracy compared to training from scratch
model = YOLO("yolov8m.pt")   # start from medium pretrained model

# ── Training configuration ──
training_config = {
    "data":      "dataset.yaml",  # path to your dataset config
    "epochs":    100,             # number of training epochs
    "imgsz":     640,             # input image size
    "batch":     16,              # batch size (reduce if OOM)
    "device":    "0",             # GPU device (use "cpu" for CPU)
    "workers":   4,               # dataloader workers
    "project":   "runs/train",    # save directory
    "name":      "nepal_traffic_v1",

    # Optimiser
    "optimizer": "AdamW",
    "lr0":       0.001,           # initial learning rate
    "lrf":       0.01,            # final LR = lr0 * lrf
    "momentum":  0.937,
    "weight_decay": 0.0005,

    # Augmentation (enabled by default)
    "augment":   True,
    "mosaic":    1.0,             # mosaic augmentation probability
    "mixup":     0.1,             # mixup augmentation
    "flipud":    0.0,             # vertical flip
    "fliplr":    0.5,             # horizontal flip

    # Early stopping
    "patience":  20,              # stop if no improvement for 20 epochs

    # Pretrained weights
    "pretrained": True,
    "freeze":    0,               # 0 = train all layers
                                  # 10 = freeze first 10 layers (faster, less data needed)
}

print("Starting training...")
print(f"Model: YOLOv8m | Epochs: {training_config['epochs']} | Batch: {training_config['batch']}")

# ── Start training ──
results = model.train(**training_config)

# ── Print final metrics ──
print("
Training complete!")
print(f"Best mAP50:    {results.results_dict.get('metrics/mAP50(B)', 0):.4f}")
print(f"Best mAP50-95: {results.results_dict.get('metrics/mAP50-95(B)', 0):.4f}")
print(f"Weights saved: {model.trainer.best}")

# ── Validate on test set ──
val_results = model.val(
    data="dataset.yaml",
    split="test",
    conf=0.5,
    iou=0.6,
    save_json=True,   # save COCO-format results
)

print(f"
Test mAP50: {val_results.box.map50:.4f}")
print(f"Test mAP50-95: {val_results.box.map:.4f}")

# ── Export for deployment ──
# Export to ONNX for fast CPU inference on any platform
model.export(format="onnx", imgsz=640, opset=12, dynamic=False)
print("Model exported to ONNX format")

# For edge devices (Raspberry Pi, Jetson Nano):
# model.export(format="ncnn")    # Alibaba NCNN format
# model.export(format="tflite") # TensorFlow Lite for mobile

ℹ️YOLO Versions Comparison — Which to Use?

Version	Released	Best Feature	Recommendation
YOLOv5	2020	Widely adopted, huge community, great tooling	Legacy projects only
YOLOv8	2023	Best docs, most integrations, easiest to use	Default choice for most projects
YOLOv9	2024	Programmable Gradient Information, better accuracy	When accuracy matters more than speed
YOLOv10	2024	NMS-free, faster inference, dual assignment	Production API, latency-critical
YOLOv11	2024	Latest architecture, instance segmentation support	New projects in 2025

Nepal Use Case

Nepal Use Case: Traffic Monitoring in Kathmandu

Kathmandu's traffic is notoriously chaotic — mixed vehicle types, inconsistent lane discipline, and limited traffic monitoring infrastructure. A YOLOv8-based traffic monitoring system can provide real-time vehicle counting, traffic density estimation, and incident detection from existing CCTV cameras without requiring new hardware.

A real implementation would fine-tune YOLOv8 on a Nepal-specific vehicle dataset including classes: motorcycle, car, microbus, tempo, truck, bus, bicycle, and pedestrian. Unique to Kathmandu traffic: tempos (electric three-wheelers) and microbuses require custom classes not present in the COCO dataset.

python

# nepal_traffic_counter.py — Vehicle counting for Kathmandu intersections
from ultralytics import YOLO
import cv2
import numpy as np
from collections import defaultdict

# Kathmandu-specific vehicle classes
CLASSES = {
    0: "motorcycle", 1: "car", 2: "microbus",
    3: "tempo",      4: "truck", 5: "bus",
    6: "bicycle",    7: "pedestrian",
}

# Colours for each class
COLORS = {
    0: (255, 165, 0),   # orange - motorcycle
    1: (0, 200, 255),   # cyan - car
    2: (0, 255, 100),   # green - microbus
    3: (255, 0, 200),   # magenta - tempo
    4: (255, 50, 50),   # red - truck
    5: (50, 50, 255),   # blue - bus
    6: (255, 255, 0),   # yellow - bicycle
    7: (200, 200, 200), # grey - pedestrian
}

model = YOLO("nepal_traffic_v1.pt")  # your fine-tuned model

# Load traffic camera feed (replace with actual RTSP URL or video file)
cap = cv2.VideoCapture("traffic_camera_ratnapark.mp4")

# Vehicle counter
total_counts = defaultdict(int)
frame_counts = defaultdict(int)  # per-frame counts for density

# Define counting line (horizontal line at 60% of frame height)
COUNTING_LINE_Y = None  # will be set on first frame

while True:
    ret, frame = cap.read()
    if not ret:
        break

    if COUNTING_LINE_Y is None:
        COUNTING_LINE_Y = int(frame.shape[0] * 0.6)

    results = model(frame, conf=0.4, classes=list(CLASSES.keys()), verbose=False)

    # Count vehicles crossing the line
    frame_counts.clear()
    for box in results[0].boxes:
        cls_id = int(box.cls[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        cy = (y1 + y2) // 2  # center y of box

        frame_counts[cls_id] += 1
        if abs(cy - COUNTING_LINE_Y) < 15:  # crossing the line
            total_counts[cls_id] += 1

    # Draw annotated frame
    annotated = results[0].plot(line_width=1, font_size=8)

    # Draw counting line
    cv2.line(annotated, (0, COUNTING_LINE_Y),
             (frame.shape[1], COUNTING_LINE_Y), (0, 255, 255), 2)

    # Draw stats panel
    y_offset = 10
    for cls_id, name in CLASSES.items():
        count = total_counts[cls_id]
        cv2.putText(annotated, f"{name}: {count}",
                    (10, y_offset + 20), cv2.FONT_HERSHEY_SIMPLEX,
                    0.5, COLORS[cls_id], 1)
        y_offset += 20

    total = sum(frame_counts.values())
    density = "HIGH" if total > 20 else "MEDIUM" if total > 10 else "LOW"
    cv2.putText(annotated, f"Density: {density} ({total} vehicles)",
                (10, frame.shape[0] - 10),
                cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)

    cv2.imshow("Kathmandu Traffic Monitor", annotated)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

print("
Session Summary:")
for cls_id, name in CLASSES.items():
    print(f"  {name}: {total_counts[cls_id]} vehicles counted")
print(f"  TOTAL: {sum(total_counts.values())} vehicles")

Conclusion

From Detection to Production

You now have everything you need to build real-time object detection systems with YOLO. The Ultralytics library makes the workflow remarkably approachable — from a three-line detection script to a full custom training pipeline with just Python.

The key to making YOLO work well for your specific use case is data quality. A model trained on 500 high-quality, well-annotated images from your actual deployment environment will almost always outperform a model trained on 5,000 generic images. Use tools like Label Studio or Roboflow for annotation, and always validate on data from the same distribution as your production environment.

Nepal presents unique and interesting computer vision challenges: diverse lighting from Himalayan altitudes, distinctive vehicle types, multilingual signage, and varied terrain. If you build and open-source Nepal-specific datasets, you'll contribute something genuinely valuable to the global ML community.

Computer Vision with YOLO: Building a Real-time Object Detector

Where Object Detection Is Used

YOLO Architecture Deep Dive

YOLO vs Other Detectors

The Math Behind YOLO

Intersection over Union (IoU)

Non-Maximum Suppression (NMS)

Step 1: Installation

Step 2: Basic Detection on an Image

Step 3: Real-Time Webcam Detection

Step 4: Custom Training — Dataset YAML

Step 5: Train a Custom Model

Nepal Use Case: Traffic Monitoring in Kathmandu

From Detection to Production

Shiv Shankar Sah

Stay Ahead in AI