Live
Computer Vision

Computer Vision with YOLO: Building a Real-time Object Detector

Build a real-time object detection system using YOLOv8 and OpenCV in 30 minutes — from setup to custom model training.

Shiv Shankar Sah· AI Solutions Lead
June 7, 2025
9 min read
#YOLO#OpenCV#Object Detection#Computer Vision#Python

Object detection is one of the most visually compelling and practically useful applications of computer vision. Rather than just classifying what an image contains, object detection finds where objects are — drawing bounding boxes around every person, car, animal, or item in a scene, in real time. The applications are everywhere: security cameras, self-driving cars, medical imaging, manufacturing quality control, and agricultural monitoring.

YOLO (You Only Look Once) has been the dominant real-time object detection framework since 2016. Its key insight — treating detection as a single regression problem rather than a two-stage pipeline — made it fast enough for real-time video while remaining accurate. In this guide, you'll go from zero to running a custom-trained YOLO model that detects objects in real time using your webcam.

Real-World Uses

Where Object Detection Is Used

🚗Autonomous DrivingDetect pedestrians, vehicles, signs at 60+ FPS
🏭Quality ControlDetect defects on manufacturing lines in real time
🏥Medical ImagingLocate tumours, anomalies in X-rays and MRIs
🌾AgricultureDetect crop diseases, pest infestations from drones
📦Warehouse AutomationTrack inventory, guide robots, count packages
🚦Smart TrafficVehicle counting, speed estimation, incident detection
YOLO Architecture

YOLO Architecture Deep Dive

Understanding YOLO's architecture helps you tune it effectively for your use case. The modern YOLOv8/v9/v10/v11 architecture has three main components: a backbone for feature extraction, a neck for multi-scale feature aggregation, and a head for prediction.

YOLOv8 Architecture — Feature Pyramid Detection
📸 Input Image (640×640×3)
🦴 Backbone — CSPDarknet
Conv + BN + SiLU
Feature extraction
C2f Modules
Cross-stage partial
SPPF
Multi-scale pooling
🔗 Neck — PANet (Path Aggregation)
P3 (80×80)
Small objects
P4 (40×40)
Medium objects
P5 (20×20)
Large objects
🎯 Head — Decoupled Detection
Bounding Boxes
[x, y, w, h]
Class Scores
Softmax over N classes
Confidence
Objectness score

YOLO vs Other Detectors

ModelApproachSpeed (FPS)mAP (COCO)Real-time?Best For
YOLOv8 (Nano)One-stage150+ FPS37.3YesEdge devices, mobile, embedded
YOLOv8 (Large)One-stage35 FPS52.9YesServer inference, high accuracy
YOLOv11 (M)One-stage60 FPS51.5YesBalanced speed/accuracy
Faster R-CNNTwo-stage5–10 FPS55+ mAPNoHigh accuracy, batch processing
SSD (MobileNet)One-stage60 FPS23 mAPYesLightweight mobile deployment
RT-DETR (Large)Transformer25 FPS53.1 mAPBorderlineWhen accuracy is critical
The Mathematics

The Math Behind YOLO

Intersection over Union (IoU)

IoU is the primary metric for measuring how well a predicted bounding box matches the ground truth. It's used during both training (loss function) and evaluation (mAP calculation). A prediction is typically considered correct if IoU > 0.5.

IoU = Area of Overlap / Area of Union

Area of Overlap is the intersection of the predicted box and the ground truth box. Area of Union is the total area covered by both boxes combined. IoU = 1.0 means perfect overlap; IoU = 0 means no overlap. A threshold of 0.5 is standard for most benchmarks (COCO uses 0.5:0.95 averaging).

Non-Maximum Suppression (NMS)

YOLO typically predicts hundreds of overlapping bounding boxes for the same object. NMS selects the best box and suppresses redundant ones. The algorithm: (1) Sort all predictions by confidence score, (2) Select the highest-confidence box, (3) Remove all other boxes with IoU > threshold (typically 0.45) with the selected box, (4) Repeat until no boxes remain.

Confidence Threshold — Effect on Detections
conf=0.25 (Low)
More detections — includes low-confidence predictions. More false positives. Good for recall-sensitive tasks.
8 detections
conf=0.50 (Medium)
Balanced. Standard for most applications. Removes uncertain predictions while keeping confident ones.
5 detections
conf=0.80 (High)
Only very confident predictions. Few false positives but misses uncertain true positives. High precision.
2 detections
Code — Step by Step

Step 1: Installation

bash
# Install Ultralytics YOLO (includes YOLOv8, v9, v10, v11)
pip install ultralytics opencv-python numpy pillow

# Verify installation
python -c "from ultralytics import YOLO; print('YOLO ready!')"

# Download pretrained YOLOv8 weights (automatic on first use)
# Available: yolov8n (nano), yolov8s (small), yolov8m (medium),
#            yolov8l (large), yolov8x (extra-large)
# nano = fastest, extra-large = most accurate

Step 2: Basic Detection on an Image

python
# detect_image.py — Object detection on a single image
from ultralytics import YOLO
import cv2
import numpy as np
from pathlib import Path

# ── Load pretrained YOLOv8 model ──
# First run downloads ~6MB weights automatically
model = YOLO("yolov8m.pt")  # medium model: speed/accuracy balance

# ── Run inference on an image ──
image_path = "test_image.jpg"  # replace with your image
results = model(
    source=image_path,
    conf=0.5,        # confidence threshold (0–1)
    iou=0.45,        # NMS IoU threshold
    imgsz=640,       # inference image size
    device="cpu",    # use "0" for GPU, "cpu" for CPU
    verbose=True,    # print results to console
)

# ── Process results ──
for r in results:
    print(f"
Detected {len(r.boxes)} objects:")
    print(f"Classes: {r.names}")

    for i, box in enumerate(r.boxes):
        class_id   = int(box.cls[0])
        class_name = r.names[class_id]
        confidence = float(box.conf[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])

        print(f"  [{i+1}] {class_name}: {confidence:.2f} @ [{x1},{y1},{x2},{y2}]")

# ── Save annotated image ──
annotated = results[0].plot()  # BGR numpy array with boxes drawn
cv2.imwrite("output_detected.jpg", annotated)
print("
Saved annotated image: output_detected.jpg")

# ── Batch inference on a folder ──
batch_results = model(
    source="./test_images/",   # process entire folder
    conf=0.5,
    save=True,                  # auto-save annotated images
    save_dir="./outputs/",
)
print(f"Processed {len(batch_results)} images")

Step 3: Real-Time Webcam Detection

python
# webcam_detection.py — Real-time detection from webcam
from ultralytics import YOLO
import cv2
import time
from collections import defaultdict

# ── Configuration ──
MODEL_PATH = "yolov8n.pt"   # nano = fastest for real-time
CONF_THRESHOLD = 0.5
IOU_THRESHOLD  = 0.45
CLASSES_TO_DETECT = None    # None = detect all, or [0, 2, 5] for specific classes
                             # COCO class 0=person, 2=car, 5=bus

# ── Load model ──
model = YOLO(MODEL_PATH)
model.to("cpu")  # or "cuda" if you have a GPU

# ── Open webcam ──
cap = cv2.VideoCapture(0)  # 0 = default webcam
if not cap.isOpened():
    raise RuntimeError("Cannot open webcam. Check connection.")

cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

# ── FPS tracking ──
fps_history = []
frame_count = 0

print("Starting real-time detection. Press 'q' to quit.")

while True:
    t0 = time.time()
    ret, frame = cap.read()
    if not ret:
        print("Failed to read frame")
        break

    # ── Run YOLO inference ──
    results = model(
        source=frame,
        conf=CONF_THRESHOLD,
        iou=IOU_THRESHOLD,
        classes=CLASSES_TO_DETECT,
        verbose=False,  # don't print per-frame
    )

    # ── Draw results ──
    annotated = results[0].plot(
        line_width=2,
        font_size=12,
        labels=True,
        conf=True,
    )

    # ── Calculate and display FPS ──
    fps = 1.0 / (time.time() - t0)
    fps_history.append(fps)
    if len(fps_history) > 30:
        fps_history.pop(0)
    avg_fps = sum(fps_history) / len(fps_history)

    # Count detections per class
    class_counts = defaultdict(int)
    for box in results[0].boxes:
        class_name = results[0].names[int(box.cls[0])]
        class_counts[class_name] += 1

    # Overlay FPS and detection info
    cv2.putText(annotated, f"FPS: {avg_fps:.1f}", (10, 30),
                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    cv2.putText(annotated, f"Objects: {len(results[0].boxes)}", (10, 70),
                cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 200, 255), 2)

    # Display
    cv2.imshow("YOLOv8 Real-time Detection", annotated)
    frame_count += 1

    # Press 'q' to quit, 's' to save screenshot
    key = cv2.waitKey(1) & 0xFF
    if key == ord("q"):
        break
    elif key == ord("s"):
        cv2.imwrite(f"screenshot_{frame_count}.jpg", annotated)
        print(f"Screenshot saved: screenshot_{frame_count}.jpg")

cap.release()
cv2.destroyAllWindows()
print(f"Session complete. Processed {frame_count} frames at avg {sum(fps_history)/len(fps_history):.1f} FPS")

Step 4: Custom Training — Dataset YAML

yaml
# dataset.yaml — Custom dataset configuration for YOLO training
# Place this file alongside your data directory

# Dataset root directory (relative or absolute)
path: /home/user/datasets/my_custom_dataset

# Image directories (relative to path)
train: images/train    # Training images
val:   images/val      # Validation images
test:  images/test     # Test images (optional)

# Number of classes
nc: 5

# Class names (must match label indices 0, 1, 2, ...)
names:
  0: motorcycle
  1: car
  2: bus
  3: truck
  4: pedestrian

# Dataset statistics (optional, for reference)
# Total images: 4,200
# Total annotations: 18,500
# Image sizes: 640×640 (resized during training)

# ──────────────────────────────────────────────────────────
# Expected directory structure:
# my_custom_dataset/
# ├── images/
# │   ├── train/          ← your .jpg or .png images
# │   │   ├── img_001.jpg
# │   │   └── ...
# │   ├── val/
# │   └── test/
# └── labels/             ← YOLO format labels (auto-detected)
#     ├── train/
#     │   ├── img_001.txt  ← one .txt per image
#     │   └── ...
#     ├── val/
#     └── test/
#
# YOLO label format (one line per object):
# <class_id> <x_center> <y_center> <width> <height>
# All values normalised 0–1 relative to image dimensions
# Example: 0 0.5234 0.3891 0.2104 0.4502
#          ↑ class  ↑ cx   ↑ cy   ↑ w    ↑ h

Step 5: Train a Custom Model

python
# train_custom.py — Fine-tune YOLOv8 on your custom dataset
from ultralytics import YOLO
import yaml
import os

# ── Load pretrained YOLOv8 (transfer learning) ──
# Starting from pretrained weights dramatically reduces training time
# and improves accuracy compared to training from scratch
model = YOLO("yolov8m.pt")   # start from medium pretrained model

# ── Training configuration ──
training_config = {
    "data":      "dataset.yaml",  # path to your dataset config
    "epochs":    100,             # number of training epochs
    "imgsz":     640,             # input image size
    "batch":     16,              # batch size (reduce if OOM)
    "device":    "0",             # GPU device (use "cpu" for CPU)
    "workers":   4,               # dataloader workers
    "project":   "runs/train",    # save directory
    "name":      "nepal_traffic_v1",

    # Optimiser
    "optimizer": "AdamW",
    "lr0":       0.001,           # initial learning rate
    "lrf":       0.01,            # final LR = lr0 * lrf
    "momentum":  0.937,
    "weight_decay": 0.0005,

    # Augmentation (enabled by default)
    "augment":   True,
    "mosaic":    1.0,             # mosaic augmentation probability
    "mixup":     0.1,             # mixup augmentation
    "flipud":    0.0,             # vertical flip
    "fliplr":    0.5,             # horizontal flip

    # Early stopping
    "patience":  20,              # stop if no improvement for 20 epochs

    # Pretrained weights
    "pretrained": True,
    "freeze":    0,               # 0 = train all layers
                                  # 10 = freeze first 10 layers (faster, less data needed)
}

print("Starting training...")
print(f"Model: YOLOv8m | Epochs: {training_config['epochs']} | Batch: {training_config['batch']}")

# ── Start training ──
results = model.train(**training_config)

# ── Print final metrics ──
print("
Training complete!")
print(f"Best mAP50:    {results.results_dict.get('metrics/mAP50(B)', 0):.4f}")
print(f"Best mAP50-95: {results.results_dict.get('metrics/mAP50-95(B)', 0):.4f}")
print(f"Weights saved: {model.trainer.best}")

# ── Validate on test set ──
val_results = model.val(
    data="dataset.yaml",
    split="test",
    conf=0.5,
    iou=0.6,
    save_json=True,   # save COCO-format results
)

print(f"
Test mAP50: {val_results.box.map50:.4f}")
print(f"Test mAP50-95: {val_results.box.map:.4f}")

# ── Export for deployment ──
# Export to ONNX for fast CPU inference on any platform
model.export(format="onnx", imgsz=640, opset=12, dynamic=False)
print("Model exported to ONNX format")

# For edge devices (Raspberry Pi, Jetson Nano):
# model.export(format="ncnn")    # Alibaba NCNN format
# model.export(format="tflite") # TensorFlow Lite for mobile
ℹ️YOLO Versions Comparison — Which to Use?
VersionReleasedBest FeatureRecommendation
YOLOv52020Widely adopted, huge community, great toolingLegacy projects only
YOLOv82023Best docs, most integrations, easiest to useDefault choice for most projects
YOLOv92024Programmable Gradient Information, better accuracyWhen accuracy matters more than speed
YOLOv102024NMS-free, faster inference, dual assignmentProduction API, latency-critical
YOLOv112024Latest architecture, instance segmentation supportNew projects in 2025
Nepal Use Case

Nepal Use Case: Traffic Monitoring in Kathmandu

Kathmandu's traffic is notoriously chaotic — mixed vehicle types, inconsistent lane discipline, and limited traffic monitoring infrastructure. A YOLOv8-based traffic monitoring system can provide real-time vehicle counting, traffic density estimation, and incident detection from existing CCTV cameras without requiring new hardware.

A real implementation would fine-tune YOLOv8 on a Nepal-specific vehicle dataset including classes: motorcycle, car, microbus, tempo, truck, bus, bicycle, and pedestrian. Unique to Kathmandu traffic: tempos (electric three-wheelers) and microbuses require custom classes not present in the COCO dataset.

python
# nepal_traffic_counter.py — Vehicle counting for Kathmandu intersections
from ultralytics import YOLO
import cv2
import numpy as np
from collections import defaultdict

# Kathmandu-specific vehicle classes
CLASSES = {
    0: "motorcycle", 1: "car", 2: "microbus",
    3: "tempo",      4: "truck", 5: "bus",
    6: "bicycle",    7: "pedestrian",
}

# Colours for each class
COLORS = {
    0: (255, 165, 0),   # orange - motorcycle
    1: (0, 200, 255),   # cyan - car
    2: (0, 255, 100),   # green - microbus
    3: (255, 0, 200),   # magenta - tempo
    4: (255, 50, 50),   # red - truck
    5: (50, 50, 255),   # blue - bus
    6: (255, 255, 0),   # yellow - bicycle
    7: (200, 200, 200), # grey - pedestrian
}

model = YOLO("nepal_traffic_v1.pt")  # your fine-tuned model

# Load traffic camera feed (replace with actual RTSP URL or video file)
cap = cv2.VideoCapture("traffic_camera_ratnapark.mp4")

# Vehicle counter
total_counts = defaultdict(int)
frame_counts = defaultdict(int)  # per-frame counts for density

# Define counting line (horizontal line at 60% of frame height)
COUNTING_LINE_Y = None  # will be set on first frame

while True:
    ret, frame = cap.read()
    if not ret:
        break

    if COUNTING_LINE_Y is None:
        COUNTING_LINE_Y = int(frame.shape[0] * 0.6)

    results = model(frame, conf=0.4, classes=list(CLASSES.keys()), verbose=False)

    # Count vehicles crossing the line
    frame_counts.clear()
    for box in results[0].boxes:
        cls_id = int(box.cls[0])
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        cy = (y1 + y2) // 2  # center y of box

        frame_counts[cls_id] += 1
        if abs(cy - COUNTING_LINE_Y) < 15:  # crossing the line
            total_counts[cls_id] += 1

    # Draw annotated frame
    annotated = results[0].plot(line_width=1, font_size=8)

    # Draw counting line
    cv2.line(annotated, (0, COUNTING_LINE_Y),
             (frame.shape[1], COUNTING_LINE_Y), (0, 255, 255), 2)

    # Draw stats panel
    y_offset = 10
    for cls_id, name in CLASSES.items():
        count = total_counts[cls_id]
        cv2.putText(annotated, f"{name}: {count}",
                    (10, y_offset + 20), cv2.FONT_HERSHEY_SIMPLEX,
                    0.5, COLORS[cls_id], 1)
        y_offset += 20

    total = sum(frame_counts.values())
    density = "HIGH" if total > 20 else "MEDIUM" if total > 10 else "LOW"
    cv2.putText(annotated, f"Density: {density} ({total} vehicles)",
                (10, frame.shape[0] - 10),
                cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)

    cv2.imshow("Kathmandu Traffic Monitor", annotated)
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

print("
Session Summary:")
for cls_id, name in CLASSES.items():
    print(f"  {name}: {total_counts[cls_id]} vehicles counted")
print(f"  TOTAL: {sum(total_counts.values())} vehicles")
Conclusion

From Detection to Production

You now have everything you need to build real-time object detection systems with YOLO. The Ultralytics library makes the workflow remarkably approachable — from a three-line detection script to a full custom training pipeline with just Python.

The key to making YOLO work well for your specific use case is data quality. A model trained on 500 high-quality, well-annotated images from your actual deployment environment will almost always outperform a model trained on 5,000 generic images. Use tools like Label Studio or Roboflow for annotation, and always validate on data from the same distribution as your production environment.

Nepal presents unique and interesting computer vision challenges: diverse lighting from Himalayan altitudes, distinctive vehicle types, multilingual signage, and varied terrain. If you build and open-source Nepal-specific datasets, you'll contribute something genuinely valuable to the global ML community.

S

Written by

Shiv Shankar Sah

AI Solutions Lead at HexCode Nepal

Passionate about making AI education accessible in Nepal. Writing tutorials, guides, and deep-dives on ML, LLMs, and production AI systems.

Stay Ahead in AI

Get weekly AI tutorials, course updates, career tips, and exclusive offers. Join 2,000+ subscribers in Nepal.

No spam. Unsubscribe anytime.

Computer Vision with YOLO: Building a Real-time Object Detector