Object detection is one of the most visually compelling and practically useful applications of computer vision. Rather than just classifying what an image contains, object detection finds where objects are — drawing bounding boxes around every person, car, animal, or item in a scene, in real time. The applications are everywhere: security cameras, self-driving cars, medical imaging, manufacturing quality control, and agricultural monitoring.
YOLO (You Only Look Once) has been the dominant real-time object detection framework since 2016. Its key insight — treating detection as a single regression problem rather than a two-stage pipeline — made it fast enough for real-time video while remaining accurate. In this guide, you'll go from zero to running a custom-trained YOLO model that detects objects in real time using your webcam.
Where Object Detection Is Used
YOLO Architecture Deep Dive
Understanding YOLO's architecture helps you tune it effectively for your use case. The modern YOLOv8/v9/v10/v11 architecture has three main components: a backbone for feature extraction, a neck for multi-scale feature aggregation, and a head for prediction.
YOLO vs Other Detectors
| Model | Approach | Speed (FPS) | mAP (COCO) | Real-time? | Best For |
|---|---|---|---|---|---|
| YOLOv8 (Nano) | One-stage | 150+ FPS | 37.3 | Yes | Edge devices, mobile, embedded |
| YOLOv8 (Large) | One-stage | 35 FPS | 52.9 | Yes | Server inference, high accuracy |
| YOLOv11 (M) | One-stage | 60 FPS | 51.5 | Yes | Balanced speed/accuracy |
| Faster R-CNN | Two-stage | 5–10 FPS | 55+ mAP | No | High accuracy, batch processing |
| SSD (MobileNet) | One-stage | 60 FPS | 23 mAP | Yes | Lightweight mobile deployment |
| RT-DETR (Large) | Transformer | 25 FPS | 53.1 mAP | Borderline | When accuracy is critical |
The Math Behind YOLO
Intersection over Union (IoU)
IoU is the primary metric for measuring how well a predicted bounding box matches the ground truth. It's used during both training (loss function) and evaluation (mAP calculation). A prediction is typically considered correct if IoU > 0.5.
IoU = Area of Overlap / Area of UnionArea of Overlap is the intersection of the predicted box and the ground truth box. Area of Union is the total area covered by both boxes combined. IoU = 1.0 means perfect overlap; IoU = 0 means no overlap. A threshold of 0.5 is standard for most benchmarks (COCO uses 0.5:0.95 averaging).
Non-Maximum Suppression (NMS)
YOLO typically predicts hundreds of overlapping bounding boxes for the same object. NMS selects the best box and suppresses redundant ones. The algorithm: (1) Sort all predictions by confidence score, (2) Select the highest-confidence box, (3) Remove all other boxes with IoU > threshold (typically 0.45) with the selected box, (4) Repeat until no boxes remain.
Step 1: Installation
# Install Ultralytics YOLO (includes YOLOv8, v9, v10, v11)
pip install ultralytics opencv-python numpy pillow
# Verify installation
python -c "from ultralytics import YOLO; print('YOLO ready!')"
# Download pretrained YOLOv8 weights (automatic on first use)
# Available: yolov8n (nano), yolov8s (small), yolov8m (medium),
# yolov8l (large), yolov8x (extra-large)
# nano = fastest, extra-large = most accurateStep 2: Basic Detection on an Image
# detect_image.py — Object detection on a single image
from ultralytics import YOLO
import cv2
import numpy as np
from pathlib import Path
# ── Load pretrained YOLOv8 model ──
# First run downloads ~6MB weights automatically
model = YOLO("yolov8m.pt") # medium model: speed/accuracy balance
# ── Run inference on an image ──
image_path = "test_image.jpg" # replace with your image
results = model(
source=image_path,
conf=0.5, # confidence threshold (0–1)
iou=0.45, # NMS IoU threshold
imgsz=640, # inference image size
device="cpu", # use "0" for GPU, "cpu" for CPU
verbose=True, # print results to console
)
# ── Process results ──
for r in results:
print(f"
Detected {len(r.boxes)} objects:")
print(f"Classes: {r.names}")
for i, box in enumerate(r.boxes):
class_id = int(box.cls[0])
class_name = r.names[class_id]
confidence = float(box.conf[0])
x1, y1, x2, y2 = map(int, box.xyxy[0])
print(f" [{i+1}] {class_name}: {confidence:.2f} @ [{x1},{y1},{x2},{y2}]")
# ── Save annotated image ──
annotated = results[0].plot() # BGR numpy array with boxes drawn
cv2.imwrite("output_detected.jpg", annotated)
print("
Saved annotated image: output_detected.jpg")
# ── Batch inference on a folder ──
batch_results = model(
source="./test_images/", # process entire folder
conf=0.5,
save=True, # auto-save annotated images
save_dir="./outputs/",
)
print(f"Processed {len(batch_results)} images")
Step 3: Real-Time Webcam Detection
# webcam_detection.py — Real-time detection from webcam
from ultralytics import YOLO
import cv2
import time
from collections import defaultdict
# ── Configuration ──
MODEL_PATH = "yolov8n.pt" # nano = fastest for real-time
CONF_THRESHOLD = 0.5
IOU_THRESHOLD = 0.45
CLASSES_TO_DETECT = None # None = detect all, or [0, 2, 5] for specific classes
# COCO class 0=person, 2=car, 5=bus
# ── Load model ──
model = YOLO(MODEL_PATH)
model.to("cpu") # or "cuda" if you have a GPU
# ── Open webcam ──
cap = cv2.VideoCapture(0) # 0 = default webcam
if not cap.isOpened():
raise RuntimeError("Cannot open webcam. Check connection.")
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
# ── FPS tracking ──
fps_history = []
frame_count = 0
print("Starting real-time detection. Press 'q' to quit.")
while True:
t0 = time.time()
ret, frame = cap.read()
if not ret:
print("Failed to read frame")
break
# ── Run YOLO inference ──
results = model(
source=frame,
conf=CONF_THRESHOLD,
iou=IOU_THRESHOLD,
classes=CLASSES_TO_DETECT,
verbose=False, # don't print per-frame
)
# ── Draw results ──
annotated = results[0].plot(
line_width=2,
font_size=12,
labels=True,
conf=True,
)
# ── Calculate and display FPS ──
fps = 1.0 / (time.time() - t0)
fps_history.append(fps)
if len(fps_history) > 30:
fps_history.pop(0)
avg_fps = sum(fps_history) / len(fps_history)
# Count detections per class
class_counts = defaultdict(int)
for box in results[0].boxes:
class_name = results[0].names[int(box.cls[0])]
class_counts[class_name] += 1
# Overlay FPS and detection info
cv2.putText(annotated, f"FPS: {avg_fps:.1f}", (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
cv2.putText(annotated, f"Objects: {len(results[0].boxes)}", (10, 70),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 200, 255), 2)
# Display
cv2.imshow("YOLOv8 Real-time Detection", annotated)
frame_count += 1
# Press 'q' to quit, 's' to save screenshot
key = cv2.waitKey(1) & 0xFF
if key == ord("q"):
break
elif key == ord("s"):
cv2.imwrite(f"screenshot_{frame_count}.jpg", annotated)
print(f"Screenshot saved: screenshot_{frame_count}.jpg")
cap.release()
cv2.destroyAllWindows()
print(f"Session complete. Processed {frame_count} frames at avg {sum(fps_history)/len(fps_history):.1f} FPS")
Step 4: Custom Training — Dataset YAML
# dataset.yaml — Custom dataset configuration for YOLO training
# Place this file alongside your data directory
# Dataset root directory (relative or absolute)
path: /home/user/datasets/my_custom_dataset
# Image directories (relative to path)
train: images/train # Training images
val: images/val # Validation images
test: images/test # Test images (optional)
# Number of classes
nc: 5
# Class names (must match label indices 0, 1, 2, ...)
names:
0: motorcycle
1: car
2: bus
3: truck
4: pedestrian
# Dataset statistics (optional, for reference)
# Total images: 4,200
# Total annotations: 18,500
# Image sizes: 640×640 (resized during training)
# ──────────────────────────────────────────────────────────
# Expected directory structure:
# my_custom_dataset/
# ├── images/
# │ ├── train/ ← your .jpg or .png images
# │ │ ├── img_001.jpg
# │ │ └── ...
# │ ├── val/
# │ └── test/
# └── labels/ ← YOLO format labels (auto-detected)
# ├── train/
# │ ├── img_001.txt ← one .txt per image
# │ └── ...
# ├── val/
# └── test/
#
# YOLO label format (one line per object):
# <class_id> <x_center> <y_center> <width> <height>
# All values normalised 0–1 relative to image dimensions
# Example: 0 0.5234 0.3891 0.2104 0.4502
# ↑ class ↑ cx ↑ cy ↑ w ↑ h
Step 5: Train a Custom Model
# train_custom.py — Fine-tune YOLOv8 on your custom dataset
from ultralytics import YOLO
import yaml
import os
# ── Load pretrained YOLOv8 (transfer learning) ──
# Starting from pretrained weights dramatically reduces training time
# and improves accuracy compared to training from scratch
model = YOLO("yolov8m.pt") # start from medium pretrained model
# ── Training configuration ──
training_config = {
"data": "dataset.yaml", # path to your dataset config
"epochs": 100, # number of training epochs
"imgsz": 640, # input image size
"batch": 16, # batch size (reduce if OOM)
"device": "0", # GPU device (use "cpu" for CPU)
"workers": 4, # dataloader workers
"project": "runs/train", # save directory
"name": "nepal_traffic_v1",
# Optimiser
"optimizer": "AdamW",
"lr0": 0.001, # initial learning rate
"lrf": 0.01, # final LR = lr0 * lrf
"momentum": 0.937,
"weight_decay": 0.0005,
# Augmentation (enabled by default)
"augment": True,
"mosaic": 1.0, # mosaic augmentation probability
"mixup": 0.1, # mixup augmentation
"flipud": 0.0, # vertical flip
"fliplr": 0.5, # horizontal flip
# Early stopping
"patience": 20, # stop if no improvement for 20 epochs
# Pretrained weights
"pretrained": True,
"freeze": 0, # 0 = train all layers
# 10 = freeze first 10 layers (faster, less data needed)
}
print("Starting training...")
print(f"Model: YOLOv8m | Epochs: {training_config['epochs']} | Batch: {training_config['batch']}")
# ── Start training ──
results = model.train(**training_config)
# ── Print final metrics ──
print("
Training complete!")
print(f"Best mAP50: {results.results_dict.get('metrics/mAP50(B)', 0):.4f}")
print(f"Best mAP50-95: {results.results_dict.get('metrics/mAP50-95(B)', 0):.4f}")
print(f"Weights saved: {model.trainer.best}")
# ── Validate on test set ──
val_results = model.val(
data="dataset.yaml",
split="test",
conf=0.5,
iou=0.6,
save_json=True, # save COCO-format results
)
print(f"
Test mAP50: {val_results.box.map50:.4f}")
print(f"Test mAP50-95: {val_results.box.map:.4f}")
# ── Export for deployment ──
# Export to ONNX for fast CPU inference on any platform
model.export(format="onnx", imgsz=640, opset=12, dynamic=False)
print("Model exported to ONNX format")
# For edge devices (Raspberry Pi, Jetson Nano):
# model.export(format="ncnn") # Alibaba NCNN format
# model.export(format="tflite") # TensorFlow Lite for mobile
| Version | Released | Best Feature | Recommendation |
|---|---|---|---|
| YOLOv5 | 2020 | Widely adopted, huge community, great tooling | Legacy projects only |
| YOLOv8 | 2023 | Best docs, most integrations, easiest to use | Default choice for most projects |
| YOLOv9 | 2024 | Programmable Gradient Information, better accuracy | When accuracy matters more than speed |
| YOLOv10 | 2024 | NMS-free, faster inference, dual assignment | Production API, latency-critical |
| YOLOv11 | 2024 | Latest architecture, instance segmentation support | New projects in 2025 |
Nepal Use Case: Traffic Monitoring in Kathmandu
Kathmandu's traffic is notoriously chaotic — mixed vehicle types, inconsistent lane discipline, and limited traffic monitoring infrastructure. A YOLOv8-based traffic monitoring system can provide real-time vehicle counting, traffic density estimation, and incident detection from existing CCTV cameras without requiring new hardware.
A real implementation would fine-tune YOLOv8 on a Nepal-specific vehicle dataset including classes: motorcycle, car, microbus, tempo, truck, bus, bicycle, and pedestrian. Unique to Kathmandu traffic: tempos (electric three-wheelers) and microbuses require custom classes not present in the COCO dataset.
# nepal_traffic_counter.py — Vehicle counting for Kathmandu intersections
from ultralytics import YOLO
import cv2
import numpy as np
from collections import defaultdict
# Kathmandu-specific vehicle classes
CLASSES = {
0: "motorcycle", 1: "car", 2: "microbus",
3: "tempo", 4: "truck", 5: "bus",
6: "bicycle", 7: "pedestrian",
}
# Colours for each class
COLORS = {
0: (255, 165, 0), # orange - motorcycle
1: (0, 200, 255), # cyan - car
2: (0, 255, 100), # green - microbus
3: (255, 0, 200), # magenta - tempo
4: (255, 50, 50), # red - truck
5: (50, 50, 255), # blue - bus
6: (255, 255, 0), # yellow - bicycle
7: (200, 200, 200), # grey - pedestrian
}
model = YOLO("nepal_traffic_v1.pt") # your fine-tuned model
# Load traffic camera feed (replace with actual RTSP URL or video file)
cap = cv2.VideoCapture("traffic_camera_ratnapark.mp4")
# Vehicle counter
total_counts = defaultdict(int)
frame_counts = defaultdict(int) # per-frame counts for density
# Define counting line (horizontal line at 60% of frame height)
COUNTING_LINE_Y = None # will be set on first frame
while True:
ret, frame = cap.read()
if not ret:
break
if COUNTING_LINE_Y is None:
COUNTING_LINE_Y = int(frame.shape[0] * 0.6)
results = model(frame, conf=0.4, classes=list(CLASSES.keys()), verbose=False)
# Count vehicles crossing the line
frame_counts.clear()
for box in results[0].boxes:
cls_id = int(box.cls[0])
x1, y1, x2, y2 = map(int, box.xyxy[0])
cy = (y1 + y2) // 2 # center y of box
frame_counts[cls_id] += 1
if abs(cy - COUNTING_LINE_Y) < 15: # crossing the line
total_counts[cls_id] += 1
# Draw annotated frame
annotated = results[0].plot(line_width=1, font_size=8)
# Draw counting line
cv2.line(annotated, (0, COUNTING_LINE_Y),
(frame.shape[1], COUNTING_LINE_Y), (0, 255, 255), 2)
# Draw stats panel
y_offset = 10
for cls_id, name in CLASSES.items():
count = total_counts[cls_id]
cv2.putText(annotated, f"{name}: {count}",
(10, y_offset + 20), cv2.FONT_HERSHEY_SIMPLEX,
0.5, COLORS[cls_id], 1)
y_offset += 20
total = sum(frame_counts.values())
density = "HIGH" if total > 20 else "MEDIUM" if total > 10 else "LOW"
cv2.putText(annotated, f"Density: {density} ({total} vehicles)",
(10, frame.shape[0] - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
cv2.imshow("Kathmandu Traffic Monitor", annotated)
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cap.release()
cv2.destroyAllWindows()
print("
Session Summary:")
for cls_id, name in CLASSES.items():
print(f" {name}: {total_counts[cls_id]} vehicles counted")
print(f" TOTAL: {sum(total_counts.values())} vehicles")
From Detection to Production
You now have everything you need to build real-time object detection systems with YOLO. The Ultralytics library makes the workflow remarkably approachable — from a three-line detection script to a full custom training pipeline with just Python.
The key to making YOLO work well for your specific use case is data quality. A model trained on 500 high-quality, well-annotated images from your actual deployment environment will almost always outperform a model trained on 5,000 generic images. Use tools like Label Studio or Roboflow for annotation, and always validate on data from the same distribution as your production environment.
Nepal presents unique and interesting computer vision challenges: diverse lighting from Himalayan altitudes, distinctive vehicle types, multilingual signage, and varied terrain. If you build and open-source Nepal-specific datasets, you'll contribute something genuinely valuable to the global ML community.