Building a Road Extraction Model with Convolutional Neural Networks and Satellite Data

Road networks are the circulatory system of modern civilization. Mapping them accurately — and keeping those maps current — is a challenge that traditional cartography struggles to meet at scale. Enter deep learning: specifically, Convolutional Neural Networks (CNNs) applied to satellite imagery. In this article, we walk through the full pipeline for building a road extraction model from scratch, covering data preparation, model architecture, training strategy, and evaluation.

Why Road Extraction Matters

Manual digitization of road networks is expensive, slow, and error-prone. In rapidly developing regions, newly built roads can lag OpenStreetMap (OSM) by months or years. Disaster response teams need updated road maps within hours, not weeks. Autonomous vehicle HD map providers need centimeter-level accuracy across millions of kilometers.

Automated road extraction from satellite imagery addresses all of these needs — and CNNs have emerged as the dominant approach, consistently outperforming classical computer vision methods on benchmark datasets.

Problem Framing: Semantic Segmentation

Road extraction is fundamentally a binary semantic segmentation task. Given a satellite image tile as input, the model must output a pixel-wise mask where each pixel is classified as either:

1 — Road surface
0 — Background (buildings, vegetation, water, bare soil, etc.)

This framing makes the problem well-suited to encoder-decoder architectures, where the encoder learns hierarchical spatial features and the decoder reconstructs the spatial resolution of the original image.

Dataset Preparation

Choosing a Dataset

Several high-quality public benchmarks exist for road extraction:

Dataset	Resolution	Coverage	Labels
SpaceNet Roads	30–50 cm/px	Global cities	Polylines + masks
DeepGlobe Road	50 cm/px	South/Southeast Asia	Binary masks
Massachusetts Roads	1 m/px	Massachusetts, USA	Binary masks
CVPR Road Detection	15 cm/px	Urban Europe	Binary masks

For this walkthrough, we’ll use the DeepGlobe Road Extraction Dataset, which contains 6,226 satellite image tiles at 1024×1024 pixels with corresponding binary masks.

Tiling and Patching

Raw satellite tiles are often too large to train on directly due to GPU memory constraints. A standard approach is to slice each 1024×1024 image into smaller patches:

import numpy as np
from PIL import Image

def tile_image(img_path, mask_path, patch_size=512, stride=256, output_dir="patches/"):
    img = np.array(Image.open(img_path))
    mask = np.array(Image.open(mask_path).convert("L"))

    h, w = img.shape[:2]
    patch_id = 0

    for y in range(0, h - patch_size + 1, stride):
        for x in range(0, w - patch_size + 1, stride):
            img_patch = img[y:y+patch_size, x:x+patch_size]
            mask_patch = mask[y:y+patch_size, x:x+patch_size]

            Image.fromarray(img_patch).save(f"{output_dir}/img_{patch_id}.png")
            Image.fromarray(mask_patch).save(f"{output_dir}/mask_{patch_id}.png")
            patch_id += 1

Using a stride smaller than the patch size (overlapping patches) helps prevent the model from learning hard boundary artifacts and significantly increases the effective training set size.

Class Imbalance

Road pixels typically occupy only 5–15% of any given satellite tile. This severe class imbalance is one of the primary challenges in road extraction. Leaving it unaddressed causes the model to converge toward predicting everything as background, achieving high pixel accuracy while being completely useless.

Strategies to address imbalance:

Weighted loss functions (e.g., weighted binary cross-entropy, Focal Loss)
Dice Loss, which directly optimizes the overlap metric
Oversampling patches that contain a minimum road pixel density
Hard Negative Mining during training

A combined Dice + BCE loss is the most commonly used approach in practice:

import torch
import torch.nn.functional as F

def dice_loss(pred, target, smooth=1.0):
    pred = torch.sigmoid(pred)
    pred_flat = pred.view(-1)
    target_flat = target.view(-1)
    intersection = (pred_flat * target_flat).sum()
    return 1 - (2. * intersection + smooth) / (pred_flat.sum() + target_flat.sum() + smooth)

def combined_loss(pred, target, bce_weight=0.5):
    bce = F.binary_cross_entropy_with_logits(pred, target)
    dice = dice_loss(pred, target)
    return bce_weight * bce + (1 - bce_weight) * dice

Data Augmentation

Geometric and photometric augmentations dramatically improve generalization across different geographies, seasons, and sensor conditions:

import albumentations as A

train_transform = A.Compose([
    A.RandomRotate90(p=0.5),
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.5),
    A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.4),
    A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=20, val_shift_limit=10, p=0.3),
    A.GaussNoise(var_limit=(10, 50), p=0.2),
    A.ElasticTransform(alpha=1, sigma=50, alpha_affine=50, p=0.2),
    A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

Model Architecture: U-Net with a Pretrained Encoder

The U-Net architecture, originally developed for biomedical image segmentation, has become the de facto standard for geospatial segmentation tasks. Its key innovation is the skip connections between the encoder and decoder, which allow the decoder to access fine-grained spatial information lost during downsampling.

Architecture Overview

Input Image (512×512×3)
        │
   ┌────▼────┐
   │ Encoder │  (Feature extraction: ResNet / EfficientNet / VGG)
   │         │  Progressively downsamples (stride-2 convolutions or max pool)
   └────┬────┘
        │ Bottleneck (32×32 feature map, 512 channels)
   ┌────▼────┐
   │ Decoder │  Progressively upsamples (transposed conv / bilinear upsample)
   │         │◄──── Skip connections from encoder at each resolution
   └────┬────┘
        │
   ┌────▼────┐
   │  Head   │  1×1 conv → single channel sigmoid output
   └─────────┘
Output Mask (512×512×1)

Using a Pretrained Encoder

Training from scratch is rarely optimal. A ResNet-34 encoder pretrained on ImageNet provides powerful low-level feature detectors (edges, textures, gradients) that transfer effectively to satellite imagery:

import segmentation_models_pytorch as smp

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=3,
    classes=1,
    activation=None  # raw logits; sigmoid applied in loss
)

The segmentation_models_pytorch library provides a clean interface for swapping encoders. Common alternatives:

efficientnet-b4 — better accuracy, more parameters
mit_b2 (Mix Transformer) — excellent for thin structures like roads
resnext50_32x4d — strong general-purpose choice

Attention U-Net

For road extraction specifically, Attention Gates in the decoder help the model focus on spatially relevant features — particularly useful for thin, elongated road structures that can be easily suppressed by standard skip connections:

model = smp.UnetPlusPlus(
    encoder_name="efficientnet-b4",
    encoder_weights="imagenet",
    in_channels=3,
    classes=1,
    activation=None
)

UNet++ with dense skip connections often outperforms standard U-Net on road datasets by 1–3 IoU points.

Training Pipeline

DataLoader Setup

from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os

class RoadDataset(Dataset):
    def __init__(self, img_dir, mask_dir, transform=None):
        self.img_paths = sorted([os.path.join(img_dir, f) for f in os.listdir(img_dir)])
        self.mask_paths = sorted([os.path.join(mask_dir, f) for f in os.listdir(mask_dir)])
        self.transform = transform

    def __len__(self):
        return len(self.img_paths)

    def __getitem__(self, idx):
        img = np.array(Image.open(self.img_paths[idx]).convert("RGB"))
        mask = np.array(Image.open(self.mask_paths[idx]).convert("L")) / 255.0

        if self.transform:
            augmented = self.transform(image=img, mask=mask)
            img = augmented["image"]
            mask = augmented["mask"]

        img = torch.from_numpy(img).permute(2, 0, 1).float()
        mask = torch.from_numpy(mask).unsqueeze(0).float()
        return img, mask

train_loader = DataLoader(
    RoadDataset("data/train/images", "data/train/masks", transform=train_transform),
    batch_size=8, shuffle=True, num_workers=4, pin_memory=True
)

Optimizer and Scheduler

import torch.optim as optim

optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)

scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=50, eta_min=1e-6
)

AdamW with cosine annealing is a robust default. An alternative approach uses a 1-cycle policy for faster convergence when training time is limited.

Training Loop

def train_epoch(model, loader, optimizer, device):
    model.train()
    total_loss = 0

    for imgs, masks in loader:
        imgs, masks = imgs.to(device), masks.to(device)

        optimizer.zero_grad()
        preds = model(imgs)
        loss = combined_loss(preds, masks)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(loader)

Gradient clipping (max_norm=1.0) prevents occasional exploding gradients, especially during early training.

Evaluation Metrics

Pixel accuracy is a misleading metric due to class imbalance. The field uses three standard metrics:

Intersection over Union (IoU / Jaccard Index)

The ratio of the intersection of predicted and ground truth road pixels to their union. Values range from 0 to 1; scores above 0.65 are considered strong for road extraction.

F1 Score (Dice Coefficient)

The harmonic mean of precision and recall, computed at the pixel level. More sensitive to false negatives than IoU, which is important since missed road segments are costly in navigation applications.

Relaxed Precision / Recall

Because road masks are often imprecisely labeled (e.g., a 1–2 pixel shift in ground truth), a “relaxed” metric counts a predicted pixel as correct if it falls within a buffer distance (typically 3 meters) of any ground truth road pixel. This is particularly important when benchmarking against OSM-derived labels.

def compute_iou(pred_mask, true_mask, threshold=0.5):
    pred_binary = (pred_mask > threshold).float()
    intersection = (pred_binary * true_mask).sum()
    union = pred_binary.sum() + true_mask.sum() - intersection
    return (intersection / (union + 1e-8)).item()

Post-Processing

Raw model output is a probability map. Converting it to a clean road network requires post-processing:

Thresholding and Morphological Cleanup

import cv2

def postprocess_mask(prob_map, threshold=0.45):
    binary = (prob_map > threshold).astype(np.uint8) * 255

    # Remove small isolated blobs
    kernel = np.ones((3, 3), np.uint8)
    cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel, iterations=2)

    # Fill small gaps in road segments
    cleaned = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel, iterations=3)

    return cleaned

Skeletonization

For downstream vectorization (converting the raster mask to a road polyline graph), skeletonization reduces the thick road mask to single-pixel centerlines:

from skimage.morphology import skeletonize

skeleton = skeletonize(binary_mask // 255).astype(np.uint8) * 255

Vectorization

The skeleton can then be converted to a vector road network using graph-based traversal algorithms or tools like osmnx, rdp (Ramer-Douglas-Peucker simplification), or GRASS GIS’s r.thin + r.to.vect.

Results and Benchmarks

On the DeepGlobe Road Extraction benchmark, a well-trained U-Net with an EfficientNet-B4 encoder typically achieves:

Metric	Score
IoU	0.67 – 0.72
F1 Score	0.80 – 0.85
Precision	0.82 – 0.87
Recall	0.78 – 0.84

State-of-the-art models incorporating Transformer-based encoders (e.g., SegFormer, Swin-T) push IoU above 0.75, particularly in dense urban areas where road intersections create complex topology.

Common Failure Modes and How to Address Them

Shadowed Roads

Tall buildings cast shadows that obscure road surfaces, causing the model to miss road segments. Mitigation: include near-infrared (NIR) bands if available; NIR is shadow-invariant. Alternatively, train with synthetic shadow augmentation.

Unpaved Roads

Gravel tracks and dirt roads have spectral signatures similar to bare soil, making them difficult to separate. Mitigation: use multi-temporal imagery — repeated observations help distinguish transient bare soil from persistent road surfaces.

Dense Tree Canopy

Roads under dense canopy are invisible in optical imagery. This is a fundamental sensor limitation. SAR (Synthetic Aperture Radar) data from Sentinel-1 can penetrate canopy and can be fused with optical bands as additional input channels.

Road–Parking Lot Confusion

Large paved parking lots share the same spectral signature as roads. Mitigation: include contextual features (building proximity, vegetation density) or use instance-aware architectures that model road topology explicitly.

Extending the Pipeline: Multi-Spectral Inputs

Most satellite platforms provide more than three spectral bands. Sentinel-2 provides 13 bands; WorldView-3 provides 8. Incorporating additional bands — particularly NIR, SWIR, and Red-Edge — improves discrimination between roads and spectrally similar surfaces.

Modifying the model to accept more input channels requires only a minor architectural change:

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights=None,       # cannot use ImageNet weights with non-RGB input
    in_channels=8,              # 8-band multispectral input
    classes=1,
    activation=None
)

When encoder_weights="imagenet" is desired with non-RGB inputs, a common trick is to initialize the first convolutional layer by averaging the pretrained RGB weights and tiling them across the additional channels.

Practical Deployment Considerations

Tiled Inference

At inference time, large images must be processed as overlapping tiles. Predictions from overlapping regions are averaged (or max-pooled) to produce a seamless output mask without edge artifacts.

Geographic Projection

Model outputs must be georeferenced. Store the affine transform and CRS of each input tile and apply them to the output mask using rasterio:

import rasterio
from rasterio.transform import from_bounds

with rasterio.open("input.tif") as src:
    transform = src.transform
    crs = src.crs

with rasterio.open("output_mask.tif", "w",
                   driver="GTiff", height=mask.shape[0], width=mask.shape[1],
                   count=1, dtype="uint8", crs=crs, transform=transform) as dst:
    dst.write(mask, 1)

Change Detection

By running the model on multi-temporal imagery and computing the difference between output masks, you get a simple road change detection pipeline — useful for identifying newly constructed or destroyed road segments.

Conclusion

Building a road extraction model is a rewarding end-to-end deep learning exercise that touches on geospatial data handling, semantic segmentation, loss function design, and practical deployment. The key takeaways:

Frame the problem as binary semantic segmentation and address class imbalance early.
Use a U-Net with a pretrained encoder — it reliably outperforms training from scratch.
Combine Dice Loss and BCE to handle the pixel imbalance and optimize directly for overlap.
Heavy augmentation is non-negotiable for cross-geography generalization.
Post-process aggressively — raw model outputs need morphological cleanup and skeletonization before they are useful as road networks.
Consider multi-spectral inputs when optical RGB alone is insufficient.

Road extraction is also a natural entry point into broader GeoAI workflows. Once the mask is produced, downstream tasks — change detection, route planning, population access analysis, infrastructure damage assessment — all become tractable. The road network is, in more ways than one, the foundation everything else is built on.

Want to go deeper? The SpaceNet 3 challenge provides one of the most rigorously benchmarked road extraction datasets, with solutions from top competitors publicly available on GitHub.

Building a Road Extraction Model with Convolutional Neural Networks and Satellite Data

Why Road Extraction Matters

Problem Framing: Semantic Segmentation

Dataset Preparation

Choosing a Dataset

Tiling and Patching

Class Imbalance

Data Augmentation

Model Architecture: U-Net with a Pretrained Encoder

Architecture Overview

Using a Pretrained Encoder

Attention U-Net

Training Pipeline

DataLoader Setup

Optimizer and Scheduler

Training Loop

Evaluation Metrics

Post-Processing

Results and Benchmarks

Common Failure Modes and How to Address Them

Extending the Pipeline: Multi-Spectral Inputs

Practical Deployment Considerations

Conclusion

The Psychology of Topographic Maps in Client Presentations

How a Retail Chain Visualized Expansion Across United States States

How US Enterprises Use Custom Maps in Boardroom Presentations

Why Screenshots of Google Maps Fail in Executive Presentations

What Is a Datum and Why Do Surveyors Care About It? 4 Things to Understand

How Public Sector Dashboards Use Interactive Maps Effectively

Leave a Reply Cancel reply

About Us

Why Road Extraction Matters

Problem Framing: Semantic Segmentation

Dataset Preparation

Choosing a Dataset

Tiling and Patching

Class Imbalance

Data Augmentation

Model Architecture: U-Net with a Pretrained Encoder

Architecture Overview

Using a Pretrained Encoder

Attention U-Net

Training Pipeline

DataLoader Setup

Optimizer and Scheduler

Training Loop

Evaluation Metrics

Post-Processing

Results and Benchmarks

Common Failure Modes and How to Address Them

Extending the Pipeline: Multi-Spectral Inputs

Practical Deployment Considerations

Conclusion

Similar Posts

Leave a Reply Cancel reply

About Us

Review Cart