Building a Road Extraction Model with Convolutional Neural Networks and Satellite Data
Road networks are the circulatory system of modern civilization. Mapping them accurately — and keeping those maps current — is a challenge that traditional cartography struggles to meet at scale. Enter deep learning: specifically, Convolutional Neural Networks (CNNs) applied to satellite imagery. In this article, we walk through the full pipeline for building a road extraction model from scratch, covering data preparation, model architecture, training strategy, and evaluation.
Why Road Extraction Matters
Manual digitization of road networks is expensive, slow, and error-prone. In rapidly developing regions, newly built roads can lag OpenStreetMap (OSM) by months or years. Disaster response teams need updated road maps within hours, not weeks. Autonomous vehicle HD map providers need centimeter-level accuracy across millions of kilometers.
Automated road extraction from satellite imagery addresses all of these needs — and CNNs have emerged as the dominant approach, consistently outperforming classical computer vision methods on benchmark datasets.
Problem Framing: Semantic Segmentation
Road extraction is fundamentally a binary semantic segmentation task. Given a satellite image tile as input, the model must output a pixel-wise mask where each pixel is classified as either:
1— Road surface0— Background (buildings, vegetation, water, bare soil, etc.)
This framing makes the problem well-suited to encoder-decoder architectures, where the encoder learns hierarchical spatial features and the decoder reconstructs the spatial resolution of the original image.
Dataset Preparation
Choosing a Dataset
Several high-quality public benchmarks exist for road extraction:
| Dataset | Resolution | Coverage | Labels |
|---|---|---|---|
| SpaceNet Roads | 30–50 cm/px | Global cities | Polylines + masks |
| DeepGlobe Road | 50 cm/px | South/Southeast Asia | Binary masks |
| Massachusetts Roads | 1 m/px | Massachusetts, USA | Binary masks |
| CVPR Road Detection | 15 cm/px | Urban Europe | Binary masks |
For this walkthrough, we’ll use the DeepGlobe Road Extraction Dataset, which contains 6,226 satellite image tiles at 1024×1024 pixels with corresponding binary masks.
Tiling and Patching
Raw satellite tiles are often too large to train on directly due to GPU memory constraints. A standard approach is to slice each 1024×1024 image into smaller patches:
import numpy as np
from PIL import Image
def tile_image(img_path, mask_path, patch_size=512, stride=256, output_dir="patches/"):
img = np.array(Image.open(img_path))
mask = np.array(Image.open(mask_path).convert("L"))
h, w = img.shape[:2]
patch_id = 0
for y in range(0, h - patch_size + 1, stride):
for x in range(0, w - patch_size + 1, stride):
img_patch = img[y:y+patch_size, x:x+patch_size]
mask_patch = mask[y:y+patch_size, x:x+patch_size]
Image.fromarray(img_patch).save(f"{output_dir}/img_{patch_id}.png")
Image.fromarray(mask_patch).save(f"{output_dir}/mask_{patch_id}.png")
patch_id += 1
Using a stride smaller than the patch size (overlapping patches) helps prevent the model from learning hard boundary artifacts and significantly increases the effective training set size.
Class Imbalance
Road pixels typically occupy only 5–15% of any given satellite tile. This severe class imbalance is one of the primary challenges in road extraction. Leaving it unaddressed causes the model to converge toward predicting everything as background, achieving high pixel accuracy while being completely useless.
Strategies to address imbalance:
- Weighted loss functions (e.g., weighted binary cross-entropy, Focal Loss)
- Dice Loss, which directly optimizes the overlap metric
- Oversampling patches that contain a minimum road pixel density
- Hard Negative Mining during training
A combined Dice + BCE loss is the most commonly used approach in practice:
import torch
import torch.nn.functional as F
def dice_loss(pred, target, smooth=1.0):
pred = torch.sigmoid(pred)
pred_flat = pred.view(-1)
target_flat = target.view(-1)
intersection = (pred_flat * target_flat).sum()
return 1 - (2. * intersection + smooth) / (pred_flat.sum() + target_flat.sum() + smooth)
def combined_loss(pred, target, bce_weight=0.5):
bce = F.binary_cross_entropy_with_logits(pred, target)
dice = dice_loss(pred, target)
return bce_weight * bce + (1 - bce_weight) * dice
Data Augmentation
Geometric and photometric augmentations dramatically improve generalization across different geographies, seasons, and sensor conditions:
import albumentations as A
train_transform = A.Compose([
A.RandomRotate90(p=0.5),
A.HorizontalFlip(p=0.5),
A.VerticalFlip(p=0.5),
A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.4),
A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=20, val_shift_limit=10, p=0.3),
A.GaussNoise(var_limit=(10, 50), p=0.2),
A.ElasticTransform(alpha=1, sigma=50, alpha_affine=50, p=0.2),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
Model Architecture: U-Net with a Pretrained Encoder
The U-Net architecture, originally developed for biomedical image segmentation, has become the de facto standard for geospatial segmentation tasks. Its key innovation is the skip connections between the encoder and decoder, which allow the decoder to access fine-grained spatial information lost during downsampling.
Architecture Overview
Input Image (512×512×3)
│
┌────▼────┐
│ Encoder │ (Feature extraction: ResNet / EfficientNet / VGG)
│ │ Progressively downsamples (stride-2 convolutions or max pool)
└────┬────┘
│ Bottleneck (32×32 feature map, 512 channels)
┌────▼────┐
│ Decoder │ Progressively upsamples (transposed conv / bilinear upsample)
│ │◄──── Skip connections from encoder at each resolution
└────┬────┘
│
┌────▼────┐
│ Head │ 1×1 conv → single channel sigmoid output
└─────────┘
Output Mask (512×512×1)
Using a Pretrained Encoder
Training from scratch is rarely optimal. A ResNet-34 encoder pretrained on ImageNet provides powerful low-level feature detectors (edges, textures, gradients) that transfer effectively to satellite imagery:
import segmentation_models_pytorch as smp
model = smp.Unet(
encoder_name="resnet34",
encoder_weights="imagenet",
in_channels=3,
classes=1,
activation=None # raw logits; sigmoid applied in loss
)
The segmentation_models_pytorch library provides a clean interface for swapping encoders. Common alternatives:
efficientnet-b4— better accuracy, more parametersmit_b2(Mix Transformer) — excellent for thin structures like roadsresnext50_32x4d— strong general-purpose choice
Attention U-Net
For road extraction specifically, Attention Gates in the decoder help the model focus on spatially relevant features — particularly useful for thin, elongated road structures that can be easily suppressed by standard skip connections:
model = smp.UnetPlusPlus(
encoder_name="efficientnet-b4",
encoder_weights="imagenet",
in_channels=3,
classes=1,
activation=None
)
UNet++ with dense skip connections often outperforms standard U-Net on road datasets by 1–3 IoU points.
Training Pipeline
DataLoader Setup
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class RoadDataset(Dataset):
def __init__(self, img_dir, mask_dir, transform=None):
self.img_paths = sorted([os.path.join(img_dir, f) for f in os.listdir(img_dir)])
self.mask_paths = sorted([os.path.join(mask_dir, f) for f in os.listdir(mask_dir)])
self.transform = transform
def __len__(self):
return len(self.img_paths)
def __getitem__(self, idx):
img = np.array(Image.open(self.img_paths[idx]).convert("RGB"))
mask = np.array(Image.open(self.mask_paths[idx]).convert("L")) / 255.0
if self.transform:
augmented = self.transform(image=img, mask=mask)
img = augmented["image"]
mask = augmented["mask"]
img = torch.from_numpy(img).permute(2, 0, 1).float()
mask = torch.from_numpy(mask).unsqueeze(0).float()
return img, mask
train_loader = DataLoader(
RoadDataset("data/train/images", "data/train/masks", transform=train_transform),
batch_size=8, shuffle=True, num_workers=4, pin_memory=True
)
Optimizer and Scheduler
import torch.optim as optim
optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=50, eta_min=1e-6
)
AdamW with cosine annealing is a robust default. An alternative approach uses a 1-cycle policy for faster convergence when training time is limited.
Training Loop
def train_epoch(model, loader, optimizer, device):
model.train()
total_loss = 0
for imgs, masks in loader:
imgs, masks = imgs.to(device), masks.to(device)
optimizer.zero_grad()
preds = model(imgs)
loss = combined_loss(preds, masks)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)
Gradient clipping (max_norm=1.0) prevents occasional exploding gradients, especially during early training.
Evaluation Metrics
Pixel accuracy is a misleading metric due to class imbalance. The field uses three standard metrics:
Intersection over Union (IoU / Jaccard Index)
The ratio of the intersection of predicted and ground truth road pixels to their union. Values range from 0 to 1; scores above 0.65 are considered strong for road extraction.
F1 Score (Dice Coefficient)
The harmonic mean of precision and recall, computed at the pixel level. More sensitive to false negatives than IoU, which is important since missed road segments are costly in navigation applications.
Relaxed Precision / Recall
Because road masks are often imprecisely labeled (e.g., a 1–2 pixel shift in ground truth), a “relaxed” metric counts a predicted pixel as correct if it falls within a buffer distance (typically 3 meters) of any ground truth road pixel. This is particularly important when benchmarking against OSM-derived labels.
def compute_iou(pred_mask, true_mask, threshold=0.5):
pred_binary = (pred_mask > threshold).float()
intersection = (pred_binary * true_mask).sum()
union = pred_binary.sum() + true_mask.sum() - intersection
return (intersection / (union + 1e-8)).item()
Post-Processing
Raw model output is a probability map. Converting it to a clean road network requires post-processing:
Thresholding and Morphological Cleanup
import cv2
def postprocess_mask(prob_map, threshold=0.45):
binary = (prob_map > threshold).astype(np.uint8) * 255
# Remove small isolated blobs
kernel = np.ones((3, 3), np.uint8)
cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel, iterations=2)
# Fill small gaps in road segments
cleaned = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel, iterations=3)
return cleaned
Skeletonization
For downstream vectorization (converting the raster mask to a road polyline graph), skeletonization reduces the thick road mask to single-pixel centerlines:
from skimage.morphology import skeletonize
skeleton = skeletonize(binary_mask // 255).astype(np.uint8) * 255
Vectorization
The skeleton can then be converted to a vector road network using graph-based traversal algorithms or tools like osmnx, rdp (Ramer-Douglas-Peucker simplification), or GRASS GIS’s r.thin + r.to.vect.
Results and Benchmarks
On the DeepGlobe Road Extraction benchmark, a well-trained U-Net with an EfficientNet-B4 encoder typically achieves:
| Metric | Score |
|---|---|
| IoU | 0.67 – 0.72 |
| F1 Score | 0.80 – 0.85 |
| Precision | 0.82 – 0.87 |
| Recall | 0.78 – 0.84 |
State-of-the-art models incorporating Transformer-based encoders (e.g., SegFormer, Swin-T) push IoU above 0.75, particularly in dense urban areas where road intersections create complex topology.
Common Failure Modes and How to Address Them
Shadowed Roads
Tall buildings cast shadows that obscure road surfaces, causing the model to miss road segments. Mitigation: include near-infrared (NIR) bands if available; NIR is shadow-invariant. Alternatively, train with synthetic shadow augmentation.
Unpaved Roads
Gravel tracks and dirt roads have spectral signatures similar to bare soil, making them difficult to separate. Mitigation: use multi-temporal imagery — repeated observations help distinguish transient bare soil from persistent road surfaces.
Dense Tree Canopy
Roads under dense canopy are invisible in optical imagery. This is a fundamental sensor limitation. SAR (Synthetic Aperture Radar) data from Sentinel-1 can penetrate canopy and can be fused with optical bands as additional input channels.
Road–Parking Lot Confusion
Large paved parking lots share the same spectral signature as roads. Mitigation: include contextual features (building proximity, vegetation density) or use instance-aware architectures that model road topology explicitly.
Extending the Pipeline: Multi-Spectral Inputs
Most satellite platforms provide more than three spectral bands. Sentinel-2 provides 13 bands; WorldView-3 provides 8. Incorporating additional bands — particularly NIR, SWIR, and Red-Edge — improves discrimination between roads and spectrally similar surfaces.
Modifying the model to accept more input channels requires only a minor architectural change:
model = smp.Unet(
encoder_name="resnet34",
encoder_weights=None, # cannot use ImageNet weights with non-RGB input
in_channels=8, # 8-band multispectral input
classes=1,
activation=None
)
When encoder_weights="imagenet" is desired with non-RGB inputs, a common trick is to initialize the first convolutional layer by averaging the pretrained RGB weights and tiling them across the additional channels.
Practical Deployment Considerations
Tiled Inference
At inference time, large images must be processed as overlapping tiles. Predictions from overlapping regions are averaged (or max-pooled) to produce a seamless output mask without edge artifacts.
Geographic Projection
Model outputs must be georeferenced. Store the affine transform and CRS of each input tile and apply them to the output mask using rasterio:
import rasterio
from rasterio.transform import from_bounds
with rasterio.open("input.tif") as src:
transform = src.transform
crs = src.crs
with rasterio.open("output_mask.tif", "w",
driver="GTiff", height=mask.shape[0], width=mask.shape[1],
count=1, dtype="uint8", crs=crs, transform=transform) as dst:
dst.write(mask, 1)
Change Detection
By running the model on multi-temporal imagery and computing the difference between output masks, you get a simple road change detection pipeline — useful for identifying newly constructed or destroyed road segments.
Conclusion
Building a road extraction model is a rewarding end-to-end deep learning exercise that touches on geospatial data handling, semantic segmentation, loss function design, and practical deployment. The key takeaways:
- Frame the problem as binary semantic segmentation and address class imbalance early.
- Use a U-Net with a pretrained encoder — it reliably outperforms training from scratch.
- Combine Dice Loss and BCE to handle the pixel imbalance and optimize directly for overlap.
- Heavy augmentation is non-negotiable for cross-geography generalization.
- Post-process aggressively — raw model outputs need morphological cleanup and skeletonization before they are useful as road networks.
- Consider multi-spectral inputs when optical RGB alone is insufficient.
Road extraction is also a natural entry point into broader GeoAI workflows. Once the mask is produced, downstream tasks — change detection, route planning, population access analysis, infrastructure damage assessment — all become tractable. The road network is, in more ways than one, the foundation everything else is built on.
Want to go deeper? The SpaceNet 3 challenge provides one of the most rigorously benchmarked road extraction datasets, with solutions from top competitors publicly available on GitHub.
