Semantic Segmentation for Urban Mapping: A GIS + AI Tutorial
Author: Anshul Mohan | Level: Intermediate–Advanced | Domain: GeoAI / Remote Sensing
Table of Contents
- Introduction
- What Is Semantic Segmentation?
- Why Urban Mapping Needs It
- The GIS + AI Stack
- Data Preparation
- Model Architecture: U-Net for Urban Scenes
- Training the Model
- Post-Processing and GIS Integration
- Evaluation Metrics
- Real-World Considerations
- Conclusion
Introduction
Cities are living systems — roads expand, buildings rise, green spaces shrink, and informal settlements emerge overnight. Keeping up with these changes using traditional GIS methods alone is slow and labour-intensive. Enter semantic segmentation, a deep learning technique that assigns a land-cover class to every single pixel in a satellite or aerial image.
When you pair semantic segmentation with a GIS workflow, you get something powerful: spatially accurate, machine-generated maps that can be updated continuously as new imagery arrives. This tutorial walks you through that entire pipeline — from raw satellite imagery to a classified shapefile you can load into QGIS or ArcGIS.
By the end, you will understand:
- How semantic segmentation differs from object detection and instance segmentation
- How to prepare satellite imagery tiles for training
- How to build and train a U-Net model on urban imagery
- How to convert model outputs back into GIS-ready vector layers
- How to evaluate segmentation quality using both ML and GIS metrics
What Is Semantic Segmentation?
Semantic segmentation is a pixel-wise classification task. Given an input image, the model outputs a label map of the same spatial dimensions, where each pixel is assigned to one of a predefined set of classes.
Input: RGB satellite image → [H × W × 3]
Output: Class label map → [H × W × 1] (integer class IDs)
This is distinct from two related tasks:
| Task | Output | Example |
|---|---|---|
| Image Classification | Single label for whole image | “This tile contains a highway” |
| Object Detection | Bounding boxes around objects | Box drawn around each building |
| Semantic Segmentation | Per-pixel class label | Every road pixel labelled as “road” |
| Instance Segmentation | Per-pixel label + instance ID | Each individual building polygon |
For urban mapping, semantic segmentation is often the best trade-off: it gives you full spatial coverage without requiring the overhead of instance-level tracking.
Why Urban Mapping Needs It
Urban environments are among the most heterogeneous surfaces on Earth. A single 10 m resolution Sentinel-2 pixel may blend rooftop, asphalt, shadow, and vegetation — making traditional spectral classifiers unreliable.
Semantic segmentation addresses this in several ways:
Contextual understanding. Deep convolutional networks learn that a narrow dark strip adjacent to buildings is likely a road, not a river — something a pixel-by-pixel spectral classifier cannot infer.
Multi-class precision. Urban maps typically need at minimum six classes: buildings, roads, vegetation, bare soil, water, and impervious surface. Segmentation models handle all of these simultaneously.
Scalability. Once trained, a model can process thousands of image tiles in hours, enabling city-wide or national-scale mapping.
Change detection readiness. Running the same model over multi-temporal imagery produces consistent label maps that can be differenced to detect urban growth, flood extent, or post-disaster damage.
The GIS + AI Stack
Here is the full software stack used in this tutorial:
Imagery Source → Sentinel-2 (ESA Copernicus) or aerial orthophotos
Preprocessing → GDAL, Rasterio, Shapely
Deep Learning → PyTorch + segmentation_models.pytorch (SMP)
Training Infra → NVIDIA CUDA GPU (or Google Colab)
Post-Processing → OpenCV, Rasterio, Fiona
GIS Output → GeoTIFF (raster) → Shapefile/GeoJSON (vector)
Visualisation → QGIS / ArcGIS Pro / Folium
Install the core Python dependencies:
pip install torch torchvision segmentation-models-pytorch \
rasterio fiona shapely geopandas albumentations \
opencv-python matplotlib tqdm
Data Preparation
Good data preparation is where most urban segmentation projects succeed or fail. Spend time here.
Step 1: Acquire Imagery
For this tutorial, we use Sentinel-2 Level-2A (surface reflectance) imagery at 10 m resolution. Download from:
- Copernicus Open Access Hub
- Google Earth Engine (for large-scale downloads)
Select a cloud-free scene over your city of interest. We will use bands B02 (Blue), B03 (Green), B04 (Red), and B08 (NIR) — giving us a 4-channel input that includes near-infrared, which is highly informative for vegetation and impervious surface discrimination.
Step 2: Acquire Ground Truth Labels
You need pixel-accurate land cover masks. Sources include:
- OpenStreetMap (OSM) — rasterise road, building, and water polygons
- National land cover databases — NLCD (USA), Urban Atlas (EU), LULC (India)
- Manual annotation — tools like CVAT, Labelme, or QGIS with the Semi-Automatic Classification Plugin
For this tutorial, we rasterise OSM polygons into a six-class label raster aligned to our Sentinel-2 image:
import geopandas as gpd
import rasterio
from rasterio.features import rasterize
import numpy as np
CLASS_MAP = {
'building': 1,
'road': 2,
'vegetation': 3,
'water': 4,
'bare_soil': 5,
'other': 0,
}
def rasterize_labels(gdf: gpd.GeoDataFrame,
reference_raster: str,
output_path: str) -> None:
"""Burn vector labels to a raster matching the reference image."""
with rasterio.open(reference_raster) as src:
meta = src.meta.copy()
meta.update(count=1, dtype='uint8')
transform = src.transform
shape = (src.height, src.width)
label_raster = np.zeros(shape, dtype=np.uint8)
for class_name, class_id in CLASS_MAP.items():
subset = gdf[gdf['class'] == class_name]
if subset.empty:
continue
burned = rasterize(
[(geom, class_id) for geom in subset.geometry],
out_shape=shape,
transform=transform,
fill=0,
dtype='uint8'
)
label_raster = np.where(burned > 0, burned, label_raster)
with rasterio.open(output_path, 'w', **meta) as dst:
dst.write(label_raster, 1)
print(f"Label raster saved to {output_path}")
Step 3: Tile the Image
Neural networks require fixed-size inputs. We cut both the imagery and its label raster into 256 x 256 pixel tiles with a 50% overlap to reduce edge artefacts.
import os
from pathlib import Path
def tile_raster(raster_path: str,
output_dir: str,
tile_size: int = 256,
overlap: int = 128) -> list:
"""Cut a large raster into fixed-size tiles with overlap."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
tile_paths = []
stride = tile_size - overlap
with rasterio.open(raster_path) as src:
data = src.read() # [C, H, W]
meta = src.meta.copy()
H, W = src.height, src.width
transform = src.transform
tile_id = 0
for row_start in range(0, H - tile_size + 1, stride):
for col_start in range(0, W - tile_size + 1, stride):
tile = data[:, row_start:row_start + tile_size,
col_start:col_start + tile_size]
tile_transform = rasterio.transform.Affine(
transform.a, 0,
transform.c + col_start * transform.a,
0, transform.e,
transform.f + row_start * transform.e
)
tile_meta = meta.copy()
tile_meta.update(
height=tile_size, width=tile_size,
count=tile.shape[0],
transform=tile_transform
)
out_path = os.path.join(output_dir, f"tile_{tile_id:05d}.tif")
with rasterio.open(out_path, 'w', **tile_meta) as dst:
dst.write(tile)
tile_paths.append(out_path)
tile_id += 1
print(f"Generated {tile_id} tiles in {output_dir}")
return tile_paths
Step 4: Build a PyTorch Dataset
import torch
from torch.utils.data import Dataset
import numpy as np
import rasterio
import albumentations as A
from albumentations.pytorch import ToTensorV2
AUGMENTATIONS = A.Compose([
A.HorizontalFlip(p=0.5),
A.VerticalFlip(p=0.5),
A.RandomRotate90(p=0.5),
A.RandomBrightnessContrast(p=0.3),
A.GaussNoise(p=0.2),
ToTensorV2()
])
class UrbanSegDataset(Dataset):
def __init__(self, image_tiles: list, label_tiles: list,
augment: bool = True):
self.images = image_tiles
self.labels = label_tiles
self.augment = augment
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
with rasterio.open(self.images[idx]) as src:
img = src.read().astype(np.float32) # [C, H, W]
with rasterio.open(self.labels[idx]) as src:
mask = src.read(1).astype(np.int64) # [H, W]
# Normalise to [0, 1] using Sentinel-2 max reflectance value
img = img / 10000.0
img = np.clip(img, 0, 1)
img = img.transpose(1, 2, 0) # [H, W, C] for albumentations
if self.augment:
augmented = AUGMENTATIONS(image=img, mask=mask)
img = augmented['image'].float()
mask = torch.from_numpy(augmented['mask']).long()
else:
img = torch.from_numpy(img.transpose(2, 0, 1)).float()
mask = torch.from_numpy(mask).long()
return img, mask
Model Architecture: U-Net for Urban Scenes
U-Net remains the workhorse of geospatial segmentation. Its encoder-decoder structure with skip connections preserves fine spatial detail while learning high-level semantic features — exactly what urban scenes demand.
Encoder (downsampling path)
Input [B, 4, 256, 256]
Conv Block → 64 feature maps ─────────────────────────────── skip
MaxPool → 128 x 128
Conv Block → 128 feature maps ──────────────────────────── skip
MaxPool → 64 x 64
Conv Block → 256 feature maps ───────────────────────── skip
MaxPool → 32 x 32
Bottleneck → 512 feature maps
Decoder (upsampling path)
Upsample → 64 x 64 + skip concat from 256-map block
Upsample → 128 x 128 + skip concat from 128-map block
Upsample → 256 x 256 + skip concat from 64-map block
1x1 Conv → [B, num_classes, 256, 256]
We use the segmentation_models.pytorch (SMP) library, which provides pretrained encoder backbones:
import segmentation_models_pytorch as smp
NUM_CLASSES = 6 # background + 5 land cover classes
IN_CHANNELS = 4 # R, G, B, NIR
model = smp.Unet(
encoder_name='resnet34', # ImageNet-pretrained encoder
encoder_weights='imagenet',
in_channels=IN_CHANNELS,
classes=NUM_CLASSES,
activation=None # Raw logits; apply softmax during inference
)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}") # ~24 million for ResNet-34 backbone
Note on encoder weights: ResNet-34 is pretrained on ImageNet (3-channel RGB). We add a fourth NIR channel by initialising its weights as the mean of the three existing input channel weights. SMP handles this automatically when
in_channels != 3.
Training the Model
Loss Function
Urban datasets are class-imbalanced — roads and buildings often cover 60-80% of pixels, while water or bare soil may cover less than 2%. We combine Cross-Entropy with Dice Loss to handle this:
import torch
import torch.nn as nn
import torch.nn.functional as F
class CombinedLoss(nn.Module):
def __init__(self, num_classes: int, dice_weight: float = 0.5):
super().__init__()
self.ce = nn.CrossEntropyLoss()
self.dice_weight = dice_weight
self.num_classes = num_classes
def dice_loss(self, logits: torch.Tensor,
targets: torch.Tensor) -> torch.Tensor:
probs = F.softmax(logits, dim=1) # [B, C, H, W]
targets_one_hot = F.one_hot(targets, self.num_classes) # [B, H, W, C]
targets_one_hot = targets_one_hot.permute(0, 3, 1, 2).float()
intersection = (probs * targets_one_hot).sum(dim=(2, 3))
union = probs.sum(dim=(2, 3)) + targets_one_hot.sum(dim=(2, 3))
dice = (2 * intersection + 1e-7) / (union + 1e-7)
return 1 - dice.mean()
def forward(self, logits, targets):
ce = self.ce(logits, targets)
dice = self.dice_loss(logits, targets)
return (1 - self.dice_weight) * ce + self.dice_weight * dice
Training Loop
from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from tqdm import tqdm
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
EPOCHS = 50
BATCH_SIZE = 8
LR = 3e-4
train_dataset = UrbanSegDataset(train_images, train_labels, augment=True)
val_dataset = UrbanSegDataset(val_images, val_labels, augment=False)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
shuffle=False, num_workers=4)
model = model.to(DEVICE)
criterion = CombinedLoss(num_classes=NUM_CLASSES).to(DEVICE)
optimizer = AdamW(model.parameters(), lr=LR, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=1e-6)
best_val_loss = float('inf')
for epoch in range(EPOCHS):
# Training
model.train()
train_loss = 0.0
for imgs, masks in tqdm(train_loader, desc=f"Epoch {epoch+1} [Train]"):
imgs, masks = imgs.to(DEVICE), masks.to(DEVICE)
optimizer.zero_grad()
logits = model(imgs)
loss = criterion(logits, masks)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation
model.eval()
val_loss = 0.0
with torch.no_grad():
for imgs, masks in tqdm(val_loader, desc=f"Epoch {epoch+1} [Val]"):
imgs, masks = imgs.to(DEVICE), masks.to(DEVICE)
logits = model(imgs)
val_loss += criterion(logits, masks).item()
scheduler.step()
avg_train = train_loss / len(train_loader)
avg_val = val_loss / len(val_loader)
print(f"Epoch {epoch+1} | Train Loss: {avg_train:.4f} | Val Loss: {avg_val:.4f}")
if avg_val < best_val_loss:
best_val_loss = avg_val
torch.save(model.state_dict(), 'best_urban_seg_model.pth')
print(" Best model saved.")
Recommended Hyperparameters
| Parameter | Value | Notes |
|---|---|---|
| Backbone | ResNet-34 | Good balance of speed and accuracy |
| Input size | 256 x 256 | Increase to 512 on high-VRAM GPUs |
| Batch size | 8-16 | Adjust to available GPU memory |
| Learning rate | 3e-4 | With AdamW + cosine decay |
| Epochs | 50-100 | Use early stopping with patience = 10 |
| Dice weight | 0.5 | Tune upward if minority classes underperform |
Post-Processing and GIS Integration
Step 1: Run Inference on the Full Scene
After training, run the model over all tiles and reconstruct the full prediction raster. Overlapping tiles are averaged before the final argmax:
import numpy as np
import rasterio
import torch
import torch.nn.functional as F
def predict_full_scene(model, image_tiles: list, output_path: str,
reference_raster: str, num_classes: int = 6):
"""Aggregate tile predictions into a full-scene label raster."""
model.eval()
with rasterio.open(reference_raster) as ref:
H, W = ref.height, ref.width
meta = ref.meta.copy()
meta.update(count=1, dtype='uint8')
score_acc = np.zeros((num_classes, H, W), dtype=np.float32)
count_acc = np.zeros((H, W), dtype=np.float32)
for tile_path in tqdm(image_tiles, desc="Inference"):
with rasterio.open(tile_path) as src:
tile_data = src.read().astype(np.float32) / 10000.0
row_off = int(src.transform.f) # approximate; use windowed reads in production
col_off = int(src.transform.c)
th, tw = src.height, src.width
tensor = torch.from_numpy(tile_data).unsqueeze(0).to(DEVICE)
with torch.no_grad():
logits = model(tensor)
probs = F.softmax(logits, dim=1).squeeze(0).cpu().numpy()
score_acc[:, row_off:row_off+th, col_off:col_off+tw] += probs
count_acc[row_off:row_off+th, col_off:col_off+tw] += 1
count_acc = np.where(count_acc == 0, 1, count_acc)
label_map = (score_acc / count_acc).argmax(axis=0).astype(np.uint8)
with rasterio.open(output_path, 'w', **meta) as dst:
dst.write(label_map, 1)
print(f"Prediction raster saved to {output_path}")
Step 2: Vectorise the Prediction Raster
Convert the classified raster to polygons using rasterio.features.shapes:
import fiona
import rasterio
from rasterio.features import shapes
from shapely.geometry import shape, mapping
CLASS_NAMES = {
0: 'other',
1: 'building',
2: 'road',
3: 'vegetation',
4: 'water',
5: 'bare_soil'
}
def vectorise_prediction(raster_path: str, output_path: str) -> None:
"""Convert a classified raster to a polygon shapefile."""
with rasterio.open(raster_path) as src:
label_array = src.read(1)
transform = src.transform
crs = src.crs
schema = {
'geometry': 'Polygon',
'properties': {'class_id': 'int', 'class_name': 'str'}
}
with fiona.open(output_path, 'w',
driver='ESRI Shapefile',
crs=crs.to_epsg(),
schema=schema) as dst:
for geom, value in shapes(label_array, transform=transform):
if value == 0: # skip background
continue
polygon = shape(geom)
if polygon.area < 100: # filter slivers smaller than 100 m²
continue
dst.write({
'geometry': mapping(polygon),
'properties': {
'class_id': int(value),
'class_name': CLASS_NAMES.get(int(value), 'unknown')
}
})
print(f"Vector layer saved to {output_path}")
Step 3: Load into QGIS
Your output shapefile can be opened directly in QGIS. Apply a Categorized Renderer on the class_name field with the following suggested colour scheme:
| Class | Suggested Colour |
|---|---|
| building | #c0392b (dark red) |
| road | #7f8c8d (grey) |
| vegetation | #27ae60 (green) |
| water | #2980b9 (blue) |
| bare_soil | #e67e22 (orange) |
| other | #bdc3c7 (light grey) |
Evaluation Metrics
Pixel-Level ML Metrics
For each class, compute Intersection over Union (IoU), also called the Jaccard Index:
IoU (class c) = TP_c / (TP_c + FP_c + FN_c)
Average across all classes to get mean IoU (mIoU) — the standard benchmark metric for segmentation.
import numpy as np
def compute_miou(preds: np.ndarray, targets: np.ndarray,
num_classes: int) -> dict:
"""Compute per-class IoU and mean IoU."""
iou_per_class = {}
preds = preds.flatten()
targets = targets.flatten()
for c in range(num_classes):
pred_c = preds == c
target_c = targets == c
intersection = np.logical_and(pred_c, target_c).sum()
union = np.logical_or(pred_c, target_c).sum()
iou = intersection / (union + 1e-7)
iou_per_class[CLASS_NAMES[c]] = round(float(iou), 4)
iou_per_class['mIoU'] = round(
float(np.mean(list(iou_per_class.values()))), 4
)
return iou_per_class
Typical benchmark targets for urban imagery:
| Class | Acceptable mIoU | Good mIoU |
|---|---|---|
| Building | 0.65 | 0.80+ |
| Road | 0.60 | 0.75+ |
| Vegetation | 0.70 | 0.85+ |
| Water | 0.75 | 0.90+ |
| Bare soil | 0.50 | 0.70+ |
| Overall mIoU | 0.60 | 0.75+ |
GIS-Specific Metrics
Beyond mIoU, urban mapping projects benefit from spatial accuracy metrics:
Boundary F1 Score measures how accurately class boundaries are localised — critical for road network extraction where alignment with centrelines matters.
Area Accuracy per Class compares predicted polygon areas against reference areas. A model with high mIoU can still systematically over- or under-predict building footprint area, which distorts any downstream area-based analysis.
Road Connectivity Score uses NetworkX to check whether extracted road segments form a connected graph. A fragmented road network is far less useful for routing applications even if per-pixel accuracy is high.
Real-World Considerations
Handling Temporal Gaps
Satellite imagery is collected at a different time from your ground truth labels. If your OSM labels reflect the city as of 2023 but your imagery is from 2021, you will introduce label noise for recently built structures. Mitigate this by using imagery from within six months of your label collection date, and by fine-tuning periodically as new imagery and OSM edits become available.
Domain Shift Between Cities
A model trained on Mumbai will not transfer cleanly to Helsinki. Building density, road geometry, and vegetation coverage differ dramatically. Apply transfer learning by fine-tuning your pretrained model on 5-10% labelled data from the new city before running full inference.
Resolution Sensitivity
Sentinel-2 at 10 m/pixel is good for district-scale mapping but cannot resolve individual buildings reliably. For building footprint extraction, use higher-resolution sources:
| Source | Resolution | Cost |
|---|---|---|
| Sentinel-2 | 10 m | Free |
| Planet Labs PlanetScope | 3-5 m | Commercial |
| Maxar WorldView-3 | 0.3 m | Commercial |
| NAIP / OpenAerialMap | 0.6-1 m | Free (USA / selective) |
Handling Shadows and Clouds
Urban scenes have pervasive building shadows that create ambiguity for road and vegetation classes. Use the Scene Classification Layer (SCL) provided in Sentinel-2 L2A products to mask clouds and cloud shadows before training. For shadowed areas, NIR band reflectance provides a partial signal even when visible bands are dark.
Class Imbalance in Informal Settlements
Informal urban areas present unique challenges: building materials (corrugated iron, thatch) have irregular spectral signatures, road boundaries are unclear, and OSM coverage is sparse. Consider using active learning — iteratively labelling the tiles where model uncertainty is highest — to build targeted training data for these areas without exhaustive manual annotation.
Conclusion
Semantic segmentation has fundamentally changed what is possible in urban GIS. What once required months of manual digitising can now be accomplished in hours — with the model improving continuously as more labelled data is added.
The pipeline built here is production-ready in structure, though real deployments require additional engineering: model versioning, automated retraining triggers, QA sampling workflows, and integration with spatial databases like PostGIS. Each of those is its own tutorial.
The key takeaways:
- Tile your imagery with overlap and normalise carefully — bad preprocessing undermines good models
- U-Net with a pretrained encoder backbone is a strong baseline for most urban segmentation tasks
- Combined Cross-Entropy + Dice loss handles class imbalance better than either alone
- Always convert predictions back to georeferenced GIS formats — pixel accuracy is necessary but not sufficient
- mIoU is the primary benchmark, but spatial metrics like boundary F1 and connectivity score reflect real-world usability
- Domain shift is the biggest deployment risk — build in fine-tuning workflows from the start
The intersection of GIS and deep learning is one of the fastest-moving areas in geospatial technology. Semantic segmentation is your entry point into that space.
For questions or dataset recommendations, connect on LinkedIn or raise an issue in the project repository.
Tags: #GeoAI #RemoteSensing #SemanticSegmentation #GIS #DeepLearning #UrbanMapping #Sentinel2 #PyTorch #UNet
