Semantic Segmentation for Urban Mapping: A GIS + AI Tutorial

Author: Anshul Mohan | Level: Intermediate–Advanced | Domain: GeoAI / Remote Sensing


Table of Contents

  1. Introduction
  2. What Is Semantic Segmentation?
  3. Why Urban Mapping Needs It
  4. The GIS + AI Stack
  5. Data Preparation
  6. Model Architecture: U-Net for Urban Scenes
  7. Training the Model
  8. Post-Processing and GIS Integration
  9. Evaluation Metrics
  10. Real-World Considerations
  11. Conclusion

Introduction

Cities are living systems — roads expand, buildings rise, green spaces shrink, and informal settlements emerge overnight. Keeping up with these changes using traditional GIS methods alone is slow and labour-intensive. Enter semantic segmentation, a deep learning technique that assigns a land-cover class to every single pixel in a satellite or aerial image.

When you pair semantic segmentation with a GIS workflow, you get something powerful: spatially accurate, machine-generated maps that can be updated continuously as new imagery arrives. This tutorial walks you through that entire pipeline — from raw satellite imagery to a classified shapefile you can load into QGIS or ArcGIS.

By the end, you will understand:

  • How semantic segmentation differs from object detection and instance segmentation
  • How to prepare satellite imagery tiles for training
  • How to build and train a U-Net model on urban imagery
  • How to convert model outputs back into GIS-ready vector layers
  • How to evaluate segmentation quality using both ML and GIS metrics

What Is Semantic Segmentation?

Semantic segmentation is a pixel-wise classification task. Given an input image, the model outputs a label map of the same spatial dimensions, where each pixel is assigned to one of a predefined set of classes.

Input:  RGB satellite image  →  [H × W × 3]
Output: Class label map      →  [H × W × 1]  (integer class IDs)

This is distinct from two related tasks:

TaskOutputExample
Image ClassificationSingle label for whole image“This tile contains a highway”
Object DetectionBounding boxes around objectsBox drawn around each building
Semantic SegmentationPer-pixel class labelEvery road pixel labelled as “road”
Instance SegmentationPer-pixel label + instance IDEach individual building polygon

For urban mapping, semantic segmentation is often the best trade-off: it gives you full spatial coverage without requiring the overhead of instance-level tracking.


Why Urban Mapping Needs It

Urban environments are among the most heterogeneous surfaces on Earth. A single 10 m resolution Sentinel-2 pixel may blend rooftop, asphalt, shadow, and vegetation — making traditional spectral classifiers unreliable.

Semantic segmentation addresses this in several ways:

Contextual understanding. Deep convolutional networks learn that a narrow dark strip adjacent to buildings is likely a road, not a river — something a pixel-by-pixel spectral classifier cannot infer.

Multi-class precision. Urban maps typically need at minimum six classes: buildings, roads, vegetation, bare soil, water, and impervious surface. Segmentation models handle all of these simultaneously.

Scalability. Once trained, a model can process thousands of image tiles in hours, enabling city-wide or national-scale mapping.

Change detection readiness. Running the same model over multi-temporal imagery produces consistent label maps that can be differenced to detect urban growth, flood extent, or post-disaster damage.


The GIS + AI Stack

Here is the full software stack used in this tutorial:

Imagery Source    →  Sentinel-2 (ESA Copernicus) or aerial orthophotos
Preprocessing     →  GDAL, Rasterio, Shapely
Deep Learning     →  PyTorch + segmentation_models.pytorch (SMP)
Training Infra    →  NVIDIA CUDA GPU (or Google Colab)
Post-Processing   →  OpenCV, Rasterio, Fiona
GIS Output        →  GeoTIFF (raster) → Shapefile/GeoJSON (vector)
Visualisation     →  QGIS / ArcGIS Pro / Folium

Install the core Python dependencies:

pip install torch torchvision segmentation-models-pytorch \
            rasterio fiona shapely geopandas albumentations \
            opencv-python matplotlib tqdm

Data Preparation

Good data preparation is where most urban segmentation projects succeed or fail. Spend time here.

Step 1: Acquire Imagery

For this tutorial, we use Sentinel-2 Level-2A (surface reflectance) imagery at 10 m resolution. Download from:

Select a cloud-free scene over your city of interest. We will use bands B02 (Blue), B03 (Green), B04 (Red), and B08 (NIR) — giving us a 4-channel input that includes near-infrared, which is highly informative for vegetation and impervious surface discrimination.

Step 2: Acquire Ground Truth Labels

You need pixel-accurate land cover masks. Sources include:

  • OpenStreetMap (OSM) — rasterise road, building, and water polygons
  • National land cover databases — NLCD (USA), Urban Atlas (EU), LULC (India)
  • Manual annotation — tools like CVAT, Labelme, or QGIS with the Semi-Automatic Classification Plugin

For this tutorial, we rasterise OSM polygons into a six-class label raster aligned to our Sentinel-2 image:

import geopandas as gpd
import rasterio
from rasterio.features import rasterize
import numpy as np

CLASS_MAP = {
    'building':   1,
    'road':       2,
    'vegetation': 3,
    'water':      4,
    'bare_soil':  5,
    'other':      0,
}

def rasterize_labels(gdf: gpd.GeoDataFrame,
                     reference_raster: str,
                     output_path: str) -> None:
    """Burn vector labels to a raster matching the reference image."""
    with rasterio.open(reference_raster) as src:
        meta = src.meta.copy()
        meta.update(count=1, dtype='uint8')
        transform = src.transform
        shape = (src.height, src.width)

    label_raster = np.zeros(shape, dtype=np.uint8)

    for class_name, class_id in CLASS_MAP.items():
        subset = gdf[gdf['class'] == class_name]
        if subset.empty:
            continue
        burned = rasterize(
            [(geom, class_id) for geom in subset.geometry],
            out_shape=shape,
            transform=transform,
            fill=0,
            dtype='uint8'
        )
        label_raster = np.where(burned > 0, burned, label_raster)

    with rasterio.open(output_path, 'w', **meta) as dst:
        dst.write(label_raster, 1)

    print(f"Label raster saved to {output_path}")

Step 3: Tile the Image

Neural networks require fixed-size inputs. We cut both the imagery and its label raster into 256 x 256 pixel tiles with a 50% overlap to reduce edge artefacts.

import os
from pathlib import Path

def tile_raster(raster_path: str,
                output_dir: str,
                tile_size: int = 256,
                overlap: int = 128) -> list:
    """Cut a large raster into fixed-size tiles with overlap."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    tile_paths = []
    stride = tile_size - overlap

    with rasterio.open(raster_path) as src:
        data = src.read()           # [C, H, W]
        meta = src.meta.copy()
        H, W = src.height, src.width
        transform = src.transform

    tile_id = 0
    for row_start in range(0, H - tile_size + 1, stride):
        for col_start in range(0, W - tile_size + 1, stride):
            tile = data[:, row_start:row_start + tile_size,
                           col_start:col_start + tile_size]

            tile_transform = rasterio.transform.Affine(
                transform.a, 0,
                transform.c + col_start * transform.a,
                0, transform.e,
                transform.f + row_start * transform.e
            )

            tile_meta = meta.copy()
            tile_meta.update(
                height=tile_size, width=tile_size,
                count=tile.shape[0],
                transform=tile_transform
            )

            out_path = os.path.join(output_dir, f"tile_{tile_id:05d}.tif")
            with rasterio.open(out_path, 'w', **tile_meta) as dst:
                dst.write(tile)

            tile_paths.append(out_path)
            tile_id += 1

    print(f"Generated {tile_id} tiles in {output_dir}")
    return tile_paths

Step 4: Build a PyTorch Dataset

import torch
from torch.utils.data import Dataset
import numpy as np
import rasterio
import albumentations as A
from albumentations.pytorch import ToTensorV2

AUGMENTATIONS = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.5),
    A.RandomRotate90(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.GaussNoise(p=0.2),
    ToTensorV2()
])

class UrbanSegDataset(Dataset):
    def __init__(self, image_tiles: list, label_tiles: list,
                 augment: bool = True):
        self.images = image_tiles
        self.labels = label_tiles
        self.augment = augment

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        with rasterio.open(self.images[idx]) as src:
            img = src.read().astype(np.float32)        # [C, H, W]

        with rasterio.open(self.labels[idx]) as src:
            mask = src.read(1).astype(np.int64)        # [H, W]

        # Normalise to [0, 1] using Sentinel-2 max reflectance value
        img = img / 10000.0
        img = np.clip(img, 0, 1)
        img = img.transpose(1, 2, 0)                   # [H, W, C] for albumentations

        if self.augment:
            augmented = AUGMENTATIONS(image=img, mask=mask)
            img = augmented['image'].float()
            mask = torch.from_numpy(augmented['mask']).long()
        else:
            img = torch.from_numpy(img.transpose(2, 0, 1)).float()
            mask = torch.from_numpy(mask).long()

        return img, mask

Model Architecture: U-Net for Urban Scenes

U-Net remains the workhorse of geospatial segmentation. Its encoder-decoder structure with skip connections preserves fine spatial detail while learning high-level semantic features — exactly what urban scenes demand.

Encoder (downsampling path)
  Input [B, 4, 256, 256]
  Conv Block → 64 feature maps  ─────────────────────────────── skip
  MaxPool → 128 x 128
  Conv Block → 128 feature maps ──────────────────────────── skip
  MaxPool → 64 x 64
  Conv Block → 256 feature maps ───────────────────────── skip
  MaxPool → 32 x 32
  Bottleneck → 512 feature maps

Decoder (upsampling path)
  Upsample → 64 x 64   + skip concat from 256-map block
  Upsample → 128 x 128 + skip concat from 128-map block
  Upsample → 256 x 256 + skip concat from  64-map block
  1x1 Conv → [B, num_classes, 256, 256]

We use the segmentation_models.pytorch (SMP) library, which provides pretrained encoder backbones:

import segmentation_models_pytorch as smp

NUM_CLASSES = 6    # background + 5 land cover classes
IN_CHANNELS = 4    # R, G, B, NIR

model = smp.Unet(
    encoder_name='resnet34',        # ImageNet-pretrained encoder
    encoder_weights='imagenet',
    in_channels=IN_CHANNELS,
    classes=NUM_CLASSES,
    activation=None                 # Raw logits; apply softmax during inference
)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")   # ~24 million for ResNet-34 backbone

Note on encoder weights: ResNet-34 is pretrained on ImageNet (3-channel RGB). We add a fourth NIR channel by initialising its weights as the mean of the three existing input channel weights. SMP handles this automatically when in_channels != 3.


Training the Model

Loss Function

Urban datasets are class-imbalanced — roads and buildings often cover 60-80% of pixels, while water or bare soil may cover less than 2%. We combine Cross-Entropy with Dice Loss to handle this:

import torch
import torch.nn as nn
import torch.nn.functional as F

class CombinedLoss(nn.Module):
    def __init__(self, num_classes: int, dice_weight: float = 0.5):
        super().__init__()
        self.ce = nn.CrossEntropyLoss()
        self.dice_weight = dice_weight
        self.num_classes = num_classes

    def dice_loss(self, logits: torch.Tensor,
                  targets: torch.Tensor) -> torch.Tensor:
        probs = F.softmax(logits, dim=1)                          # [B, C, H, W]
        targets_one_hot = F.one_hot(targets, self.num_classes)    # [B, H, W, C]
        targets_one_hot = targets_one_hot.permute(0, 3, 1, 2).float()

        intersection = (probs * targets_one_hot).sum(dim=(2, 3))
        union = probs.sum(dim=(2, 3)) + targets_one_hot.sum(dim=(2, 3))

        dice = (2 * intersection + 1e-7) / (union + 1e-7)
        return 1 - dice.mean()

    def forward(self, logits, targets):
        ce   = self.ce(logits, targets)
        dice = self.dice_loss(logits, targets)
        return (1 - self.dice_weight) * ce + self.dice_weight * dice

Training Loop

from torch.utils.data import DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from tqdm import tqdm

DEVICE     = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
EPOCHS     = 50
BATCH_SIZE = 8
LR         = 3e-4

train_dataset = UrbanSegDataset(train_images, train_labels, augment=True)
val_dataset   = UrbanSegDataset(val_images,   val_labels,   augment=False)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True,  num_workers=4, pin_memory=True)
val_loader   = DataLoader(val_dataset,   batch_size=BATCH_SIZE,
                          shuffle=False, num_workers=4)

model     = model.to(DEVICE)
criterion = CombinedLoss(num_classes=NUM_CLASSES).to(DEVICE)
optimizer = AdamW(model.parameters(), lr=LR, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=1e-6)

best_val_loss = float('inf')

for epoch in range(EPOCHS):
    # Training
    model.train()
    train_loss = 0.0
    for imgs, masks in tqdm(train_loader, desc=f"Epoch {epoch+1} [Train]"):
        imgs, masks = imgs.to(DEVICE), masks.to(DEVICE)
        optimizer.zero_grad()
        logits = model(imgs)
        loss   = criterion(logits, masks)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for imgs, masks in tqdm(val_loader, desc=f"Epoch {epoch+1} [Val]"):
            imgs, masks = imgs.to(DEVICE), masks.to(DEVICE)
            logits    = model(imgs)
            val_loss += criterion(logits, masks).item()

    scheduler.step()

    avg_train = train_loss / len(train_loader)
    avg_val   = val_loss   / len(val_loader)
    print(f"Epoch {epoch+1} | Train Loss: {avg_train:.4f} | Val Loss: {avg_val:.4f}")

    if avg_val < best_val_loss:
        best_val_loss = avg_val
        torch.save(model.state_dict(), 'best_urban_seg_model.pth')
        print("  Best model saved.")

Recommended Hyperparameters

ParameterValueNotes
BackboneResNet-34Good balance of speed and accuracy
Input size256 x 256Increase to 512 on high-VRAM GPUs
Batch size8-16Adjust to available GPU memory
Learning rate3e-4With AdamW + cosine decay
Epochs50-100Use early stopping with patience = 10
Dice weight0.5Tune upward if minority classes underperform

Post-Processing and GIS Integration

Step 1: Run Inference on the Full Scene

After training, run the model over all tiles and reconstruct the full prediction raster. Overlapping tiles are averaged before the final argmax:

import numpy as np
import rasterio
import torch
import torch.nn.functional as F

def predict_full_scene(model, image_tiles: list, output_path: str,
                       reference_raster: str, num_classes: int = 6):
    """Aggregate tile predictions into a full-scene label raster."""
    model.eval()

    with rasterio.open(reference_raster) as ref:
        H, W = ref.height, ref.width
        meta = ref.meta.copy()
        meta.update(count=1, dtype='uint8')

    score_acc = np.zeros((num_classes, H, W), dtype=np.float32)
    count_acc = np.zeros((H, W),              dtype=np.float32)

    for tile_path in tqdm(image_tiles, desc="Inference"):
        with rasterio.open(tile_path) as src:
            tile_data = src.read().astype(np.float32) / 10000.0
            row_off   = int(src.transform.f)   # approximate; use windowed reads in production
            col_off   = int(src.transform.c)
            th, tw    = src.height, src.width

        tensor = torch.from_numpy(tile_data).unsqueeze(0).to(DEVICE)
        with torch.no_grad():
            logits = model(tensor)
            probs  = F.softmax(logits, dim=1).squeeze(0).cpu().numpy()

        score_acc[:, row_off:row_off+th, col_off:col_off+tw] += probs
        count_acc[row_off:row_off+th, col_off:col_off+tw]    += 1

    count_acc = np.where(count_acc == 0, 1, count_acc)
    label_map = (score_acc / count_acc).argmax(axis=0).astype(np.uint8)

    with rasterio.open(output_path, 'w', **meta) as dst:
        dst.write(label_map, 1)

    print(f"Prediction raster saved to {output_path}")

Step 2: Vectorise the Prediction Raster

Convert the classified raster to polygons using rasterio.features.shapes:

import fiona
import rasterio
from rasterio.features import shapes
from shapely.geometry import shape, mapping

CLASS_NAMES = {
    0: 'other',
    1: 'building',
    2: 'road',
    3: 'vegetation',
    4: 'water',
    5: 'bare_soil'
}

def vectorise_prediction(raster_path: str, output_path: str) -> None:
    """Convert a classified raster to a polygon shapefile."""
    with rasterio.open(raster_path) as src:
        label_array = src.read(1)
        transform   = src.transform
        crs         = src.crs

    schema = {
        'geometry': 'Polygon',
        'properties': {'class_id': 'int', 'class_name': 'str'}
    }

    with fiona.open(output_path, 'w',
                    driver='ESRI Shapefile',
                    crs=crs.to_epsg(),
                    schema=schema) as dst:

        for geom, value in shapes(label_array, transform=transform):
            if value == 0:              # skip background
                continue
            polygon = shape(geom)
            if polygon.area < 100:      # filter slivers smaller than 100 m²
                continue
            dst.write({
                'geometry': mapping(polygon),
                'properties': {
                    'class_id':   int(value),
                    'class_name': CLASS_NAMES.get(int(value), 'unknown')
                }
            })

    print(f"Vector layer saved to {output_path}")

Step 3: Load into QGIS

Your output shapefile can be opened directly in QGIS. Apply a Categorized Renderer on the class_name field with the following suggested colour scheme:

ClassSuggested Colour
building#c0392b (dark red)
road#7f8c8d (grey)
vegetation#27ae60 (green)
water#2980b9 (blue)
bare_soil#e67e22 (orange)
other#bdc3c7 (light grey)

Evaluation Metrics

Pixel-Level ML Metrics

For each class, compute Intersection over Union (IoU), also called the Jaccard Index:

IoU (class c) = TP_c / (TP_c + FP_c + FN_c)

Average across all classes to get mean IoU (mIoU) — the standard benchmark metric for segmentation.

import numpy as np

def compute_miou(preds: np.ndarray, targets: np.ndarray,
                 num_classes: int) -> dict:
    """Compute per-class IoU and mean IoU."""
    iou_per_class = {}
    preds   = preds.flatten()
    targets = targets.flatten()

    for c in range(num_classes):
        pred_c   = preds == c
        target_c = targets == c
        intersection = np.logical_and(pred_c, target_c).sum()
        union        = np.logical_or(pred_c, target_c).sum()
        iou = intersection / (union + 1e-7)
        iou_per_class[CLASS_NAMES[c]] = round(float(iou), 4)

    iou_per_class['mIoU'] = round(
        float(np.mean(list(iou_per_class.values()))), 4
    )
    return iou_per_class

Typical benchmark targets for urban imagery:

ClassAcceptable mIoUGood mIoU
Building0.650.80+
Road0.600.75+
Vegetation0.700.85+
Water0.750.90+
Bare soil0.500.70+
Overall mIoU0.600.75+

GIS-Specific Metrics

Beyond mIoU, urban mapping projects benefit from spatial accuracy metrics:

Boundary F1 Score measures how accurately class boundaries are localised — critical for road network extraction where alignment with centrelines matters.

Area Accuracy per Class compares predicted polygon areas against reference areas. A model with high mIoU can still systematically over- or under-predict building footprint area, which distorts any downstream area-based analysis.

Road Connectivity Score uses NetworkX to check whether extracted road segments form a connected graph. A fragmented road network is far less useful for routing applications even if per-pixel accuracy is high.


Real-World Considerations

Handling Temporal Gaps

Satellite imagery is collected at a different time from your ground truth labels. If your OSM labels reflect the city as of 2023 but your imagery is from 2021, you will introduce label noise for recently built structures. Mitigate this by using imagery from within six months of your label collection date, and by fine-tuning periodically as new imagery and OSM edits become available.

Domain Shift Between Cities

A model trained on Mumbai will not transfer cleanly to Helsinki. Building density, road geometry, and vegetation coverage differ dramatically. Apply transfer learning by fine-tuning your pretrained model on 5-10% labelled data from the new city before running full inference.

Resolution Sensitivity

Sentinel-2 at 10 m/pixel is good for district-scale mapping but cannot resolve individual buildings reliably. For building footprint extraction, use higher-resolution sources:

SourceResolutionCost
Sentinel-210 mFree
Planet Labs PlanetScope3-5 mCommercial
Maxar WorldView-30.3 mCommercial
NAIP / OpenAerialMap0.6-1 mFree (USA / selective)

Handling Shadows and Clouds

Urban scenes have pervasive building shadows that create ambiguity for road and vegetation classes. Use the Scene Classification Layer (SCL) provided in Sentinel-2 L2A products to mask clouds and cloud shadows before training. For shadowed areas, NIR band reflectance provides a partial signal even when visible bands are dark.

Class Imbalance in Informal Settlements

Informal urban areas present unique challenges: building materials (corrugated iron, thatch) have irregular spectral signatures, road boundaries are unclear, and OSM coverage is sparse. Consider using active learning — iteratively labelling the tiles where model uncertainty is highest — to build targeted training data for these areas without exhaustive manual annotation.


Conclusion

Semantic segmentation has fundamentally changed what is possible in urban GIS. What once required months of manual digitising can now be accomplished in hours — with the model improving continuously as more labelled data is added.

The pipeline built here is production-ready in structure, though real deployments require additional engineering: model versioning, automated retraining triggers, QA sampling workflows, and integration with spatial databases like PostGIS. Each of those is its own tutorial.

The key takeaways:

  • Tile your imagery with overlap and normalise carefully — bad preprocessing undermines good models
  • U-Net with a pretrained encoder backbone is a strong baseline for most urban segmentation tasks
  • Combined Cross-Entropy + Dice loss handles class imbalance better than either alone
  • Always convert predictions back to georeferenced GIS formats — pixel accuracy is necessary but not sufficient
  • mIoU is the primary benchmark, but spatial metrics like boundary F1 and connectivity score reflect real-world usability
  • Domain shift is the biggest deployment risk — build in fine-tuning workflows from the start

The intersection of GIS and deep learning is one of the fastest-moving areas in geospatial technology. Semantic segmentation is your entry point into that space.


For questions or dataset recommendations, connect on LinkedIn or raise an issue in the project repository.


Tags: #GeoAI #RemoteSensing #SemanticSegmentation #GIS #DeepLearning #UrbanMapping #Sentinel2 #PyTorch #UNet

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *