Training a Neural Network to Classify Land Cover from Aerial Photos

A practical guide for geospatial practitioners and remote sensing enthusiasts


Introduction

Land cover classification — the process of labeling every pixel in a satellite or aerial image with a category like forest, water, urban, or cropland — is one of the most impactful applications of deep learning in geospatial science. Accurate, up-to-date land cover maps underpin environmental monitoring, urban planning, disaster response, agricultural policy, and climate modeling.

Traditional classification workflows relied on handcrafted spectral indices and rule-based thresholds. Neural networks, particularly convolutional neural networks (CNNs) and their encoder-decoder variants, have largely supplanted these approaches by learning rich spatial-spectral features directly from labeled imagery. This article walks through the end-to-end process of training such a model — from data preparation to deployment-ready inference.


1. Understanding the Problem

Land cover classification is a semantic segmentation task. Unlike image-level classification (which assigns one label to an entire image), semantic segmentation assigns a class label to every pixel. The output of the model is a classification map — a raster of the same spatial extent and resolution as the input image, where each cell carries a predicted category.

Common Land Cover Classes

Depending on the use case and reference schema, typical classes include:

  • Impervious surfaces — roads, rooftops, parking lots
  • Vegetation — forests, shrublands, grasslands
  • Cropland — agricultural fields, orchards
  • Water bodies — rivers, lakes, reservoirs
  • Bare soil / exposed rock
  • Built-up / urban areas

Schemes like NLCD (National Land Cover Database), CORINE (Europe), or ESA WorldCover define standardized hierarchies that are widely used in training datasets.


2. Data Collection and Preparation

The quality and consistency of your training data will determine the ceiling of your model’s performance. This stage deserves at least as much effort as model design.

2.1 Imagery Sources

Aerial and satellite imagery differ in resolution, revisit frequency, and cost. Common options include:

SourceResolutionBandsNotes
NAIP (USA)0.6–1 mRGB + NIRFree, covers contiguous US
Sentinel-210 m13 bandsFree, global, multispectral
Planet Basemaps3–5 mRGB + NIRCommercial, high cadence
Maxar/WorldView0.3–0.5 mMultispectralCommercial, very high resolution
OpenAerialMapVariesRGBCommunity-contributed, free

For a first project, NAIP imagery paired with the NLCD product is an excellent starting point — both are freely available from USGS.

2.2 Ground Truth Labels

Labels must align spatially and temporally with your imagery. Sources include:

  • Existing land cover products — NLCD, ESA WorldCover, Dynamic World (Google)
  • Manual digitization — using QGIS, ArcGIS Pro, or Labelbox
  • Crowdsourced data — OpenStreetMap building/road layers for specific classes

When using pre-existing products as labels, verify temporal correspondence. A 2015 label map paired with 2023 imagery will introduce noise in areas that have changed.

2.3 Tiling the Data

Neural networks are trained on fixed-size image patches, not full-scene rasters. A standard pipeline:

  1. Reproject all imagery and labels to a common CRS (e.g., UTM zone for the AOI).
  2. Create a regular grid of patch extents across the study area (commonly 256×256 or 512×512 pixels).
  3. Export patches as GeoTIFFs — one image chip and one corresponding label chip per tile.
  4. Filter tiles with excessive NoData coverage (e.g., >20% cloud or null pixels).

Tools like rasterio, GDAL, and torchgeo streamline this process considerably.

import rasterio
from rasterio.windows import Window

def export_tiles(src_path, label_path, out_dir, tile_size=256, stride=256):
    with rasterio.open(src_path) as src, rasterio.open(label_path) as lbl:
        for row_off in range(0, src.height - tile_size, stride):
            for col_off in range(0, src.width - tile_size, stride):
                window = Window(col_off, row_off, tile_size, tile_size)
                img_chip = src.read(window=window)
                lbl_chip = lbl.read(1, window=window)
                # Save chips to out_dir...

2.4 Train / Validation / Test Splits

Avoid random pixel-level splits — spatially proximate tiles share texture and context, causing data leakage. Instead, use a spatial block split: hold out contiguous geographic regions for validation and testing. This gives a fairer estimate of how the model will generalize to unseen areas.


3. Choosing a Model Architecture

3.1 U-Net: The Workhorse

Introduced in 2015 for biomedical image segmentation, U-Net has become the default starting point for land cover classification. Its key features:

  • Encoder (contracting path): A series of convolutional blocks that progressively downsample the input, building increasingly abstract feature representations.
  • Decoder (expansive path): Upsampling blocks that recover spatial resolution.
  • Skip connections: Direct connections from encoder to decoder at each resolution level, preserving fine spatial detail that would otherwise be lost during downsampling.

The result is a model that simultaneously captures global context and local texture — essential for distinguishing visually similar classes like grassland and cropland at field boundaries.

3.2 Beyond U-Net

Modern variants and alternatives worth considering:

  • ResNet / EfficientNet encoders — Replace the vanilla U-Net encoder with a pretrained ResNet or EfficientNet backbone for better feature extraction, especially with limited labeled data.
  • DeepLab v3+ — Uses atrous (dilated) convolutions and an Atrous Spatial Pyramid Pooling (ASPP) module to capture multi-scale context without losing resolution.
  • SegFormer — A transformer-based encoder that has shown strong performance on remote sensing benchmarks, particularly when large-scale pretraining is available.
  • Swin-UNet — Combines Swin Transformer blocks within a U-Net-style decoder, offering a good balance of accuracy and efficiency.

For most projects with moderate dataset sizes (tens of thousands of tiles), a U-Net with a pretrained ResNet-34 or EfficientNet-B4 encoder (available through segmentation-models-pytorch) is an excellent choice.

import segmentation_models_pytorch as smp

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=4,          # RGB + NIR
    classes=7,              # Number of land cover classes
)

4. Training the Model

4.1 Loss Functions

Standard cross-entropy loss works but often struggles with class imbalance — urban scenes may be 80% impervious and 2% water, so a naive model learns to predict the majority class everywhere.

Better options:

  • Weighted cross-entropy — Assign higher weight to rare classes, inversely proportional to their frequency.
  • Focal Loss — Down-weights easy examples and focuses learning on hard, misclassified pixels.
  • Dice Loss / IoU Loss — Directly optimizes for overlap between predicted and ground truth masks; less sensitive to class imbalance.
  • Combo Loss — A weighted combination of cross-entropy and Dice loss, often the best empirical choice.
import torch
import torch.nn as nn

class ComboLoss(nn.Module):
    def __init__(self, alpha=0.5):
        super().__init__()
        self.alpha = alpha
        self.ce = nn.CrossEntropyLoss()

    def forward(self, preds, targets):
        ce_loss = self.ce(preds, targets)
        preds_soft = torch.softmax(preds, dim=1)
        dice = 1 - (2 * (preds_soft * targets_onehot).sum()) / \
               (preds_soft.sum() + targets_onehot.sum() + 1e-6)
        return self.alpha * ce_loss + (1 - self.alpha) * dice

4.2 Data Augmentation

Remote sensing imagery demands thoughtful augmentation. Standard photographic augmentations apply, plus domain-specific ones:

AugmentationRationale
Horizontal / vertical flipsAerial imagery has no canonical orientation
90° rotationsBuildings and fields appear at arbitrary orientations
Random cropsIncreases positional variety
Brightness / contrast jitterAccounts for sensor and atmospheric variation
Random channel dropoutSimulates missing bands in multispectral data
Gaussian blurSimulates variation in imagery sharpness
Cutout / GridMaskForces the model to not rely on a single region

Use albumentations for efficient, reproducible augmentation pipelines.

4.3 Training Loop Best Practices

  • Learning rate scheduling: Use a cosine annealing schedule or OneCycleLR — both consistently outperform a fixed learning rate.
  • Mixed precision training: Use torch.cuda.amp for 16-bit training on compatible GPUs; typically 1.5–2× speedup with negligible accuracy loss.
  • Gradient clipping: Helps with training stability, especially in early epochs.
  • Early stopping: Monitor validation mean IoU and stop when it plateaus to prevent overfitting.
  • Model checkpointing: Save the best checkpoint, not just the last epoch.
from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

for epoch in range(max_epochs):
    for images, masks in train_loader:
        optimizer.zero_grad()
        with autocast():
            preds = model(images)
            loss = criterion(preds, masks)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    scheduler.step()

5. Evaluation Metrics

Pixel accuracy alone is misleading under class imbalance. Use:

  • Mean Intersection over Union (mIoU) — The standard benchmark metric. Computes IoU per class and averages. A mIoU of 0.75+ is generally considered strong.
  • Per-class IoU — Critical for understanding where the model fails. Water and impervious surfaces are usually easiest; shrubland vs. grassland is notoriously difficult.
  • Precision / Recall per class — Helps diagnose whether errors are false positives (commission errors) or false negatives (omission errors).
  • Confusion matrix — Reveals systematic confusions between spectrally similar classes.
  • F1 score (Dice coefficient) — Useful for reporting results on imbalanced datasets.

6. Common Failure Modes and Fixes

ProblemLikely CauseFix
Model predicts majority class onlyClass imbalanceUse focal loss or weighted CE; oversample rare classes
Sharp boundary artifactsTile edge effects during inferenceUse overlapping tiles with soft blending in overlap zones
High validation loss despite good training lossSpatial data leakageUse spatial block splits for train/val/test
Poor generalization to new areasDomain shift in imageryNormalize by local statistics; use domain adaptation
Confusion between spectrally similar classesInsufficient spectral depthAdd NIR / SWIR bands; use indices like NDVI, NDWI
Noisy predictions at parcel boundariesLimited boundary supervisionUse boundary-aware loss terms or contour supervision

7. Inference at Scale

Once trained, deploying the model over large areas introduces its own challenges.

7.1 Sliding Window with Overlap

Do not simply tile and predict independently. Use a sliding window with 50% overlap and average predictions in the overlap zones to suppress tile-boundary artifacts.

def predict_with_overlap(model, image, tile_size=512, overlap=0.5):
    stride = int(tile_size * (1 - overlap))
    prediction_map = torch.zeros((num_classes, H, W))
    count_map = torch.zeros((1, H, W))

    for y in range(0, H - tile_size + 1, stride):
        for x in range(0, W - tile_size + 1, stride):
            tile = image[:, y:y+tile_size, x:x+tile_size]
            with torch.no_grad():
                pred = torch.softmax(model(tile.unsqueeze(0)), dim=1).squeeze(0)
            prediction_map[:, y:y+tile_size, x:x+tile_size] += pred
            count_map[:, y:y+tile_size, x:x+tile_size] += 1

    return (prediction_map / count_map).argmax(0)

7.2 Writing Georeferenced Output

Always write predictions back as georeferenced GeoTIFFs, preserving the CRS and transform of the source imagery:

import rasterio
from rasterio.transform import from_bounds

with rasterio.open(source_path) as src:
    meta = src.meta.copy()
    meta.update({"count": 1, "dtype": "uint8"})

    with rasterio.open("prediction.tif", "w", **meta) as dst:
        dst.write(prediction_array, 1)

8. Going Further

Once you have a working baseline, several directions can meaningfully improve results:

  • Transfer learning from geospatial foundation models: Models like SatMAE, ScaleMAE, and GeoChat are pretrained on massive remote sensing datasets and provide far richer initializations than ImageNet weights.
  • Multi-temporal inputs: Stack imagery from multiple dates to leverage phenological signatures (e.g., crops change dramatically across seasons; forests do not).
  • Elevation and terrain features: Incorporate a DSM or DTM as additional input channels — elevation strongly constrains land cover type.
  • Active learning: Strategically select the most uncertain tiles for human annotation to improve label efficiency.
  • Post-processing with spatial context: Apply morphological operations, connected-component filtering, or a Conditional Random Field (CRF) as a post-processing step to enforce spatial coherence.

Recommended Tools and Libraries

ToolPurpose
torchgeoGeospatial datasets and samplers for PyTorch
segmentation-models-pytorchPretrained encoder-decoder architectures
rasterio / GDALRaster I/O and geospatial operations
albumentationsFast image augmentation
QGISVisualization, label creation, and QA
pytorch-lightningTraining loop boilerplate reduction
wandb / MLflowExperiment tracking

Conclusion

Training a neural network for land cover classification is a mature, well-supported workflow — but the details matter enormously. Spatial data leakage, class imbalance, tile-boundary artifacts, and domain shift are all real failure modes that trip up practitioners relying on generic computer vision recipes. By treating the geospatial context seriously at every stage — from spatial splits to georeferenced output — you can build models that generalize reliably across landscapes and form the backbone of operational land monitoring systems.

The field is also moving fast. Foundation models pretrained on terabytes of Earth observation data are beginning to make the feature extraction stage largely plug-and-play, shifting the engineering challenge toward high-quality labeling, robust evaluation, and thoughtful deployment. For practitioners entering the space today, there has never been a better time to build.


Further reading: Ronneberger et al. (2015) — U-Net; He et al. (2016) — Deep Residual Learning; Chen et al. (2018) — DeepLab v3+; Stewart & Ermon (2021) — SatMAE

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *