Training a Neural Network to Classify Land Cover from Aerial Photos

A practical guide for geospatial practitioners and remote sensing enthusiasts

Introduction

Land cover classification — the process of labeling every pixel in a satellite or aerial image with a category like forest, water, urban, or cropland — is one of the most impactful applications of deep learning in geospatial science. Accurate, up-to-date land cover maps underpin environmental monitoring, urban planning, disaster response, agricultural policy, and climate modeling.

Traditional classification workflows relied on handcrafted spectral indices and rule-based thresholds. Neural networks, particularly convolutional neural networks (CNNs) and their encoder-decoder variants, have largely supplanted these approaches by learning rich spatial-spectral features directly from labeled imagery. This article walks through the end-to-end process of training such a model — from data preparation to deployment-ready inference.

1. Understanding the Problem

Land cover classification is a semantic segmentation task. Unlike image-level classification (which assigns one label to an entire image), semantic segmentation assigns a class label to every pixel. The output of the model is a classification map — a raster of the same spatial extent and resolution as the input image, where each cell carries a predicted category.

Common Land Cover Classes

Depending on the use case and reference schema, typical classes include:

Impervious surfaces — roads, rooftops, parking lots
Vegetation — forests, shrublands, grasslands
Cropland — agricultural fields, orchards
Water bodies — rivers, lakes, reservoirs
Bare soil / exposed rock
Built-up / urban areas

Schemes like NLCD (National Land Cover Database), CORINE (Europe), or ESA WorldCover define standardized hierarchies that are widely used in training datasets.

2. Data Collection and Preparation

The quality and consistency of your training data will determine the ceiling of your model’s performance. This stage deserves at least as much effort as model design.

2.1 Imagery Sources

Aerial and satellite imagery differ in resolution, revisit frequency, and cost. Common options include:

Source	Resolution	Bands	Notes
NAIP (USA)	0.6–1 m	RGB + NIR	Free, covers contiguous US
Sentinel-2	10 m	13 bands	Free, global, multispectral
Planet Basemaps	3–5 m	RGB + NIR	Commercial, high cadence
Maxar/WorldView	0.3–0.5 m	Multispectral	Commercial, very high resolution
OpenAerialMap	Varies	RGB	Community-contributed, free

For a first project, NAIP imagery paired with the NLCD product is an excellent starting point — both are freely available from USGS.

2.2 Ground Truth Labels

Labels must align spatially and temporally with your imagery. Sources include:

Existing land cover products — NLCD, ESA WorldCover, Dynamic World (Google)
Manual digitization — using QGIS, ArcGIS Pro, or Labelbox
Crowdsourced data — OpenStreetMap building/road layers for specific classes

When using pre-existing products as labels, verify temporal correspondence. A 2015 label map paired with 2023 imagery will introduce noise in areas that have changed.

2.3 Tiling the Data

Neural networks are trained on fixed-size image patches, not full-scene rasters. A standard pipeline:

Reproject all imagery and labels to a common CRS (e.g., UTM zone for the AOI).
Create a regular grid of patch extents across the study area (commonly 256×256 or 512×512 pixels).
Export patches as GeoTIFFs — one image chip and one corresponding label chip per tile.
Filter tiles with excessive NoData coverage (e.g., >20% cloud or null pixels).

Tools like rasterio, GDAL, and torchgeo streamline this process considerably.

import rasterio
from rasterio.windows import Window

def export_tiles(src_path, label_path, out_dir, tile_size=256, stride=256):
    with rasterio.open(src_path) as src, rasterio.open(label_path) as lbl:
        for row_off in range(0, src.height - tile_size, stride):
            for col_off in range(0, src.width - tile_size, stride):
                window = Window(col_off, row_off, tile_size, tile_size)
                img_chip = src.read(window=window)
                lbl_chip = lbl.read(1, window=window)
                # Save chips to out_dir...

2.4 Train / Validation / Test Splits

Avoid random pixel-level splits — spatially proximate tiles share texture and context, causing data leakage. Instead, use a spatial block split: hold out contiguous geographic regions for validation and testing. This gives a fairer estimate of how the model will generalize to unseen areas.

3. Choosing a Model Architecture

3.1 U-Net: The Workhorse

Introduced in 2015 for biomedical image segmentation, U-Net has become the default starting point for land cover classification. Its key features:

Encoder (contracting path): A series of convolutional blocks that progressively downsample the input, building increasingly abstract feature representations.
Decoder (expansive path): Upsampling blocks that recover spatial resolution.
Skip connections: Direct connections from encoder to decoder at each resolution level, preserving fine spatial detail that would otherwise be lost during downsampling.

The result is a model that simultaneously captures global context and local texture — essential for distinguishing visually similar classes like grassland and cropland at field boundaries.

3.2 Beyond U-Net

Modern variants and alternatives worth considering:

ResNet / EfficientNet encoders — Replace the vanilla U-Net encoder with a pretrained ResNet or EfficientNet backbone for better feature extraction, especially with limited labeled data.
DeepLab v3+ — Uses atrous (dilated) convolutions and an Atrous Spatial Pyramid Pooling (ASPP) module to capture multi-scale context without losing resolution.
SegFormer — A transformer-based encoder that has shown strong performance on remote sensing benchmarks, particularly when large-scale pretraining is available.
Swin-UNet — Combines Swin Transformer blocks within a U-Net-style decoder, offering a good balance of accuracy and efficiency.

For most projects with moderate dataset sizes (tens of thousands of tiles), a U-Net with a pretrained ResNet-34 or EfficientNet-B4 encoder (available through segmentation-models-pytorch) is an excellent choice.

import segmentation_models_pytorch as smp

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=4,          # RGB + NIR
    classes=7,              # Number of land cover classes
)

4. Training the Model

4.1 Loss Functions

Standard cross-entropy loss works but often struggles with class imbalance — urban scenes may be 80% impervious and 2% water, so a naive model learns to predict the majority class everywhere.

Better options:

Weighted cross-entropy — Assign higher weight to rare classes, inversely proportional to their frequency.
Focal Loss — Down-weights easy examples and focuses learning on hard, misclassified pixels.
Dice Loss / IoU Loss — Directly optimizes for overlap between predicted and ground truth masks; less sensitive to class imbalance.
Combo Loss — A weighted combination of cross-entropy and Dice loss, often the best empirical choice.

import torch
import torch.nn as nn

class ComboLoss(nn.Module):
    def __init__(self, alpha=0.5):
        super().__init__()
        self.alpha = alpha
        self.ce = nn.CrossEntropyLoss()

    def forward(self, preds, targets):
        ce_loss = self.ce(preds, targets)
        preds_soft = torch.softmax(preds, dim=1)
        dice = 1 - (2 * (preds_soft * targets_onehot).sum()) / \
               (preds_soft.sum() + targets_onehot.sum() + 1e-6)
        return self.alpha * ce_loss + (1 - self.alpha) * dice

4.2 Data Augmentation

Remote sensing imagery demands thoughtful augmentation. Standard photographic augmentations apply, plus domain-specific ones:

Augmentation	Rationale
Horizontal / vertical flips	Aerial imagery has no canonical orientation
90° rotations	Buildings and fields appear at arbitrary orientations
Random crops	Increases positional variety
Brightness / contrast jitter	Accounts for sensor and atmospheric variation
Random channel dropout	Simulates missing bands in multispectral data
Gaussian blur	Simulates variation in imagery sharpness
Cutout / GridMask	Forces the model to not rely on a single region

Use albumentations for efficient, reproducible augmentation pipelines.

4.3 Training Loop Best Practices

Learning rate scheduling: Use a cosine annealing schedule or OneCycleLR — both consistently outperform a fixed learning rate.
Mixed precision training: Use torch.cuda.amp for 16-bit training on compatible GPUs; typically 1.5–2× speedup with negligible accuracy loss.
Gradient clipping: Helps with training stability, especially in early epochs.
Early stopping: Monitor validation mean IoU and stop when it plateaus to prevent overfitting.
Model checkpointing: Save the best checkpoint, not just the last epoch.

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

for epoch in range(max_epochs):
    for images, masks in train_loader:
        optimizer.zero_grad()
        with autocast():
            preds = model(images)
            loss = criterion(preds, masks)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    scheduler.step()

5. Evaluation Metrics

Pixel accuracy alone is misleading under class imbalance. Use:

Mean Intersection over Union (mIoU) — The standard benchmark metric. Computes IoU per class and averages. A mIoU of 0.75+ is generally considered strong.
Per-class IoU — Critical for understanding where the model fails. Water and impervious surfaces are usually easiest; shrubland vs. grassland is notoriously difficult.
Precision / Recall per class — Helps diagnose whether errors are false positives (commission errors) or false negatives (omission errors).
Confusion matrix — Reveals systematic confusions between spectrally similar classes.
F1 score (Dice coefficient) — Useful for reporting results on imbalanced datasets.

6. Common Failure Modes and Fixes

Problem	Likely Cause	Fix
Model predicts majority class only	Class imbalance	Use focal loss or weighted CE; oversample rare classes
Sharp boundary artifacts	Tile edge effects during inference	Use overlapping tiles with soft blending in overlap zones
High validation loss despite good training loss	Spatial data leakage	Use spatial block splits for train/val/test
Poor generalization to new areas	Domain shift in imagery	Normalize by local statistics; use domain adaptation
Confusion between spectrally similar classes	Insufficient spectral depth	Add NIR / SWIR bands; use indices like NDVI, NDWI
Noisy predictions at parcel boundaries	Limited boundary supervision	Use boundary-aware loss terms or contour supervision

7. Inference at Scale

Once trained, deploying the model over large areas introduces its own challenges.

7.1 Sliding Window with Overlap

Do not simply tile and predict independently. Use a sliding window with 50% overlap and average predictions in the overlap zones to suppress tile-boundary artifacts.

def predict_with_overlap(model, image, tile_size=512, overlap=0.5):
    stride = int(tile_size * (1 - overlap))
    prediction_map = torch.zeros((num_classes, H, W))
    count_map = torch.zeros((1, H, W))

    for y in range(0, H - tile_size + 1, stride):
        for x in range(0, W - tile_size + 1, stride):
            tile = image[:, y:y+tile_size, x:x+tile_size]
            with torch.no_grad():
                pred = torch.softmax(model(tile.unsqueeze(0)), dim=1).squeeze(0)
            prediction_map[:, y:y+tile_size, x:x+tile_size] += pred
            count_map[:, y:y+tile_size, x:x+tile_size] += 1

    return (prediction_map / count_map).argmax(0)

7.2 Writing Georeferenced Output

Always write predictions back as georeferenced GeoTIFFs, preserving the CRS and transform of the source imagery:

import rasterio
from rasterio.transform import from_bounds

with rasterio.open(source_path) as src:
    meta = src.meta.copy()
    meta.update({"count": 1, "dtype": "uint8"})

    with rasterio.open("prediction.tif", "w", **meta) as dst:
        dst.write(prediction_array, 1)

8. Going Further

Once you have a working baseline, several directions can meaningfully improve results:

Transfer learning from geospatial foundation models: Models like SatMAE, ScaleMAE, and GeoChat are pretrained on massive remote sensing datasets and provide far richer initializations than ImageNet weights.
Multi-temporal inputs: Stack imagery from multiple dates to leverage phenological signatures (e.g., crops change dramatically across seasons; forests do not).
Elevation and terrain features: Incorporate a DSM or DTM as additional input channels — elevation strongly constrains land cover type.
Active learning: Strategically select the most uncertain tiles for human annotation to improve label efficiency.
Post-processing with spatial context: Apply morphological operations, connected-component filtering, or a Conditional Random Field (CRF) as a post-processing step to enforce spatial coherence.

Recommended Tools and Libraries

Tool	Purpose
`torchgeo`	Geospatial datasets and samplers for PyTorch
`segmentation-models-pytorch`	Pretrained encoder-decoder architectures
`rasterio` / `GDAL`	Raster I/O and geospatial operations
`albumentations`	Fast image augmentation
`QGIS`	Visualization, label creation, and QA
`pytorch-lightning`	Training loop boilerplate reduction
`wandb` / `MLflow`	Experiment tracking

Conclusion

Training a neural network for land cover classification is a mature, well-supported workflow — but the details matter enormously. Spatial data leakage, class imbalance, tile-boundary artifacts, and domain shift are all real failure modes that trip up practitioners relying on generic computer vision recipes. By treating the geospatial context seriously at every stage — from spatial splits to georeferenced output — you can build models that generalize reliably across landscapes and form the backbone of operational land monitoring systems.

The field is also moving fast. Foundation models pretrained on terabytes of Earth observation data are beginning to make the feature extraction stage largely plug-and-play, shifting the engineering challenge toward high-quality labeling, robust evaluation, and thoughtful deployment. For practitioners entering the space today, there has never been a better time to build.

Further reading: Ronneberger et al. (2015) — U-Net; He et al. (2016) — Deep Residual Learning; Chen et al. (2018) — DeepLab v3+; Stewart & Ermon (2021) — SatMAE

Training a Neural Network to Classify Land Cover from Aerial Photos

Introduction

1. Understanding the Problem

Common Land Cover Classes

2. Data Collection and Preparation

2.1 Imagery Sources

2.2 Ground Truth Labels

2.3 Tiling the Data

2.4 Train / Validation / Test Splits

3. Choosing a Model Architecture

3.1 U-Net: The Workhorse

3.2 Beyond U-Net

4. Training the Model

4.1 Loss Functions

4.2 Data Augmentation

4.3 Training Loop Best Practices

5. Evaluation Metrics

6. Common Failure Modes and Fixes

7. Inference at Scale

7.1 Sliding Window with Overlap

7.2 Writing Georeferenced Output

8. Going Further

Recommended Tools and Libraries

Conclusion

How to Perform Land Use Classification Using Sentinel-2 Imagery (And Why Analysts Trust It Most)

Building Your First Interactive Web Map with Leaflet.js

How to Publish a GIS Dashboard with ArcGIS Online

How to Design Location Maps for United States Clients

A Beginner’s Guide to the ArcGIS Python API

The 8 Best Offline Map Apps for Traveling Without Data

Leave a Reply Cancel reply

About Us

Introduction

1. Understanding the Problem

Common Land Cover Classes

2. Data Collection and Preparation

2.1 Imagery Sources

2.2 Ground Truth Labels

2.3 Tiling the Data

2.4 Train / Validation / Test Splits

3. Choosing a Model Architecture

3.1 U-Net: The Workhorse

3.2 Beyond U-Net

4. Training the Model

4.1 Loss Functions

4.2 Data Augmentation

4.3 Training Loop Best Practices

5. Evaluation Metrics

6. Common Failure Modes and Fixes

7. Inference at Scale

7.1 Sliding Window with Overlap

7.2 Writing Georeferenced Output

8. Going Further

Recommended Tools and Libraries

Conclusion

Similar Posts

Leave a Reply Cancel reply

About Us

Review Cart