Training a Neural Network to Classify Land Cover from Aerial Photos
A practical guide for geospatial practitioners and remote sensing enthusiasts
Introduction
Land cover classification — the process of labeling every pixel in a satellite or aerial image with a category like forest, water, urban, or cropland — is one of the most impactful applications of deep learning in geospatial science. Accurate, up-to-date land cover maps underpin environmental monitoring, urban planning, disaster response, agricultural policy, and climate modeling.
Traditional classification workflows relied on handcrafted spectral indices and rule-based thresholds. Neural networks, particularly convolutional neural networks (CNNs) and their encoder-decoder variants, have largely supplanted these approaches by learning rich spatial-spectral features directly from labeled imagery. This article walks through the end-to-end process of training such a model — from data preparation to deployment-ready inference.
1. Understanding the Problem
Land cover classification is a semantic segmentation task. Unlike image-level classification (which assigns one label to an entire image), semantic segmentation assigns a class label to every pixel. The output of the model is a classification map — a raster of the same spatial extent and resolution as the input image, where each cell carries a predicted category.
Common Land Cover Classes
Depending on the use case and reference schema, typical classes include:
- Impervious surfaces — roads, rooftops, parking lots
- Vegetation — forests, shrublands, grasslands
- Cropland — agricultural fields, orchards
- Water bodies — rivers, lakes, reservoirs
- Bare soil / exposed rock
- Built-up / urban areas
Schemes like NLCD (National Land Cover Database), CORINE (Europe), or ESA WorldCover define standardized hierarchies that are widely used in training datasets.
2. Data Collection and Preparation
The quality and consistency of your training data will determine the ceiling of your model’s performance. This stage deserves at least as much effort as model design.
2.1 Imagery Sources
Aerial and satellite imagery differ in resolution, revisit frequency, and cost. Common options include:
| Source | Resolution | Bands | Notes |
|---|---|---|---|
| NAIP (USA) | 0.6–1 m | RGB + NIR | Free, covers contiguous US |
| Sentinel-2 | 10 m | 13 bands | Free, global, multispectral |
| Planet Basemaps | 3–5 m | RGB + NIR | Commercial, high cadence |
| Maxar/WorldView | 0.3–0.5 m | Multispectral | Commercial, very high resolution |
| OpenAerialMap | Varies | RGB | Community-contributed, free |
For a first project, NAIP imagery paired with the NLCD product is an excellent starting point — both are freely available from USGS.
2.2 Ground Truth Labels
Labels must align spatially and temporally with your imagery. Sources include:
- Existing land cover products — NLCD, ESA WorldCover, Dynamic World (Google)
- Manual digitization — using QGIS, ArcGIS Pro, or Labelbox
- Crowdsourced data — OpenStreetMap building/road layers for specific classes
When using pre-existing products as labels, verify temporal correspondence. A 2015 label map paired with 2023 imagery will introduce noise in areas that have changed.
2.3 Tiling the Data
Neural networks are trained on fixed-size image patches, not full-scene rasters. A standard pipeline:
- Reproject all imagery and labels to a common CRS (e.g., UTM zone for the AOI).
- Create a regular grid of patch extents across the study area (commonly 256×256 or 512×512 pixels).
- Export patches as GeoTIFFs — one image chip and one corresponding label chip per tile.
- Filter tiles with excessive NoData coverage (e.g., >20% cloud or null pixels).
Tools like rasterio, GDAL, and torchgeo streamline this process considerably.
import rasterio
from rasterio.windows import Window
def export_tiles(src_path, label_path, out_dir, tile_size=256, stride=256):
with rasterio.open(src_path) as src, rasterio.open(label_path) as lbl:
for row_off in range(0, src.height - tile_size, stride):
for col_off in range(0, src.width - tile_size, stride):
window = Window(col_off, row_off, tile_size, tile_size)
img_chip = src.read(window=window)
lbl_chip = lbl.read(1, window=window)
# Save chips to out_dir...
2.4 Train / Validation / Test Splits
Avoid random pixel-level splits — spatially proximate tiles share texture and context, causing data leakage. Instead, use a spatial block split: hold out contiguous geographic regions for validation and testing. This gives a fairer estimate of how the model will generalize to unseen areas.
3. Choosing a Model Architecture
3.1 U-Net: The Workhorse
Introduced in 2015 for biomedical image segmentation, U-Net has become the default starting point for land cover classification. Its key features:
- Encoder (contracting path): A series of convolutional blocks that progressively downsample the input, building increasingly abstract feature representations.
- Decoder (expansive path): Upsampling blocks that recover spatial resolution.
- Skip connections: Direct connections from encoder to decoder at each resolution level, preserving fine spatial detail that would otherwise be lost during downsampling.
The result is a model that simultaneously captures global context and local texture — essential for distinguishing visually similar classes like grassland and cropland at field boundaries.
3.2 Beyond U-Net
Modern variants and alternatives worth considering:
- ResNet / EfficientNet encoders — Replace the vanilla U-Net encoder with a pretrained ResNet or EfficientNet backbone for better feature extraction, especially with limited labeled data.
- DeepLab v3+ — Uses atrous (dilated) convolutions and an Atrous Spatial Pyramid Pooling (ASPP) module to capture multi-scale context without losing resolution.
- SegFormer — A transformer-based encoder that has shown strong performance on remote sensing benchmarks, particularly when large-scale pretraining is available.
- Swin-UNet — Combines Swin Transformer blocks within a U-Net-style decoder, offering a good balance of accuracy and efficiency.
For most projects with moderate dataset sizes (tens of thousands of tiles), a U-Net with a pretrained ResNet-34 or EfficientNet-B4 encoder (available through segmentation-models-pytorch) is an excellent choice.
import segmentation_models_pytorch as smp
model = smp.Unet(
encoder_name="resnet34",
encoder_weights="imagenet",
in_channels=4, # RGB + NIR
classes=7, # Number of land cover classes
)
4. Training the Model
4.1 Loss Functions
Standard cross-entropy loss works but often struggles with class imbalance — urban scenes may be 80% impervious and 2% water, so a naive model learns to predict the majority class everywhere.
Better options:
- Weighted cross-entropy — Assign higher weight to rare classes, inversely proportional to their frequency.
- Focal Loss — Down-weights easy examples and focuses learning on hard, misclassified pixels.
- Dice Loss / IoU Loss — Directly optimizes for overlap between predicted and ground truth masks; less sensitive to class imbalance.
- Combo Loss — A weighted combination of cross-entropy and Dice loss, often the best empirical choice.
import torch
import torch.nn as nn
class ComboLoss(nn.Module):
def __init__(self, alpha=0.5):
super().__init__()
self.alpha = alpha
self.ce = nn.CrossEntropyLoss()
def forward(self, preds, targets):
ce_loss = self.ce(preds, targets)
preds_soft = torch.softmax(preds, dim=1)
dice = 1 - (2 * (preds_soft * targets_onehot).sum()) / \
(preds_soft.sum() + targets_onehot.sum() + 1e-6)
return self.alpha * ce_loss + (1 - self.alpha) * dice
4.2 Data Augmentation
Remote sensing imagery demands thoughtful augmentation. Standard photographic augmentations apply, plus domain-specific ones:
| Augmentation | Rationale |
|---|---|
| Horizontal / vertical flips | Aerial imagery has no canonical orientation |
| 90° rotations | Buildings and fields appear at arbitrary orientations |
| Random crops | Increases positional variety |
| Brightness / contrast jitter | Accounts for sensor and atmospheric variation |
| Random channel dropout | Simulates missing bands in multispectral data |
| Gaussian blur | Simulates variation in imagery sharpness |
| Cutout / GridMask | Forces the model to not rely on a single region |
Use albumentations for efficient, reproducible augmentation pipelines.
4.3 Training Loop Best Practices
- Learning rate scheduling: Use a cosine annealing schedule or OneCycleLR — both consistently outperform a fixed learning rate.
- Mixed precision training: Use
torch.cuda.ampfor 16-bit training on compatible GPUs; typically 1.5–2× speedup with negligible accuracy loss. - Gradient clipping: Helps with training stability, especially in early epochs.
- Early stopping: Monitor validation mean IoU and stop when it plateaus to prevent overfitting.
- Model checkpointing: Save the best checkpoint, not just the last epoch.
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler()
for epoch in range(max_epochs):
for images, masks in train_loader:
optimizer.zero_grad()
with autocast():
preds = model(images)
loss = criterion(preds, masks)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
scheduler.step()
5. Evaluation Metrics
Pixel accuracy alone is misleading under class imbalance. Use:
- Mean Intersection over Union (mIoU) — The standard benchmark metric. Computes IoU per class and averages. A mIoU of 0.75+ is generally considered strong.
- Per-class IoU — Critical for understanding where the model fails. Water and impervious surfaces are usually easiest; shrubland vs. grassland is notoriously difficult.
- Precision / Recall per class — Helps diagnose whether errors are false positives (commission errors) or false negatives (omission errors).
- Confusion matrix — Reveals systematic confusions between spectrally similar classes.
- F1 score (Dice coefficient) — Useful for reporting results on imbalanced datasets.
6. Common Failure Modes and Fixes
| Problem | Likely Cause | Fix |
|---|---|---|
| Model predicts majority class only | Class imbalance | Use focal loss or weighted CE; oversample rare classes |
| Sharp boundary artifacts | Tile edge effects during inference | Use overlapping tiles with soft blending in overlap zones |
| High validation loss despite good training loss | Spatial data leakage | Use spatial block splits for train/val/test |
| Poor generalization to new areas | Domain shift in imagery | Normalize by local statistics; use domain adaptation |
| Confusion between spectrally similar classes | Insufficient spectral depth | Add NIR / SWIR bands; use indices like NDVI, NDWI |
| Noisy predictions at parcel boundaries | Limited boundary supervision | Use boundary-aware loss terms or contour supervision |
7. Inference at Scale
Once trained, deploying the model over large areas introduces its own challenges.
7.1 Sliding Window with Overlap
Do not simply tile and predict independently. Use a sliding window with 50% overlap and average predictions in the overlap zones to suppress tile-boundary artifacts.
def predict_with_overlap(model, image, tile_size=512, overlap=0.5):
stride = int(tile_size * (1 - overlap))
prediction_map = torch.zeros((num_classes, H, W))
count_map = torch.zeros((1, H, W))
for y in range(0, H - tile_size + 1, stride):
for x in range(0, W - tile_size + 1, stride):
tile = image[:, y:y+tile_size, x:x+tile_size]
with torch.no_grad():
pred = torch.softmax(model(tile.unsqueeze(0)), dim=1).squeeze(0)
prediction_map[:, y:y+tile_size, x:x+tile_size] += pred
count_map[:, y:y+tile_size, x:x+tile_size] += 1
return (prediction_map / count_map).argmax(0)
7.2 Writing Georeferenced Output
Always write predictions back as georeferenced GeoTIFFs, preserving the CRS and transform of the source imagery:
import rasterio
from rasterio.transform import from_bounds
with rasterio.open(source_path) as src:
meta = src.meta.copy()
meta.update({"count": 1, "dtype": "uint8"})
with rasterio.open("prediction.tif", "w", **meta) as dst:
dst.write(prediction_array, 1)
8. Going Further
Once you have a working baseline, several directions can meaningfully improve results:
- Transfer learning from geospatial foundation models: Models like SatMAE, ScaleMAE, and GeoChat are pretrained on massive remote sensing datasets and provide far richer initializations than ImageNet weights.
- Multi-temporal inputs: Stack imagery from multiple dates to leverage phenological signatures (e.g., crops change dramatically across seasons; forests do not).
- Elevation and terrain features: Incorporate a DSM or DTM as additional input channels — elevation strongly constrains land cover type.
- Active learning: Strategically select the most uncertain tiles for human annotation to improve label efficiency.
- Post-processing with spatial context: Apply morphological operations, connected-component filtering, or a Conditional Random Field (CRF) as a post-processing step to enforce spatial coherence.
Recommended Tools and Libraries
| Tool | Purpose |
|---|---|
torchgeo | Geospatial datasets and samplers for PyTorch |
segmentation-models-pytorch | Pretrained encoder-decoder architectures |
rasterio / GDAL | Raster I/O and geospatial operations |
albumentations | Fast image augmentation |
QGIS | Visualization, label creation, and QA |
pytorch-lightning | Training loop boilerplate reduction |
wandb / MLflow | Experiment tracking |
Conclusion
Training a neural network for land cover classification is a mature, well-supported workflow — but the details matter enormously. Spatial data leakage, class imbalance, tile-boundary artifacts, and domain shift are all real failure modes that trip up practitioners relying on generic computer vision recipes. By treating the geospatial context seriously at every stage — from spatial splits to georeferenced output — you can build models that generalize reliably across landscapes and form the backbone of operational land monitoring systems.
The field is also moving fast. Foundation models pretrained on terabytes of Earth observation data are beginning to make the feature extraction stage largely plug-and-play, shifting the engineering challenge toward high-quality labeling, robust evaluation, and thoughtful deployment. For practitioners entering the space today, there has never been a better time to build.
Further reading: Ronneberger et al. (2015) — U-Net; He et al. (2016) — Deep Residual Learning; Chen et al. (2018) — DeepLab v3+; Stewart & Ermon (2021) — SatMAE
