Using Random Forests for Flood Risk Prediction in GIS

Introduction

Floods are among the most destructive and frequently occurring natural disasters worldwide, affecting millions of people and causing billions of dollars in damage every year. Accurate spatial prediction of flood risk is not just an academic exercise — it directly informs land-use planning, emergency response, infrastructure investment, and climate adaptation policy.

Geographic Information Systems (GIS) have long been the backbone of flood hazard mapping. Traditionally, physics-based hydrological models — such as HEC-RAS or MIKE FLOOD — have dominated the field, simulating water flow using terrain, rainfall, and channel geometry data. While powerful, these models are computationally intensive and require extensive calibration data that may not be available in data-scarce regions.

Machine learning, and specifically Random Forests (RF), has emerged as a compelling complement — and sometimes alternative — to these physics-based approaches. Random Forests can learn complex, non-linear relationships between terrain and environmental variables and historical flood occurrence, producing spatially explicit risk maps at scale with remarkable efficiency.

This article walks through the core concepts, data requirements, workflow, and practical considerations for applying Random Forests to flood risk prediction in a GIS context.


What Is a Random Forest?

A Random Forest is an ensemble machine learning algorithm introduced by Leo Breiman in 2001. It builds a large number of decision trees during training and outputs the mode (for classification) or mean (for regression) of the individual trees’ predictions.

Two key sources of randomness give the model its name and its power:

  1. Bootstrap aggregating (bagging): Each tree is trained on a random subset of the training data, sampled with replacement.
  2. Feature randomness: At each split in a tree, only a random subset of input features is considered, reducing correlation between individual trees.

The result is a model that is:

  • Robust to overfitting due to averaging across many diverse trees
  • Resistant to noise in input features
  • Capable of handling mixed data types — continuous variables like elevation, categorical ones like land use
  • Interpretable through feature importance metrics
  • Scalable to large spatial datasets without GPU requirements

These properties make Random Forests particularly well-suited for geospatial classification problems like flood susceptibility mapping.


The Flood Risk Prediction Framework

Flood susceptibility mapping using Random Forests is fundamentally a binary or multi-class spatial classification problem: for every pixel or polygon in a study area, predict whether it is flood-prone or not — or assign a relative risk class (Low / Medium / High / Very High).

The workflow can be broken into five stages:

  1. Inventory compilation (where have floods occurred?)
  2. Feature engineering (what terrain and environmental variables matter?)
  3. Model training and validation
  4. Spatial prediction and map generation
  5. Interpretation and uncertainty assessment

Stage 1: Flood Inventory Compilation

The quality of a flood susceptibility model is only as good as the historical flood data it learns from. Flood inventory datasets typically come from:

  • Remote sensing: Synthetic Aperture Radar (SAR) imagery (e.g., Sentinel-1) can detect flooded surfaces through cloud cover. Optical imagery (Landsat, Sentinel-2) is useful post-flood using indices like the Normalized Difference Water Index (NDWI).
  • Government records: National and regional disaster agencies often maintain georeferenced flood event databases.
  • News and crowd-sourced data: Platforms like Copernicus Emergency Management Service, GDACS, and even OpenStreetMap edits during disasters provide valuable point-level records.
  • Field surveys: GPS-referenced observations from ground-truth campaigns.

The output of this stage is a set of flood occurrence points (positive class) and an equal or greater number of non-flood points (negative class) randomly sampled from areas confirmed not to have flooded. The balance and spatial distribution of these samples is critical — spatial autocorrelation between training and validation samples must be carefully managed to avoid overly optimistic accuracy metrics.


Stage 2: Feature Engineering — The Conditioning Factors

The predictive power of the model depends on selecting geomorphological, hydrological, and environmental variables — often called flood conditioning factors — that physically relate to flood susceptibility. These are derived primarily from a Digital Elevation Model (DEM) and ancillary spatial datasets.

Terrain-Derived Variables (from DEM)

FeatureDescriptionFlood Relevance
ElevationHeight above sea levelLower areas accumulate runoff
SlopeRate of elevation changeLow slopes slow water movement
Topographic Wetness Index (TWI)ln(upslope area / tan(slope))High TWI = greater soil saturation potential
Flow AccumulationCumulative upstream drainage areaLarge values indicate stream channels
Distance to RiverEuclidean or network distanceProximity to channels increases risk
CurvatureProfile and plan curvatureConcave areas concentrate flow
Terrain Ruggedness Index (TRI)Variation in surrounding elevationSmoother terrain = larger flood extents

Hydrological Variables

  • Stream Power Index (SPI): Measures erosive power of surface flow. High SPI correlates with flood generation capacity.
  • Drainage density: Length of stream channels per unit area — higher density means faster concentration of runoff.
  • Rainfall intensity: Mean annual rainfall or return-period storm intensities from climate datasets (e.g., CHIRPS, ERA5).

Land Cover and Soil Variables

  • Land use / land cover (LULC): Urban impervious surfaces increase runoff; forests and wetlands buffer it.
  • Soil type / hydrologic soil group: Clay-dominated soils have low infiltration rates and generate more surface runoff.
  • Curve Number (CN): A composite index from the USDA SCS method combining LULC and soil data.

Anthropogenic Factors

  • Distance to roads: Roads alter natural drainage patterns and can act as flow barriers.
  • Population density / built-up area: Useful for weighting risk in terms of exposure.

All raster layers must be resampled to a common spatial resolution and coordinate reference system (CRS) before model training.


Stage 3: Model Training and Validation

Data Preparation in GIS

In a GIS workflow (using QGIS, ArcGIS Pro, or Python with geopandas and rasterio):

  1. Stack all conditioning factor rasters into a multi-band raster.
  2. Extract pixel values at each flood and non-flood training point.
  3. Export the result as a tabular dataset (CSV or GeoDataFrame) with one row per sample.

Training the Model in Python

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Load training samples
df = pd.read_csv("flood_samples.csv")

features = ["elevation", "slope", "twi", "flow_acc", "dist_river",
            "spi", "rainfall", "lulc", "soil_cn", "curvature"]

X = df[features]
y = df["flood_label"]  # 1 = flooded, 0 = non-flood

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=300,
    max_features="sqrt",
    min_samples_leaf=5,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")

Validation Metrics

For flood susceptibility models, the most relevant performance indicators are:

  • AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Values above 0.85 are generally considered excellent for spatial hazard models.
  • True Skill Statistic (TSS): Accounts for both sensitivity and specificity — more robust than overall accuracy when classes are imbalanced.
  • F1-Score: Harmonic mean of precision and recall, useful when false negatives (missed flood areas) are costly.
  • Spatial cross-validation: Standard random splits violate the independence assumption in spatial data. Use spatial k-fold cross-validation (e.g., via the scikit-learn GroupKFold with spatial blocks) to get honest generalization estimates.

Stage 4: Spatial Prediction and Map Generation

Once trained and validated, the model is applied to every pixel across the study area to produce a continuous flood susceptibility index (FSI) between 0 and 1, or a classified risk map.

import numpy as np
import rasterio
from rasterio.transform import from_bounds

# Stack raster bands into a 2D array: (n_pixels, n_features)
# Assume `stack` is shape (n_bands, rows, cols)
rows, cols = stack.shape[1], stack.shape[2]
X_full = stack.reshape(n_bands, -1).T  # shape: (rows*cols, n_features)

# Predict probability for the flood class
prob_flat = rf.predict_proba(X_full)[:, 1]
prob_map = prob_flat.reshape(rows, cols)

# Write to GeoTIFF
with rasterio.open("flood_susceptibility.tif", "w",
                   driver="GTiff", dtype="float32",
                   count=1, crs=src.crs, transform=src.transform,
                   width=cols, height=rows) as dst:
    dst.write(prob_map.astype("float32"), 1)

The resulting raster can then be:

  • Classified into susceptibility zones (e.g., using Natural Breaks / Jenks classification)
  • Overlaid with infrastructure, population, and land use data in QGIS or ArcGIS Pro
  • Integrated into web GIS platforms for public-facing flood risk dashboards

Stage 5: Feature Importance and Interpretability

One of the key advantages of Random Forests over black-box deep learning models is the availability of built-in feature importance metrics. The two most used are:

  • Mean Decrease in Impurity (MDI): Averages the reduction in Gini impurity contributed by each feature across all trees. Fast but can be biased toward high-cardinality features.
  • Permutation Importance: Measures the drop in model accuracy when a feature’s values are randomly shuffled. More reliable and model-agnostic.
import matplotlib.pyplot as plt

importances = pd.Series(rf.feature_importances_, index=features)
importances.sort_values().plot(kind="barh", figsize=(8, 5))
plt.title("Random Forest Feature Importances (MDI)")
plt.tight_layout()
plt.savefig("feature_importance.png", dpi=150)

Typical findings in flood susceptibility studies rank TWI, distance to river, elevation, and flow accumulation among the most influential variables — consistent with hydrological theory and a useful sanity check on model behavior.

For deeper interpretability, SHAP (SHapley Additive exPlanations) values can be computed to understand how individual features contribute to predictions for specific locations:

import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test, feature_names=features)

Practical Considerations and Limitations

Strengths

  • Handles non-linear relationships and interactions between terrain variables without explicit specification
  • Requires no assumptions about the distribution of input data
  • Computationally efficient — a full study area prediction typically runs in minutes on a standard workstation
  • Produces probabilistic outputs useful for uncertainty communication

Limitations

  • Extrapolation risk: Random Forests do not extrapolate beyond the range of training data. If the study region contains terrain types not represented in the training samples, predictions will be unreliable.
  • Spatial autocorrelation: Flood occurrence is spatially clustered. Standard cross-validation inflates performance metrics. Always use spatial cross-validation.
  • Static model: A trained model captures the flood regime at the time of the training inventory. Changes in land use, drainage infrastructure, or climate may reduce its validity over time.
  • No dynamic simulation: Random Forests predict susceptibility (where), not when or how deeply an area will flood. They are not substitutes for hydraulic models when stage or discharge predictions are needed.
  • Sample quality dependency: The model is only as good as the flood inventory. Biased or incomplete historical records produce biased susceptibility maps.

Integrating Random Forests with Traditional GIS Workflows

Random Forest outputs are most useful when embedded within a broader GIS analytical framework:

  • Multi-hazard risk assessment: Combine flood susceptibility with landslide, earthquake, or drought layers to produce composite risk surfaces.
  • Exposure analysis: Overlay the susceptibility raster with building footprints, road networks, and population grids to quantify elements at risk.
  • Climate scenario modelling: Retrain or recalibrate models using projected rainfall intensity data from CMIP6 climate scenarios to produce future flood risk estimates.
  • Change detection: Compare susceptibility maps across time periods to identify areas of increasing or decreasing risk driven by land use change.

Tools like Google Earth Engine enable scaling this workflow to continental or global extents by combining cloud-based raster processing with machine learning APIs, eliminating the bottleneck of local data download and processing.


Conclusion

Random Forests occupy a valuable niche in the flood risk prediction toolkit — offering a scalable, data-driven approach that complements physics-based hydrological models rather than replacing them. When paired with robust spatial data, careful sample design, and rigorous validation using spatial cross-validation, they can produce flood susceptibility maps with accuracy competitive with far more complex approaches.

As geospatial data availability continues to grow — driven by open satellite archives, global DEM products like Copernicus GLO-30, and expanding climate datasets — the accessibility and utility of machine learning-based flood mapping will only increase. For GIS practitioners, understanding when and how to deploy Random Forests as part of a multi-method flood risk framework is becoming an essential professional competency.


Further Reading and Tools

  • scikit-learn: sklearn.ensemble.RandomForestClassifier — primary Python implementation
  • SHAP library: Model interpretability and feature contribution analysis
  • QGIS / WhiteboxTools: Open-source terrain analysis and DEM-derived variable computation
  • Google Earth Engine: Cloud-scale geospatial ML pipeline deployment
  • Copernicus Emergency Management Service (CEMS): Historical flood extent data for inventory compilation
  • CHIRPS / ERA5: Global gridded rainfall datasets for hydroclimatic conditioning factors

Keywords: flood susceptibility mapping, random forest, GIS, machine learning, geospatial hazard assessment, terrain analysis, TWI, spatial cross-validation, remote sensing

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *