Using Random Forests for Flood Risk Prediction in GIS
Introduction
Floods are among the most destructive and frequently occurring natural disasters worldwide, affecting millions of people and causing billions of dollars in damage every year. Accurate spatial prediction of flood risk is not just an academic exercise — it directly informs land-use planning, emergency response, infrastructure investment, and climate adaptation policy.
Geographic Information Systems (GIS) have long been the backbone of flood hazard mapping. Traditionally, physics-based hydrological models — such as HEC-RAS or MIKE FLOOD — have dominated the field, simulating water flow using terrain, rainfall, and channel geometry data. While powerful, these models are computationally intensive and require extensive calibration data that may not be available in data-scarce regions.
Machine learning, and specifically Random Forests (RF), has emerged as a compelling complement — and sometimes alternative — to these physics-based approaches. Random Forests can learn complex, non-linear relationships between terrain and environmental variables and historical flood occurrence, producing spatially explicit risk maps at scale with remarkable efficiency.
This article walks through the core concepts, data requirements, workflow, and practical considerations for applying Random Forests to flood risk prediction in a GIS context.
What Is a Random Forest?
A Random Forest is an ensemble machine learning algorithm introduced by Leo Breiman in 2001. It builds a large number of decision trees during training and outputs the mode (for classification) or mean (for regression) of the individual trees’ predictions.
Two key sources of randomness give the model its name and its power:
- Bootstrap aggregating (bagging): Each tree is trained on a random subset of the training data, sampled with replacement.
- Feature randomness: At each split in a tree, only a random subset of input features is considered, reducing correlation between individual trees.
The result is a model that is:
- Robust to overfitting due to averaging across many diverse trees
- Resistant to noise in input features
- Capable of handling mixed data types — continuous variables like elevation, categorical ones like land use
- Interpretable through feature importance metrics
- Scalable to large spatial datasets without GPU requirements
These properties make Random Forests particularly well-suited for geospatial classification problems like flood susceptibility mapping.
The Flood Risk Prediction Framework
Flood susceptibility mapping using Random Forests is fundamentally a binary or multi-class spatial classification problem: for every pixel or polygon in a study area, predict whether it is flood-prone or not — or assign a relative risk class (Low / Medium / High / Very High).
The workflow can be broken into five stages:
- Inventory compilation (where have floods occurred?)
- Feature engineering (what terrain and environmental variables matter?)
- Model training and validation
- Spatial prediction and map generation
- Interpretation and uncertainty assessment
Stage 1: Flood Inventory Compilation
The quality of a flood susceptibility model is only as good as the historical flood data it learns from. Flood inventory datasets typically come from:
- Remote sensing: Synthetic Aperture Radar (SAR) imagery (e.g., Sentinel-1) can detect flooded surfaces through cloud cover. Optical imagery (Landsat, Sentinel-2) is useful post-flood using indices like the Normalized Difference Water Index (NDWI).
- Government records: National and regional disaster agencies often maintain georeferenced flood event databases.
- News and crowd-sourced data: Platforms like Copernicus Emergency Management Service, GDACS, and even OpenStreetMap edits during disasters provide valuable point-level records.
- Field surveys: GPS-referenced observations from ground-truth campaigns.
The output of this stage is a set of flood occurrence points (positive class) and an equal or greater number of non-flood points (negative class) randomly sampled from areas confirmed not to have flooded. The balance and spatial distribution of these samples is critical — spatial autocorrelation between training and validation samples must be carefully managed to avoid overly optimistic accuracy metrics.
Stage 2: Feature Engineering — The Conditioning Factors
The predictive power of the model depends on selecting geomorphological, hydrological, and environmental variables — often called flood conditioning factors — that physically relate to flood susceptibility. These are derived primarily from a Digital Elevation Model (DEM) and ancillary spatial datasets.
Terrain-Derived Variables (from DEM)
| Feature | Description | Flood Relevance |
|---|---|---|
| Elevation | Height above sea level | Lower areas accumulate runoff |
| Slope | Rate of elevation change | Low slopes slow water movement |
| Topographic Wetness Index (TWI) | ln(upslope area / tan(slope)) | High TWI = greater soil saturation potential |
| Flow Accumulation | Cumulative upstream drainage area | Large values indicate stream channels |
| Distance to River | Euclidean or network distance | Proximity to channels increases risk |
| Curvature | Profile and plan curvature | Concave areas concentrate flow |
| Terrain Ruggedness Index (TRI) | Variation in surrounding elevation | Smoother terrain = larger flood extents |
Hydrological Variables
- Stream Power Index (SPI): Measures erosive power of surface flow. High SPI correlates with flood generation capacity.
- Drainage density: Length of stream channels per unit area — higher density means faster concentration of runoff.
- Rainfall intensity: Mean annual rainfall or return-period storm intensities from climate datasets (e.g., CHIRPS, ERA5).
Land Cover and Soil Variables
- Land use / land cover (LULC): Urban impervious surfaces increase runoff; forests and wetlands buffer it.
- Soil type / hydrologic soil group: Clay-dominated soils have low infiltration rates and generate more surface runoff.
- Curve Number (CN): A composite index from the USDA SCS method combining LULC and soil data.
Anthropogenic Factors
- Distance to roads: Roads alter natural drainage patterns and can act as flow barriers.
- Population density / built-up area: Useful for weighting risk in terms of exposure.
All raster layers must be resampled to a common spatial resolution and coordinate reference system (CRS) before model training.
Stage 3: Model Training and Validation
Data Preparation in GIS
In a GIS workflow (using QGIS, ArcGIS Pro, or Python with geopandas and rasterio):
- Stack all conditioning factor rasters into a multi-band raster.
- Extract pixel values at each flood and non-flood training point.
- Export the result as a tabular dataset (CSV or GeoDataFrame) with one row per sample.
Training the Model in Python
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
# Load training samples
df = pd.read_csv("flood_samples.csv")
features = ["elevation", "slope", "twi", "flow_acc", "dist_river",
"spi", "rainfall", "lulc", "soil_cn", "curvature"]
X = df[features]
y = df["flood_label"] # 1 = flooded, 0 = non-flood
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
rf = RandomForestClassifier(
n_estimators=300,
max_features="sqrt",
min_samples_leaf=5,
class_weight="balanced",
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
Validation Metrics
For flood susceptibility models, the most relevant performance indicators are:
- AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Values above 0.85 are generally considered excellent for spatial hazard models.
- True Skill Statistic (TSS): Accounts for both sensitivity and specificity — more robust than overall accuracy when classes are imbalanced.
- F1-Score: Harmonic mean of precision and recall, useful when false negatives (missed flood areas) are costly.
- Spatial cross-validation: Standard random splits violate the independence assumption in spatial data. Use spatial k-fold cross-validation (e.g., via the
scikit-learnGroupKFoldwith spatial blocks) to get honest generalization estimates.
Stage 4: Spatial Prediction and Map Generation
Once trained and validated, the model is applied to every pixel across the study area to produce a continuous flood susceptibility index (FSI) between 0 and 1, or a classified risk map.
import numpy as np
import rasterio
from rasterio.transform import from_bounds
# Stack raster bands into a 2D array: (n_pixels, n_features)
# Assume `stack` is shape (n_bands, rows, cols)
rows, cols = stack.shape[1], stack.shape[2]
X_full = stack.reshape(n_bands, -1).T # shape: (rows*cols, n_features)
# Predict probability for the flood class
prob_flat = rf.predict_proba(X_full)[:, 1]
prob_map = prob_flat.reshape(rows, cols)
# Write to GeoTIFF
with rasterio.open("flood_susceptibility.tif", "w",
driver="GTiff", dtype="float32",
count=1, crs=src.crs, transform=src.transform,
width=cols, height=rows) as dst:
dst.write(prob_map.astype("float32"), 1)
The resulting raster can then be:
- Classified into susceptibility zones (e.g., using Natural Breaks / Jenks classification)
- Overlaid with infrastructure, population, and land use data in QGIS or ArcGIS Pro
- Integrated into web GIS platforms for public-facing flood risk dashboards
Stage 5: Feature Importance and Interpretability
One of the key advantages of Random Forests over black-box deep learning models is the availability of built-in feature importance metrics. The two most used are:
- Mean Decrease in Impurity (MDI): Averages the reduction in Gini impurity contributed by each feature across all trees. Fast but can be biased toward high-cardinality features.
- Permutation Importance: Measures the drop in model accuracy when a feature’s values are randomly shuffled. More reliable and model-agnostic.
import matplotlib.pyplot as plt
importances = pd.Series(rf.feature_importances_, index=features)
importances.sort_values().plot(kind="barh", figsize=(8, 5))
plt.title("Random Forest Feature Importances (MDI)")
plt.tight_layout()
plt.savefig("feature_importance.png", dpi=150)
Typical findings in flood susceptibility studies rank TWI, distance to river, elevation, and flow accumulation among the most influential variables — consistent with hydrological theory and a useful sanity check on model behavior.
For deeper interpretability, SHAP (SHapley Additive exPlanations) values can be computed to understand how individual features contribute to predictions for specific locations:
import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test, feature_names=features)
Practical Considerations and Limitations
Strengths
- Handles non-linear relationships and interactions between terrain variables without explicit specification
- Requires no assumptions about the distribution of input data
- Computationally efficient — a full study area prediction typically runs in minutes on a standard workstation
- Produces probabilistic outputs useful for uncertainty communication
Limitations
- Extrapolation risk: Random Forests do not extrapolate beyond the range of training data. If the study region contains terrain types not represented in the training samples, predictions will be unreliable.
- Spatial autocorrelation: Flood occurrence is spatially clustered. Standard cross-validation inflates performance metrics. Always use spatial cross-validation.
- Static model: A trained model captures the flood regime at the time of the training inventory. Changes in land use, drainage infrastructure, or climate may reduce its validity over time.
- No dynamic simulation: Random Forests predict susceptibility (where), not when or how deeply an area will flood. They are not substitutes for hydraulic models when stage or discharge predictions are needed.
- Sample quality dependency: The model is only as good as the flood inventory. Biased or incomplete historical records produce biased susceptibility maps.
Integrating Random Forests with Traditional GIS Workflows
Random Forest outputs are most useful when embedded within a broader GIS analytical framework:
- Multi-hazard risk assessment: Combine flood susceptibility with landslide, earthquake, or drought layers to produce composite risk surfaces.
- Exposure analysis: Overlay the susceptibility raster with building footprints, road networks, and population grids to quantify elements at risk.
- Climate scenario modelling: Retrain or recalibrate models using projected rainfall intensity data from CMIP6 climate scenarios to produce future flood risk estimates.
- Change detection: Compare susceptibility maps across time periods to identify areas of increasing or decreasing risk driven by land use change.
Tools like Google Earth Engine enable scaling this workflow to continental or global extents by combining cloud-based raster processing with machine learning APIs, eliminating the bottleneck of local data download and processing.
Conclusion
Random Forests occupy a valuable niche in the flood risk prediction toolkit — offering a scalable, data-driven approach that complements physics-based hydrological models rather than replacing them. When paired with robust spatial data, careful sample design, and rigorous validation using spatial cross-validation, they can produce flood susceptibility maps with accuracy competitive with far more complex approaches.
As geospatial data availability continues to grow — driven by open satellite archives, global DEM products like Copernicus GLO-30, and expanding climate datasets — the accessibility and utility of machine learning-based flood mapping will only increase. For GIS practitioners, understanding when and how to deploy Random Forests as part of a multi-method flood risk framework is becoming an essential professional competency.
Further Reading and Tools
- scikit-learn:
sklearn.ensemble.RandomForestClassifier— primary Python implementation - SHAP library: Model interpretability and feature contribution analysis
- QGIS / WhiteboxTools: Open-source terrain analysis and DEM-derived variable computation
- Google Earth Engine: Cloud-scale geospatial ML pipeline deployment
- Copernicus Emergency Management Service (CEMS): Historical flood extent data for inventory compilation
- CHIRPS / ERA5: Global gridded rainfall datasets for hydroclimatic conditioning factors
Keywords: flood susceptibility mapping, random forest, GIS, machine learning, geospatial hazard assessment, terrain analysis, TWI, spatial cross-validation, remote sensing
