normet is an R package designed for air quality time-series analysis. It provides a powerful and user-friendly suite of tools for air quality research, causal inference, and policy evaluation.
- Automated & Intelligent: Powered by an H2O AutoML backend, it automatically finds the optimal model, eliminating tedious manual tuning.
- All-in-One Solution: Offers high-level functions that cover the entire workflow: data preprocessing, model training, weather normalisation, decomposition, and counterfactual modelling.
- Robust Causal Inference: Integrates both classic (SCM) and machine-learning-based (ML-SCM) Synthetic Control Methods.
- Uncertainty Quantification: Provides comprehensive tools for uncertainty estimation, including Bootstrap, Jackknife, and Placebo Tests (in-space).
- High Performance: Built-in memory management and parallel processing for handling large datasets.
The analysis typically follows this sequence:
- Initialize: Start the H2O backend (
nm_init_h2o). - Prepare: Process raw data and create time-based features (
nm_prepare_data). - Train: Build predictive models (
nm_train_model). - Analyse:
- Importance: Rank predictors by influence (
nm_feature_importance). - Explain: Visualize marginal effects (
nm_pdp). - Normalise: Remove weather effects (
nm_normalise/nm_normalise_auto). - Decompose: Isolate emission vs. meteorological contributions (
nm_decompose). - Evaluate Policy: Estimate causal effects (
nm_run_scm).
- Importance: Rank predictors by influence (
Install the latest development version from GitHub:
# install.packages("devtools")
devtools::install_github("normet-dev/normet-r")normet relies on H2O. Ensure Java is installed, then install the h2o package:
install.packages("h2o")For standard weather normalisation, use nm_do_all to handle data preparation, training, and normalisation in one step.
library(normet)
library(dplyr)
# 1. Initialize Backend
nm_init_h2o()
# 2. Load Data
data("MY1")
# 3. Define Features
# 'predictors' includes weather + time variables for training
predictors <- c("u10", "v10", "d2m", "t2m", "blh", "sp", "ssrd", "tcc", "tp", "rh2m",
"date_unix", "day_julian", "weekday", "hour")
# 'weather_vars' are the variables to be resampled (shuffled) during normalisation
weather_vars <- c("u10", "v10", "d2m", "t2m", "blh", "sp", "ssrd", "tcc", "tp", "rh2m")
# 4. Run Pipeline
results <- nm_do_all(
df = my1,
value = "PM2.5",
predictors = predictors,
resample_vars = weather_vars,
n_samples = 300, # Number of resampling iterations
model_config = list(include_algos = c("GBM"), max_runtime_secs = 60, sort_metric = "AUTO")
)
# 5. Inspect Results
head(results$out) # Normalised Data (Date, Observed, Normalised)
print(results$model) # Trained ModelFor greater control, you can execute each stage manually.
# Prepare data with time features and train/test splits
df_prep <- nm_prepare_data(
df = my1,
value = 'PM2.5',
predictors = weather_vars, # Will automatically add time features
split_method = 'random',
fraction = 0.75
)
# Configure H2O AutoML
h2o_cfg <- list(
include_algos = c("GBM"),
max_runtime_secs = 60,
sort_metric = "AUTO"
)
# Train the model
model <- nm_train_model(
df = df_prep,
value = 'value',
backend = "h2o",
variables = predictors,
model_config = h2o_cfg
)
# Evaluate Performance
nm_modStats(df_prep, model)
# (Optional) Save and Load Model
nm_save_model(model, path = "./", filename = "my_automl")
model <- nm_load_model(path = "./", filename = "my_automl")Identify which variables have the strongest influence on the model's predictions.
# Extract feature importance table
importance_table <- nm_feature_importance(model)
print(head(importance_table))Understand the specific relationship between variables and the pollutant (e.g., linear, non-linear) using Partial Dependence Plots.
# Compute PDP for all variables
pdp_all <- nm_pdp(df_prep, model)
print(head(pdp_all))
# Compute PDP for specific variables
pdp_data <- nm_pdp(df_prep, model, var_list = c('blh', 'rh2m'))
print(head(pdp_data))Use the trained model to generate the weather-normalised time-series. The function now auto-detects features from the model.
# Aggregate=TRUE returns the mean normalised value
df_dew <- nm_normalise(
df = df_prep,
model = model,
resample_vars = weather_vars,
n_samples = 600,
aggregate = TRUE
)Instead of guessing n_samples, let the algorithm determine the optimal number of resampling iterations required for the result to stabilize.
# Automatically find best n_samples
auto_result <- nm_normalise_auto(
df = df_prep,
model = model,
resample_vars = weather_vars
)
# Check the optimal N found
cat("Optimal samples used:", auto_result$best_n, "\n")
# Access the normalised result
head(auto_result$res)Use a specific historical period (e.g., specific year or season) as the weather baseline.
# Create a custom pool (e.g., first 100 observations)
resample_pool <- df_prep %>% dplyr::slice(1:100)
df_dew_custom <- nm_normalise(
df = df_prep,
model = model,
resample_df = resample_pool, # <--- Use custom pool
resample_vars = weather_vars,
n_samples = 600
)Perform normalisation in a moving window to capture changing trends.
df_rolling <- nm_rolling(
df = df_prep,
value = 'value',
model = model,
resample_vars = weather_vars,
n_samples = 300,
window_days = 14,
rolling_every = 7
)Decompose the signal into Emission (human activity) and Meteorology (weather) drivers.
# Isolate Emission contribution
df_emi <- nm_decompose(method = "emission", df = df_prep, value = "value", model = model, n_samples = 300)
# Isolate Meteorology contribution
df_met <- nm_decompose(method = "meteorology", df = df_prep, value = "value", model = model, n_samples = 300)Run an ensemble of models to estimate confidence intervals for the normalised trend.
unc_results <- nm_do_all_unc(
df = my1,
value = 'PM2.5',
predictors = predictors,
resample_vars = weather_vars,
n_models = 5, # Train 5 models with different seeds
n_samples = 300
)Evaluate the effectiveness of policy interventions using SCM or Machine Learning SCM (ML-SCM).
data("SCM")
df_scm <- scm
# Define the intervention date
intervention_date <- as.Date("2015-10-23")
# Identify the Target Unit and the Donor Pool
target_unit <- unique(scm$ID[scm$group == "target"])
control_pool <- unique(scm$ID[scm$group == "control"])
cat("Target Unit:", target_unit, "\n")
cat("Donor Pool size:", length(control_pool), "\n")# Classic SCM
scm_res <- nm_run_scm(
df = df_scm, date_col = "date", outcome_col = "SO2wn", unit_col = "ID",
treated_unit = target_unit, donors = control_pool, cutoff_date = intervention_date,
scm_backend = "scm", #Or "mlscm"
)
Validate results by running "Placebo in Space" tests (treating control units as if they were treated).
# Run Placebo Test
placebo_out <- nm_placebo_in_space(
df = df_scm, date_col = "date", outcome_col = "SO2wn", unit_col = "ID",
treated_unit = target_unit, donors = control_pool, cutoff_date = intervention_date,
scm_backend = "scm", # Using ML-SCM backend ("mlscm") or "scm"
verbose = FALSE
)
# Calculate and Plot 95% Confidence Bands
bands <- nm_effect_bands_space(placebo_out, level = 0.95, method = "quantile")
nm_plot_effect_with_bands(bands, cutoff_date = intervention_date, title = "SCM Effect (95% Placebo)")Alternative uncertainty estimation methods.
# Jackknife (Leave-One-Out)
jack_res <- nm_uncertainty_bands(
df = df_scm, date_col = "date", outcome_col = "SO2wn", unit_col = "ID",
scm_backend = "scm", # Using ML-SCM backend ("mlscm") or "scm"
treated_unit = target_unit, donors = control_pool, cutoff_date = intervention_date,
method = "jackknife", # Or "bootstrap"
verbose = FALSE
)
nm_plot_uncertainty_bands(jack_res, cutoff_date = intervention_date, title = "SCM Effect (Jackknife)")- R (>= 4.0)
- Core:
h2o,dplyr,data.table,lubridate,foreach,doSNOW - SCM:
glmnet,quadprog - Visualization:
ggplot2
@Manual{normet-pkg,
title = {normet: Normalisation, Decomposition, and Counterfactual Modelling for Air Quality Time-Series},
author = {Congbo Song and Other Contributors},
year = {2025},
note = {R package version 0.0.1},
organization = {University of Manchester},
url = {https://github.com/normet-dev/normet-r},
}GNU GENERAL PUBLIC LICENSE.
Contributions are welcome! Please submit issues and pull requests via GitHub.