Skip to content

normet-dev/normet-r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


normet: Normalisation, Decomposition, and Counterfactual Modelling for Air Quality Time-series

normet is an R package designed for air quality time-series analysis. It provides a powerful and user-friendly suite of tools for air quality research, causal inference, and policy evaluation.


✨ Core Strengths

  • Automated & Intelligent: Powered by an H2O AutoML backend, it automatically finds the optimal model, eliminating tedious manual tuning.
  • All-in-One Solution: Offers high-level functions that cover the entire workflow: data preprocessing, model training, weather normalisation, decomposition, and counterfactual modelling.
  • Robust Causal Inference: Integrates both classic (SCM) and machine-learning-based (ML-SCM) Synthetic Control Methods.
  • Uncertainty Quantification: Provides comprehensive tools for uncertainty estimation, including Bootstrap, Jackknife, and Placebo Tests (in-space).
  • High Performance: Built-in memory management and parallel processing for handling large datasets.

🚀 Workflow Overview

The analysis typically follows this sequence:

  1. Initialize: Start the H2O backend (nm_init_h2o).
  2. Prepare: Process raw data and create time-based features (nm_prepare_data).
  3. Train: Build predictive models (nm_train_model).
  4. Analyse:
    • Importance: Rank predictors by influence (nm_feature_importance).
    • Explain: Visualize marginal effects (nm_pdp).
    • Normalise: Remove weather effects (nm_normalise / nm_normalise_auto).
    • Decompose: Isolate emission vs. meteorological contributions (nm_decompose).
    • Evaluate Policy: Estimate causal effects (nm_run_scm).

🔧 Installation

Install the latest development version from GitHub:

# install.packages("devtools")
devtools::install_github("normet-dev/normet-r")

Backend Setup

normet relies on H2O. Ensure Java is installed, then install the h2o package:

install.packages("h2o")

💡 Quick Start: The "Do-All" Pipeline

For standard weather normalisation, use nm_do_all to handle data preparation, training, and normalisation in one step.

library(normet)
library(dplyr)

# 1. Initialize Backend
nm_init_h2o()

# 2. Load Data
data("MY1")

# 3. Define Features
# 'predictors' includes weather + time variables for training
predictors <- c("u10", "v10", "d2m", "t2m", "blh", "sp", "ssrd", "tcc", "tp", "rh2m",
                "date_unix", "day_julian", "weekday", "hour")

# 'weather_vars' are the variables to be resampled (shuffled) during normalisation
weather_vars <- c("u10", "v10", "d2m", "t2m", "blh", "sp", "ssrd", "tcc", "tp", "rh2m")

# 4. Run Pipeline
results <- nm_do_all(
  df = my1,
  value = "PM2.5",
  predictors = predictors,
  resample_vars = weather_vars,
  n_samples = 300,  # Number of resampling iterations
  model_config = list(include_algos = c("GBM"), max_runtime_secs = 60, sort_metric = "AUTO")
)

# 5. Inspect Results
head(results$out)       # Normalised Data (Date, Observed, Normalised)
print(results$model)    # Trained Model

🛠️ Step-by-Step Advanced Workflow

For greater control, you can execute each stage manually.

1. Data Preparation & Model Training

# Prepare data with time features and train/test splits
df_prep <- nm_prepare_data(
  df = my1,
  value = 'PM2.5',
  predictors = weather_vars, # Will automatically add time features
  split_method = 'random',
  fraction = 0.75
)

# Configure H2O AutoML
h2o_cfg <- list(
  include_algos = c("GBM"),
  max_runtime_secs = 60,
  sort_metric = "AUTO"
)

# Train the model
model <- nm_train_model(
  df = df_prep,
  value = 'value',
  backend = "h2o",
  variables = predictors,
  model_config = h2o_cfg
)

# Evaluate Performance
nm_modStats(df_prep, model)

# (Optional) Save and Load Model
nm_save_model(model, path = "./", filename = "my_automl")
model <- nm_load_model(path = "./", filename = "my_automl")

2. Model Explainability

Feature Importance

Identify which variables have the strongest influence on the model's predictions.

# Extract feature importance table
importance_table <- nm_feature_importance(model)
print(head(importance_table))

Partial Dependence Plots (PDP)

Understand the specific relationship between variables and the pollutant (e.g., linear, non-linear) using Partial Dependence Plots.

# Compute PDP for all variables
pdp_all <- nm_pdp(df_prep, model)
print(head(pdp_all))

# Compute PDP for specific variables
pdp_data <- nm_pdp(df_prep, model, var_list = c('blh', 'rh2m'))
print(head(pdp_data))

3. Weather Normalisation

Standard Normalisation

Use the trained model to generate the weather-normalised time-series. The function now auto-detects features from the model.

# Aggregate=TRUE returns the mean normalised value
df_dew <- nm_normalise(
  df = df_prep,
  model = model,
  resample_vars = weather_vars,
  n_samples = 600,
  aggregate = TRUE
)

Automatic Normalisation (Auto-Convergence)

Instead of guessing n_samples, let the algorithm determine the optimal number of resampling iterations required for the result to stabilize.

# Automatically find best n_samples
auto_result <- nm_normalise_auto(
  df = df_prep,
  model = model,
  resample_vars = weather_vars
)

# Check the optimal N found
cat("Optimal samples used:", auto_result$best_n, "\n")

# Access the normalised result
head(auto_result$res)

Custom Resampling Pool

Use a specific historical period (e.g., specific year or season) as the weather baseline.

# Create a custom pool (e.g., first 100 observations)
resample_pool <- df_prep %>% dplyr::slice(1:100)

df_dew_custom <- nm_normalise(
  df = df_prep,
  model = model,
  resample_df = resample_pool, # <--- Use custom pool
  resample_vars = weather_vars,
  n_samples = 600
)

Rolling Normalisation

Perform normalisation in a moving window to capture changing trends.

df_rolling <- nm_rolling(
  df = df_prep,
  value = 'value',
  model = model,
  resample_vars = weather_vars,
  n_samples = 300,
  window_days = 14,
  rolling_every = 7
)

4. Time-Series Decomposition

Decompose the signal into Emission (human activity) and Meteorology (weather) drivers.

# Isolate Emission contribution
df_emi <- nm_decompose(method = "emission", df = df_prep, value = "value", model = model, n_samples = 300)

# Isolate Meteorology contribution
df_met <- nm_decompose(method = "meteorology", df = df_prep, value = "value", model = model, n_samples = 300)

5. Uncertainty Quantification (Ensemble)

Run an ensemble of models to estimate confidence intervals for the normalised trend.

unc_results <- nm_do_all_unc(
  df = my1,
  value = 'PM2.5',
  predictors = predictors,
  resample_vars = weather_vars,
  n_models = 5, # Train 5 models with different seeds
  n_samples = 300
)

⚖️ Causal Inference: Synthetic Control Methods (SCM)

Evaluate the effectiveness of policy interventions using SCM or Machine Learning SCM (ML-SCM).

1. Setup Data

data("SCM")
df_scm <- scm

# Define the intervention date
intervention_date <- as.Date("2015-10-23")

# Identify the Target Unit and the Donor Pool
target_unit <- unique(scm$ID[scm$group == "target"])
control_pool <- unique(scm$ID[scm$group == "control"])

cat("Target Unit:", target_unit, "\n")
cat("Donor Pool size:", length(control_pool), "\n")

2. Run SCM / ML-SCM

# Classic SCM
scm_res <- nm_run_scm(
  df = df_scm, date_col = "date", outcome_col = "SO2wn", unit_col = "ID",
  treated_unit = target_unit, donors = control_pool, cutoff_date = intervention_date,
  scm_backend = "scm", #Or "mlscm"
)

3. Placebo Tests & Confidence Bands

Validate results by running "Placebo in Space" tests (treating control units as if they were treated).

# Run Placebo Test
placebo_out <- nm_placebo_in_space(
  df = df_scm, date_col = "date", outcome_col = "SO2wn", unit_col = "ID",
  treated_unit = target_unit, donors = control_pool, cutoff_date = intervention_date,
  scm_backend = "scm", # Using ML-SCM backend ("mlscm") or "scm"
  verbose = FALSE
)

# Calculate and Plot 95% Confidence Bands
bands <- nm_effect_bands_space(placebo_out, level = 0.95, method = "quantile")
nm_plot_effect_with_bands(bands, cutoff_date = intervention_date, title = "SCM Effect (95% Placebo)")

4. Uncertainty (Bootstrap / Jackknife)

Alternative uncertainty estimation methods.

# Jackknife (Leave-One-Out)
jack_res <- nm_uncertainty_bands(
  df = df_scm, date_col = "date", outcome_col = "SO2wn", unit_col = "ID",
  scm_backend = "scm", # Using ML-SCM backend ("mlscm") or "scm"
  treated_unit = target_unit, donors = control_pool, cutoff_date = intervention_date,
  method = "jackknife", # Or "bootstrap"
  verbose = FALSE
)
nm_plot_uncertainty_bands(jack_res, cutoff_date = intervention_date, title = "SCM Effect (Jackknife)")

📦 Dependencies

  • R (>= 4.0)
  • Core: h2o, dplyr, data.table, lubridate, foreach, doSNOW
  • SCM: glmnet, quadprog
  • Visualization: ggplot2

📜 How to Cite

@Manual{normet-pkg,
  title = {normet: Normalisation, Decomposition, and Counterfactual Modelling for Air Quality Time-Series},
  author = {Congbo Song and Other Contributors},
  year = {2025},
  note = {R package version 0.0.1},
  organization = {University of Manchester},
  url = {https://github.com/normet-dev/normet-r},
}

📄 License

GNU GENERAL PUBLIC LICENSE.


🤝 Contributing

Contributions are welcome! Please submit issues and pull requests via GitHub.

About

normet: Normalisation, Decomposition, and Counterfactual Modelling for Environmental Time-series

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages