Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions changelog/622.breaking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Renamed the "ignore datasets" configuration to "grey list"
and decoupled fetching from configuration loading.

The `default_ignore_datasets.yaml` file at the repository root is now `config/default_grey_list.yaml`,
the `Config.ignore_datasets_file` field is now `Config.grey_list_file`,
and the `REF_IGNORE_DATASETS_FILE` environment variable is now `REF_GREY_LIST_FILE`.

A new `Config.grey_list_url` field (also overridable via `REF_GREY_LIST_URL`) controls where the grey list is fetched from.
Set it to an empty value to disable fetching,
which is useful for offline or air-gapped HPC environments.
The grey list location is now independent of the fetch lifecycle,
so users can pin it to a writable location (e.g. a Kubernetes volume mount)
and still have the solver refresh it.

`Config.default()` no longer performs network I/O as a side effect.
A missing grey list file is treated as an empty list rather than raising an error,
so disabling fetching does not require pre-seeding the file.

Loading a configuration that still uses the deprecated `ignore_datasets_file` key,
or running with the `REF_IGNORE_DATASETS_FILE` environment variable set,
now raises a hard error with migration instructions instead of silently falling back to defaults.

Grey list refresh is now fail-safe:
a network failure with no cached file raises `GreyListRefreshError`
rather than creating an empty placeholder that would silently disable grey list protections for the next 6 hours.
If a cached file exists it is reused.

See the new "Grey list" section in `docs/configuration.md` for full details.
11 changes: 6 additions & 5 deletions default_ignore_datasets.yaml → config/default_grey_list.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This file is used to specify datasets that should be ignored by default.
# Grey list: datasets that should be excluded from diagnostics by default.
#
# It can be used to ignore datasets that are known to have issues and keep
# It can be used to skip datasets that are known to have issues and keep
# diagnostics that use multiple datasets from running.
#
# The format used in this configuration file is:
Expand All @@ -14,9 +14,10 @@
# instances that will be prepended to the constraints of the data requirements
# of the diagnostics that the facets are listed under.
#
# If no `ignore_datasets_file` is specified in the REF configuration, this file
# will be downloaded from GitHub and used. If the local copy of this file is
# older than 6 hours, it will be updated.
# `Config.grey_list_file` controls the location this file lives at and
# `Config.grey_list_url` controls where the solver refreshes it from
# (set the URL to None to disable fetching). The default cache is refreshed
# at most every `DEFAULT_GREY_LIST_MAX_AGE` (6 hours).
#
esmvaltool:
sea-ice-sensitivity:
Expand Down
65 changes: 65 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,71 @@ If this is set, then the sample data won't be updated.
Path where the test output is stored.
This is used to store the output of the tests that are run in the test suite for later inspection.

## Grey list

The *grey list* is a YAML file that lists facets which should be excluded from specific diagnostics
— for example, datasets that are known to crash or produce invalid output.
The datasets in the grey list are filtered before solving for the relevant diagnostic.

The file format is:

```yaml
provider:
diagnostic:
source_type:
- facet: value
- other_facet: [other_value1, other_value2]
```

Two configuration values control how the grey list is loaded;
both can be set in your `ref.toml` or via environment variables.

### `grey_list_file` / `REF_GREY_LIST_FILE`

Path to the grey list file on disk.
Defaults to `grey_list.yaml` inside your `REF_CONFIGURATION` directory
(alongside `ref.toml`, the database, etc.).
This location must be writable by the user
as the grey list may be updated periodically.

### `grey_list_url` / `REF_GREY_LIST_URL`

URL the solver fetches the grey list from.
Defaults to `config/default_grey_list.yaml` on the `main` branch
of the `Climate-REF/climate-ref` GitHub repository.
Override this to point at a fork or internal mirror.

The download is **lazy and explicit**:
it only runs once at the start of a solve (`ExecutionSolver.build_from_db`),
and only when the on-disk copy is missing or older than 6 hours.
Read-only commands like `ref providers list` or `ref datasets list` never touch the network.

If a refresh fails and no cached file exists,
the solver raises `GreyListRefreshError` rather than running with an empty grey list.
If a cached file exists it is reused.

### Offline / air-gapped use (HPC)

To run completely offline — for example on an HPC compute node with no outbound network —
set the URL to an empty value:

```bash
export REF_GREY_LIST_URL=
```

or in `ref.toml`:

```toml
grey_list_url = ""
```

When fetching is disabled the solver simply uses whatever file is at `grey_list_file`.
A missing file is treated as an empty grey list,
so you do not have to seed the file by hand.
If you want to apply a specific grey list,
either copy `config/default_grey_list.yaml` from the repository to your `grey_list_file` location ahead of time,
or fetch it once before disabling the URL on the compute nodes.

## Configuration Options

<!-- This file is appended to by gen_config_stubs.py -->
28 changes: 16 additions & 12 deletions packages/climate-ref-core/src/climate_ref_core/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,29 +79,33 @@ def configure(self, config: Any) -> None:
config :
A configuration.
"""
logger.debug(
f"Configuring provider {self.slug} using ignore_datasets_file {config.ignore_datasets_file}"
)
# The format of the configuration file is:
logger.debug(f"Configuring provider {self.slug} using grey_list_file {config.grey_list_file}")
# The format of the grey list file is:
# provider:
# diagnostic:
# source_type:
# - facet: value
# - other_facet: [other_value1, other_value2]
ignore_datasets_all = yaml.safe_load(config.ignore_datasets_file.read_text(encoding="utf-8")) or {}
ignore_datasets = ignore_datasets_all.get(self.slug, {})
if unknown_slugs := {slug for slug in ignore_datasets} - {d.slug for d in self.diagnostics()}:
# A missing file is treated as an empty grey list
# so offline/air-gapped users that disable fetching with `grey_list_url=""`
# can run without having to seed the file themselves.
if not config.grey_list_file.exists():
grey_list_all: dict[str, Any] = {}
else:
grey_list_all = yaml.safe_load(config.grey_list_file.read_text(encoding="utf-8")) or {}
grey_list = grey_list_all.get(self.slug, {})
if unknown_slugs := {slug for slug in grey_list} - {d.slug for d in self.diagnostics()}:
logger.warning(
f"Unknown diagnostics found in {config.ignore_datasets_file} "
f"Unknown diagnostics found in {config.grey_list_file} "
f"for provider {self.slug}: {', '.join(sorted(unknown_slugs))}"
)

known_source_types = {s.value for s in iter(SourceDatasetType)}
for diagnostic in self.diagnostics():
if diagnostic.slug in ignore_datasets:
if unknown_source_types := set(ignore_datasets[diagnostic.slug]) - known_source_types:
if diagnostic.slug in grey_list:
if unknown_source_types := set(grey_list[diagnostic.slug]) - known_source_types:
logger.warning(
f"Unknown source types found in {config.ignore_datasets_file} for "
f"Unknown source types found in {config.grey_list_file} for "
f"diagnostic '{diagnostic.slug}' by provider {self.slug}: "
f"{', '.join(sorted(unknown_source_types))}"
)
Expand All @@ -114,7 +118,7 @@ def configure(self, config: Any) -> None:
data_requirement,
constraints=tuple(
IgnoreFacets(facets)
for facets in ignore_datasets[diagnostic.slug].get(
for facets in grey_list[diagnostic.slug].get(
data_requirement.source_type.value, []
)
)
Expand Down
21 changes: 14 additions & 7 deletions packages/climate-ref-core/tests/unit/test_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ def mock_config(tmp_path, mocker):
"""Use a mock config to avoid depending on `climate_ref.config.Config`."""
config = mocker.Mock()
config.paths.software = tmp_path / "software"
config.ignore_datasets_file = tmp_path / "ignore_datasets.yaml"
config.ignore_datasets_file.touch()
config.grey_list_file = tmp_path / "grey_list.yaml"
config.grey_list_file.touch()
return config


Expand Down Expand Up @@ -70,7 +70,7 @@ def test_provider_fixture(self, provider):
assert isinstance(result, Diagnostic)

def test_configure(self, provider, mock_config):
mock_config.ignore_datasets_file.write_text(
mock_config.grey_list_file.write_text(
textwrap.dedent(
"""
mock_provider:
Expand All @@ -85,8 +85,15 @@ def test_configure(self, provider, mock_config):
expected_constraint = IgnoreFacets(facets={"source_id": ("A",)})
assert provider.diagnostics()[0].data_requirements[0][0].constraints[0] == expected_constraint

def test_configure_missing_grey_list_file(self, provider, mock_config):
# Offline/air-gapped users may run with grey_list_url="" and no file
# seeded yet; missing file should be treated as an empty grey list,
# not raise FileNotFoundError.
mock_config.grey_list_file.unlink()
provider.configure(mock_config)

def test_configure_unknown_diagnostic(self, provider, mock_config, caplog):
mock_config.ignore_datasets_file.write_text(
mock_config.grey_list_file.write_text(
textwrap.dedent(
"""
mock_provider:
Expand All @@ -100,13 +107,13 @@ def test_configure_unknown_diagnostic(self, provider, mock_config, caplog):
with caplog.at_level(logging.WARNING):
provider.configure(mock_config)
expected_msg = (
f"Unknown diagnostics found in {mock_config.ignore_datasets_file} "
f"Unknown diagnostics found in {mock_config.grey_list_file} "
"for provider mock_provider: invalid_diagnostic"
)
assert expected_msg in caplog.text

def test_configure_unknown_source_type(self, provider, mock_config, caplog):
mock_config.ignore_datasets_file.write_text(
mock_config.grey_list_file.write_text(
textwrap.dedent(
"""
mock_provider:
Expand All @@ -120,7 +127,7 @@ def test_configure_unknown_source_type(self, provider, mock_config, caplog):
with caplog.at_level(logging.WARNING):
provider.configure(mock_config)
expected_msg = (
f"Unknown source types found in {mock_config.ignore_datasets_file} "
f"Unknown source types found in {mock_config.grey_list_file} "
"for diagnostic 'mock' by provider mock_provider: invalid_source_type"
)
assert expected_msg in caplog.text
Expand Down
4 changes: 2 additions & 2 deletions packages/climate-ref-pmp/tests/unit/test_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ def test_configure_sets_env_vars(self, mocker, tmp_path):
test_provider = PMPDiagnosticProvider("PMP-Test", "1.0")
mock_config = mocker.Mock()
mock_config.paths.software = tmp_path / "software"
mock_config.ignore_datasets_file = tmp_path / "ignore.yaml"
mock_config.ignore_datasets_file.touch()
mock_config.grey_list_file = tmp_path / "ignore.yaml"
mock_config.grey_list_file.touch()

mocker.patch.object(test_provider, "get_conda_exe", return_value=Path("/path/to/conda"))

Expand Down
Loading
Loading