Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0

- name: Set up Python 3.12
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: 3.12

Expand All @@ -35,7 +35,7 @@ jobs:
python -m build

- name: Upload artifact
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v7
with:
name: python-package-distributions
path: dist/
Expand All @@ -56,7 +56,7 @@ jobs:

steps:
- name: Download all the dists
uses: actions/download-artifact@v4
uses: actions/download-artifact@v8
with:
name: python-package-distributions
path: dist/
Expand All @@ -83,7 +83,7 @@ jobs:

steps:
- name: Download artifact
uses: actions/download-artifact@v4
uses: actions/download-artifact@v8
with:
name: python-package-distributions
path: dist/
Expand All @@ -107,13 +107,13 @@ jobs:

steps:
- name: Download artifact
uses: actions/download-artifact@v4
uses: actions/download-artifact@v8
with:
name: python-package-distributions
path: dist/

- name: Sign with Sigstore
uses: sigstore/gh-action-sigstore-python@v3.0.0
uses: sigstore/gh-action-sigstore-python@v3.4.0
with:
inputs: >-
./dist/*.tar.gz
Expand Down
18 changes: 8 additions & 10 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,23 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
mongodb-version: ['4.2', '4.4', '5.0', '6.0']
python-version: ['3.10', '3.11', '3.12', '3.13']
mongodb-version: ['6.0', '7.0', '8.0']
dservercore-version: [main]
dserver-search-plugin-mongo-version: [main]
dserver-retrieve-plugin-mongo-version: [main]
dserver-direct-mongo-plugin-version: [main]

steps:
- name: Git checkout
uses: actions/checkout@v4
uses: actions/checkout@v6

- name: Set up MongoDB ${{ matrix.mongodb-version }}
uses: supercharge/mongodb-github-action@1.11.0
uses: supercharge/mongodb-github-action@1.12.1
with:
mongodb-version: ${{ matrix.mongodb-version }}

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}

Expand All @@ -36,10 +35,9 @@ jobs:

- name: Install server, search and retrieve plugins
run: |
pip install git+https://github.com/livMatS/dservercore.git@${{ matrix.dservercore-version }}
pip install git+https://github.com/livMatS/dserver-search-plugin-mongo.git@${{ matrix.dserver-search-plugin-mongo-version }}
pip install git+https://github.com/livMatS/dserver-retrieve-plugin-mongo.git@${{ matrix.dserver-retrieve-plugin-mongo-version }}
pip install git+https://github.com/livMatS/dserver-direct-mongo-plugin.git@${{ matrix.dserver-direct-mongo-plugin-version }}
pip install git+https://github.com/jic-dtool/dservercore.git@${{ matrix.dservercore-version }}
pip install git+https://github.com/jic-dtool/dserver-search-plugin-mongo.git@${{ matrix.dserver-search-plugin-mongo-version }}
pip install git+https://github.com/jic-dtool/dserver-retrieve-plugin-mongo.git@${{ matrix.dserver-retrieve-plugin-mongo-version }}

- name: Remaining requirements
run: |
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,6 @@ dist/*
venv/*
old-provision/*
jwt-spike/*

# Auto-generated by setuptools_scm (write_to target)
dserver_dependency_graph_plugin/version.py
121 changes: 121 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this is

A `dservercore` extension plugin that lets a [dserver](https://github.com/livMatS/dservercore)
instance answer "give me every dataset in the same dependency graph as this UUID."
It registers a Flask blueprint at `/graph` and is wired into the host server through the
`dservercore.extension` entry point (`pyproject.toml`):
`DependencyGraphExtension = "dserver_dependency_graph_plugin:DependencyGraphExtension"`.

The plugin does **not** register or own dataset metadata. It reads the dataset collection
that the search/retrieve mongo plugins populate, building MongoDB *views* on top of it.
`register_dataset` is intentionally a no-op.

## Commands

```bash
# Install with test deps (needs the sibling livMatS plugins, see test.yml for git installs)
pip install .[test]

# Run the full test suite (requires a running MongoDB)
pytest -sv

# Point tests at a non-default mongo
TEST_MONGO_URI=mongodb://localhost:27017/ pytest -sv

# Single test
pytest tests/test_graph_routes.py::test_query_dependency_graph_by_default_keys -sv

# Lint (matches CI)
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
```

Tests spin up a real dserver app via `create_app` and a temporary mongo database per run
(`conftest.py`). **A MongoDB server must be reachable** — there is no mocking layer. CI
(`.github/workflows/test.yml`) runs a matrix of Python 3.8–3.12 × MongoDB 4.2–6.0 and
installs `dservercore`, `dserver-search-plugin-mongo`, `dserver-retrieve-plugin-mongo`,
and `dserver-direct-mongo-plugin` from their `main` branches first.

Build backend is `flit_scm`; version is derived from git tags via `setuptools_scm` and
written to `dserver_dependency_graph_plugin/version.py` (do not edit by hand).

## Routes (actual)

Defined in `__init__.py` on `graph_bp` (`url_prefix="/graph"`):

- `GET /graph/uuids/<uuid>` — graph using server-default `DEPENDENCY_KEYS`.
- `POST /graph/uuids/<uuid>` — body is a `DependencyKeysSchema` (`{"dependency_keys": [...]}`);
only honored when `DYNAMIC_DEPENDENCY_KEYS` is enabled, otherwise it silently falls back
to the defaults.

Note: `README.rst` documents an older `/graph/lookup/<uuid>` path — the code uses
`/graph/uuids/<uuid>`. Trust the code.

## Architecture / request flow

The hard part is all MongoDB aggregation-pipeline construction. Two files carry it:

- **`graph.py`** — pure pipeline builders (no DB calls), two distinct concerns:
1. `build_undirected_adjacency_lists(keys)` builds the **view** definition: it unwinds
each configured dependency key into directed `(uuid → derived_from)` edges, then
emits *both* directions (`group_dependencies` + `group_inverse_dependencies`) so the
graph can be traversed forward and backward. Invalid edges are dropped by a
`UUID_v4_REGEX` `$match`.
2. `query_dependency_graph(...)` builds the **query** pipeline: a `$graphLookup` over that
view starting from the requested uuid, re-joined (`$lookup`) against the real dataset
collection, with `pre_query`/`post_query` privilege filters and a final `$project` that
strips `readme`, `manifest`, `annotations`.

- **`__init__.py`** — the stateful half. `DependencyGraphExtension.init_app` opens the
mongo client and stashes `client`/`db`/`collection` as **class variables** (so the
module-level route functions can reach the DB — see the NOTE comment there).
`dependency_graph_by_user_and_uuid` is the orchestrator: gates on `ENABLE_DEPENDENCY_VIEW`,
resolves a cached view via `_get_dependency_view_from_keys`, applies privilege filtering
through `dservercore`'s `_preprocess_privileges` + the local `_dict_to_mongo_query`, runs
the aggregation, and converts datetimes to float timestamps for the response.

### View caching / bookkeeping

Each distinct *set* of dependency keys gets its own materialized view named
`<PREFIX><utc-iso-timestamp>` (e.g. `dep:2020-10-05T01:22:39.581592`). A bookkeeping
collection (`dep_views`) maps `keys → view name` with an `accessed_on` timestamp and acts
as an LRU: when count exceeds `MONGO_DEPENDENCY_VIEW_CACHE_SIZE`, the least-recently-accessed
view is dropped. `FORCE_REBUILD_DEPENDENCY_VIEW=True` drops and recreates the view on every
query (needed to pick up changes to `DEPENDENCY_KEYS`). All bookkeeping helpers are wrapped
by `@assert_dependency_view_bookkeeping_collection` which lazily creates that collection.

### Security boundary in `utils.py`

`utils.py` is a vendored copy of query-building helpers from `dserver-direct-mongo-plugin`
(deliberately copied to drop the runtime dependency). `_dict_to_mongo_query` can merge a
caller-supplied raw mongo `query`; `_assert_no_forbidden_operators` **recursively rejects**
`$where`, `$function`, `$accumulator` to block server-side JavaScript execution. Tests in
`test_raw_query_hardening.py` lock this down — keep that guarantee when editing.

## Configuration (`config.py`)

`Config` reads env vars at import time. Key behavioral switches (all `DSERVER_`-prefixed
env vars, parsed against `AFFIRMATIVE_EXPRESSIONS`):

- `MONGO_URI` / `MONGO_DB` / `MONGO_COLLECTION` — **required**; `init_app` raises if absent.
Listed in `CONFIG_SECRETS_TO_OBFUSCATE` so the `/config/info` route never returns them clear-text.
- `DSERVER_ENABLE_DEPENDENCY_VIEW` (default True), `DSERVER_DYNAMIC_DEPENDENCY_KEYS` (default True),
`DSERVER_FORCE_REBUILD_DEPENDENCY_VIEW` (default False).
- `DSERVER_DEPENDENCY_KEYS` — JSON list (or bare string) of dotted paths to source UUIDs.
Default: `["readme_parsed.derived_from.uuid", "annotations.source_dataset_uuid"]`. Nesting
hierarchy is irrelevant; the dot-path is just unwound. Note it traverses `readme_parsed`
(the server-parsed README), **not** raw `readme` — a string README breaks traversal.

## Conventions / gotchas

- A dataset is truly identified by `(uuid, base_uri)`, but the graph is keyed on `uuid`
alone — duplicate registrations of one uuid across base URIs yield one arbitrary hit
(see the `TODO` in `graph.py`).
- Privilege filtering happens **twice** (pre- and post-graph-traversal) so a user who lacks
access to part of a graph gets a truncated/disconnected result rather than a leak.
- When changing the response shape, update the field-exclusion markers in
`tests/test_graph_routes.py` (server-stamped fields like `created_at`, `frozen_at`,
`uploaded_at`, `uploaded_by` are excluded from comparison).
6 changes: 3 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ graph by UUID is possible, i.e.
.. code-block:: bash

$ UUID=41a2e3e2-0c01-444f-bd7d-f9bb45512373
$ curl -H "$HEADER" http://localhost:5000/graph/lookup/$UUID
$ curl -H "$HEADER" http://localhost:5000/graph/uuids/$UUID

Looking up a dependency graph by UUID will result in unique per-UUID hits.
As it is possible for a dataset to be registered in more than one base
Expand All @@ -172,7 +172,7 @@ of desired dependency keys attached
$ curl -H "$HEADER" -H "Content-Type: application/json" \
-X POST -d \
'["annotations.source_dataset_uuid","readme.derived_from.uuid"]'
http://localhost:5000/graph/lookup/$UUID
http://localhost:5000/graph/uuids/$UUID

If a view for this particular set of keys does not exist yet, the server will
generate and cache it on-the-fly. This can be observed in the mongo shell
Expand Down Expand Up @@ -217,7 +217,7 @@ and querying with a specific set of keys for the first time
$ curl -H "$HEADER" -H "Content-Type: application/json" \
-X POST -d \
'["another.possibly_nested.dependency_key"]' \
http://localhost:5000/graph/lookup/$UUID
http://localhost:5000/graph/uuids/$UUID

will result in an additional view named uniquely by the current UTC time::

Expand Down
14 changes: 9 additions & 5 deletions dserver_dependency_graph_plugin/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
from dservercore import AuthenticationError, ExtensionABC
from dservercore.sql_models import DatasetSchema
from dservercore.utils import _preprocess_privileges
from dserver_direct_mongo_plugin.utils import _dict_to_mongo_query
from .utils import _dict_to_mongo_query

from .schemas import DependencyKeysSchema

Expand Down Expand Up @@ -116,7 +116,8 @@ def _get_dependency_view_bookkeeping_record(dependency_keys):
@assert_dependency_view_bookkeeping_collection
def _create_dependency_view_bookkeeping_record(name, dependency_keys):
ret = DependencyGraphExtension.db[Config.MONGO_DEPENDENCY_VIEW_BOOKKEEPING].insert_one(
{'name': name, 'keys': dependency_keys, 'accessed_on': datetime.datetime.utcnow()})
{'name': name, 'keys': dependency_keys,
'accessed_on': datetime.datetime.now(datetime.timezone.utc)})
# drop oldest entry if number of documents exceeds allowed maximum
count = DependencyGraphExtension.db[Config.MONGO_DEPENDENCY_VIEW_BOOKKEEPING].count_documents({})
if count > Config.MONGO_DEPENDENCY_VIEW_CACHE_SIZE:
Expand All @@ -135,7 +136,8 @@ def _create_dependency_view_bookkeeping_record(name, dependency_keys):
def _update_dependency_view_bookkeeping_record(name):
"""Updated record to dependency view bookkeeping collection or add if new."""
return DependencyGraphExtension.db[Config.MONGO_DEPENDENCY_VIEW_BOOKKEEPING].update_one(
{'name': name}, {'$set': {'accessed_on': datetime.datetime.utcnow()}})
{'name': name},
{'$set': {'accessed_on': datetime.datetime.now(datetime.timezone.utc)}})


# mid-level dependency view helpers
Expand All @@ -146,7 +148,8 @@ def _create_dependency_view(dependency_keys):
:returns: str"""

# generate unique, valid name for view from prefix and ISO date string
datestring = datetime.datetime.utcnow().isoformat()
datestring = datetime.datetime.now(
datetime.timezone.utc).replace(tzinfo=None).isoformat()
name = Config.MONGO_DEPENDENCY_VIEW_PREFIX + datestring

if name in DependencyGraphExtension.db.list_collection_names():
Expand Down Expand Up @@ -246,7 +249,8 @@ def dependency_graph_by_user_and_uuid(username, uuid, dependency_keys=Config.DEP
mongo_aggregation = query_dependency_graph(pre_query=pre_query,
post_query=post_query,
dependency_keys=dependency_keys,
mongo_dependency_view=dependency_view)
mongo_dependency_view=dependency_view,
mongo_collection=current_app.config['MONGO_COLLECTION'])
logger.debug("Constructed mongo aggregation: {}".format(mongo_aggregation))
cx = DependencyGraphExtension.db[current_app.config['MONGO_COLLECTION']].aggregate(mongo_aggregation)

Expand Down
13 changes: 11 additions & 2 deletions dserver_dependency_graph_plugin/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,19 @@
AFFIRMATIVE_EXPRESSIONS = ['true', '1', 'y', 'yes', 'on']


CONFIG_SECRETS_TO_OBFUSCATE = []
CONFIG_SECRETS_TO_OBFUSCATE = [
"MONGO_URI",
"MONGO_DB",
"MONGO_COLLECTION",
]


class Config(object):
# MongoDB connection settings
# These are required for the dependency graph plugin to connect to MongoDB
MONGO_URI = os.environ.get("MONGO_URI")
MONGO_DB = os.environ.get("MONGO_DB")
MONGO_COLLECTION = os.environ.get("MONGO_COLLECTION")
# If enabled, the underlying database will offer dependency graph views on
# the server's default collection. Those views offer on-the-fly-generated
# collections of undirected per-dataset adjacency lists in order to
Expand All @@ -34,7 +43,7 @@ class Config(object):
# a single key or a JSON-formatted list of keys.
# Nested fields are separated by a dot (.)
DEPENDENCY_KEYS = [
'readme.derived_from.uuid',
'readme_parsed.derived_from.uuid',
'annotations.source_dataset_uuid'
]

Expand Down
13 changes: 6 additions & 7 deletions dserver_dependency_graph_plugin/graph.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
"""Aggregation pipelines for graph operations."""

from dserver_dependency_graph_plugin.config import Config as dependency_graph_plugin_config
from dserver_direct_mongo_plugin.config import Config as direct_mongo_plugin_config
from .config import Config

# a regular expression to filter valid v4 UUIDs
UUID_v4_REGEX = '[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[4][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}'


# most of those 'functions' are pretty static and just wrapped in function
# definitions for convenience.
def unwind_dependencies(dependency_keys=dependency_graph_plugin_config.DEPENDENCY_KEYS):
def unwind_dependencies(dependency_keys=Config.DEPENDENCY_KEYS):
"""Create parallel aggregation pipelines for unwinding all configured dependency keys."""

parallel_aggregations = []
Expand Down Expand Up @@ -41,7 +40,7 @@ def unwind_dependencies(dependency_keys=dependency_graph_plugin_config.DEPENDENC
return parallel_aggregations


def merge_dependencies(dependency_keys=dependency_graph_plugin_config.DEPENDENCY_KEYS):
def merge_dependencies(dependency_keys=Config.DEPENDENCY_KEYS):
"""Aggregate (directed) dependency graph edges.

All configured dependency keys are merged in a key-agnostic 'dependencies'
Expand Down Expand Up @@ -117,7 +116,7 @@ def group_inverse_dependencies():
return aggregation


def build_undirected_adjecency_lists(dependency_keys=dependency_graph_plugin_config.DEPENDENCY_KEYS):
def build_undirected_adjecency_lists(dependency_keys=Config.DEPENDENCY_KEYS):
"""Aggregate undirected adjacency lists."""
aggregation = [
*merge_dependencies(dependency_keys),
Expand Down Expand Up @@ -200,8 +199,8 @@ def build_undirected_adjecency_lists(dependency_keys=dependency_graph_plugin_con
# behavior would be to yield all redundant dataset entries for a uuid.
def query_dependency_graph(mongo_dependency_view,
pre_query, post_query=None,
dependency_keys=dependency_graph_plugin_config.DEPENDENCY_KEYS,
mongo_collection=direct_mongo_plugin_config.MONGO_COLLECTION):
dependency_keys=Config.DEPENDENCY_KEYS,
mongo_collection=None):
"""Aggregation pipeline for querying dependency view on datasets collection.

:param pre_query: selects all documents for whicht to query the dependency graph.
Expand Down
Loading