Skip to content

feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592

Open
AlyciaBHZ wants to merge 2 commits into
ChronoAIProject:mainfrom
AlyciaBHZ:add-public-academic-catalog
Open

feat(catalog): seed unauthenticated public APIs (arXiv, OpenAlex, Crossref)#592
AlyciaBHZ wants to merge 2 commits into
ChronoAIProject:mainfrom
AlyciaBHZ:add-public-academic-catalog

Conversation

@AlyciaBHZ

@AlyciaBHZ AlyciaBHZ commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds three first-class catalog entries for unauthenticated public academic APIs:

  • arxiv-api (https://export.arxiv.org/api) -- Atom feed search/metadata for arXiv papers.
  • api-openalex (https://api.openalex.org) -- OpenAlex scholarly works, authors, institutions, concepts, and citations.
  • api-crossref (https://api.crossref.org) -- Crossref DOI metadata and citation graph.

These have no ProviderConfig to bind to. The implementation introduces a parallel DEFAULT_PUBLIC_SERVICE_SEEDS table and a second seed loop in seed_default_services that produces DownstreamService rows with provider_config_id: None, auth_method: "none", requires_user_credential: false, and no ServiceProviderRequirement.

build_catalog_entry already tolerates provider: None and returns requires_credential: false, so these services are represented as no-auth catalog entries rather than credential setup flows.

Why route public APIs through NyxID?

The proxy injects nothing on these calls. The benefit is operational:

  • Ergonomics: users and agents get a stable catalog slug such as arxiv-api instead of repeating --custom --base-url ... setup on each machine.
  • Single source of truth: catalog-level URL, description, category, and future metadata are managed once instead of drifting across local custom service definitions.
  • Admin-managed default headers: polite-pool / contact / User-Agent defaults can be populated once via catalog default_request_headers and inherited by every agent that enables the service.
  • Future rate-limit / circuit-breaker hooks: if NyxID later adds per-service rate limiting, public APIs can use the same service-level control point as credentialed APIs.
  • Discoverability: nyxid catalog show arxiv-api can explain the no-auth policy and official API docs from inside NyxID.

This is not an audit-log distinction between catalog and custom services: custom services also route through NyxID and are audited. The PR is about reducing per-user boilerplate and avoiding drift for common public academic sources.

Motivation

I am using NyxID to broker external APIs for an outreach/research pipeline that scans scholarly sources while working on open mathematical problems. arXiv, OpenAlex, and Crossref are common enough in agent literature workflows that they are better represented as shared catalog entries than as repeated local custom definitions.

The same argument extends to citation mining, paper deduplication, related-work search, and research-board refresh tasks.

Implementation notes

  • New seed table is distinct from DEFAULT_SERVICE_SEEDS rather than threading Option<&str> through the existing provider_slug field. This isolates the no-provider path from the existing provider-backed seeds, so the SPR / token-exchange logic stays unchanged for credentialed cases.
  • list_catalog now includes no-auth public API catalog rows by accepting the explicit auth_method: "none", service_category: "internal", provider_config_id: null case.
  • Slug uniqueness is enforced by a unit test that also asserts no collision with DEFAULT_SERVICE_SEEDS.
  • arXiv uses https://export.arxiv.org/api.

Test plan

  • cargo fmt --check
  • cargo test -p nyxid provider_service::tests --no-fail-fast
  • cargo test -p nyxid catalog_service::tests --no-fail-fast

Follow-up ideas

  • Populate catalog default_request_headers for polite-pool conventions where appropriate.
  • Add documentation_url to DownstreamService and move documentation links out of descriptions.
  • Add Semantic Scholar as a credentialed API seed if desired.
  • Add ORCID public API under the same no-auth path.

…ssref)

Adds DEFAULT_PUBLIC_SERVICE_SEEDS + a parallel seed loop in
seed_default_services for catalog entries that don't bind to any
ProviderConfig. Resulting DownstreamService rows have:

  - provider_config_id: None
  - auth_method: "none"
  - requires_user_credential: false
  - no ServiceProviderRequirement

build_catalog_entry already tolerates `provider: None` and emits
`requires_credential: false`, so these surface in the AI Services
dialog as one-click no-auth services. The proxy injects nothing — the
benefit is centralised audit logging and a single place to manage
polite-pool / rate-limit headers across agents that hit the same
public source.

Three initial seeds:
- `arxiv-api` (http://export.arxiv.org/api): Atom feed search/metadata
- `api-openalex` (https://api.openalex.org): 240M+ scholarly works graph
- `api-crossref` (https://api.crossref.org): DOI metadata + citations

Each description includes the polite-pool convention so agents can
discover it from `nyxid catalog show <slug>` without leaving NyxID.

Tests:
- `public_service_seeds_have_unique_slugs_and_no_collision_with_default_seeds`
- `arxiv_public_seed_is_present_and_unauthenticated`

Motivation: agents working on academic / open-problem domains (e.g.
literature staleness checks against erdosproblems / RESEARCH_BOARD
targets, citation graph mining) need these sources first-class. Today
they have to use `service add --custom` per machine and lose the
audit trail. Seeding them in catalog gives one-line `nyxid service add
arxiv-api` everywhere.
@AlyciaBHZ AlyciaBHZ force-pushed the add-public-academic-catalog branch from edb9f9d to d1934e0 Compare April 30, 2026 13:21
@AlyciaBHZ

Copy link
Copy Markdown
Contributor Author

Thanks for the careful read. Taking all three.

1. arXiv → https

Fixing. One-liner. Will land in the next push.

2. AI Services dialog filter — taking option (a)

Agreed (a) is the right call: the description's framing should be true, and nyxid service add arxiv-api is the only path I actually exercised, so the dialog-visibility claim was wishful thinking on my part rather than verified.

Plan:

  • extend the list_catalog $or in backend/src/services/catalog_service.rs:217 with the fourth clause you proposed: { auth_method: "none", service_category: "internal", provider_config_id: null };
  • add a unit test that seeds the catalog and asserts arxiv-api (and the other two) appear in list_catalog output, plus a regression assertion that the existing 28 provider-backed seeds still match exactly once (no double-counting via the new clause);
  • I'll re-verify your claim that all 28 existing seeds match via provider_config_id != null before relying on it, but on first read that lines up with what's in the seed table.

3. Audit-trail framing — fixing the description

You're right, the "loses audit trail" line is wrong as written. --custom services route through the same /api/v1/proxy/{service_id}/{path} handler (handlers/proxy.rs:179, 217) and produce identical audit_personal_routing / audit_org_routing entries. The audit-vs-raw-curl distinction is real but it's not the distinction this PR is for.

Will rewrite the motivation as:

  1. Ergonomics: no per-user --base-url boilerplate; one slug per service across all agents and machines.
  2. Single source of truth: catalog-level URL, description, and category; no drift between local --custom definitions on different machines.
  3. Admin-managed default_request_headers: future polite-pool / User-Agent / contact-email propagation lands once at the catalog and reaches every agent without per-machine re-registration. (Note: current PR sets default_request_headers: None, so the polite-pool benefit is latent until an admin populates it. I'll either add a // TODO: populate polite-pool headers comment in the seed table or open a follow-up issue — preference?)

Smaller / optional

Adding the comment line on service_category: "internal" / created_by: "system" / visibility: "public" rationale in the seed struct — agreed, future-maintainer-readability win.

The parallel-seed-table vs. threading-Option<&str> tradeoff: noted that future capability/header backfill mechanisms will need a parallel path. Will keep that in mind when the polite-pool follow-up lands.

Push order

I'll batch (1) https, (2a) list_catalog $or extension + unit test, and (3) PR description rewrite into a single push, then convert (3)-followups (TODO comment vs. follow-up issue) per your preference. Re-requesting review once that's in.

@AlyciaBHZ

Copy link
Copy Markdown
Contributor Author

Pushed the requested follow-up fixes in 64b6534.

What changed:

  • arXiv now uses https://export.arxiv.org/api.
  • list_catalog now includes the no-auth public API case explicitly:
    { auth_method: "none", service_category: "internal", provider_config_id: null }.
  • Added a catalog-service unit test pinning that new filter arm, alongside the existing provider-service seed tests.
  • Rewrote the PR description to remove the incorrect custom-service audit-trail framing. The motivation is now ergonomics, single source of truth, and catalog-level default-header management.

Validation:

  • cargo fmt --check
  • cargo test -p nyxid provider_service::tests --no-fail-fast
  • cargo test -p nyxid catalog_service::tests --no-fail-fast

For the polite-pool / default_request_headers question: I left actual header population as a follow-up rather than seeding placeholder headers in this PR, since the right contact identity is deployment-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant