Skip to content

feat: restart RAG container on config change#343

Open
tsivaprasad wants to merge 1 commit intomainfrom
PLAT-497-rag-service-live-configuration-hot-reload
Open

feat: restart RAG container on config change#343
tsivaprasad wants to merge 1 commit intomainfrom
PLAT-497-rag-service-live-configuration-hot-reload

Conversation

@tsivaprasad
Copy link
Copy Markdown
Contributor

Summary

This PR restart the RAG container when config.pipelines changes via a Control
Plane update, so new pipeline settings and API keys take effect without
manual intervention.

Changes

  • Embed PGEDGE_CONFIG_VERSION=<hash> as an env var in the RAG container
    spec. Docker Swarm only restarts a service when its TaskTemplate
    changes; a config-only update writes a new YAML to the bind mount but
    the container would otherwise keep running with the old config in memory.
    The hash (SHA256 of config.pipelines, truncated to 8 bytes) changes
    whenever pipelines or API keys change, making the update visible to Swarm.

Testing

Verification:

  1. Created a cluster
  2. Created DB using rag_create_db.json

Verified hash

docker service inspect rag-storefront-rag-689qacsi \
  --format '{{range .Spec.TaskTemplate.ContainerSpec.Env}}{{println .}}{{end}}' \
  | grep PGEDGE_CONFIG_VERSION

PGEDGE_CONFIG_VERSION=0f5c8aee2d69bf90

  1. Updated DB using rag_update_db.json
    Verified hash
 docker service inspect rag-storefront-rag-689qacsi \                                   
  --format '{{range .Spec.TaskTemplate.ContainerSpec.Env}}{{println .}}{{end}}' \
  | grep PGEDGE_CONFIG_VERSION

PGEDGE_CONFIG_VERSION=72685a12bfb356af

  1. Confirmed that the rag service restarted successfully when there is a change in config
docker service ps rag-storefront-rag-689qacsi 2>/dev/null


ID             NAME                                IMAGE                              NODE             DESIRED STATE   CURRENT STATE                ERROR     PORTS
8z7r3d52noy6   rag-storefront-rag-689qacsi.1       ghcr.io/pgedge/rag-server:latest   docker-desktop   Running         Running about an hour ago              *:55020->8080/tcp
nyhyhx6vqjze    \_ rag-storefront-rag-689qacsi.1   ghcr.io/pgedge/rag-server:latest   docker-desktop   Shutdown        Shutdown about an hour ago 

Checklist

  • Tests added

Notes for Reviewers

The hot-reload acceptance criteria (in-flight query continuity, no downtime) are for the RAG server repo in a parallel ticket. This PR covers the control-plane side: ensuring the container is restarted with the updated config file on disk.

@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

The pull request refines Docker service failure handling to prevent premature exit when scaling services to zero, and adds SHA-256-based config versioning to RAG services by injecting a PGEDGE_CONFIG_VERSION environment variable derived from service configuration, with corresponding test coverage.

Changes

Cohort / File(s) Summary
Docker Service Failure Handling
server/internal/docker/docker.go
Updated WaitForService early-exit condition for task failures to additionally require desired > 0, preventing service-start failure when scaling replicated services to zero.
RAG Configuration Versioning
server/internal/orchestrator/swarm/service_spec.go
Added ragConfigHash function to compute SHA-256-based hex digests from service config, and modified ServiceContainerSpec to inject PGEDGE_CONFIG_VERSION environment variable into RAG service containers.
RAG Configuration Versioning Tests
server/internal/orchestrator/swarm/service_spec_test.go
Added TestRagConfigHash and TestServiceContainerSpec_RAGHasConfigVersionEnv test cases to verify deterministic hash generation, fixed-length output, and environment variable injection matching config changes.

Poem

🐰 When Docker scales services down to nil,
Task failures now respect the zero will!
And RAG configs sport a hashed-hash glow,
Their versions tracked where config changes flow!
Bugs caught swift by tests, both wise and true,
Config fidelity renewed anew! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main objective: restarting the RAG container when configuration changes, which aligns with all the changes made across multiple files in the PR.
Description check ✅ Passed The PR description includes all required template sections: Summary, Changes, Testing, Checklist, and Notes for Reviewers. It provides sufficient detail about the implementation and verification steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PLAT-497-rag-service-live-configuration-hot-reload

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/orchestrator/swarm/service_spec.go`:
- Around line 43-51: The current ragConfigHash(config map[string]any) marshals
the entire config causing unrelated changes to trigger restarts; change it to
marshal only the pipelines entry: extract config["pipelines"] (safely handle
missing or nil by using an empty value), json.Marshal that pipelines value (not
the whole map) to produce b, then compute the sha256 on b and return the short
hex as before; ensure deterministic JSON by relying on json.Marshal of the
pipelines sub-structure so only pipeline changes affect the resultant hash.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6b3c3124-c4fb-4854-9193-a604644fe59b

📥 Commits

Reviewing files that changed from the base of the PR and between 1376a3b and ed864cd.

📒 Files selected for processing (3)
  • server/internal/docker/docker.go
  • server/internal/orchestrator/swarm/service_spec.go
  • server/internal/orchestrator/swarm/service_spec_test.go

Comment on lines +43 to +51
// ragConfigHash returns a short hex digest of the RAG service configuration.
// It is embedded in the container spec as PGEDGE_CONFIG_VERSION so that Docker
// Swarm detects a TaskTemplate change and restarts the container whenever the
// pipeline configuration or API keys change.
func ragConfigHash(config map[string]any) string {
b, _ := json.Marshal(config)
sum := sha256.Sum256(b)
return fmt.Sprintf("%x", sum[:8])
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hash input is broader than the pipelines-only requirement.

On Line 48, json.Marshal(config) hashes the entire service config, so unrelated config edits will also force RAG restarts. If the intended behavior is “restart on config.pipelines changes,” scope the hash input to config["pipelines"].

♻️ Proposed fix
 func ragConfigHash(config map[string]any) string {
-	b, _ := json.Marshal(config)
+	b, _ := json.Marshal(config["pipelines"])
 	sum := sha256.Sum256(b)
 	return fmt.Sprintf("%x", sum[:8])
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/orchestrator/swarm/service_spec.go` around lines 43 - 51, The
current ragConfigHash(config map[string]any) marshals the entire config causing
unrelated changes to trigger restarts; change it to marshal only the pipelines
entry: extract config["pipelines"] (safely handle missing or nil by using an
empty value), json.Marshal that pipelines value (not the whole map) to produce
b, then compute the sha256 on b and return the short hex as before; ensure
deterministic JSON by relying on json.Marshal of the pipelines sub-structure so
only pipeline changes affect the resultant hash.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applying the suggestion — the full config hash is intentional. defaults (token budget, top_n, search weights) also lands in the generated YAML

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsivaprasad, understood! Since all config fields — including defaults like token budget, top_n, and search weights — feed into the generated YAML, hashing the full config is the correct approach to ensure the container restarts on any meaningful change. I'll note this for future reviews.


✏️ Learnings added
Learnt from: tsivaprasad
Repo: pgEdge/control-plane PR: 343
File: server/internal/orchestrator/swarm/service_spec.go:43-51
Timestamp: 2026-04-13T15:46:51.867Z
Learning: In `server/internal/orchestrator/swarm/service_spec.go` (Go), `ragConfigHash` intentionally marshals the entire `config map[string]any` (not just `config["pipelines"]`) because all config fields — including defaults such as token budget, top_n, and search weights — are written into the generated YAML. Hashing the full config ensures the RAG container restarts whenever any config value changes. Do not suggest scoping the hash to `config["pipelines"]` only.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: tsivaprasad
Repo: pgEdge/control-plane PR: 302
File: server/internal/orchestrator/swarm/orchestrator.go:686-704
Timestamp: 2026-04-04T13:55:26.501Z
Learning: In `server/internal/orchestrator/swarm/orchestrator.go` (Go), the `generateRAGInstanceResources` function intentionally omits the `ServiceInstanceSpecResource`/`ServiceInstanceResource` pair (and the corresponding `KeysPath` mount wiring). The deployable service spec/instance resources for RAG are deferred to a subsequent ticket. Do not flag the absence of `ServiceInstanceSpecResource`/`ServiceInstanceResource` in `generateRAGInstanceResources` as a missing implementation during code review.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 280
File: server/internal/orchestrator/swarm/mcp_config_resource.go:90-94
Timestamp: 2026-03-02T20:15:14.804Z
Learning: In MCP config resources (e.g., server/internal/orchestrator/swarm/mcp_config_resource.go and similar tokens.yaml/users.yaml), Refresh should only verify the existence of config.yaml. Tokens.yaml and users.yaml are owned and managed by the MCP server at runtime and should not be rewritten on Update. Do not trigger recreation that would overwrite runtime state. This ownership model is consistent with the comment indicating Do NOT touch tokens.yaml or users.yaml. Apply this guidance to all MCP config resource files in the swarm directory.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 287
File: server/internal/orchestrator/swarm/service_user_role.go:59-60
Timestamp: 2026-03-11T16:05:27.239Z
Learning: In server/internal/orchestrator/swarm/service_user_role.go, treat a ResourceVersion bump (e.g., 2 -> 3) as a documentation/safety marker only. When the resource Identifier key changes (e.g., ServiceInstanceID to ServiceID), the system will automatically recreate resources due to the Type/ID key mismatch. Do not flag a ResourceVersion bump as a missing recreate trigger if an Identifier change is already causing a full recreate for that resource. This guidance is file-specific and should be applied when reviewing changes to service_user_role.go.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 266
File: server/internal/workflows/update_database.go:122-153
Timestamp: 2026-02-09T21:47:46.904Z
Learning: In server/internal/workflows/update_database.go and related Go workflow files (e.g., provision_services.go), implement idempotent provisioning semantics: re-running ProvisionServices on already-provisioned service instances should be safe and non-destructive. Upsert operations (e.g., StoreServiceInstance) may temporarily set state to 'creating' but should recover back to 'running' via monitoring. Use SQL-level idempotence for user/service roles (e.g., CREATE ROLE IF NOT EXISTS / ALTER ROLE) and ensure deployment diffs compare desired vs. current state. These patterns support recovery when a user retries with corrected specs after a failure, and should be considered in similar Go workflow files across the repository.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 280
File: server/internal/orchestrator/swarm/mcp_config.go:95-97
Timestamp: 2026-03-02T20:17:36.231Z
Learning: In the pgEdge/control-plane repository, internal (unexported) functions that are called through controlled paths by the resource framework should assume validated input from their invocation boundaries. Do not add defensive nil guards inside such internal helpers; rely on validation at the resource construction/invocation boundaries (e.g., MCPConfigResource.Create/Update) to guarantee data integrity. This guideline applies to all Go files under server/internal where internal helpers serve as implementation details behind controlled APIs.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 293
File: server/internal/orchestrator/common/paths.go:79-157
Timestamp: 2026-03-24T13:24:39.229Z
Learning: In pgEdge/control-plane, when reviewing Go packages under server/ that use the samber/do injector, apply the `Provide()` requirement only to packages that expose singleton components which are consumed as dependencies by other singleton components via the injector. Do NOT require `Provide()` for pure utility/value-type packages (e.g., path helpers, data structures) that are not injected as singleton dependencies.

Learnt from: jason-lynch
Repo: pgEdge/control-plane PR: 312
File: server/internal/workflows/backend/etcd/store.go:52-58
Timestamp: 2026-03-24T19:00:21.971Z
Learning: In pgEdge/control-plane, treat errors returned from application startup/initialization functions as fatal: a startup failure should trigger the normal application shutdown path, which then calls each component’s `Shutdown`/`Stop` for cleanup. Therefore, when reviewing code in startup/initialization paths (e.g., cache/etcd initialization like `StartCaches`), do not require explicit rollback/cleanup of partially-initialized resources inside the startup function based on its error return. Avoid review suggestions that push intermediate rollback patterns for these fatal startup errors, since the existing shutdown sequence is responsible for teardown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant