Skip to content

feat: scheduled database snapshots#37

Open
pthmas wants to merge 4 commits intomainfrom
issue-23
Open

feat: scheduled database snapshots#37
pthmas wants to merge 4 commits intomainfrom
issue-23

Conversation

@pthmas
Copy link
Collaborator

@pthmas pthmas commented Mar 19, 2026

Summary

  • Adds opt-in daily pg_dump snapshots as a background tokio task, following the same spawn pattern as the indexer and metadata fetcher
  • Configurable via SNAPSHOT_ENABLED, SNAPSHOT_TIME (HH:MM UTC), SNAPSHOT_RETENTION (max files kept), and SNAPSHOT_DIR
  • Automatic cleanup of old snapshots beyond the retention count
  • Docker image updated with postgresql16-client for pg_dump; compose file wired with env vars and volume mount

Closes #23

Summary by CodeRabbit

  • New Features

    • Optional daily PostgreSQL snapshotting with UTC schedule, configurable retention count, and custom storage path.
  • Chores

    • Container and compose defaults updated to support snapshot storage and non-root bind-mount permissions.
  • Tests

    • Added unit and integration tests covering snapshot config parsing, scheduler behavior, retention cleanup, and a dump/restore round-trip (ignored by default).

pthmas added 3 commits March 19, 2026 17:11
Add opt-in daily pg_dump snapshots to protect against data loss.
Runs as a background tokio task with configurable time, retention,
and output directory. Includes integration test for dump/restore
round-trip.
@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

Adds an automated, time-based PostgreSQL snapshot scheduler to the Atlas server with configurable enablement, schedule, retention, storage directory, retry/backoff behavior, and cleanup of old snapshots; includes CLI tests and Docker/compose configuration for persistent host mounts.

Changes

Cohort / File(s) Summary
Environment & Compose
\.env.example, docker-compose.yml
Added snapshot-related env vars: SNAPSHOT_ENABLED, SNAPSHOT_TIME (UTC HH:MM), SNAPSHOT_RETENTION, SNAPSHOT_DIR, SNAPSHOT_HOST_DIR mount; atlas-server runs under configurable UID:GID and mounts host snapshot dir.
Docker image
backend/Dockerfile
Install postgresql16-client, create /snapshots dir and set ownership to atlas user prior to USER atlas.
Server config
backend/crates/atlas-server/src/config.rs
New SnapshotConfig with from_env(database_url) parsing SNAPSHOT_* vars, validation (time format, non-zero retention, non-empty dir), Debug redaction for DB URL, and unit tests.
Snapshot module
backend/crates/atlas-server/src/snapshot.rs
New async scheduler run_snapshot_loop that computes next run, runs pg_dump to temp file, atomically renames on success, retries with fixed backoff (5s,10s,20s,30s,60s), and prunes old dumps while ignoring .tmp files; includes unit tests.
Server integration
backend/crates/atlas-server/src/main.rs
Load SnapshotConfig and conditionally spawn background snapshot scheduler task wrapped in existing retry/backoff helper.
Tests & tooling
backend/crates/atlas-server/Cargo.toml, backend/crates/atlas-server/tests/integration/common.rs, .../tests/integration/main.rs, .../tests/integration/snapshots.rs
Added tempfile dev-dependency; extended TestEnv to expose database_url(); added integration test snapshot_dump_and_restore_round_trip (ignored by default; skips if pg_dump/pg_restore missing).

Sequence Diagram

sequenceDiagram
    participant Timer as Scheduler Timer (tokio sleep)
    participant SnapLoop as run_snapshot_loop
    participant Backoff as Retry Backoff
    participant PGDump as pg_dump process
    participant FS as File System
    participant Cleanup as cleanup_old_snapshots

    loop daily schedule
        Timer->>SnapLoop: Wake at scheduled UTC time
        SnapLoop->>FS: Ensure snapshot dir exists
        SnapLoop->>PGDump: Spawn pg_dump -> write to `.tmp`
        PGDump->>FS: Write dump data to temp file
        alt pg_dump succeeds
            FS->>FS: Rename `.tmp` -> `.dump` (atomic)
            SnapLoop->>Cleanup: Remove old dumps beyond retention
            Cleanup->>FS: List/sort/remove old files (ignore `.tmp`)
            SnapLoop->>Backoff: Reset retry state
        else pg_dump fails
            SnapLoop->>FS: Remove `.tmp` if present
            SnapLoop->>Backoff: Wait next backoff interval (5s,10s,20s,30s,60s)
            Backoff->>SnapLoop: Retry attempt
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • tac0turtle

Poem

🐰 I nibble logs at break of dawn,

Dumps hop out before they're gone.
Timed and tucked in folder deep,
Old crumbs trimmed while watchers sleep.
Hoppity backups—safe and neat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: scheduled database snapshots' clearly and concisely summarizes the main change: adding automated daily database snapshots.
Linked Issues check ✅ Passed Changes fully implement #23 requirements: automated snapshot generation via background task, database dump/restore capability with custom format, configurable retention policy, and storage path options.
Out of Scope Changes check ✅ Passed All changes align with snapshot feature scope: configuration infrastructure, scheduled task implementation, snapshot storage/retention logic, and testing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-23
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

@pthmas pthmas marked this pull request as ready for review March 19, 2026 16:35
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
backend/crates/atlas-server/Cargo.toml (1)

48-48: Use workspace-managed versioning for tempfile.

Please avoid pinning tempfile in the crate manifest and reference it via workspace instead, so dependency versions stay centralized.

Suggested change
 [dev-dependencies]
 testcontainers = { workspace = true }
 testcontainers-modules = { workspace = true }
 tokio = { workspace = true }
 tower = { workspace = true, features = ["util"] }
 serde_json = { workspace = true }
 sqlx = { workspace = true }
-tempfile = "3"
+tempfile = { workspace = true }

Based on learnings: Keep all Rust workspace dependency versions in the root Cargo.toml.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/Cargo.toml` at line 48, Replace the hard-pinned
tempfile = "3" entry in this crate's Cargo.toml with a workspace reference by
removing the version and using the workspace-managed dependency (i.e., change
the tempfile entry to reference the workspace dependency without a version
string), so the crate relies on the version declared in the workspace root; look
for the tempfile entry in backend/crates/atlas-server/Cargo.toml and remove the
pinned version to defer to workspace-managed versions.
backend/crates/atlas-server/tests/integration/snapshots.rs (2)

57-57: Fragile database URL substitution.

The replace("/postgres", "/test_restore") pattern assumes the source database is always named postgres and that /postgres doesn't appear elsewhere in the URL (e.g., in a password or hostname). Consider parsing the URL properly or using a more targeted replacement at the path end.

♻️ Suggested fix using URL parsing
-        let restore_url = db_url.replace("/postgres", "/test_restore");
+        let mut parsed = url::Url::parse(db_url).expect("parse DATABASE_URL");
+        parsed.set_path("/test_restore");
+        let restore_url = parsed.to_string();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/snapshots.rs` at line 57, The
db URL substitution using db_url.replace("/postgres", "/test_restore") is
fragile; instead parse db_url (e.g., with the Url type from the url crate or a
PostgreSQL connection string parser) and replace only the path component to
"test_restore" before rebuilding the string so you don't accidentally modify
passwords/hosts; update the code that constructs restore_url (reference variable
restore_url and source db_url) to parse, set url.path_segments_mut() or
equivalent to the single segment "test_restore", and then serialize the URL back
to a string for use in the test.

88-93: Cleanup may be skipped if assertions fail.

If any assertion fails before line 90, the test_restore database won't be dropped, potentially leaving artifacts that could cause subsequent test runs to fail (e.g., CREATE DATABASE test_restore would error). Consider wrapping cleanup in a guard or using a drop-based cleanup pattern.

♻️ Consider a cleanup guard or defer-style pattern

One approach is to use defer semantics via scopeguard or a custom struct that drops the database in Drop. Alternatively, add a DROP DATABASE IF EXISTS test_restore at the start of the test to handle leftover state from previous failed runs:

+        // Clean up any leftover database from previous failed runs
+        let _ = sqlx::query("DROP DATABASE IF EXISTS test_restore")
+            .execute(pool)
+            .await;
+
         // Create a separate database for restore
         sqlx::query("CREATE DATABASE test_restore")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/snapshots.rs` around lines 88 -
93, The test currently drops the test database with sqlx::query("DROP DATABASE
test_restore").execute(pool).await after calling restore_pool.close().await, but
this cleanup is skipped if earlier assertions panic; refactor to guarantee
cleanup by introducing a guard (e.g., a small struct with Drop impl or using
scopeguard) that holds the pool/connection and runs sqlx::query("DROP DATABASE
IF EXISTS test_restore").execute(...) in Drop, or alternatively run
sqlx::query("DROP DATABASE IF EXISTS test_restore").execute(pool).await at the
start of the test to ensure idempotent cleanup; update references to
restore_pool.close().await and the existing sqlx::query(...) invocation to be
performed via that guard so the database is removed even on test failure.
backend/crates/atlas-server/src/snapshot.rs (2)

127-129: Use idiomatic descending sort.

Per coding guidelines, prefer idiomatic Rust patterns. The two-step sort() then reverse() can be combined into a single descending sort.

♻️ Suggested fix
-    files.sort();
-    files.reverse();
+    files.sort_by(|a, b| b.cmp(a));

As per coding guidelines: "Use idiomatic Rust patterns: prefer .min(), .max(), |=, += over manual if/assign statements"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/snapshot.rs` around lines 127 - 129, Replace
the two-step ascending-then-reverse sequence operating on the `files` collection
(calls to `files.sort()` followed by `files.reverse()`) with a single idiomatic
descending sort; update the code to call a single sort variant that orders
descending (e.g., use `sort_by`/`sort_unstable_by` with `b.cmp(a)` or an
appropriate key-based descending comparator) so the collection is sorted
newest-first in one pass.

47-54: Consider adding a timeout to pg_dump execution.

If pg_dump hangs (e.g., database unresponsive, network issues), the snapshot loop will block indefinitely, preventing future snapshots. Consider wrapping the command with tokio::time::timeout.

♻️ Suggested fix
+    let pg_dump_timeout = Duration::from_secs(3600); // 1 hour max
+    let status = tokio::time::timeout(
+        pg_dump_timeout,
+        tokio::process::Command::new("pg_dump")
             .arg("--dbname")
             .arg(&config.database_url)
             .arg("-Fc")
             .arg("-f")
             .arg(&tmp_path)
             .status()
-            .await?;
+    )
+    .await
+    .map_err(|_| anyhow::anyhow!("pg_dump timed out after {} seconds", pg_dump_timeout.as_secs()))??;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/snapshot.rs` around lines 47 - 54, Wrap the
blocking pg_dump invocation with tokio::time::timeout to avoid hanging the
snapshot loop: instead of awaiting .status() directly on
tokio::process::Command, spawn the child with Command::spawn() (reference
Command::spawn and the tmp_path/status variables), then await child.wait()
inside tokio::time::timeout(Duration::from_secs(...)). If the timeout returns
Err(elapsed) call child.kill().await (or child.kill() and then
child.wait().await) and return or log a timeout error; on Ok, inspect the
ExitStatus as before. Make the timeout duration configurable or a constant and
handle both timeout and command errors consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/crates/atlas-server/src/config.rs`:
- Around line 255-262: SnapshotConfig currently derives Debug and may log
sensitive DB credentials via the database_url field; remove #[derive(Debug)] (or
stop deriving Debug) for SnapshotConfig and implement a manual Debug impl for
the SnapshotConfig struct that formats all fields except database_url (or prints
database_url as a redacted placeholder like "<redacted>") so logs cannot leak
secrets; update any uses that relied on the derived Debug (e.g., fmt::Debug
formatting or debug! calls referencing SnapshotConfig) to continue working with
the new manual impl.
- Around line 293-294: The current env::var call that sets let dir =
env::var("SNAPSHOT_DIR").unwrap_or_else(|_| "/snapshots".to_string()) can
produce an empty string; update the code that reads SNAPSHOT_DIR (the let dir
variable in config.rs) to validate that the resulting string is non-empty when
snapshots are enabled: after obtaining the value (either from env::var or the
default), check if dir.trim().is_empty() and return/configuration-error (or fall
back to "/snapshots") so the configuration fails fast instead of allowing an
empty SNAPSHOT_DIR to propagate into the scheduler loop; ensure the check is
applied in the same function that constructs the snapshot configuration so
callers that enable snapshots receive an explicit error or valid path.

In `@docker-compose.yml`:
- Around line 53-54: The snapshot bind mount using the volume entry "-
${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}" can make
/snapshots unwritable for the non-root atlas user because host directory
permissions override container ownership; to fix, either switch to a named
Docker volume (remove SNAPSHOT_HOST_DIR bind and use a named volume) or update
deployment docs and startup to ensure SNAPSHOT_HOST_DIR is pre-created with
correct owner/permissions (matching the atlas UID/GID), or add an init step that
chowns the mounted path to the atlas user before snapshot jobs run; reference
the volume line using SNAPSHOT_HOST_DIR and SNAPSHOT_DIR and update
docker-compose and docs accordingly.

---

Nitpick comments:
In `@backend/crates/atlas-server/Cargo.toml`:
- Line 48: Replace the hard-pinned tempfile = "3" entry in this crate's
Cargo.toml with a workspace reference by removing the version and using the
workspace-managed dependency (i.e., change the tempfile entry to reference the
workspace dependency without a version string), so the crate relies on the
version declared in the workspace root; look for the tempfile entry in
backend/crates/atlas-server/Cargo.toml and remove the pinned version to defer to
workspace-managed versions.

In `@backend/crates/atlas-server/src/snapshot.rs`:
- Around line 127-129: Replace the two-step ascending-then-reverse sequence
operating on the `files` collection (calls to `files.sort()` followed by
`files.reverse()`) with a single idiomatic descending sort; update the code to
call a single sort variant that orders descending (e.g., use
`sort_by`/`sort_unstable_by` with `b.cmp(a)` or an appropriate key-based
descending comparator) so the collection is sorted newest-first in one pass.
- Around line 47-54: Wrap the blocking pg_dump invocation with
tokio::time::timeout to avoid hanging the snapshot loop: instead of awaiting
.status() directly on tokio::process::Command, spawn the child with
Command::spawn() (reference Command::spawn and the tmp_path/status variables),
then await child.wait() inside tokio::time::timeout(Duration::from_secs(...)).
If the timeout returns Err(elapsed) call child.kill().await (or child.kill() and
then child.wait().await) and return or log a timeout error; on Ok, inspect the
ExitStatus as before. Make the timeout duration configurable or a constant and
handle both timeout and command errors consistently.

In `@backend/crates/atlas-server/tests/integration/snapshots.rs`:
- Line 57: The db URL substitution using db_url.replace("/postgres",
"/test_restore") is fragile; instead parse db_url (e.g., with the Url type from
the url crate or a PostgreSQL connection string parser) and replace only the
path component to "test_restore" before rebuilding the string so you don't
accidentally modify passwords/hosts; update the code that constructs restore_url
(reference variable restore_url and source db_url) to parse, set
url.path_segments_mut() or equivalent to the single segment "test_restore", and
then serialize the URL back to a string for use in the test.
- Around line 88-93: The test currently drops the test database with
sqlx::query("DROP DATABASE test_restore").execute(pool).await after calling
restore_pool.close().await, but this cleanup is skipped if earlier assertions
panic; refactor to guarantee cleanup by introducing a guard (e.g., a small
struct with Drop impl or using scopeguard) that holds the pool/connection and
runs sqlx::query("DROP DATABASE IF EXISTS test_restore").execute(...) in Drop,
or alternatively run sqlx::query("DROP DATABASE IF EXISTS
test_restore").execute(pool).await at the start of the test to ensure idempotent
cleanup; update references to restore_pool.close().await and the existing
sqlx::query(...) invocation to be performed via that guard so the database is
removed even on test failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f400324c-4e91-4be0-b747-26e65caf32eb

📥 Commits

Reviewing files that changed from the base of the PR and between 2fb2426 and 5255005.

📒 Files selected for processing (10)
  • .env.example
  • backend/Dockerfile
  • backend/crates/atlas-server/Cargo.toml
  • backend/crates/atlas-server/src/config.rs
  • backend/crates/atlas-server/src/main.rs
  • backend/crates/atlas-server/src/snapshot.rs
  • backend/crates/atlas-server/tests/integration/common.rs
  • backend/crates/atlas-server/tests/integration/main.rs
  • backend/crates/atlas-server/tests/integration/snapshots.rs
  • docker-compose.yml

Comment on lines +53 to +54
volumes:
- ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Snapshot bind mount can fail writes under non-root atlas user.

This mount may be unwritable in default setups because host-directory permissions override the Dockerfile’s /snapshots ownership. Snapshot jobs can then fail continuously.

Suggested mitigation (one practical option)
   atlas-server:
+    user: "${UID:-1000}:${GID:-1000}"
     volumes:
       - ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}

If you prefer to keep fixed container UID/GID, ensure deployment docs require pre-creating SNAPSHOT_HOST_DIR with matching ownership/permissions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
volumes:
- ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}
user: "${UID:-1000}:${GID:-1000}"
volumes:
- ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker-compose.yml` around lines 53 - 54, The snapshot bind mount using the
volume entry "- ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}"
can make /snapshots unwritable for the non-root atlas user because host
directory permissions override container ownership; to fix, either switch to a
named Docker volume (remove SNAPSHOT_HOST_DIR bind and use a named volume) or
update deployment docs and startup to ensure SNAPSHOT_HOST_DIR is pre-created
with correct owner/permissions (matching the atlas UID/GID), or add an init step
that chowns the mounted path to the atlas user before snapshot jobs run;
reference the volume line using SNAPSHOT_HOST_DIR and SNAPSHOT_DIR and update
docker-compose and docs accordingly.

@pthmas pthmas self-assigned this Mar 20, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/crates/atlas-server/src/config.rs (1)

293-295: Consider documenting that SNAPSHOT_TIME is interpreted as UTC.

Based on context from snapshot.rs, the NaiveTime value is interpreted as UTC (via and_utc()). Operators in non-UTC timezones may expect local time. A brief comment here or in .env.example would clarify.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/config.rs` around lines 293 - 295, Document
that SNAPSHOT_TIME is interpreted as UTC: update the comment near the NaiveTime
parsing in config.rs (around the time_str/NaiveTime::parse_from_str code) to
state explicitly that the parsed NaiveTime will be treated as UTC (snapshot.rs
uses .and_utc()), and add a matching note in .env.example or README to warn
operators in non-UTC timezones that times are UTC. Ensure the comment references
SNAPSHOT_TIME, NaiveTime::parse_from_str, and snapshot.rs's .and_utc() behavior
so readers can find the related code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/src/config.rs`:
- Around line 293-295: Document that SNAPSHOT_TIME is interpreted as UTC: update
the comment near the NaiveTime parsing in config.rs (around the
time_str/NaiveTime::parse_from_str code) to state explicitly that the parsed
NaiveTime will be treated as UTC (snapshot.rs uses .and_utc()), and add a
matching note in .env.example or README to warn operators in non-UTC timezones
that times are UTC. Ensure the comment references SNAPSHOT_TIME,
NaiveTime::parse_from_str, and snapshot.rs's .and_utc() behavior so readers can
find the related code paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae8d11ee-2bc2-47c8-9c61-44b90aed9b3f

📥 Commits

Reviewing files that changed from the base of the PR and between 5255005 and 79529b4.

📒 Files selected for processing (3)
  • .env.example
  • backend/crates/atlas-server/src/config.rs
  • docker-compose.yml
✅ Files skipped from review due to trivial changes (1)
  • .env.example
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker-compose.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: snapshots

1 participant