Conversation
Add opt-in daily pg_dump snapshots to protect against data loss. Runs as a background tokio task with configurable time, retention, and output directory. Includes integration test for dump/restore round-trip.
📝 WalkthroughWalkthroughAdds an automated, time-based PostgreSQL snapshot scheduler to the Atlas server with configurable enablement, schedule, retention, storage directory, retry/backoff behavior, and cleanup of old snapshots; includes CLI tests and Docker/compose configuration for persistent host mounts. Changes
Sequence DiagramsequenceDiagram
participant Timer as Scheduler Timer (tokio sleep)
participant SnapLoop as run_snapshot_loop
participant Backoff as Retry Backoff
participant PGDump as pg_dump process
participant FS as File System
participant Cleanup as cleanup_old_snapshots
loop daily schedule
Timer->>SnapLoop: Wake at scheduled UTC time
SnapLoop->>FS: Ensure snapshot dir exists
SnapLoop->>PGDump: Spawn pg_dump -> write to `.tmp`
PGDump->>FS: Write dump data to temp file
alt pg_dump succeeds
FS->>FS: Rename `.tmp` -> `.dump` (atomic)
SnapLoop->>Cleanup: Remove old dumps beyond retention
Cleanup->>FS: List/sort/remove old files (ignore `.tmp`)
SnapLoop->>Backoff: Reset retry state
else pg_dump fails
SnapLoop->>FS: Remove `.tmp` if present
SnapLoop->>Backoff: Wait next backoff interval (5s,10s,20s,30s,60s)
Backoff->>SnapLoop: Retry attempt
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
backend/crates/atlas-server/Cargo.toml (1)
48-48: Use workspace-managed versioning fortempfile.Please avoid pinning
tempfilein the crate manifest and reference it via workspace instead, so dependency versions stay centralized.Suggested change
[dev-dependencies] testcontainers = { workspace = true } testcontainers-modules = { workspace = true } tokio = { workspace = true } tower = { workspace = true, features = ["util"] } serde_json = { workspace = true } sqlx = { workspace = true } -tempfile = "3" +tempfile = { workspace = true }Based on learnings: Keep all Rust workspace dependency versions in the root
Cargo.toml.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/Cargo.toml` at line 48, Replace the hard-pinned tempfile = "3" entry in this crate's Cargo.toml with a workspace reference by removing the version and using the workspace-managed dependency (i.e., change the tempfile entry to reference the workspace dependency without a version string), so the crate relies on the version declared in the workspace root; look for the tempfile entry in backend/crates/atlas-server/Cargo.toml and remove the pinned version to defer to workspace-managed versions.backend/crates/atlas-server/tests/integration/snapshots.rs (2)
57-57: Fragile database URL substitution.The
replace("/postgres", "/test_restore")pattern assumes the source database is always namedpostgresand that/postgresdoesn't appear elsewhere in the URL (e.g., in a password or hostname). Consider parsing the URL properly or using a more targeted replacement at the path end.♻️ Suggested fix using URL parsing
- let restore_url = db_url.replace("/postgres", "/test_restore"); + let mut parsed = url::Url::parse(db_url).expect("parse DATABASE_URL"); + parsed.set_path("/test_restore"); + let restore_url = parsed.to_string();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/tests/integration/snapshots.rs` at line 57, The db URL substitution using db_url.replace("/postgres", "/test_restore") is fragile; instead parse db_url (e.g., with the Url type from the url crate or a PostgreSQL connection string parser) and replace only the path component to "test_restore" before rebuilding the string so you don't accidentally modify passwords/hosts; update the code that constructs restore_url (reference variable restore_url and source db_url) to parse, set url.path_segments_mut() or equivalent to the single segment "test_restore", and then serialize the URL back to a string for use in the test.
88-93: Cleanup may be skipped if assertions fail.If any assertion fails before line 90, the
test_restoredatabase won't be dropped, potentially leaving artifacts that could cause subsequent test runs to fail (e.g.,CREATE DATABASE test_restorewould error). Consider wrapping cleanup in a guard or using a drop-based cleanup pattern.♻️ Consider a cleanup guard or defer-style pattern
One approach is to use
defersemantics viascopeguardor a custom struct that drops the database inDrop. Alternatively, add aDROP DATABASE IF EXISTS test_restoreat the start of the test to handle leftover state from previous failed runs:+ // Clean up any leftover database from previous failed runs + let _ = sqlx::query("DROP DATABASE IF EXISTS test_restore") + .execute(pool) + .await; + // Create a separate database for restore sqlx::query("CREATE DATABASE test_restore")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/tests/integration/snapshots.rs` around lines 88 - 93, The test currently drops the test database with sqlx::query("DROP DATABASE test_restore").execute(pool).await after calling restore_pool.close().await, but this cleanup is skipped if earlier assertions panic; refactor to guarantee cleanup by introducing a guard (e.g., a small struct with Drop impl or using scopeguard) that holds the pool/connection and runs sqlx::query("DROP DATABASE IF EXISTS test_restore").execute(...) in Drop, or alternatively run sqlx::query("DROP DATABASE IF EXISTS test_restore").execute(pool).await at the start of the test to ensure idempotent cleanup; update references to restore_pool.close().await and the existing sqlx::query(...) invocation to be performed via that guard so the database is removed even on test failure.backend/crates/atlas-server/src/snapshot.rs (2)
127-129: Use idiomatic descending sort.Per coding guidelines, prefer idiomatic Rust patterns. The two-step
sort()thenreverse()can be combined into a single descending sort.♻️ Suggested fix
- files.sort(); - files.reverse(); + files.sort_by(|a, b| b.cmp(a));As per coding guidelines: "Use idiomatic Rust patterns: prefer .min(), .max(), |=, += over manual if/assign statements"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/src/snapshot.rs` around lines 127 - 129, Replace the two-step ascending-then-reverse sequence operating on the `files` collection (calls to `files.sort()` followed by `files.reverse()`) with a single idiomatic descending sort; update the code to call a single sort variant that orders descending (e.g., use `sort_by`/`sort_unstable_by` with `b.cmp(a)` or an appropriate key-based descending comparator) so the collection is sorted newest-first in one pass.
47-54: Consider adding a timeout to pg_dump execution.If
pg_dumphangs (e.g., database unresponsive, network issues), the snapshot loop will block indefinitely, preventing future snapshots. Consider wrapping the command withtokio::time::timeout.♻️ Suggested fix
+ let pg_dump_timeout = Duration::from_secs(3600); // 1 hour max + let status = tokio::time::timeout( + pg_dump_timeout, + tokio::process::Command::new("pg_dump") .arg("--dbname") .arg(&config.database_url) .arg("-Fc") .arg("-f") .arg(&tmp_path) .status() - .await?; + ) + .await + .map_err(|_| anyhow::anyhow!("pg_dump timed out after {} seconds", pg_dump_timeout.as_secs()))??;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/src/snapshot.rs` around lines 47 - 54, Wrap the blocking pg_dump invocation with tokio::time::timeout to avoid hanging the snapshot loop: instead of awaiting .status() directly on tokio::process::Command, spawn the child with Command::spawn() (reference Command::spawn and the tmp_path/status variables), then await child.wait() inside tokio::time::timeout(Duration::from_secs(...)). If the timeout returns Err(elapsed) call child.kill().await (or child.kill() and then child.wait().await) and return or log a timeout error; on Ok, inspect the ExitStatus as before. Make the timeout duration configurable or a constant and handle both timeout and command errors consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/crates/atlas-server/src/config.rs`:
- Around line 255-262: SnapshotConfig currently derives Debug and may log
sensitive DB credentials via the database_url field; remove #[derive(Debug)] (or
stop deriving Debug) for SnapshotConfig and implement a manual Debug impl for
the SnapshotConfig struct that formats all fields except database_url (or prints
database_url as a redacted placeholder like "<redacted>") so logs cannot leak
secrets; update any uses that relied on the derived Debug (e.g., fmt::Debug
formatting or debug! calls referencing SnapshotConfig) to continue working with
the new manual impl.
- Around line 293-294: The current env::var call that sets let dir =
env::var("SNAPSHOT_DIR").unwrap_or_else(|_| "/snapshots".to_string()) can
produce an empty string; update the code that reads SNAPSHOT_DIR (the let dir
variable in config.rs) to validate that the resulting string is non-empty when
snapshots are enabled: after obtaining the value (either from env::var or the
default), check if dir.trim().is_empty() and return/configuration-error (or fall
back to "/snapshots") so the configuration fails fast instead of allowing an
empty SNAPSHOT_DIR to propagate into the scheduler loop; ensure the check is
applied in the same function that constructs the snapshot configuration so
callers that enable snapshots receive an explicit error or valid path.
In `@docker-compose.yml`:
- Around line 53-54: The snapshot bind mount using the volume entry "-
${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}" can make
/snapshots unwritable for the non-root atlas user because host directory
permissions override container ownership; to fix, either switch to a named
Docker volume (remove SNAPSHOT_HOST_DIR bind and use a named volume) or update
deployment docs and startup to ensure SNAPSHOT_HOST_DIR is pre-created with
correct owner/permissions (matching the atlas UID/GID), or add an init step that
chowns the mounted path to the atlas user before snapshot jobs run; reference
the volume line using SNAPSHOT_HOST_DIR and SNAPSHOT_DIR and update
docker-compose and docs accordingly.
---
Nitpick comments:
In `@backend/crates/atlas-server/Cargo.toml`:
- Line 48: Replace the hard-pinned tempfile = "3" entry in this crate's
Cargo.toml with a workspace reference by removing the version and using the
workspace-managed dependency (i.e., change the tempfile entry to reference the
workspace dependency without a version string), so the crate relies on the
version declared in the workspace root; look for the tempfile entry in
backend/crates/atlas-server/Cargo.toml and remove the pinned version to defer to
workspace-managed versions.
In `@backend/crates/atlas-server/src/snapshot.rs`:
- Around line 127-129: Replace the two-step ascending-then-reverse sequence
operating on the `files` collection (calls to `files.sort()` followed by
`files.reverse()`) with a single idiomatic descending sort; update the code to
call a single sort variant that orders descending (e.g., use
`sort_by`/`sort_unstable_by` with `b.cmp(a)` or an appropriate key-based
descending comparator) so the collection is sorted newest-first in one pass.
- Around line 47-54: Wrap the blocking pg_dump invocation with
tokio::time::timeout to avoid hanging the snapshot loop: instead of awaiting
.status() directly on tokio::process::Command, spawn the child with
Command::spawn() (reference Command::spawn and the tmp_path/status variables),
then await child.wait() inside tokio::time::timeout(Duration::from_secs(...)).
If the timeout returns Err(elapsed) call child.kill().await (or child.kill() and
then child.wait().await) and return or log a timeout error; on Ok, inspect the
ExitStatus as before. Make the timeout duration configurable or a constant and
handle both timeout and command errors consistently.
In `@backend/crates/atlas-server/tests/integration/snapshots.rs`:
- Line 57: The db URL substitution using db_url.replace("/postgres",
"/test_restore") is fragile; instead parse db_url (e.g., with the Url type from
the url crate or a PostgreSQL connection string parser) and replace only the
path component to "test_restore" before rebuilding the string so you don't
accidentally modify passwords/hosts; update the code that constructs restore_url
(reference variable restore_url and source db_url) to parse, set
url.path_segments_mut() or equivalent to the single segment "test_restore", and
then serialize the URL back to a string for use in the test.
- Around line 88-93: The test currently drops the test database with
sqlx::query("DROP DATABASE test_restore").execute(pool).await after calling
restore_pool.close().await, but this cleanup is skipped if earlier assertions
panic; refactor to guarantee cleanup by introducing a guard (e.g., a small
struct with Drop impl or using scopeguard) that holds the pool/connection and
runs sqlx::query("DROP DATABASE IF EXISTS test_restore").execute(...) in Drop,
or alternatively run sqlx::query("DROP DATABASE IF EXISTS
test_restore").execute(pool).await at the start of the test to ensure idempotent
cleanup; update references to restore_pool.close().await and the existing
sqlx::query(...) invocation to be performed via that guard so the database is
removed even on test failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f400324c-4e91-4be0-b747-26e65caf32eb
📒 Files selected for processing (10)
.env.examplebackend/Dockerfilebackend/crates/atlas-server/Cargo.tomlbackend/crates/atlas-server/src/config.rsbackend/crates/atlas-server/src/main.rsbackend/crates/atlas-server/src/snapshot.rsbackend/crates/atlas-server/tests/integration/common.rsbackend/crates/atlas-server/tests/integration/main.rsbackend/crates/atlas-server/tests/integration/snapshots.rsdocker-compose.yml
| volumes: | ||
| - ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots} |
There was a problem hiding this comment.
Snapshot bind mount can fail writes under non-root atlas user.
This mount may be unwritable in default setups because host-directory permissions override the Dockerfile’s /snapshots ownership. Snapshot jobs can then fail continuously.
Suggested mitigation (one practical option)
atlas-server:
+ user: "${UID:-1000}:${GID:-1000}"
volumes:
- ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}If you prefer to keep fixed container UID/GID, ensure deployment docs require pre-creating SNAPSHOT_HOST_DIR with matching ownership/permissions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| volumes: | |
| - ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots} | |
| user: "${UID:-1000}:${GID:-1000}" | |
| volumes: | |
| - ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker-compose.yml` around lines 53 - 54, The snapshot bind mount using the
volume entry "- ${SNAPSHOT_HOST_DIR:-./snapshots}:${SNAPSHOT_DIR:-/snapshots}"
can make /snapshots unwritable for the non-root atlas user because host
directory permissions override container ownership; to fix, either switch to a
named Docker volume (remove SNAPSHOT_HOST_DIR bind and use a named volume) or
update deployment docs and startup to ensure SNAPSHOT_HOST_DIR is pre-created
with correct owner/permissions (matching the atlas UID/GID), or add an init step
that chowns the mounted path to the atlas user before snapshot jobs run;
reference the volume line using SNAPSHOT_HOST_DIR and SNAPSHOT_DIR and update
docker-compose and docs accordingly.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/crates/atlas-server/src/config.rs (1)
293-295: Consider documenting thatSNAPSHOT_TIMEis interpreted as UTC.Based on context from
snapshot.rs, theNaiveTimevalue is interpreted as UTC (viaand_utc()). Operators in non-UTC timezones may expect local time. A brief comment here or in.env.examplewould clarify.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/src/config.rs` around lines 293 - 295, Document that SNAPSHOT_TIME is interpreted as UTC: update the comment near the NaiveTime parsing in config.rs (around the time_str/NaiveTime::parse_from_str code) to state explicitly that the parsed NaiveTime will be treated as UTC (snapshot.rs uses .and_utc()), and add a matching note in .env.example or README to warn operators in non-UTC timezones that times are UTC. Ensure the comment references SNAPSHOT_TIME, NaiveTime::parse_from_str, and snapshot.rs's .and_utc() behavior so readers can find the related code paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/crates/atlas-server/src/config.rs`:
- Around line 293-295: Document that SNAPSHOT_TIME is interpreted as UTC: update
the comment near the NaiveTime parsing in config.rs (around the
time_str/NaiveTime::parse_from_str code) to state explicitly that the parsed
NaiveTime will be treated as UTC (snapshot.rs uses .and_utc()), and add a
matching note in .env.example or README to warn operators in non-UTC timezones
that times are UTC. Ensure the comment references SNAPSHOT_TIME,
NaiveTime::parse_from_str, and snapshot.rs's .and_utc() behavior so readers can
find the related code paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ae8d11ee-2bc2-47c8-9c61-44b90aed9b3f
📒 Files selected for processing (3)
.env.examplebackend/crates/atlas-server/src/config.rsdocker-compose.yml
✅ Files skipped from review due to trivial changes (1)
- .env.example
🚧 Files skipped from review as they are similar to previous changes (1)
- docker-compose.yml
Summary
pg_dumpsnapshots as a background tokio task, following the same spawn pattern as the indexer and metadata fetcherSNAPSHOT_ENABLED,SNAPSHOT_TIME(HH:MM UTC),SNAPSHOT_RETENTION(max files kept), andSNAPSHOT_DIRpostgresql16-clientforpg_dump; compose file wired with env vars and volume mountCloses #23
Summary by CodeRabbit
New Features
Chores
Tests