18: Storage, Backups & Disaster Recovery¶
This section explains how to protect LocalCloudLab’s critical data systems:
• PostgreSQL (primary system of record)
• Redis (cache + ephemeral data)
• RabbitMQ (message queue)
• Persistent Volume Claims (PVCs)
• Backup strategies
• Restore strategies
• Disaster Recovery design
• Automated backup jobs
• Data integrity practices
This section is vital for enterprise‑grade stability and preventing catastrophic data loss.
18.1 Understanding Data Roles in LocalCloudLab¶
You have three different classes of data in your platform:
1. System of Record (critical data)¶
• PostgreSQL
• Contains all transactional data (search history, checkins…)
If lost → business impact is HIGH.
2. Ephemeral but important¶
• Redis
• Cache, tokens, session context, temporary search results
If lost → system slows down but does NOT lose customer data.
3. Messaging¶
• RabbitMQ
• Queue messages; durable queues ensure persistence between restarts
If lost → may lose pending messages, but system recovers.
18.2 Persistent Volume Claims (PVCs)¶
Each component that stores data uses PVCs. Example:
PostgreSQL: /var/lib/postgresql/data
Redis: /data
RabbitMQ: /var/lib/rabbitmq/mnesia
These PVCs live on the host’s storage provisioned by k3s.
Check PVCs:
kubectl get pvc -A
Typical output:
NAMESPACE NAME STATUS CAPACITY ACCESS MODES STORAGECLASS
db pg-data Bound 20Gi RWO local-path
caching redis-data Bound 5Gi RWO local-path
messaging rabbitmq-data Bound 10Gi RWO local-path
18.3 PostgreSQL Backup Strategy¶
PostgreSQL is the MOST important data store.
You must implement:
✔ Logical backups (pg_dump)
✔ Physical backups (pg_basebackup) — optional
✔ Point-in-time recovery (WAL files)
✔ Automated scheduled backups
18.3.1 Logical Backups with pg_dump¶
Command:
pg_dump -U postgres -Fc -f /backup/pg_backup.dump mydatabase
Inside Kubernetes:
kubectl exec -n db postgres-0 -- pg_dump -U postgres -Fc -f /var/lib/postgresql/data/backup.dump mydatabase
18.3.2 Restore¶
pg_restore -U postgres -d mydatabase /path/backup.dump
18.3.3 Automated Backup Job (CronJob)
Create:
k8s/postgres/backup-cronjob.yaml
Example:
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: db
spec:
schedule: "0 */6 * * *" # every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16
command:
- /bin/sh
- -c
- >
pg_dump -U postgres -Fc -f /backup/pg_backup_$(date +\%Y\%m\%d_\%H\%M).dump mydatabase;
volumeMounts:
- name: backup-storage
mountPath: /backup
restartPolicy: OnFailure
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: pg-backup-pvc
This protects against accidental deletion, corruption, operator errors, etc.
18.3.4 WAL Archiving for Point-in-Time Recovery (Optional Future)
Enable:
wal_level = replica
archive_mode = on
archive_command = 'cp %p /wal_archive/%f'
Allows restoring database to ANY point in time.
This is advanced and not required for LocalCloudLab initial deployment.
18.4 Redis Backup & Persistence Strategy¶
Redis is used mostly as cache, but you should still enable persistence.
Redis supports:
• RDB snapshots
• AOF (Append Only File)
• Hybrid persistence (best)
18.4.1 RDB Snapshots¶
Automatically saves memory to disk:
save 900 1 # every 15min if ≥1 key changed
save 300 10
save 60 10000
18.4.2 AOF Persistence¶
Logs each write operation.
appendonly yes
appendfsync everysec
Best durability with minimal performance impact.
18.4.3 Backup Redis PVC¶
kubectl cp caching/redis-master-0:/data redis_backup/
If Redis PVC is lost → data is lost → but system continues working.
This is acceptable because Redis is NOT a system of record.
18.5 RabbitMQ Backup Strategy¶
RabbitMQ contains:
• Queues
• Exchange bindings
• Durable messages
Durability must be enabled:
durable: true
persistent messages
18.5.1 Backing up RabbitMQ definitions¶
rabbitmqctl export_definitions /backup/rabbit_def.json
Inside pod:
kubectl exec -n messaging rabbitmq-0 -- rabbitmqctl export_definitions /var/lib/rabbitmq/mnesia/definitions.json
18.5.2 Restoring:¶
rabbitmqctl import_definitions /path/definitions.json
18.6 Backup Storage Options¶
Recommended:
• Local PVC backup for fast recovery
• Off-cluster backup: rsync to external host
• Cloud backup (S3/Spaces) — ideal for critical data
Example rsync:
rsync -avz /var/lib/rancher/k3s/storage root@backup-server:/localcloudlab/
18.7 Disaster Recovery Plans¶
A good DR plan answers:
• What happens if the node dies?
• How fast can you recover PostgreSQL?
• What about Redis?
• What about RabbitMQ?
• Can you recreate the cluster with Git + backups?
18.7.1 Disaster Scenario: Node Disk Failure¶
Impact:
• PostgreSQL lost → restore from backup
• Redis lost → cache wiped → system slows but recovers
• RabbitMQ lost → import definitions; messages lost
Recovery:
1. Deploy new server
2. Install k3s
3. Run GitOps (apply all manifests)
4. Restore PostgreSQL backup
5. Restart APIs
Total downtime depends on PostgreSQL backup size.
18.7.2 Disaster Scenario: Kubernetes Corruption¶
• Reinstall k3s
• Redeploy everything from Git
• Restore PVC backups
Your Git repository becomes the single source of truth.
18.8 Verification & Backup Testing¶
A backup you never tested is a backup that does not exist.
Monthly tasks:
✓ Restore PostgreSQL backup into a separate namespace
✓ Benchmark Redis persistence
✓ Validate RabbitMQ import/export
✓ Run k3s recovery simulation on second server (optional)
Testing DR ensures confidence.
18.9 Summary of Section 18¶
You now have:
✔ PostgreSQL dump + restore strategy
✔ Automated CronJob backups
✔ Redis persistence strategy
✔ RabbitMQ durability & definitions export
✔ PVC storage layout
✔ Disaster recovery plans
✔ Full recovery procedure from Git + backups
Your data layer is now fully protected and enterprise-ready.
Next section begins automatically:
Section 19 — Security, Secrets Management & Hardening