18: Storage, Backups & Disaster Recovery¶

This section explains how to protect LocalCloudLab’s critical data systems:

• PostgreSQL (primary system of record)
• Redis (cache + ephemeral data)
• RabbitMQ (message queue)
• Persistent Volume Claims (PVCs)
• Backup strategies
• Restore strategies
• Disaster Recovery design
• Automated backup jobs
• Data integrity practices

This section is vital for enterprise‑grade stability and preventing catastrophic data loss.

18.1 Understanding Data Roles in LocalCloudLab¶

You have three different classes of data in your platform:

1. System of Record (critical data)¶

• PostgreSQL
• Contains all transactional data (search history, checkins…)

If lost → business impact is HIGH.

2. Ephemeral but important¶

• Redis
• Cache, tokens, session context, temporary search results

If lost → system slows down but does NOT lose customer data.

3. Messaging¶

• RabbitMQ
• Queue messages; durable queues ensure persistence between restarts

If lost → may lose pending messages, but system recovers.

18.2 Persistent Volume Claims (PVCs)¶

Each component that stores data uses PVCs. Example:

PostgreSQL: /var/lib/postgresql/data

Redis: /data

RabbitMQ: /var/lib/rabbitmq/mnesia

These PVCs live on the host’s storage provisioned by k3s.

Check PVCs:

kubectl get pvc -A

Typical output:

NAMESPACE   NAME                STATUS   CAPACITY   ACCESS MODES   STORAGECLASS
db          pg-data             Bound    20Gi       RWO            local-path
caching     redis-data          Bound    5Gi        RWO            local-path
messaging   rabbitmq-data       Bound    10Gi       RWO            local-path

18.3 PostgreSQL Backup Strategy¶

PostgreSQL is the MOST important data store.

You must implement:

✔ Logical backups (pg_dump)
✔ Physical backups (pg_basebackup) — optional
✔ Point-in-time recovery (WAL files)
✔ Automated scheduled backups

18.3.1 Logical Backups with pg_dump¶

Command:

pg_dump -U postgres -Fc -f /backup/pg_backup.dump mydatabase

Inside Kubernetes:

kubectl exec -n db postgres-0 --         pg_dump -U postgres -Fc -f /var/lib/postgresql/data/backup.dump mydatabase

18.3.2 Restore¶

pg_restore -U postgres -d mydatabase /path/backup.dump

18.3.3 Automated Backup Job (CronJob)

Create:

k8s/postgres/backup-cronjob.yaml

Example:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: db
spec:
  schedule: "0 */6 * * *"    # every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:16
            command:
            - /bin/sh
            - -c
            - >
              pg_dump -U postgres -Fc -f /backup/pg_backup_$(date +\%Y\%m\%d_\%H\%M).dump mydatabase;
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: pg-backup-pvc

This protects against accidental deletion, corruption, operator errors, etc.

18.3.4 WAL Archiving for Point-in-Time Recovery (Optional Future)

Enable:

wal_level = replica
archive_mode = on
archive_command = 'cp %p /wal_archive/%f'

Allows restoring database to ANY point in time.

This is advanced and not required for LocalCloudLab initial deployment.

18.4 Redis Backup & Persistence Strategy¶

Redis is used mostly as cache, but you should still enable persistence.

Redis supports:

• RDB snapshots
• AOF (Append Only File)
• Hybrid persistence (best)

18.4.1 RDB Snapshots¶

Automatically saves memory to disk:

save 900 1   # every 15min if ≥1 key changed
save 300 10
save 60 10000

18.4.2 AOF Persistence¶

Logs each write operation.

appendonly yes
appendfsync everysec

Best durability with minimal performance impact.

18.4.3 Backup Redis PVC¶

kubectl cp caching/redis-master-0:/data redis_backup/

If Redis PVC is lost → data is lost → but system continues working.

This is acceptable because Redis is NOT a system of record.

18.5 RabbitMQ Backup Strategy¶

RabbitMQ contains:

• Queues
• Exchange bindings
• Durable messages

Durability must be enabled:

durable: true
persistent messages

18.5.1 Backing up RabbitMQ definitions¶

rabbitmqctl export_definitions /backup/rabbit_def.json

Inside pod:

kubectl exec -n messaging rabbitmq-0 --         rabbitmqctl export_definitions /var/lib/rabbitmq/mnesia/definitions.json

18.5.2 Restoring:¶

rabbitmqctl import_definitions /path/definitions.json

18.6 Backup Storage Options¶

Recommended:

• Local PVC backup for fast recovery
• Off-cluster backup: rsync to external host
• Cloud backup (S3/Spaces) — ideal for critical data

Example rsync:

rsync -avz /var/lib/rancher/k3s/storage root@backup-server:/localcloudlab/

18.7 Disaster Recovery Plans¶

A good DR plan answers:

• What happens if the node dies?
• How fast can you recover PostgreSQL?
• What about Redis?
• What about RabbitMQ?
• Can you recreate the cluster with Git + backups?

18.7.1 Disaster Scenario: Node Disk Failure¶

Impact:

• PostgreSQL lost → restore from backup
• Redis lost → cache wiped → system slows but recovers
• RabbitMQ lost → import definitions; messages lost

Recovery:

1. Deploy new server
2. Install k3s
3. Run GitOps (apply all manifests)
4. Restore PostgreSQL backup
5. Restart APIs

Total downtime depends on PostgreSQL backup size.

18.7.2 Disaster Scenario: Kubernetes Corruption¶

• Reinstall k3s
• Redeploy everything from Git
• Restore PVC backups

Your Git repository becomes the single source of truth.

18.8 Verification & Backup Testing¶

A backup you never tested is a backup that does not exist.

Monthly tasks:

✓ Restore PostgreSQL backup into a separate namespace
✓ Benchmark Redis persistence
✓ Validate RabbitMQ import/export
✓ Run k3s recovery simulation on second server (optional)

Testing DR ensures confidence.

18.9 Summary of Section 18¶

You now have:

✔ PostgreSQL dump + restore strategy
✔ Automated CronJob backups
✔ Redis persistence strategy
✔ RabbitMQ durability & definitions export
✔ PVC storage layout
✔ Disaster recovery plans
✔ Full recovery procedure from Git + backups

Your data layer is now fully protected and enterprise-ready.

Next section begins automatically:

Section 19 — Security, Secrets Management & Hardening