|
| 1 | +--- |
| 2 | +title: "Restoring from Longhorn Backups: A Complete Guide" |
| 3 | +date: 2025-09-11 14:30:00 +0300 |
| 4 | +categories: [kubernetes, disaster-recovery] |
| 5 | +#tags: [kubernetes,longhorn,backup,restore,disaster-recovery,k8s,storage,minio,prometheus,grafana,jellyfin,flux,gitops] |
| 6 | +description: Complete step-by-step guide to restoring Kubernetes applications from Longhorn backups stored in MinIO, including scaling strategies, PVC management, and real-world lessons learned. |
| 7 | +image: |
| 8 | + path: /assets/img/posts/k8s-longhorn-restore.webp |
| 9 | + alt: Kubernetes Longhorn backup restoration guide |
| 10 | +draft: true |
| 11 | +--- |
| 12 | + |
| 13 | + |
| 14 | +# Restoring Your Kubernetes Applications from Longhorn Backups: A Complete Guide |
| 15 | + |
| 16 | +When disaster strikes your Kubernetes cluster, having a solid backup strategy isn't enough—you need to know how to restore your applications quickly and reliably. Recently, I had to rebuild my entire K8S cluster from scratch and restore all my applications from 3-month-old Longhorn backups stored in MinIO. Here's the complete step-by-step process that got my media stack and observability tools back online. |
| 17 | + |
| 18 | +## The Situation |
| 19 | + |
| 20 | +After redeploying my K8S cluster with Flux GitOps, I found myself with: |
| 21 | +- ✅ Fresh cluster with all applications deployed via Flux |
| 22 | +- ✅ Longhorn storage configured and connected to MinIO backend |
| 23 | +- ✅ All backup data visible in Longhorn UI |
| 24 | +- ❌ Empty volumes for all applications |
| 25 | +- ❌ Lost configurations, dashboards, and media metadata |
| 26 | + |
| 27 | +The challenge? Restore 6 critical applications to their backup state without losing the current Flux-managed infrastructure. |
| 28 | + |
| 29 | +## Applications to Restore |
| 30 | + |
| 31 | +Here's what needed restoration: |
| 32 | +- **Prometheus** (45GB) - Monitoring metrics and configuration |
| 33 | +- **Loki** (20GB) - Log aggregation and retention |
| 34 | +- **Jellyfin** (10GB) - Media library and metadata |
| 35 | +- **Grafana** (10GB) - Dashboards and data sources |
| 36 | +- **QBittorrent** (5GB) - Torrent client configuration |
| 37 | +- **Sonarr** (5GB) - TV show management settings |
| 38 | + |
| 39 | +## Prerequisites |
| 40 | + |
| 41 | +Before starting, ensure you have: |
| 42 | +- Kubernetes cluster with kubectl access |
| 43 | +- Longhorn installed and configured |
| 44 | +- Backup storage backend accessible (MinIO/S3) |
| 45 | +- Applications deployed (scaled up or down doesn't matter) |
| 46 | +- Longhorn UI access for backup management |
| 47 | + |
| 48 | +## Step 1: Assess Current State |
| 49 | + |
| 50 | +First, let's understand what we're working with: |
| 51 | + |
| 52 | +```bash |
| 53 | +# Check current deployments and statefulsets |
| 54 | +kubectl get deployments -A |
| 55 | +kubectl get statefulsets -A |
| 56 | + |
| 57 | +# Review current PVCs |
| 58 | +kubectl get pvc -A -o wide |
| 59 | + |
| 60 | +# Verify Longhorn storage class |
| 61 | +kubectl get storageclass |
| 62 | +``` |
| 63 | + |
| 64 | +This gives you a complete picture of your current infrastructure and identifies which PVCs need replacement. |
| 65 | + |
| 66 | +## Step 2: Scale Down Applications |
| 67 | + |
| 68 | +**Critical:** Before touching any storage, scale down applications to prevent data corruption: |
| 69 | + |
| 70 | +```bash |
| 71 | +# Scale down deployments |
| 72 | +kubectl scale deployment jellyfin --replicas=0 -n default |
| 73 | +kubectl scale deployment qbittorrent --replicas=0 -n default |
| 74 | +kubectl scale deployment sonarr --replicas=0 -n default |
| 75 | +kubectl scale deployment grafana --replicas=0 -n observability |
| 76 | + |
| 77 | +# Scale down statefulsets |
| 78 | +kubectl scale statefulset loki --replicas=0 -n observability |
| 79 | +kubectl scale statefulset prometheus-kube-prometheus-stack --replicas=0 -n observability |
| 80 | +kubectl scale statefulset alertmanager-kube-prometheus-stack --replicas=0 -n observability |
| 81 | +``` |
| 82 | + |
| 83 | +Wait for all pods to terminate before proceeding. |
| 84 | + |
| 85 | +## Step 3: Remove Current Empty PVCs |
| 86 | + |
| 87 | +Since the current PVCs contain only empty data, we need to remove them: |
| 88 | + |
| 89 | +```bash |
| 90 | +# Delete PVCs in default namespace |
| 91 | +kubectl delete pvc jellyfin -n default |
| 92 | +kubectl delete pvc qbittorrent -n default |
| 93 | +kubectl delete pvc sonarr -n default |
| 94 | + |
| 95 | +# Delete PVCs in observability namespace |
| 96 | +kubectl delete pvc grafana -n observability |
| 97 | +kubectl delete pvc storage-loki-0 -n observability |
| 98 | +kubectl delete pvc prometheus-kube-prometheus-stack-db-prometheus-kube-prometheus-stack-0 -n observability |
| 99 | +``` |
| 100 | + |
| 101 | +## Step 4: Restore Backups via Longhorn UI |
| 102 | + |
| 103 | +This is where the magic happens. Access your Longhorn UI and navigate to the **Backup** tab. |
| 104 | + |
| 105 | +For each backup, click the **⟲ (restore)** button and configure: |
| 106 | + |
| 107 | +### Prometheus Backup |
| 108 | +- **Name**: `prometheus-restored` |
| 109 | +- **Storage Class**: `longhorn` |
| 110 | +- **Access Mode**: `ReadWriteOnce` |
| 111 | + |
| 112 | +### Loki Backup |
| 113 | +- **Name**: `loki-restored` |
| 114 | +- **Storage Class**: `longhorn` |
| 115 | +- **Access Mode**: `ReadWriteOnce` |
| 116 | + |
| 117 | +### Jellyfin Backup |
| 118 | +- **Name**: `jellyfin-restored` |
| 119 | +- **Storage Class**: `longhorn` |
| 120 | +- **Access Mode**: `ReadWriteOnce` |
| 121 | + |
| 122 | +### Continue for all other backups... |
| 123 | + |
| 124 | +**Important**: Wait for all restore operations to complete before proceeding. You can monitor progress in the Longhorn UI. |
| 125 | + |
| 126 | +## Step 5: Create PersistentVolumes |
| 127 | + |
| 128 | +Once restoration completes, the restored Longhorn volumes need PersistentVolumes to be accessible by Kubernetes: |
| 129 | + |
| 130 | +```yaml |
| 131 | +# Example for Jellyfin - repeat for all applications |
| 132 | +apiVersion: v1 |
| 133 | +kind: PersistentVolume |
| 134 | +metadata: |
| 135 | + name: jellyfin-restored-pv |
| 136 | +spec: |
| 137 | + capacity: |
| 138 | + storage: 10Gi |
| 139 | + accessModes: |
| 140 | + - ReadWriteOnce |
| 141 | + persistentVolumeReclaimPolicy: Retain |
| 142 | + storageClassName: longhorn |
| 143 | + csi: |
| 144 | + driver: driver.longhorn.io |
| 145 | + fsType: ext4 |
| 146 | + volumeAttributes: |
| 147 | + numberOfReplicas: "3" |
| 148 | + staleReplicaTimeout: "30" |
| 149 | + volumeHandle: jellyfin-restored |
| 150 | +``` |
| 151 | +
|
| 152 | +Apply this pattern for all restored volumes, adjusting the `storage` capacity and `volumeHandle` to match your backups. |
| 153 | + |
| 154 | +## Step 6: Create PersistentVolumeClaims |
| 155 | + |
| 156 | +Now create PVCs that bind to the restored PersistentVolumes: |
| 157 | + |
| 158 | +```yaml |
| 159 | +# Example for Jellyfin |
| 160 | +apiVersion: v1 |
| 161 | +kind: PersistentVolumeClaim |
| 162 | +metadata: |
| 163 | + name: jellyfin |
| 164 | + namespace: default |
| 165 | +spec: |
| 166 | + accessModes: |
| 167 | + - ReadWriteOnce |
| 168 | + resources: |
| 169 | + requests: |
| 170 | + storage: 10Gi |
| 171 | + storageClassName: longhorn |
| 172 | + volumeName: jellyfin-restored-pv |
| 173 | +``` |
| 174 | + |
| 175 | +The key here is using `volumeName` to bind the PVC to the specific PV we created. |
| 176 | + |
| 177 | +## Step 7: Verify Binding |
| 178 | + |
| 179 | +Check that all PVCs are properly bound: |
| 180 | + |
| 181 | +```bash |
| 182 | +# Check binding status |
| 183 | +kubectl get pvc -n default | grep -E "(jellyfin|qbittorrent|sonarr)" |
| 184 | +kubectl get pvc -n observability | grep -E "(grafana|storage-loki|prometheus)" |
| 185 | +
|
| 186 | +# Verify Longhorn volume status |
| 187 | +kubectl get volumes -n longhorn-system | grep "restored" |
| 188 | +``` |
| 189 | + |
| 190 | +You should see all PVCs in `Bound` status and Longhorn volumes as `attached` and `healthy`. |
| 191 | + |
| 192 | +## Step 8: Scale Applications Back Up |
| 193 | + |
| 194 | +With storage properly restored and connected, bring your applications back online: |
| 195 | + |
| 196 | +```bash |
| 197 | +# Scale deployments back up |
| 198 | +kubectl scale deployment jellyfin --replicas=1 -n default |
| 199 | +kubectl scale deployment qbittorrent --replicas=1 -n default |
| 200 | +kubectl scale deployment sonarr --replicas=1 -n default |
| 201 | +kubectl scale deployment grafana --replicas=1 -n observability |
| 202 | +
|
| 203 | +# Scale statefulsets back up |
| 204 | +kubectl scale statefulset loki --replicas=1 -n observability |
| 205 | +kubectl scale statefulset prometheus-kube-prometheus-stack --replicas=1 -n observability |
| 206 | +kubectl scale statefulset alertmanager-kube-prometheus-stack --replicas=1 -n observability |
| 207 | +``` |
| 208 | + |
| 209 | +## Step 9: Final Verification |
| 210 | + |
| 211 | +Confirm everything is working correctly: |
| 212 | + |
| 213 | +```bash |
| 214 | +# Check pod status |
| 215 | +kubectl get pods -A | grep -v Running | grep -v Completed |
| 216 | +
|
| 217 | +# Verify Longhorn volumes are healthy |
| 218 | +kubectl get volumes -n longhorn-system | grep "restored" |
| 219 | +
|
| 220 | +# Test application functionality |
| 221 | +kubectl get pods -n default -o wide |
| 222 | +kubectl get pods -n observability -o wide |
| 223 | +``` |
| 224 | + |
| 225 | +## Results and Key Lessons |
| 226 | + |
| 227 | +### What Was Restored Successfully ✅ |
| 228 | +- **Jellyfin**: Complete media library, metadata, and user settings |
| 229 | +- **Grafana**: All dashboards, data sources, and alerting rules |
| 230 | +- **Prometheus**: Historical metrics and configuration |
| 231 | +- **Loki**: Log retention policies and stored logs |
| 232 | +- **QBittorrent**: Torrent configurations and download states |
| 233 | +- **Sonarr**: TV show monitoring and quality profiles |
| 234 | + |
| 235 | +### Important Considerations |
| 236 | + |
| 237 | +1. **Data Age**: My backups were 3 months old, so any data created after that point was lost. Plan backup frequency accordingly. |
| 238 | + |
| 239 | +2. **Storage Sizes**: Pay attention to backup sizes vs. current PVC sizes. My Prometheus backup was 45GB while the current PVC was only 15GB—the restore process required updating the PVC size. |
| 240 | + |
| 241 | +3. **Volume Naming**: Longhorn creates restored volumes with specific names. The PV `volumeHandle` must match exactly. |
| 242 | + |
| 243 | +4. **Application Dependencies**: Some applications have interdependencies. Restore core infrastructure (Prometheus, Grafana) before application-specific services. |
| 244 | + |
| 245 | +## Alternative: CLI-Based Restoration |
| 246 | + |
| 247 | +For automation or when UI access isn't available, you can restore via Longhorn's CRD: |
| 248 | + |
| 249 | +```yaml |
| 250 | +apiVersion: longhorn.io/v1beta1 |
| 251 | +kind: Volume |
| 252 | +metadata: |
| 253 | + name: jellyfin-restored |
| 254 | + namespace: longhorn-system |
| 255 | +spec: |
| 256 | + size: "10737418240" # Size in bytes |
| 257 | + restoreVolumeRecurringJob: false |
| 258 | + fromBackup: "s3://your-minio-bucket/backups/backup-name" |
| 259 | +``` |
| 260 | + |
| 261 | +## Conclusion |
| 262 | + |
| 263 | +Restoring Kubernetes applications from Longhorn backups requires careful orchestration of scaling, PVC management, and volume binding. The process took about 45 minutes for 6 applications, but the result was a complete restoration to the previous backup state. |
| 264 | + |
| 265 | +Key takeaways: |
| 266 | +- **Always scale down applications first** to prevent corruption |
| 267 | +- **Understand the relationship** between Longhorn volumes, PVs, and PVCs |
| 268 | +- **Test your backup restoration process** before you need it |
| 269 | +- **Document your PVC naming conventions** for faster recovery |
| 270 | +- **Monitor backup age** vs. acceptable data loss |
| 271 | + |
| 272 | +Having a solid backup strategy is crucial, but knowing how to restore efficiently under pressure is what separates good infrastructure management from great infrastructure management. |
| 273 | + |
| 274 | +## Next Steps |
| 275 | + |
| 276 | +Consider implementing: |
| 277 | +- **Automated backup validation** to ensure restorability |
| 278 | +- **Backup age monitoring** with alerts |
| 279 | +- **Documentation of critical PVC mappings** |
| 280 | +- **Regular disaster recovery drills** |
| 281 | + |
| 282 | +Your future self will thank you when disaster strikes again. |
0 commit comments