Skip to content

Commit 87e3080

Browse files
authored
docs: add disaster recovery solution documentation for Harbor (#71)
* feat: add disaster recovery solution documentation for Harbor - Introduced a comprehensive guide on performing disaster recovery for Harbor using Object Storage and PostgreSQL. - Included detailed architecture, terminology, setup procedures for primary and secondary Harbor instances, and failover procedures. - Added a visual representation of the disaster recovery architecture in SVG format. * chore: update front matter for disaster recovery documentation - Changed the ProductsVersion from 4.1.0 to 4.x for broader applicability. - Added an ID for the documentation entry. - Adjusted the kind field formatting for consistency.
1 parent 90cf1ee commit 87e3080

File tree

2 files changed

+743
-0
lines changed

2 files changed

+743
-0
lines changed
Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
---
2+
kind:
3+
- Solution
4+
products:
5+
- Alauda DevOps
6+
ProductsVersion:
7+
- 4.x
8+
id: KB251000012
9+
---
10+
11+
# How to Perform Disaster Recovery for Harbor
12+
13+
## Issue
14+
15+
This solution describes how to build a Harbor disaster recovery solution based on Object Storage and PostgreSQL disaster recovery capabilities. The solution primarily focuses on data disaster recovery processing, and users need to implement their own Harbor access address switching mechanism.
16+
17+
## Environment
18+
19+
Harbor CE Operator: >=v2.12.4
20+
21+
## Terminology
22+
23+
| Term | Description |
24+
|-------------------------|-----------------------------------------------------------------------------|
25+
| **Primary Harbor** | The active Harbor instance that serves normal business operations and user requests. This instance is fully operational with all components running. |
26+
| **Secondary Harbor** | The standby Harbor instance deployed in a different cluster/region with zero replicas. It remains dormant until activated during disaster recovery scenarios. |
27+
| **Primary PostgreSQL** | The active PostgreSQL database cluster that handles all data transactions and serves as the source for data replication to the secondary database. |
28+
| **Secondary PostgreSQL**| The hot standby PostgreSQL database that receives real-time data replication from the primary database. It can be promoted to primary role during failover. |
29+
| **Primary Object Storage**| The active S3-compatible object storage system that stores all Harbor registry data and serves as the source for storage replication. |
30+
| **Secondary Object Storage**| The synchronized backup object storage system that receives data replication from the primary storage. It ensures data availability during disaster recovery. |
31+
| **Recovery Point Objective (RPO)** | The maximum acceptable amount of data loss measured in time (e.g., 5 minutes, 1 hour). It defines how much data can be lost during a disaster before it becomes unacceptable. |
32+
| **Recovery Time Objective (RTO)** | The maximum acceptable downtime measured in time (e.g., 15 minutes, 2 hours). It defines how quickly the system must be restored after a disaster. |
33+
| **Failover** | The process of switching from the primary system to the secondary system when the primary system becomes unavailable or fails. |
34+
| **Data Synchronization**| The continuous process of replicating data from primary systems to secondary systems to maintain consistency and enable disaster recovery. |
35+
| **Cold Standby** | A standby system that is not continuously synchronized with the primary system and requires manual activation with potential data loss during disaster recovery. |
36+
37+
## Architecture
38+
39+
![harbor](/harbor-disaster-recovery.drawio.svg)
40+
41+
### Architecture Overview
42+
43+
The Harbor disaster recovery solution implements a **cold-standby architecture** for Harbor services with **hot-standby database replication**. This hybrid approach provides disaster recovery capabilities through real-time database synchronization and manual Harbor service failover procedures. The architecture consists of two Harbor instances deployed across different clusters or regions, with the secondary Harbor instance remaining dormant until activated during disaster scenarios, while the database layer maintains continuous synchronization.
44+
45+
#### Core Components
46+
47+
- **Primary Harbor**: Active instance serving normal business operations and user requests
48+
- **Secondary Harbor**: Standby instance with zero replicas, ready for failover scenarios
49+
- **Primary PostgreSQL**: Active database handling all data transactions
50+
- **Secondary PostgreSQL**: Hot standby database with real-time data replication
51+
- **Primary Object Storage**: Active S3-compatible storage for registry data
52+
- **Secondary Object Storage**: Synchronized backup storage with data replication
53+
54+
#### Data Synchronization Strategy
55+
56+
The solution leverages two independent data synchronization mechanisms:
57+
58+
1. **Database Layer**: PostgreSQL streaming replication ensures real-time transaction log synchronization between primary and secondary databases
59+
2. **Storage Layer**: Object storage replication maintains data consistency across primary and secondary storage systems
60+
61+
#### Disaster Recovery Configuration
62+
63+
1. **Deploy Primary Harbor**: Configure the primary instance to connect to the primary PostgreSQL database and use primary object storage as the registry backend
64+
2. **Deploy Secondary Harbor**: Configure the secondary instance to connect to the secondary PostgreSQL database and use secondary object storage as the registry backend
65+
3. **Initialize Standby State**: Set replica count of all secondary Harbor components to 0 to prevent unnecessary background operations and resource consumption
66+
67+
#### Failover Procedure
68+
69+
When a disaster occurs, the following steps ensure transition to the secondary environment:
70+
71+
1. **Verify Primary Failure**: Confirm that all primary Harbor components are non-functional
72+
2. **Promote Database**: Elevate secondary PostgreSQL to primary role using database failover procedures (no data loss due to hot standby)
73+
3. **Promote Storage**: Activate secondary object storage as the primary storage system
74+
4. **Activate Harbor**: Scale up secondary Harbor components by setting replica count greater than 0
75+
5. **Update Routing**: Switch external access addresses to point to the secondary Harbor instance
76+
77+
## Harbor Disaster Recovery Setup Procedure with `Alauda Build of Rook-Ceph` and `Alauda support for PostgreSQL`
78+
79+
### Prerequisites
80+
81+
1. Prepare a primary cluster and a disaster recovery cluster (or a cluster containing different regions) in advance.
82+
2. Complete the deployment of `Alauda Build of Rook-Ceph` and `Alauda support for PostgreSQL`.
83+
3. Refer to `Alauda Build of Rook-Ceph`, `Alauda support for PostgreSQL` and [Harbor Instance Deployment guide](https://docs.alauda.io/alauda-build-of-harbor/2.12/install/03_harbor_deploy.html) to plan the system resources needed in advance.
84+
85+
### Building PostgreSQL Disaster Recovery Cluster with `Alauda support for PostgreSQL`
86+
87+
Refer to `PostgreSQL Hot Standby Cluster Configuration Guide` to build a disaster recovery cluster using `Alauda support for PostgreSQL`.
88+
89+
Ensure that Primary PostgreSQL and Secondary PostgreSQL are in different clusters (or different regions).
90+
91+
You can search for `PostgreSQL Hot Standby Cluster Configuration Guide` on [Alauda Knowledge](https://cloud.alauda.io/knowledges#/) to obtain it.
92+
93+
:::warning
94+
95+
`PostgreSQL Hot Standby Cluster Configuration Guide` is a document that describes how to build a disaster recovery cluster using `Alauda support for PostgreSQL`. Please ensure compatibility with the appropriate ACP version when using this configuration.
96+
97+
:::
98+
99+
### Building Object Storage Disaster Recovery Cluster with `Alauda Build of Rook-Ceph`
100+
101+
Build a disaster recovery cluster using `Alauda Build of Rook-Ceph`. Refer to [Object Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html) to build a disaster recovery cluster.
102+
103+
You need to create a CephObjectStoreUser in advance to obtain the access credentials for Object Storage, and prepare a Harbor registry bucket on Primary Object Storage:
104+
105+
1. Create a CephObjectStoreUser on Primary Object Storage to obtain access credentials: [Create CephObjectStoreUser](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/create_object_user.html).
106+
107+
:::info
108+
You only need to create the CephObjectStoreUser on the Primary Object Storage. The user information will be automatically synchronized to the Secondary Object Storage through the disaster recovery replication mechanism.
109+
:::
110+
111+
2. This `PRIMARY_OBJECT_STORAGE_ADDRESS` is the access address of the Object Storage, you can get it from the step [Configure External Access for Primary Zone](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#configure-external-access-for-primary-zone) of `Object Storage Disaster Recovery`.
112+
113+
3. Create a Harbor registry bucket on Primary Object Storage using mc, in this example, the bucket name is `harbor-registry`.
114+
115+
```bash
116+
$ mc alias set primary-s3 <PRIMARY_OBJECT_STORAGE_ADDRESS> <PRIMARY_OBJECT_STORAGE_ACCESS_KEY> <PRIMARY_OBJECT_STORAGE_SECRET_KEY>
117+
Added `primary-s3` successfully.
118+
$ mc alias list
119+
primary-s3
120+
URL : <PRIMARY_OBJECT_STORAGE_ADDRESS>
121+
AccessKey : <PRIMARY_OBJECT_STORAGE_ACCESS_KEY>
122+
SecretKey : <PRIMARY_OBJECT_STORAGE_SECRET_KEY>
123+
API : s3v4
124+
Path : auto
125+
Src : /home/demo/.mc/config.json
126+
$ mc mb primary-s3/harbor-registry
127+
Bucket created successfully `primary-s3/harbor-registry`
128+
$ mc ls primary-s3/harbor-registry
129+
```
130+
131+
### Set Up Primary Harbor
132+
133+
Deploy the Primary Harbor instance by following the [Harbor Instance Deployment](https://docs.alauda.io/alauda-build-of-harbor/2.12/install/03_harbor_deploy.html) guide. Configure it to connect to the Primary PostgreSQL database and use the Primary Object Storage as the [Registry storage backend](https://docs.alauda.io/alauda-build-of-harbor/2.12/install/03_harbor_deploy.html#storage-yaml-snippets).
134+
135+
Configuration example:
136+
137+
```yaml
138+
apiVersion: operator.alaudadevops.io/v1alpha1
139+
kind: Harbor
140+
metadata:
141+
name: dr-harbor
142+
spec:
143+
externalURL: http://dr-harbor.example.com
144+
helmValues:
145+
core:
146+
replicas: 1
147+
resources:
148+
limits:
149+
cpu: 400m
150+
memory: 512Mi
151+
requests:
152+
cpu: 200m
153+
memory: 256Mi
154+
database:
155+
external:
156+
coreDatabase: harbor
157+
existingSecret: primary-pg
158+
existingSecretKey: password
159+
host: acid-primary-pg.harbor.svc
160+
port: 5432
161+
sslmode: require
162+
username: postgres
163+
type: external
164+
existingSecretAdminPassword: harbor-account
165+
existingSecretAdminPasswordKey: password
166+
expose:
167+
ingress:
168+
hosts:
169+
core: dr-harbor.example.com
170+
tls:
171+
enabled: false
172+
type: ingress
173+
jobservice:
174+
replicas: 1
175+
resources:
176+
limits:
177+
cpu: 400m
178+
memory: 512Mi
179+
requests:
180+
cpu: 200m
181+
memory: 256Mi
182+
persistence:
183+
enabled: true
184+
imageChartStorage:
185+
disableredirect: true
186+
s3:
187+
existingSecret: object-storage-secret
188+
bucket: harbor-registry
189+
regionendpoint: <PRIMARY_OBJECT_STORAGE_ADDRESS>
190+
v4auth: true
191+
type: s3
192+
persistentVolumeClaim:
193+
jobservice:
194+
jobLog:
195+
accessMode: ReadWriteMany
196+
size: 10Gi
197+
storageClass: nfs
198+
trivy:
199+
accessMode: ReadWriteMany
200+
size: 10Gi
201+
storageClass: nfs
202+
portal:
203+
replicas: 1
204+
resources:
205+
limits:
206+
cpu: 400m
207+
memory: 512Mi
208+
requests:
209+
cpu: 200m
210+
memory: 256Mi
211+
redis:
212+
external:
213+
addr: primary-redis-0.primary-redis-hl.harbor.svc:26379
214+
existingSecret: redis-redis-s3-default-credential
215+
existingSecretKey: password
216+
sentinelMasterSet: mymaster
217+
type: external
218+
registry:
219+
controller:
220+
resources:
221+
limits:
222+
cpu: 200m
223+
memory: 410Mi
224+
requests:
225+
cpu: 100m
226+
memory: 200Mi
227+
registry:
228+
resources:
229+
limits:
230+
cpu: 600m
231+
memory: 1638Mi
232+
requests:
233+
cpu: 300m
234+
memory: 419Mi
235+
replicas: 1
236+
trivy:
237+
offlineScan: true
238+
replicas: 1
239+
resources:
240+
limits:
241+
cpu: 800m
242+
memory: 2Gi
243+
requests:
244+
cpu: 400m
245+
memory: 200Mi
246+
skipUpdate: true
247+
version: 2.12.4
248+
```
249+
250+
### Set Up Secondary Harbor
251+
252+
Deploy the Secondary Harbor instance by following the [Harbor Instance Deployment](https://docs.alauda.io/alauda-build-of-harbor/2.12/install/03_harbor_deploy.html) guide. Configure it to connect to the Secondary PostgreSQL database and use the Secondary Object Storage as the [Registry storage backend](https://docs.alauda.io/alauda-build-of-harbor/2.12/install/03_harbor_deploy.html#storage-yaml-snippets).
253+
254+
:::info
255+
256+
The instance names for both Primary Harbor and Secondary Harbor must be identical.
257+
:::
258+
259+
Set the replica count of all Secondary Harbor instances to 0 to prevent Secondary Harbor from performing unnecessary background operations.
260+
261+
Configuration YAML snippet example:
262+
263+
```yaml
264+
apiVersion: operator.alaudadevops.io/v1alpha1
265+
kind: Harbor
266+
metadata:
267+
name: dr-harbor
268+
spec:
269+
helmValues:
270+
core:
271+
replicas: 0
272+
portal:
273+
replicas: 0
274+
jobservice:
275+
replicas: 0
276+
registry:
277+
replicas: 0
278+
trivy:
279+
replicas: 0
280+
```
281+
282+
### Primary-Standby Switchover Procedure in Disaster Scenarios
283+
284+
1. First confirm that all Primary Harbor components are not in working state, otherwise stop all Primary Harbor components first.
285+
2. Promote Secondary PostgreSQL to Primary PostgreSQL. Refer to `PostgreSQL Hot Standby Cluster Configuration Guide`, the switchover procedure.
286+
3. Promote Secondary Object Storage to Primary Object Storage. Refer to [Alauda Build of Rook-Ceph Failover](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#procedures-1), the switchover procedure.
287+
288+
4. Scale up all Secondary Harbor components by modifying the replica count to greater than 0:
289+
290+
Configuration YAML snippet example:
291+
292+
```yaml
293+
apiVersion: operator.alaudadevops.io/v1alpha1
294+
kind: Harbor
295+
metadata:
296+
name: dr-harbor
297+
spec:
298+
helmValues:
299+
core:
300+
replicas: 1
301+
portal:
302+
replicas: 1
303+
jobservice:
304+
replicas: 1
305+
registry:
306+
replicas: 1
307+
trivy:
308+
replicas: 1
309+
```
310+
311+
5. Test image push and pull to verify that Harbor is working properly.
312+
6. Switch external access addresses to Secondary Harbor.
313+
314+
### Disaster Recovery Data Check
315+
316+
Check the synchronization status of Object Storage and PostgreSQL to ensure that the disaster recovery is successful.
317+
318+
- Check Ceph Object Storage Synchronization Status: [Object Storage Disaster Recovery](https://docs.alauda.io/container_platform/4.1/storage/storagesystem_ceph/how_to/disaster_recovery/dr_object.html#check-ceph-object-storage-synchronization-status)
319+
- Check PostgreSQL Synchronization Status: Refer to `PostgreSQL Hot Standby Cluster Configuration Guide` for status check section.
320+
321+
### Recovery Objectives
322+
323+
#### Recovery Point Objective (RPO)
324+
325+
The RPO represents the maximum acceptable data loss during a disaster recovery scenario. In this Harbor disaster recovery solution:
326+
327+
- **Database Layer**: Near-zero data loss due to PostgreSQL hot standby with streaming replication
328+
- **Storage Layer**: Near-zero data loss due to synchronous object storage replication
329+
- **Overall RPO**: Near-zero data loss due to synchronous replication of both database and object storage layers
330+
331+
**Factors affecting RPO:**
332+
333+
- Network latency between primary and secondary clusters
334+
- Object storage synchronous replication and consistency model
335+
- Database replication lag and commit acknowledgment settings
336+
337+
#### Recovery Time Objective (RTO)
338+
339+
The RTO represents the maximum acceptable downtime during disaster recovery. This solution provides:
340+
341+
- **Manual Components**: Harbor service activation and external routing updates require manual intervention
342+
- **Typical RTO**: 5-15 minutes for complete service restoration
343+
344+
**RTO Breakdown:**
345+
346+
- Database failover: 1-2 minutes (manual)
347+
- Storage failover: 1-2 minutes (manual)
348+
- Harbor service activation: 2-5 minutes (manual, cold standby requires startup time)
349+
- External routing updates: 1-5 minutes (manual, depends on DNS propagation)
350+
351+
## Building Harbor Disaster Recovery Solution with Other Object Storage and PostgreSQL
352+
353+
The operational steps are similar to building a Harbor disaster recovery solution with `Alauda Build of Rook-Ceph` and `Alauda support for PostgreSQL`. Simply replace Object Storage and PostgreSQL with other object storage and PostgreSQL solutions.
354+
355+
Ensure that the Object Storage and PostgreSQL solutions support disaster recovery capabilities.

0 commit comments

Comments
 (0)