feat(meta): support meta store recovery #6737

zwang28 · 2022-12-05T09:55:42Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

This is the restoring part of meta backup&restore. The PR looks a bit huge, but lots line changes are test or generated files:

Extract a new crate risingwave_backup, which contains common sdk like backup storage.
Add a new standalone cmd backup-restore, which is used to restore meta snapshot to an empty meta store.
Add new risectl cmd backup-meta and delete-meta-snapshots, which are used to create and delete meta snapshot respectively, via meta service.
Add integration test suite. Run by ./risedev meta-backup-restore-test. Included in CI daily test.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

Installation and deployment

Release note

Add support for meta store backup and recovery.

Refer to a related PR or issue link (optional)

#6482

codecov · 2022-12-06T09:54:01Z

Codecov Report

Merging #6737 (8b243f0) into main (40bcdae) will increase coverage by 0.03%.
The diff coverage is 74.01%.

@@            Coverage Diff             @@
##             main    #6737      +/-   ##
==========================================
+ Coverage   73.22%   73.25%   +0.03%     
==========================================
  Files        1026     1032       +6     
  Lines      164174   164627     +453     
==========================================
+ Hits       120215   120603     +388     
- Misses      43959    44024      +65

Flag	Coverage Δ
rust	`73.25% <74.01%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/cmd_all/src/playground.rs	`0.00% <0.00%> (ø)`
src/ctl/src/cmd_impl/meta/backup_meta.rs	`0.00% <0.00%> (ø)`
src/ctl/src/lib.rs	`0.86% <0.00%> (-0.04%)`	⬇️
src/meta/src/backup_restore/error.rs	`0.00% <ø> (ø)`
src/meta/src/lib.rs	`2.40% <ø> (+1.20%)`	⬆️
src/meta/src/rpc/service/mod.rs	`0.00% <ø> (ø)`
src/prost/src/lib.rs	`89.74% <ø> (ø)`
src/rpc_client/src/meta_client.rs	`0.00% <0.00%> (ø)`
src/storage/backup/src/error.rs	`0.00% <0.00%> (ø)`
src/storage/backup/src/meta_snapshot.rs	`100.00% <ø> (ø)`
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

# Conflicts: # risedev.yml

soundOfDestiny · 2022-12-12T00:41:52Z

src/storage/backup/src/lib.rs

+        Self {
+            id,
+            hummock_version_id: v.id,
+            ssts: v.get_sst_ids(),


Does SSTs in uncleaned versions before v need to be stored?

From the perspective of the newly restored cluster, SSTs in uncleaned versions of previous cluster are treated as orphan SSTs and can be deleted via full GC, as it is completely unaware of these uncleaned versions.
Compared with taking care of these SSTs in backup/recovery procedure, I think current implementation would be simpler.

hzxa21

Overall LGTM. Many tricky corner cases are covered. Great job!

I left some comments but they are not big issues so it is okay to merge this PR first.

hzxa21 · 2022-12-12T04:52:51Z

src/ctl/src/cmd_impl/meta/backup_meta.rs

+        let job_status = meta_client.get_backup_job_status(job_id).await?;
+        match job_status {
+            BackupJobStatus::Running => {
+                tracing::info!("backup job is still running: job {}", job_id);


should we print out the result instead?

hzxa21 · 2022-12-12T04:58:53Z

src/cmd/src/bin/backup_restore.rs

+        .enable_all()
+        .build()
+        .unwrap()
+        .block_on(risingwave_meta::backup_restore::restore(opts))


Just a note: I am thinking whether we should have a mechnism to check whether the cluster functions as expected after restore. Maybe we should try to query the pg_catalog tables. This can also be done externally.

hzxa21 · 2022-12-12T05:13:39Z

src/meta/src/backup_restore/backup_manager.rs

-                    .finish_backup_job(job_id, BackupJobResult::Failed(e))
-                    .await;
-            }
+            let job_result = job.await.unwrap_or_else(BackupJobResult::Failed);


Do we need a way to GC snapshot objects for the failed jobs? Given that the objects are small, to keep it simple, I think exposing a metric and doing the GC manually seems okay to me.

hzxa21 · 2022-12-12T05:14:52Z

src/meta/src/backup_restore/backup_manager.rs


 pub enum BackupJobResult {
-    Finished,
+    Succeeded,


We need to expose some metrics for backup and restore (e.g. success/fail rate, elasped time, ...)

hzxa21 · 2022-12-12T05:16:34Z

src/meta/src/backup_restore/restore.rs

+}
+
+async fn restore_metadata<S: MetaStore>(meta_store: S, snapshot: MetaSnapshot) -> BackupResult<()> {
+    restore_default_cf(&meta_store, &snapshot).await?;


Do we need to guarantee the atomicity of restoring different parts of the metadata? In other words, do we need to wrap restore_metadata_model in a meta transaction?

Do we need to guarantee the atomicity of restoring different parts of the metadata? In other words, do we need to wrap restore_metadata_model in a meta transaction?

why? I think while restore_metadata is called, there will not be other tasks.

I am mainly thinking about simplifying metadata cleanup when restore fails after some of the metadata has been written to etcd. Doing it manually also looks acceptable to me.

hzxa21 · 2022-12-12T05:19:46Z

src/meta/src/hummock/manager/versioning.rs

 /// `HummockVersionSafePoint` prevents hummock versions GE than it from being GC.
 /// It's used by meta node itself to temporarily pin versions.
-#[derive(Clone, PartialEq, Eq)]
 pub struct HummockVersionSafePoint {


Better add a metric for the min safe point so that if min pinned id is stuck we know whether it is caused by backup or CN version pin.

Add metrics for backup manager, as a follow-up to #6737. Also refactor backup integration test, to avoid unnecessary compilation of risectl. <img width="1459" alt="Screen Shot 2022-12-14 at 12 41 45 PM" src="https://user-images.githubusercontent.com/70626450/207508448-73f9ad49-f1af-4995-9c17-48a6bb52b36a.png"> u64 watermark like min safe point is displayed as -1, as prometheus IntCounter is i64. <img width="733" alt="Screen Shot 2022-12-14 at 12 42 29 PM" src="https://user-images.githubusercontent.com/70626450/207508456-1a5dbbd4-216e-437b-8e72-5fd2710463a1.png"> Approved-By: hzxa21 Co-Authored-By: zwang28 <[email protected]>

github-actions bot added the type/feature Type: New feature. label Dec 5, 2022

zwang28 force-pushed the wangzheng/br_restore_impl branch 2 times, most recently from 153b9dc to 413e909 Compare December 6, 2022 09:24

feat(meta): impl meta restore from backup

1afeefa

zwang28 force-pushed the wangzheng/br_restore_impl branch from 413e909 to 1afeefa Compare December 6, 2022 09:26

Merge branch 'main' into wangzheng/br_restore_impl

589e97a

zwang28 marked this pull request as ready for review December 6, 2022 10:09

zwang28 changed the title ~~feat(meta): support meta restore~~ feat(meta): support meta store recovery Dec 10, 2022

zwang28 mentioned this pull request Dec 11, 2022

feat(batch): query historical epoch data #6840

Merged

3 tasks

zwang28 requested review from Li0k and hzxa21 December 11, 2022 05:46

zwang28 added 3 commits December 11, 2022 13:49

Merge branch 'main' into wangzheng/br_restore_impl

e1bb7ed

# Conflicts: # risedev.yml

tune tests

c744e48

extract risingwave_backup crate

ecf9dcf

zwang28 requested a review from soundOfDestiny December 11, 2022 14:49

soundOfDestiny reviewed Dec 12, 2022

View reviewed changes

soundOfDestiny approved these changes Dec 12, 2022

View reviewed changes

zwang28 added 2 commits December 12, 2022 12:32

enable backup test in ci daily test

dccb53e

enable backup test in ci daily test

c4f09e7

hzxa21 approved these changes Dec 12, 2022

View reviewed changes

zwang28 added 4 commits December 12, 2022 13:48

add README

e78dd91

add README

85b217d

enable backup test in ci daily test

f26413c

fix tests

38e93c7

zwang28 added this to the release-0.1.15 milestone Dec 12, 2022

zwang28 added user-facing-changes Contains changes that are visible to users A-meta Area: Meta node. mergify/can-merge labels Dec 12, 2022

Merge branch 'main' into wangzheng/br_restore_impl

8b243f0

mergify bot merged commit e3c37a8 into main Dec 12, 2022

mergify bot deleted the wangzheng/br_restore_impl branch December 12, 2022 07:27

zwang28 mentioned this pull request Dec 14, 2022

feat(meta): add metrics for backup manager. #6898

Merged

1 task

feat(meta): support meta store recovery #6737

feat(meta): support meta store recovery #6737

Uh oh!

Conversation

zwang28 commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's changed and what's your intention?

Checklist

Documentation

Types of user-facing changes

Release note

Refer to a related PR or issue link (optional)

Uh oh!

codecov bot commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zwang28 Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzxa21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zwang28 commented Dec 5, 2022 •

edited

Loading

codecov bot commented Dec 6, 2022 •

edited

Loading

zwang28 Dec 12, 2022 •

edited

Loading