Skip to content

Conversation

@zwang28
Copy link
Contributor

@zwang28 zwang28 commented Dec 5, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

This is the restoring part of meta backup&restore. The PR looks a bit huge, but lots line changes are test or generated files:

  • Extract a new crate risingwave_backup, which contains common sdk like backup storage.
  • Add a new standalone cmd backup-restore, which is used to restore meta snapshot to an empty meta store.
  • Add new risectl cmd backup-meta and delete-meta-snapshots, which are used to create and delete meta snapshot respectively, via meta service.
  • Add integration test suite. Run by ./risedev meta-backup-restore-test. Included in CI daily test.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Installation and deployment

Release note

Add support for meta store backup and recovery.

Refer to a related PR or issue link (optional)

#6482

@github-actions github-actions bot added the type/feature Type: New feature. label Dec 5, 2022
@zwang28 zwang28 force-pushed the wangzheng/br_restore_impl branch 2 times, most recently from 153b9dc to 413e909 Compare December 6, 2022 09:24
@zwang28 zwang28 force-pushed the wangzheng/br_restore_impl branch from 413e909 to 1afeefa Compare December 6, 2022 09:26
@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Merging #6737 (8b243f0) into main (40bcdae) will increase coverage by 0.03%.
The diff coverage is 74.01%.

@@            Coverage Diff             @@
##             main    #6737      +/-   ##
==========================================
+ Coverage   73.22%   73.25%   +0.03%     
==========================================
  Files        1026     1032       +6     
  Lines      164174   164627     +453     
==========================================
+ Hits       120215   120603     +388     
- Misses      43959    44024      +65     
Flag Coverage Δ
rust 73.25% <74.01%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/cmd_all/src/playground.rs 0.00% <0.00%> (ø)
src/ctl/src/cmd_impl/meta/backup_meta.rs 0.00% <0.00%> (ø)
src/ctl/src/lib.rs 0.86% <0.00%> (-0.04%) ⬇️
src/meta/src/backup_restore/error.rs 0.00% <ø> (ø)
src/meta/src/lib.rs 2.40% <ø> (+1.20%) ⬆️
src/meta/src/rpc/service/mod.rs 0.00% <ø> (ø)
src/prost/src/lib.rs 89.74% <ø> (ø)
src/rpc_client/src/meta_client.rs 0.00% <0.00%> (ø)
src/storage/backup/src/error.rs 0.00% <0.00%> (ø)
src/storage/backup/src/meta_snapshot.rs 100.00% <ø> (ø)
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zwang28 zwang28 marked this pull request as ready for review December 6, 2022 10:09
@zwang28 zwang28 changed the title feat(meta): support meta restore feat(meta): support meta store recovery Dec 10, 2022
@zwang28 zwang28 requested review from Li0k and hzxa21 December 11, 2022 05:46
Self {
id,
hummock_version_id: v.id,
ssts: v.get_sst_ids(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SSTs in uncleaned versions before v need to be stored?

Copy link
Contributor Author

@zwang28 zwang28 Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of the newly restored cluster, SSTs in uncleaned versions of previous cluster are treated as orphan SSTs and can be deleted via full GC, as it is completely unaware of these uncleaned versions.
Compared with taking care of these SSTs in backup/recovery procedure, I think current implementation would be simpler.

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Many tricky corner cases are covered. Great job!

I left some comments but they are not big issues so it is okay to merge this PR first.

let job_status = meta_client.get_backup_job_status(job_id).await?;
match job_status {
BackupJobStatus::Running => {
tracing::info!("backup job is still running: job {}", job_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we print out the result instead?

.enable_all()
.build()
.unwrap()
.block_on(risingwave_meta::backup_restore::restore(opts))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: I am thinking whether we should have a mechnism to check whether the cluster functions as expected after restore. Maybe we should try to query the pg_catalog tables. This can also be done externally.

.finish_backup_job(job_id, BackupJobResult::Failed(e))
.await;
}
let job_result = job.await.unwrap_or_else(BackupJobResult::Failed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a way to GC snapshot objects for the failed jobs? Given that the objects are small, to keep it simple, I think exposing a metric and doing the GC manually seems okay to me.


pub enum BackupJobResult {
Finished,
Succeeded,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to expose some metrics for backup and restore (e.g. success/fail rate, elasped time, ...)

}

async fn restore_metadata<S: MetaStore>(meta_store: S, snapshot: MetaSnapshot) -> BackupResult<()> {
restore_default_cf(&meta_store, &snapshot).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to guarantee the atomicity of restoring different parts of the metadata? In other words, do we need to wrap restore_metadata_model in a meta transaction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to guarantee the atomicity of restoring different parts of the metadata? In other words, do we need to wrap restore_metadata_model in a meta transaction?

why? I think while restore_metadata is called, there will not be other tasks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mainly thinking about simplifying metadata cleanup when restore fails after some of the metadata has been written to etcd. Doing it manually also looks acceptable to me.

/// `HummockVersionSafePoint` prevents hummock versions GE than it from being GC.
/// It's used by meta node itself to temporarily pin versions.
#[derive(Clone, PartialEq, Eq)]
pub struct HummockVersionSafePoint {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add a metric for the min safe point so that if min pinned id is stuck we know whether it is caused by backup or CN version pin.

@zwang28 zwang28 added this to the release-0.1.15 milestone Dec 12, 2022
@zwang28 zwang28 added user-facing-changes Contains changes that are visible to users A-meta Area: Meta node. mergify/can-merge labels Dec 12, 2022
@mergify mergify bot merged commit e3c37a8 into main Dec 12, 2022
@mergify mergify bot deleted the wangzheng/br_restore_impl branch December 12, 2022 07:27
mergify bot pushed a commit that referenced this pull request Dec 14, 2022
Add metrics for backup manager, as a follow-up to #6737.
Also refactor backup integration test, to avoid unnecessary compilation of risectl.

<img width="1459" alt="Screen Shot 2022-12-14 at 12 41 45 PM" src="https://user-images.githubusercontent.com/70626450/207508448-73f9ad49-f1af-4995-9c17-48a6bb52b36a.png">

u64 watermark like min safe point is displayed as -1, as prometheus IntCounter is i64.
<img width="733" alt="Screen Shot 2022-12-14 at 12 42 29 PM" src="https://user-images.githubusercontent.com/70626450/207508456-1a5dbbd4-216e-437b-8e72-5fd2710463a1.png">


Approved-By: hzxa21

Co-Authored-By: zwang28 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-meta Area: Meta node. type/feature Type: New feature. user-facing-changes Contains changes that are visible to users

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants