fix!: change how client heads work to fix ChunkNotFound #465

grgbkr · 2023-04-07T19:04:03Z

Instead of a headHash and tempRefreshHash, we now have: refreshHashes and persistHash.

refreshHashes
refreshHashes is a set of hashes that replaces headHash and tempRefreshHash. This change addresses a problem that occurs if the second perdag write tx in refresh fails, failing to update headHash and cleanup tempRefreshHash.
refreshHashes contains:

The hash of the last commit this client refreshed from its client group (this is the commit it bootstrapped from until it completes its first refresh).
One or more hashes that were added to retain chunks of a commit while it was being refreshed into this client's memdag. (This can be one or more because refresh's cleanup step is a separate transaction and can fail).
Upon refresh completing and successfully running its clean up step, this set will contain a single hash: the hash of the last commit this client refreshed.

persistHash
Is new and most directly resolves the observed ChunkNotFound issue. persistHash is the hash of the last snapshot commit persisted by this client to this client's client group, or null if has never persisted a snapshot.

In persist, when we write the memdag's basesnapshot to the perdag we tell the memdag via call to chunksPersisted that it can move these chunks from its _memOnlyChunks (where chunks can not be evicted) to its _sourceChunksCache (where chunks can be evicted). However, the only thing retaining these chunks in the perdag is the ClientGroup's head hash, which can be updated by another client. This new persistHash will retain these chunks in the case that another client updates the ClientGroup's head hash.

Fixes #434 #30

BREAKING CHANGE: Updates format version from 5 to 6 because the structure of persisted Client(s) in the ClientMap is changed.

vercel · 2023-04-07T19:04:07Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
replicache-docs-trunk	✅ Ready (Inspect)	Visit Preview	💬 1 unresolved	Apr 11, 2023 11:45pm

arv · 2023-04-08T09:17:20Z

General approach looks good. Keeping more things alive prevents things from getting garbage collected behind our backs.

Isn't there still a problem with memdag.chunksPersisted? Maybe not because we keep everything alive?

I'm also worried about us clobbering the Client/ClientGroup write in case there are overlapping refreshes/persist. Let's add asserts that the invariants we had at the start didn't change before we do any writes.

arv · 2023-04-08T09:18:20Z

Also, BREAKING CHANGE so update the commit title.

grgbkr · 2023-04-10T16:54:57Z

General approach looks good. Keeping more things alive prevents things from getting garbage collected behind our backs.

Isn't there still a problem with memdag.chunksPersisted? Maybe not because we keep everything alive?

I believe the call to memdag.chunksPersisted (i.e. moving these chunks to a cache where they can be evicted) is now safe because we retain them in the perdag with persistHash.

I'm also worried about us clobbering the Client/ClientGroup write in case there are overlapping refreshes/persist. Let's add asserts that the invariants we had at the start didn't change before we do any writes.

Can you be more specific about which invariants you think we should assert?

grgbkr · 2023-04-10T19:39:53Z

Also, BREAKING CHANGE so update the commit title.

Yes, I think this actually requires a format version change because the client group gc code asserts that all the clients in the client map are of the version it expects (so the old code asserts V5). I was thinking about making this code laxer so that it just checks the the client has a client group id (all that it needs), but then decided its just safer to do format version changes when we change things like this instead of reasoning about the backwards compat all over the place. I think its best if we can minimize the places where we have to consider old db formats to: 1. idb collect and 2. mutation recovery.

packages/replicache/src/persist/clients.ts

arv · 2023-04-08T08:22:58Z

packages/replicache/src/persist/refresh.ts

+  const result: RefreshResult =
+    await memdag.withSuspendedSourceCacheEvictsAndDeletes(async () => {
      const perdagWriteResult:
        | [


Maybe use a named type here too?

arv · 2023-04-08T08:23:46Z

packages/replicache/src/persist/refresh.ts

+        const refreshHashesSet = new Set<Hash>();
+        client.refreshHashes.forEach(h => refreshHashesSet.add(h));


Suggested change

const refreshHashesSet = new Set<Hash>();

client.refreshHashes.forEach(h => refreshHashesSet.add(h));

const refreshHashesSet = new Set(client.refreshHashes);

arv · 2023-04-08T08:43:08Z

packages/replicache/src/persist/client-group-gc.test.ts

    clientGroupID: 'client-group-1',
  });
-  const client2 = makeClientV5({
+  const client2 = makeClientV6({


This test now has one v5 client and 2 v6 clients... That seems good but maybe copy the test and have a test where all the clients are v6?

arv · 2023-04-08T08:57:30Z

packages/replicache/src/persist/refresh.test.ts

-    await assertClientHeadHash(
-      perdag,
-      clientID,
+    await assertRefreshHashes(perdag, clientID, [


Why isn't the hash removed from the refreshHashes at the end?

arv · 2023-04-08T09:01:16Z

packages/replicache/src/persist/refresh.ts

+    }
+    return undefined;
+  }
+  await setRefreshHashes([result.newPerdagClientHeadHash]);


How do we ensure we do not clobber another refresh?

packages/replicache/src/replicache-mutation-recovery-dd31.test.ts

And a few other code review fixes

to make it clearer where the transactions are

It is confusing to have chunksPersisted on the LazyStore. If it takes a write lock it is better to keep this on the write implementation.

Use the same write transaction for clarity

arv · 2023-04-11T12:55:14Z

Can you be more specific about which invariants you think we should assert?

In persist we have:

withRead(perdag) A
withRead(memdag) B
withWrite(perdag)
- I feel like we need to validate that the things we read in A are still true.
withWrite(memdag)
- And here, maybe check that what we read in B also hold. But if they do not hold we are pretty screwed.

Similar things in refresh.

arv · 2023-04-11T12:56:18Z

packages/replicache/src/dag/lazy-store.ts

-   * Acquires a write lock on the store.
-   */
-  chunksPersisted(chunkHashes: Iterable<Hash>): Promise<void> {
-    return withWrite(this, () => {


Having this write in here is problematic because it hides transactions.

arv · 2023-04-11T12:56:50Z

packages/replicache/src/dag/lazy-store.ts

-        }
-      }
-      this._sourceChunksCache.persisted(chunksToCache);
-    });


Missing commit. In this case the commit would be a no op but it is still confusing.

arv · 2023-04-11T12:57:31Z

packages/replicache/src/persist/persist.ts

-  const [newMemdagMutations, memdagBaseSnapshot] = await withRead(
-    memdag,
-    async memdagRead => {
+  const [newMemdagMutations, memdagBaseSnapshot, gatheredChunks] =


I moved things around to remove a memdag read transaction here

arv · 2023-04-11T12:58:02Z

packages/replicache/src/persist/persist.ts

+  }
+
+  let memdagBaseSnapshotPersisted = false;
+  await withWrite(perdag, async perdagWrite => {


refactor to fold the if into the transaction and reuse code.

arv · 2023-04-11T15:39:18Z

For refresh:

TXA: In the first perdag write tx we write the refreshHashes to new Set([...client.refreshHashes, perdagClientGroupHeadHash])
TXB: Then we do a memdag write tx
TXC: And finally we set or restores the client.refreshHashes

What if we have 2 clients C1 and C2 and we end up with the following sequence:

C1_TXA
C2_TXA
C1_TXB
C2_TXB
C2_TX3
C1_TX3

It looks like TXA can be overlapping without problems.
It looks like TXB can be overlapping without problems.
TXC is not as straight forward but I believe it is fine because all it does is clear the refreshHashes and as long as it comes after TXA all should be fine...

I'm convinced that refresh is fine... I will try the same exercise with persist too

arv · 2023-04-11T16:24:49Z

I'm convinced that refresh is fine... I will try the same exercise with persist too

Actually, I think we will have issues if TXC-C2 happens before TXB-C1

grgbkr · 2023-04-11T16:50:23Z

packages/replicache/src/persist/persist.ts

    });
    if (memdagBaseSnapshotPersisted) {
-      await memdag.chunksPersisted(gatheredChunks.keys());
+      await withWrite(memdag, async perdagWrite => {


memdagWrite

grgbkr · 2023-04-11T17:05:51Z

packages/replicache/src/persist/collect-idb-databases.ts

    // If increase the format version we need to decide how to deal with this
    // logic.
-    assert(db.replicacheFormatVersion === REPLICACHE_FORMAT_VERSION_DD31);
+    assert(db.replicacheFormatVersion >= REPLICACHE_FORMAT_VERSION_DD31);


This should be assert(db.replicacheFormatVersion === REPLICACHE_FORMAT_VERSION_DD31 || db.replicacheFormatVersion === REPLICACHE_FORMAT_VERSION_V6)

So that it fails next time we try to update format version and we reassess if this logic still works.

grgbkr · 2023-04-11T17:09:47Z

packages/replicache/src/persist/collect-idb-databases.test.ts

          name: 'a',
          lastOpenedTimestampMS: 0,
-          replicacheFormatVersion: REPLICACHE_FORMAT_VERSION_DD31,
+          replicacheFormatVersion: REPLICACHE_FORMAT_VERSION_V6,


We should also have a test collecting a idb with format version 5. Since the check actually reads the db's perdag to see if it has pending mutations, I think we should use a perdag snapshot.

actually I don't think this is necessary as the algorithm only looks at the client groups, and this change only effects the client map.

grgbkr · 2023-04-11T17:11:24Z

packages/replicache/src/persist/client-group-gc.test.ts

+  'initClientGroupGC starts 5 min interval that collects client groups that are not referred to by any clients and have no pending mutations',
+  () => {
+    test('v5 and v6', async () => {
+      const client1 = makeClientV5({


Since we are bumping the format version, the client map should never contain a mixture of v5 and v6. We just need the test of all v6, and the assert in client-group-gc.ts can be update form assertClientV5OrV6 to just assertClientV6.

grgbkr · 2023-04-11T17:15:20Z

packages/replicache/src/persist/refresh.ts

  );
  assertSnapshotCommitDD31(memdagBaseSnapshot);

+  type PerdagWriteResult = [


type PerdagWriteResult = [ perdagClientGroupHeadHash: Hash, perdagClientGroupBaseSnapshot: db.Commit<db.SnapshotMetaDD31>, perdagLmid: number, gatheredChunks: ReadonlyMap<Hash, ChunkWithSize>, refreshHashesForRevert: readonly Hash[] ];

fix: change how client heads work to fix ChunkNotFound

be9c6f2

grgbkr requested a review from arv April 7, 2023 19:04

vercel bot had a problem deploying to Preview April 7, 2023 19:04 Failure

fix lint and check-types

16b6478

vercel bot deployed to Preview April 7, 2023 19:12 View deployment

add comments

f354dac

vercel bot deployed to Preview April 7, 2023 21:45 View deployment

comment formatting

7d86780

vercel bot deployed to Preview April 7, 2023 21:47 View deployment

add and update tests

7da174c

vercel bot had a problem deploying to Preview April 8, 2023 05:09 Failure

fix lint

bb8bfbc

vercel bot deployed to Preview April 8, 2023 05:29 View deployment

grgbkr marked this pull request as ready for review April 8, 2023 05:30

arv reviewed Apr 11, 2023

View reviewed changes

chore: Change REPLICACHE_FORMAT_VERSION to 6

d6f8633

And a few other code review fixes

vercel bot deployed to Preview April 11, 2023 09:53 View deployment

Inline gatherMemOnlyChunks

c03ef46

to make it clearer where the transactions are

vercel bot deployed to Preview April 11, 2023 10:04 View deployment

chore: Move chunksPersisted to LazyWrite

2d123d4

It is confusing to have chunksPersisted on the LazyStore. If it takes a write lock it is better to keep this on the write implementation.

vercel bot deployed to Preview April 11, 2023 10:56 View deployment

chore: Refactor to reduce code duplication in persist

bc49a10

Use the same write transaction for clarity

vercel bot deployed to Preview April 11, 2023 12:53 View deployment

arv reviewed Apr 11, 2023

View reviewed changes

grgbkr commented Apr 11, 2023

View reviewed changes

resolve my own review comments and further simplify persist

5c0557b

vercel bot deployed to Preview April 11, 2023 23:45 View deployment

grgbkr changed the title ~~fix: change how client heads work to fix ChunkNotFound~~ fix!: change how client heads work to fix ChunkNotFound Apr 12, 2023

grgbkr merged commit a408c68 into main Apr 12, 2023

grgbkr deleted the grgbkr/chunk-not-found branch April 12, 2023 00:13

		const refreshHashesSet = new Set<Hash>();
		client.refreshHashes.forEach(h => refreshHashesSet.add(h));

	const refreshHashesSet = new Set<Hash>();
	client.refreshHashes.forEach(h => refreshHashesSet.add(h));
	const refreshHashesSet = new Set(client.refreshHashes);

fix!: change how client heads work to fix ChunkNotFound #465

fix!: change how client heads work to fix ChunkNotFound #465

Uh oh!

Conversation

grgbkr commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arv commented Apr 8, 2023

Uh oh!

arv commented Apr 8, 2023

Uh oh!

grgbkr commented Apr 10, 2023

Uh oh!

grgbkr commented Apr 10, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arv commented Apr 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arv commented Apr 11, 2023

Uh oh!

arv commented Apr 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grgbkr Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grgbkr commented Apr 7, 2023 •

edited

Loading

vercel bot commented Apr 7, 2023 •

edited

Loading

grgbkr Apr 11, 2023 •

edited

Loading