Skip to content

Conversation

grgbkr
Copy link
Contributor

@grgbkr grgbkr commented Apr 7, 2023

Instead of a headHash and tempRefreshHash, we now have: refreshHashes and persistHash.

refreshHashes
refreshHashes is a set of hashes that replaces headHash and tempRefreshHash. This change addresses a problem that occurs if the second perdag write tx in refresh fails, failing to update headHash and cleanup tempRefreshHash.
refreshHashes contains:

  1. The hash of the last commit this client refreshed from its client group (this is the commit it bootstrapped from until it completes its first refresh).
  2. One or more hashes that were added to retain chunks of a commit while it was being refreshed into this client's memdag. (This can be one or more because refresh's cleanup step is a separate transaction and can fail).
    Upon refresh completing and successfully running its clean up step, this set will contain a single hash: the hash of the last commit this client refreshed.

persistHash
Is new and most directly resolves the observed ChunkNotFound issue. persistHash is the hash of the last snapshot commit persisted by this client to this client's client group, or null if has never persisted a snapshot.

In persist, when we write the memdag's basesnapshot to the perdag we tell the memdag via call to chunksPersisted that it can move these chunks from its _memOnlyChunks (where chunks can not be evicted) to its _sourceChunksCache (where chunks can be evicted). However, the only thing retaining these chunks in the perdag is the ClientGroup's head hash, which can be updated by another client. This new persistHash will retain these chunks in the case that another client updates the ClientGroup's head hash.

Fixes #434 #30

BREAKING CHANGE: Updates format version from 5 to 6 because the structure of persisted Client(s) in the ClientMap is changed.

@grgbkr grgbkr requested a review from arv April 7, 2023 19:04
@vercel
Copy link

vercel bot commented Apr 7, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
replicache-docs-trunk ✅ Ready (Inspect) Visit Preview 💬 1 unresolved Apr 11, 2023 11:45pm

@grgbkr grgbkr marked this pull request as ready for review April 8, 2023 05:30
@arv
Copy link
Contributor

arv commented Apr 8, 2023

General approach looks good. Keeping more things alive prevents things from getting garbage collected behind our backs.

Isn't there still a problem with memdag.chunksPersisted? Maybe not because we keep everything alive?

I'm also worried about us clobbering the Client/ClientGroup write in case there are overlapping refreshes/persist. Let's add asserts that the invariants we had at the start didn't change before we do any writes.

@arv
Copy link
Contributor

arv commented Apr 8, 2023

Also, BREAKING CHANGE so update the commit title.

@grgbkr
Copy link
Contributor Author

grgbkr commented Apr 10, 2023

General approach looks good. Keeping more things alive prevents things from getting garbage collected behind our backs.

Isn't there still a problem with memdag.chunksPersisted? Maybe not because we keep everything alive?

I believe the call to memdag.chunksPersisted (i.e. moving these chunks to a cache where they can be evicted) is now safe because we retain them in the perdag with persistHash.

I'm also worried about us clobbering the Client/ClientGroup write in case there are overlapping refreshes/persist. Let's add asserts that the invariants we had at the start didn't change before we do any writes.

Can you be more specific about which invariants you think we should assert?

@grgbkr
Copy link
Contributor Author

grgbkr commented Apr 10, 2023

Also, BREAKING CHANGE so update the commit title.

Yes, I think this actually requires a format version change because the client group gc code asserts that all the clients in the client map are of the version it expects (so the old code asserts V5). I was thinking about making this code laxer so that it just checks the the client has a client group id (all that it needs), but then decided its just safer to do format version changes when we change things like this instead of reasoning about the backwards compat all over the place. I think its best if we can minimize the places where we have to consider old db formats to: 1. idb collect and 2. mutation recovery.

const result: RefreshResult =
await memdag.withSuspendedSourceCacheEvictsAndDeletes(async () => {
const perdagWriteResult:
| [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a named type here too?

Comment on lines 132 to 133
const refreshHashesSet = new Set<Hash>();
client.refreshHashes.forEach(h => refreshHashesSet.add(h));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const refreshHashesSet = new Set<Hash>();
client.refreshHashes.forEach(h => refreshHashesSet.add(h));
const refreshHashesSet = new Set(client.refreshHashes);

clientGroupID: 'client-group-1',
});
const client2 = makeClientV5({
const client2 = makeClientV6({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now has one v5 client and 2 v6 clients... That seems good but maybe copy the test and have a test where all the clients are v6?

await assertClientHeadHash(
perdag,
clientID,
await assertRefreshHashes(perdag, clientID, [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the hash removed from the refreshHashes at the end?

}
return undefined;
}
await setRefreshHashes([result.newPerdagClientHeadHash]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure we do not clobber another refresh?

And a few other code review fixes
to make it clearer where the transactions are
It is confusing to have chunksPersisted on the LazyStore. If it takes a
write lock it is better to keep this on the write implementation.
Use the same write transaction for clarity
@arv
Copy link
Contributor

arv commented Apr 11, 2023

Can you be more specific about which invariants you think we should assert?

In persist we have:

  • withRead(perdag) A
  • withRead(memdag) B
  • withWrite(perdag)
    • I feel like we need to validate that the things we read in A are still true.
  • withWrite(memdag)
    • And here, maybe check that what we read in B also hold. But if they do not hold we are pretty screwed.

Similar things in refresh.

* Acquires a write lock on the store.
*/
chunksPersisted(chunkHashes: Iterable<Hash>): Promise<void> {
return withWrite(this, () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this write in here is problematic because it hides transactions.

}
}
this._sourceChunksCache.persisted(chunksToCache);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing commit. In this case the commit would be a no op but it is still confusing.

const [newMemdagMutations, memdagBaseSnapshot] = await withRead(
memdag,
async memdagRead => {
const [newMemdagMutations, memdagBaseSnapshot, gatheredChunks] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved things around to remove a memdag read transaction here

}

let memdagBaseSnapshotPersisted = false;
await withWrite(perdag, async perdagWrite => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor to fold the if into the transaction and reuse code.

@arv
Copy link
Contributor

arv commented Apr 11, 2023

For refresh:

  • TXA: In the first perdag write tx we write the refreshHashes to new Set([...client.refreshHashes, perdagClientGroupHeadHash])
  • TXB: Then we do a memdag write tx
  • TXC: And finally we set or restores the client.refreshHashes

What if we have 2 clients C1 and C2 and we end up with the following sequence:

  • C1_TXA
  • C2_TXA
  • C1_TXB
  • C2_TXB
  • C2_TX3
  • C1_TX3

It looks like TXA can be overlapping without problems.
It looks like TXB can be overlapping without problems.
TXC is not as straight forward but I believe it is fine because all it does is clear the refreshHashes and as long as it comes after TXA all should be fine...

I'm convinced that refresh is fine... I will try the same exercise with persist too

@arv
Copy link
Contributor

arv commented Apr 11, 2023

I'm convinced that refresh is fine... I will try the same exercise with persist too

Actually, I think we will have issues if TXC-C2 happens before TXB-C1

});
if (memdagBaseSnapshotPersisted) {
await memdag.chunksPersisted(gatheredChunks.keys());
await withWrite(memdag, async perdagWrite => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memdagWrite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// If increase the format version we need to decide how to deal with this
// logic.
assert(db.replicacheFormatVersion === REPLICACHE_FORMAT_VERSION_DD31);
assert(db.replicacheFormatVersion >= REPLICACHE_FORMAT_VERSION_DD31);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be assert(db.replicacheFormatVersion === REPLICACHE_FORMAT_VERSION_DD31 || db.replicacheFormatVersion === REPLICACHE_FORMAT_VERSION_V6)

So that it fails next time we try to update format version and we reassess if this logic still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

name: 'a',
lastOpenedTimestampMS: 0,
replicacheFormatVersion: REPLICACHE_FORMAT_VERSION_DD31,
replicacheFormatVersion: REPLICACHE_FORMAT_VERSION_V6,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a test collecting a idb with format version 5. Since the check actually reads the db's perdag to see if it has pending mutations, I think we should use a perdag snapshot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't think this is necessary as the algorithm only looks at the client groups, and this change only effects the client map.

'initClientGroupGC starts 5 min interval that collects client groups that are not referred to by any clients and have no pending mutations',
() => {
test('v5 and v6', async () => {
const client1 = makeClientV5({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are bumping the format version, the client map should never contain a mixture of v5 and v6. We just need the test of all v6, and the assert in client-group-gc.ts can be update form assertClientV5OrV6 to just assertClientV6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

);
assertSnapshotCommitDD31(memdagBaseSnapshot);

type PerdagWriteResult = [
Copy link
Contributor Author

@grgbkr grgbkr Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type PerdagWriteResult = [
  perdagClientGroupHeadHash: Hash,
  perdagClientGroupBaseSnapshot: db.Commit<db.SnapshotMetaDD31>,
  perdagLmid: number,
  gatheredChunks: ReadonlyMap<Hash, ChunkWithSize>,
  refreshHashesForRevert: readonly Hash[]
];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@grgbkr grgbkr changed the title fix: change how client heads work to fix ChunkNotFound fix!: change how client heads work to fix ChunkNotFound Apr 12, 2023
@grgbkr grgbkr merged commit a408c68 into main Apr 12, 2023
@grgbkr grgbkr deleted the grgbkr/chunk-not-found branch April 12, 2023 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ChunkNotFound

2 participants