-
Notifications
You must be signed in to change notification settings - Fork 2.2k
graphdb: add caching for isPublicNode query #10363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @Abdulkbk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a caching mechanism for the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces caching for the IsPublicNode query to address a significant performance bottleneck in gossip message processing. The changes are well-structured, including the addition of a new cache, configuration options, and cache invalidation logic at various points where a node's public status might change. I've identified a potential race condition in one of the cache invalidation paths and an opportunity to optimize the cache-aside pattern to prevent thundering herds. Overall, this is a valuable improvement.
graph/db/sql_store.go
Outdated
| // Store the result in cache. | ||
| s.cacheMu.Lock() | ||
| _, err = s.publicNodeCache.Put(pubKey, &cachedPublicNode{ | ||
| isPublic: isPublic, | ||
| }) | ||
| if err != nil { | ||
| log.Warnf("unable to store node %x in cache: %v", pubKey, err) | ||
| } | ||
|
|
||
| s.cacheMu.Unlock() | ||
|
|
||
| return isPublic, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential for a thundering herd problem here. If multiple goroutines call IsPublicNode for the same key that is not in the cache, they will all miss the cache, query the database, and then attempt to write to the cache. To optimize this and prevent redundant database queries, you can re-check the cache after acquiring the write lock. This ensures that only the first goroutine populates the cache, and subsequent ones will use the cached value.
// Store the result in cache. We use a double-checked locking pattern
// here to avoid a thundering herd problem where multiple goroutines
// query the DB for the same key on a cache miss.
s.cacheMu.Lock()
defer s.cacheMu.Unlock()
// Re-check the cache to avoid a race where another goroutine populated
// it between our read-lock release and write-lock acquisition.
if cached, err := s.publicNodeCache.Get(pubKey); err == nil && cached != nil {
return cached.isPublic, nil
}
_, err = s.publicNodeCache.Put(pubKey, &cachedPublicNode{
isPublic: isPublic,
})
if err != nil {
log.Warnf("unable to store node %x in cache: %v", pubKey, err)
}
return isPublic, nil0a81125 to
85b45d5
Compare
c6c7bbe to
aec3619
Compare
ziggie1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to cache the call, let's also add some benchmark here.
graph/db/sql_store.go
Outdated
| s.cacheMu.RLock() | ||
| cached, err := s.publicNodeCache.Get(pubKey) | ||
|
|
||
| switch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this logic hard to follow can we instead do:
// Cache hit - return immediately.
if err == nil && cached != nil {
return cached.isPublic, nil
}
// Log unexpected errors (anything other than "not found").
if err != nil && !errors.Is(err, cache.ErrElementNotFound) {
log.Warnf("unexpected error checking node cache: %v", err)
}
cc98da3 to
8c43ead
Compare
|
I cherry-picked f9078e5 from #10356 adding the benchmark for go test -tags=test_db_sqlite -bench=BenchmarkIsPublicNode -v
The difference is very significant with cache (~ 10000x faster). |
|
I notice the |
f9078e5 to
cf003b0
Compare
|
can you add a release note entry for 20.1 |
graph/db/sql_store.go
Outdated
| s.chanCache.remove(chanID) | ||
| } | ||
|
|
||
| var pubkeys [][33]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to delete here, its an LRU cache, if a node was public we don't bother because the pubkey already was annoucned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This calls s.deleteChannels(ctx, db, chanIDsToDelete) which deletes the affected channels from the db. There will be a discrepancy between what would normally be returned and what the cache will have.
graph/db/sql_store.go
Outdated
| for _, channel := range closedChans { | ||
| s.rejectCache.remove(channel.ChannelID) | ||
| s.chanCache.remove(channel.ChannelID) | ||
| s.removePublicNodeCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not able to delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to if we want to stay consistent with the db.
graph/db/sql_store.go
Outdated
| for _, channel := range removedChans { | ||
| s.rejectCache.remove(channel.ChannelID) | ||
| s.chanCache.remove(channel.ChannelID) | ||
| s.removePublicNodeCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, cannot b e deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
graph/db/sql_store.go
Outdated
| // | ||
| // NOTE: This can safely be called without holding a lock since the lru is | ||
| // thread safe. | ||
| func (s *SQLStore) removePublicNodeCache(pubkeys ...[33]byte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this if we remove all the callsites
graph/db/sql_store.go
Outdated
| return fmt.Errorf("unable to delete node: %w", err) | ||
| } | ||
|
|
||
| s.removePublicNodeCache(pubKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even here, I am acutally not sure if we should remove it from the cache, its an LRU cache so it cycles unused values out, so we still might get some infos to this node if the gossip is delayed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be as cautious as possible? Let's assume initially the node is public. After a DeleteNode call, if you call IsPublicNode without a cache, you get false, but if we have a cache that wasn't invalidated, then you get true. Wouldn't that be a discrepancy?
590351c to
bd7587d
Compare
|
I've reconsidered the cache design because I think the main flood of node announcements we are seeing are from public nodes on the network. Private node announcements we should only see from direct peers with private channels to us. So the current approach of invalidating cache on every channel close causes unnecessary churn. Proposed approach: Public-nodes-only cache The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart). Rationale:
Flow:
Benefits:
The minor tradeoff (stale "public" status for nodes that close all channels) is acceptable since the gossip information has already leaked and the network treats these nodes as known. |
|
And also look at this: Line 1726 in 91423ee
we in LND don't even don't send periodic node announcements if we do not have public channels which is another argument to not cache private nodes. |
This makes sense. I will drop the cache invalidation and only cache public nodes. |
bd7587d to
004814b
Compare
|
I noticed that unit tests for
So I will update the test to reflect this behavior. |
004814b to
714cbc9
Compare
yeah I think we can update the tests, but let's maybe wait until either @Roasbeef or @ellemouton agree with it. |
Roasbeef
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
IIUC we only have this so we don't forward the node ann of an unadvertised node to the broader network. However once a node is public, it's node ann is already "leaked" effectively, so we can assume that once public, always public.
| // SQL store caches public nodes and once a node is cached as public, it | ||
| // stays public until eviction/restart. This test asserts | ||
| // public<->private transitions, so it doesn't apply to SQL. | ||
| if _, ok := aliceGraph.V1Store.(*SQLStore); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes let's update the testcase to reflect this new behavior, so we do not special case it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'd be adding cache for kvdb, I will leave this as is for now and remove/update this test once that is done so as not to block this PR from getting in.
| concurrent-safe. | ||
|
|
||
| * [Add caching for](https://github.com/lightningnetwork/lnd/pull/10363) | ||
| `IsPublicNode` query which speedup calls to check for nodes visibility status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can also add this cache for the KV store, it shouldn't be a big lift but at least it makes the config value more general, because right now kv db node runners would also think they have this feature available. So either we add clear comments that this is a sql feature or we also add it to the kv store which probably isn't a big change and would improve the performance there as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will aim for it in a followup but if it's done before this get marged that will be fine too. For now I'll update the note.
714cbc9 to
cad9cd6
Compare
This commit adds the struct we'll use to cache the node. It also adds the require `Size` method for the lru package. We also add publicNodeCache into the sqlstore and the necessary config for initializing the cache. Additionally, we introduce a new config `public-node-cache-size` which let us set values for the cache size. Signed-off-by: Abdullahi Yunus <[email protected]>
In this commit, we first check for the node in our cache before querying the database when determining if a node is public or not.
In this commit we add a benchmark to test the performance of IsPublicNode query.
In this commit, we add new tests to verify SQL backend behaviour after caching for IsPublic is introduced. The tests verify that once a node is public and cached, it remains so untill all its public channels are removed and it's evicted from the cache.
cad9cd6 to
fcfc0ec
Compare
|
please take your time and also implement the kvdb side, we are not going to rush things here, so take the time you need to implement it for both db implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice usage of the lru cache! My two cents,
- we should try to fix the sql query first
- build caching is hard, and it should stay as close as to the db CRUDs - for instance, the current caching doesn't have a GC logic, which means we may consider removed channels as public.
- we don't want to have a second caching system, and should stick to the current one - we already have
nodeSchedulerand should leverage that, similarly to how we cache chan updates inchanSchedular. For instance, maybe we can just usechanCacheto decide whether the node is public or not.
|
Thank you for also taking a look,
you mean first go with the updated query and then see if it already solves the performance bottleneck ? And in the next version introduce the cache if it is not sufficient ?
I understand, I was more thinking of making the cache very easy and dumb so that no errors occur. Because if you really want to fully be inline with the db, you basically need to start tracking the amount of channels and given the nature of the lightning network I was choosing simplicity over complexity.
Ok did not think about it, will check this cache out, maybe it has already everything what we need. |
|
Thanks for the review @yyforyongyu
I will also look into |
|
@gemini-code-assist it was proposed to use the nodeScheduler as a cache can you verify if this is possible |
|
That's a good question, @ziggie1984. The For caching the While the |
|
@Abdulkbk, remember to re-request review from reviewers when ready |
fixes #10337
continues #10356
Change Description
From the issue description:
In this PR, we add caching to the
IsPublicNodemethod in theSQLStore. SinceAdding cache will significantly reduce database overhead and accelerate gossip message processing.
Steps to Test
Steps for reviewers to follow to test the change.