graphdb: add caching for isPublicNode query #10363

Abdulkbk · 2025-11-12T15:36:39Z

Change Description

From the issue description:

CPU profiling reveals that IsPublicNode queries consume 62.36% of total CPU time (21.77s out of 34.91s) during gossip message processing. This represents a critical performance bottleneck in LND's gossip subsystem that significantly impacts node synchronization and network message processing throughput.

In this PR, we add caching to the IsPublicNode method in the SQLStore. Since

The IsPublicNode function is called for every node announcement received from the Lightning Network gossip protocol.

Adding cache will significantly reduce database overhead and accelerate gossip message processing.

Steps to Test

Steps for reviewers to follow to test the change.

go test -tags=test_db_sqlite -run TestNodeIsPublic # to ensure we still return the correst status for nodes.

go test -tags=test_db_sqlite -run TestNodeIsPublicCacheInvalidation -v # to test we're correctly invalidating cache

gemini-code-assist · 2025-11-12T15:36:57Z

Summary of Changes

Hello @Abdulkbk, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a caching mechanism for the IsPublicNode query, which was identified as a major CPU bottleneck during Lightning Network gossip message processing. By caching the public status of nodes, the system can avoid repeated database queries, thereby significantly improving performance, reducing database overhead, and accelerating the overall gossip subsystem. The changes include the cache implementation, robust invalidation logic, and new configuration options.

Highlights

Performance Improvement: Implemented an LRU cache for the IsPublicNode query within the SQLStore to significantly reduce CPU consumption during gossip message processing, addressing a critical performance bottleneck.
Cache Integration: The IsPublicNode method now checks the cache first, retrieving the public status of a node if available, or querying the database and storing the result for future use.
Cache Invalidation: Comprehensive cache invalidation logic has been added to ensure data consistency. The publicNodeCache is cleared for affected nodes whenever channel edges are added, deleted, marked as zombie, or when nodes are deleted, or during graph pruning and block disconnections.
Configuration Options: Introduced a new DefaultPublicNodeCacheSize constant and a PublicNodeCacheSize field in StoreOptions to allow configuration of the cache's capacity, defaulting to 15,000 entries.
New Test Coverage: A new test, TestNodeIsPublicCacheInvalidation, was added to thoroughly verify that the cache is correctly invalidated across various graph modification operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces caching for the IsPublicNode query to address a significant performance bottleneck in gossip message processing. The changes are well-structured, including the addition of a new cache, configuration options, and cache invalidation logic at various points where a node's public status might change. I've identified a potential race condition in one of the cache invalidation paths and an opportunity to optimize the cache-aside pattern to prevent thundering herds. Overall, this is a valuable improvement.

graph/db/sql_store.go

gemini-code-assist · 2025-11-12T15:41:14Z

graph/db/sql_store.go

+	// Store the result in cache.
+	s.cacheMu.Lock()
+	_, err = s.publicNodeCache.Put(pubKey, &cachedPublicNode{
+		isPublic: isPublic,
+	})
+	if err != nil {
+		log.Warnf("unable to store node %x in cache: %v", pubKey, err)
+	}
+
+	s.cacheMu.Unlock()
+
 	return isPublic, nil


There's a potential for a thundering herd problem here. If multiple goroutines call IsPublicNode for the same key that is not in the cache, they will all miss the cache, query the database, and then attempt to write to the cache. To optimize this and prevent redundant database queries, you can re-check the cache after acquiring the write lock. This ensures that only the first goroutine populates the cache, and subsequent ones will use the cached value.

// Store the result in cache. We use a double-checked locking pattern // here to avoid a thundering herd problem where multiple goroutines // query the DB for the same key on a cache miss. s.cacheMu.Lock() defer s.cacheMu.Unlock() // Re-check the cache to avoid a race where another goroutine populated // it between our read-lock release and write-lock acquisition. if cached, err := s.publicNodeCache.Get(pubKey); err == nil && cached != nil { return cached.isPublic, nil } _, err = s.publicNodeCache.Put(pubKey, &cachedPublicNode{ isPublic: isPublic, }) if err != nil { log.Warnf("unable to store node %x in cache: %v", pubKey, err) } return isPublic, nil

graph/db/sql_store.go

ziggie1984

Good idea to cache the call, let's also add some benchmark here.

graph/db/options.go

graph/db/sql_store.go

graph/db/graph_sql_test.go

graph/db/sql_store.go

ziggie1984 · 2025-12-15T08:02:16Z

graph/db/sql_store.go

+	s.cacheMu.RLock()
+	cached, err := s.publicNodeCache.Get(pubKey)
+
+	switch {


I find this logic hard to follow can we instead do:

// Cache hit - return immediately. if err == nil && cached != nil { return cached.isPublic, nil } // Log unexpected errors (anything other than "not found"). if err != nil && !errors.Is(err, cache.ErrElementNotFound) { log.Warnf("unexpected error checking node cache: %v", err) }

Abdulkbk · 2025-12-15T16:56:16Z

I cherry-picked f9078e5 from #10356 adding the benchmark for IsPublicNode call and below is how it compares:

go test -tags=test_db_sqlite -bench=BenchmarkIsPublicNode -v

Scenario	Iterations (`b.N`)	Time per op (`ns/op`)
Cache	10422774	109.9 ns/op
No cache	390	2991256 ns/op (~2ms per op)

The difference is very significant with cache (~ 10000x faster).

Abdulkbk · 2025-12-15T16:56:50Z

I notice the itests are all failing. Let me check to see why...

ziggie1984 · 2025-12-15T18:20:12Z

can you add a release note entry for 20.1

graph/db/sql_store.go

ziggie1984 · 2025-12-15T22:38:04Z

graph/db/sql_store.go

 		s.chanCache.remove(chanID)
 	}

+	var pubkeys [][33]byte


no need to delete here, its an LRU cache, if a node was public we don't bother because the pubkey already was annoucned.

This calls s.deleteChannels(ctx, db, chanIDsToDelete) which deletes the affected channels from the db. There will be a discrepancy between what would normally be returned and what the cache will have.

ziggie1984 · 2025-12-15T22:39:24Z

graph/db/sql_store.go

 	for _, channel := range closedChans {
 		s.rejectCache.remove(channel.ChannelID)
 		s.chanCache.remove(channel.ChannelID)
+		s.removePublicNodeCache(


not able to delete

We need to if we want to stay consistent with the db.

ziggie1984 · 2025-12-15T22:39:38Z

graph/db/sql_store.go

 	for _, channel := range removedChans {
 		s.rejectCache.remove(channel.ChannelID)
 		s.chanCache.remove(channel.ChannelID)
+		s.removePublicNodeCache(


same here, cannot b e deleted

ziggie1984 · 2025-12-15T22:41:41Z

graph/db/sql_store.go

+//
+// NOTE: This can safely be called without holding a lock since the lru is
+// thread safe.
+func (s *SQLStore) removePublicNodeCache(pubkeys ...[33]byte) {


I don't think we need this if we remove all the callsites

ziggie1984 · 2025-12-15T22:44:15Z

graph/db/sql_store.go

 		return fmt.Errorf("unable to delete node: %w", err)
 	}

+	s.removePublicNodeCache(pubKey)


even here, I am acutally not sure if we should remove it from the cache, its an LRU cache so it cycles unused values out, so we still might get some infos to this node if the gossip is delayed ?

Shouldn't we be as cautious as possible? Let's assume initially the node is public. After a DeleteNode call, if you call IsPublicNode without a cache, you get false, but if we have a cache that wasn't invalidated, then you get true. Wouldn't that be a discrepancy?

ziggie1984 · 2025-12-17T07:41:46Z

I've reconsidered the cache design because I think the main flood of node announcements we are seeing are from public nodes on the network. Private node announcements we should only see from direct peers with private channels to us. So the current approach of invalidating cache on every channel close causes unnecessary churn.

Proposed approach: Public-nodes-only cache

The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart).

Rationale:

Private nodes are rare and limited to direct peers—caching them adds little value
Once a node's pubkey is public via gossip, that information has already propagated through the network
Brief DB inconsistency (node closes all channels but stays cached as public) is acceptable given Lightning's gossip semantics

Flow:

Cache hit → return public
Cache miss → query DB
- If public → add to cache
- If private → don't cache (will query DB next time)
On node deletion → optionally remove from cache (not strictly necessary)

Benefits:

Eliminates cache churn for active public nodes (the 99% case)
No write-path overhead (no invalidation logic needed)
Simpler implementation
LRU naturally handles cleanup, and restart clears everything

The minor tradeoff (stale "public" status for nodes that close all channels) is acceptable since the gossip information has already leaked and the network treats these nodes as known.

ziggie1984 · 2025-12-17T07:52:05Z

And also look at this:

lnd/discovery/gossiper.go

Line 1726 in 91423ee

func (d *AuthenticatedGossiper) retransmitStaleAnns(ctx context.Context,

we in LND don't even don't send periodic node announcements if we do not have public channels which is another argument to not cache private nodes.

Abdulkbk · 2025-12-17T13:58:54Z

The cache only stores public nodes. Once a node is in the cache, it's considered public (until LRU eviction or restart).

This makes sense. I will drop the cache invalidation and only cache public nodes.

Abdulkbk · 2025-12-17T16:34:08Z

I noticed that unit tests for postgres and sqlite backends are failing. It is expected because TestNodeIsPublic assumes that a node can transition from public to private after its channels are removed, which is now not case because of the 2 points highlighted by @ziggie1984:

Once a node's pubkey is public via gossip, that information has already propagated through the network
Brief DB inconsistency (node closes all channels but stays cached as public) is acceptable given Lightning's gossip semantics

So I will update the test to reflect this behavior.

ziggie1984 · 2025-12-17T19:35:36Z

I noticed that unit tests for postgres and sqlite backends are failing. It is expected because TestNodeIsPublic assumes that a node can transition from public to private after its channels are removed, which is now not case because of the 2 points highlighted by @ziggie1984:

yeah I think we can update the tests, but let's maybe wait until either @Roasbeef or @ellemouton agree with it.

Roasbeef

Concept ACK.

IIUC we only have this so we don't forward the node ann of an unadvertised node to the broader network. However once a node is public, it's node ann is already "leaked" effectively, so we can assume that once public, always public.

ziggie1984 · 2025-12-18T06:58:05Z

graph/db/graph_test.go

+	// SQL store caches public nodes and once a node is cached as public, it
+	// stays public until eviction/restart. This test asserts
+	// public<->private transitions, so it doesn't apply to SQL.
+	if _, ok := aliceGraph.V1Store.(*SQLStore); ok {


yes let's update the testcase to reflect this new behavior, so we do not special case it.

Since I'd be adding cache for kvdb, I will leave this as is for now and remove/update this test once that is done so as not to block this PR from getting in.

graph/db/options.go

graph/db/graph_sql_test.go

ziggie1984 · 2025-12-18T07:13:22Z

docs/release-notes/release-notes-0.20.1.md

  concurrent-safe.

+* [Add caching for](https://github.com/lightningnetwork/lnd/pull/10363)
+`IsPublicNode` query which speedup calls to check for nodes visibility status.


I wonder if we can also add this cache for the KV store, it shouldn't be a big lift but at least it makes the config value more general, because right now kv db node runners would also think they have this feature available. So either we add clear comments that this is a sql feature or we also add it to the kv store which probably isn't a big change and would improve the performance there as well

Sure, will aim for it in a followup but if it's done before this get marged that will be fine too. For now I'll update the note.

This commit adds the struct we'll use to cache the node. It also adds the require `Size` method for the lru package. We also add publicNodeCache into the sqlstore and the necessary config for initializing the cache. Additionally, we introduce a new config `public-node-cache-size` which let us set values for the cache size. Signed-off-by: Abdullahi Yunus <[email protected]>

In this commit, we first check for the node in our cache before querying the database when determining if a node is public or not.

In this commit we add a benchmark to test the performance of IsPublicNode query.

In this commit, we add new tests to verify SQL backend behaviour after caching for IsPublic is introduced. The tests verify that once a node is public and cached, it remains so untill all its public channels are removed and it's evicted from the cache.

ziggie1984 · 2025-12-18T21:33:24Z

please take your time and also implement the kvdb side, we are not going to rush things here, so take the time you need to implement it for both db implementations.

yyforyongyu

Nice usage of the lru cache! My two cents,

we should try to fix the sql query first
build caching is hard, and it should stay as close as to the db CRUDs - for instance, the current caching doesn't have a GC logic, which means we may consider removed channels as public.
we don't want to have a second caching system, and should stick to the current one - we already have nodeScheduler and should leverage that, similarly to how we cache chan updates in chanSchedular. For instance, maybe we can just use chanCache to decide whether the node is public or not.

ziggie1984 · 2025-12-19T06:34:00Z

Thank you for also taking a look,

we should try to fix the sql query first

you mean first go with the updated query and then see if it already solves the performance bottleneck ? And in the next version introduce the cache if it is not sufficient ?

build caching is hard, and it should stay as close as to the db CRUDs - for instance, the current caching doesn't have a GC logic, which means we may consider removed channels as public.

I understand, I was more thinking of making the cache very easy and dumb so that no errors occur. Because if you really want to fully be inline with the db, you basically need to start tracking the amount of channels and given the nature of the lightning network I was choosing simplicity over complexity.

we don't want to have a second caching system, and should stick to the current one - we already have nodeScheduler and should leverage that, similarly to how we cache chan updates in chanSchedular. For instance, maybe we can just use chanCache to decide whether the node is public or not.

Ok did not think about it, will check this cache out, maybe it has already everything what we need.

Abdulkbk · 2025-12-19T06:49:03Z

Thanks for the review @yyforyongyu

we don't want to have a second caching system, and should stick to the current one - we already have nodeScheduler and should leverage that, similarly to how we cache chan updates in chanSchedular. For instance, maybe we can just use chanCache to decide whether the node is public or not.

I will also look into nodeScheduler to see the best way forward.

ziggie1984 · 2025-12-19T07:49:44Z

@gemini-code-assist it was proposed to use the nodeScheduler as a cache can you verify if this is possible

gemini-code-assist · 2025-12-19T07:49:59Z

That's a good question, @ziggie1984. The nodeScheduler is primarily designed for batching database write operations to improve efficiency when modifying the graph. Its purpose is to schedule a set of operations to be executed in a single batch against the database, rather than serving as a cache for query results.

For caching the IsPublicNode query, we need a mechanism that stores the results of previous lookups (node public status) and allows for quick retrieval. The current publicNodeCache (an LRU cache) is well-suited for this, as it stores key-value pairs (public key to public status) and handles eviction based on usage.

While the nodeScheduler is crucial for optimizing database interactions, it doesn't inherently provide the storage and lookup capabilities required for a read-side cache like the one implemented for IsPublicNode. Therefore, directly using nodeScheduler as a cache for this purpose isn't feasible.

lightninglabs-deploy · 2025-12-26T08:01:59Z

@Abdulkbk, remember to re-request review from reviewers when ready

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

Abdulkbk force-pushed the ispub-cache branch 2 times, most recently from 0a81125 to 85b45d5 Compare November 12, 2025 16:12

Abdulkbk force-pushed the ispub-cache branch from 85b45d5 to c6c7bbe Compare December 9, 2025 15:46

Roasbeef requested changes Dec 10, 2025

View reviewed changes

graph/db/sql_store.go Outdated Show resolved Hide resolved

Roasbeef added this to the v0.20.1 milestone Dec 10, 2025

saubyk assigned Abdulkbk Dec 10, 2025

saubyk added this to lnd v0.20 Dec 10, 2025

saubyk moved this to In progress in lnd v0.20 Dec 10, 2025

saubyk moved this from In progress to In review in lnd v0.20 Dec 10, 2025

ziggie1984 self-requested a review December 10, 2025 07:31

Abdulkbk force-pushed the ispub-cache branch from c6c7bbe to aec3619 Compare December 10, 2025 11:08

ziggie1984 requested changes Dec 15, 2025

View reviewed changes

ziggie1984 reviewed Dec 15, 2025

View reviewed changes

graph/db/sql_store.go Outdated Show resolved Hide resolved

ziggie1984 reviewed Dec 15, 2025

View reviewed changes

graph/db/sql_store.go Outdated Show resolved Hide resolved

ziggie1984 requested changes Dec 15, 2025

View reviewed changes

Abdulkbk force-pushed the ispub-cache branch 2 times, most recently from cc98da3 to 8c43ead Compare December 15, 2025 13:15

Abdulkbk force-pushed the ispub-cache branch from f9078e5 to cf003b0 Compare December 15, 2025 17:56

ziggie1984 added back port candidate pr which should be back ported to last major release backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. labels Dec 15, 2025

ziggie1984 requested changes Dec 15, 2025

View reviewed changes

Abdulkbk force-pushed the ispub-cache branch 2 times, most recently from 590351c to bd7587d Compare December 17, 2025 06:14

Abdulkbk force-pushed the ispub-cache branch from bd7587d to 004814b Compare December 17, 2025 14:28

Abdulkbk force-pushed the ispub-cache branch from 004814b to 714cbc9 Compare December 17, 2025 19:07

Roasbeef reviewed Dec 18, 2025

View reviewed changes

ziggie1984 reviewed Dec 18, 2025

View reviewed changes

ziggie1984 requested changes Dec 18, 2025

View reviewed changes

Abdulkbk force-pushed the ispub-cache branch from 714cbc9 to cad9cd6 Compare December 18, 2025 15:13

Abdulkbk added 5 commits December 18, 2025 17:03

graphdb: check for cached data before query

001747c

In this commit, we first check for the node in our cache before querying the database when determining if a node is public or not.

graphdb: add benchmark for isPublicNode query

7af9397

In this commit we add a benchmark to test the performance of IsPublicNode query.

docs: add release note

fcfc0ec

Abdulkbk force-pushed the ispub-cache branch from cad9cd6 to fcfc0ec Compare December 18, 2025 16:04

yyforyongyu requested changes Dec 19, 2025

View reviewed changes

ziggie1984 removed back port candidate pr which should be back ported to last major release backport-v0.20.x-branch This label is used to trigger the creation of a backport PR to the branch `v0.20.x-branch`. labels Dec 20, 2025

ziggie1984 modified the milestones: v0.20.1, v0.21.0 Dec 20, 2025

ziggie1984 added this to v0.21 Dec 20, 2025

ziggie1984 removed this from lnd v0.20 Dec 20, 2025

graphdb: add caching for isPublicNode query #10363

Are you sure you want to change the base?

graphdb: add caching for isPublicNode query #10363

Conversation

Abdulkbk commented Nov 12, 2025

Change Description

Steps to Test

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Abdulkbk commented Dec 15, 2025

Uh oh!

Abdulkbk commented Dec 15, 2025

Uh oh!

ziggie1984 commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Dec 17, 2025

Uh oh!

ziggie1984 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abdulkbk commented Dec 17, 2025

Uh oh!

Abdulkbk commented Dec 17, 2025

Uh oh!

ziggie1984 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Dec 17, 2025 •

edited

Loading

ziggie1984 commented Dec 17, 2025 •

edited

Loading

ziggie1984 Dec 18, 2025 •

edited

Loading

yyforyongyu left a comment •

edited

Loading