Skip to content

Slow head sync in tinny devnet #8289

@dapplion

Description

@dapplion

Testing non-finalized checkpoint sync I synced a node for 23000 slots and it took 30-45 minutes. The network has a state of 96 validators and uses mock EL. There's no excuse for this speed, something is very wrong.

I did not had time to fully debug yet, this are the pieces I collected:

Empty queue but some jobs taking +10 seconds, TODO: what are they doing?

beacon_processor_worker_time_bucket{type="chain_segment",le="0.005"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.01"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.025"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.05"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.1"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.25"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.5"} 211
beacon_processor_worker_time_bucket{type="chain_segment",le="1"} 337
beacon_processor_worker_time_bucket{type="chain_segment",le="2.5"} 420
beacon_processor_worker_time_bucket{type="chain_segment",le="5"} 445
beacon_processor_worker_time_bucket{type="chain_segment",le="10"} 451
beacon_processor_worker_time_bucket{type="chain_segment",le="+Inf"} 505

beacon_processor_work_event_queue_length_sum{type="chain_segment"} 0
beacon_processor_queue_time_sum{work_type="chain_segment"} 0.0011878449999999999

# TYPE sync_range_chains_dropped_blocks_total counter
sync_range_chains_dropped_blocks_total{range_type="Head"} 101

sync_range_chains_ignored_blocks_total{chain_type="Head"} 394

sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.05"} 39
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.1"} 39
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.2"} 39
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.5"} 48
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="1"} 59
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="2"} 97
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="5"} 227
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="10"} 301
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="20"} 363
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="+Inf"} 727
sync_range_chain_batch_awaiting_processing_seconds_sum 16723.573466162004
sync_range_chain_batch_awaiting_processing_seconds_count 727

Head sync appeared to be inefficient when rotating into a new chain as peers are re-statused. The longest processing time I saw for a batch (90 seconds) aligned with one of these rotations. Chain id 1984 was removed because its peers where re-statused and added to a new chain 20862.

Oct 25 13:16:55.218 DEBUG Batch sending for process                     blocks: 17, id: RangeBatchId(1984, Epoch(650)), chain_id: 1984, start_epoch: 20, target_head_slot: 22839, target_head_root: 0x7a0f79a9213cdb22f32c1256bc35d0b73831bf7a9bd2394bff0e632930501c63, chain_type: Head

Oct 25 13:16:56.935 DEBUG Batch sending for process                     blocks: 13, id: RangeBatchId(1984, Epoch(651)), chain_id: 1984, start_epoch: 20, target_head_slot: 22839, target_head_root: 0x7a0f79a9213cdb22f32c1256bc35d0b73831bf7a9bd2394bff0e632930501c63, chain_type: Head

Oct 25 13:18:42.043 DEBUG Chain removed                                 id: 1984, sync_type: Head, reason: EmptyPeerPool, op: "peer removed"

Oct 25 13:18:42.200 DEBUG Batch sending for process                     blocks: 17, id: RangeBatchId(2141, Epoch(650)), chain_id: 2141, start_epoch: 20, target_head_slot: 22939, target_head_root: 0xd6d0eb6ccdbea4522307e282468843105a1074faa001f4a87c10adbb5fcb0680, chain_type: Head

Oct 25 13:18:42.226 DEBUG Batch processed                               batch_epoch: 651, first_block_slot: 20832, chain: 1984, last_block_slot: 20862, processed_blocks: 13, service: "sync", process_id: RangeBatchId(1984, Epoch(651)), downloaded_blocks: 13

Oct 25 13:18:42.323 DEBUG Batch sending for process                     blocks: 13, id: RangeBatchId(2141, Epoch(651)), chain_id: 2141, start_epoch: 20, target_head_slot: 22939, target_head_root: 0xd6d0eb6ccdbea4522307e282468843105a1074faa001f4a87c10adbb5fcb0680, chain_type: Head

From the metrics there are a lot of dropped blocks when doing the rotations, and the wait times are very long too

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingsyncing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions