-
Notifications
You must be signed in to change notification settings - Fork 933
Description
Testing non-finalized checkpoint sync I synced a node for 23000 slots and it took 30-45 minutes. The network has a state of 96 validators and uses mock EL. There's no excuse for this speed, something is very wrong.
I did not had time to fully debug yet, this are the pieces I collected:
Empty queue but some jobs taking +10 seconds, TODO: what are they doing?
beacon_processor_worker_time_bucket{type="chain_segment",le="0.005"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.01"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.025"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.05"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.1"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.25"} 14
beacon_processor_worker_time_bucket{type="chain_segment",le="0.5"} 211
beacon_processor_worker_time_bucket{type="chain_segment",le="1"} 337
beacon_processor_worker_time_bucket{type="chain_segment",le="2.5"} 420
beacon_processor_worker_time_bucket{type="chain_segment",le="5"} 445
beacon_processor_worker_time_bucket{type="chain_segment",le="10"} 451
beacon_processor_worker_time_bucket{type="chain_segment",le="+Inf"} 505
beacon_processor_work_event_queue_length_sum{type="chain_segment"} 0
beacon_processor_queue_time_sum{work_type="chain_segment"} 0.0011878449999999999
# TYPE sync_range_chains_dropped_blocks_total counter
sync_range_chains_dropped_blocks_total{range_type="Head"} 101
sync_range_chains_ignored_blocks_total{chain_type="Head"} 394
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.05"} 39
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.1"} 39
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.2"} 39
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="0.5"} 48
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="1"} 59
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="2"} 97
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="5"} 227
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="10"} 301
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="20"} 363
sync_range_chain_batch_awaiting_processing_seconds_bucket{le="+Inf"} 727
sync_range_chain_batch_awaiting_processing_seconds_sum 16723.573466162004
sync_range_chain_batch_awaiting_processing_seconds_count 727
Head sync appeared to be inefficient when rotating into a new chain as peers are re-statused. The longest processing time I saw for a batch (90 seconds) aligned with one of these rotations. Chain id 1984 was removed because its peers where re-statused and added to a new chain 20862.
Oct 25 13:16:55.218 DEBUG Batch sending for process blocks: 17, id: RangeBatchId(1984, Epoch(650)), chain_id: 1984, start_epoch: 20, target_head_slot: 22839, target_head_root: 0x7a0f79a9213cdb22f32c1256bc35d0b73831bf7a9bd2394bff0e632930501c63, chain_type: Head
Oct 25 13:16:56.935 DEBUG Batch sending for process blocks: 13, id: RangeBatchId(1984, Epoch(651)), chain_id: 1984, start_epoch: 20, target_head_slot: 22839, target_head_root: 0x7a0f79a9213cdb22f32c1256bc35d0b73831bf7a9bd2394bff0e632930501c63, chain_type: Head
Oct 25 13:18:42.043 DEBUG Chain removed id: 1984, sync_type: Head, reason: EmptyPeerPool, op: "peer removed"
Oct 25 13:18:42.200 DEBUG Batch sending for process blocks: 17, id: RangeBatchId(2141, Epoch(650)), chain_id: 2141, start_epoch: 20, target_head_slot: 22939, target_head_root: 0xd6d0eb6ccdbea4522307e282468843105a1074faa001f4a87c10adbb5fcb0680, chain_type: Head
Oct 25 13:18:42.226 DEBUG Batch processed batch_epoch: 651, first_block_slot: 20832, chain: 1984, last_block_slot: 20862, processed_blocks: 13, service: "sync", process_id: RangeBatchId(1984, Epoch(651)), downloaded_blocks: 13
Oct 25 13:18:42.323 DEBUG Batch sending for process blocks: 13, id: RangeBatchId(2141, Epoch(651)), chain_id: 2141, start_epoch: 20, target_head_slot: 22939, target_head_root: 0xd6d0eb6ccdbea4522307e282468843105a1074faa001f4a87c10adbb5fcb0680, chain_type: Head
From the metrics there are a lot of dropped blocks when doing the rotations, and the wait times are very long too