-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Component(s)
extension/storage/filestorage
What happened?
Description
When there are issues with underlying a linux guest storage, mostly in scenarios where the hypervisor loses connectivity or experiences extremely high latency with the NAS/SAN, the database can be in a corrupted state, which requires completely deleting the filesstorage queue, to allow otel to function (or even start running) again.
Steps to Reproduce
While a VM is running, try to remove it's storage beneath it, potentially in the middle of otel writing to the bolt filestorage.
This may also be caused by machines being improperly shut down (like power loss.)
Expected Result
Some data will be lost, but the filestorage extension will recover gracefully and be able to resume running.
Actual Result
The otelcol service attempts to start, but fails to do so when loading the filestorage extension. This requires manual intervention to delete the specific filestorage file that was corrupted, and then to start the service.
Collector version
0.106.1
Environment information
Environment
OS: Ubuntu 22.04
Compiler(if manually compiled): N/A, using https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.106.1/otelcol-contrib_0.106.1_linux_amd64.deb
OpenTelemetry Collector configuration
# This is a significantly reduced config (removed additional log receivers, filters, etc.) but all the required components are here, and all the items related to the file_storage extension, queues, etc. is untouched.
# ---------------------------------------------------- #
# --------------- EXTENSIONS ------------------------ #
# ---------------------------------------------------- #
extensions:
file_storage/fingerprint:
directory: /var/lib/otelcol/fingerprint
timeout: 10s
compaction:
on_rebound: true
directory: /data/otelcol/compaction
check_interval: 5m
rebound_trigger_threshold_mib: 50
rebound_needed_threshold_mib: 500
file_storage/persistent_queue_storage:
directory: /data/otelcol/persistent_queue_storage
timeout: 10s
compaction:
on_rebound: true
directory: /data/otelcol/compaction
check_interval: 5m
rebound_trigger_threshold_mib: 50
rebound_needed_threshold_mib: 500
# ---------------------------------------------------- #
# ------------- RECEIVERS --------------------------- #
# ---------------------------------------------------- #
receivers:
filelog/syslog:
include: "/var/log/syslog"
start_at: beginning
include_file_path: true
max_log_size: 100MiB
storage: file_storage/fingerprint
prometheus:
trim_metric_suffixes: true
config:
scrape_configs:
- job_name: 'prometheus-node-exporter'
scrape_interval: 1m
static_configs:
- targets: ['127.0.0.1:9100']
# ---------------------------------------------------- #
# ------------ PROCESSORS --------------------------- #
# ---------------------------------------------------- #
processors:
batch:
send_batch_size: 10000
timeout: 5s
# ---------------------------------------------------- #
# ------------- EXPORTERS --------------------------- #
# ---------------------------------------------------- #
exporters:
logging: null
# from all normal logs
otlphttp/general:
endpoint: https://foo.bar.com:443
timeout: 10s
retry_on_failure:
enabled: true
initial_interval: 10s
max_interval: 60s
max_elapsed_time: 10m
sending_queue:
enabled: true
storage: file_storage/persistent_queue_storage
queue_size: 1000000000
num_consumers: 2
compression: gzip
tls:
insecure: false
# ---------------------------------------------------- #
# --------------- SERVICE --------------------------- #
# ---------------------------------------------------- #
service:
extensions:
- "file_storage/fingerprint"
- "file_storage/persistent_queue_storage"
telemetry:
logs:
encoding: "json"
disable_caller: true
pipelines:
logs/syslog:
receivers:
- "filelog/syslog"
processors:
- "batch"
exporters:
- "otlphttp/general"
metrics:
receivers:
- "prometheus"
processors:
- "batch"
exporters:
- "otlphttp/general"
Log output
Oct 21 17:46:01 gw-0210027-02 systemd[1]: Started OpenTelemetry Collector Contrib.
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: time="2024-10-21T17:46:01+01:00" level=warning msg="DBUS_SESSION_BUS_ADDRESS envvar looks to be not set, this can lead to runaway dbus-daemon processes. T>
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5634177,"msg":"Setting up own telemetry..."}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.563445,"msg":"OpenCensus bridge is disabled for Collector telemetry and will be removed in a future version, use --feature>
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5635238,"msg":"Serving metrics","address":":9103","metrics level":"Normal"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5654778,"msg":"Starting otelcol-contrib...","Version":"0.106.1","NumCPU":6}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5654962,"msg":"Starting extensions..."}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655017,"msg":"Extension is starting...","kind":"extension","name":"file_storage/persistent_queue_storage"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655134,"msg":"Extension started.","kind":"extension","name":"file_storage/persistent_queue_storage"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.565517,"msg":"Extension is starting...","kind":"extension","name":"file_storage/fingerprint"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655217,"msg":"Extension started.","kind":"extension","name":"file_storage/fingerprint"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: panic: assertion failed: Page expected to be: 242, but self identifies as 8316213599780548146
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: goroutine 1 [running]:
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt._assert(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/db.go:1387
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*page).fastCheck(0x7f795bd4c000, 0xf2)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/page.go:57 +0x1d9
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).page(0xfd3f188?, 0x7412180?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx.go:534 +0x7b
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc002450540, {0xc00225b3b0, 0x2, 0xa}, 0xc00353e020)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx.go:546 +0x5d
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc002450540, {0xc00225b3b0, 0x1, 0xa}, 0xc00353e020)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx.go:555 +0xc8
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPage(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx.go:542
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc002450540, 0xc00187db80, 0xc00353e428, 0xc00353e368, {0x9f50fa8, 0x10095f80}, 0xc002259b00)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx_check.go:83 +0x114
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f795bca6020?, 0xc00225b360?, 0xc00347fa78?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx_check.go:110 +0x90
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc00353e220)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/bucket.go:411 +0x96
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc002450540, 0xc002450558, 0xc00353e428, 0xc00353e368, {0x9f50fa8, 0x10095f80}, 0xc002259b00)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/tx_check.go:108 +0x255
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).freepages(0xc002253448)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/db.go:1205 +0x225
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).loadFreelist.func1()
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/db.go:417 +0xc5
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync.(*Once).doSlow(0x3fc8bc0?, 0xc002253610?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync/once.go:74 +0xc2
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync.(*Once).Do(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync/once.go:65
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).loadFreelist(0xc002253448?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/db.go:413 +0x45
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.Open({0xc00225b2c0, 0x45}, 0x180, 0xc00353e650)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/[email protected]/db.go:295 +0x430
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage.newClient(0xc001fbab00, {0xc00225b2c0?, 0xc00353e7c0?}, 0x2540be400, 0xc003315600, 0xc0?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/[email protected]/client.go:53 +0xaf
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage.(*localFileStorage).GetClient(0xc001bfd980, {0x0?, 0x0?}, 0x0?, {{{0xc003349010?, 0x0?}}, {0xc003349019?, 0x0?}}, {0x8d9233f, 0x4})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/[email protected]/extension.go:61 +0x3c5
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.toStorageClient({0x9f97628, 0x10095f80}, {{{0xc003243da0, 0xc}}, {0xc003243dad, 0x18}}, {0x9f40c68?, 0xc0034ae070?}, {{{0xc003349010, 0x8}}, ...}, ...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/internal/queue/persistent_queue.go:499 +0x106
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.(*persistentQueue[...]).Start(0xa02c220, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/internal/queue/persistent_queue.go:107 +0xa9
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start(0x9ee6610, {0x9f97628, 0x10095f80?}, {0x9f40c68?, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:31 +0x4c
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start(0xc0033680b0, {0x9f97628?, 0x10095f80?}, {0x9f40c68?, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:103 +0x65
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start(0xc00337a160, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:322 +0xbf
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll(0xc00336a180, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070}, {0x9f71d50, 0xc001ecd110})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/internal/graph/graph.go:425 +0x1cd
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/service.(*Service).Start(0xc003442000, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/service.go:217 +0x309
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc002dd97a0, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/collector.go:218 +0x849
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc002dd97a0, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/collector.go:272 +0x55
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.NewCommand.func1(0xc00244fb08, {0x8d9a719?, 0x7?, 0x8d9255f?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/[email protected]/command.go:37 +0xa7
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).execute(0xc00244fb08, {0xc0001d4010, 0x1, 0x1})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/[email protected]/command.go:985 +0xaca
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).ExecuteC(0xc00244fb08)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/[email protected]/command.go:1117 +0x3ff
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).Execute(0x916ac28?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/[email protected]/command.go:1041 +0x13
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.runInteractive({0x916ac28, {{0x8dc9f6c, 0xf}, {0x8e71b88, 0x1f}, {0x8d9a5a6, 0x7}}, 0x0, {{{0x0, 0x0, ...}, ...}}, ...})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:53 +0x5d
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.run(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:10
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.main()
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:46 +0x205
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Failed with result 'exit-code'.
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Scheduled restart job, restart counter is at 5.
Oct 21 17:46:01 gw-0210027-02 systemd[1]: Stopped OpenTelemetry Collector Contrib.
Additional context
No response