Skip to content

Conversation

ArthurSens
Copy link
Member

@ArthurSens ArthurSens commented Jan 31, 2025

I was taking a look over #20875 and hoping to finish it.
Fixes #19363
Fixes #24399
Fixes #15277


As mentioned in #24399 (comment), I used a library to help me understand how the deadlock was happening. (1st commit). It showed that persistToWal was trying to acquire the lock, while readPrompbFromWal held it forever.

I changed the strategy here and instead of using fs.Notify, and all that complicated logic around it, we're just using a pub/sub strategy between the writer and reader Go routines.

The reader go routine, once finding an empty WAL, will now release the lock immediately and wait for a notification from the writer. While previously it would hold the lock while waiting for a write that would never happen.

Signed-off-by: Arthur Silva Sens <[email protected]>
Signed-off-by: Arthur Silva Sens <[email protected]>
Signed-off-by: Arthur Silva Sens <[email protected]>
@ArthurSens ArthurSens marked this pull request as ready for review February 2, 2025 15:25
@ArthurSens ArthurSens requested a review from a team as a code owner February 2, 2025 15:25
Signed-off-by: Arthur Silva Sens <[email protected]>
@dashpole dashpole added the ready to merge Code review completed; ready to merge by maintainers label Feb 4, 2025
@songy23 songy23 merged commit 7f581ca into open-telemetry:main Feb 4, 2025
174 checks passed
@github-actions github-actions bot added this to the next release milestone Feb 4, 2025
@ArthurSens ArthurSens deleted the prwe-wal-deadlock branch February 4, 2025 16:13
yiquanzhou added a commit to dash0hq/opentelemetry-collector-contrib that referenced this pull request Feb 5, 2025
* main: (392 commits)
  fix(deps): update module golang.org/x/text to v0.22.0 (open-telemetry#37686)
  [exporter/bmchelix] Second PR of New component: BMC Helix Exporter (open-telemetry#37350)
  chore(deps): update otel/opentelemetry-collector-contrib docker tag to v0.119.0 (open-telemetry#37688)
  [chore] fix codeowners allowlist (open-telemetry#37684)
  chore(deps): update otel/opentelemetry-collector docker tag to v0.119.0 (open-telemetry#37687)
  Update jpkroehling's affiliation (open-telemetry#37683)
  fix(deps): update module github.com/clickhouse/clickhouse-go/v2 to v2.30.3 (open-telemetry#37655)
  fix(deps): update all opentelemetry collector contrib packages to v0.119.0 (open-telemetry#37666)
  fix(deps): update module github.com/elastic/go-docappender/v2 to v2.4.0 (open-telemetry#37667)
  fix(deps): update all golang.org/x packages (open-telemetry#37680)
  [exporter/prometheusremotewrite] Fix WAL deadlock (open-telemetry#37630)
  fix(deps): update opentelemetry-go monorepo (open-telemetry#37673)
  fix(deps): update module github.com/shirou/gopsutil/v4 to v4.25.1 (open-telemetry#37671)
  fix(deps): update module github.com/spf13/pflag to v1.0.6 (open-telemetry#37658)
  fix(deps): update all github.com/aws packages (open-telemetry#37661)
  [chore] Prepare release 0.119.0 (open-telemetry#37660)
  make update-otel OTEL_VERSION=v0.119.0 OTEL_STABLE_VERSION=v1.25.0 (open-telemetry#37656)
  add documentation and warning log for deprecating AccessTokenPassthrough (open-telemetry#37575)
  chore: add myself, echlebek, as a codeowner (open-telemetry#37650)
  [processor/transform] Add support for flat configuration style (open-telemetry#37444)
  ...
yiquanzhou added a commit to dash0hq/opentelemetry-collector-contrib that referenced this pull request Feb 5, 2025
* main: (392 commits)
  fix(deps): update module golang.org/x/text to v0.22.0 (open-telemetry#37686)
  [exporter/bmchelix] Second PR of New component: BMC Helix Exporter (open-telemetry#37350)
  chore(deps): update otel/opentelemetry-collector-contrib docker tag to v0.119.0 (open-telemetry#37688)
  [chore] fix codeowners allowlist (open-telemetry#37684)
  chore(deps): update otel/opentelemetry-collector docker tag to v0.119.0 (open-telemetry#37687)
  Update jpkroehling's affiliation (open-telemetry#37683)
  fix(deps): update module github.com/clickhouse/clickhouse-go/v2 to v2.30.3 (open-telemetry#37655)
  fix(deps): update all opentelemetry collector contrib packages to v0.119.0 (open-telemetry#37666)
  fix(deps): update module github.com/elastic/go-docappender/v2 to v2.4.0 (open-telemetry#37667)
  fix(deps): update all golang.org/x packages (open-telemetry#37680)
  [exporter/prometheusremotewrite] Fix WAL deadlock (open-telemetry#37630)
  fix(deps): update opentelemetry-go monorepo (open-telemetry#37673)
  fix(deps): update module github.com/shirou/gopsutil/v4 to v4.25.1 (open-telemetry#37671)
  fix(deps): update module github.com/spf13/pflag to v1.0.6 (open-telemetry#37658)
  fix(deps): update all github.com/aws packages (open-telemetry#37661)
  [chore] Prepare release 0.119.0 (open-telemetry#37660)
  make update-otel OTEL_VERSION=v0.119.0 OTEL_STABLE_VERSION=v1.25.0 (open-telemetry#37656)
  add documentation and warning log for deprecating AccessTokenPassthrough (open-telemetry#37575)
  chore: add myself, echlebek, as a codeowner (open-telemetry#37650)
  [processor/transform] Add support for flat configuration style (open-telemetry#37444)
  ...
chengchuanpeng pushed a commit to chengchuanpeng/opentelemetry-collector-contrib that referenced this pull request Feb 8, 2025
I was taking a look over open-telemetry#20875 and hoping to finish it.
Fixes open-telemetry#19363
Fixes open-telemetry#24399
Fixes open-telemetry#15277 

---

As mentioned in
open-telemetry#24399 (comment),
I used a library to help me understand how the deadlock was happening.
(1st commit). It showed that `persistToWal` was trying to acquire the
lock, while `readPrompbFromWal` held it forever.

I changed the strategy here and instead of using fs.Notify, and all that
complicated logic around it, we're just using a pub/sub strategy between
the writer and reader Go routines.

The reader go routine, once finding an empty WAL, will now release the
lock immediately and wait for a notification from the writer. While
previously it would hold the lock while waiting for a write that would
never happen.

---------

Signed-off-by: Arthur Silva Sens <[email protected]>
zeck-ops pushed a commit to zeck-ops/opentelemetry-collector-contrib that referenced this pull request Apr 23, 2025
I was taking a look over open-telemetry#20875 and hoping to finish it.
Fixes open-telemetry#19363
Fixes open-telemetry#24399
Fixes open-telemetry#15277 

---

As mentioned in
open-telemetry#24399 (comment),
I used a library to help me understand how the deadlock was happening.
(1st commit). It showed that `persistToWal` was trying to acquire the
lock, while `readPrompbFromWal` held it forever.

I changed the strategy here and instead of using fs.Notify, and all that
complicated logic around it, we're just using a pub/sub strategy between
the writer and reader Go routines.

The reader go routine, once finding an empty WAL, will now release the
lock immediately and wait for a notification from the writer. While
previously it would hold the lock while waiting for a write that would
never happen.

---------

Signed-off-by: Arthur Silva Sens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

exporter/prometheusremotewrite ready to merge Code review completed; ready to merge by maintainers

Projects

None yet

5 participants