Skip to content

Conversation

gouthamve
Copy link
Member

The previous attempt was rolled back in #10590 due to memory issues.

sl.parentCtx and sl.ctx both had a copy of the cache and target info
in the previous attempt and it was hard to pin-point where the context
was being retained causing the memory increase.

I've experimented a bunch in #10627 to figure out that this approach doesn't
cause memory increase. Beyond that, just using this info in any other context
is causing a memory increase. I'm hoping just this won't cause issues but doing the following on top:

diff --git a/scrape/scrape.go b/scrape/scrape.go
index 76739b9cb..8efa431f0 100644
--- a/scrape/scrape.go
+++ b/scrape/scrape.go
@@ -1288,7 +1288,7 @@ func (sl *scrapeLoop) scrapeAndReport(last, appendTime time.Time, errc chan<- er
        }

        var contentType string
-       scrapeCtx, cancel := context.WithTimeout(sl.parentCtx, sl.timeout)
+       scrapeCtx, cancel := context.WithTimeout(sl.appCtx, sl.timeout)
        contentType, scrapeErr = sl.scraper.scrape(scrapeCtx, buf)
        cancel()

still causes a slight memory increase. So at this point, I'm not sure how to debug it.

Screenshot 2022-04-27 at 11 03 28

The change fixed a bunch of long-standing in the OTel Collector that the
community was waiting on and release is blocked on a few downstream distrubutions
of OTel Collector waiting on a fix. I propose to merge this change in while
I investigate what is happening.

@gouthamve
Copy link
Member Author

/prombench main

@prombot
Copy link
Contributor

prombot commented Apr 27, 2022

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-10636 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@roidelapluie
Copy link
Member

  1. Can you gate this behind a manager option so Prometheus is not affected?
  2. Can we design this in another way that does not involve context?

@gouthamve
Copy link
Member Author

Can you gate this behind a manager option so Prometheus is not affected?

Yes. I can do this.

Can we design this in another way that does not involve context?

I've looked into this before and couldn't find a better idea. The problem is OTel needs to look up metadata for each sample added in Append, and the only thing we provide is scrape.NewManager which only takes an Appendable interface. I'll take another look today

@gouthamve
Copy link
Member Author

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Apr 28, 2022

Benchmark cancel is in progress.

@gouthamve
Copy link
Member Author

@roidelapluie Gated it behind a manager option. Could you take a look?

The previous attempt was rolled back in #10590 due to memory issues.

`sl.parentCtx` and `sl.ctx` both had a copy of the cache and target info
in the previous attempt and it was hard to pin-point where the context
was being retained causing the memory increase.

I've experimented a bunch in #10627 to figure out that this approach doesn't
cause memory increase. Beyond that, just using this info in _any_ other context
is causing a memory increase.

The change fixed a bunch of long-standing in the OTel Collector that the
community was waiting on and release is blocked on a few downstream distrubutions
of OTel Collector waiting on a fix. I propose to merge this change in while
I investigate what is happening.

Signed-off-by: Goutham Veeramachaneni <[email protected]>
@gouthamve gouthamve force-pushed the otel-context-metadata branch from d276fe8 to d1176d0 Compare May 2, 2022 21:13
Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after lint fixes

Signed-off-by: Goutham Veeramachaneni <[email protected]>
@gouthamve gouthamve force-pushed the otel-context-metadata branch from d1176d0 to 5808889 Compare May 3, 2022 00:56
@gouthamve
Copy link
Member Author

Fixed.

@Nexucis
Copy link
Member

Nexucis commented May 3, 2022

wouldn't be nice to have it as a fix for the v2.35.x series ?

@roidelapluie
Copy link
Member

wouldn't be nice to have it as a fix for the v2.35.x series ?

There is no end user impact here, I would not release prometheus for it. as a stop gap, otel can point to the commit

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gouthamve gouthamve merged commit 2381d7b into main May 3, 2022
@gouthamve gouthamve deleted the otel-context-metadata branch May 3, 2022 18:45
gouthamve added a commit to gouthamve/opentelemetry-collector-contrib that referenced this pull request May 3, 2022
Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes open-telemetry#9278

Signed-off-by: Goutham Veeramachaneni <[email protected]>
codeboten pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request May 3, 2022
* Update Prometheus to not have memory leak.

Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes #9278

Signed-off-by: Goutham Veeramachaneni <[email protected]>
djaglowski pushed a commit to djaglowski/opentelemetry-collector-contrib that referenced this pull request May 10, 2022
* Update Prometheus to not have memory leak.

Prometheus started doing proper go.mod publishing and started off with
v0.35, but this means the previous version was "winning" as it was v1.8.2
so I had to update the version everywhere.

This brings in prometheus/prometheus#10636
which fixes open-telemetry#9278

Signed-off-by: Goutham Veeramachaneni <[email protected]>
roidelapluie pushed a commit to roidelapluie/prometheus that referenced this pull request Jun 22, 2022
* Send target and metadata cache in context (again)

The previous attempt was rolled back in prometheus#10590 due to memory issues.

`sl.parentCtx` and `sl.ctx` both had a copy of the cache and target info
in the previous attempt and it was hard to pin-point where the context
was being retained causing the memory increase.

I've experimented a bunch in prometheus#10627 to figure out that this approach doesn't
cause memory increase. Beyond that, just using this info in _any_ other context
is causing a memory increase.

The change fixed a bunch of long-standing in the OTel Collector that the
community was waiting on and release is blocked on a few downstream distrubutions
of OTel Collector waiting on a fix. I propose to merge this change in while
I investigate what is happening.

Signed-off-by: Goutham Veeramachaneni <[email protected]>

* Gate the change behind a manager option

Signed-off-by: Goutham Veeramachaneni <[email protected]>
@prombot
Copy link
Contributor

prombot commented Aug 18, 2022

@khanhntd is not a org member nor a collaborator and cannot execute benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants