Skip to content

Conversation

celian-garcia
Copy link
Member

@celian-garcia celian-garcia commented May 28, 2025

Description

The idea is to not wait for all the calls for one Azure subscription to finish before starting to collect data from another Azure subscription.
Parallelizing this helps to make all the processing and queries under 1m.

Link to tracking issue

Fixes #39417

Testing

No regression on the current tests that should cover the case.

Documentation

N/A

@celian-garcia celian-garcia marked this pull request as draft June 2, 2025 09:40
@celian-garcia
Copy link
Member Author

Converting it to draft since it produces some crashes. I have a bunch of never ending strange logs to analyze, example:

github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/internal/metadata/metrics.go:102 +0x77
--
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver/internal/metadata.(*MetricsBuilder).EmitForResource(0xc00aa9f3b0, {0xc007ad8f78, 0x1, 0x24?})
  | go.opentelemetry.io/collector/[email protected]/pmetric/generated_scopemetricsslice.go:103
  | go.opentelemetry.io/collector/pdata/pmetric.ScopeMetricsSlice.AppendEmpty(...)
  | goroutine 95038 [runnable]:
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152 +0xab
  | created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape in goroutine 206
  | runtime/asm_amd64.s:1700 +0x1
  | runtime.goexit({})
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape.gowrap1()
  | goroutine 95046 [runnable]:
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152 +0xab
  | created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape in goroutine 206
  | runtime/asm_amd64.s:1700 +0x1
  | runtime.goexit({})
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape.gowrap1()
  | goroutine 95073 [runnable]:
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152 +0xab
  | created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape in goroutine 206
  | runtime/asm_amd64.s:1700 +0x1
  | runtime.goexit({})
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape.gowrap1()
  | goroutine 95027 [runnable]:
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152 +0xab
  | created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape in goroutine 206
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:166 +0x1b7
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape.func1({0xc003ab2390, 0x24})
  | goroutine 95033 [chan receive]:
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152 +0xab
  | created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape in goroutine 206
  | runtime/asm_amd64.s:1700 +0x1
  | runtime.goexit({})
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/scraper_batch.go:152
  | github.com/open-telemetry/opentelemetry-collector-contrib/receiver/azuremonitorreceiver.(*azureBatchScraper).scrape.gowrap1()
  | goroutine 95081 [runnable]:
  | golang.org/x/[email protected]/http2/transport.go:1336 +0x485
  | created by golang.org/x/net/http2.(*ClientConn).roundTrip in goroutine 75645
  | golang.org/x/[email protected]/http2/transport.go:1431 +0x56
  | golang.org/x/net/http2.(*clientStream).doRequest(0xc00bbda900, 0xc001d68720?, 0x24?)
  | golang.org/x/[email protected]/http2/transport.go:1570 +0xc85
  | golang.org/x/net/http2.(*clientStream).writeRequest(0xc00bbda900, 0xc0420be640, 0x0)


Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 17, 2025
Copy link
Contributor

github-actions bot commented Jul 1, 2025

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jul 1, 2025
atoulme pushed a commit that referenced this pull request Oct 10, 2025
… Batch API mode (#43047)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

This PR revives the previously non-functional #40344, aiming to enhance
performance by running metrics collection across Azure subscriptions in
parallel. This significantly reduces latency when dealing with multiple
subscriptions.

Compared to the original PR, this version addresses a critical
concurrency issue:
```fatal error: concurrent map read and map write```
Reference: https://victoriametrics.com/blog/go-sync-map/

To resolve this, I evaluated multiple approaches and found that using the concurrent-map library yielded the best results. Details of the evaluation and benchmarks are documented in concurrency_bench_report.md.


<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes #39417

<!--Describe what testing was performed and which tests were added.-->
#### Testing

During testing, a race condition was discovered in the Azure SDK related to the fallback behavior of the Cloud option in the NewClient function:
```go
// NewClient creates a client that accesses Azure Monitor metrics data.
// Client should be used for performing metrics queries on multiple
monitored resources in the same region.
// A credential with authorization at the subscription level is required
when using this client.
//
// endpoint - The regional endpoint to use, for example
https://eastus.metrics.monitor.azure.com.
// The region should match the region of the requested resources. For
global resources, the region should be 'global'.
func NewClient(endpoint string, credential azcore.TokenCredential,
options *ClientOptions) (*Client, error) {
	if options == nil {
		options = &ClientOptions{}
	}
	if reflect.ValueOf(options.Cloud).IsZero() {
		options.Cloud = cloud.AzurePublic // <-- HERE
	}
	c, ok := options.Cloud.Services[ServiceName]
	if !ok || c.Audience == "" {
return nil, errors.New("provided Cloud field is missing Azure Monitor
Metrics configuration")
	}

```
To prevent this, our implementation explicitly sets the Cloud option in all cases, ensuring deterministic behavior and avoiding the race.

<!--Describe the documentation added.-->
#### Documentation

A new markdown file (concurrency_bench_report.md) has been added to document:

- The rationale behind choosing concurrent-map
- Benchmark results comparing different implementations
- Notes for future contributors who may want to explore alternative concurrency strategies
<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Célian Garcia <[email protected]>
Signed-off-by: Celian GARCIA <[email protected]>
Co-authored-by: Moritz Wiesinger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/azuremonitor] use_batch_api should parallelize the calls by subscription

2 participants