Skip to content

Conversation

@pkoenig10
Copy link
Member

@pkoenig10 pkoenig10 commented May 29, 2025

Before this PR

In our internal authentication service, we have a single threaded executor where, at any given time, there is at most one executing task and one queued task. The code looks something like:

private final ExecutorService updateExecutor = Executors.newSingleThreadExecutor();
private final AtomicReference<SettableFuture<Void>> pendingUpdate = new AtomicReference<>();

Future<Void> updateCache() {
    SettableFuture<Void> future = pendingUpdate.get();
    if (future != null) {
        return future;
    }

    future = SettableFuture.create();

    SettableFuture<Void> witness = pendingUpdate.compareAndExchange(null, future);
    if (witness != null) {
        return witness;
    }

    future.setFuture(updateExecutor.submit(this::doUpdateCache, null));

    return future;
}

void doUpdateCache() {
    pendingUpdate.set(null);

    ...
}

Metrics seem to indicate that the queued duration p99 is longer than the duration p99. Here are metrics from our internal test environment.

Screenshot 2025-05-29 at 12 38 30 PM

This should be impossible, given how this executor is used.

But it happens because TaggedMetricsExecutorService is simply dropping any samples below the threshold. This causes the value of the queued duration metrics to be artificially inflated - especially for executors that typically have short queue duration.

It's confusing for measurements to simply be dropped in this way and causes the resulting metrics to be misleading.

After this PR

TaggedMetricsExecutorService no longer excludes measurements from the queued duration metric. This metric now accurately captures the time between submission and execution for all submitted tasks.

@changelog-app
Copy link

changelog-app bot commented May 29, 2025

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

The executor queued duration metric no longer excludes small measurements. This ensures that the metrics accurately measure the time between submission and execution.

Check the box to generate changelog(s)

  • Generate changelog entry

// it doesn't necessarily mean there's a queue at all. We assume anything longer than
// this threshold, which should be longer than pauses in most cases, is the result
// of queueing.
private static final long QUEUED_DURATION_MINIMUM_THRESHOLD_NANOS = 250_000_000L;
Copy link
Contributor

@schlosna schlosna Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1230 was what originally added this threshold. Per discussions with @carterkozak & @pkoenig10 , we explicitly do not add queue metrics for cached executors in tritium clients (support for this was added in #1012).

@bulldozer-bot bulldozer-bot bot merged commit fd19631 into develop Jun 2, 2025
5 checks passed
@bulldozer-bot bulldozer-bot bot deleted the pkoenig/queuedDuration branch June 2, 2025 15:52
@autorelease3
Copy link

autorelease3 bot commented Jun 2, 2025

Released 0.100.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants