Introduce cost-based tasks autoscaler for streaming ingestion #18819

Fly-Style · 2025-12-05T22:42:07Z

Cost-Based Autoscaler for Seekable Stream Supervisors

Overview

Implements a cost-based autoscaling algorithm for seekable stream supervisor tasks that optimizes task count by balancing lag reduction against resource efficiency.

Note: this patch doesn't support autoscaling (down) during task rollover. Temporarily, it scales down in the same manner as scales up.
Introduces WeightedCostFunction for cost-based autoscaling decisions. The function computes a cost score (in seconds) for each candidate task count, balancing lag recovery time against idle resource waste.

Key Design Decisions

Cost Formula

totalCost = lagWeight × lagRecoveryTime + idleWeight × idlenessCost

lagRecoveryTime = aggregateLag / (taskCount × avgProcessingRate) — time to clear backlog
idlenessCost = taskCount × taskDuration × predictedIdleRatio — wasted compute time

Idle Prediction Model

Uses capacity-based linear scaling:

predictedIdle = 1 - (1 - currentIdle) / (proposedTasks / currentTasks)

More tasks → more idle per task; fewer tasks → busier tasks.

Ideal Idle Range

Defines optimal utilization as idle ratio within [0.2, 0.6]:

Below 0.2: overloaded → scale up
Within range: optimal → no action
Above 0.6: underutilized → scale down

Conservative Cold Start Behavior

When processing rate is unavailable (cold start, new tasks):

Current task count: cost = 0.01 (allowed)
Any scaling: cost = +∞ (prohibited)

This prevents scaling decisions based on incomplete data.

Additionally, we add reading poll-idle ratio-avg from /rowStats task endpoint.

This PR has:

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

jtuglu1 · 2025-12-06T00:52:02Z

While I think this will be very useful, the primary issue we've run into with the current scaler is that it needs to shut down tasks in order to scale (causes a lot of lag during this process). #18466 was working on a way to fix this.

I think this will help the scaler be smarter on each scale, but each scale action still costs a lot to do.

Fly-Style · 2025-12-08T12:35:51Z

@jtuglu1 thanks for your input, appreciate it! The aim to make more capable / tunable autoscaler, while in parallel we will make improvements proposed in #18466 .

kfaraz

Thanks for the new auto-scaler strategy, @Fly-Style !
I really like the idea of assigning a cost value to a potential task count as it helps reason about our choices! Overall, the patch looks good.

I am leaving a partial review here. I am yet to go through the WeightedCostFunction and some other aspects of the patch. Will post the remaining comments today.

indexing-service/src/main/java/org/apache/druid/indexing/common/task/CompactionTask.java

...rg/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java

...in/java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/AutoScalerConfig.java

...fka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaConsumerMonitor.java

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

kfaraz · 2025-12-10T05:31:10Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+    }
+
+    final int currentTaskCount = currentMetrics.getCurrentTaskCount();
+    final List<Integer> validTaskCounts = FACTORS_CACHE.computeIfAbsent(


Is the computeFactors() computation really heavy enough to require caching?
Especially since we are imposing a max SCALE_FACTOR_DISCRETE_DISTANCE.

How about we simplify computeFactors so that we compute only the required factors?

Example:

List<Integer> computeValidTaskCounts(int partitionCount, int currentTaskCount) { final int currentPartitionsPerTask = partitionCount / currentTaskCount; final int minPartitionsPerTask = Math.max(1, currentPartitionsPerTask - 2); final int maxPartitionsPerTask = Math.min(partitionCount, currentPartitionsPerTask + 2); return IntStream.of(minPartitionPerTask, maxPartitionsPerTask + 1) .map(partitionsPerTask -> (partitionCount / partitionsPerTask) + Math.min(partitionCount % partitionsPerTask, 1)) .collect(Collectors.toList(); }

As we discussed on voice chat, we leave current implementation as is, but slightly enhancing the way how it data is saved in cache.

...rg/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java

...pache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfigTest.java

...rg/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

...rg/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

Fly-Style · 2025-12-10T15:03:11Z

@kfaraz thanks for the review. I addressed most of the comments in e5b40c7 , except comment regarding using the correct metrica for poll-idle-ratio. Unfortunately, initial plan with measuring poll from the consumer was not correct, more details in this comment. I simply did not know it measures a bit wrong thing. :(

Will do a separate endpoint for correct metrics to fetch correct metrics from all tasks and calculate average with consequent data normalization in separate commit.

cc @cryptoe

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

…the metrics

…tation

…ange and logarithmic scalability

kfaraz · 2025-12-11T13:31:04Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+        config.getScaleActionStartDelayMillis(),
+        config.getScaleActionPeriodMillis(),
+        TimeUnit.MILLISECONDS


Sorry, I meant that we should not have the start delay, period and collection interval as configs at all. Even with good defaults, unnecessary configs only complicate admin work and require code to handle all possible scenarios.

At most, maybe keep just one config scaleActionPeriod that can be specified as an ISO period (e.g. PT1M) or something (mostly since you would be using this in embedded tests). The other configs don't really add any value. They are legacy configs in lag-based auto-scaler which we might as well avoid adding in the new strategy.

...fka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaConsumerMonitor.java

...rg/apache/druid/testing/embedded/indexing/autoscaler/CostBasedAutoScalerIntegrationTest.java

...fka-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaConsumerMonitor.java

...ng-service/src/main/java/org/apache/druid/indexing/seekablestream/common/RecordSupplier.java

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

indexing-service/src/main/java/org/apache/druid/indexing/common/task/AbstractTask.java

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

kfaraz · 2025-12-15T07:16:59Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+    if (optimalTaskCount > currentTaskCount) {
+      return optimalTaskCount;
+    } else if (optimalTaskCount < currentTaskCount) {
+      supervisor.getIoConfig().setTaskCount(optimalTaskCount);


This line can have behavioural side effects in the supervisor, since the taskCount should always reflect the current running task count and not the desired task count. Here we are updating the taskCount without actually changing the number of tasks, or suspending the supervisor.

Instead, we could do the following:

Add an auto-scaler method isScaleDownOnRolloverOnly(). This will always return false for lag-based and always true for cost-based.

CostBasedAutoScalerConfig.computeOptimalTaskCount() should return the optimal task count for scale down cases as well.

SeekableStreamSupervisor can store this desired task count in an atomic boolean and retrieve it upon regular task rollover.

Another (perhaps cleaner) option is to simply invoke AutoScaler.computeOptimalTaskCount() whenever we do rollover and then just go with the optimal task count.

@Fly-Style , as you suggested on chat, we can take up the scale down behaviour in a follow up PR.
In the current PR, we can do scale down same as scale up.

…e/druid/indexing/kafka/KafkaConsumerMonitor.java Co-authored-by: Kashif Faraz <[email protected]>

jtuglu1 · 2025-12-15T17:15:58Z

At most, maybe keep just one config scaleActionPeriod that can be specified as an ISO period (e.g. PT1M) or something (mostly since you would be using this in embedded tests). The other configs don't really add any value. They are legacy configs in lag-based auto-scaler which we might as well avoid adding in the new strategy.

I don't fully agree with this. At the very least, we use config.getScaleActionStartDelayMillis() internally when doing red/black deployments where supervisors can get paused. It's better in our case to put a delay after resubmitting the supervisor, otherwise we end-up over-scaling after a deployment. Similarly, we update the specs frequently to add new/update existing columns. Putting a cooldown after submission allows the scaler to adjust accurately to the lag rather than getting in a scaling loop and becoming way over-scaled (in supervisors with 500+ tasks this is issue). I agree the rest are not too useful in practice.

kfaraz

The changes look good but the weighted cost function needs to be simplified to make the computations more intuitive and debug friendly.

...rg/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskClient.java

server/src/main/java/org/apache/druid/indexing/overlord/supervisor/Supervisor.java

kfaraz · 2025-12-15T14:43:15Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+  {
+    Map<String, Map<String, Object>> taskMetrics = getStats(true);
+    if (taskMetrics.isEmpty()) {
+      return 1.;


Returning 1 (full idle) here would cause the auto-scaler to think that the tasks are doing nothing and cause a scale-down, when in fact the tasks failed to return the metrics and may be in a bad state. Scaling down might further worsen the problem.

Should we return 0 instead?

Scaling up might be an overkill in that scenario - nothing may happen and instead we will waste a resources.
0.5 looks optimal for me (during the implementation i thought between 0.5 and 1).

We can even return -1 to denote that we do not have metrics available, and just skip scaling rather than make a bad decision.

...rg/apache/druid/testing/embedded/indexing/autoscaler/CostBasedAutoScalerIntegrationTest.java

kfaraz · 2025-12-15T15:02:07Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+        result.add(taskCount);
+      }
+    }
+    return result.stream().mapToInt(Integer::intValue).toArray();


Nit: Is conversion to array still needed? Can we just return List or Set instead?

It's my personal belief: not a fan of boxed primitives A LOT :)

But we are still boxing the primitives while adding to the List. Might as well avoid the extra conversion.

kfaraz · 2025-12-15T15:03:46Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+      if (result.isEmpty() || result.get(result.size() - 1) != taskCount) {
+        result.add(taskCount);
+      }


Maybe just use a set to simplify this computation.

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

kfaraz · 2025-12-15T18:51:38Z

...ava/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java

+ * Weighted cost function combining lag, idle time, and change distance metrics.
+ * Uses adaptive bounds for normalization based on recent history.
+ */
+public class WeightedCostFunction


I like the idea of the WeightedCostFunction, but I think we need to make it much more intuitive.

Define requirements (already aligned with the current state of this PR):

A function that computes cost.

A task count with lower cost is better.

cost = lagCost * lagWeight + idlenessCost * idleWeight

Lower the task count, higher the predicted lag, higher the lag cost.

Higher the task count, higher the predicted idleness, higher the idleness cost.

Simplify computations

Use linear scaling only, logarithmic scaling makes the terms difficult to reason about and debug.
The diminishing returns effect is already enforced by the window (discrete distance). If more terms are needed to account for say, task operational overhead, we will add them in the future.

Use only one mode i.e. do not invert scaling, even when lag is abnormally high.

Use actual metrics instead of normalized or adaptive bounds.
If a supervisor once saw a lag of 100M, the adaptive ratio would make a lag of 1M seem very small (normalizedLag = 0.01 i.e. 1%). But in reality, a lag of 1M is bad too and needs to be given appropriate weight.

Always perform cost computation even if idleness is in the accepted range (0.2-0.6 in the PR).
This would help us validate the correctness of the formula against real clusters by verifying that the current task count gives minimal cost.

We may re-introduce some of these enhancements in later patches once we have more data points using this auto-scaler, but it is best to start as simple as possible.

Use intuitive metric e.g. compute time

Connect the result of the cost function to an actual metric to make it more intuitive. The best metric I can think of is compute time or compute cycles, as it may be related to actual monetary cost of running tasks.

For example, what if we could model the cost as follows:

lagCost = expected time (in seconds) required to recover current lag

idlenessCost = total compute time (in seconds) wasted being idle in a single taskDuration

Intuitively, we can see that as task count increases, lagCost would increase and idlenessCost would decrease.

The formula for these costs may be something like:

lagCost = expected time (in seconds) required to recover current lag = currentAggregateLag / (proposedTaskCount * avgRateOfProcessing) where, currentAggregateLag = sum of current lag (in records) across all partitions avgRateOfProcessing = average of task moving averages

idlenessCost = total time (in seconds) wasted being idle in a single taskDuration = total task run time * predicted idleness ratio where, total task run time = (proposedTaskCount * taskDuration) predicted idleness ratio = (proposedTaskCount / currentTaskCount) - (1 - avgPollToIdleRatio) e.g. if current poll-to-idle-ratio is 0.7, tasks are idle 70% of the time, so reducing task count by 70% will make tasks busy all the time (idleness ratio = 0).

Assumptions

Tasks are already at their peak processing rate and will remain at this rate.

poll-to-idle ratio scales linearly with task count. We may use some reasonable clamps for min (say 0.05) and max (say 0.95).

Regarding:

Use actual metrics instead of normalized or adaptive bounds.
If a supervisor once saw a lag of 100M, the adaptive ratio would make a lag of 1M seem very small (normalizedLag = 0.01 i.e. 1%). But in reality, a lag of 1M is bad too and needs to be given appropriate weight.

Always perform cost computation even if idleness is in the accepted range (0.2-0.6 in the PR).
This would help us validate the correctness of the formula against real clusters by verifying that the current task count gives minimal cost.

Implemented in the latest commit. Other than that, we need to discuss each other point :)
Thanks in advance for your input, appreciate it A LOT!

kfaraz · 2025-12-15T19:02:30Z

Thanks for sharing the insight on config.getScaleActionStartDelayMillis(), @jtuglu1 !
Can you share some typical values that you use for this config, and how it compares to the scale action period?

jtuglu1 · 2025-12-15T19:06:36Z

Thanks for sharing the insight on config.getScaleActionStartDelayMillis(), @jtuglu1 ! Can you share some typical values that you use for this config, and how it compares to the scale action period?

We typically set a submit (start) delay of 20-30mins after suspension/re-submit. This allows the scaler a chance to recover at its current task count before scaling (because often times, scaling too frequently will disrupt lag more than it helps). Instead, we opt to try and let the lag recover under 20mins, and if there's a sustained decrease in read tput (either due to new column) or increase in write tput, we allow the scaler to scale.

Scale action period is typically much smaller, maybe 5-10mins.

Co-authored-by: Kashif Faraz <[email protected]>

kfaraz · 2025-12-16T04:46:36Z

server/src/main/java/org/apache/druid/indexing/overlord/supervisor/Supervisor.java

-  default Map<String, Map<String, Object>> getStats()
+  /**
+   * Returns all stats from stream consumer. If {@code includeOnlyStreamConsumerStats} is true,
+   * returns only stream platform stats, like Kafka metrics.


Suggested change

* returns only stream platform stats, like Kafka metrics.

* returns only stream consumer stats, like Kafka consumer metrics.

kfaraz · 2025-12-16T04:48:39Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+   * Calculates the average poll-idle-ratio metric across all active tasks.
+   * This metric indicates how much time the consumer spends idle waiting for data.
+   *
+   * @return the average poll-idle-ratio across all tasks, or 1 (full idle) if no tasks or metrics are available


Suggested change

* @return the average poll-idle-ratio across all tasks, or 1 (full idle) if no tasks or metrics are available

* @return the average poll-idle-ratio across all tasks, or 0 (fully busy) if no tasks or metrics are available

On second though, we might want to return -1 from here if we don't have any metrics available.
This would cause the auto-scaler to skip scaling rather than make a bad decision.

...rg/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerConfig.java

kfaraz · 2025-12-16T04:51:54Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+  {
+    Map<String, Map<String, Object>> taskMetrics = getStats(true);
+    if (taskMetrics.isEmpty()) {
+      return 1.;


We can even return -1 to denote that we do not have metrics available, and just skip scaling rather than make a bad decision.

kfaraz · 2025-12-16T04:52:35Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+        result.add(taskCount);
+      }
+    }
+    return result.stream().mapToInt(Integer::intValue).toArray();


But we are still boxing the primitives while adding to the List. Might as well avoid the extra conversion.

kfaraz · 2025-12-16T05:01:27Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+
+    autoscalerExecutor.scheduleAtFixedRate(
+        supervisor.buildDynamicAllocationTask(scaleAction, onSuccessfulScale, emitter),
+        config.getScaleActionPeriodMillis(),


Based on the feedback from @jtuglu1 , I think we might need to reconsider this.

But rather than add a new config, I wonder if we can't improve this logic a bit.

Options:

startDelay = Math.min(taskDuration, 30 mins). So we consider scaling up only after the current tasks have run for a bit.

startDelay = 3 * config.getScaleActionPeriodMillis(). This could work too, and seems reasonable for most cases.

Add a separate config scaleActionStartDelayMillis (so that we remain aligned with the existing behaviour of lag-based auto-scaler), whose default value is 3 * config.getScaleActionPeriodMillis().

@jtuglu1 , @Fly-Style , which one do you guys prefer the most?

kfaraz · 2025-12-16T05:03:05Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+  private static final EmittingLogger log = new EmittingLogger(CostBasedAutoScaler.class);
+
+  private static final int MAX_INCREASE_IN_PARTITIONS_PER_TASK = 2;
+  private static final int MIN_INCREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;


Please rename based on usage. IIUC, this constant represents the "maximum" amount by which we can "decrease" the current value of num partitions per task.

Suggested change

private static final int MIN_INCREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;

private static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

...rg/apache/druid/testing/embedded/indexing/autoscaler/CostBasedAutoScalerIntegrationTest.java

…imated time to eliminate lag

kfaraz

Thanks for incorporating the feedback, @Fly-Style !

The auto-scaler looks like a good starting point.
I have left some comments which should be addressed in follow up PRs.

kfaraz · 2025-12-16T17:03:34Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+    final int partitionCount = supervisor.getPartitionCount();
+
+    final Map<String, Map<String, Object>> taskStats = supervisor.getStats();
+    final double movingAvgRate = extractMovingAverage(taskStats, DropwizardRowIngestionMeters.ONE_MINUTE_NAME);


Moving averages over a longer time window (say 15 mins) might be more stable and thus more reliable.
If not available, then fallback to 5 minute, then 1 minute.

kfaraz · 2025-12-16T17:04:59Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+    } else {
+      // Fallback: estimate processing rate based on idle ratio
+      final double utilizationRatio = Math.max(0.01, 1.0 - pollIdleRatio);
+      avgProcessingRate = config.getDefaultProcessingRate() * utilizationRatio;


It might be weird to have this be fed from a config (even as a fallback mechanism).

In the future, we can consider computing this based on the stats of previously completed tasks of this supervisor.
OR just use the last known processing rate.

kfaraz · 2025-12-16T17:06:29Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+      if (result.isEmpty() || result.get(result.size() - 1) != taskCount) {
+        result.add(taskCount);
+      }


Use a set to simplify this logic.

kfaraz · 2025-12-16T17:07:36Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+
+    autoscalerExecutor.scheduleAtFixedRate(
+        supervisor.buildDynamicAllocationTask(scaleAction, onSuccessfulScale, emitter),
+        config.getScaleActionPeriodMillis(),


We need to revisit the start delay.

kfaraz · 2025-12-16T17:15:27Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+      return -1;
+    }
+
+    // If idle is already in the ideal range [0.2, 0.6], optimal utilization has been achieved.


We may not always want to skip scaling even if idleness is in the accepted range.

For example, if current idleness is 0.5 and there is no lag, a cluster admin might prefer to scale down the tasks, so that idleness is more like 0.2 or so. They should be allowed to control this via the idleWeight.

For the initial testing of this auto-scaler, let's remove this guardrail.

kfaraz · 2025-12-16T17:17:57Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+      }
+    }
+
+    emitter.emit(metricBuilder.setMetric(AVG_LAG_METRIC, metrics.getAvgPartitionLag()));


This is already emitted as a metric.

What would be more useful is the computed terms lagCost and idleCost.
Getting these out as metrics would enable users to choose better values for lagWeight and idleWeight.

...ava/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java

kfaraz · 2025-12-17T05:13:23Z

@Fly-Style , I have merged this PR. Please address the open comments in a follow up PR.

github-actions bot added the Area - Ingestion label Dec 5, 2025

Fly-Style changed the title ~~Cost-based autoscaler: first raw version~~ Cost-based autoscaler Dec 5, 2025

github-advanced-security bot found potential problems Dec 5, 2025

View reviewed changes

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java Fixed Show fixed Hide fixed

Fly-Style force-pushed the new-autoscaler branch from 58b4e2c to 7ee6403 Compare December 8, 2025 08:19

Cost-based autoscaler: first raw version

b7703c8

Fly-Style force-pushed the new-autoscaler branch from 7ee6403 to b7703c8 Compare December 8, 2025 09:22

Use single atomic ref instead of locking, fix tests, cleanup

704ed4c

Fly-Style force-pushed the new-autoscaler branch from 1606559 to 704ed4c Compare December 8, 2025 15:09

Fly-Style added 3 commits December 8, 2025 17:45

WIP: integration test

1fe225d

Use kinda correct metric for idle-ratio, fix metrics emission

4a18380

Finish the integration test; introduce idle metrics

e2aa55a

github-actions bot added the Area - Streaming Ingestion label Dec 9, 2025

Fly-Style marked this pull request as ready for review December 9, 2025 16:34

kfaraz reviewed Dec 10, 2025

View reviewed changes

Fly-Style added 2 commits December 10, 2025 12:51

Review comments pass: WIP

9dabb99

Address review comments

e5b40c7

github-advanced-security bot found potential problems Dec 10, 2025

View reviewed changes

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java Fixed Show fixed Hide fixed

Addressing review comments - 2

9bc5107

Fly-Style force-pushed the new-autoscaler branch from 5dd32e8 to 9bc5107 Compare December 11, 2025 07:49

Fly-Style added 2 commits December 11, 2025 15:34

Add a new HTTP endpoint on task side and call on supervisor side for …

f81b7b8

…the metrics

Use http call to get poll-idle-ratio-avg metric

db1254b

Fly-Style force-pushed the new-autoscaler branch from 5451cff to db1254b Compare December 11, 2025 13:55

Remove factors cache as more dangerous option, just leaving the compu…

33c9549

…tation

Fly-Style requested a review from kfaraz December 11, 2025 19:28

Fly-Style added 2 commits December 12, 2025 14:30

Adjust WeightedCostFunction implementation with focus on ideal idle r…

1e5281b

…ange and logarithmic scalability

Extract stats via /rowStats call, remove /metrics call

fdbe06f

Fly-Style force-pushed the new-autoscaler branch from 276b98a to fdbe06f Compare December 12, 2025 14:37

Enable scaleUp integration test

9ecafee

Fly-Style changed the title ~~Cost-based autoscaler~~ Introduce cost-based tasks autoscaler Dec 12, 2025

Fly-Style changed the title ~~Introduce cost-based tasks autoscaler~~ Introduce cost-based tasks autoscaler for streaming ingestion Dec 12, 2025

kfaraz reviewed Dec 15, 2025

View reviewed changes

Address review comments - 3

3cda48a

Fly-Style force-pushed the new-autoscaler branch from d165901 to 3cda48a Compare December 15, 2025 13:18

Remove obsolete comment in tests

922a7c6

Fly-Style requested a review from kfaraz December 15, 2025 13:22

Fly-Style and others added 3 commits December 15, 2025 15:22

Update extensions-core/kafka-indexing-service/src/main/java/org/apach…

788e8e6

…e/druid/indexing/kafka/KafkaConsumerMonitor.java Co-authored-by: Kashif Faraz <[email protected]>

Update extensions-core/kafka-indexing-service/src/main/java/org/apach…

1647ee9

…e/druid/indexing/kafka/KafkaConsumerMonitor.java Co-authored-by: Kashif Faraz <[email protected]>

Adjust computeValidTaskCounts implementation

286aa4e

kfaraz reviewed Dec 15, 2025

View reviewed changes

Fly-Style and others added 2 commits December 15, 2025 21:30

Apply suggestions from code review

3fed728

Co-authored-by: Kashif Faraz <[email protected]>

Adjust cost function: use direct lag values, not normalized ones

161fdf3

kfaraz reviewed Dec 16, 2025

View reviewed changes

...rg/apache/druid/testing/embedded/indexing/autoscaler/CostBasedAutoScalerIntegrationTest.java Show resolved Hide resolved

Change weight function to linear and switch cost from abstract to est…

e8f8247

…imated time to eliminate lag

kfaraz approved these changes Dec 16, 2025

View reviewed changes

github-advanced-security bot found potential problems Dec 16, 2025

View reviewed changes

...ava/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java Fixed Show fixed Hide fixed

kfaraz reviewed Dec 17, 2025

View reviewed changes

...ava/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/WeightedCostFunction.java Outdated Show resolved Hide resolved

Remove unused variable taskCountDiff

dbb264c

kfaraz merged commit 313ba8e into apache:master Dec 17, 2025
55 checks passed

	* returns only stream platform stats, like Kafka metrics.
	* returns only stream consumer stats, like Kafka consumer metrics.

	* @return the average poll-idle-ratio across all tasks, or 1 (full idle) if no tasks or metrics are available
	* @return the average poll-idle-ratio across all tasks, or 0 (fully busy) if no tasks or metrics are available

	private static final int MIN_INCREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;
	private static final int MAX_DECREASE_IN_PARTITIONS_PER_TASK = MAX_INCREASE_IN_PARTITIONS_PER_TASK * 2;

Introduce cost-based tasks autoscaler for streaming ingestion #18819

Introduce cost-based tasks autoscaler for streaming ingestion #18819

Uh oh!

Conversation

Fly-Style commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cost-Based Autoscaler for Seekable Stream Supervisors

Overview

Key Design Decisions

Cost Formula

Idle Prediction Model

Ideal Idle Range

Conservative Cold Start Behavior

Uh oh!

Uh oh!

jtuglu1 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fly-Style commented Dec 8, 2025

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fly-Style commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtuglu1 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Fly-Style commented Dec 5, 2025 •

edited

Loading

jtuglu1 commented Dec 6, 2025 •

edited

Loading

kfaraz Dec 10, 2025 •

edited

Loading

Fly-Style commented Dec 10, 2025 •

edited

Loading

jtuglu1 commented Dec 15, 2025 •

edited

Loading

kfaraz Dec 15, 2025 •

edited

Loading

jtuglu1 commented Dec 15, 2025 •

edited

Loading

kfaraz Dec 16, 2025 •

edited

Loading