Skip to content

Conversation

@phenylshima
Copy link
Contributor

Fixes #3710

This pull request improves the speed of BytesQueueBuffer.get() by using memoryview, solving the problem described in #3710 .

@phenylshima phenylshima marked this pull request as ready for review November 13, 2025 11:55
@sigmavirus24
Copy link
Contributor

Does it make more sense to just always store the byte chunks as memoryviews? My instinct was that would solve this as well.

We'll need to update the type hints as well as now the deque can have either bytes or memoryview probably

@phenylshima
Copy link
Contributor Author

@sigmavirus24 Thank you for your suggestion! I updated the code to store the data as memoryview.

@sigmavirus24
Copy link
Contributor

Is that as performant at least?

@phenylshima
Copy link
Contributor Author

phenylshima commented Nov 14, 2025

Yes, at least on my machine.

Test code:

from src.urllib3.response import BytesQueueBuffer
import time

buffer = BytesQueueBuffer()
buffer.put(b"x" * 1024 * 1024 * 10)  # 10 MiB
now = time.perf_counter()
while len(buffer) > 5000:
    buffer.get(1024)  # 1 KiB
delta = time.perf_counter() - now

print(delta)

The result:
495aab1: 0.0052835239985142834
19820a4: 0.004722315999970306

The new one looks faster, but it is because the memoryview(chunk) is moved to put method. I think the overall performance is the same. One thing to note is when we write a big chunk once, and read it only once (either by get or get_all), I expect it to be slightly slower compared to the current main branch (due to running memoryview(chunk) and converting it back to bytes. If get_all is used, the conversion back to bytes will not happen though).

@sigmavirus24
Copy link
Contributor

Worth checking for a performance hit on .put then from before your changes to now then, although I don't expect much of one. May as well also do a full end-to-end performance test. Put a bunch of chunks, do a get_all, etc.

@phenylshima
Copy link
Contributor Author

phenylshima commented Nov 14, 2025

I tested in the original repro, and it looked fine.

However, the put method got slower (surprising to me):

from src.urllib3.response import BytesQueueBuffer
import time

buffer = BytesQueueBuffer()
now = time.perf_counter()

for _ in range(10 * 1024):  # 10 MiB total
    buffer.put(b"x" * 1024)  # 1 KiB

delta = time.perf_counter() - now
print(f"Time to put data: {delta}s")

while len(buffer) > 0:
    buffer.get(1024)  # 1 KiB

delta = time.perf_counter() - now

print(f"Time to get data: {delta}s")
> ~/dev/urllib3> main !1 ?1> python t.py
Time to put data: 0.0007599639998261409s
Time to get data: 0.004324924999764335s
> ~/dev/urllib3> main !1 ?1> python t.py
Time to put data: 0.0008291090002785495s
Time to get data: 0.004370479000044725s
> ~/dev/urllib3> main !1 ?1> python t.py
Time to put data: 0.0007629809997524717s
Time to get data: 0.004293366999718273s

> ~/dev/urllib3> @495aab1b ?1> python t.py
Time to put data: 0.0007698970002820715s
Time to get data: 0.004365679000329692s
> ~/dev/urllib3> @495aab1b ?1> python t.py
Time to put data: 0.000990569999885338s
Time to get data: 0.004518313000062335s
> ~/dev/urllib3> @495aab1b ?1> python t.py
Time to put data: 0.0009582410002622055s
Time to get data: 0.004823216000204411s

> ~/dev/urllib3> byte-queue-perf ?1> python t.py
Time to put data: 0.003418835000047693s
Time to get data: 0.007445904999713093s
> ~/dev/urllib3> byte-queue-perf ?1> python t.py
Time to put data: 0.0033251579998250236s
Time to get data: 0.007321049999973184s
> ~/dev/urllib3> byte-queue-perf ?1> python t.py
Time to put data: 0.003460155999619019s
Time to get data: 0.00736400899995715s

Maybe because this code is not reveraging anything from bytes->memoryview conversion, as it is just reading and writing the same size?

It might be better to use 495aab1.

@phenylshima
Copy link
Contributor Author

Now it's like this:

>~/dev/urllib3 > byte-queue-perf ?1 > python t.py
Time to put data: 0.0007707550003033248s
Time to get data: 0.004282103000150528s
>~/dev/urllib3 > byte-queue-perf ?1 > python t.py
Time to put data: 0.0007912390001365566s
Time to get data: 0.004329170999881171s
>~/dev/urllib3 > byte-queue-perf ?1 > python t.py
Time to put data: 0.0007524289994762512s
Time to get data: 0.004279935999875306s

Copy link
Member

@illia-v illia-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phenylshima thanks for uncovering the issue!

illia-v
illia-v previously approved these changes Nov 19, 2025
Copy link
Member

@illia-v illia-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended the tests a bit while testing the change locally and fixed a minor issue in the changelog entry

@illia-v illia-v requested a review from sigmavirus24 November 19, 2025 22:05
@illia-v illia-v requested a review from pquentin November 22, 2025 20:39

assert len(get_func(buffer)) == 10 * 2**20
result = get_func(buffer)
assert type(result) is bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not isinstance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to ensure that it returns bytes and not something else including a subclass instance which isinstance wouln't differenciate

@pquentin
Copy link
Member

Am I misreading the benchmarks? I'm not seeing a speed difference between main and the last byte-queue-perf benchmark: 0.7ms for put and 4ms for get.

@phenylshima
Copy link
Contributor Author

phenylshima commented Nov 28, 2025

Thank you for looking at this PR. The current test does not test the execution time, rather the non-existance of extra memory allocation (that led to the overhead in the current main branch).
Changing buffer.get(2**20) to buffer.get(2**10) or something like that should make the performance difference more visible (I'm not sure if I should commit it, though).

@illia-v
Copy link
Member

illia-v commented Nov 28, 2025

I believe Quentin asked about the benchmarks from #3711 (comment) and #3711 (comment) which did not measure the effect of using memoryview for chunks that are split.

The benchmarked case was related to a regression that was introduced by this PR initially and has been resolved since then in e898682: memoryview was used even when a chunk was not split.

@phenylshima
Copy link
Contributor Author

Ah I see, thank you for the followup!

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM. Illia showed me a (private for now) benchmark where this PR makes a drastic difference.

@illia-v illia-v merged commit 18af0a1 into urllib3:main Dec 1, 2025
42 checks passed
@phenylshima phenylshima deleted the byte-queue-perf branch December 2, 2025 22:43
Ousret added a commit to jawah/urllib3.future that referenced this pull request Dec 16, 2025
Ousret added a commit to jawah/urllib3.future that referenced this pull request Dec 16, 2025
2.15.900 (2025-12-16)
=====================

- Improved pre-check for socket liveness probe before connection reuse
from pool.
- Backported "HTTPHeaderDict bytes key handling" from upstream
urllib3#3653
- Backported "Expand environment variable of SSLKEYLOGFILE" from
upstream urllib3#3705
- Backported "Fix redirect handling when an integer is passed to a pool
manager" from upstream urllib3#3655
- Backported "Improved the performance of content decoding by optimizing
``BytesQueueBuffer`` class." from upstream
urllib3#3711
- Backported "GHSA-gm62-xv2j-4w53" security patch for "attacker could
compose an HTTP response with virtually unlimited links in the
``Content-Encoding`` header" from upstream
urllib3@24d7b67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance issue in BytesQueueBuffer.get() byte slicing

4 participants