`--rate-type=sweep` does not load vllm with expected count of requests #273

psydok · 2025-08-11T18:15:25Z

psydok
Aug 11, 2025

Describe the bug
I ran sweep test to understand the maximum throughput of the service. guidellm calculated 0.5 RPS (though vllm/benchmarks/benchmark_serving.py showed 0.7 RPS when iterating through concurrency).
I logged into the vllm graphana to observe the load on the service. guidellm says 0.5-0.55 RPS. But in fact, I see 13-15 requests.

Expected behavior
0.5 RPS = 30 RPM. Shouldn't 30 requests be processed if the service holds 0.5 RPS?

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]: -
Python version [e.g. 3.12.2]: 3.10
Docker-image: python:3.10-slim

To Reproduce
Exact steps to reproduce the behavior:

> pip install -U git+https://github.com/vllm-project/guidellm.git@1261fe81c57b07ed64333b5d50846699aa5307d4
> export GUIDELLM__PREFERRED_ROUTE="chat_completions" && export GUIDELLM__OPENAI__MAX_OUTPUT_TOKENS=512 && export GUIDELLM_MAX_REQUESTS=1000 && export GUIDELLM__REQUEST_TIMEOUT=600 && guidellm benchmark --target http://localhost:8000 --rate-type sweep --model Qwen/Qwen3-30B-A3B --processor Qwen/Qwen3-30B-A3B --random-seed 2025 --max-seconds 300 --data "prompt_tokens=4096,output_tokens=512,samples=1000" --backend-args '{"extra_body":{"chat_template_kwargs":{"enable_thinking":false}}}' --output-path "data/benchmarks.json"

Errors
guidellm:

vllm grafana:

Additional context
Add any other context about the problem here. Also include any relevant files.

Answered by sjmonson

Aug 11, 2025

For the row you highlighted in the first screenshot, first keep these things in mind:

constant@0.5 := Send 0.5 requests every second so every 2 seconds send a new request
25.40 Lat := Each request takes on average 25.4 seconds to complete

Start at the beginning of the test, the number of running requests increases by one every two seconds (1). The first request finishes after 25.4 seconds. At this point, new requests are still being added at a rate of 0.5 per second, while finished requests begin to decrease at the same rate (assuming every request takes the same amount of time). In that first 25.4 seconds 25.4/2 ~= 12 requests were started. Assuming neither the rate of requests nor the…

View full answer

sjmonson · 2025-08-11T22:12:57Z

sjmonson
Aug 11, 2025
Maintainer

For the row you highlighted in the first screenshot, first keep these things in mind:

constant@0.5 := Send 0.5 requests every second so every 2 seconds send a new request
25.40 Lat := Each request takes on average 25.4 seconds to complete

Start at the beginning of the test, the number of running requests increases by one every two seconds (1). The first request finishes after 25.4 seconds. At this point, new requests are still being added at a rate of 0.5 per second, while finished requests begin to decrease at the same rate (assuming every request takes the same amount of time). In that first 25.4 seconds 25.4/2 ~= 12 requests were started. Assuming neither the rate of requests nor the latency (2) of each request changes the number of running requests will stabilize around 12 or 13.

One of the other things that is confusing is there is input requests per second and output requests per second (responses per second). For GuideLLM the number reported as Req is output RPS and constant@0.50 is input. I believe the benchmark_serving.py RPS you were looking at is output since input RPS varies in a concurrency driven test.

2 replies

psydok Aug 12, 2025
Author

Thank you so much for the explanation!

psydok Aug 12, 2025
Author

Do I understand correctly that we are considering approximate concurrency in this way? That is, the number of users that we can serve for about the same metrics?

There is no information about concurrency in HTML reports. But it seems she should appear there?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`--rate-type=sweep` does not load vllm with expected count of requests #273

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

--rate-type=sweep does not load vllm with expected count of requests #273

Uh oh!

psydok Aug 11, 2025

Replies: 1 comment · 2 replies

Uh oh!

sjmonson Aug 11, 2025 Maintainer

Uh oh!

psydok Aug 12, 2025 Author

Uh oh!

psydok Aug 12, 2025 Author

`--rate-type=sweep` does not load vllm with expected count of requests #273

psydok
Aug 11, 2025

Replies: 1 comment 2 replies

sjmonson
Aug 11, 2025
Maintainer

psydok Aug 12, 2025
Author

psydok Aug 12, 2025
Author