Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 13 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

> :warning: **Code in this repo is written for testing purposes and should not be used in production**

The Azure OpenAI Benchmarking tool is designed to aid customers in benchmarking their provisioned-throughput deployments. Provisioned throughput deployments provide a set amount of model compute. But determining the exact performance for you application is dependent on several variables such as: prompt size, generation size and call rate.
The Azure OpenAI Benchmarking tool is designed to aid customers in benchmarking their provisioned-throughput deployments. Provisioned throughput deployments provide a set amount of model compute. But determining the exact performance for you application is dependent on several variables such as: prompt size, generation size and call rate.

The benchmarking tool provides a simple way to run test traffic on your deploymnet and validate the throughput for your traffic workloads. The script will output key performance statistics including the average and 95th percentile latencies and utilization of the deployment.
The benchmarking tool provides a simple way to run test traffic on your deploymnet and validate the throughput for your traffic workloads. The script will output key performance statistics including the average and 95th percentile latencies and utilization of the deployment.

You can use this tool to experiment with total throughput at 100% utilization across different traffic patterns for a ```Provisioned-Managed``` deployment type. These tests allow you to better optimize your solution design by adjusting the prompt size, generation size and PTUs deployed

Expand Down Expand Up @@ -57,9 +57,9 @@ The table below provides an example prompt & generation size we have seen with s

Or see the [pre-configured shape-profiles below](#shape-profiles).

### Run samples
### Run samples

During a run, statistics are output every second to `stdout` while logs are output to `stderr`. Some metrics may not show up immediately due to lack of data.
During a run, statistics are output every second to `stdout` while logs are output to `stderr`. Some metrics may not show up immediately due to lack of data.

**Run load test at 60 RPM with exponential retry back-off**

Expand All @@ -73,9 +73,9 @@ $ python -m benchmark.bench load \
2023-10-19 18:21:06 INFO using shape profile balanced: context tokens: 500, max tokens: 500
2023-10-19 18:21:06 INFO warming up prompt cache
2023-10-19 18:21:06 INFO starting load...
2023-10-19 18:21:06 rpm: 1.0 requests: 1 failures: 0 throttled: 0 ctx tpm: 501.0 gen tpm: 103.0 ttft avg: 0.736 ttft 95th: n/a tbt avg: 0.088 tbt 95th: n/a e2e avg: 1.845 e2e 95th: n/a util avg: 0.0% util 95th: n/a
2023-10-19 18:21:07 rpm: 5.0 requests: 5 failures: 0 throttled: 0 ctx tpm: 2505.0 gen tpm: 515.0 ttft avg: 0.937 ttft 95th: 1.321 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.223 e2e 95th: 1.658 util avg: 0.8% util 95th: 1.6%
2023-10-19 18:21:08 rpm: 8.0 requests: 8 failures: 0 throttled: 0 ctx tpm: 4008.0 gen tpm: 824.0 ttft avg: 0.913 ttft 95th: 1.304 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.241 e2e 95th: 1.663 util avg: 1.3% util 95th: 2.6%
2023-10-19 18:21:06 rpm: 1.0 requests: 1 failures: 0 throttled: 0 ctx tpm: 501.0 gen tpm: 103.0 ttft avg: 0.736 ttft 95th: n/a tbt avg: 0.088 tbt 95th: n/a e2e avg: 1.845 e2e 95th: n/a util avg: 0.0% util 95th: n/a
2023-10-19 18:21:07 rpm: 5.0 requests: 5 failures: 0 throttled: 0 ctx tpm: 2505.0 gen tpm: 515.0 ttft avg: 0.937 ttft 95th: 1.321 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.223 e2e 95th: 1.658 util avg: 0.8% util 95th: 1.6%
2023-10-19 18:21:08 rpm: 8.0 requests: 8 failures: 0 throttled: 0 ctx tpm: 4008.0 gen tpm: 824.0 ttft avg: 0.913 ttft 95th: 1.304 tbt avg: 0.042 tbt 95th: 0.043 e2e avg: 1.241 e2e 95th: 1.663 util avg: 1.3% util 95th: 2.6%
```

**Load test with custom request shape**
Expand Down Expand Up @@ -120,7 +120,9 @@ The tool supports four different shape profiles via command line option `--shape
|`balanced`|[default] Balanced count of context and generation tokens. Should be representative of typical workloads.|500|500|
|`context`|Represents workloads with larger context sizes compared to generation. For example, chat assistants.|2000|200|
|`generation`|Represents workloads with larger generation and smaller contexts. For example, question answering.|500|1000|
|`custom`|Allows specifying custom values for context size (`--context-tokens`) and max generation tokens (`--max-tokens`).|||
|`custom`|Allows specifying custom values for context size (`--context-tokens`) and max generation tokens (`--max-tokens`).|||

Note: With the default prompting strategy, OpenAI models will typically return completions of a max of 700-1200 tokens. If setting `max_tokens` above 750, be aware that the results for `rpm` may be higher, and `e2e` latency lower, than if the model was returning completions of size `max_tokens` in every response. Refer to the `gen_tokens` stats at the end of each run to see how many tokens were generated across responses.

### Output fields

Expand All @@ -139,6 +141,7 @@ The tool supports four different shape profiles via command line option `--shape
|`ttft_95th`|95th percentile of time in seconds from the beginning of the request until the first token was received.|yes|`0.130`|
|`tbt_avg`|Average time in seconds between two consequitive generated tokens.|yes|`0.018`|
|`tbt_95th`|95th percentail of time in seconds between two consequitive generated tokens.|yes|`0.021`|
|`avg_gen_tokens`|Average number of generated tokens per model response.|yes|`509`|
|`e2e_avg`|Average end to end request time.|yes|`1.2`|
|`e2e_95th`|95th percentile of end to end request time.|yes|`1.5`|
|`util_avg`|Average deployment utilization percentage as reported by the service.|yes|`89.3%`|
Expand All @@ -162,8 +165,8 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
28 changes: 14 additions & 14 deletions benchmark/loadcmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ class _RequestBuilder:
Wrapper iterator class to build request payloads.
"""
def __init__(self, model:str, context_tokens:int,
max_tokens:None,
completions:None,
frequency_penalty:None,
presence_penalty:None,
temperature:None,
max_tokens:None,
completions:None,
frequency_penalty:None,
presence_penalty:None,
temperature:None,
top_p:None):
self.model = model
self.context_tokens = context_tokens
Expand Down Expand Up @@ -101,7 +101,7 @@ def load(args):
logging.info("starting load...")

_run_load(request_builder,
max_concurrency=args.clients,
max_concurrency=args.clients,
api_key=api_key,
url=url,
rate_limiter=rate_limiter,
Expand All @@ -112,19 +112,20 @@ def load(args):
json_output=args.output_format=="jsonl")

def _run_load(request_builder: Iterable[dict],
max_concurrency: int,
max_concurrency: int,
api_key: str,
url: str,
rate_limiter=None,
rate_limiter=None,
backoff=False,
duration=None,
duration=None,
aggregation_duration=60,
request_count=None,
json_output=False):
aggregator = _StatsAggregator(
window_duration=aggregation_duration,
dump_duration=1,
dump_duration=1,
clients=max_concurrency,
expected_gen_tokens=request_builder.max_tokens,
json_output=json_output)
requester = OAIRequester(api_key, url, backoff=backoff)

Expand All @@ -141,13 +142,13 @@ async def request_func(session:aiohttp.ClientSession):
print(e)

executer = AsyncHTTPExecuter(
request_func,
rate_limiter=rate_limiter,
request_func,
rate_limiter=rate_limiter,
max_concurrency=max_concurrency)

aggregator.start()
executer.run(
call_count=request_count,
call_count=request_count,
duration=duration)
aggregator.stop()

Expand Down Expand Up @@ -219,4 +220,3 @@ def _validate(args):
raise ValueError("presence-penalty must be between -2.0 and 2.0")
if args.temperature is not None and (args.temperature < 0 or args.temperature > 2):
raise ValueError("temperature must be between 0 and 2.0")

29 changes: 25 additions & 4 deletions benchmark/statsaggregator.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import logging
import threading
import time
from typing import Optional

import numpy as np

Expand All @@ -29,7 +30,7 @@ def _values(self) -> [float]:
for entry in self.samples:
values.append(entry[1])
return values

def _len(self) -> int:
return len(self.samples)

Expand All @@ -56,17 +57,19 @@ class _StatsAggregator(threading.Thread):
generated_tokens = _Samples()
utilizations = _Samples()

def __init__(self, clients:int, dump_duration:float=5, window_duration:float=60, json_output=False, *args,**kwargs):
def __init__(self, clients:int, dump_duration:float=5, window_duration:float=60, expected_gen_tokens: Optional[int] = None, json_output=False, *args,**kwargs):
"""
:param clients: number of clients used in testing
:param dump_duration: duration in seconds to dump current aggregates.
:param window_duration: duration of sliding window in second to consider for aggregation.
:param expected_gen_tokens: number of tokens expected in each response.
:param json_output: whether to dump periodic stats as json or human readable.
"""
self.clients = clients
self.dump_duration = dump_duration
self.json_output = json_output
self.window_duration = window_duration
self.expected_gen_tokens = expected_gen_tokens

super(_StatsAggregator, self).__init__(*args, **kwargs)

Expand Down Expand Up @@ -113,7 +116,7 @@ def aggregate_request(self, stats: RequestStats):
f"request completed in {round(request_latency, 2)} seconds, while aggregation-window is {round(self.window_duration, 2)} "
"seconds, consider increasing aggregation-window to at least 2x your typical request latency."
)
)
)
self.request_timestamps._append(stats.request_start_time, stats.request_start_time)
self.response_latencies._append(stats.request_start_time, stats.response_time - stats.request_start_time)
self.first_token_latencies._append(stats.request_start_time, stats.first_token_time - stats.request_start_time)
Expand All @@ -138,13 +141,28 @@ def _dump(self):
tokens_per_minute += context_per_minute
if gen_per_minute != "n/a":
tokens_per_minute += gen_per_minute
avg_gen_tokens = int(np.sum(self.generated_tokens._values()) / self.generated_tokens._len()) if self.generated_tokens._len() > 0 else "n/a"
ttft_avg = round(np.average(self.first_token_latencies._values()), 3) if self.first_token_latencies._len() > 0 else "n/a"
ttft_95th = round(np.percentile(self.first_token_latencies._values(), 95), 3) if self.first_token_latencies._len() > 1 else "n/a"
tbt_avg = round(np.average(self.token_latencies._values()), 3) if self.token_latencies._len() > 0 else "n/a"
tbt_95th = round(np.percentile(self.token_latencies._values(), 95), 3) if self.token_latencies._len() > 1 else "n/a"
util_avg = f"{round(np.average(self.utilizations._values()), 1)}%" if self.utilizations._len() > 0 else "n/a"
util_95th = f"{round(np.percentile(self.utilizations._values(), 95), 1)}%" if self.utilizations._len() > 1 else "n/a"
rpm = round(60.0 * self.request_timestamps._len() / dynamic_window, 1) if self.request_timestamps._len() > 0 else "n/a"
# Periodically warn if generated TPR is consistently lower than requested, which can result in higher scores for RPM compared to reality
warning_period_secs = 10
if all((
run_seconds % warning_period_secs == 0,
self.expected_gen_tokens is not None,
isinstance(avg_gen_tokens, int)
)) and avg_gen_tokens < 0.9 * self.expected_gen_tokens:
logging.warning(
(
f"average tokens per response is {avg_gen_tokens}, compared to requested max_tokens of {self.expected_gen_tokens}."
" this may mean measured rpm is higher and e2e request latency is faster than in real-world workloads"
" (tpm, ttft & tbt stats will still be accurate)."
)
)
# Handle the 1x extra processing_request due to next request being queued
processing_requests_count = min(self.clients, self.processing_requests_count)
if self.json_output:
Expand All @@ -162,6 +180,9 @@ def _dump(self):
"gen": gen_per_minute,
"total": tokens_per_minute,
},
"gen_tokens": {
"avg": avg_gen_tokens,
},
"e2e": {
"avg": e2e_latency_avg,
"95th": e2e_latency_95th,
Expand All @@ -181,7 +202,7 @@ def _dump(self):
}
print(json.dumps(j), flush=True)
else:
print(f"{timestamp} rpm: {rpm:<5} processing: {processing_requests_count:<4} completed: {self.total_requests_count:<5} failures: {self.total_failed_count:<4} throttled: {self.throttled_count:<4} requests: {self.total_requests_count:<5} tpm: {tokens_per_minute:<6} ttft_avg: {ttft_avg:<6} ttft_95th: {ttft_95th:<6} tbt_avg: {tbt_avg:<6} tbt_95th: {tbt_95th:<6} e2e_avg: {e2e_latency_avg:<6} e2e_95th: {e2e_latency_95th:<6} util_avg: {util_avg:<6} util_95th: {util_95th:<6}", flush=True)
print(f"{timestamp} rpm: {rpm:<5} processing: {processing_requests_count:<4} completed: {self.total_requests_count:<5} failures: {self.total_failed_count:<4} throttled: {self.throttled_count:<4} requests: {self.total_requests_count:<5} tpm: {tokens_per_minute:<6} avg_gen_tokens: {avg_gen_tokens:<4} ttft_avg: {ttft_avg:<6} ttft_95th: {ttft_95th:<6} tbt_avg: {tbt_avg:<6} tbt_95th: {tbt_95th:<6} e2e_avg: {e2e_latency_avg:<6} e2e_95th: {e2e_latency_95th:<6} util_avg: {util_avg:<6} util_95th: {util_95th:<6}", flush=True)

def _slide_window(self):
with self.lock:
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ numpy
backoff
wonderwords
asyncio
aiohttp
aiohttp
typing