Skip to content

Commit d4dc127

Browse files
committed
include links to papers
1 parent 9f9f29a commit d4dc127

File tree

5 files changed

+16
-4
lines changed

5 files changed

+16
-4
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
107107

108108
- ### [Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models](src/inspect_evals/cybench)
109109
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
110-
<sub><sup>Contributed by: [@sinman-aisi](https://github.com/sinman-aisi), [@sam-deverett-dsit](https://github.com/sam-deverett-dsit), [@kola-aisi](https://github.com/kola-aisi)</sub></sup>
110+
<sub><sup>Contributed by: [@sinman-aisi](https://github.com/sinman-aisi), [@sam-deverett-dsit](https://github.com/sam-deverett-dsit), [@kola-aisi](https://github.com/kola-aisi), [@pgiav](https://github.com/pgiav)</sub></sup>
111111
```
112112
inspect eval inspect_evals/cybench
113113
```

docs/_scripts/prerender.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,8 @@ def create_front_matter(listing: dict[str, Any], sort_index: int) -> list[str]:
136136
readme_out.append(f" - \"{to_author_link(author)}\"")
137137
readme_out.append(f"code: {listing['path']}")
138138
readme_out.append(f"code-url: https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/{listing['path']}")
139+
if "arxiv" in listing:
140+
readme_out.append(f"arxiv: {listing['arxiv']}")
139141
readme_out.append(f"group: {listing['group']}")
140142
readme_out.append(f"order: {sort_index}")
141143
readme_out.append("---")

docs/_templates/title-metadata.html

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,15 @@
5656
</div>
5757
$endif$
5858

59+
$if(arxiv)$
60+
<div>
61+
<div class="quarto-title-meta-heading">Paper</div>
62+
<div class="quarto-title-meta-contents">
63+
<p class="doi"><a href="$arxiv$">$arxiv$</a></p>
64+
</div>
65+
</div>
66+
$endif$
67+
5968
$if(date-modified)$
6069
<div>
6170
<div class="quarto-title-meta-heading">$labels.modified$</div>

src/inspect_evals/cybench/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
[Cybench](https://cybench.github.io/) is a benchmark for evaluating models' cybersecurity capabilities. It contains 40 capture the flag (CTF) tasks from 4 CTF competitions that cover 6 domains: cryptography, web security, reverse engineering, forensics, exploitation (pwn), and miscellaneous.
44

55
<!-- Contributors: Automatically Generated -->
6-
Contributed by [@sinman-aisi](https://github.com/sinman-aisi), [@sam-deverett-dsit](https://github.com/sam-deverett-dsit), [@kola-aisi](https://github.com/kola-aisi)
6+
Contributed by [@sinman-aisi](https://github.com/sinman-aisi), [@sam-deverett-dsit](https://github.com/sam-deverett-dsit), [@kola-aisi](https://github.com/kola-aisi), [@pgiav](https://github.com/pgiav)
77
<!-- /Contributors: Automatically Generated -->
88

99
<!-- Usage: Automatically Generated -->
@@ -62,7 +62,7 @@ You can specify a certain variant to run. For example, to run the solution varia
6262
inspect eval inspect_evals/cybench -T variants=solution
6363
```
6464

65-
You can also create and specify an agent to use so long as it's in the form of an Inspect [solver](https://inspect.ai-safety-institute.org.uk/solvers.html). See `default_agent` in [task.py](./task.py) for an example.
65+
You can also create and specify an agent to use so long as it's in the form of an Inspect [solver](https://inspect.ai-safety-institute.org.uk/solvers.html). See `default_agent` in [cybench.py](https://github.com/UKGovernmentBEIS/inspect_evals/blob/main/src/inspect_evals/cybench/cybench.py) for an example.
6666

6767
There are two task parameters that define limits on the evaluation:
6868
- `max_attempts` defines the number of incorrect submissions to allow before ending the challenges (defaults to 3).

tools/listing.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,8 @@
6161
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
6262
path: src/inspect_evals/cybench
6363
group: Cybersecurity
64-
contributors: ["sinman-aisi", "sam-deverett-dsit", "kola-aisi"]
64+
contributors: ["sinman-aisi", "sam-deverett-dsit", "kola-aisi", "pgiav"]
65+
arxiv: https://arxiv.org/abs/2408.08926
6566
tasks: ["cybench"]
6667
tags: ["Agent"]
6768

0 commit comments

Comments
 (0)