Skip to content

Commit c59255a

Browse files
committed
Use lm-evaluation-harness from PyPI instead of our submodule
1 parent a07faff commit c59255a

File tree

2 files changed

+6
-10
lines changed

2 files changed

+6
-10
lines changed

3rdparty/lm-evaluation-harness

Submodule lm-evaluation-harness deleted from a352061

README.md

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -101,23 +101,20 @@ Performance is expected to be comparable or better than other architectures trai
101101

102102
To run zero-shot evaluations of models (corresponding to Table 3 of the paper),
103103
we use the
104-
[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor)
104+
[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
105105
library.
106106

107-
1. Pull the `lm-evaluation-harness` repo by `git submodule update --init
108-
--recursive`. We use the `big-refactor` branch.
109-
2. Install `lm-evaluation-harness`: `pip install -e 3rdparty/lm-evaluation-harness`.
110-
On Python 3.10 you might need to manually install the latest version of `promptsource`: `pip install git+https://github.com/bigscience-workshop/promptsource.git`.
111-
3. Run evaluation with (more documentation at the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) repo):
107+
1. Install `lm-evaluation-harness` by `pip install lm-eval==0.4.2`.
108+
2. Run evaluation with (more documentation at the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) repo):
112109
```
113-
python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-130m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --device cuda --batch_size 64
110+
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-130m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256
114111
python evals/lm_harness_eval.py --model hf --model_args pretrained=EleutherAI/pythia-160m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --device cuda --batch_size 64
115112
```
116113

117114
To reproduce the results on the `mamba-2.8b-slimpj` model reported in the blogposts:
118115
```
119-
python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa,race,truthfulqa_mc2 --device cuda --batch_size 64
120-
python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks mmlu --num_fewshot 5 --device cuda --batch_size 64
116+
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa,race,truthfulqa_mc2 --device cuda --batch_size 256
117+
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks mmlu --num_fewshot 5 --device cuda --batch_size 256
121118
```
122119

123120
Note that the result of each task might differ from reported values by 0.1-0.3 due to noise in the evaluation process.

0 commit comments

Comments
 (0)