Skip to content

[DRAFT] prompt migration engine #808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 38 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
2570d16
added evaluator and formatter and main
heyjustinai Dec 4, 2024
08e41d0
add usage guide and init
heyjustinai Dec 4, 2024
a3e96e4
add engine and eval dataset
heyjustinai Dec 4, 2024
096249b
add .env settings and configure yml
heyjustinai Dec 4, 2024
263b8b5
placeholder readme
heyjustinai Dec 4, 2024
43a2cbc
adding eval dataset
heyjustinai Dec 5, 2024
b85811d
change eval dataset, include more robust judging, improved main
heyjustinai Dec 5, 2024
90d16cd
minor changes in eval, deleted formatter
heyjustinai Dec 5, 2024
4d75fe9
update dir
heyjustinai Jan 15, 2025
e52e1d1
updated prompt migration to use benchmark and also mipro, added meta …
heyjustinai Jan 16, 2025
1e4c6d2
update harness notebook
heyjustinai Jan 16, 2025
62b5367
update harness notebook
heyjustinai Jan 16, 2025
5730a84
beef up readme
heyjustinai Jan 16, 2025
314b6a8
added updated llama-mmlu-pro and added human-eva
heyjustinai Jan 17, 2025
2776a35
harness runcode
heyjustinai Jan 17, 2025
0bec41f
updated readme
heyjustinai Jan 21, 2025
03f2b8e
change gpu parallel size docs
heyjustinai Jan 21, 2025
becbe77
attempt to fix json output format in eval
heyjustinai Jan 22, 2025
a6f448f
<Replace this line with a title. Use 1 line only, 67 chars or less>
heyjustinai Jan 22, 2025
4fd5f29
revert to previous changes
heyjustinai Jan 22, 2025
eea9661
batching and parallelization, ran on baseline and lite
heyjustinai Jan 22, 2025
9ffb292
added inspect and modified harness
heyjustinai Jan 24, 2025
8d3a047
updated env file
heyjustinai Jan 27, 2025
e19b9e9
added fix split, gitignore and download mmlu script
heyjustinai Jan 29, 2025
21e04c2
update mmlu pro
heyjustinai Jan 29, 2025
dc406b4
setup meta-eval for benchmark, ray error
heyjustinai Jan 29, 2025
07b191b
Merge pull request #2 from pia-papanna/tools-refactory-chester
WuhanMonkey Jun 10, 2024
f8a6c7d
running mmlu pro with meta eval - fixed error
heyjustinai Jan 30, 2025
caeddcc
update utils
heyjustinai Jan 30, 2025
479b1fb
updated mmlu meta-eval for prompt migration
heyjustinai Jan 30, 2025
e1d64ca
update gitignore, added mmlu 0shot and ran a bunch of test
heyjustinai Jan 30, 2025
d214437
Stop tracking files in eval_results/meta-llama__Llama-3.3-70B-Instruct
heyjustinai Jan 30, 2025
d4638ba
updated gitignore
heyjustinai Jan 30, 2025
7a014b3
update readme
heyjustinai Jan 30, 2025
52c5a76
made changes to utils
heyjustinai Jan 30, 2025
423231e
updated mmlu and harness
heyjustinai Feb 3, 2025
3174e5b
code handoff
heyjustinai Feb 4, 2025
4768a41
merge utils, added configs and sample notebook to run optimizer
heyjustinai Feb 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ __pycache__
.ipynb_checkpoints
wandb/
artifacts/

**/.env
2 changes: 2 additions & 0 deletions end-to-end-use-cases/benchmarks/llm_eval_harness/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
**/eval_results/**
**/old_eval_results/**
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
model_name: "meta-llama/Llama-3.1-8B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."
model_name: "meta-llama/Llama-3.3-70B-Instruct" # The name of the model to evaluate. This must be a valid Meta Llama 3 based model name in the HuggingFace model hub."

evals_dataset: "meta-llama/Llama-3.1-8B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid dataset name in the Llama 3.x Evals collection.
evals_dataset: "meta-llama/Llama-3.1-70B-Instruct-evals" # The name of the 3.1 evals dataset to evaluate, please make sure this eval dataset corresponds to the model loaded. This must be a valid dataset name in the Llama 3.x Evals collection.
# Must be one of the following ["meta-llama/Llama-3.1-8B-Instruct-evals","meta-llama/Llama-3.1-70B-Instruct-evals","meta-llama/Llama-3.1-405B-Instruct-evals","meta-llama/Llama-3.1-8B-evals","meta-llama/Llama-3.1-70B-evals","meta-llama/Llama-3.1-405B-evals","meta-llama/Llama-3.2-1B-evals","meta-llama/Llama-3.2-3B-evals", "meta-llama/Llama-3.2-1B-Instruct-evals", "meta-llama/Llama-3.2-3B-Instruct-evals"]

tasks: "meta_instruct" # Available tasks for 3.1 instruct model: "meta_math_hard", "meta_gpqa_cot", "meta_mmlu_pro_instruct", "meta_ifeval"; or just use "meta_instruct" to run all of them.
tasks: "meta_mmlu_pro_instruct" # Available tasks for 3.1 instruct model: "meta_math_hard", "meta_gpqa_cot", "meta_mmlu_pro_instruct", "meta_ifeval"; or just use "meta_instruct" to run all of them.
# Available tasks for 3.1 pretrain model: "meta_bbh", "meta_mmlu_pro_pretrain"; or just use "meta_pretrain" to run all of them.
# Available tasks for 3.2 instruct model: "meta_mmlu", "meta_math", "meta_gpqa"; or just use "meta_instruct" to run all of them.
# Available tasks for 3.2 pretrain model: "meta_mmlu"; or just use "meta_pretrain" to run all of them
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
dataset_path: meta-llama/Llama-3.1-70B-evals
dataset_name: Llama-3.1-70B-evals__bbh__details
task: meta_bbh
output_type: generate_until
process_docs: !function utils.process_docs
test_split: latest
doc_to_text: !function utils.doc_to_text
doc_to_target: answer
filter_list:
- name: "strict-match"
filter:
- function: "regex"
regex_pattern: 'the answer is (.*?)\.'
- function: "take_first"
generation_kwargs:
until: "\n\nQ: "
do_sample: false
temperature: 0
max_gen_toks: 512
num_fewshot: 0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import random
import re

import datasets



def doc_to_text(doc: dict) -> str:
return doc["input_final_prompts"][0]

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc: dict) -> dict:
out_doc = {
"problem": doc["input_question"],
"answer": doc["input_correct_responses"][0],
}
return out_doc
dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","output_prediction_text"])
dataset = dataset.rename_column("is_correct","previously_is_correct")
dataset = dataset.map(_process_doc)
return dataset.map(_process_doc)
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
dataset_path: meta-llama/Llama-3.1-70B-Instruct-evals
dataset_name: Llama-3.1-70B-Instruct-evals__gpqa__details
task: meta_gpqa
output_type: generate_until
process_docs: !function utils.process_docs
test_split: latest
doc_to_text: !function utils.doc_to_text
doc_to_target: gold
filter_list:
- name: "strict-match"
filter:
- function: "regex"
group_select: -1
regex_pattern: ' ([A-Z])'
- function: "take_first"
generation_kwargs:
until: []
do_sample: false
temperature: 0
max_gen_toks: 2048
num_fewshot: 0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import random
import re

import datasets

def doc_to_text(doc: dict) -> str:
return doc["input_final_prompts"][0]

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc: dict) -> dict:
out_doc = {
"problem": doc["input_question"],
"gold": doc["input_correct_responses"][0],
}
return out_doc
dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","input_choice_list","output_prediction_text"])
dataset = dataset.rename_column("is_correct","previously_is_correct")
dataset = dataset.map(_process_doc)
return dataset.map(_process_doc)
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
dataset_path: meta-llama/Llama-3.1-70B-Instruct-evals
dataset_name: Llama-3.1-70B-Instruct-evals__gpqa__details
task: meta_gpqa_cot
output_type: generate_until
process_docs: !function utils.process_docs
test_split: latest
doc_to_text: !function utils.doc_to_text
doc_to_target: gold
filter_list:
- name: "strict-match"
filter:
- function: "regex"
group_select: -1
regex_pattern: 'best answer is ([A-Z])'
- function: "take_first"
generation_kwargs:
until: []
do_sample: false
temperature: 0
max_gen_toks: 2048
num_fewshot: 0
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import random
import re

import datasets



def doc_to_text(doc: dict) -> str:
return doc["input_final_prompts"][0]
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
def _process_doc(doc: dict) -> dict:
out_doc = {
"problem": doc["input_question"],
"gold": doc["input_correct_responses"][0],
}
return out_doc
dataset = dataset.select_columns(["input_question", "input_correct_responses", "input_final_prompts", "is_correct","input_question_hash","input_choice_list","output_prediction_text"])
dataset = dataset.rename_column("is_correct","previously_is_correct")
dataset = dataset.map(_process_doc)
return dataset.map(_process_doc)
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
task: meta_ifeval
dataset_path: parquet
dataset_kwargs:
data_files: ./work_dir/joined_ifeval.parquet
output_type: generate_until
test_split: train
num_fewshot: 0
doc_to_text: prompt
doc_to_target: 0
generation_kwargs:
until: []
do_sample: false
temperature: 0.0
max_gen_toks: 1280
process_results: !function utils.process_results
metric_list:
- metric: prompt_level_strict_acc
aggregation: mean
higher_is_better: true
- metric: inst_level_strict_acc
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
- metric: prompt_level_loose_acc
aggregation: mean
higher_is_better: true
- metric: inst_level_loose_acc
aggregation: !function utils.agg_inst_level_acc
higher_is_better: true
metadata:
version: 2.0
fewshot_config:
sampler: first_n
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
import dataclasses
from typing import Dict, Optional, Union

from lm_eval.tasks.ifeval import instructions_registry


@dataclasses.dataclass
class InputExample:
key: int
instruction_id_list: list[str]
prompt: str
kwargs: list[Dict[str, Optional[Union[str, int]]]]


@dataclasses.dataclass
class OutputExample:
instruction_id_list: list[str]
prompt: str
response: str
follow_all_instructions: bool
follow_instruction_list: list[bool]


def test_instruction_following_strict(
inp,
response,
):
"""Tests response to see if instructions are followed."""
instruction_list = inp.instruction_id_list
is_following_list = []

for index, instruction_id in enumerate(instruction_list):
instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
instruction = instruction_cls(instruction_id)

# Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
instruction.build_description(**kwargs)
args = instruction.get_instruction_args()
if args and "prompt" in args:
instruction.build_description(prompt=inp.prompt)

if response.strip() and instruction.check_following(response):
is_following_list.append(True)
else:
is_following_list.append(False)

return OutputExample(
instruction_id_list=inp.instruction_id_list,
prompt=inp.prompt,
response=response,
follow_all_instructions=all(is_following_list),
follow_instruction_list=is_following_list,
)


def test_instruction_following_loose(
inp,
response,
):
"""Tests response for an upper bound for following instructions."""
r = response.split("\n")
response_remove_first = "\n".join(r[1:]).strip()
response_remove_last = "\n".join(r[:-1]).strip()
response_remove_both = "\n".join(r[1:-1]).strip()
revised_response = response.replace("*", "")
revised_response_remove_first = response_remove_first.replace("*", "")
revised_response_remove_last = response_remove_last.replace("*", "")
revised_response_remove_both = response_remove_both.replace("*", "")
all_responses = [
response,
revised_response,
response_remove_first,
response_remove_last,
response_remove_both,
revised_response_remove_first,
revised_response_remove_last,
revised_response_remove_both,
]
instruction_list = inp.instruction_id_list
is_following_list = []

for index, instruction_id in enumerate(instruction_list):
instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
instruction = instruction_cls(instruction_id)

# Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
kwargs = {k: v for k, v in inp.kwargs[index].items() if v}
instruction.build_description(**kwargs)
args = instruction.get_instruction_args()
if args and "prompt" in args:
instruction.build_description(prompt=inp.prompt)

is_following = False
for r in all_responses:
if r.strip() and instruction.check_following(r):
is_following = True
break

is_following_list.append(is_following)

return OutputExample(
instruction_id_list=inp.instruction_id_list,
prompt=inp.prompt,
response=response,
follow_all_instructions=all(is_following_list),
follow_instruction_list=is_following_list,
)


def process_results(doc, results):
new_kwargs = []
for item in doc["kwargs"]:
if item["nth_paragraph"]:
item["nth_paragraph"] = int(item["nth_paragraph"])
new_kwargs.append(item)
inp = InputExample(
key=doc["key"],
instruction_id_list=doc["instruction_id_list"],
prompt=doc["prompt"],
kwargs=new_kwargs,
)
response = results[0]

out_strict = test_instruction_following_strict(inp, response)
out_loose = test_instruction_following_loose(inp, response)

return {
"prompt_level_strict_acc": out_strict.follow_all_instructions,
"inst_level_strict_acc": out_strict.follow_instruction_list,
"prompt_level_loose_acc": out_loose.follow_all_instructions,
"inst_level_loose_acc": out_loose.follow_instruction_list,
}


def agg_inst_level_acc(items):
flat_items = [item for sublist in items for item in sublist]
inst_level_acc = sum(flat_items) / len(flat_items)
return inst_level_acc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
dataset_path: parquet
dataset_kwargs:
data_files: ./work_dir/joined_math.parquet
task: meta_math
process_docs: !function utils.process_docs
output_type: generate_until
test_split: train
doc_to_text: !function utils.doc_to_text
process_results: !function utils.process_results
doc_to_target: answer
generation_kwargs:
until: []
do_sample: false
temperature: 0
max_gen_toks: 512
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
dataset_path: parquet
dataset_kwargs:
data_files: ./work_dir/joined_math_hard.parquet
task: meta_math_hard
process_docs: !function utils.process_docs
output_type: generate_until
test_split: train
doc_to_text: !function utils.doc_to_text
process_results: !function utils.process_results
doc_to_target: answer
generation_kwargs:
until: []
do_sample: false
temperature: 0
max_gen_toks: 5120
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
Loading