|
| 1 | +This is the code and data for the paper: Language Models can teach themselves to code better |
| 2 | +https://arxiv.org/abs/2207.14502 |
| 3 | + |
| 4 | +LICENSE |
| 5 | +MIT License - as already specified in the ../LICENSE file of PythonProgrammingPuzzles repo |
| 6 | + |
| 7 | +GPU USAGE |
| 8 | +GPU usage was large , especially for the 2.7B sized model which is ~20X the 125M. |
| 9 | +Data generation takes the most GPU usage and took about 2500 GPU hours for 2.7B (on v100) |
| 10 | +Finetuning on the 1M generated data took about 40 GPU hours for 2.7B (on v100) per epoch of finetuning - 10 epochs = 400 GPU hours |
| 11 | +Solving the 228 problem testset with 100 attempts using the finetuned 2.7B model took about 4 hours (on v100) |
| 12 | +We mostly used v100, but we used whatever was available, so T4 and A100 sometimes if they were free. |
| 13 | +Tried everything at 125M first - debug there and make it work perfect - then roll out the 1.3 and 2.7 jobs |
| 14 | + |
| 15 | +DATASETS |
| 16 | +In data directory are the datasets used. We feel the most interesting dataset is data/Codex_PAPER_1M_iter_0.txt |
| 17 | +which is generated by Codex and gave the best results when finetuned on. All the datasets are part of our public release. |
| 18 | + |
| 19 | +SETUP |
| 20 | +src/requirements.txt is what we install on our cluster machines - the cluster comes with NVidia drivers and matching pytorch |
| 21 | +./requirements.txt is what I personally have installed on my local machine and tested this runs - but it has lots of stuff you don't need |
| 22 | +So try src/requirements.txt only - and if that doesn't work - then /requirements.txt has all versions of everything installed on my machine |
| 23 | +Getting a deepspeed 0.6.1 matching a pytorch matching a nvidia driver install was tricky for me on some machines, torch 1.10 and 1.11 both work |
| 24 | + |
| 25 | +GENERATING/FINETUNING -> run "cd src, ./babysit.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically |
| 26 | +In src/babysit.sh is the script that generates data, and finetunes on that data in a loop, finetuning the GPT-Neo 125M/1.3B/2.7B models |
| 27 | +In src/babysit.sh TEST_LOCAL=1 controls running locally on machine's GPUs which is great for fast testing, or =0 is launching on the cluster which is slow but has lots of GPUs |
| 28 | +Realistically you have to train on a cluster - data generation takes a long time so having lots of machines all generating data is the feasible approach. |
| 29 | +But given enough time - this will run locally on 1 GPU. 1 year for 2.7B, or 2 weeks for 125M. |
| 30 | +We found generating 75k samples after deduping worked for iteration_0 - finetune on that data. |
| 31 | +Then using that fine_tuned model in iter_1 generating data happens more quickly - the finetuned model solves many more problems |
| 32 | +Repeating that process works well. |
| 33 | +On 125M we looked at just training on only 125M generated data from iter_0 versus iter_1 versus iter_2 - generating 600K for each iteration. |
| 34 | +It seemed finetuning on iter_2 data was best on the testset 26.9/228 solved vs iter_1=26.1/228 vs iter_0=22.2/228 |
| 35 | +With 1M samples from 125M generated data sampled across all the iterations 0,1,2 we got 26.75/228 |
| 36 | +We understand why it's faster to generate iter_2 data on a finetuned model - it solves more problems. |
| 37 | +But why are the generated puzzles&solutions better for training the model on? |
| 38 | +We will explore that more in the future - and try iterating a lot farther than 3 iterations - although our preliminary experiments on 125M show it tops out at 3 iterations |
| 39 | + |
| 40 | +FINETUNING ONLY -> run "cd src, ./fine_tune1.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically |
| 41 | +# ./fine_tune1.sh GPU MODEL_TO_TRAIN EXPERIMENT_NAME_DIRECTORY TRAIN_DATA EPOCHS |
| 42 | +This allows the repeated finetuning on a specific dataset. |
| 43 | +Use this to do a temperature grid search, or try different variations of parameters on a specific dataset. |
| 44 | + |
| 45 | +Detailed instructions for reproducing experiments: |
| 46 | +# Generating Codex data |
| 47 | +python gen.py -n=32 -max_tokens=4096 -model_path=openai/code-davinci-002 -model_path_solve=openai/code-cushman-001 -out=../data/codex/iter_0 -seed=2022 |
| 48 | + |
| 49 | +# Measuring codex accuracy via API calls |
| 50 | +./solve2.sh |
| 51 | +python solve.py -prefix=../data/train_prefix.txt -attempts=1 -model_path=openai/code-cushman-001 -gpu=0 -fixed_temp=0.8 -out=../data/codex -puzzles=../data/test_228.json -seed=2022 -batch_size=64 |
| 52 | + |
| 53 | +# Producing verified Codex_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex |
| 54 | +python preprocess.py -path=../data/codex/old_verified -f_name=Codex_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=False -seed=2022 |
| 55 | +cp ../data/codex/old/Codex_PAPER_1M_iter_0.txt ../data/Codex_PAPER_1M_iter_0.txt |
| 56 | + |
| 57 | +# Producing unverified Codex_unverified_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex |
| 58 | +python preprocess.py -path=../data/codex/old_unverified -f_name=Codex_unverified_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=True -seed=2022 |
| 59 | +cp ../data/codex/old_unverified/Codex_unverified_PAPER_1M_iter_0.txt ../data/Codex_unverified_PAPER_1M_iter_0.txt |
| 60 | + |
| 61 | +# Producing 125M_PAPER_25K_iter_0.txt from the puzzle/solution new style data |
| 62 | +python preprocess.py ../data/125M_PAPER/iter_0 125M_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 |
| 63 | +cp ../data/125M_PAPER/iter_0/125M_PAPER_25K_iter_0.txt ../data/125M_PAPER_25K_iter_0.txt |
| 64 | + |
| 65 | +# Producing 125M_PAPER_1M_iter_1.txt from the puzzle/solution new style data |
| 66 | +python preprocess.py ../data/125M_PAPER/iter_1 125M_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 |
| 67 | +cp ../data/125M_PAPER/iter_1/125M_PAPER_1M_iter_1.txt ../data/125M_PAPER_1M_iter_1.txt |
| 68 | + |
| 69 | +# Producing 125M_PAPER_1M_iter_2.txt from the puzzle/solution new style data13B |
| 70 | +python preprocess.py ../data/125M_PAPER/iter_2 125M_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 |
| 71 | +cp ../data/125M_PAPER/iter_2/125M_PAPER_1M_iter_2.txt ../data/125M_PAPER_1M_iter_2.txt |
| 72 | + |
| 73 | +# Producing 13B_PAPER_25K_iter_0.txt from the puzzle/solution new style data |
| 74 | +python preprocess.py ../data/13B_PAPER/iter_0 13B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 |
| 75 | +cp ../data/13B_PAPER/iter_0/13B_PAPER_25K_iter_0.txt ../data/13B_PAPER_25K_iter_0.txt |
| 76 | + |
| 77 | +# Producing 13B_PAPER_1M_iter_1.txt from the puzzle/solution new style data |
| 78 | +python preprocess.py ../data/13B_PAPER/iter_1 13B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 |
| 79 | +cp ../data/13B_PAPER/iter_1/13B_PAPER_1M_iter_1.txt ../data/13B_PAPER_1M_iter_1.txt |
| 80 | + |
| 81 | +# Producing 13B_PAPER_1M_iter_2.txt from the puzzle/solution new style data |
| 82 | +python preprocess.py ../data/13B_PAPER/iter_2 13B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 |
| 83 | +cp ../data/13B_PAPER/iter_2/13B_PAPER_1M_iter_2.txt ../data/13B_PAPER_1M_iter_2.txt |
| 84 | + |
| 85 | +# Producing 27B_PAPER_25K_iter_0.txt from the puzzle/solution new style data |
| 86 | +python preprocess.py ../data/27B_PAPER/iter_0 27B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022 |
| 87 | +cp ../data/27B_PAPER/iter_0/27B_PAPER_25K_iter_0.txt ../data/27B_PAPER_25K_iter_0.txt |
| 88 | + |
| 89 | +# Producing 27B_PAPER_1M_iter_1.txt from the puzzle/solution new style data |
| 90 | +python preprocess.py ../data/27B_PAPER/iter_1 27B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022 |
| 91 | +cp ../data/27B_PAPER/iter_1/27B_PAPER_1M_iter_1.txt ../data/27B_PAPER_1M_iter_1.txt |
| 92 | + |
| 93 | +# Producing 27B_PAPER_1M_iter_2.txt from the puzzle/solution new style data |
| 94 | +python preprocess.py ../data/27B_PAPER/iter_2 27B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022 |
| 95 | +cp ../data/27B_PAPER/iter_2/27B_PAPER_1M_iter_2.txt ../data/27B_PAPER_1M_iter_2.txt |
| 96 | + |
| 97 | +# Data files produced by babysit.sh - generating data from gpt-neo-* and Codex |
| 98 | +# At the time of experiments running, Codex wasn't finetunable, so only iteration 0 data was available |
| 99 | +Codex_PAPER_1M_iter_0.txt |
| 100 | +125M_PAPER_25K_iter_0.txt |
| 101 | +13B_PAPER_25K_iter_0.txt |
| 102 | +27B_PAPER_25K_iter_0.txt |
| 103 | +125M_PAPER_1M_iter_1.txt |
| 104 | +13B_PAPER_1M_iter_1.txt |
| 105 | +27B_PAPER_1M_iter_1.txt |
| 106 | +125M_PAPER_1M_iter_2.txt |
| 107 | +13B_PAPER_1M_iter_2.txt |
| 108 | +27B_PAPER_1M_iter_2.txt |
| 109 | + |
| 110 | +# Figure 5 - 3 diagrams - showing the 3 GPT models trained on verified codex vs unverified codex vs baseline |
| 111 | +# 5a GPT-NEO 125M |
| 112 | +./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt |
| 113 | +./fine_tune1.sh 0 125M ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt |
| 114 | +./solve1.sh 0 125M 10 228 |
| 115 | +# 5b GPT-NEO 13B |
| 116 | +./fine_tune1.sh 0 13B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt |
| 117 | +./fine_tune1.sh 0 13B ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt |
| 118 | +./solve1.sh 0 13B 10 228 5 |
| 119 | +# 5c GPT-NEO 27B |
| 120 | +./fine_tune1.sh 0 27B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt |
| 121 | +./fine_tune1.sh 0 27B ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt |
| 122 | +./solve1.sh 0 13B 10 228 5 |
| 123 | + |
| 124 | +# Figure 6 - 3 diagrams - showing test228 Pass@ for the 3 GPT models trained on data from 4 generators (codex and 3 GPT-Neo) and baseline |
| 125 | +# 6a - GPT-NEO 125M trained on 4 different datasets and baseline |
| 126 | +# ./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5a) |
| 127 | +./fine_tune1.sh 0 125M ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt |
| 128 | +./fine_tune1.sh 0 125M ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt |
| 129 | +./fine_tune1.sh 0 125M ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt |
| 130 | + |
| 131 | +# 6b - GPT-NEO 13B trained on 4 different datasets and baseline |
| 132 | +# ./fine_tune1.sh 0 13B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5b) |
| 133 | +./fine_tune1.sh 0 13B ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt |
| 134 | +./fine_tune1.sh 0 13B ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt |
| 135 | +./fine_tune1.sh 0 13B ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt |
| 136 | + |
| 137 | +# 6c - GPT-NEO 27B trained on 4 different datasets and baseline |
| 138 | +# ./fine_tune1.sh 0 27B ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5c) |
| 139 | +./fine_tune1.sh 0 27B ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt |
| 140 | +./fine_tune1.sh 0 27B ft1_13B_PAPER_1M_iter_2 13B_PAPER_1M_iter_2.txt |
| 141 | +./fine_tune1.sh 0 27B ft1_27B_PAPER_1M_iter_2 27B_PAPER_1M_iter_2.txt |
| 142 | + |
| 143 | +# Launch on torch2020 - edit solve.yaml for correct parameters of model and epoch |
| 144 | +./tst_human_eval_base.sh 0 125M 1024 |
| 145 | +./tst_human_eval_ft1.sh 0 125M 1024 |
| 146 | +./tst_human_eval_ft5.sh 0 125M 1024 |
| 147 | +./tst_human_eval_ft10.sh 0 125M 1024 |
0 commit comments