microsoft
diff --git a/‎ICLR2023/README.md
Lines changed: 147 additions & 0 deletions b/‎ICLR2023/README.md
Lines changed: 147 additions & 0 deletions
diff --git a/‎ICLR2023/data/125M_PAPER_1M_iter_1.txt.gz
27.2 MB b/‎ICLR2023/data/125M_PAPER_1M_iter_1.txt.gz
27.2 MB
diff --git a/‎ICLR2023/data/13B_PAPER_1M_iter_1.txt.gz
25.6 MB b/‎ICLR2023/data/13B_PAPER_1M_iter_1.txt.gz
25.6 MB
diff --git a/‎ICLR2023/data/27B_PAPER_1M_iter_1.txt.gz
27 MB b/‎ICLR2023/data/27B_PAPER_1M_iter_1.txt.gz
27 MB
diff --git a/‎ICLR2023/data/350M_PAPER_1M_iter_0.txt.gz
39.5 MB b/‎ICLR2023/data/350M_PAPER_1M_iter_0.txt.gz
39.5 MB
diff --git a/‎ICLR2023/data/Codex_PAPER_1M_iter_0.txt.gz
47.9 MB b/‎ICLR2023/data/Codex_PAPER_1M_iter_0.txt.gz
47.9 MB
diff --git a/‎ICLR2023/requirements.txt
Lines changed: 249 additions & 0 deletions b/‎ICLR2023/requirements.txt
Lines changed: 249 additions & 0 deletions
@@ -0,0 +1,147 @@
+This is the code and data for the paper: Language Models can teach themselves to code better
+https://arxiv.org/abs/2207.14502
+
+LICENSE
+MIT License - as already specified in the ../LICENSE file of PythonProgrammingPuzzles repo
+
+GPU USAGE
+GPU usage was large , especially for the 2.7B sized model which is ~20X the 125M.
+Data generation takes the most GPU usage and took about 2500 GPU hours for 2.7B (on v100)
+Finetuning on the 1M generated data took about 40 GPU hours for 2.7B (on v100) per epoch of finetuning - 10 epochs = 400 GPU hours
+Solving the 228 problem testset with 100 attempts using the finetuned 2.7B model took about 4 hours (on v100)
+We mostly used v100, but we used whatever was available, so T4 and A100 sometimes if they were free.
+Tried everything at 125M first - debug there and make it work perfect - then roll out the 1.3 and 2.7 jobs
+
+DATASETS
+In data directory are the datasets used. We feel the most interesting dataset is data/Codex_PAPER_1M_iter_0.txt
+which is generated by Codex and gave the best results when finetuned on. All the datasets are part of our public release.
+
+SETUP
+src/requirements.txt is what we install on our cluster machines - the cluster comes with NVidia drivers and matching pytorch
+./requirements.txt is what I personally have installed on my local machine and tested this runs - but it has lots of stuff you don't need
+So try src/requirements.txt only - and if that doesn't work - then /requirements.txt has all versions of everything installed on my machine
+Getting a deepspeed 0.6.1 matching a pytorch matching a nvidia driver install was tricky for me on some machines, torch 1.10 and 1.11 both work
+
+GENERATING/FINETUNING -> run "cd src, ./babysit.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically
+In src/babysit.sh is the script that generates data, and finetunes on that data in a loop, finetuning the GPT-Neo 125M/1.3B/2.7B models
+In src/babysit.sh TEST_LOCAL=1 controls running locally on machine's GPUs which is great for fast testing, or =0 is launching on the cluster which is slow but has lots of GPUs
+Realistically you have to train on a cluster - data generation takes a long time so having lots of machines all generating data is the feasible approach.
+But given enough time - this will run locally on 1 GPU.  1 year for 2.7B, or 2 weeks for 125M.
+We found generating 75k samples after deduping worked for iteration_0 - finetune on that data.
+Then using that fine_tuned model in iter_1 generating data happens more quickly - the finetuned model solves many more problems
+Repeating that process works well.
+On 125M we looked at just training on only 125M generated data from iter_0 versus iter_1 versus iter_2 - generating 600K for each iteration.
+It seemed finetuning on iter_2 data was best on the testset 26.9/228 solved vs iter_1=26.1/228 vs iter_0=22.2/228
+With 1M samples from 125M generated data sampled across all the iterations 0,1,2 we got 26.75/228
+We understand why it's faster to generate iter_2 data on a finetuned model - it solves more problems.
+But why are the generated puzzles&solutions better for training the model on?
+We will explore that more in the future - and try iterating a lot farther than 3 iterations - although our preliminary experiments on 125M show it tops out at 3 iterations
+
+FINETUNING ONLY -> run "cd src, ./fine_tune1.sh GPU_INDEX_TO_USE" -> GPU_INDEX_TO_USE=0 typically
+# ./fine_tune1.sh GPU MODEL_TO_TRAIN EXPERIMENT_NAME_DIRECTORY TRAIN_DATA EPOCHS
+This allows the repeated finetuning on a specific dataset.
+Use this to do a temperature grid search, or try different variations of parameters on a specific dataset.
+
+Detailed instructions for reproducing experiments:
+# Generating Codex data
+python gen.py -n=32 -max_tokens=4096 -model_path=openai/code-davinci-002 -model_path_solve=openai/code-cushman-001 -out=../data/codex/iter_0 -seed=2022
+
+# Measuring codex accuracy via API calls
+./solve2.sh
+python solve.py -prefix=../data/train_prefix.txt -attempts=1 -model_path=openai/code-cushman-001 -gpu=0 -fixed_temp=0.8 -out=../data/codex -puzzles=../data/test_228.json -seed=2022 -batch_size=64
+
+# Producing verified Codex_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex
+python preprocess.py -path=../data/codex/old_verified -f_name=Codex_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=False -seed=2022
+cp ../data/codex/old/Codex_PAPER_1M_iter_0.txt ../data/Codex_PAPER_1M_iter_0.txt
+
+# Producing unverified Codex_unverified_PAPER_1M_iter_0.txt from the puzzle/solution old style data generated by Codex
+python preprocess.py -path=../data/codex/old_unverified -f_name=Codex_unverified_PAPER_1M_iter_0.txt -max_sols_per_puzzle=8 -old_style_json=True -max_examples=1000000 -include_failures=True -seed=2022
+cp ../data/codex/old_unverified/Codex_unverified_PAPER_1M_iter_0.txt ../data/Codex_unverified_PAPER_1M_iter_0.txt
+
+# Producing 125M_PAPER_25K_iter_0.txt from the puzzle/solution new style data
+python preprocess.py ../data/125M_PAPER/iter_0 125M_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022
+cp ../data/125M_PAPER/iter_0/125M_PAPER_25K_iter_0.txt ../data/125M_PAPER_25K_iter_0.txt
+
+# Producing 125M_PAPER_1M_iter_1.txt from the puzzle/solution new style data
+python preprocess.py ../data/125M_PAPER/iter_1 125M_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022
+cp ../data/125M_PAPER/iter_1/125M_PAPER_1M_iter_1.txt ../data/125M_PAPER_1M_iter_1.txt
+
+# Producing 125M_PAPER_1M_iter_2.txt from the puzzle/solution new style data13B
+python preprocess.py ../data/125M_PAPER/iter_2 125M_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022
+cp ../data/125M_PAPER/iter_2/125M_PAPER_1M_iter_2.txt ../data/125M_PAPER_1M_iter_2.txt
+
+# Producing 13B_PAPER_25K_iter_0.txt from the puzzle/solution new style data
+python preprocess.py ../data/13B_PAPER/iter_0 13B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022
+cp ../data/13B_PAPER/iter_0/13B_PAPER_25K_iter_0.txt ../data/13B_PAPER_25K_iter_0.txt
+
+# Producing 13B_PAPER_1M_iter_1.txt from the puzzle/solution new style data
+python preprocess.py ../data/13B_PAPER/iter_1 13B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022
+cp ../data/13B_PAPER/iter_1/13B_PAPER_1M_iter_1.txt ../data/13B_PAPER_1M_iter_1.txt
+
+# Producing 13B_PAPER_1M_iter_2.txt from the puzzle/solution new style data
+python preprocess.py ../data/13B_PAPER/iter_2 13B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022
+cp ../data/13B_PAPER/iter_2/13B_PAPER_1M_iter_2.txt ../data/13B_PAPER_1M_iter_2.txt
+
+# Producing 27B_PAPER_25K_iter_0.txt from the puzzle/solution new style data
+python preprocess.py ../data/27B_PAPER/iter_0 27B_PAPER_25K_iter_0.txt 8 False 25000 False -seed=2022
+cp ../data/27B_PAPER/iter_0/27B_PAPER_25K_iter_0.txt ../data/27B_PAPER_25K_iter_0.txt
+
+# Producing 27B_PAPER_1M_iter_1.txt from the puzzle/solution new style data
+python preprocess.py ../data/27B_PAPER/iter_1 27B_PAPER_1M_iter_1.txt 8 False 1000000 False -seed=2022
+cp ../data/27B_PAPER/iter_1/27B_PAPER_1M_iter_1.txt ../data/27B_PAPER_1M_iter_1.txt
+
+# Producing 27B_PAPER_1M_iter_2.txt from the puzzle/solution new style data
+python preprocess.py ../data/27B_PAPER/iter_2 27B_PAPER_1M_iter_2.txt 8 False 1000000 False -seed=2022
+cp ../data/27B_PAPER/iter_2/27B_PAPER_1M_iter_2.txt ../data/27B_PAPER_1M_iter_2.txt
+
+# Data files produced by babysit.sh - generating data from gpt-neo-* and Codex
+# At the time of experiments running, Codex wasn't finetunable, so only iteration 0 data was available
+Codex_PAPER_1M_iter_0.txt
+125M_PAPER_25K_iter_0.txt
+13B_PAPER_25K_iter_0.txt
+27B_PAPER_25K_iter_0.txt
+125M_PAPER_1M_iter_1.txt
+13B_PAPER_1M_iter_1.txt
+27B_PAPER_1M_iter_1.txt
+125M_PAPER_1M_iter_2.txt
+13B_PAPER_1M_iter_2.txt
+27B_PAPER_1M_iter_2.txt
+
+# Figure 5 - 3 diagrams - showing the 3 GPT models trained on verified codex vs unverified codex vs baseline
+# 5a GPT-NEO 125M
+./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt
+./fine_tune1.sh 0 125M ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt
+./solve1.sh 0 125M 10 228
+# 5b GPT-NEO 13B
+./fine_tune1.sh 0 13B  ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt
+./fine_tune1.sh 0 13B  ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt
+./solve1.sh 0 13B 10 228 5
+# 5c GPT-NEO 27B
+./fine_tune1.sh 0 27B  ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt
+./fine_tune1.sh 0 27B  ft1_Codex_unverified_PAPER_1M_iter_0 Codex_unverified_PAPER_1M_iter_0.txt
+./solve1.sh 0 13B 10 228 5
+
+# Figure 6 - 3 diagrams - showing test228 Pass@ for the 3 GPT models trained on data from 4 generators (codex and 3 GPT-Neo) and baseline
+# 6a - GPT-NEO 125M trained on 4 different datasets and baseline
+# ./fine_tune1.sh 0 125M ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5a)
+./fine_tune1.sh 0 125M ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt
+./fine_tune1.sh 0 125M ft1_13B_PAPER_1M_iter_2  13B_PAPER_1M_iter_2.txt
+./fine_tune1.sh 0 125M ft1_27B_PAPER_1M_iter_2  27B_PAPER_1M_iter_2.txt
+
+# 6b - GPT-NEO 13B trained on 4 different datasets and baseline
+# ./fine_tune1.sh 0 13B  ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5b)
+./fine_tune1.sh 0 13B  ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt
+./fine_tune1.sh 0 13B  ft1_13B_PAPER_1M_iter_2  13B_PAPER_1M_iter_2.txt
+./fine_tune1.sh 0 13B  ft1_27B_PAPER_1M_iter_2  27B_PAPER_1M_iter_2.txt
+
+# 6c - GPT-NEO 27B trained on 4 different datasets and baseline
+# ./fine_tune1.sh 0 27B  ft1_Codex_PAPER_1M_iter_0 Codex_PAPER_1M_iter_0.txt (dupe of 5c)
+./fine_tune1.sh 0 27B  ft1_125M_PAPER_1M_iter_2 125M_PAPER_1M_iter_2.txt
+./fine_tune1.sh 0 27B  ft1_13B_PAPER_1M_iter_2  13B_PAPER_1M_iter_2.txt
+./fine_tune1.sh 0 27B  ft1_27B_PAPER_1M_iter_2  27B_PAPER_1M_iter_2.txt
+
+# Launch on torch2020 - edit solve.yaml for correct parameters of model and epoch
+./tst_human_eval_base.sh 0 125M 1024
+./tst_human_eval_ft1.sh 0 125M 1024
+./tst_human_eval_ft5.sh 0 125M 1024
+./tst_human_eval_ft10.sh 0 125M 1024
@@ -0,0 +1,249 @@
+adal==1.2.7
+aiohttp==3.8.1
+aiosignal==1.2.0
+amlt==8.0.9
+applicationinsights==0.11.10
+asn1crypto==0.24.0
+astor==0.8.1
+async-timeout==4.0.1
+attrs==17.4.0
+Automat==0.6.0
+azure-common==1.1.27
+azure-core==1.17.0
+azure-data-tables==12.0.0b6
+azure-graphrbac==0.61.1
+azure-identity==1.4.1
+azure-mgmt-authorization==0.61.0
+azure-mgmt-containerregistry==2.8.0
+azure-mgmt-keyvault==2.2.0
+azure-mgmt-resource==13.0.0
+azure-mgmt-storage==11.2.0
+azure-storage-blob==2.1.0
+azure-storage-common==2.1.0
+azure-storage-file==2.1.0
+azureml-automl-core==1.26.0
+azureml-contrib-k8s==0.1.16
+azureml-contrib-pipeline-steps==1.26.0
+azureml-core==1.26.0
+azureml-dataprep==2.13.2
+azureml-dataprep-native==32.0.0
+azureml-dataprep-rslex==1.11.2
+azureml-dataset-runtime==1.26.0
+azureml-k8s-mt==1.0.4
+azureml-pipeline-core==1.26.0
+azureml-pipeline-steps==1.26.0
+azureml-telemetry==1.26.0
+azureml-train-automl-client==1.26.0
+azureml-train-core==1.26.0
+azureml-train-restclients-hyperdrive==1.26.0
+backcall==0.2.0
+backports.tempfile==1.0
+backports.weakref==1.0.post1
+beautifulsoup4==4.9.3
+bitstring==3.1.9
+black==21.8b0
+blinker==1.4
+blis==0.7.4
+blobxfer==1.10.0
+cachetools==4.2.2
+catalogue==2.0.6
+certifi==2018.1.18
+cffi==1.14.6
+chardet==3.0.4
+charset-normalizer==2.0.7
+click==7.1.2
+click-completion @ git+https://github.com/temporaer/click-completion.git@41b21868cac0781d25b37da624bae2fd1f36be88
+click-option-group==0.5.3
+click-plugins==1.1.1
+cloud-init==20.2
+cloudpickle==1.6.0
+colorama==0.3.7
+colorlog==6.4.1
+command-not-found==0.3
+configobj==5.0.6
+configparser==5.0.2
+constantly==15.1.0
+contextlib2==21.6.0
+cryptography==3.4.8
+cycler==0.10.0
+cymem==2.0.5
+datasets==1.15.1
+debugpy==1.4.3
+decorator==5.0.9
+deepspeed==0.5.1
+dill==0.3.4
+distro==1.6.0
+distro-info===0.18ubuntu0.18.04.1
+docker==5.0.1
+docker-pycreds==0.4.0
+dotnetcore2==2.1.21
+ecdsa==0.17.0
+entrypoints==0.3
+et-xmlfile==1.1.0
+fail2ban==0.10.2
+fastai==2.5.2
+fastcore==1.3.26
+fastdownload==0.0.5
+fastprogress==1.0.0
+filelock==3.0.12
+Flask==2.0.1
+Flask-Cors==3.0.10
+Flask-Executor==0.9.4
+Flask-FontAwesome==0.1.5
+frozenlist==1.2.0
+fsspec==2021.11.0
+gitdb==4.0.7
+GitPython==3.1.18
+httplib2==0.9.2
+huggingface-hub==0.1.2
+humanize==3.11.0
+hyperlink==17.3.1
+idna==2.6
+incremental==16.10.1
+ipdb==0.13.9
+ipykernel==6.4.1
+ipython==7.27.0
+ipython-genutils==0.2.0
+isodate==0.6.0
+itsdangerous==2.0.1
+jedi==0.18.0
+Jinja2==3.0.1
+jmespath==0.10.0
+joblib==1.0.1
+jsonpatch==1.16
+jsonpickle==2.0.0
+jsonpointer==1.10
+jsonschema==2.6.0
+jupyter-client==7.0.5
+jupyter-core==4.8.1
+keyring==10.6.0
+keyrings.alt==3.0
+kiwisolver==1.3.2
+language-selector==0.1
+libtmux==0.10.1
+Mako==1.1.5
+MarkupSafe==2.0.1
+marshmallow==3.10.0
+matplotlib==3.4.3
+matplotlib-inline==0.1.3
+mlb-core==0.0.4
+msal==1.14.0
+msal-extensions==0.2.2
+msrest==0.6.19
+msrestazure==0.6.4
+multidict==5.2.0
+multiprocess==0.70.12.2
+murmurhash==1.0.5
+mypy-extensions==0.4.3
+ndg-httpsclient==0.5.1
+nest-asyncio==1.5.1
+netifaces==0.10.4
+ninja==1.10.2
+ntlm-auth==1.5.0
+numpy==1.21.2
+oauthlib==3.1.1
+openai==0.13.0
+openpyxl==3.0.9
+orderedset==2.0.3
+packaging==21.0
+PAM==0.4.2
+pandas==1.3.2
+pandas-stubs==1.2.0.45
+parso==0.8.2
+passpy==1.0.2
+pathspec==0.9.0
+pathtools==0.1.2
+pathy==0.6.0
+Pebble==4.6.3
+petname==2.6
+pexpect==4.8.0
+pickleshare==0.7.5
+Pillow==8.3.2
+platformdirs==2.3.0
+portalocker==1.7.1
+preshed==3.0.5
+promise==2.3
+prompt-toolkit==3.0.20
+protobuf==3.17.3
+psb2==1.0.0
+psutil==5.8.0
+ptyprocess==0.7.0
+pyarrow==1.0.1
+pyasn1==0.4.2
+pyasn1-modules==0.2.1
+pycparser==2.20
+pycrypto==2.6.1
+pydantic==1.8.2
+Pygments==2.10.0
+PyGObject==3.26.1
+PyJWT==1.5.3
+pyOpenSSL==17.5.0
+pyparsing==2.4.7
+pyperclip==1.8.2
+pyserial==3.4
+python-apt==1.6.5+ubuntu0.3
+python-dateutil==2.8.2
+python-debian==0.1.32
+python-gnupg==0.4.7
+pytz==2021.1
+pyxdg==0.25
+PyYAML==5.4.1
+pyzmq==22.3.0
+regex==2021.8.28
+requests==2.25.1
+requests-ntlm==1.1.0
+requests-oauthlib==1.3.0
+requests-unixsocket==0.1.5
+ruamel.yaml==0.17.16
+ruamel.yaml.clib==0.2.6
+sacremoses==0.0.45
+scikit-learn==0.24.2
+scipy==1.7.1
+SecretStorage==2.3.1
+sentry-sdk==1.3.1
+service-identity==16.0.0
+shellingham==1.4.0
+shortuuid==1.0.1
+six==1.16.0
+sklearn==0.0
+smart-open==5.2.1
+smmap==4.0.0
+soupsieve==2.2.1
+spacy==3.1.2
+spacy-legacy==3.0.8
+srsly==2.4.1
+ssh-import-id==5.7
+sshpubkeys==3.3.1
+strictfire==0.4.1
+subprocess32==3.5.4
+systemd-python==234
+tabulate==0.8.9
+tensorboardX==1.8
+termcolor==1.1.0
+thinc==8.0.10
+threadpoolctl==2.2.0
+tokenizers==0.10.3
+toml==0.10.2
+tomli==1.2.1
+torch==1.9.0
+torchvision==0.10.0
+tornado==6.1
+tqdm==4.62.2
+traitlets==5.1.0
+transformers==4.10.0
+Twisted==17.9.0
+typer==0.3.2
+typing-extensions==3.10.0.2
+ufw==0.36
+unattended-upgrades==0.1
+urllib3==1.26.6
+virtualenv==15.1.0
+WALinuxAgent==2.2.45
+wasabi==0.8.2
+wcwidth==0.2.5
+websocket-client==1.2.1
+Werkzeug==2.0.1
+xdg==5.1.1
+xxhash==2.0.2
+yarl==1.7.2
+zope.interface==4.3.2