-
Notifications
You must be signed in to change notification settings - Fork 194
Description
Hi,
I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy
index files
doc_idx_filename = _filename + "_doc_idx.npy"
sample_idx_filename = _filename + "_sample_idx.npy"
shuffle_idx_filename = _filename + "_shuffle_idx.npy"
I was wondering can you provide a copy of these files so that I don't need to regenerate them?
I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.
I noticed that you provide the batch_viewer.py
to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.
Thanks