Provide the shuffled index_mapping npy files for ease of reproducing training data

Hi,

I was wondering can you provide the index_mapping files that is generated by the [GPT2Dataset](https://github.com/EleutherAI/gpt-neox/blob/03186decef022dc35e6adee1a66619968812e0a9/megatron/data/gpt2_dataset.py#L29)? From the construction of gpt2dataset at [here](https://github.com/EleutherAI/gpt-neox/blob/03186decef022dc35e6adee1a66619968812e0a9/megatron/data/gpt2_dataset.py#L158), I can see there are three `npy` index files
```
    doc_idx_filename = _filename + "_doc_idx.npy"
    sample_idx_filename = _filename + "_sample_idx.npy"
    shuffle_idx_filename = _filename + "_shuffle_idx.npy"
```
I was wondering can you provide a copy of these files so that I don't need to regenerate them?

I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of [GPT2Dataset](https://github.com/EleutherAI/gpt-neox/blob/03186decef022dc35e6adee1a66619968812e0a9/megatron/data/gpt2_dataset.py#L29), I found that with these index files, I can reproduce the original training data of pythia.

I noticed that you provide the `batch_viewer.py` to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide the shuffled index_mapping npy files for ease of reproducing training data #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions