Speed up synthetic data sampling #383

dagrayvid · 2025-09-30T21:40:33Z

Summary

Refactor how synthetic data is generated, to make it run faster for large datasets of 1000s of prompts each with 1000s of tokens.

Basic outline of how it works now:

Tokenize the whole dataset and save it locally in npy format as an array of token_ids under XDG_CACHE_HOME
Cache name is unique to the tokenizer and source file context.
Sample randomly from this array of token IDs, no need to search for the right length since it is already an array of token IDs
Convert selected arrays back to text

In terms of performance, we can now generate 10,000 prompts with 5,000 tokens each in under 30 seconds, whereas before this would take >10 min

None?

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Signed-off-by: David Whyte-Gray <40244437+dagrayvid@users.noreply.github.com>

markurtz · 2025-10-01T12:11:54Z

@dagrayvid can you take a look at the new refactor, specifically the data pipelines rework, and what is missing / adapting this on top of it? #384

Speed up synthetic data sampling

57df288

Signed-off-by: David Whyte-Gray <40244437+dagrayvid@users.noreply.github.com>

dagrayvid requested review from sjmonson and markurtz September 30, 2025 21:41