DataJuicer

🍊 Welcome to Data-Juicer!

Open, modular data engineering for every intelligent application

We’re the Data-Juicer team — a group of researchers, engineers, and open-source enthusiasts from Tongyi Lab and the data intelligence community. We believe great AI starts with great data, but today’s data development is hard to reuse, fragmented to model, and far from intelligence.

So we’re building Data-Juicer: a community-driven ecosystem to make data processing more simple, composable, and valable — whether you’re working on:

🧠 LLMs / VLMs / pre-training / post-tuning
🤖 Agents & autonomous systems
📊 BI, document intelligence, or knowledge extraction
🚗 Embodied AI, autopilot, or simulation
🧬 AI for science, healthcare, finance, and beyond

🌍 A Modular Ecosystem, Built Together

After the growth of real-world use across academia and industry, we’re reimagining Data-Juicer as a more open ecosystem:

✨ Modular by design: Use only what you need, like hundreds of operators, data-model co-dev sandbox, data agents
⚡️ Unmatched Efficiency: Accelerate your data workflows with good scalability and deep optimization
🧩 Easy to extend: Add your own operators, data recipes, or domain-specific and new modules
🤝 Open to all: Students, data hackers, researchers, startups, and enterprises — all are welcome

🔧 We’re currently migrating from a monolith to lighter repos under this org. This is the beginning of a more flexible, community-driven future.

🚀 How to Start & Join

Try it now: star your journey from the portal repo.
Contributing & Acknowledgements: We love contributions! We are grateful for every feedback, from code and doc to bug reports and ideas. You can refer the list of our amazing contributors (such as those from Alibaba PAI, ModelScope, NVIDIA, Ray, ...) and join us.
⭐ Star this org and its repos to show your support and help us grow!

Built with ❤️ by the Data-Juicer Team, the community and you.
No matter your background or scale — if you care about data, you belong here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataJuicer

🍊 Welcome to Data-Juicer!

Open, modular data engineering for every intelligent application

🌍 A Modular Ecosystem, Built Together

🚀 How to Start & Join

Popular repositories Loading

Repositories

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!