We’re the Data-Juicer team — a group of researchers, engineers, and open-source enthusiasts from Tongyi Lab and the data intelligence community. We believe great AI starts with great data, but today’s data development is hard to reuse, fragmented to model, and far from intelligence.
So we’re building Data-Juicer: a community-driven ecosystem to make data processing more simple, composable, and valable — whether you’re working on:
- 🧠 LLMs / VLMs / pre-training / post-tuning
- 🤖 Agents & autonomous systems
- 📊 BI, document intelligence, or knowledge extraction
- 🚗 Embodied AI, autopilot, or simulation
- 🧬 AI for science, healthcare, finance, and beyond
After the growth of real-world use across academia and industry, we’re reimagining Data-Juicer as a more open ecosystem:
- ✨ Modular by design: Use only what you need, like hundreds of operators, data-model co-dev sandbox, data agents
- ⚡️ Unmatched Efficiency: Accelerate your data workflows with good scalability and deep optimization
- 🧩 Easy to extend: Add your own operators, data recipes, or domain-specific and new modules
- 🤝 Open to all: Students, data hackers, researchers, startups, and enterprises — all are welcome
🔧 We’re currently migrating from a monolith to lighter repos under this org. This is the beginning of a more flexible, community-driven future.
- Try it now: star your journey from the portal repo.
- Contributing & Acknowledgements: We love contributions! We are grateful for every feedback, from code and doc to bug reports and ideas. You can refer the list of our amazing contributors (such as those from Alibaba PAI, ModelScope, NVIDIA, Ray, ...) and join us.
- ⭐ Star this org and its repos to show your support and help us grow!
Built with ❤️ by the Data-Juicer Team, the community and you.
No matter your background or scale — if you care about data, you belong here.