Skip to content
@datajuicer

DataJuicer

Data processing for and with large models.

🍊 Welcome to Data-Juicer!

Open, modular data engineering for every intelligent application

We’re the Data-Juicer team — a group of researchers, engineers, and open-source enthusiasts from Tongyi Lab and the data intelligence community. We believe great AI starts with great data, but today’s data development is hard to reuse, fragmented to model, and far from intelligence.

So we’re building Data-Juicer: a community-driven ecosystem to make data processing more simple, composable, and valable — whether you’re working on:

  • 🧠 LLMs / VLMs / pre-training / post-tuning
  • 🤖 Agents & autonomous systems
  • 📊 BI, document intelligence, or knowledge extraction
  • 🚗 Embodied AI, autopilot, or simulation
  • 🧬 AI for science, healthcare, finance, and beyond

🌍 A Modular Ecosystem, Built Together

After the growth of real-world use across academia and industry, we’re reimagining Data-Juicer as a more open ecosystem:

  • Modular by design: Use only what you need, like hundreds of operators, data-model co-dev sandbox, data agents
  • ⚡️ Unmatched Efficiency: Accelerate your data workflows with good scalability and deep optimization
  • 🧩 Easy to extend: Add your own operators, data recipes, or domain-specific and new modules
  • 🤝 Open to all: Students, data hackers, researchers, startups, and enterprises — all are welcome

🔧 We’re currently migrating from a monolith to lighter repos under this org. This is the beginning of a more flexible, community-driven future.


🚀 How to Start & Join

  • Try it now: star your journey from the portal repo.
  • Contributing & Acknowledgements: We love contributions! We are grateful for every feedback, from code and doc to bug reports and ideas. You can refer the list of our amazing contributors (such as those from Alibaba PAI, ModelScope, NVIDIA, Ray, ...) and join us.
  • Star this org and its repos to show your support and help us grow!

Built with ❤️ by the Data-Juicer Team, the community and you.
No matter your background or scale — if you care about data, you belong here.

Popular repositories Loading

  1. data-juicer data-juicer Public

    Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

    Python 5.5k 286

  2. .github .github Public

Repositories

Showing 2 of 2 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…