Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 112 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,115 @@

<!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section -->
[![All Contributors](https://img.shields.io/badge/all_contributors-1-orange.svg?style=flat-square)](#contributors-)
<!-- ALL-CONTRIBUTORS-BADGE:END -->
---

# JZFS
<p align="center">
<picture>
<source media="(prefers-color-scheme: light)" srcset="https://github.com/GitDataAI/jzfs/blob/main/docs/jzfs-logo-words.png?raw=true">
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/GitDataAI/jzfs/blob/main/docs/jzfs-logo-words.png?raw=true">
<img alt="JZFS Logo" src="https://github.com/GitDataAI/jzfs/blob/main/docs/jzfs-logo-words.png" width="400px">
</picture>
</p>

<h2 align="center">Git Based & Version Control & Joint Management <br/>for code, data, model and their relationship</h2>

> Delivers distributed data management system that keeps track of your data from code to PB scale dataset and ensures reproducibility.

<div align="center">
<h3 align="center">
<a href="https://gitdata.ai">JZFS Cloud</a> |
<a href="https://gitdata.ai/">User Guide</a> |
<a href="https://gitdata.ai/">API Docs</a> |
<a href="https://github.com/GitDataAI/jzfs">Roadmap 2025</a>
</h3>

<a href="https://github.com/GitDataAI/jzfs/releases/latest">
<img src="https://img.shields.io/github/v/release/GitDataAI/jzfs.svg" alt="Version"/>
</a>
<a href="https://github.com/GitDataAI/jzfs/releases/latest">
<img src="https://img.shields.io/github/release-date/GitDataAI/jzfs.svg" alt="Releases"/>
</a>
<a href="https://hub.docker.com/r/gitdatateam/jzfs/">
<img src="https://img.shields.io/docker/pulls/gitdatateam/jzfs.svg" alt="Docker Pulls"/>
</a>
<a href="https://github.com/GitDataAI/jzfs/actions/workflows/flow.yml">
<img src="https://github.com/GitDataAI/jzfs/actions/workflows/flow.yml/badge.svg" alt="GitHub Actions"/>
</a>
<a href="https://codecov.io/gh/GitDataAI/jzfs">
<img src="https://codecov.io/gh/GitDataAI/jzfs/branch/main/graph/badge.svg?token=FITFDI3J3C" alt="Codecov"/>
</a>
<a href="https://github.com/GitDataAI/jzfs/blob/main/LICENSE">
<img src="https://img.shields.io/github/license/GitDataAI/jzfs" alt="License"/>
</a>

<br/>

<a href="https://gitdata.ai/slack">
<img src="https://img.shields.io/badge/slack-GitDataAI-0abd59?logo=slack&style=for-the-badge" alt="Slack"/>
</a>
<a href="https://x.com/GitDataAI">
<img src="https://img.shields.io/badge/twitter-follow_us-1d9bf0.svg?style=for-the-badge" alt="Twitter"/>
</a>
<a href="https://www.linkedin.com/company/gitdataai">
<img src="https://img.shields.io/badge/linkedin-connect_with_us-0a66c2.svg?style=for-the-badge" alt="LinkedIn"/>
</a>
</div>

## Introduction

**JZFS** is an open-source, cloud-native version control filesystem based on Git protocol for data management and publication with a command line interface and a Python API. With JZFS, you can version control arbitrarily large data, share or consume data, record your data’s provenance, and work computationally reproducible.

JZFS adapts principles of open-source software development and distribution to address the technical challenges of data management, data sharing, and digital provenance collection across the life cycle of digital objects.

![](docs/jzfs-joint-management.png)

Compared with code in software development, data tend not to be as precisely
Comment on lines +57 to +63
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Provide alt-text for the joint-management image (MD045)

-![](docs/jzfs-joint-management.png)
+![Joint management overview](docs/jzfs-joint-management.png)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**JZFS** is an open-source, cloud-native version control filesystem based on Git protocol for data management and publication with a command line interface and a Python API. With JZFS, you can version control arbitrarily large data, share or consume data, record your data’s provenance, and work computationally reproducible.
JZFS adapts principles of open-source software development and distribution to address the technical challenges of data management, data sharing, and digital provenance collection across the life cycle of digital objects.
![](docs/jzfs-joint-management.png)
Compared with code in software development, data tend not to be as precisely
**JZFS** is an open-source, cloud-native version control filesystem based on Git protocol for data management and publication with a command line interface and a Python API. With JZFS, you can version control arbitrarily large data, share or consume data, record your data’s provenance, and work computationally reproducible.
JZFS adapts principles of open-source software development and distribution to address the technical challenges of data management, data sharing, and digital provenance collection across the life cycle of digital objects.
![Joint management overview](docs/jzfs-joint-management.png)
Compared with code in software development, data tend not to be as precisely
🧰 Tools
🪛 LanguageTool

[uncategorized] ~57-~57: You might be missing the article “the” here.
Context: ...ive version control filesystem based on Git protocol for data management and public...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~63-~63: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...with code in software development, data tend not to be as precisely identified becau...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

🪛 markdownlint-cli2 (0.17.2)

61-61: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents
In README.md around lines 57 to 63, the image tag for
"docs/jzfs-joint-management.png" lacks alt-text, which is important for
accessibility and markdown linting. Add a descriptive alt attribute to the image
tag that briefly explains the content or purpose of the image, such as "JZFS
joint management diagram" or a similar concise description.

identified because data versioning is rarely or only coarsely practiced. Scientific computation
is not reproducible enough, because data provenance, the information of how a digital file
came to be, is often incomplete and rarely automatically captured. Last but not least, in
the absence of standardized data packages, there is no uniform way to declare actionable
data dependencies and derivative relationships between inputs and outputs of a computation. JZFS aims to solve these issues by providing streamlined, transparent management
of code, data, computing environments, and their relationship.

### Current Status and Roadmap

#### 🚧 Current Status: Incubating - JZFS is not ready for production usage. The API is still evolving and documentation is lacking.
JZFS is still in the early development stages and is considered **incubating**. There is no commitment to ongoing maintenance or development. As the project evolves, this may change in the future. Therefore, we encourage you to explore, experiment, and contribute to JZFS, but do not attempt to use it in production.

JZFS is a distributed git storage service for the Rust programming language that prioritizes ease-of-use. It supports both Single Machine as well as some distributed environments, including Kubernetes and more. Note that JZFS does not hide the store; instead, JZFS exposes features based on the target distributed git storage service.
The immediate next steps for the project are to fill obvious gaps, such as implementing error handling, removing panics throughout the codebase, supporting additional data types, and writing documentation. After that, development will be based on feedback and contributions.

## Current Status and Roadmap
JZFS's long-term goal is to build data ecosystems that enable new innovations.

JZFS is still in the early development stages and is considered **incubating**. There is no commitment to ongoing maintenance or development. As the project evolves, this may change in the future. Therefore, we encourage you to explore, experiment, and contribute to JZFS, but do not attempt to use it in production.

The immediate next steps for the project are to fill obvious gaps, such as implementing error handling, removing panics throughout the codebase, supporting additional data types, and writing documentation. After that, development will be based on feedback and contributions.
### Reserach Data Management

JZFS is based on Git with extend capabilities, especially with respect to managing large files.

JZFS is a data management software designed to support the various stages
of the development of digital objects.

Importantly, JZFS can be seen as an overlay on top of existing data
structures and services: Tracking files does not change the files themselves or the location from which they can
be retrieved by data processing tools.

JZFS is used to collect
all experimental metadata about the complete timeline of longitudinal and multimodal animal experiments,
including MRI, histology, electrophysiology, and behavior.

![](./docs/jzfs-research-flow.png)
Project planning and experimental details are recorded in an in-house relational cloud-based database.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add alt-text for remaining images (MD045)

-![](./docs/jzfs-research-flow.png)
+![Research-flow diagram](./docs/jzfs-research-flow.png)

-![](./docs/jzfs-space.png)
+![Data-space architecture](./docs/jzfs-space.png)

Also applies to: 140-142

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

95-95: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents
In README.md at lines 95-96 and also lines 140-142, the images lack alt-text
which is required for accessibility and markdown linting (MD045). Add
descriptive alt-text inside the square brackets for each image markdown tag to
describe the image content meaningfully.


### Added value
A key element for both the database and the data storage is the
identifier, the study ID for each animal, used in a standardized fle name structure to make the data findable.

Te directory structure for the raw data follows the permit of performing animal experiments. Te data for a
specific project is organized following the YODA principles (https://handbook.datalad.org/en/latest/basics/101-127-yoda.html), which is compatible with existing standards, e.g., the BIDS structure.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Correct typos and preposition

-identifier, the study ID for each animal, used in a standardized fle name structure
-The directory structure for the raw data follows the permit of performing animal experiments. Te data for a
+identifier—the study ID for each animal—used in a standardized file-name structure.
+The directory structure for the raw data follows the permit for performing animal experiments. The data for a

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~101-~101: The preposition “for” seems more likely in this position.
Context: ...ure for the raw data follows the permit of performing animal experiments. Te data ...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION)

🪛 markdownlint-cli2 (0.17.2)

102-102: Bare URL used

(MD034, no-bare-urls)

🤖 Prompt for AI Agents
In README.md around lines 99 to 103, correct the typos "fle" to "file", "Te" to
"The", and "permit" to "permittee" or the correct intended word. Also, fix the
preposition "for performing animal experiments" to a more appropriate phrase
like "governing animal experiments" to improve clarity and correctness.

In preparation for publication and to facilitate
data reproducibility, the experimental raw and processed data is made publicly available on GitData.AI.

JZFS is used as the central data management tool (Fig. above) and for version control: It keeps track of which
files were modified, when, and by whom, and provides the ability to restore previous states. To this end, JZFS
is agnostic of the data type and provides a unified interface for managing code and data files.


### DataHub
Our central use case is the DataHub(Like Github, buf for Data),which essentially consists of a Git version control for data and a Git collaboration for data.

In the JZFS DataHub model, each node maintains a copy of the files and all of the history of each file.
Expand All @@ -28,9 +120,9 @@ Writes are organized on branches.
Git is designed to compute differences between versions quickly.
Generally, Git relies on human action to share and merge changes.

Fault tolerance and trustless-ness are achieved via the separation of remotes from individual nodes.
Each node in the network is maintaining it's own copy, its history and coordinating via one or many remotes.
If a remote you trust gets corrupted, you have the ability to roll back to a previous good state and switch to a new remote.
Fault tolerance and trustless-ness are achieved via the separation of remotes from individual nodes.
Each node in the network is maintaining it's own copy, its history and coordinating via one or many remotes.
If a remote you trust gets corrupted, you have the ability to roll back to a previous good state and switch to a new remote.
Even if you lose your copy you can rely on other nodes' copies to restore from.

This enables us to create transparency across internal and external data.
Expand All @@ -39,18 +131,22 @@ It forms the basis for a new way of practicing data exchange and contract design

It is crucial that data exchange works both within the company and in individually controllable data networks (data space).

### Data Space
JZFS data space consists of so-called “DataHubs” like one or many remotes in Git, which are virtual data nodes for sharing data and building data networks.
A single DataHub manages various data connections and can join together with other hubs to form a network through targeted synchronization.
Based on data contracts mapped in the network, data can be released to other participants, enabling efficient data exchange.


![](./docs/jzfs-space.png)

JZFS offers git for data technology for exchanging data in data hub and data space.
JZFS offers git for data technology for exchanging data in data hub and data space.

The added value is clear: simple, transparent data management combined with intuitive linking and sharing of data in decentralized networks – data space.
This enables secure, trustworthy data exchange across organizational boundaries.

JZFS's long-term goal is to build data ecosystems that enable new innovations.




### Deploy
```bash
Expand Down
Binary file added docs/jzfs-joint-management.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/jzfs-logo-words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/jzfs-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/jzfs-research-flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions docs/jzfs-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@

# Why JZFS for Data ?

* Chaos has ensued for non-expert end users as data ecosystems progressively develop into complex and siloed systems with a continuous stream of point solutions added to the insane mix.

* Complex infrastructures requiring consistent maintenance deflect most of the engineering talent from high-value operations, such as developing data applications that directly impact the business and ultimately enhance the ROI of data teams.

* Inflexible and unstable, and therefore, fragile data pipelines constrict data engineering teams as a bottleneck for even simple data operations. It is not uncommon to hear a whole new data pipeline being spawned to answer one specific business question or 1000K data warehouse tables being created from 6K source tables.

Data Consumers suffer from unreliable data quality, Data Producers suffer from duplicated efforts to produce data for ambiguous objectives, and Data Engineers suffer from flooding requests from both data production and consumption sides.

The dearth of exemplary developer experience also robs data developers of the ability to declaratively manage resources, environments, and requests so they can focus completely on data solutions.

Due to these diversions and the lack of a unified platform, it is nearly impossible for DEs to build short and crisp data-to-insight roadmaps.

On top of that, it’s a constant struggle to adhere to the organization’s changing data compliance standards as governance and observability become afterthoughts in a maintenance-first setting. This directly impacts the quality and experience of data that passes through meandering pipelines blotched with miscellaneous integrations.

The concept of having an assembled architecture emerged over time to solve these common problems that infested the data community at large. One tool could tend to a particular problem, and assembling a collection of such tools would solve several issues. But, targeting patches of the problem led to a disconnected basket of solutions ending up with fragile data pipelines and dumping all data to a central lake that eventually created unmanageable data swamps across industries. This augmented the problem by adding the cognitive load of a plethora of tooling that had to be integrated and managed separately through expensive resources and experts.

Data swamps are no better than physical files in the basement-clogged with rich, useful, yet dormant data that businesses are unable to operationalise due to disparate and untrustworthy semantics. Semantic untrustworthiness stems from a chaotic clutter of MDS, overwhelmed with tools, integrations, and unstable pipelines. Another level of semantics is required to understand the low-level semantics, complicating the problem further.

Two distinct features become more apparent with this kind of tooling overwhelm:

1. Progressive overlap in Assembled Systems
As more tools pop in, they increasingly develop the need to become independently operable, often based on user feedback. For instance, two different point tools, say one for cataloguing and another for governance, are plugged into your data stacks. This incites the need not just to learn the tools’ different philosophies, integrate, and maintain each one from scratch but eventually pop up completely parallel tracks. The governance tool starts requiring a native catalog, and the cataloguing tool requires policies manageable within its system. Now consider the same problem at scale, beyond just two point solutions. Even if we consider the cost of these parallel tracks as secondary, it is essentially a significantly disruptive design flaw that keeps splitting the topology of one unique capability into unmanageable duplicates.

2. Consistent and increasing desire to Decentralise
What follows from assembled systems is the sudden overwhelm of managing multiple limbs of the system, and therefore, increasing complexity and friction for end users to get their hands on the data. While business domains, such as marketing, sales, support, etc., have to jump multiple hops to achieve the data they need, the organisation feels the pressure to lift all dependencies clogging the central data team and distributing the workload across these domains. Ergo, it was not a surprise to see how the early Data Mesh laid urgent focus on domain ownership, or decentralisation in other words. While the idea seems very appealing on theoretical grounds, how feasible is it in the field? If we lay this idea on any working business model, there are a few consequences:

* Not enough skilled professionals to allocate to each individual domain - Practically, how feasible is the idea of having data teams for each domain?
* Not enough professionals or budget to disrupt existing processes, detangle pipelines, and embed brand-new infrastructures.
* Not enough experts to help train and onboard during migration.

It’s both a skill- and resource-deficit issue. Moreover, with decades spent on evolving data stacks with not much value to show, organisations are not ideally inclined to pour in more investments and efforts to rip and replace their work. In essence, Autonomy instead should become the higher priority over Decentralisation if that is the ultimate objective.

Why - Data Developer Platform
https://datadeveloperplatform.org/why_ddp_for_data/#why-build-a-ddp-for-data-products