Skip to content

Commit 43dec1d

Browse files
authored
Feat/doc (#191)
* add logo picture and chagne jzfs joint management picture * change logo file name of dark mode * change size of picture * change alignment of picture * change picture alignment * update image * update alignment of image * update jzfs research flow
1 parent 536fbf2 commit 43dec1d

File tree

3 files changed

+76
-1
lines changed

3 files changed

+76
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ all experimental metadata about the complete timeline of longitudinal and multim
9494
including MRI, histology, electrophysiology, and behavior.
9595

9696
![](./docs/jzfs-research-flow.png)
97-
Project planning and experimental details are recorded in an in-house relational cloud-based database.
97+
Project planning and experimental details are recorded in an in-house relational cloud-based database of GitData.AI.
9898

9999
A key element for both the database and the data storage is the
100100
identifier, the study ID for each animal, used in a standardized fle name structure to make the data findable.

docs/jzfs-research-flow.png

-23 KB
Loading

docs/jzfs.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
DataLad is a Python-based tool for the joint management of code, data, and their relationship,
2+
built on top of a versatile system for data logistics (git-annex) and the most popular distributed
3+
version control system (Git). It adapts principles of open-source software development and
4+
distribution to address the technical challenges of data management, data sharing, and digital
5+
provenance collection across the life cycle of digital objects. DataLad aims to make data
6+
management as easy as managing code. It streamlines procedures to consume, publish, and
7+
update data, for data of any size or type, and to link them as precisely versioned, lightweight
8+
dependencies. DataLad helps to make science more reproducible and FAIR (Wilkinson et al.,
9+
2016). It can capture complete and actionable process provenance of data transformations to
10+
enable automatic re-computation. The DataLad project (datalad.org) delivers a completely
11+
open, pioneering platform for flexible decentralized research data management (RDM) (Hanke,
12+
Pestilli, et al., 2021). It features a Python and a command-line interface, an extensible
13+
architecture, and does not depend on any centralized services but facilitates interoperability
14+
with a plurality of existing tools and services. In order to maximize its utility and target audience, DataLad is available for all major operating systems, and can be integrated into
15+
established workflows and environments with minimal friction.
16+
17+
18+
Statement of Need
19+
Code, data and computing environments are core components of scientific projects. While
20+
the collaborative development and use of research software and code is streamlined with established procedures and infrastructures, such as software distributions, distributed version
21+
control systems, and social coding portals like GitHub, other components of scientific projects
22+
are not as transparently managed or accessible. Data consumption is complicated by disconnected data portals that require a large variety of different data access and authentication
23+
methods. Compared with code in software development, data tend not to be as precisely
24+
identified because data versioning is rarely or only coarsely practiced. Scientific computation
25+
is not reproducible enough, because data provenance, the information of how a digital file
26+
came to be, is often incomplete and rarely automatically captured. Last but not least, in
27+
the absence of standardized data packages, there is no uniform way to declare actionable
28+
data dependencies and derivative relationships between inputs and outputs of a computation. DataLad aims to solve these issues by providing streamlined, transparent management
29+
of code, data, computing environments, and their relationship. It provides targeted interfaces
30+
and interoperability adapters to established scientific and commercial tools and services to
31+
set up unobstructed, unified access to all elements of scientific projects. This unique set of
32+
features enables workflows that are particularly suited for reproducible science, such as actionable process provenance capture for arbitrary command execution that affords automatic
33+
re-execution. To this end, it builds on and extends two established tools for version control
34+
and transport logistics, Git and git-annex.
35+
36+
37+
Why Git and git-annex?
38+
Git is the most popular version control system for software development1
39+
. It is a distributed
40+
content management system, specifically tuned towards managing and collaborating on text
41+
files, and excels at making all committed content reliably and efficiently available to all clones
42+
of a repository. At the same time, Git is not designed to efficiently handle large (e.g., over
43+
a gigabyte) or binary files (see, e.g., Kenlon, 2016). This makes it hard or impossible to
44+
use Git directly for distributed data storage with tailored access to individual files. Gitannex takes advantage of Git’s ability to efficiently manage textual information to overcome
45+
this limitation. File content handled by git-annex is placed into a managed repository annex,
46+
which avoids committing the file content directly to Git. Instead, git-annex commits a compact
47+
reference, typically derived from the checksum of a file’s content, that enables identification
48+
and association of a file name with the content. Using these identifiers, git-annex tracks
49+
content availability across all repository clones and external resources such as URLs pointing
50+
to individual files on the web. Upon user request, git-annex automatically manages data
51+
transport to and from a local repository annex at a granularity of individual files. With this
52+
simple approach, git-annex enables separate and optimized implementations for identification
53+
and transport of arbitrarily large files, using an extensible set of protocols, while retaining the
54+
distributed nature and compatibility with versatile workflows for versioning and collaboration
55+
provided by Git.
56+
57+
58+
What does DataLad add to Git and git-annex?
59+
Easy to use modularization. Research workflows impose additional demands for an efficient research data management platform besides version control and data transport. Many
60+
research datasets contain millions of files, but a large number of files precludes managing
61+
such a dataset in a single Git repository, even if the total storage demand is not huge. Partitioning such datasets into smaller, linked components (e.g., one subdataset per sample in
62+
a dataset comprising thousands) allows for scalable management. Research datasets and
63+
projects can also be heterogeneous, comprising different data sources or evolving data across
64+
different processing stages, and with different pace. Beyond scalability, modularization into
65+
homogeneous components also enables efficient reuse of a selected subset of datasets and for
66+
recording a derivative relationship between datasets. Git’s submodule mechanism provides a
67+
way to nest individual repositories via unambiguously versioned linkage, but Git operations
68+
must still be performed within each individual repository. To achieve modularity without impeding usability, DataLad simplifies working with the resulting hierarchies of Git repositories
69+
via recursive operations across dataset boundaries. With this, DataLad provides a “monorepo”-like user experience in datasets with arbitrarily deep nesting, and makes it trivial to
70+
operate on individual files deep in the hierarchy or entire trees of datasets. A testament of
71+
this is datasets.datalad.org, created as the project’s initial goal to provide a data distribution
72+
with unified access to already available public data archives in neuroscience, such as crcns.org
73+
and openfmri.org. It is curated by the DataLad team and provides, at the time of publication,
74+
streamlined access to over 260 TBs of data across over 5,000 subdatasets from a wide range
75+
of projects and dozens of archives in a fully modularized way.

0 commit comments

Comments
 (0)