Skip to content

Commit b1bfe15

Browse files
Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation (#7532)
* fix 7457 * fix 7457 and 7480 * Update cache.mdx --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
1 parent 8874b25 commit b1bfe15

File tree

1 file changed

+27
-1
lines changed

1 file changed

+27
-1
lines changed

docs/source/cache.mdx

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,32 @@ The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Ch
2323
$ export HF_HOME="/path/to/another/directory/datasets"
2424
```
2525

26+
Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory:
27+
28+
```
29+
$ export HF_DATASETS_CACHE="/path/to/datasets_cache"
30+
```
31+
32+
⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices).
33+
It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are located in `~/.cache/huggingface/hub` by default and controlled separately via the `HF_HUB_CACHE` variable:
34+
35+
```
36+
$ export HF_HUB_CACHE="/path/to/hub_cache"
37+
```
38+
39+
💡 If you'd like to relocate all Hugging Face caches — including datasets and hub downloads — use the `HF_HOME` variable instead:
40+
41+
```
42+
$ export HF_HOME="/path/to/cache_root"
43+
```
44+
45+
This results in:
46+
- datasets cache → `/path/to/cache_root/datasets`
47+
- hub cache → `/path/to/cache_root/hub`
48+
49+
These distinctions are especially useful when working in shared environments or networked file systems (e.g., NFS).
50+
See [issue #7480](https://github.com/huggingface/datasets/issues/7480) for discussion on how users encountered unexpected cache locations when `HF_HUB_CACHE` was not set alongside `HF_DATASETS_CACHE`.
51+
2652
When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want:
2753

2854
```py
@@ -82,6 +108,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par
82108

83109
Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:
84110

85-
1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.
111+
1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.
86112

87113
2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence.

0 commit comments

Comments
 (0)