You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* * fix wrong mm_key init
* + support load image from the specified bytes key instead of paths only
* * fix the wrong context handling logics in the compute_stats_batched method of basic Filter
* * make all image-related filters support load from bytes data
* * make all image-related mappers support load from bytes data
* * make all image-related mappers support update bytes data to the latest one
* * restore the context behavior for base op
* * support context for general fused op
* * fix minor bugs
* * fix schema alignment problem in general fused op
* * change the parent class of GeneralFusedOP from OP to Mapper
* update DownloadFileMapper and custom webdataset encoder and decoder (#724)
* support context for DownloadFileMapper and custom webdataset encoder and decoder
* update DownloadFileMapper
* update raydataset
* update webdata _custom_default_encoder
* + add ray tag for the webdataset test case
* * move webdataset_utils to the utils module
+ add the customizable reconstruct function
* + support customized webdataset format reconstruction before exporting
* + support export_type
* * update exporter args in analyzer
* fix the logics for the default alignment for download_file_mapper
* specify the stdout encoding for the new process
* - refactor the export method of RayExporter to the _router style
* * fix docstring
---------
Co-authored-by: Cathy0908 <30484308+Cathy0908@users.noreply.github.com>
Copy file name to clipboardExpand all lines: configs/config_all.yaml
+9-3Lines changed: 9 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -27,8 +27,13 @@ validators: # validators are a l
27
27
text: 'str'
28
28
29
29
export_path: '/path/to/result/dataset.jsonl'# path to processed result dataset. Supported suffixes include ['jsonl', 'json', 'parquet']
30
+
export_type: 'jsonl'# The export format type. If it's not specified, Data-Juicer will parse from the export_path. The supported types can be found in Exporter._router() for standalone mode and RayExporter._SUPPORTED_FORMATS for ray mode
30
31
export_shard_size: 0# shard size of exported dataset in Byte. In default, it's 0, which means export the whole dataset into only one file. If it's set a positive number, the exported dataset will be split into several dataset shards, and the max size of each shard won't larger than the export_shard_size
31
32
export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. **Notice**: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time.
33
+
keep_stats_in_res_ds: false # whether to keep the computed stats in the result dataset. The intermediate fields to store the stats computed by Filters will be removed if it's False. It's False in default.
34
+
keep_hashes_in_res_ds: false # whether to keep the computed hashes in the result dataset. The intermediate fields to store the hashes computed by Deduplicators will be removed if it's False. It's False in default.
35
+
export_extra_args: {} # Other optional arguments for exporting in dict. For example, the key mapping info for exporting the WebDataset format.
36
+
32
37
np: 4# number of subprocess to process your dataset
33
38
text_keys: 'text'# the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
34
39
# Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
@@ -46,12 +51,11 @@ trace_num: 10 # number of samples
46
51
op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
47
52
fusion_strategy: 'probe'# OP fusion strategy. Support ['greedy', 'probe'] now. 'greedy' means keep the basic OP order and put the fused OP to the last of each fused OP group. 'probe' means Data-Juicer will probe the running speed for each OP at the beginning and reorder the OPs and fused OPs according to their probed speed (fast to slow). It's 'probe' in default.
48
53
cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.
49
-
keep_stats_in_res_ds: false # whether to keep the computed stats in the result dataset. The intermediate fields to store the stats computed by Filters will be removed if it's False. It's False in default.
50
-
keep_hashes_in_res_ds: false # whether to keep the computed hashes in the result dataset. The intermediate fields to store the hashes computed by Deduplicators will be removed if it's False. It's False in default.
51
54
adaptive_batch_size: false # whether to use adaptive batch sizes for each OP according to the probed results. It's False in default.
52
55
53
56
# for multimodal data processing
54
57
image_key: 'images'# key name of field to store the list of sample image paths.
58
+
image_bytes_key: 'image_bytes'# key name of field to store the list of sample image bytes.
55
59
image_special_token: '<__dj__image>'# the special token that represents an image in the text. In default, it's "<__dj__image>". You can specify your own special token according to your input dataset.
56
60
audio_key: 'audios'# key name of field to store the list of sample audio paths.
57
61
audio_special_token: '<__dj__audio>'# the special token that represents an audio in the text. In default, it's "<__dj__audio>". You can specify your own special token according to your input dataset.
@@ -273,10 +277,12 @@ process:
273
277
- extract_tables_from_html_mapper: # extract tables from HTML content
274
278
tables_field_name: 'html_tables'# Field name to store the extracted tables.
275
279
retain_html_tags: false, # If True, retains HTML tags in the tables; otherwise, removes them.
276
-
include_header: true,# If True, includes the table header; otherwise, excludes it. This parameter is effective only when `retain_html_tags` is False and applies solely to the extracted table content.
280
+
include_header: true# If True, includes the table header; otherwise, excludes it. This parameter is effective only when `retain_html_tags` is False and applies solely to the extracted table content.
277
281
- download_file_mapper: # download url files to local files
278
282
save_dir: null # The directory to save downloaded files.
279
283
download_field: null # The filed name to get the url to download.
284
+
save_field: null # The filed name to save the downloaded file content.
285
+
resume_download: false # Whether to resume download. if True, skip the sample if it exists.
280
286
timeout: 30# The timeout in seconds for each HTTP request.
281
287
max_concurrent: 10# Maximum concurrent downloads.
282
288
- fix_unicode_mapper: # fix unicode errors in text.
0 commit comments