-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
On request from @ohvssb
@BjornRoarJoneid also probably has some interest.
Partitioned parquet is possible on google cloud, and it can be filtered on a row basis on-read for faster loading. For the larger datasets it would be a possibility to avoid transitioning to databases potentially.
To write to to a partitioned dataset is pretty simple:
- pyarrow.Table.from_pandas(df)
- outpath
- partition_cols=['FODT_AAR']
- filesystem = gcs_file_system
To get back to an ordinary pandas dataframe is a bit harder, you would have to:
- decide the filter to apply on the partitioned columns, like
filters=[('FODT_AAR', '>', 1995)] - read().combine_chunks() on the pyarrow.parquet.ParquetDataset
- set the column used as the partitioned_cols back into the data, with an appropriate datatype (an other option is to duplicate it, before writing it down, which is probably simpler but adds data)
- doing a .to_pandas() on the pyarrow dataset
Here is an experiment:
https://github.com/statisticsnorway/utd_nudb/blob/carl_experiments_daplaprod/experiments/partition_parquet.ipynb
Metadata
Metadata
Assignees
Labels
No labels