Support for partitioned parquet

On request from @ohvssb 
@BjornRoarJoneid also probably has some interest.

Partitioned parquet is possible on google cloud, and it can be filtered on a row basis on-read for faster loading. For the larger datasets it would be a possibility to avoid transitioning to databases potentially.

To write to to a partitioned dataset is pretty simple:
- pyarrow.Table.from_pandas(df)
- outpath
- partition_cols=['FODT_AAR']
- filesystem = gcs_file_system

To get back to an ordinary pandas dataframe is a bit harder, you would have to:
- decide the filter to apply on the partitioned columns, like `filters=[('FODT_AAR', '>', 1995)]`
- read().combine_chunks() on the pyarrow.parquet.ParquetDataset
- set the column used as the partitioned_cols back into the data, with an appropriate datatype (an other option is to duplicate it, before writing it down, which is probably simpler but adds data)
- doing a .to_pandas() on the pyarrow dataset

Here is an experiment:
https://github.com/statisticsnorway/utd_nudb/blob/carl_experiments_daplaprod/experiments/partition_parquet.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for partitioned parquet #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for partitioned parquet #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions