Skip to content

Support for partitioned parquet #47

@aecorn

Description

@aecorn

On request from @ohvssb
@BjornRoarJoneid also probably has some interest.

Partitioned parquet is possible on google cloud, and it can be filtered on a row basis on-read for faster loading. For the larger datasets it would be a possibility to avoid transitioning to databases potentially.

To write to to a partitioned dataset is pretty simple:

  • pyarrow.Table.from_pandas(df)
  • outpath
  • partition_cols=['FODT_AAR']
  • filesystem = gcs_file_system

To get back to an ordinary pandas dataframe is a bit harder, you would have to:

  • decide the filter to apply on the partitioned columns, like filters=[('FODT_AAR', '>', 1995)]
  • read().combine_chunks() on the pyarrow.parquet.ParquetDataset
  • set the column used as the partitioned_cols back into the data, with an appropriate datatype (an other option is to duplicate it, before writing it down, which is probably simpler but adds data)
  • doing a .to_pandas() on the pyarrow dataset

Here is an experiment:
https://github.com/statisticsnorway/utd_nudb/blob/carl_experiments_daplaprod/experiments/partition_parquet.ipynb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions