Skip to content

PERF: parallelize libjoin calls #51364

@jbrockmendel

Description

@jbrockmendel
import numpy as np
import pandas as pd
import pandas._libs.join as libjoin

left = np.random.randint(0, 10**8, size=10**8)
right = np.random.randint(0, 10**8, size=10**8)

left.sort()
right.sort()

# The current usage
In [28]: %timeit result = libjoin.inner_join_indexer(left, right)
2.89 s ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# An alternative that _may_ be parallelizable
lloc = len(left) // 2
rloc = right.searchsorted(left[lloc])  # 734 ns ± 16.6 ns

chunk1 = libjoin.inner_join_indexer(left[:lloc], right[:rloc])
chunk2 = libjoin.inner_join_indexer(left[lloc:], right[rloc:])
result = tuple([np.r_[chunk1[i], chunk2[i]] for i in range(3)])
# totals 3.9 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The chunk1/chunk2 calls are each 1.47 s ± 40 ms and the concat is 1.05 s ± 179 ms. In a pyarrow or otherwise-chunked context the concat might not be needed.

For the non-object-dtype case, I'm pretty sure we could release the GIL in inner_join_indexer. Could we then do the chunk1/chunk2 calls in parallel? If so, could we split it further? cc @WillAyd?

Metadata

Metadata

Assignees

No one assigned

    Labels

    MultithreadingParallelism in pandasPerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions