|
1 | 1 | # RangePartitioner
|
2 | 2 |
|
3 |
| -`RangePartitioner` is a [Partitioner](Partitioner.md) for **bucketed partitioning**. |
| 3 | +`RangePartitioner` is a [Partitioner](Partitioner.md) that partitions sortable records by range into roughly equal ranges (that can be used for **bucketed partitioning**). |
4 | 4 |
|
5 |
| -`RangePartitioner` is used for [sortByKey](OrderedRDDFunctions.md#sortByKey) operator (among other uses). |
| 5 | +`RangePartitioner` is used for [sortByKey](OrderedRDDFunctions.md#sortByKey) operator (_mostly_). |
6 | 6 |
|
7 | 7 | ## Creating Instance
|
8 | 8 |
|
9 | 9 | `RangePartitioner` takes the following to be created:
|
10 | 10 |
|
11 |
| -* <span id="partitions"> Number of Partitions |
| 11 | +* <span id="partitions"> Hint for the number of partitions |
12 | 12 | * <span id="rdd"> Key-Value [RDD](RDD.md) (`RDD[_ <: Product2[K, V]]`)
|
13 | 13 | * <span id="ascending"> `ascending` flag (default: `true`)
|
14 | 14 | * <span id="samplePointsPerPartitionHint"> samplePointsPerPartitionHint (default: `20`)
|
|
19 | 19 | numPartitions: Int
|
20 | 20 | ```
|
21 | 21 |
|
22 |
| -`numPartitions` is the length of the [rangeBounds](#rangeBounds) array plus `1`. |
23 |
| - |
24 | 22 | `numPartitions` is part of the [Partitioner](Partitioner.md#numPartitions) abstraction.
|
25 | 23 |
|
| 24 | +--- |
| 25 | + |
| 26 | +`numPartitions` is 1 more than the length of the [range bounds](#rangeBounds) (since the number of [range bounds](#rangeBounds) is 0 for 0 or 1 partitions). |
| 27 | + |
26 | 28 | ## <span id="getPartition"> Partition for Key
|
27 | 29 |
|
28 | 30 | ```scala
|
29 | 31 | getPartition(
|
30 | 32 | key: Any): Int
|
31 | 33 | ```
|
32 | 34 |
|
33 |
| -`getPartition`...FIXME |
34 |
| - |
35 | 35 | `getPartition` is part of the [Partitioner](Partitioner.md#getPartition) abstraction.
|
36 | 36 |
|
| 37 | +--- |
| 38 | + |
| 39 | +`getPartition` branches off based on the length of the [range bounds](#rangeBounds). |
| 40 | + |
| 41 | +For up to 128 range bounds, `getPartition` is either the first range bound (from the [rangeBounds](#rangeBounds)) for which the `key` value is greater than the value of the range bound or 128 (if no value was found among the [rangeBounds](#rangeBounds)). `getPartition` starts finding a candidate partition number from `0` and walks over the [rangeBounds](#rangeBounds) until a range bound for which the given `key` value is greater than the value of the range bound is found or there are no more [rangeBounds](#rangeBounds). `getPartition` increments the candidate partition candidate every iteration. |
| 42 | + |
| 43 | +For the number of the [rangeBounds](#rangeBounds) above 128, `getPartition`...FIXME |
| 44 | + |
| 45 | +In the end, `getPartition` returns the candidate partition number for the [ascending](#ascending) enabled, or flips it (to be the number of the [rangeBounds](#rangeBounds) minus the candidate partition number), otheriwse. |
| 46 | + |
37 | 47 | ## <span id="rangeBounds"> Range Bounds
|
38 | 48 |
|
39 | 49 | ```scala
|
40 | 50 | rangeBounds: Array[K]
|
41 | 51 | ```
|
42 | 52 |
|
43 |
| -`rangeBounds` is an `Array[K]`...FIXME |
| 53 | +`rangeBounds` is an array of upper bounds. |
| 54 | + |
| 55 | +For the [number of partitions](#partitions) up to and including 1, `rangeBounds` is an empty array. |
| 56 | + |
| 57 | +For more than 1 [partitions](#partitions), `rangeBounds` determines the sample size per partitions. The total sample size is the [samplePointsPerPartitionHint](#samplePointsPerPartitionHint) multiplied by the [number of partitions](#partitions) capped by `1e6`. `rangeBounds` allows for 3x over-sample per partition. |
| 58 | + |
| 59 | +`rangeBounds` [sketches](#sketch) the keys of the [input rdd](#rdd) (with the `sampleSizePerPartition`). |
| 60 | + |
| 61 | +!!! note |
| 62 | + There is more going on in `rangeBounds`. |
| 63 | + |
| 64 | +In the end, `rangeBounds` [determines the bounds](#determineBounds). |
44 | 65 |
|
45 |
| -### <span id="determineBounds"> determineBounds Utility |
| 66 | +### <span id="determineBounds"> determineBounds |
46 | 67 |
|
47 | 68 | ```scala
|
48 |
| -determineBounds[K : Ordering : ClassTag]( |
| 69 | +determineBounds[K: Ordering]( |
49 | 70 | candidates: ArrayBuffer[(K, Float)],
|
50 | 71 | partitions: Int): Array[K]
|
51 | 72 | ```
|
|
0 commit comments