RangePartitioner (+ minor config fixes)

jaceklaskowski · jaceklaskowski · commit bc51417fe355 · 2022-07-18T15:56:49.000+02:00
diff --git a/docs/rdd/Partitioner.md b/docs/rdd/Partitioner.md
@@ -27,6 +27,5 @@ numPartitions: Int
 
 ## Implementations
 
-* CoalescedPartitioner (Spark SQL)
 * [HashPartitioner](HashPartitioner.md)
 * [RangePartitioner](RangePartitioner.md)
diff --git a/docs/rdd/RangePartitioner.md b/docs/rdd/RangePartitioner.md
@@ -1,14 +1,14 @@
 # RangePartitioner
 
-`RangePartitioner` is a [Partitioner](Partitioner.md) for **bucketed partitioning**.
+`RangePartitioner` is a [Partitioner](Partitioner.md) that partitions sortable records by range into roughly equal ranges (that can be used for **bucketed partitioning**).
 
-`RangePartitioner` is used for [sortByKey](OrderedRDDFunctions.md#sortByKey) operator (among other uses).
+`RangePartitioner` is used for [sortByKey](OrderedRDDFunctions.md#sortByKey) operator (_mostly_).
 
 ## Creating Instance
 
 `RangePartitioner` takes the following to be created:
 
-* <span id="partitions"> Number of Partitions
+* <span id="partitions"> Hint for the number of partitions
 * <span id="rdd"> Key-Value [RDD](RDD.md) (`RDD[_ <: Product2[K, V]]`)
 * <span id="ascending"> `ascending` flag (default: `true`)
 * <span id="samplePointsPerPartitionHint"> samplePointsPerPartitionHint (default: `20`)
@@ -19,33 +19,54 @@
 numPartitions: Int
 ```
 
-`numPartitions` is the length of the [rangeBounds](#rangeBounds) array plus `1`.
-
 `numPartitions` is part of the [Partitioner](Partitioner.md#numPartitions) abstraction.
 
+---
+
+`numPartitions` is 1 more than the length of the [range bounds](#rangeBounds) (since the number of [range bounds](#rangeBounds) is 0 for 0 or 1 partitions).
+
 ## <span id="getPartition"> Partition for Key
 
 ```scala
 getPartition(
   key: Any): Int
 ```
 
-`getPartition`...FIXME
-
 `getPartition` is part of the [Partitioner](Partitioner.md#getPartition) abstraction.
 
+---
+
+`getPartition` branches off based on the length of the [range bounds](#rangeBounds).
+
+For up to 128 range bounds, `getPartition` is either the first range bound (from the [rangeBounds](#rangeBounds)) for which the `key` value is greater than the value of the range bound or 128 (if no value was found among the [rangeBounds](#rangeBounds)). `getPartition` starts finding a candidate partition number from `0` and walks over the [rangeBounds](#rangeBounds) until a range bound for which the given `key` value is greater than the value of the range bound is found or there are no more [rangeBounds](#rangeBounds). `getPartition` increments the candidate partition candidate every iteration.
+
+For the number of the [rangeBounds](#rangeBounds) above 128, `getPartition`...FIXME
+
+In the end, `getPartition` returns the candidate partition number for the [ascending](#ascending) enabled, or flips it (to be the number of the [rangeBounds](#rangeBounds) minus the candidate partition number), otheriwse.
+
 ## <span id="rangeBounds"> Range Bounds
 
 ```scala
 rangeBounds: Array[K]
 ```
 
-`rangeBounds` is an `Array[K]`...FIXME
+`rangeBounds` is an array of upper bounds.
+
+For the [number of partitions](#partitions) up to and including 1, `rangeBounds` is an empty array.
+
+For more than 1 [partitions](#partitions), `rangeBounds` determines the sample size per partitions. The total sample size is the [samplePointsPerPartitionHint](#samplePointsPerPartitionHint) multiplied by the [number of partitions](#partitions) capped by `1e6`. `rangeBounds` allows for 3x over-sample per partition.
+
+`rangeBounds` [sketches](#sketch) the keys of the [input rdd](#rdd) (with the `sampleSizePerPartition`).
+
+!!! note
+    There is more going on in `rangeBounds`.
+
+In the end, `rangeBounds` [determines the bounds](#determineBounds).
 
-### <span id="determineBounds"> determineBounds Utility
+### <span id="determineBounds"> determineBounds
 
 ```scala
-determineBounds[K : Ordering : ClassTag](
+determineBounds[K: Ordering](
   candidates: ArrayBuffer[(K, Float)],
   partitions: Int): Array[K]
 ```
diff --git a/docs/scheduler/DAGScheduler.md b/docs/scheduler/DAGScheduler.md
@@ -1483,7 +1483,7 @@ The lookup table of all stages per `ActiveJob` id
 nextJobId: AtomicInteger
 ```
 
-`nextJobId` is a Java [AtomicInteger]({{ java.doc }}/java/util/concurrent/atomic/AtomicInteger.html) for job IDs.
+`nextJobId` is a Java [AtomicInteger]({{ java.api }}/java/util/concurrent/atomic/AtomicInteger.html) for job IDs.
 
 `nextJobId` starts at `0`.
 
diff --git a/docs/scheduler/DAGSchedulerEventProcessLoop.md b/docs/scheduler/DAGSchedulerEventProcessLoop.md
@@ -4,7 +4,7 @@
 
 `DAGSchedulerEventProcessLoop` is registered under the name of **dag-scheduler-event-loop**.
 
-`DAGSchedulerEventProcessLoop` uses [java.util.concurrent.LinkedBlockingDeque]({{ java.doc }}/java/util/concurrent/LinkedBlockingDeque.html) blocking deque that can grow indefinitely.
+`DAGSchedulerEventProcessLoop` uses [java.util.concurrent.LinkedBlockingDeque]({{ java.api }}/java/util/concurrent/LinkedBlockingDeque.html) blocking deque that can grow indefinitely.
 
 ## Creating Instance