Skip to content

Commit d139926

Browse files
Serialized Task Size Threshold + Finding Preferred Locations for RDD Partition
1 parent 2077326 commit d139926

File tree

3 files changed

+55
-23
lines changed

3 files changed

+55
-23
lines changed

docs/SparkContext.md

Lines changed: 21 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -513,6 +513,27 @@ withScope[U](
513513
!!! note
514514
`withScope` is used for most (if not all) `SparkContext` API operators.
515515

516+
## Finding Preferred Locations for RDD Partition { #getPreferredLocs }
517+
518+
```scala
519+
getPreferredLocs(
520+
rdd: RDD[_],
521+
partition: Int): Seq[TaskLocation]
522+
```
523+
524+
`getPreferredLocs` requests the [DAGScheduler](#dagScheduler) for the [preferred locations](scheduler/DAGScheduler.md#getPreferredLocs) of the given `partition` (of the given [RDD](rdd/RDD.md)).
525+
526+
!!! note
527+
**Preferred locations** of a RDD partition are also referred to as _placement preferences_ or _locality preferences_.
528+
529+
---
530+
531+
`getPreferredLocs` is used when:
532+
533+
* `CoalescedRDDPartition` is requested to `localFraction`
534+
* `DefaultPartitionCoalescer` is requested to `currPrefLocs`
535+
* `PartitionerAwareUnionRDD` is requested to `currPrefLocs`
536+
516537
## Logging
517538

518539
Enable `ALL` logging level for `org.apache.spark.SparkContext` logger to see what happens inside.
@@ -1236,21 +1257,6 @@ SparkContext may have a core:ContextCleaner.md[ContextCleaner] defined.
12361257
12371258
`ContextCleaner` is created when `SparkContext` is created with configuration-properties.md#spark.cleaner.referenceTracking[spark.cleaner.referenceTracking] configuration property enabled.
12381259
1239-
== [[getPreferredLocs]] Finding Preferred Locations (Placement Preferences) for RDD Partition
1240-
1241-
[source, scala]
1242-
----
1243-
getPreferredLocs(
1244-
rdd: RDD[_],
1245-
partition: Int): Seq[TaskLocation]
1246-
----
1247-
1248-
getPreferredLocs simply scheduler:DAGScheduler.md#getPreferredLocs[requests `DAGScheduler` for the preferred locations for `partition`].
1249-
1250-
NOTE: Preferred locations of a partition of a RDD are also called *placement preferences* or *locality preferences*.
1251-
1252-
getPreferredLocs is used in CoalescedRDDPartition, DefaultPartitionCoalescer and PartitionerAwareUnionRDD.
1253-
12541260
== [[persistRDD]] Registering RDD in persistentRdds Internal Registry -- `persistRDD` Internal Method
12551261
12561262
[source, scala]

docs/scheduler/DAGScheduler.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -736,13 +736,18 @@ In the end, with no tasks to submit for execution, `submitMissingTasks` [submits
736736

737737
```scala
738738
getPreferredLocs(
739-
rdd: RDD[_],
739+
rdd: RDD[_],
740740
partition: Int): Seq[TaskLocation]
741741
```
742742

743743
`getPreferredLocs` is simply an alias for the internal (recursive) [getPreferredLocsInternal](#getPreferredLocsInternal).
744744

745-
`getPreferredLocs` is used when...FIXME
745+
---
746+
747+
`getPreferredLocs` is used when:
748+
749+
* `SparkContext` is requested to [getPreferredLocs](../SparkContext.md#getPreferredLocs)
750+
* `DAGScheduler` is requested to [submit the missing tasks of a stage](#submitMissingTasks)
746751

747752
## Finding BlockManagers (Executors) for Cached RDD Partitions (aka Block Location Discovery) { #getCacheLocs }
748753

docs/scheduler/TaskSetManager.md

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Epoch for [taskSet]: [epoch]
3535

3636
`TaskSetManager` [adds the tasks as pending execution](#addPendingTask) (in reverse order from the highest partition to the lowest).
3737

38-
### <span id="maxTaskFailures"> Number of Task Failures
38+
### Number of Task Failures { #maxTaskFailures }
3939

4040
`TaskSetManager` is given `maxTaskFailures` value that is how many times a [single task can fail](#handleFailedTask) before the whole [TaskSet](#taskSet) is [aborted](#abort).
4141

@@ -88,7 +88,7 @@ In the end, `resourceOffer` returns the `TaskDescription`, `hasScheduleDelayReje
8888

8989
* `TaskSchedulerImpl` is requested to [resourceOfferSingleTaskSet](TaskSchedulerImpl.md#resourceOfferSingleTaskSet)
9090

91-
## <span id="getLocalityWait"> Locality Wait
91+
## Locality Wait { #getLocalityWait }
9292

9393
```scala
9494
getLocalityWait(
@@ -116,11 +116,11 @@ Unless the value has been determined, `getLocalityWait` defaults to `0`.
116116

117117
* `TaskSetManager` is [created](#localityWaits) and [recomputes locality preferences](#recomputeLocality)
118118

119-
## <span id="maxResultSize"> spark.driver.maxResultSize
119+
## spark.driver.maxResultSize { #maxResultSize }
120120

121121
`TaskSetManager` uses [spark.driver.maxResultSize](../configuration-properties.md#spark.driver.maxResultSize) configuration property to [check available memory for more task results](#canFetchMoreResults).
122122

123-
## <span id="recomputeLocality"> Recomputing Task Locality Preferences
123+
## Recomputing Task Locality Preferences { #recomputeLocality }
124124

125125
```java
126126
recomputeLocality(): Unit
@@ -150,7 +150,7 @@ While in zombie state, a `TaskSetManager` can launch no new tasks and responds w
150150

151151
A `TaskSetManager` remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.
152152

153-
## <span id="computeValidLocalityLevels"> Computing Locality Levels (for Scheduled Tasks)
153+
## Computing Locality Levels (for Scheduled Tasks) { #computeValidLocalityLevels }
154154

155155
```scala
156156
computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]
@@ -182,7 +182,7 @@ Valid locality levels for [taskSet]: [comma-separated levels]
182182

183183
* `TaskSetManager` is [created](#myLocalityLevels) and to [recomputeLocality](#recomputeLocality)
184184

185-
## <span id="executorAdded"> executorAdded
185+
## executorAdded { #executorAdded }
186186

187187
```scala
188188
executorAdded(): Unit
@@ -222,6 +222,27 @@ prepareLaunchingTask(
222222
* `TaskSchedulerImpl` is requested to [resourceOffers](TaskSchedulerImpl.md#resourceOffers)
223223
* `TaskSetManager` is requested to [resourceOffers](#resourceOffers)
224224

225+
## Serialized Task Size Threshold { #TASK_SIZE_TO_WARN_KIB }
226+
227+
`TaskSetManager` object defines `TASK_SIZE_TO_WARN_KIB` value as the threshold to warn a user if any stages contain a task that has a serialized size greater than `1000` kB.
228+
229+
### DAGScheduler { #TASK_SIZE_TO_WARN_KIB-DAGScheduler }
230+
231+
`DAGScheduler` can print out the following WARN message to the logs when requested to [submitMissingTasks](DAGScheduler.md#submitMissingTasks):
232+
233+
```text
234+
Broadcasting large task binary with size [taskBinaryBytes] [siByteSuffix]
235+
```
236+
237+
### TaskSetManager { #TASK_SIZE_TO_WARN_KIB-TaskSetManager }
238+
239+
`TaskSetManager` can print out the following WARN message to the logs when requested to [prepareLaunchingTask](#prepareLaunchingTask):
240+
241+
```text
242+
Stage [stageId] contains a task of very large size ([serializedTask] KiB).
243+
The maximum recommended task size is 1000 KiB.
244+
```
245+
225246
## Demo
226247

227248
Enable `DEBUG` logging level for `org.apache.spark.scheduler.TaskSchedulerImpl` (or `org.apache.spark.scheduler.cluster.YarnScheduler` for YARN) and `org.apache.spark.scheduler.TaskSetManager` and execute the following two-stage job to see their low-level innerworkings.

0 commit comments

Comments
 (0)