You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/SparkContext.md
+21-15Lines changed: 21 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -513,6 +513,27 @@ withScope[U](
513
513
!!! note
514
514
`withScope` is used for most (if not all) `SparkContext` API operators.
515
515
516
+
## Finding Preferred Locations for RDD Partition { #getPreferredLocs }
517
+
518
+
```scala
519
+
getPreferredLocs(
520
+
rdd: RDD[_],
521
+
partition: Int):Seq[TaskLocation]
522
+
```
523
+
524
+
`getPreferredLocs` requests the [DAGScheduler](#dagScheduler) for the [preferred locations](scheduler/DAGScheduler.md#getPreferredLocs) of the given `partition` (of the given [RDD](rdd/RDD.md)).
525
+
526
+
!!! note
527
+
**Preferred locations** of a RDD partition are also referred to as _placement preferences_ or _locality preferences_.
528
+
529
+
---
530
+
531
+
`getPreferredLocs` is used when:
532
+
533
+
*`CoalescedRDDPartition` is requested to `localFraction`
534
+
*`DefaultPartitionCoalescer` is requested to `currPrefLocs`
535
+
*`PartitionerAwareUnionRDD` is requested to `currPrefLocs`
536
+
516
537
## Logging
517
538
518
539
Enable `ALL` logging level for `org.apache.spark.SparkContext` logger to see what happens inside.
@@ -1236,21 +1257,6 @@ SparkContext may have a core:ContextCleaner.md[ContextCleaner] defined.
1236
1257
1237
1258
`ContextCleaner` is created when `SparkContext` is created with configuration-properties.md#spark.cleaner.referenceTracking[spark.cleaner.referenceTracking] configuration property enabled.
1238
1259
1239
-
== [[getPreferredLocs]] Finding Preferred Locations (Placement Preferences) for RDD Partition
1240
-
1241
-
[source, scala]
1242
-
----
1243
-
getPreferredLocs(
1244
-
rdd: RDD[_],
1245
-
partition: Int): Seq[TaskLocation]
1246
-
----
1247
-
1248
-
getPreferredLocs simply scheduler:DAGScheduler.md#getPreferredLocs[requests `DAGScheduler` for the preferred locations for `partition`].
1249
-
1250
-
NOTE: Preferred locations of a partition of a RDD are also called *placement preferences* or *locality preferences*.
1251
-
1252
-
getPreferredLocs is used in CoalescedRDDPartition, DefaultPartitionCoalescer and PartitionerAwareUnionRDD.
Copy file name to clipboardExpand all lines: docs/scheduler/TaskSetManager.md
+27-6Lines changed: 27 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ Epoch for [taskSet]: [epoch]
35
35
36
36
`TaskSetManager`[adds the tasks as pending execution](#addPendingTask) (in reverse order from the highest partition to the lowest).
37
37
38
-
### <spanid="maxTaskFailures"> Number of Task Failures
38
+
### Number of Task Failures { #maxTaskFailures }
39
39
40
40
`TaskSetManager` is given `maxTaskFailures` value that is how many times a [single task can fail](#handleFailedTask) before the whole [TaskSet](#taskSet) is [aborted](#abort).
41
41
@@ -88,7 +88,7 @@ In the end, `resourceOffer` returns the `TaskDescription`, `hasScheduleDelayReje
88
88
89
89
*`TaskSchedulerImpl` is requested to [resourceOfferSingleTaskSet](TaskSchedulerImpl.md#resourceOfferSingleTaskSet)
90
90
91
-
## <spanid="getLocalityWait"> Locality Wait
91
+
## Locality Wait { #getLocalityWait }
92
92
93
93
```scala
94
94
getLocalityWait(
@@ -116,11 +116,11 @@ Unless the value has been determined, `getLocalityWait` defaults to `0`.
116
116
117
117
*`TaskSetManager` is [created](#localityWaits) and [recomputes locality preferences](#recomputeLocality)
`TaskSetManager` uses [spark.driver.maxResultSize](../configuration-properties.md#spark.driver.maxResultSize) configuration property to [check available memory for more task results](#canFetchMoreResults).
`TaskSetManager` object defines `TASK_SIZE_TO_WARN_KIB` value as the threshold to warn a user if any stages contain a task that has a serialized size greater than `1000` kB.
`TaskSetManager` can print out the following WARN message to the logs when requested to [prepareLaunchingTask](#prepareLaunchingTask):
240
+
241
+
```text
242
+
Stage [stageId] contains a task of very large size ([serializedTask] KiB).
243
+
The maximum recommended task size is 1000 KiB.
244
+
```
245
+
225
246
## Demo
226
247
227
248
Enable `DEBUG` logging level for `org.apache.spark.scheduler.TaskSchedulerImpl` (or `org.apache.spark.scheduler.cluster.YarnScheduler` for YARN) and `org.apache.spark.scheduler.TaskSetManager` and execute the following two-stage job to see their low-level innerworkings.
0 commit comments