Skip to content

Commit ac57607

Browse files
Spark's Tips and Tricks
1 parent e5ccbaf commit ac57607

File tree

10 files changed

+192
-204
lines changed

10 files changed

+192
-204
lines changed

docs/spark-tips-and-tricks-running-spark-windows.md

Lines changed: 0 additions & 135 deletions
This file was deleted.

docs/tips-and-tricks/.pages

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
title: Spark's Tips and Tricks
2+
nav:
3+
- index.md
4+
- ...

docs/spark-tips-and-tricks-access-private-members-spark-shell.md renamed to docs/tips-and-tricks/access-private-members-spark-shell.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
== Access private members in Scala in Spark shell
1+
# Access private members in Scala in Spark shell
22

3-
If you ever wanted to use `private[spark]` members in Spark using the Scala programming language, e.g. toy with `org.apache.spark.scheduler.DAGScheduler` or similar, you will have to use the following trick in Spark shell - use `:paste -raw` as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].
3+
If you ever wanted to use `private[spark]` members in Spark using the Scala programming language, e.g. toy with `org.apache.spark.scheduler.DAGScheduler` or similar, you will have to use the following trick in Spark shell - use `:paste -raw` as described in [REPL: support for package definition](https://issues.scala-lang.org/browse/SI-5299).
44

55
Open `spark-shell` and execute `:paste -raw` that allows you to enter any valid Scala code, including `package`.
66

77
The following snippet shows how to access `private[spark]` member `DAGScheduler.RESUBMIT_TIMEOUT`:
88

9-
```
9+
```text
1010
scala> :paste -raw
1111
// Entering paste mode (ctrl-D to finish)
1212

docs/spark-tips-and-tricks.md renamed to docs/tips-and-tricks/index.md

Lines changed: 9 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,34 @@
1-
= Spark Tips and Tricks
1+
# Spark's Tips and Tricks
22

3-
== [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts
3+
## Print Launch Command of Spark Scripts { #SPARK_PRINT_LAUNCH_COMMAND }
44

5-
`SPARK_PRINT_LAUNCH_COMMAND` environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. `System.err`, or not.
6-
7-
```
8-
Spark Command: [here comes the command]
9-
========================================
10-
```
5+
`SPARK_PRINT_LAUNCH_COMMAND` environment variable controls whether or not the Spark launch command is printed out to the standard error output.
116

127
All the Spark shell scripts use `org.apache.spark.launcher.Main` class internally that checks `SPARK_PRINT_LAUNCH_COMMAND` and when set (to any value) will print out the entire command line to launch it.
138

14-
```
9+
```text
1510
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
1611
Spark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell
1712
========================================
1813
```
1914

20-
== Show Spark version in Spark shell
15+
## Show Spark version in Spark shell
2116

2217
In spark-shell, use `sc.version` or `org.apache.spark.SPARK_VERSION` to know the Spark version:
2318

24-
```
19+
```text
2520
scala> sc.version
2621
res0: String = 1.6.0-SNAPSHOT
2722
2823
scala> org.apache.spark.SPARK_VERSION
2924
res1: String = 1.6.0-SNAPSHOT
3025
```
3126

32-
== Resolving local host name
27+
## Resolving local host name
3328

3429
When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred `SPARK_LOCAL_HOSTNAME` environment variable as the custom host name or `SPARK_LOCAL_IP` as the custom IP that is going to be later resolved to a hostname.
3530

36-
Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).
31+
Spark checks them out before using [java.net.InetAddress.getLocalHost()](http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--) (consult [org.apache.spark.util.Utils.findLocalInetAddress()](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759) method).
3732

3833
You may see the following WARN messages in the logs when Spark finished the resolving process:
3934

@@ -44,7 +39,7 @@ Set SPARK_LOCAL_IP if you need to bind to another address
4439

4540
## Starting standalone Master and workers on Windows 7
4641

47-
Windows 7 users can use [spark-class](tools/spark-class.md) to start Spark Standalone as there are no launch scripts for the Windows platform.
42+
Windows 7 users can use [spark-class](../tools/spark-class.md) to start Spark Standalone as there are no launch scripts for the Windows platform.
4843

4944
```text
5045
./bin/spark-class org.apache.spark.deploy.master.Master -h localhost
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Running Spark Applications on Windows
2+
3+
Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.
4+
5+
!!! note
6+
A Spark application could be [spark-shell](../tools/spark-shell.md) or your own custom Spark application.
7+
8+
What makes a very important difference between the operating systems is Apache Hadoop that is used internally in Spark for file system access.
9+
10+
You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.
11+
12+
!!! note
13+
You are not required to install Apache Hadoop to develop or run Spark applications.
14+
15+
!!! tip
16+
Read the Apache Hadoop project's [Problems running Hadoop on Windows](https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems).
17+
18+
Among the issues is the infamous `java.io.IOException` while running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).
19+
20+
```text
21+
16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
22+
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
23+
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
24+
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
25+
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
26+
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)
27+
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365)
28+
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
29+
at java.lang.Class.forName0(Native Method)
30+
at java.lang.Class.forName(Class.java:348)
31+
at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
32+
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)
33+
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)
34+
```
35+
36+
!!! note
37+
You need to have Administrator rights on your laptop.
38+
All the following commands must be executed in a command-line window (`cmd`) ran as Administrator (i.e., using **Run as administrator** option while executing `cmd`).
39+
40+
Download `winutils.exe` binary from [steveloughran/winutils](https://github.com/steveloughran/winutils) Github repository.
41+
42+
!!! note
43+
Select the version of Hadoop the Spark distribution was compiled with, e.g. use `hadoop-2.7.1` for Spark 2 ([here is the direct link to `winutils.exe` binary](https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe)).
44+
45+
Save `winutils.exe` binary to a directory of your choice (e.g., `c:\hadoop\bin`).
46+
47+
Set `HADOOP_HOME` to reflect the directory with `winutils.exe` (without `bin`).
48+
49+
```text
50+
set HADOOP_HOME=c:\hadoop
51+
```
52+
53+
Set `PATH` environment variable to include `%HADOOP_HOME%\bin` as follows:
54+
55+
```text
56+
set PATH=%HADOOP_HOME%\bin;%PATH%
57+
```
58+
59+
!!! tip
60+
Define `HADOOP_HOME` and `PATH` environment variables in Control Panel so any Windows program would use them.
61+
62+
Create `C:\tmp\hive` directory.
63+
64+
!!! note
65+
`c:\tmp\hive` directory is the default value of [`hive.exec.scratchdir` configuration property](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir) in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.
66+
67+
You can change `hive.exec.scratchdir` configuration property to another directory as described in [Changing `hive.exec.scratchdir` Configuration Property](#changing-hive.exec.scratchdir) in this document.
68+
69+
Execute the following command in `cmd` that you started using the option **Run as administrator**.
70+
71+
```text
72+
winutils.exe chmod -R 777 C:\tmp\hive
73+
```
74+
75+
Check the permissions (that is one of the commands that are executed under the covers):
76+
77+
```text
78+
winutils.exe ls -F C:\tmp\hive
79+
```
80+
81+
Open `spark-shell` and observe the output (perhaps with few WARN messages that you can simply disregard).
82+
83+
As a verification step, execute the following line to display the content of a `DataFrame`:
84+
85+
```text
86+
scala> spark.range(1).withColumn("status", lit("All seems fine. Congratulations!")).show(false)
87+
+---+--------------------------------+
88+
|id |status |
89+
+---+--------------------------------+
90+
|0 |All seems fine. Congratulations!|
91+
+---+--------------------------------+
92+
```
93+
94+
!!! note
95+
Disregard WARN messages when you start `spark-shell`. They are harmless.
96+
97+
```text
98+
16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of
99+
the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered,
100+
and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-
101+
3.2.10.jar."
102+
16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR
103+
versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar" is already
104+
registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-
105+
hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
106+
16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR
107+
versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar" is
108+
already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-
109+
hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar."
110+
```
111+
112+
If you see the above output, you're done! You should now be able to run Spark applications on your Windows. Congrats! 👏👏👏
113+
114+
## Changing hive.exec.scratchdir { #changing-hive.exec.scratchdir }
115+
116+
Create a `hive-site.xml` file with the following content:
117+
118+
```xml
119+
<configuration>
120+
<property>
121+
<name>hive.exec.scratchdir</name>
122+
<value>/tmp/mydir</value>
123+
<description>Scratch space for Hive jobs</description>
124+
</property>
125+
</configuration>
126+
```
127+
128+
Start a Spark application (e.g., `spark-shell`) with `HADOOP_CONF_DIR` environment variable set to the directory with `hive-site.xml`.
129+
130+
```text
131+
HADOOP_CONF_DIR=conf ./bin/spark-shell
132+
```

docs/spark-tips-and-tricks-sparkexception-task-not-serializable.md renamed to docs/tips-and-tricks/sparkexception-task-not-serializable.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
== org.apache.spark.SparkException: Task not serializable
1+
# org.apache.spark.SparkException: Task not serializable
22

33
When you run into `org.apache.spark.SparkException: Task not serializable` exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:
44

5-
```
5+
```text
66
➜ spark git:(master) ✗ ./bin/spark-shell
77
Welcome to
88
____ __
@@ -68,8 +68,8 @@ Serialization stack:
6868
... 57 more
6969
```
7070

71-
=== Further reading
71+
## Learn More
7272

73-
* https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]
74-
* https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]
75-
* http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]
73+
* [Job aborted due to stage failure: Task not serializable](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html)
74+
* [Add utility to help with NotSerializableException debugging](https://issues.apache.org/jira/browse/SPARK-5307)
75+
* [Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects](http://stackoverflow.com/q/22592811/1305344)

0 commit comments

Comments
 (0)