|
| 1 | +# Running Spark Applications on Windows |
| 2 | + |
| 3 | +Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS. |
| 4 | + |
| 5 | +!!! note |
| 6 | + A Spark application could be [spark-shell](../tools/spark-shell.md) or your own custom Spark application. |
| 7 | + |
| 8 | +What makes a very important difference between the operating systems is Apache Hadoop that is used internally in Spark for file system access. |
| 9 | + |
| 10 | +You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem. |
| 11 | + |
| 12 | +!!! note |
| 13 | + You are not required to install Apache Hadoop to develop or run Spark applications. |
| 14 | + |
| 15 | +!!! tip |
| 16 | + Read the Apache Hadoop project's [Problems running Hadoop on Windows](https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems). |
| 17 | + |
| 18 | +Among the issues is the infamous `java.io.IOException` while running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case). |
| 19 | + |
| 20 | +```text |
| 21 | +16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path |
| 22 | +java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. |
| 23 | + at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) |
| 24 | + at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) |
| 25 | + at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387) |
| 26 | + at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327) |
| 27 | + at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365) |
| 28 | + at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105) |
| 29 | + at java.lang.Class.forName0(Native Method) |
| 30 | + at java.lang.Class.forName(Class.java:348) |
| 31 | + at org.apache.spark.util.Utils$.classForName(Utils.scala:228) |
| 32 | + at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963) |
| 33 | + at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91) |
| 34 | +``` |
| 35 | + |
| 36 | +!!! note |
| 37 | + You need to have Administrator rights on your laptop. |
| 38 | + All the following commands must be executed in a command-line window (`cmd`) ran as Administrator (i.e., using **Run as administrator** option while executing `cmd`). |
| 39 | + |
| 40 | +Download `winutils.exe` binary from [steveloughran/winutils](https://github.com/steveloughran/winutils) Github repository. |
| 41 | + |
| 42 | +!!! note |
| 43 | + Select the version of Hadoop the Spark distribution was compiled with, e.g. use `hadoop-2.7.1` for Spark 2 ([here is the direct link to `winutils.exe` binary](https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe)). |
| 44 | + |
| 45 | +Save `winutils.exe` binary to a directory of your choice (e.g., `c:\hadoop\bin`). |
| 46 | + |
| 47 | +Set `HADOOP_HOME` to reflect the directory with `winutils.exe` (without `bin`). |
| 48 | + |
| 49 | +```text |
| 50 | +set HADOOP_HOME=c:\hadoop |
| 51 | +``` |
| 52 | + |
| 53 | +Set `PATH` environment variable to include `%HADOOP_HOME%\bin` as follows: |
| 54 | + |
| 55 | +```text |
| 56 | +set PATH=%HADOOP_HOME%\bin;%PATH% |
| 57 | +``` |
| 58 | + |
| 59 | +!!! tip |
| 60 | + Define `HADOOP_HOME` and `PATH` environment variables in Control Panel so any Windows program would use them. |
| 61 | + |
| 62 | +Create `C:\tmp\hive` directory. |
| 63 | + |
| 64 | +!!! note |
| 65 | + `c:\tmp\hive` directory is the default value of [`hive.exec.scratchdir` configuration property](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir) in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1. |
| 66 | + |
| 67 | + You can change `hive.exec.scratchdir` configuration property to another directory as described in [Changing `hive.exec.scratchdir` Configuration Property](#changing-hive.exec.scratchdir) in this document. |
| 68 | + |
| 69 | +Execute the following command in `cmd` that you started using the option **Run as administrator**. |
| 70 | + |
| 71 | +```text |
| 72 | +winutils.exe chmod -R 777 C:\tmp\hive |
| 73 | +``` |
| 74 | + |
| 75 | +Check the permissions (that is one of the commands that are executed under the covers): |
| 76 | + |
| 77 | +```text |
| 78 | +winutils.exe ls -F C:\tmp\hive |
| 79 | +``` |
| 80 | + |
| 81 | +Open `spark-shell` and observe the output (perhaps with few WARN messages that you can simply disregard). |
| 82 | + |
| 83 | +As a verification step, execute the following line to display the content of a `DataFrame`: |
| 84 | + |
| 85 | +```text |
| 86 | +scala> spark.range(1).withColumn("status", lit("All seems fine. Congratulations!")).show(false) |
| 87 | ++---+--------------------------------+ |
| 88 | +|id |status | |
| 89 | ++---+--------------------------------+ |
| 90 | +|0 |All seems fine. Congratulations!| |
| 91 | ++---+--------------------------------+ |
| 92 | +``` |
| 93 | + |
| 94 | +!!! note |
| 95 | + Disregard WARN messages when you start `spark-shell`. They are harmless. |
| 96 | + |
| 97 | + ```text |
| 98 | + 16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of |
| 99 | + the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered, |
| 100 | + and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core- |
| 101 | + 3.2.10.jar." |
| 102 | + 16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR |
| 103 | + versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar" is already |
| 104 | + registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin- |
| 105 | + hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar." |
| 106 | + 16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR |
| 107 | + versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar" is |
| 108 | + already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin- |
| 109 | + hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar." |
| 110 | + ``` |
| 111 | + |
| 112 | +If you see the above output, you're done! You should now be able to run Spark applications on your Windows. Congrats! 👏👏👏 |
| 113 | + |
| 114 | +## Changing hive.exec.scratchdir { #changing-hive.exec.scratchdir } |
| 115 | + |
| 116 | +Create a `hive-site.xml` file with the following content: |
| 117 | + |
| 118 | +```xml |
| 119 | +<configuration> |
| 120 | + <property> |
| 121 | + <name>hive.exec.scratchdir</name> |
| 122 | + <value>/tmp/mydir</value> |
| 123 | + <description>Scratch space for Hive jobs</description> |
| 124 | + </property> |
| 125 | +</configuration> |
| 126 | +``` |
| 127 | + |
| 128 | +Start a Spark application (e.g., `spark-shell`) with `HADOOP_CONF_DIR` environment variable set to the directory with `hive-site.xml`. |
| 129 | + |
| 130 | +```text |
| 131 | +HADOOP_CONF_DIR=conf ./bin/spark-shell |
| 132 | +``` |
0 commit comments