Skip to content

Read an Excel document using the Spark2 datasource API

Jörn Franke edited this page Sep 30, 2018 · 9 revisions

This is a Spark2 datasource application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of Excel files. As an output it prints the number of rows, the schema and the content of the Excel cells (including comments, formulas, address). It has successfully been tested with the HDP Sandbox VM 2.5, but other Hadoop distributions should work equally well, if they support Spark 2.

Getting an example Excel

You can create yourself an Excel file in LibreOffice or Microsoft Excel. Alternatively, you can download an Excel file that is used for unit testing of hadoopoffice library by executing the following command:

wget --no-check-certificate https://github.com/ZuInnoTe/hadoopoffice/raw/master/fileformat/src/test/resources/excel2013test.xlsx

You can put it on your HDFS cluster by executing the following commands:

hadoop fs -mkdir -p /user/spark/office/excel/input

hadoop fs -put ./excel2013test.xlsx /user/spark/office/excel/input

After it has been copied you are ready to use the example.

Building the example

Note the datasource is available on Maven Central and Spark-packages.

Execute

git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice

You can build the application by changing to the directory hadoopoffice/examples/scala-spark2-excel-in-ds and using the following command:

sbt +clean +it:test +assembly

Running the example

Make sure that the output directory is empty

hadoop fs -rm -R /user/spark/office/excel/output

Execute the following command (please take care that you use spark-submit of Spark2)

spark-submit --class org.zuinnote.spark.office.example.excel.SparkScalaExcelInDataSource ./example-ho-spark-scala-ds-excelin.jar /user/spark/office/excel/input /user/spark/office/excel/output                                                                                                                                             

After the Spark2 job has been completed, an output file (CSV format) of the input excel has been created. You can display it by executing

hadoop fs -text /user/spark/office/excel/output/part-00000
Clone this wiki locally