Improve performance for processing a lot of small Office files with Hadoop Archives (HAR)

Usually Office files are small, ie several magnitudes smaller than the default HDFS blocksize ( < 128M). Hence, you may face performance penalties when processing in a cluster due to the number of requests to the Namenode. Luckily, there is already since several Hadoop versions a solution to these problems: Hadoop Archives (HAR). These archives do not compress files, but combine them into a single file.

The files within a Hadoop Archive can be accessed transparently by any Hadoop application (including Spark, TEZ or Hive):

Instead of providing the HDFS URL, such as hdfs:/user/office/, you need simply to provide a HAR URL har:/user/office/MyOfficeDocuments.har/

An alternative in the future will be the Hadoop Ozone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance for processing a lot of small Office files with Hadoop Archives (HAR)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally