You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Jörn Franke edited this page Nov 19, 2018
·
6 revisions
Usually Office files are small, ie several magnitudes smaller than the default HDFS blocksize ( < 128M). Hence, you may face performance penalties when processing in a cluster due to the number of requests to the Namenode. Luckily, there is already since several Hadoop versions a solution to these problems: Hadoop Archives (HAR). These archives do not compress files, but combine them into a single file.
The files within a Hadoop Archive can be accessed transparently by any Hadoop application (including Spark, TEZ or Hive):
Instead of providing the HDFS URL, such as hdfs:/user/office/, you need simply to provide a HAR URL har:/user/office/MyOfficeDocuments.har/
An alternative in the future will be the Hadoop Ozone.