Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions Code/01_Data_Acquisition_and_Understanding/ReadMe.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,30 @@ You can install other dependencies in a similar way
/usr/bin/anaconda/bin/conda install unidecode
- The egg file needed to run the Pubmed Parser is also included in the repository.

#### Instructions

**Adding egg file**

One way to use the pubmed parser is to add the egg file to the Spark context. To do that, add the following command in your code:

spark.sparkContext.addPyFile(*eggFilePath*)

where *eggFilePath* is the location of the egg file.

For example, you can upload the egg file to the blob container associated with your cluster. You can do that, for example, using [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/).
Let's say that you uploaded the egg file under the folder *eggs*. Then, the path will be:

*eggFilePath* = 'wasb://eggs/pubmed_parser-0.1-py2.7.egg'

Temporary Note: The above egg file is generated for Python 2.7. If it does not work for Python 3.5, you need to create it again using Python 3.5.

**Installing unidecode package**

To install unidecode Python package, you can use Script Action on your Spark Cluster. Add the following lines to your script file (.sh):

#!/usr/bin/env bash
/usr/bin/anaconda/bin/conda install unidecode

More details about Script Action can be found [here](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux).

### Next Step