diff --git a/Code/01_Data_Acquisition_and_Understanding/ReadMe.md b/Code/01_Data_Acquisition_and_Understanding/ReadMe.md index e75e5ca..9c730d1 100644 --- a/Code/01_Data_Acquisition_and_Understanding/ReadMe.md +++ b/Code/01_Data_Acquisition_and_Understanding/ReadMe.md @@ -27,4 +27,30 @@ You can install other dependencies in a similar way /usr/bin/anaconda/bin/conda install unidecode - The egg file needed to run the Pubmed Parser is also included in the repository. +#### Instructions + +**Adding egg file** + +One way to use the pubmed parser is to add the egg file to the Spark context. To do that, add the following command in your code: + +spark.sparkContext.addPyFile(*eggFilePath*) + +where *eggFilePath* is the location of the egg file. + +For example, you can upload the egg file to the blob container associated with your cluster. You can do that, for example, using [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/). +Let's say that you uploaded the egg file under the folder *eggs*. Then, the path will be: + +*eggFilePath* = 'wasb://eggs/pubmed_parser-0.1-py2.7.egg' + +Temporary Note: The above egg file is generated for Python 2.7. If it does not work for Python 3.5, you need to create it again using Python 3.5. + +**Installing unidecode package** + +To install unidecode Python package, you can use Script Action on your Spark Cluster. Add the following lines to your script file (.sh): + + #!/usr/bin/env bash + /usr/bin/anaconda/bin/conda install unidecode + +More details about Script Action can be found [here](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux). + ### Next Step