Skip to content
This repository was archived by the owner on Oct 26, 2023. It is now read-only.

Commit 0820254

Browse files
authored
Update README.md
1 parent b099ce4 commit 0820254

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ products:
99
- azure-functions
1010
name: "Big Data Processing: Serverless MapReduce on Azure"
1111
description: "This sample uses Azure Durable Functions to determine the average speed of New York Yellow taxi trips, per day over all of 2017."
12+
urlFragment: big-data-processing-serverless-mapreduce-on-azure
1213
---
1314

1415
# Big Data Processing: Serverless MapReduce on Azure
@@ -90,7 +91,7 @@ This performs the following **permanent** changes to your machine:
9091
- Installs [.Net Core SDK](https://www.microsoft.com/net/download) (to build v2 app)
9192
- Installs [.Net 4.6.1 Developer pack](https://www.microsoft.com/en-us/download/details.aspx?id=49978) (to build v1 app)
9293

93-
## 1. Serverless MapReduce on Azure ##
94+
## 1. Serverless MapReduce on Azure
9495

9596
![](./images/MapReduceArchitecture.png)
9697

@@ -103,7 +104,7 @@ You will use ***Durable Functions*** - specifically the ***Fan-out/Fan-in patter
103104
The source data you will be using for this MapReduce implementation can be found here: <http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml>
104105
and contains 12 CSV files of roughly 800MB each (`Yellow` dataset) you need to make available in Azure Blob storage - total size is ~9.6GB.
105106

106-
## 2. Hands on: Implementing the MapReduce pattern step by step ##
107+
## 2. Hands on: Implementing the MapReduce pattern step by step
107108
### 2.1 Copy the dataset to an Azure Blob Storage instance
108109
Open a new PowerShell window & execute `TaxiDataImporter.ps1` from the repo directory to copy each file from the NYC Taxi site in to your Azure Storage Account
109110

@@ -216,7 +217,8 @@ After deployment:
216217

217218
You'll receive back a list of URLs you can use to check status, issue new events (not handled by this sample), or terminate the orchestration.
218219

219-
# Notes
220+
## Notes
221+
220222
The implementation shown here utilizes a single reducer. If you needed multiple reducers, you would create **a sub-orchestration per reducer**, launch those in parallel, do the `CallActivityAsync()` calls within the sub-orchestrator, reduce the results, pass that up to the parent orchestrator to further reduce the results, and so on.
221223

222224
It's also important to remember while Serverless technologies allow you to scale "infinitely" we must use the right tool for the right job. Durable Functions will, in theory, scale out to run any number of jobs in parallel and come back to the reduce step so **this approach may work very well for loads that can be highly parallelized** however the machines on which Azure Functions run (in Consumption plan) are of limited specs ([more detail can be found here in the docs](https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale#how-the-consumption-plan-works)). This means **this approach _may not_ work well for loads with very large sections to be processed by each mapper**. In this case, you may look in to hosting this implementation on an App Service Plan with large VMs or, even better, a Functions Premium offering with a large VM size to process the data more quickly. Another thing to note is the way DF passes results and parameters around; these are done via Azure Storage Queues which may/may not add unacceptable latency in to your big data process.

0 commit comments

Comments
 (0)