You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 26, 2023. It is now read-only.
@@ -103,7 +104,7 @@ You will use ***Durable Functions*** - specifically the ***Fan-out/Fan-in patter
103
104
The source data you will be using for this MapReduce implementation can be found here: <http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml>
104
105
and contains 12 CSV files of roughly 800MB each (`Yellow` dataset) you need to make available in Azure Blob storage - total size is ~9.6GB.
105
106
106
-
## 2. Hands on: Implementing the MapReduce pattern step by step ##
107
+
## 2. Hands on: Implementing the MapReduce pattern step by step
107
108
### 2.1 Copy the dataset to an Azure Blob Storage instance
108
109
Open a new PowerShell window & execute `TaxiDataImporter.ps1` from the repo directory to copy each file from the NYC Taxi site in to your Azure Storage Account
109
110
@@ -216,7 +217,8 @@ After deployment:
216
217
217
218
You'll receive back a list of URLs you can use to check status, issue new events (not handled by this sample), or terminate the orchestration.
218
219
219
-
# Notes
220
+
## Notes
221
+
220
222
The implementation shown here utilizes a single reducer. If you needed multiple reducers, you would create **a sub-orchestration per reducer**, launch those in parallel, do the `CallActivityAsync()` calls within the sub-orchestrator, reduce the results, pass that up to the parent orchestrator to further reduce the results, and so on.
221
223
222
224
It's also important to remember while Serverless technologies allow you to scale "infinitely" we must use the right tool for the right job. Durable Functions will, in theory, scale out to run any number of jobs in parallel and come back to the reduce step so **this approach may work very well for loads that can be highly parallelized** however the machines on which Azure Functions run (in Consumption plan) are of limited specs ([more detail can be found here in the docs](https://docs.microsoft.com/en-us/azure/azure-functions/functions-scale#how-the-consumption-plan-works)). This means **this approach _may not_ work well for loads with very large sections to be processed by each mapper**. In this case, you may look in to hosting this implementation on an App Service Plan with large VMs or, even better, a Functions Premium offering with a large VM size to process the data more quickly. Another thing to note is the way DF passes results and parameters around; these are done via Azure Storage Queues which may/may not add unacceptable latency in to your big data process.
0 commit comments