Implementation of some famous big data algorithm for the Big Data Computing course.
Download the repository
git clone git@github.com:lparolari/bdc-assignments.git
# or, with https
# git clone https://github.com/lparolari/bdc-assignments.git
cd bcd-assignments
Run our algorithms (see next section) with gradle from the command line
./gradlew <TASK> --args <ARGS>
# For example
./gradlew runExample --args "4 \"src/main/resources/examples/dataset.txt\""
From IntelliJ
- add a profile build & run
- set the desired class to execute
- add VM the following option
-Dspark.master="local[*]"
- set the program arguments
- run
Note: the spark master is set programmatically, so there is no
needs for -Dspark.master="local[*]"
.
- Deterministic Partition Size
- Spark Partition Size
Run the algorithm
# Small example dataset
./gradlew runClassCount --args "4 \"src/main/resources/hw1/example.txt\""
# 10000-entry dataset
./gradlew runClassCount --args "4 \"src/main/resources/hw1/input_10000.txt\""
- Exact solution
- 2-approximation by taking k random points
- Exact solution of a subset of C of S, where C are k centers retruned by Farthest-First Traversal
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/aircraft-mainland.csv\" 4"
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/uber-small.csv\" 4"
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/uber-medium.csv\" 4"
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/uber-large.csv\" 4"
Running time
The following table summarizes the running time for the algorithms benchmarked over different dataset and with some k value. Times are expressed in millisecond. This runs refer only to correct results. Data are collected in "one shot".
Comparative with K = 4 (only for 2-approx and k-center)
Dataset | Exact | 2-approx | kCenter |
---|---|---|---|
Aircraft Mainland | 2468 | 190 | 197 |
Uber Small | 136 | 156 | 176 |
Uber Medium | 2078582 | 196 | 504 |
Uber Large | n/d | 675 | 822 |
Comparative with K = 128 (only for 2-approx and k-center)
Dataset | Exact | 2-approx | kCenter |
---|---|---|---|
Aircraft Mainland | 2468 | 264 | 1993 |
Uber Small | 136 | 154 | 266 |
Uber Medium | 2078582 | 1743 | 52554 |
Uber Large | n/d | 2992 | 169519 |
Full benchmark list
Algorithm | Dataset | K | Time (ms) |
---|---|---|---|
exact | Aircraft Mainland | - | 2468 |
exact | Uber Small | - | 136 |
exact | Uber Medium | - | 2078582 |
exact | Uber Large | - | n/d |
2-approx | Aircraft Mainland | 4 | 190 |
2-approx | Aircraft Mainland | 8 | 204 |
2-approx | Aircraft Mainland | 32 | 485 |
2-approx | Aircraft Mainland | 128 | 264 |
2-approx | Uber Small | 4 | 156 |
2-approx | Uber Small | 8 | 202 |
2-approx | Uber Small | 32 | 168 |
2-approx | Uber Small | 128 | 154 |
2-approx | Uber Medium | 4 | 196 |
2-approx | Uber Medium | 8 | 387 |
2-approx | Uber Medium | 32 | 473 |
2-approx | Uber Medium | 128 | 1743 |
2-approx | Uber Large | 4 | 675 |
2-approx | Uber Large | 8 | 462 |
2-approx | Uber Large | 32 | 963 |
2-approx | Uber Large | 128 | 2992 |
kCenter | Aircraft Mainland | 4 | 197 |
kCenter | Aircraft Mainland | 8 | 227 |
kCenter | Aircraft Mainland | 32 | 639 |
kCenter | Aircraft Mainland | 128 | 1993 |
kCenter | Uber Small | 4 | 176 |
kCenter | Uber Small | 8 | 148 |
kCenter | Uber Small | 32 | 191 |
kCenter | Uber Small | 128 | 266 |
kCenter | Uber Medium | 4 | 504 |
kCenter | Uber Medium | 8 | 947 |
kCenter | Uber Medium | 32 | 5395 |
kCenter | Uber Medium | 128 | 52554 |
kCenter | Uber Large | 4 | 882 |
kCenter | Uber Large | 8 | 1431 |
kCenter | Uber Large | 32 | 12574 |
kCenter | Uber Large | 128 | 169519 |
Run the benchmarks
If you want to run the benchmarks yourself you can look at hw2run.sh
, hw2scale.sh
and hw2bench.sh
scripts that we made available in order to simplify this task.
Please note that benchmarking was not the aim of this exercise. We took running times one shot, and we decided to leave benchmark results out of the source control.
Execute locally
./gradlew runDiversityMaximization --args "path/to/dataset K L"
for example with K=100 and L=16
./gradlew runDiversityMaximization --args "\"src/main/resources/hw3/uber-small.csv\" 100 16"
Execute on cloud
- Step 1: Build the shadow jar with
./gradlew shadowJar
We suggest deleting the build dir before re-build the shadow jar. Please be sure to use Java 8 when compiling. You can check and fix versions with
java -version
javac -version
sudo update-alternatives --config java
sudo update-alternatives --config javac
- Step 2: Copy the fat jar to a machine in the unipd network (skip this step if you are under the unipd network).
scp build/libs/bdc-assignments-all.jar torre.studenti.math.unipd.it:dev/bdc
# or, if you don't have configured the connection with key pairs (password required)
scp build/libs/bdc-assignments-all.jar username@torre.studenti.math.unipd.it:dev/bdc
- Step 3: Copy the jar to the cluster
scp dev/bdc/bdc-assignments-all.jar group49@147.162.226.106:.
- Step 4: Run the job with Spark
spark-submit \
--num-executors 4 \
--class it.unipd.bdc.ay2019.group49.hw.G49HW3\
bdc-assignments-all.jar\
/data/BDC1920/uber-small.csv 4 4
Luca Parolari
- Github: @lparolari
Giulio Piva
Thanks to our big data course tutors. Take a look at their course web page.
The project is MIT licensed. See LICENSE file.