Big Data Computing Assignments

Implementation of some famous big data algorithm for the Big Data Computing course.

🚀 Getting Started

Download the repository

git clone git@github.com:lparolari/bdc-assignments.git
# or, with https
# git clone https://github.com/lparolari/bdc-assignments.git

cd bcd-assignments

Run our algorithms (see next section) with gradle from the command line

./gradlew <TASK> --args <ARGS>

# For example
./gradlew runExample --args "4 \"src/main/resources/examples/dataset.txt\""

From IntelliJ

add a profile build & run
set the desired class to execute
add VM the following option -Dspark.master="local[*]"
set the program arguments
run

Note: the spark master is set programmatically, so there is no needs for -Dspark.master="local[*]".

✔️ Implemented Algorithms

HW1: Class Count

Deterministic Partition Size
Spark Partition Size

Run the algorithm

# Small example dataset
./gradlew runClassCount --args "4 \"src/main/resources/hw1/example.txt\""

# 10000-entry dataset
./gradlew runClassCount --args "4 \"src/main/resources/hw1/input_10000.txt\""

HW2: Max Pairwise Distance

Exact solution
2-approximation by taking k random points
Exact solution of a subset of C of S, where C are k centers retruned by Farthest-First Traversal

./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/aircraft-mainland.csv\" 4"
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/uber-small.csv\" 4"
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/uber-medium.csv\" 4"
./gradlew runMaxPairwiseDistance --args "\"src/main/resources/hw2/uber-large.csv\" 4"

Running time

The following table summarizes the running time for the algorithms benchmarked over different dataset and with some k value. Times are expressed in millisecond. This runs refer only to correct results. Data are collected in "one shot".

Comparative with K = 4 (only for 2-approx and k-center)

Dataset	Exact	2-approx	kCenter
Aircraft Mainland	2468	190	197
Uber Small	136	156	176
Uber Medium	2078582	196	504
Uber Large	n/d	675	822

Comparative with K = 128 (only for 2-approx and k-center)

Dataset	Exact	2-approx	kCenter
Aircraft Mainland	2468	264	1993
Uber Small	136	154	266
Uber Medium	2078582	1743	52554
Uber Large	n/d	2992	169519

Full benchmark list

Algorithm	Dataset	K	Time (ms)
exact	Aircraft Mainland	-	2468
exact	Uber Small	-	136
exact	Uber Medium	-	2078582
exact	Uber Large	-	n/d
2-approx	Aircraft Mainland	4	190
2-approx	Aircraft Mainland	8	204
2-approx	Aircraft Mainland	32	485
2-approx	Aircraft Mainland	128	264
2-approx	Uber Small	4	156
2-approx	Uber Small	8	202
2-approx	Uber Small	32	168
2-approx	Uber Small	128	154
2-approx	Uber Medium	4	196
2-approx	Uber Medium	8	387
2-approx	Uber Medium	32	473
2-approx	Uber Medium	128	1743
2-approx	Uber Large	4	675
2-approx	Uber Large	8	462
2-approx	Uber Large	32	963
2-approx	Uber Large	128	2992
kCenter	Aircraft Mainland	4	197
kCenter	Aircraft Mainland	8	227
kCenter	Aircraft Mainland	32	639
kCenter	Aircraft Mainland	128	1993
kCenter	Uber Small	4	176
kCenter	Uber Small	8	148
kCenter	Uber Small	32	191
kCenter	Uber Small	128	266
kCenter	Uber Medium	4	504
kCenter	Uber Medium	8	947
kCenter	Uber Medium	32	5395
kCenter	Uber Medium	128	52554
kCenter	Uber Large	4	882
kCenter	Uber Large	8	1431
kCenter	Uber Large	32	12574
kCenter	Uber Large	128	169519

Run the benchmarks

If you want to run the benchmarks yourself you can look at hw2run.sh, hw2scale.sh and hw2bench.sh scripts that we made available in order to simplify this task.

Please note that benchmarking was not the aim of this exercise. We took running times one shot, and we decided to leave benchmark results out of the source control.

HW3: Diversity Maximization

Execute locally

./gradlew runDiversityMaximization --args "path/to/dataset K L"

for example with K=100 and L=16

./gradlew runDiversityMaximization --args "\"src/main/resources/hw3/uber-small.csv\" 100 16"

Execute on cloud

Step 1: Build the shadow jar with

./gradlew shadowJar

We suggest deleting the build dir before re-build the shadow jar. Please be sure to use Java 8 when compiling. You can check and fix versions with

java -version
javac -version

sudo update-alternatives --config java
sudo update-alternatives --config javac

Step 2: Copy the fat jar to a machine in the unipd network (skip this step if you are under the unipd network).

scp build/libs/bdc-assignments-all.jar torre.studenti.math.unipd.it:dev/bdc

# or, if you don't have configured the connection with key pairs (password required)
scp build/libs/bdc-assignments-all.jar username@torre.studenti.math.unipd.it:dev/bdc

Step 3: Copy the jar to the cluster

scp dev/bdc/bdc-assignments-all.jar group49@147.162.226.106:.

Step 4: Run the job with Spark

spark-submit \
  --num-executors 4 \
  --class it.unipd.bdc.ay2019.group49.hw.G49HW3\
  bdc-assignments-all.jar\
  /data/BDC1920/uber-small.csv 4 4

👥 Authors

Luca Parolari

Github: @lparolari

Giulio Piva

🙏 Credits

Thanks to our big data course tutors. Take a look at their course web page.

📝 License

The project is MIT licensed. See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.idea		.idea
.settings		.settings
delivered		delivered
gradle/wrapper		gradle/wrapper
src/main		src/main
.classpath		.classpath
.editorconfig		.editorconfig
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
hw2bench.sh		hw2bench.sh
hw2run.sh		hw2run.sh
hw2scale.sh		hw2scale.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data Computing Assignments

🚀 Getting Started

✔️ Implemented Algorithms

HW1: Class Count

HW2: Max Pairwise Distance

HW3: Diversity Maximization

👥 Authors

🙏 Credits

📝 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

lparolari/bdc-assignments

Folders and files

Latest commit

History

Repository files navigation

Big Data Computing Assignments

🚀 Getting Started

✔️ Implemented Algorithms

HW1: Class Count

HW2: Max Pairwise Distance

HW3: Diversity Maximization

👥 Authors

🙏 Credits

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages