POC of socring rest service od Spark ML Pipelines. The
Serve Apache spark pipelines from Kubernetes
REST JSON Request -> K8s service -> Apache Spark
in the code data/pipline-archive it pyspark pipeline
Create simple pipeline and Save pipeline
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession
sc = SparkContext("local[2]", "SparkML")
spark = SparkSession(sc)
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# This is the save pipeline method
model.save("/some-where-ml/project1/pipeline"){
"name":"Test pipeline",
"pipeline":"simple_pipline1",
"schema":[
{"name":"name","type":"STRING", "isNullable":false}
],
"sample":{
"name":"Bob"
},
"output": ["name", "features"]
}No zip the folder /some-where-ml/project1 and put it in your pipelines.folder
zip -r spark-sample-pipeline.zip /some-where-ml/project1 in application.properties change the pipelines.folder to pipeline store folder -OR- run as paramete (like below)
# Build mvn
mvn install package
# Run
java -Dpipelines.folder=/some-where-ml -jar target/mlserver.jarcurl -X POST \
-H "Content-Type: application/json" \
-d '[{"text":"Yehuda"}]' \
http://localhost:9900/predict/spark-sample-pipeline- add Warm-up for modules
- dockerize and serve it as service
To do some docker build
mvn clean install package && docker build -t alefbt/spring-mlspark-serving .
Then run
docker run --rm -it -p 9900:8080 alefbt/spring-mlspark-serving- This is code is not optimized to sub-second serving, it's possibol <,i did it on other project ;-) in order to do it, you need do some cacheing>
- Code contribution is welcome !
- Remember: this porject is POC.
- FIXED. Snappy (
xerial.snappypackage) doing some problems when running on vanilla Dockeropenjdk:8-jdk-alpine- see theDockerfileexplaines the walkaround
MIT - Free.
Use and Contribute


