Setup:
1. Trained Random forest model in offline and stored in file system.
2. This model is loaded once at the start of spark-streaming application using Pipeline.load .
3. Predict function is called for every batch (model.transform(input_data_frame))
Observation:
From the Spark-UI we can see that every task of this stage is spending most of the time(more than 95%) for deserialization. Our assumption is every task is deserializing the models that loaded initially so we have tried broadcasting the models (broadcast variables is useful when caching the data in deserialized form is important as given in https://spark.apache.org/docs/latest/rdd-programming-guide.html) but still it is showing high task deserialization time.
Spark standalone cluster details :
spark version : 2.1.0
Executor core = 7
Executor Memory = 16 GB
Total Executors = 17
spark.default.parallelism = total cores 3 = (17 7) * 3 = 357
Answer by tam · 5 days ago
am also facing the same issue... did you find the solution ? can you please share the approach to avoid broad casting the model for every mini batch in structure stream ? thank you
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121