• Create
    • Ask a question
    • Create an article
    • Topics
    • Questions
    • Articles
    • Users
    • Badges
  • Sign in
  • Home /
avatar image
0

Task deserialization time is very high in prediction phase in spark-streaming application.

mllibrandom forestspark-streaming
Question by niravkntr · Jul 20, 2017 at 06:17 AM ·

Setup:

1. Trained Random forest model in offline and stored in file system.

2. This model is loaded once at the start of spark-streaming application using Pipeline.load .

3. Predict function is called for every batch (model.transform(input_data_frame))

Observation:

From the Spark-UI we can see that every task of this stage is spending most of the time(more than 95%) for deserialization. Our assumption is every task is deserializing the models that loaded initially so we have tried broadcasting the models (broadcast variables is useful when caching the data in deserialized form is important as given in https://spark.apache.org/docs/latest/rdd-programming-guide.html) but still it is showing high task deserialization time.

Spark standalone cluster details :

spark version : 2.1.0

Executor core = 7

Executor Memory = 16 GB

Total Executors = 17

spark.default.parallelism = total cores 3 = (17 7) * 3 = 357

Add comment
Comment
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

1 Answer

Sort

  • Votes
  • Created
  • Oldest
avatar image
0

Answer by tam · 5 days ago

am also facing the same issue... did you find the solution ? can you please share the approach to avoid broad casting the model for every mini batch in structure stream ? thank you

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this Question

13 People are following this question.

avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image

Related Questions

Random Forest Regression with Categorical Variables 2 Answers

Random Forest Regression Prediction when New Categories in the TestData 0 Answers

Is there currently a right way to train a RandomForest on Spark and use it with the scikit-learn library? 0 Answers

random forest regression with string and numerical features 0 Answers

When should I use RowMatrix.columnSimilarities()? 1 Answer

  • Product
    • Databricks Cloud
    • FAQ
  • Spark
    • About Spark
    • Developer Resources
    • Community + Events
  • Services
    • Certification
    • Spark Support
    • Spark Training
  • Company
    • About Us
    • Team
    • News
    • Contact
  • Careers
  • Blog

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105

info@databricks.com
1-866-330-0121

  • Twitter
  • LinkedIn
  • Facebook
  • Facebook

© Databricks 2015. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Create an article
  • Explore
  • Topics
  • Questions
  • Articles
  • Users
  • Badges