• Create
    • Ask a question
    • Create an article
    • Topics
    • Questions
    • Articles
    • Users
    • Badges
  • Sign in
  • Home /
avatar image
0

Persisting to MEMORY_AND_DISK significantly slows read from s3 stage

cachecachingpersistpersisting dataframe
Question by Christian Rodriguez · Apr 18, 2017 at 01:44 AM ·

I have a DataFrame I'm trying to cache shortly after reading from s3:

datasets["nav"]["dataFrame"] = df \
.withColumn('dpkey', concat_ws("-", *data_partitions)) \
.persist(StorageLevel.MEMORY_AND_DISK)

Running my notebook without caching runs as expected, but for some reason when I try to persist that dataframe the task takes several multiples longer.

I believe the dataset is larger my available memory, but the task is slow from the get-go even while there's still plenty available.

Any ideas?

Add comment
Comment
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

2 Answers

Sort

  • Votes
  • Created
  • Oldest
avatar image
0

Answer by bill · Apr 18, 2017 at 02:02 PM

Hey Christian - persisting/caching isn't free because you're going to take some serialization overhead. There's also potentially some storage issues, if the data isn't distributed evenly across the cluster than certain machines are going to have to work much harder and do much more work than other ones.

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by Christian Rodriguez · Apr 18, 2017 at 02:06 PM

Data is evenly distributed as the delay immediately follows reading from mounted s3 and memory distribution across workers looks even in UI. Is the serialization tax really so high that it takes multiples as long as reading from s3?

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this Question

10 People are following this question.

avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image

Related Questions

Dataframe is caching in disk instead of memory 1 Answer

Are there any best practices for working with the same expensive join repeatedly? 2 Answers

Persisted RDD partitions getting cleared automatically. How to determine if persisted RDD is still in memory and/or on disk ? -2 Answers

Explicit caching can decrease application performance by interfering with the Catalyst optimizer's ability to optimize some queries 1 Answer

Refreshing Cache seems to use up all my memory 1 Answer

  • Product
    • Databricks Cloud
    • FAQ
  • Spark
    • About Spark
    • Developer Resources
    • Community + Events
  • Services
    • Certification
    • Spark Support
    • Spark Training
  • Company
    • About Us
    • Team
    • News
    • Contact
  • Careers
  • Blog

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105

info@databricks.com
1-866-330-0121

  • Twitter
  • LinkedIn
  • Facebook
  • Facebook

© Databricks 2015. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Create an article
  • Explore
  • Topics
  • Questions
  • Articles
  • Users
  • Badges