I have a DataFrame I'm trying to cache shortly after reading from s3:
datasets["nav"]["dataFrame"] = df \
.withColumn('dpkey', concat_ws("-", *data_partitions)) \
.persist(StorageLevel.MEMORY_AND_DISK)
Running my notebook without caching runs as expected, but for some reason when I try to persist that dataframe the task takes several multiples longer.
I believe the dataset is larger my available memory, but the task is slow from the get-go even while there's still plenty available.
Any ideas?
Answer by bill · Apr 18, 2017 at 02:02 PM
Hey Christian - persisting/caching isn't free because you're going to take some serialization overhead. There's also potentially some storage issues, if the data isn't distributed evenly across the cluster than certain machines are going to have to work much harder and do much more work than other ones.
Answer by Christian Rodriguez · Apr 18, 2017 at 02:06 PM
Data is evenly distributed as the delay immediately follows reading from mounted s3 and memory distribution across workers looks even in UI. Is the serialization tax really so high that it takes multiples as long as reading from s3?
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121