From the Spark official document, it says:
Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable("tableName") to remove the table from memory.
What does caching tables using a in-memory columnar format really mean? Put the whole table into the memory? As we know that cache is also lazy, the table is cached after the first action on the query. Does it make any difference to the cached table if choosing different actions and queries? I've googled this cache topic several times but failed to find some detailed articles. I would really appreciate it if anyone can provides some links or articles for this topic.
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
Answer by bill · Feb 19, 2016 at 08:52 PM
It means that it puts the whole table (or as much as it can) in memory in an optimized format for later queries. It does make a difference if you use difference actions in an attempt to cache the data.
Hi bill
I am having a following situation can you please let me know if caching can help me with this.
1) I have a big table created a dataframe out of this using read.jdbc() method say bigTableDataframe
2) Later i want to join this data frame with 4 different small tables. like below
bigTableDataframe = bigTableDataframe.join(smallTableDataFrame1);
bigTableDataframe = bigTableDataframe.join(smallTableDataFrame2); like this go on
So what you suggest before doing the join should I cache bigTableDataframe.cache() and then a give a call to join method().
Or should i make a use of broadcast join ?
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121