The below is a SQL query that will be executed using Spark
$Query =
select col1,col2, ... col7, SUM(col8,col9...)
from table1 where (condition1)
GROUPBY col1, col2 .... col7
Having SUM(col8, col9, col10...)
1. With Spark1.5.x and dataframes, can the above operation represented as df.groupBy($"col1",$"col2",.... $"col7").sum(col8, col9....col12) ? Is the dataframe.groupBy() optimized for the data locality (i.e. similar to reduce and aggregate operations) than doing a naive shuffle. With multiple groupBy columns, how effective the optimization would be for say a billion or two tuples. Does the resulting dataframe have the sum(col8,col9....) as a selected column.
2. If the above sql query is executed as Sql/HiveContext.sql(" $Query "). Is this any different than the dataframe.groupBy().sum()
-> Can we use dataframe.cube() for the above groupBy Query as
dataframe.cube($"col1",$"col2",.... $"col7") .sum(col8, col9....col12)
If yes, is one approach better than the other in terms of query optimization or execution (df.groupBy().agg() , SqlContext.sql("") , df.cube().sum() ). Thanks in advance @jason
Why is DataFrame.select(column) embedding double quotes around the column? 2 Answers
oozie spark action gives alreadyexists exception when used with saveAsTable in append mode 4 Answers
Poor Query performance: How to improve query performance? 2 Answers
Spark 1.4.1 simple select queries on Hive ORC tables take forever 0 Answers
How to split/merge records using sparksql/dataframe/dataset 0 Answers
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121