• Create
    • Ask a question
    • Create an article
    • Topics
    • Questions
    • Articles
    • Users
    • Badges
  • Sign in
  • Home /
avatar image
0

Spark DataFrame groupby, sql, cube - alternatives and optimization

dataframesdataframespark-sqlsparksql
Question by sai · Jun 21, 2016 at 03:35 PM ·

The below is a SQL query that will be executed using Spark

$Query =

select col1,col2, ... col7, SUM(col8,col9...)

from table1 where (condition1)

GROUPBY col1, col2 .... col7

Having SUM(col8, col9, col10...)

1. With Spark1.5.x and dataframes, can the above operation represented as df.groupBy($"col1",$"col2",.... $"col7").sum(col8, col9....col12) ? Is the dataframe.groupBy() optimized for the data locality (i.e. similar to reduce and aggregate operations) than doing a naive shuffle. With multiple groupBy columns, how effective the optimization would be for say a billion or two tuples. Does the resulting dataframe have the sum(col8,col9....) as a selected column.

2. If the above sql query is executed as Sql/HiveContext.sql(" $Query "). Is this any different than the dataframe.groupBy().sum()

-> Can we use dataframe.cube() for the above groupBy Query as

dataframe.cube($"col1",$"col2",.... $"col7") .sum(col8, col9....col12)

If yes, is one approach better than the other in terms of query optimization or execution (df.groupBy().agg() , SqlContext.sql("") , df.cube().sum() ). Thanks in advance @jason

Add comment
Comment
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Sort

  • Votes
  • Created
  • Oldest

Your answer

Hint: You can notify a user about this post by typing @username

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this Question

9 People are following this question.

avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image

Related Questions

Why is DataFrame.select(column) embedding double quotes around the column? 2 Answers

oozie spark action gives alreadyexists exception when used with saveAsTable in append mode 4 Answers

Poor Query performance: How to improve query performance? 2 Answers

Spark 1.4.1 simple select queries on Hive ORC tables take forever 0 Answers

How to split/merge records using sparksql/dataframe/dataset 0 Answers

  • Product
    • Databricks Cloud
    • FAQ
  • Spark
    • About Spark
    • Developer Resources
    • Community + Events
  • Services
    • Certification
    • Spark Support
    • Spark Training
  • Company
    • About Us
    • Team
    • News
    • Contact
  • Careers
  • Blog

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105

info@databricks.com
1-866-330-0121

  • Twitter
  • LinkedIn
  • Facebook
  • Facebook

© Databricks 2015. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Create an article
  • Explore
  • Topics
  • Questions
  • Articles
  • Users
  • Badges