• Create
    • Ask a question
    • Create an article
    • Topics
    • Questions
    • Articles
    • Users
    • Badges
  • Sign in
  • Home /
avatar image
0

What is the optimal Number of Cores I should use for fast querying

spark sqlperformancecluster-resourcesconfiguration
Question by msj50 · May 19, 2015 at 02:29 PM ·

Hi, I'm performing lots of queries using spark-sql against large tables that I have compressed using orc file format and partitioning.

I have a two slave cluster setup with 16gb and 8 cores each.

I have a table that has approximately 44 million rows (approx 3gb in size compressed), doing a simple count query, against the primary key column takes approximately 12 seconds. I was wondering if anyone had any suggestions on how I could speed this up further, and if this level of performance should be expected.

here is the stage breakdown:

sparkperformancesummary.png

sparkperformancesummary.png (98.6 kB)
Add comment
Comment
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

4 Answers

Sort

  • Votes
  • Created
  • Oldest
avatar image
1

Answer by juanpampliega · May 19, 2015 at 07:38 PM

Count queries are not the best benchmark of the actual performance. If you are using ORC format then a query that does complex operations over a subset of the actual columns might show the advantage of ORC over other formats better. Also, try partitioning the data into multiple subdirectories according to your query patterns to avoid scanning the whole table every time you execute a query over it. Keep in mind that double the number of cores in the worker nodes is about the minimum number of partitions recommended that an RDD should have to better take advantage of Spark parallelization of the query. In your case it would be (8+8)*2=32 partitions.

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by msj50 · May 20, 2015 at 11:43 PM

Hi Juan - thanks for the response.

You do have a valid point, doing a simple count doesn't show the advantages of the orc format.

We are trying to intelligently partition our data to avoid excessive table scans, however I was wondering if two slaves with 8 cores each is enough - would I get a significant improvement in performance if I increase the number of cores on the slaves and the total number of slaves?

What has been your experience with this?

Thanks,

Joel

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by juanpampliega · May 21, 2015 at 02:21 AM

Hi Joel,

This talk on Spark Internals that was given during Spark Summit 2014 will give you some ideas on general concepts that come into play regarding Spark performance. https://spark-summit.org/2014/talk/a-deeper-understanding-of-spark-internals

In short, there are two main factors that come into play when optimizing parallelization: how well and much the data is partitioned and the operations being done over the data (more operations that cause shuffles equals worse performance mostly).

This is a difficult question to answer because there are many factors that can vary. In general, if you RDD is sufficiently partitioned (total number of partitions is equal or exceeds the number of cores), as you increase the number of cores/machines in a cluster, more operations would be done in parallel and performance should increase. If your code does a lot of shuffling, then as the number of machines increase the shuffling will perform worse.

In general, if you are able to provide more cores for the cluster yo use you should get better performance. If this is not the case, then check if maybe tasks finishing too quickly (less than 100ms each) that would mean your data is too small for the amount of partitions being used.

Regards, Juan.

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users
avatar image
0

Answer by msj50 · May 22, 2015 at 05:41 PM

Hi Juan,

Thanks for the link, I'll check that out. Yes - I guess it is quite difficult to quantify, I'll try and increase the number of cores and keep an eye on task completion times.

Cheers,

Joel

Comment
Add comment · Share
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Your answer

Hint: You can notify a user about this post by typing @username

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this Question

5 People are following this question.

avatar image avatar image avatar image avatar image avatar image

Related Questions

Configuring event log output for databricks clusters 3 Answers

Insert overwrite in spark SQL(hive) performance is very very slow 5 Answers

Stages being duplicated and executed in Spark 1.5.2 0 Answers

"spark.dynamicAllocation.enabled" option in the Databricks cluster - roadmap 1 Answer

How to configure Spark Standalone on a single multi-core server so sessions don't block on WAITING ... 0 Answers

  • Product
    • Databricks Cloud
    • FAQ
  • Spark
    • About Spark
    • Developer Resources
    • Community + Events
  • Services
    • Certification
    • Spark Support
    • Spark Training
  • Company
    • About Us
    • Team
    • News
    • Contact
  • Careers
  • Blog

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105

info@databricks.com
1-866-330-0121

  • Twitter
  • LinkedIn
  • Facebook
  • Facebook

© Databricks 2015. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Create an article
  • Explore
  • Topics
  • Questions
  • Articles
  • Users
  • Badges