Hi, I'm performing lots of queries using spark-sql against large tables that I have compressed using orc file format and partitioning.
I have a two slave cluster setup with 16gb and 8 cores each.
I have a table that has approximately 44 million rows (approx 3gb in size compressed), doing a simple count query, against the primary key column takes approximately 12 seconds. I was wondering if anyone had any suggestions on how I could speed this up further, and if this level of performance should be expected.
here is the stage breakdown:
Answer by juanpampliega · May 19, 2015 at 07:38 PM
Count queries are not the best benchmark of the actual performance. If you are using ORC format then a query that does complex operations over a subset of the actual columns might show the advantage of ORC over other formats better. Also, try partitioning the data into multiple subdirectories according to your query patterns to avoid scanning the whole table every time you execute a query over it. Keep in mind that double the number of cores in the worker nodes is about the minimum number of partitions recommended that an RDD should have to better take advantage of Spark parallelization of the query. In your case it would be (8+8)*2=32 partitions.
Answer by msj50 · May 20, 2015 at 11:43 PM
Hi Juan - thanks for the response.
You do have a valid point, doing a simple count doesn't show the advantages of the orc format.
We are trying to intelligently partition our data to avoid excessive table scans, however I was wondering if two slaves with 8 cores each is enough - would I get a significant improvement in performance if I increase the number of cores on the slaves and the total number of slaves?
What has been your experience with this?
Thanks,
Joel
Answer by juanpampliega · May 21, 2015 at 02:21 AM
Hi Joel,
This talk on Spark Internals that was given during Spark Summit 2014 will give you some ideas on general concepts that come into play regarding Spark performance. https://spark-summit.org/2014/talk/a-deeper-understanding-of-spark-internalsIn short, there are two main factors that come into play when optimizing parallelization: how well and much the data is partitioned and the operations being done over the data (more operations that cause shuffles equals worse performance mostly).
This is a difficult question to answer because there are many factors that can vary. In general, if you RDD is sufficiently partitioned (total number of partitions is equal or exceeds the number of cores), as you increase the number of cores/machines in a cluster, more operations would be done in parallel and performance should increase. If your code does a lot of shuffling, then as the number of machines increase the shuffling will perform worse.
In general, if you are able to provide more cores for the cluster yo use you should get better performance. If this is not the case, then check if maybe tasks finishing too quickly (less than 100ms each) that would mean your data is too small for the amount of partitions being used.
Regards, Juan.
Answer by msj50 · May 22, 2015 at 05:41 PM
Hi Juan,
Thanks for the link, I'll check that out. Yes - I guess it is quite difficult to quantify, I'll try and increase the number of cores and keep an eye on task completion times.
Cheers,
Joel
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121