using spark 1.3.1
connect via jdbc to remote postgresql server
SPARK_CLASSPATH=postgresql-9.4-1201.jdbc4.jar bin/spark-shell
....
it took over 100s to return 2 rows with 2 int -
scala> sqlContext.sql("SELECT person_id FROM xxx limit 2").collect 15/05/12 18:09:01 INFO ParseDriver: Parsing command: SELECT person_id FROM xxx limit 2 15/05/12 18:09:01 INFO ParseDriver: Parse Completed 15/05/12 18:09:01 INFO SparkContext: Starting job: runJob at SparkPlan.scala:122 15/05/12 18:09:01 INFO DAGScheduler: Got job 1 (runJob at SparkPlan.scala:122) with 1 output partitions (allowLocal=false) 15/05/12 18:09:01 INFO DAGScheduler: Final stage: Stage 1(runJob at SparkPlan.scala:122) 15/05/12 18:09:01 INFO DAGScheduler: Parents of final stage: List() 15/05/12 18:09:01 INFO DAGScheduler: Missing parents: List() 15/05/12 18:09:01 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at map at SparkPlan.scala:97), which has no missing parents 15/05/12 18:09:01 INFO MemoryStore: ensureFreeSpace(4096) called with curMem=6688, maxMem=277842493 15/05/12 18:09:01 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.0 KB, free 265.0 MB) 15/05/12 18:09:01 INFO MemoryStore: ensureFreeSpace(2591) called with curMem=10784, maxMem=277842493 15/05/12 18:09:01 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 265.0 MB) 15/05/12 18:09:01 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:50405 (size: 2.5 KB, free: 265.0 MB) 15/05/12 18:09:01 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/05/12 18:09:01 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/05/12 18:09:01 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[3] at map at SparkPlan.scala:97) 15/05/12 18:09:01 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/05/12 18:09:01 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1062 bytes) 15/05/12 18:09:01 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/05/12 18:09:24 INFO BlockManager: Removing broadcast 0 15/05/12 18:09:24 INFO BlockManager: Removing block broadcast_0 15/05/12 18:09:24 INFO MemoryStore: Block broadcast_0 of size 4096 dropped from memory (free 277833214) 15/05/12 18:09:24 INFO BlockManager: Removing block broadcast_0_piece0 15/05/12 18:09:24 INFO MemoryStore: Block broadcast_0_piece0 of size 2592 dropped from memory (free 277835806) 15/05/12 18:09:24 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:50405 in memory (size: 2.5 KB, free: 265.0 MB) 15/05/12 18:09:24 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/05/12 18:09:24 INFO ContextCleaner: Cleaned broadcast 0 15/05/12 18:10:43 INFO JDBCRDD: closed connection 15/05/12 18:10:43 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 874 bytes result sent to driver 15/05/12 18:10:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 101230 ms on localhost (1/1) 15/05/12 18:10:43 INFO DAGScheduler: Stage 1 (runJob at SparkPlan.scala:122) finished in 101.232 s 15/05/12 18:10:43 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/05/12 18:10:43 INFO DAGScheduler: Job 1 finished: runJob at SparkPlan.scala:122, took 101.244593 s res2: Array[org.apache.spark.sql.Row] = Array([1719973], [1719976])
Answer by vida · Jun 18, 2015 at 10:30 PM
Where is your postgres server? Is it on the same AWS region as your Spark cluster? If you were to issue the same query on your SQL database using a plain JDBC driver, how long does that take? Knowing that will help determine if the cause of slowness is an issue about your SQL server connection to the Spark cluster or something about the way Spark queries your database.
Cannot remove table name if df1.saveAsTable("tb1") failed 4 Answers
Malformed records are not working as expected in spark using scala 0 Answers
Databricks 3rd party integration and Tableau 9.2 crashes 4 Answers
Bug: Typo in modal 1 Answer
Adding maven library for a scala notebook job fails 2 Answers
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121