I have a standalone spark cluster of 3 nodes, we are migrating data from oracle to cassandra using spark. To load one of the table from oracle to cassandra I am using a JDBC connection in spark to pull data into a dataframe but it taking too much time around 2 hrs for 0.25M records. What can be the best way to optimize it.
Answer by jason · Jun 24, 2016 at 04:55 PM
Try ingesting data from Oracle in parallel across all the workers. Clone the following notebook and scroll down towards the end for an example:
Answer by neerajbhadani · 4 days ago
I am not able to find the DataBricks guide on above URL and also I am not getting any question mark in upper right corner as well.
I am saving pySpark Dataframe to Oracle and its only having 450 rows but taking around 40 mins to save. Could you please help me here.
JdbcRRD and DataFrames 0 Answers