I am reading a datasets from parquet file. The file is 5gigs. Need to compare if file has two same columns or not. The entries in two columns can be shuffled.
example
col1 col2 col3 login1 ram login2 login2 ram login2 login3 ram login1 login1 ram login3
Our existing approach is based on fail fast approach.
for both columns compute distinct and sort and then start comparing entries from top. On first mismatch encountered break and return false.
Since we are trying to port the system in spark is there a way spark can help me with this.
I know spark intersect and join methods but I don't think they would be as optimized theoretically.
Any inputs appreciated
Number of partitions in Spark RDD 0 Answers
Should I use groupByKey or reduceByKey? 2 Answers
Spark-submit Sql Context Create Statement does not work 1 Answer
Fuzzy text matching in Spark 6 Answers
Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105
info@databricks.com
1-866-330-0121