• Create
    • Ask a question
    • Create an article
    • Topics
    • Questions
    • Articles
    • Users
    • Badges
  • Sign in
  • Home /
avatar image
1

How to we find if two dataframes are same

sparkpartitions
Question by rohitatiit · Sep 09, 2016 at 07:43 AM ·

I am reading a datasets from parquet file. The file is 5gigs. Need to compare if file has two same columns or not. The entries in two columns can be shuffled.

example

col1    col2    col3
login1    ram    login2
login2    ram    login2
login3    ram    login1
login1    ram    login3

Our existing approach is based on fail fast approach.

for both columns compute distinct and sort and then start comparing entries from top. On first mismatch encountered break and return false.

Since we are trying to port the system in spark is there a way spark can help me with this.

I know spark intersect and join methods but I don't think they would be as optimized theoretically.

Any inputs appreciated

Add comment
Comment
10 |600 characters needed characters left characters exceeded
▼
  • Viewable by all users
  • Viewable by moderators
  • Viewable by moderators and the original poster
  • Advanced visibility
Viewable by all users

Sort

  • Votes
  • Created
  • Oldest

Your answer

Hint: You can notify a user about this post by typing @username

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this Question

13 People are following this question.

avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image avatar image

Related Questions

Number of partitions in Spark RDD 0 Answers

Should I use groupByKey or reduceByKey? 2 Answers

Spark-submit Sql Context Create Statement does not work 1 Answer

I am getting "Failed with exception java.io.IOException:java.io.IOException: hdfs://quickstart.cloudera:8020/user/hive/warehouse/people/part-r-00001.parquet not a SequenceFile" 1 Answer

Fuzzy text matching in Spark 6 Answers

  • Product
    • Databricks Cloud
    • FAQ
  • Spark
    • About Spark
    • Developer Resources
    • Community + Events
  • Services
    • Certification
    • Spark Support
    • Spark Training
  • Company
    • About Us
    • Team
    • News
    • Contact
  • Careers
  • Blog

Databricks Inc.
160 Spear Street, 13th Floor
San Francisco, CA 94105

info@databricks.com
1-866-330-0121

  • Twitter
  • LinkedIn
  • Facebook
  • Facebook

© Databricks 2015. All rights reserved. Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.

  • Anonymous
  • Sign in
  • Create
  • Ask a question
  • Create an article
  • Explore
  • Topics
  • Questions
  • Articles
  • Users
  • Badges