Lets assume that we have a big csv/excel file where there are big number of records against following fields.
1.Email 2.First Name 3.Last Name 4.Phone Number etc.
Among these records, we need to identify the duplicate records in terms of matching criteria of Email,First Name and Last Name.
For duplicate calculation,some custom rules are defined which gives a score against an individual record.
For example ,
1.If email is exact match then score is 100,else 0.
2.For First Name,Last Name etc. the edit distance is the score .
For example, lets assume that search parameter is like the following
Email:email@example.com,First Name: ABCD,Last Name:EFGH
The rows/records are like
1.Email:firstname.lastname@example.org,First Name: ABC,Last Name:EFGH
2.Email:email@example.com,First Name: ABC,Last Name:EFGH
For record1, score = 100(for email) + 75 (for first name) +100 for Last name=275 ,i.e. 91.6%
For record2, score = 0(for email) + 75 (for first name) +100 for Last name=175,i.e. 58%
Duplicate detection threshold is 75%,so record 1 is duplicate and record 2 is not.This is fairly simple to implement when we have input parameters and using them we want to determine the duplicates from a file.
But how to apply this logic when we have all the records in a file and for all of them we need to find out which are the duplicate ones ?
Here no input parameter is defined and we need to compare one record with all other records in order to find scoring relevance .
How to achieve this in Apache Spark ?
From Webinar Apache Spark MLlib 2.x: Migrating ML Workloads to DataFrames: What kind of ML algorithms will benefit from Tungsten/Catalyst (once ML is ported onto SparkSQL), just memory heavy ones or also communication-heavy ones? 1 Answer