Identifying duplicate database records


Data duplication (where more than one record refers to the same supplier or client or person) is a common problem in large databases.

It is often very difficult to know whether you have duplicate entries for your clients or suppliers in your databases. This kind of duplication means that it is difficult or impossible to assess the financial impact of your relationship with the client or a supplier.

It requires specialized software to analyse duplicate data as it is often very hard to find manually.

How it works

Working with your staff, we can determine the best way to match duplicate or similar records. This involves some analysis of your data and how you use it.

The first step is to classify fields into four categories. This is so that when the software compares each record in the database to every other record, it has a set of rules that tell if how to compare the data in that record.

1. Identical data

The first category is for fields that constitute a match if the data in them is identical, (or identical after some amount of cleaning). For example, in a set of company data, if the contact email address or telephone number is the same between two database records then that alone determines that we've found a likely duplicate record.

2. Fields that tell us when we don't have a match

The second category is for fields that tell us that we don't have a match if the data in them differs. At the same time, if the data in this sort of field is the same for two records, that fact doesn't have much value for us. An example of this is the state in an address: it does not mean much if two records have the same state, but if they have a different state then they are probably not a match for our purposes. This obviously depends on your data. The importance of this category is that it helps reduce the amount of work that the deduplication software has to do.

3. Weighted data

The third category is for fields that make a match if they are close enough. For example, a street address might be written in many different ways but if we get a high enough score from our matching algorithm we might consider it a match.

We also determine what importance (weighting) to apply to the different category three fields. For example, company name would have a high weighting, but the country may not rate highly at all.

4. Fields to ignore

The fourth category is for data that does not affect whether a record matches or not. This is generally data such as notes, dates, discount rates etc.

Continue reading about deduplication (exporting data, results and examples).