Merge Purge Data

Merge Data

Merge Purge is a term that comes from the mailing industry, but can also applied to data migration (AKA: ETL) projects as well. Merge and purge is a computer operation combining multiple mailing lists or data tables/files, eliminates duplicate records and produces a single list as the result. When merging, a file/table structure that can effectively house all the merged data while still maintaining (or improving) their state of normalization has to be developed. Put simply, with normalized data the correct values are contained in their assigned columns (fields) and fall within predetermined ranges and consistent granularity. For example, the output would have only company names stored in the [CompanyName] column, only first names in the [FirstName] column; and so forth. If there is no company name or first name for that record the respective positions for those data are set to null (or blank).

With mailing lists, merging is somewhat easier as the number of output fields are typically fewer that those in data migrations. Merge/purge falls somewhere between the "E" and the "T" in "ETL" (Extract, Transform, Load), especially in large or complex datamarts. Merging is a critical step in any data migration or mailing operations that draw their data from multiple sources. If the resulting output list or table has been merged correctly, all subsequent processing necessary to complete the project will go much, much more smoothly. This especially true when it's time to purge the data.

Purge Data

The term purging is also known as: deduping, de-duping. The goal is to remove all duplicate records from the output data. This is relatively easy when the records are so consistent that removing exact dupes (where every character matches every character and character position in another record (or part of a record; see matchcode) is enough. If the list is to be properly purged, then records representing the same entities, but written differently, have to be accounted for as well. For example, "Dick & Margaret Smith" at a given address may also be represented as "Richard and Peg Smith" at the same address. This is called a "near" duplicate and is difficult to identify during larger merge/purge operations without specialized software and some experience. Even if your project centers around a single table/file, and you don't need full merge/purge processing, you willstill have to perform a complete purge operation after any updates (e.g. NCOA) are madfe to your data to keep them trim, accurate and free of duplication.