Advanced deduplicating

<< Click to Display Table of Contents >>

ETL > 12.6 > Implementation Guide > Tutorial and implementation guide > Steps 

Advanced deduplicating

Deduplicating is a process of taking a table as an input, group the similar records by one or more columns, and then decide on each set how to proceed with the records.

Deduplication step has the following basic syntax:

 

{

    "id": 11,

    "name": "Deduplicate table Duplicates, take random",

    "type": "deduplicate",

    "source": "Duplicates",

    "target": "Normalized_Duplicates_Random",

    "by": [ "cn", "dn" ],

    "strategy": "random"

}

 

 

clip0037

In this sample, rows will be grouped by the same values in the cn and dn column. For each group, a random row will be picked and the rest will be skipped.

Selecting a random row may sound a bit weird, so the engine supports further options:

Strategy type

Description

None

If a duplicate is detected, all duplicated rows will be removed.

Random

Take random row (this is not reproducible between sessions).

Any

Take first row (this is reproducible between sessions).

All

Take all values (no deduplication)

MaxValue

Take a row with maximum value in a specified column (requires extra config, see below)

MinValue

Take a row with minimum value in a specified column (requires extra config, see below)

Source is by default optional, meaning that the table does not have to be existing. If the table is missing, the step will not be executed and its target table will not be written. There is a way to define that the source is required, in which case in case of a missing source table the step will fail and report an error. More information about setting up required steps can be found in the following chapter: Optional and required tables.