Tutorial and implementation guide > Steps

Deduplicating is a process of taking a table as an input, group the similar records by one or more columns, and then decide on each set how to proceed with the records.

Deduplication step has the following basic syntax:

{

"id": 11,

"name": "Deduplicate table Duplicates, take random",

"type": "deduplicate",

"source": "Duplicates",

"target": "Normalized_Duplicates_Random",

"by": [ "cn", "dn" ],

"strategy": "random"

}

In this sample, rows will be grouped by the same values in the cn and dn column. For each group, a random row will be picked and the rest will be skipped.

Selecting a random row may sound a bit weird, so the engine supports further options:

Strategy type	Description
None	If a duplicate is detected, all duplicated rows will be removed.
Random	Take random row (this is not reproducible between sessions).
Any	Take first row (this is reproducible between sessions).
All	Take all values (no deduplication)
MaxValue	Take a row with maximum value in a specified column (requires extra config, see below)
MinValue	Take a row with minimum value in a specified column (requires extra config, see below)

Source is by default optional, meaning that the table does not have to be existing. If the table is missing, the step will not be executed and its target table will not be written. There is a way to define that the source is required, in which case in case of a missing source table the step will fail and report an error. More information about setting up required steps can be found in the following chapter: Optional and required tables.

Advanced deduplicating

Advanced deduplicating