<< Click to Display Table of Contents >> ETL > 14.0 u1 > Implementation Guide > Tutorial and implementation guide > Steps Advanced deduplicating |
Deduplicating is a process of taking a table as an input, group the similar records by one or more columns, and then decide on each set how to proceed with the records.
Deduplication step has the following basic syntax:
{
"id": 11,
"name": "Deduplicate table Duplicates, take random",
"type": "deduplicate",
"source": "Duplicates",
"target": "Normalized_Duplicates_Random",
"by": [ "cn", "dn" ],
"strategy": "random"
}
In this sample, rows will be grouped by the same values in the cn and dn column. For each group, a random row will be picked and the rest will be skipped.
Selecting a random row may sound a bit weird, so the engine supports further options:
Strategy type |
Description |
None |
If a duplicate is detected, all duplicated rows will be removed. |
Random |
Take random row (this is not reproducible between sessions). |
Any |
Take first row (this is reproducible between sessions). |
All |
Take all values (no deduplication) |
MaxValue |
Take a row with maximum value in a specified column (requires extra config, see below) |
MinValue |
Take a row with minimum value in a specified column (requires extra config, see below) |
Source is by default optional, meaning that the table does not have to be existing. If the table is missing, the step will not be executed and its target table will not be written. There is a way to define that the source is required, in which case in case of a missing source table the step will fail and report an error. More information about setting up required steps can be found in the following chapter: Optional and required tables.