Grouping and deduplicating

<< Click to Display Table of Contents >>

ETL > 12.6 > Implementation Guide > Tutorial and implementation guide > Steps 

Grouping and deduplicating

Grouping is a process which covers one of the following use cases:

Having a single table, it writes the output table that has the exact amount of rows, but with extra meta-columns containing a "group identifier". Two or more elements considered to be the same have always the same value in the group column. The second extra column being added is the count of rows in the given group.

Having a single table, it writes the output table that contains aggregated (de-duplicated / distinct) rows, where the values in each non-grouped column is aggregated with a specified function (maximum, minimum, average, concatenated value, coalesced value). An extra column with number of grouped rows is added.

The following snippet shows a functional example required to group rows and enter the information about the group:

   {

    "id": 1,

    "type": "group",

    "name": "Example of grouping (deduplicate)",

    "source": "Users",

    "by": [ "Name", "E-Mail" ],

    "target": "Users_Grouped_Deduplicated"

   }

You can opt-in for de-duplication by using the "action" object:

{

    "id": 2,

    "type": "group",

    "name": "Example of grouping (deduplicate)",

    "source": "Users",

    "by": [ "Name", "E-Mail" ],

    "target": "Users_Grouped_Deduplicated",

    "action": {

      "type""deduplicate"

   

   }

 

At minimum the following properties are required:

ID (must be unique)

Type (must be set to "group")

Source (must be one of the following):

oA string representing the table name

oAn integer representing the source as another step

oAn object with property "table" set to the name of the source table

oAn object with property "step" set to the ID of the source step

Source is by default optional, meaning that the table does not have to be existing. If the table is missing, the step will not be executed and its target table will not be written. There is a way to define that the source is required, in which case in case of a missing source table the step will fail and report an error. More information about setting up required steps can be found in the following chapter: Optional and required tables.

 

Following parameters are optional:

by
This defines a single column or a list of columns to be used as grouping keys. Equivalent to SQL statement Select * FROM <table> GROUP BY <columns>.
If you omit this parameter or set it to an empty list, then all columns are used for grouping.

target
If omitted, the output table is temporary and does not get saved during the LOAD phase.

action
The action to execute. This can be one of the following:

oIf the value is a string, then the action of a given type is used. Supported values are deduplicate and recognize. The values are case-insensitive.

oIf the value is an object, then its property type is used to determine the action. The value of parameter type should be either deduplicate or recognize. The values are case-insensitive. There are some extra properties available when using the object syntax.

oIf the action is omitted, the recognize action is used as the default.

 

Recognize action

This type of action is used to group and recognize grouped values, but without removing any rows yet. Instead, the ETL engine compares the values in given columns, groups matching rows and for each row writes additional two columns:

GROUP_KEY: Containing the locally-unique identifier of the group (integer-based, starting from 1)

GROUP_COUNT: Containing the number of rows within the same group.

You can customize the naming of both properties by setting the properties groupKeyColumn and groupCountColumn respectively.

Deduplicate action

This type of action is used to group and consolidate grouped rows. The non-grouped columns are aggregated using a specific algorithm. ETL engine writes additional extra column for each row:

GROUP_COUNT: Containing the number of rows within the same group.

The default setting of this action use coalescing to produce aggregated value (getting the first non-null value). You can customize the behavior of aggregation by setting the object columns, where key is the name of the column to aggregate, and the value is the aggregation method. The syntax for aggregation is similar to the definition for the join step.

Example #1

Given the following table Users

Name

LastSeen

Device

E-Mail

Marcin

2020-01-02 00:00:00.000

Android

marcin@contoso.com

Marcin

2020-01-03 00:00:00.000

Windows

marcin@contoso.com

Marcin

2020-01-01 00:00:00.000        

Apple

<null>

Denis

2020-03-03 00:00:20.000        

Windows

denis@contoso.com

Denis

2021-02-02 00:10:00.000        

Android

<null>

Andreas

2021-02-02 00:20:00.000        

Windows

andreas@contoso.com

and the following JSON file:

   {

    "id": 1,

    "type": "group",

    "name": "Example of grouping (only marking)",

    "source": "Users",

    "by": [ "Name", "E-Mail" ],

    "target": "Users_Grouped_Mark",

    "action": {

      "type": "recognize"

     }

   },

   {

    "id": 2,

    "type": "group",

    "name": "Example of grouping (deduplicate)",

    "source": "Users",

    "by": [ "Name", "E-Mail" ],

    "target": "Users_Grouped_Deduplicated",

    "action": {

      "type": "deduplicate"

     }

   }

 

The following two tables are created:

Table Users_Grouped_Mark:

Name

E-Mail

LastSeen

Device

GROUP_KEY

GROUP_COUNT

Marcin

marcin@contoso.com

2020-01-02 00:00:00.000

Android

5

2

Marcin

marcin@contoso.com

2020-01-03 00:00:00.000

Windows

5

2

Marcin

<null>

2020-01-01 00:00:00.000

Apple

4

1

Denis

denis@contoso.com

2020-03-03 00:00:20.000

Windows

3

1

Denis

<null>

2021-02-02 00:10:00.000

Android

2

1

Andreas

andreas@contoso.com

2021-02-02 00:20:00.000

Windows

1

1

The other table Users_Grouped_Deduplicated has the following rows:

Name

E-Mail

LastSeen

Device

GROUP_COUNT

Marcin

marcin@contoso.com

2020-01-02 00:00:00.000

Android

2

Marcin

NULL

2020-01-03 00:00:00.000

Windows

1

Denis

<null>

2020-01-01 00:00:00.000

Apple

1

Denis

denis@contoso.com

2020-03-03 00:00:20.000

Windows

1

Andreas

andreas@contoso.com

2021-02-02 00:20:00.000

Windows

1

Example #2

With the same input Users  as in the first example and the following JSON (yellow marks the difference between Example #1 and Example #2):

   {

    "id": 3,

    "type": "group",

    "name": "Example of grouping (only marking)",

    "source": "Users",

    "by": [ "Name", "E-Mail" ],

    "target": "Users_Grouped_Mark2",

    "action": {

      "type": "recognize",

      "groupKeyColumn""ETL-GROUP",

        "groupCountColumn""ETL-COUNT"

     }

   },

   {

    "id": 4,

    "type": "group",

    "name": "Example of grouping (deduplicate)",

    "source": "Users",

    "by": [ "Name", "E-Mail" ],

    "target": "Users_Grouped_Deduplicated2",

    "action": {

      "type": "deduplicate",

      "groupKeyColumn""ETL-GROUP",

        "groupCountColumn""ETL-COUNT",

        "columns": {

          "Device""concat",

          "LastSeen""max"

       }

     }

   }

 

The following table are created:

Table Users_Grouped_Mark2:

Name

E-Mail

LastSeen

Device

ETL-GROUP

ETL-COUNT

Marcin

marcin@contoso.com

2020-01-02 00:00:00.000

Android

5

2

Marcin

marcin@contoso.com

2020-01-03 00:00:00.000

Windows

5

2

Marcin

<null>

2020-01-01 00:00:00.000

Apple

4

1

Denis

denis@contoso.com

2020-03-03 00:00:20.000

Windows

3

1

Denis

<null>

2021-02-02 00:10:00.000

Android

2

1

Andreas

andreas@contoso.com

2021-02-02 00:20:00.000

Windows

1

1

The other table Users_Grouped_Deduplicated2 has the following rows:

Name

E-Mail

LastSeen

Device

ETL-COUNT

Marcin

marcin@contoso.com

2020-01-03 00:00:00.000

Android | Windows

2

Marcin

NULL

2020-01-03 00:00:00.000

Windows

1

Denis

<null>

2020-01-01 00:00:00.000

Apple

1

Denis

denis@contoso.com

2020-03-03 00:00:20.000

Windows

1

Andreas

andreas@contoso.com

2021-02-02 00:20:00.000

Windows

1

Note the differences between both examples:

For the first entry, without specifying any aggregation, the first non-null empty was taken, which resulted in Android in the first example. In the second example, the column is built using Concat, which means that the value in the Device column joins all grouped values (with | as a separator).

In the same row, the last seen date is different. In the first example 2020-01-02 00:00:00.000 was taken but in the second one where the MAX aggregation was used, the date is set to 2020-01-03 00:00:00.000 (the maximum of two values in the group).

The extra columns ETL-COUNT and ETL-GROUP are named after the values that were defined in the second example, but omitted from the first one. In the first one, their names are defaulted to GROUP_KEY and GROUP_COUNT.

Default aggregation of duplicated values

You can use JSON property defaultAggregation on the action object to define the required aggregation, should there be no other defined on a column-base.