Overriding cell merging strategy

<< Click to Display Table of Contents >>

ETL > 12.6 > Implementation Guide > Tutorial and implementation guide > Steps > Joining 

Overriding cell merging strategy

As already mentioned, the default strategy of merging/aggregation of overlapping columns uses concatenation with a pipe ( | ) character.

There are two ways of overriding the strategy:

For all columns in a step

Specific types for selected columns.

The order how ETL resolves defined aggregations is:

1.Column-specific aggregation

2.Step-scoped aggregation

3.Default aggregation (concatenate values using pipe as separatator).

 

Step merging strategy

papercliper

Note:

This aggregation is available in ETL module version 1.1.281 and newer.

 

The default aggregation used for all columns in the given steps can be defined by defining the value of a property called defaultAggregation. Below is a sample code to demonstrate:

{

    "steps": [

        {

            "id": 1,

            "type": "join",

            "name": "join demo",

            "sources": ["OS1", "OS2"],

            "on": ["Computer"],

            "strategy": "outer",

            "target": "joined_os_outer_max",

            "defaultAggregation": "coalesce"

        },

    ]

}

 

The value can be a string or an object. For more information about supported aggregations, see the following chapter: Defining aggregations.

 

Column-specific merging strategy

It is possible to specify custom aggregation (similar to what the complex column mapping does). In order to do that, specify a special section which will define how to handle conflicts. The value of that property (conflicts) is an object where each property is the name of the column to define, and the value is the aggregation method. The value can be either a string or an object. For more information about supported aggregations and the ways of resolving conflicts, see the following chapter: Defining aggregations.

The sample below shows how to get the max value from the OperatingSystem column (1), how to concatenate all values using custom separator (2) and how to coalesce columns by selecting the first non-empty value (3):

{

    "steps": [

        {

            "id": 1,

            "type": "join",

            "name": "join demo",

            "sources": ["OS1", "OS2"],

            "on": ["Computer"],

            "strategy": "outer",

            "target": "joined_os_outer_max",

            "conflicts": {

                "OperatingSystem": "max"

            }

        },

        {

            "id": 2,

            "type": "join",

            "name": "join demo",

            "sources": ["OS1", "OS2"],

            "on": ["Computer"],

            "strategy": "outer",

            "target": "joined_os_outer_concat",

            "conflicts":

            {

                "OperatingSystem": {

                    "type": "concat",

                    "separator": " or "

                }

            }

        },

        {

            "id": 3,

            "type": "join",

            "name": "join demo",

            "sources": ["OS1", "OS2"],

            "on": ["Computer"],

            "strategy": "outer",

            "target": "joined_os_outer_coalesce",

            "conflicts":

            {

                "OperatingSystem": "coalesce"

            }

        }        

    ]

}

 

Like already mentioned, the way of resolving a conflict is defined via a mapping of a column name and a matching value (string or an object) representing the aggregation. In the example above, both syntax variants are shown: for column step 2, a concatenation with a custom separator is used (object notation), Steps 1 and 3 use a simple string notation.

Using the same input tables as in the chapter Cell merging, the following tables are written with the above JSON (the yellow color highlights important changes and differences):

 

Table joined_os_outer_max

Computer

OperatingSystem

ValueFromOS1

ValueFromOS2

1

Windows

A

<null>

2

Linux

B

AA

3

Ubuntu

C

BB

4

Windows

D

CC

5

Windows

<null>

DD

 

Table joined_os_outer_concat

Computer

OperatingSystem

ValueFromOS1

ValueFromOS2

1

Windows

A

<null>

2

Linux or BlackBerryOS

B

AA

3

macOS or Ubuntu

C

BB

4

Android or Windows

D

CC

5

Windows

<null>

DD

 

Table joined_os_outer_coalesce

Computer

OperatingSystem

ValueFromOS1

ValueFromOS2

1

Windows

A

<null>

2

Linux 

B

AA

3

macOS

C

BB

4

Android

D

CC

5

Windows

<null>

DD

 

Remarks and points of interest:

Table joined_os_outer has most of rows, as it contains all unique values from the column Computer present in both tables.

Table joined_os_inner has the least of rows, because it contains only values present in both tables.

No extra definition for "overlapping" columns was present in the JSON file, which means that the default coalesce (first non-null) will be used. This is why the column OperatingSystem in all three output tables contains a single non-empty values from tables ordered by the master joining order.

Columns ValueFromOS1 and ValueFromOS2 are unique in all tables, and are taken as-is. Missing values are filled with <null>.

 

 

Multiple conflict handling strategies

You can define many conflict resolving strategies for each columns of your interest. The columns that you don’t define will use the standard joining strategy (concatenation with pipe character).

        {

            "id": 3,

            "type": "join",

            "name": "join demo",

            "sources": ["OS1", "OS2"],

            "on": ["Computer"],

            "strategy": "outer",

            "target": "joined_os_outer_coalesce",

            "conflicts":

            {

                "OperatingSystem": "coalesce", // take the first not-empty operating system

                "InventoryDate": "max" // take the maximum date

            }

        }        

 

Priorities when joining the cell values

Unless otherwise specified, the columns will be aggregated in the order inferred from the list of tables (for example OS1 then OS2, as in the JSON above). You can also define a custom ordering per-column - the topic is covered in the following chapter: Tables priority