Cell merging

<< Click to Display Table of Contents >>

ETL > 14.0 > Implementation Guide > Tutorial and implementation guide > Steps > Joining 

Cell merging

Join operation uses a convention based approach. When defining a join, the only information required is:

Which tables need to be joined

Which columns are the key used to join them

All tables taking a part in the join must have all columns specified as the keys. For other columns, the following logic is applied:

If a column exist in exactly one of joined tables, it is taken as-is

If a column exists in two or more tables, the values in it are merged using a specific aggregation strategy. The default aggregation is to join the values with a pipe " | " separator. This may not always be the best choice, so other strategies are also available.

For example, given the following:

 

Table OS1

Computer

OperatingSystem

ValueFromOS1

1

Windows

A

2

Linux

B

3

macOS

C

4

Android

D

Table OS2

Computer

OperatingSystem

ValueFromOS2

2

BlackBerryOS

AA

3

Ubuntu

BB

4

Windows

CC

5

Windows

DD

and the following JSON definition

{

  "steps": [

    {

      "id": 1,

      "type""join",

      "name""join demo",

      "sources": [ "OS1""OS2" ],

      "on": [ "Computer" ],

      "strategy""left",

      "target""joined_os_left"

    },

    {

      "id": 2,

      "type""join",

      "name""join demo",

      "sources": [ "OS1""OS2" ],

      "on": [ "Computer" ],

      "strategy""inner",

      "target""joined_os_inner"

    },

    {

      "id": 3,

      "type""join",

      "name""join demo",

      "sources": [ "OS1""OS2" ],

      "on": [ "Computer" ],

      "strategy""outer",

      "target""joined_os_outer"

    }

  ]

}

 

The following tables are returned:

Table joined_os_left

Computer

OperatingSystem

ValueFromOS1

ValueFromOS2

1

Windows

A

<null>

2

Linux

B

AA

3

macOS

C

BB

4

Android

D

CC

 

Table joined_os_inner

Computer

OperatingSystem

ValueFromOS1

ValueFromOS2

2

Linux

B

AA

3

macOS

C

BB

4

Android

D

CC

 

Table joined_os_outer

Computer

OperatingSystem

ValueFromOS1

ValueFromOS2

1

Windows

A

<null>

2

Linux

B

AA

3

macOS

C

BB

4

Android

D

CC

5

Windows

<null>

DD

 

Remarks and points of interest:

Table joined_os_outer has most of rows, as it contains all unique values from the column Computer present in both tables.

Table joined_os_inner has the least of rows, because it contains only values present in both tables.

No extra definition for "overlapping" columns was present in the JSON file, which means that the default COALESCE (first non-null) aggregation will be used. This is why the column OperatingSystem in all three output tables contains a single non-empty value. The order of coalescing a value should be inferred from the order of joined tables.

Columns ValueFromOS1 and ValueFromOS2 are unique in all tables, and are taken as-is. Missing values are filled with <null>.