Tables priority

<< Click to Display Table of Contents >>

ETL > 14.0 > Implementation Guide > Tutorial and implementation guide > Steps > Joining 

Tables priority

For CONCAT and COALESCE columns, the default joining priority is used (the one that the user defined for the join step). In many cases, it may be required to change the priority on column basic. Consider the following example:

Data from system A has generally all values, but a its column A is not 100% accurate (approximation). The column B in this table is always correct.

Data from system B has less data, but a 100% correct value for that column A. It contains values in column B which on the other hand is unreliable.

We need to combine two data sets in which

oIf the same row in both tables has a value in Table 1 Column A, this value should win over Table 2 Column B.

oIf the same row has only value of column A in Table 2, we should use it as a fallback.

oIf the same row in both tables has a value in Table 2 Column B, this value should win over Table 1 Column B.

oIf the same row has only value of column B in Table 1, we should use it as a fallback.

 

So essentially, we define the following priority:

General joining priority: Table 1 > Table 2

Priority for values in column A: Table 1 > Table 2

Priority for values in column B: Table 2 > Table 1

 

To define this in JSON, use the following syntax:

{

    "id": 3,

    "type": "join",

    "name": "join demo",

    "sources": ["Table 1", "Table 2"],

    "on": ["Computer"],

    "strategy": "outer",

    "target": "joined_os_outer_coalesce",

    "conflicts":

    {

        "OperatingSystem": {

            "type": "coalesce",

            "priority": ["Table 2", "Table 1"]

        }

    }

}      

 

 

The syntax for the priority is the same as the definition of source tables.

clip0028

You do not have to define all tables in the Priority property. If you omit some of them, they will be treated as fallback anyway. For example, in the above snippet, if only Table 2 was provided in the priority, Table 1 and any other table will be always treated with a lower priority.
 

papercliper

Note:

If you omit the tables you lose the control of defining the priority between omitted tables. In this case, the default priority will be used for them. The tables that you specified still win.

 

Wildcard priority

You can use wildcards when defining the joining priority. For example, if you generally prefer all tables containing a keyword "final" over "tables that end with "raw", you could do the following:

    "conflicts":

    {

        "OperatingSystem": {

            "type": "coalesce",

            "priority": ["*final*", "*raw"]

        }

    }

You can also mix wildcard and non-wildcard strings. The priority order is not validated, should there be no column matching the priority order the next will be taken.

Example

Given the following tables containing approximated and actual age:

Table Approximation

Name

Age

Marcin

34

Ed

70

Table Observation

Name

Age

Marcin

35

Jeremy

20

And the following assumptions:

Join two tables by the name

Make sure that if the value from Observation and Approximation are in conflict, prefer the Observation as it is more accurate

For entries present in only a single to these two tables, take the entry as-is

We would write the following JSON:

    {

      "id": 1,

      "type""join",

      "name""Join with priority",

      "sources": [ "Approximation*""Observation*" ],

      "strategy""outer",

      "target""Age",

      "on""Name"

      "conflicts": {

        "Age": {

          "type""coalesce",

          "priority": [ "*Obs*""*Approx*" ]

        }

      }

    }

 

Note: The example introduces a concept of wildcard joining, in which we do not have to specify the full name, but rather only a mask to match against. This works on both table selection level and on priority level.

 

The result would be

 

Table Age

Name

Age

Ed

70

Jeremy

20

Marcin

35

Remarks and points of interest:

In the priority order [ "*Obs*", "*Approx*" ] we could also skip the last item (leaving only [ "*Obs*" ])  . Like mentioned before, columns which are not explicitly named in the priority list automatically receive the lowest priority.

In the row Marcin the value of Age is 35. This is the value coming from the Observation table, although the master joining first took the Approximation table.

Other values are taken from either of two tables, as they do not have the respective counter entries in the other table.