<< Click to Display Table of Contents >> ETL > 14.0 u1 > Implementation Guide > Tutorial and implementation guide > Steps > Joining Tables priority |
For CONCAT and COALESCE columns, the default joining priority is used (the one that the user defined for the join step). In many cases, it may be required to change the priority on column basic. Consider the following example:
•Data from system A has generally all values, but a its column A is not 100% accurate (approximation). The column B in this table is always correct.
•Data from system B has less data, but a 100% correct value for that column A. It contains values in column B which on the other hand is unreliable.
•We need to combine two data sets in which
oIf the same row in both tables has a value in Table 1 Column A, this value should win over Table 2 Column B.
oIf the same row has only value of column A in Table 2, we should use it as a fallback.
oIf the same row in both tables has a value in Table 2 Column B, this value should win over Table 1 Column B.
oIf the same row has only value of column B in Table 1, we should use it as a fallback.
So essentially, we define the following priority:
•General joining priority: Table 1 > Table 2
•Priority for values in column A: Table 1 > Table 2
•Priority for values in column B: Table 2 > Table 1
To define this in JSON, use the following syntax:
{
"id": 3,
"type": "join",
"name": "join demo",
"sources": ["Table 1", "Table 2"],
"on": ["Computer"],
"strategy": "outer",
"target": "joined_os_outer_coalesce",
"conflicts":
{
"OperatingSystem": {
"type": "coalesce",
"priority": ["Table 2", "Table 1"]
}
}
}
The syntax for the priority is the same as the definition of source tables.
You do not have to define all tables in the Priority property. If you omit some of them, they will be treated as fallback anyway. For example, in the above snippet, if only Table 2 was provided in the priority, Table 1 and any other table will be always treated with a lower priority.
Note: If you omit the tables you lose the control of defining the priority between omitted tables. In this case, the default priority will be used for them. The tables that you specified still win. |
You can use wildcards when defining the joining priority. For example, if you generally prefer all tables containing a keyword "final" over "tables that end with "raw", you could do the following:
"conflicts":
{
"OperatingSystem": {
"type": "coalesce",
"priority": ["*final*", "*raw"]
}
}
You can also mix wildcard and non-wildcard strings. The priority order is not validated, should there be no column matching the priority order the next will be taken.
Given the following tables containing approximated and actual age:
Table Approximation
Name |
Age |
---|---|
Marcin |
34 |
Ed |
70 |
Table Observation
Name |
Age |
---|---|
Marcin |
35 |
Jeremy |
20 |
And the following assumptions:
•Join two tables by the name
•Make sure that if the value from Observation and Approximation are in conflict, prefer the Observation as it is more accurate
•For entries present in only a single to these two tables, take the entry as-is
We would write the following JSON:
{
"id": 1,
"type": "join",
"name": "Join with priority",
"sources": [ "Approximation*", "Observation*" ],
"strategy": "outer",
"target": "Age",
"on": "Name",
"conflicts": {
"Age": {
"type": "coalesce",
"priority": [ "*Obs*", "*Approx*" ]
}
}
}
Note: The example introduces a concept of wildcard joining, in which we do not have to specify the full name, but rather only a mask to match against. This works on both table selection level and on priority level.
The result would be
Table Age
Name |
Age |
---|---|
Ed |
70 |
Jeremy |
20 |
Marcin |
35 |
Remarks and points of interest:
•In the priority order [ "*Obs*", "*Approx*" ] we could also skip the last item (leaving only [ "*Obs*" ]) . Like mentioned before, columns which are not explicitly named in the priority list automatically receive the lowest priority.
•In the row Marcin the value of Age is 35. This is the value coming from the Observation table, although the master joining first took the Approximation table.
•Other values are taken from either of two tables, as they do not have the respective counter entries in the other table.