<< Click to Display Table of Contents >> ETL > 14.0 > Implementation Guide > Tutorial and implementation guide > Steps > Joining Overriding cell merging strategy |
As already mentioned, the default strategy of merging/aggregation of overlapping columns uses concatenation with a pipe ( | ) character.
There are two ways of overriding the strategy:
•For all columns in a step
•Specific types for selected columns.
The order how ETL resolves defined aggregations is:
1.Column-specific aggregation
2.Step-scoped aggregation
3.Default aggregation (concatenate values using pipe as separatator).
Note: This aggregation is available in ETL module version 1.1.281 and newer. |
The default aggregation used for all columns in the given steps can be defined by defining the value of a property called defaultAggregation. Below is a sample code to demonstrate:
{
"steps": [
{
"id": 1,
"type": "join",
"name": "join demo",
"sources": ["OS1", "OS2"],
"on": ["Computer"],
"strategy": "outer",
"target": "joined_os_outer_max",
"defaultAggregation": "coalesce"
},
]
}
The value can be a string or an object. For more information about supported aggregations, see the following chapter: Defining aggregations.
It is possible to specify custom aggregation (similar to what the complex column mapping does). In order to do that, specify a special section which will define how to handle conflicts. The value of that property (conflicts) is an object where each property is the name of the column to define, and the value is the aggregation method. The value can be either a string or an object. For more information about supported aggregations and the ways of resolving conflicts, see the following chapter: Defining aggregations.
The sample below shows how to get the max value from the OperatingSystem column (1), how to concatenate all values using custom separator (2) and how to coalesce columns by selecting the first non-empty value (3):
{
"steps": [
{
"id": 1,
"type": "join",
"name": "join demo",
"sources": ["OS1", "OS2"],
"on": ["Computer"],
"strategy": "outer",
"target": "joined_os_outer_max",
"conflicts": {
"OperatingSystem": "max"
}
},
{
"id": 2,
"type": "join",
"name": "join demo",
"sources": ["OS1", "OS2"],
"on": ["Computer"],
"strategy": "outer",
"target": "joined_os_outer_concat",
"conflicts":
{
"OperatingSystem": {
"type": "concat",
"separator": " or "
}
}
},
{
"id": 3,
"type": "join",
"name": "join demo",
"sources": ["OS1", "OS2"],
"on": ["Computer"],
"strategy": "outer",
"target": "joined_os_outer_coalesce",
"conflicts":
{
"OperatingSystem": "coalesce"
}
}
]
}
Like already mentioned, the way of resolving a conflict is defined via a mapping of a column name and a matching value (string or an object) representing the aggregation. In the example above, both syntax variants are shown: for column step 2, a concatenation with a custom separator is used (object notation), Steps 1 and 3 use a simple string notation.
Using the same input tables as in the chapter Cell merging, the following tables are written with the above JSON (the yellow color highlights important changes and differences):
Table joined_os_outer_max
Computer |
OperatingSystem |
ValueFromOS1 |
ValueFromOS2 |
---|---|---|---|
1 |
Windows |
A |
<null> |
2 |
Linux |
B |
AA |
3 |
Ubuntu |
C |
BB |
4 |
Windows |
D |
CC |
5 |
Windows |
<null> |
DD |
Table joined_os_outer_concat
Computer |
OperatingSystem |
ValueFromOS1 |
ValueFromOS2 |
---|---|---|---|
1 |
Windows |
A |
<null> |
2 |
Linux or BlackBerryOS |
B |
AA |
3 |
macOS or Ubuntu |
C |
BB |
4 |
Android or Windows |
D |
CC |
5 |
Windows |
<null> |
DD |
Table joined_os_outer_coalesce
Computer |
OperatingSystem |
ValueFromOS1 |
ValueFromOS2 |
---|---|---|---|
1 |
Windows |
A |
<null> |
2 |
Linux |
B |
AA |
3 |
macOS |
C |
BB |
4 |
Android |
D |
CC |
5 |
Windows |
<null> |
DD |
Remarks and points of interest:
•Table joined_os_outer has most of rows, as it contains all unique values from the column Computer present in both tables.
•Table joined_os_inner has the least of rows, because it contains only values present in both tables.
•No extra definition for "overlapping" columns was present in the JSON file, which means that the default coalesce (first non-null) will be used. This is why the column OperatingSystem in all three output tables contains a single non-empty values from tables ordered by the master joining order.
•Columns ValueFromOS1 and ValueFromOS2 are unique in all tables, and are taken as-is. Missing values are filled with <null>.
You can define many conflict resolving strategies for each columns of your interest. The columns that you don’t define will use the standard joining strategy (concatenation with pipe character).
{
"id": 3,
"type": "join",
"name": "join demo",
"sources": ["OS1", "OS2"],
"on": ["Computer"],
"strategy": "outer",
"target": "joined_os_outer_coalesce",
"conflicts":
{
"OperatingSystem": "coalesce", // take the first not-empty operating system
"InventoryDate": "max" // take the maximum date
}
}
Unless otherwise specified, the columns will be aggregated in the order inferred from the list of tables (for example OS1 then OS2, as in the JSON above). You can also define a custom ordering per-column - the topic is covered in the following chapter: Tables priority