<< Click to Display Table of Contents >> ETL > 14.0 > Implementation Guide > Tutorial and implementation guide > Steps > Joining Cell merging |
Join operation uses a convention based approach. When defining a join, the only information required is:
•Which tables need to be joined
•Which columns are the key used to join them
All tables taking a part in the join must have all columns specified as the keys. For other columns, the following logic is applied:
•If a column exist in exactly one of joined tables, it is taken as-is
•If a column exists in two or more tables, the values in it are merged using a specific aggregation strategy. The default aggregation is to join the values with a pipe " | " separator. This may not always be the best choice, so other strategies are also available.
For example, given the following:
Table OS1
Computer |
OperatingSystem |
ValueFromOS1 |
---|---|---|
1 |
Windows |
A |
2 |
Linux |
B |
3 |
macOS |
C |
4 |
Android |
D |
Table OS2
Computer |
OperatingSystem |
ValueFromOS2 |
---|---|---|
2 |
BlackBerryOS |
AA |
3 |
Ubuntu |
BB |
4 |
Windows |
CC |
5 |
Windows |
DD |
and the following JSON definition
{
"steps": [
{
"id": 1,
"type": "join",
"name": "join demo",
"sources": [ "OS1", "OS2" ],
"on": [ "Computer" ],
"strategy": "left",
"target": "joined_os_left"
},
{
"id": 2,
"type": "join",
"name": "join demo",
"sources": [ "OS1", "OS2" ],
"on": [ "Computer" ],
"strategy": "inner",
"target": "joined_os_inner"
},
{
"id": 3,
"type": "join",
"name": "join demo",
"sources": [ "OS1", "OS2" ],
"on": [ "Computer" ],
"strategy": "outer",
"target": "joined_os_outer"
}
]
}
The following tables are returned:
Table joined_os_left
Computer |
OperatingSystem |
ValueFromOS1 |
ValueFromOS2 |
---|---|---|---|
1 |
Windows |
A |
<null> |
2 |
Linux |
B |
AA |
3 |
macOS |
C |
BB |
4 |
Android |
D |
CC |
Table joined_os_inner
Computer |
OperatingSystem |
ValueFromOS1 |
ValueFromOS2 |
---|---|---|---|
2 |
Linux |
B |
AA |
3 |
macOS |
C |
BB |
4 |
Android |
D |
CC |
Table joined_os_outer
Computer |
OperatingSystem |
ValueFromOS1 |
ValueFromOS2 |
---|---|---|---|
1 |
Windows |
A |
<null> |
2 |
Linux |
B |
AA |
3 |
macOS |
C |
BB |
4 |
Android |
D |
CC |
5 |
Windows |
<null> |
DD |
Remarks and points of interest:
•Table joined_os_outer has most of rows, as it contains all unique values from the column Computer present in both tables.
•Table joined_os_inner has the least of rows, because it contains only values present in both tables.
•No extra definition for "overlapping" columns was present in the JSON file, which means that the default COALESCE (first non-null) aggregation will be used. This is why the column OperatingSystem in all three output tables contains a single non-empty value. The order of coalescing a value should be inferred from the order of joined tables.
•Columns ValueFromOS1 and ValueFromOS2 are unique in all tables, and are taken as-is. Missing values are filled with <null>.