<< Click to Display Table of Contents >> RayVentory Data Hub > 12.6 u5 > Administration and User Guide > Transformations > Creating and Editing Transformation Steps > Source Join |
Joining is an operation which takes two or more tables and combines them to a single one. The combination of rows is performed using a special column or set of columns called joining keys. They should be locally unique.
Since two tables may but do not have to have the same rows and keys, conflicts may arise:
•The left table may have some unmatched rows in the right table.
•The right table may have some unmatched rows in the left table.
There are three basic strategies that can be chosen:
•Outer joining
This will combine all matching rows and it will take all unmatched rows as they are, by default using NULL values where values are unmatched.
For example:
A list of devices is scanned by system A and system B may have a lot of entries in common. Some devices may be present only in system A and some devices are only in system B. Outer join will ensure that all devices are present in the target table, even the unmatched devices from system A and the unmatched devices from system B.
•Left joining
This will treat the left column as a master table and write all rows which were matched between left and right and any unmatched rows from the left table. Unmatched rows from the right column will be discarded.
For example:
The left table contains a list of customers and the right table contains a list of addresses, than the result table will include all the customers but only the address entries for the customers that are listed in the left table. Unmatched addresses will be discarded.
•Inner joining
This will write only rows which exist in both tables. Unmatched rows from the left table and the right table will be ignored.
For example:
The left table contains names and the right table surnames. The output table only has names and surnames. All entries where there is only a name or only a surname will be discarded.
The Join step requires a bit of extra information about the key used to join (ID). This key must be present in all joined tables otherwise an exception will be thrown.
It is also possible to join more tables at once (within the same strategy).
Joined tables are by default optional, meaning that the tables do not have to be existing. If any table is missing it will not be joined. If only a single table exists, then it is taken as is and returned as the output. If all the tables are missing, then the step will not be executed and its target table will not be written. There is a way to define required tables, in which case in a missing source table the step will not be executed and its target table will not be written. There is a way to define required tables in which case in case of a missing source table the step will fail and report an error.
The join operation uses a convention based approach. When defining a join, the information required is:
•Which tables need to be joined.
•Which columns are the key used to join them.
All tables taking a part in the join must have all the columns specified as the keys. For other columns the following logic is applied:
•If a column exists in exactly one of the joined tables, it is taken as is.
•If a column exists in two or more of the tables, the values in it are merged using a specific aggregation strategy. The default aggregation is to join the values with a pipe "|" separator.
More information about Joining can be found in the ETL Implementation Guide.
•SOURCES: Define one or more sources that will be used for the transformation. It supports auto-completion and will offer all available tables matching the current input string for selection.
•JOIN STRATEGY: Select the join strategy that will be used for the transformation. The following strategies are available:
oLeft: The left table will be used as master table. All matched rows will be combined and all unmatched rows from the left table will be written. All unmatched rows from the right table will be ignored.
oInner: All matched rows will be combined and all unmatched rows will be ignored.
oOuter: All matching rows will be combined and all unmatched rows will be taken as they are.
•JOIN KEYS: Enter the join keys used for the join. The key used to join must be present in all joined tables.
•COLUMNS: This field is used to specify the columns that will be present in the output table. If no columns are provided in this field, all columns will be available. The join keys will always be available in the output table.
•CONFLICTS: This option is used to define which column and which resolution method is used to resolve conflicts. It is possible to create multiple solution to conflicts. Enter the column into the COLUMN field available in the option and select one of the conflict resolution methods from the CONFLICT RESOLUTION METHOD field. The available methods are listed below in the Conflict Resolution Method section.
•DEFAULT AGGREGATION: Define the default aggregation that will be used by selecting a method from the dropdown menu. The following options are available:
oAverage: Selects the average value of one or more values.
oMaximum: Selects the maximum of one or more values.
oMinimum: Selects the minimum of one or more values.
oSum: Selects the sum of one or more values.
oCoalesce: Selects the first not-empty value from the list of one or more values.
oConcat: Join all given non-empty values using a specific separator (from left to right).
oFirst non null: Selects the first not-empty value from the list of one or more values.
•COLUMN: Enter the column from which the value will be taken. It supports auto-completion and will offer all available columns matching the current input string for selection.
•CONFLICT RESOLUTION METHOD: Select the method used to resolve conflicts. The available methods are First non-empty value, Join values with separator, Max value, and Min value. If First non-empty value is selected, the following further settings will be available.
•AVAILABLE SOURCES: Ordering the available sources in this field determines the order in which the sources will be taken when this conflict occurs. It supports auto-completion and will offer all available sources matching the current input string for selection.
•COLUMN: Enter the column from which the value will be taken. It supports auto-completion and will offer all available columns matching the current input string for selection.
•CONFLICT RESOLUTION METHOD: Select the method used to resolve conflicts. The available methods are First non-empty value, Join values with separator, Max value, and Min value. If Join values with separator is selected, the following further settings will be available.
•SEPARATOR: Enter the separator that will be used into this field.
•AVAILABLE SOURCES: Ordering the available sources in this field determines the order in which the sources will be taken when this conflict occurs. It supports auto-completion and will offer all available sources matching the current input string for selection.
•COLUMN: Enter the column from which the value will be taken. It supports auto-completion and will offer all available columns matching the current input string for selection.
•CONFLICT RESOLUTION METHOD: Select the method used to resolve conflicts. The available methods are First non-empty value, Join values with separator, Max value, and Min value. If Max value is selected, the following further settings will be available.
•COLUMN: Enter the column from which the value will be taken. It supports auto-completion and will offer all available columns matching the current input string for selection.
•CONFLICT RESOLUTION METHOD: Select the method used to resolve conflicts. The available methods are First non-empty value, Join values with separator, Max value, and Min value. If Min value is selected, the following further settings will be available.