Enriching

<< Click to Display Table of Contents >>

ETL > 14.0 > Implementation Guide > Tutorial and implementation guide > Steps 

Enriching

Enriching is an operation which does not translate directly to any SQL query. The basic principle of enrichment relies on look-up, where a selected table (enriched table) is being updated by writing values to one of its columns (enriched column), based on values found in other table.

Source is by default optional, meaning that the table does not have to be existing. If the table is missing, the step will not be executed and its target table will not be written. There is a way to define that the source is required, in which case in case of a missing source table the step will fail and report an error. More information about setting up required steps can be found in the following chapter: Optional and required tables.

Example

For this example, let's consider three tables:

Table People contains the list of people, with an unique ID, name, address and the car plates.

Table Car contains a mapping between car registration and the address of its owner

Table Address contains a mapping between the name and the address.

 

Table People

Id

Name

Address

CarRegistration

1

Marcin

Paderborn

PB-OT-123

2

John

<NULL>

PB-JO-123

3

Andreas

<NULL>

<NULL>

 

Table Car

CarRegistration

Address

PB-OT-123

Szczecin

PB-JO-123

Berlin

PB-JK-123

Lichtenau

 

Table Address

Name

Address

Piotr

Krakau

Adrian

Stettin

Simon

Dortmund

The problem that needs to be solved is the column Address in the table People. We want to enrich that table by adding values to the columns that do not have them yet. However, the other tables do not provide an easy matching via JOIN operator, which is where the enrichment kicks in.

The basic idea is:

1.If there is already an address in the table People we take it as-is. Otherwise, go to the next point.

2.For remaining rows with still empty value of the address, if there is an entry in the Address table, where the value of the column Name matches the value in the People table / Name column, then we should take it from there. Otherwise, go to the next point.

3.For remaining rows with still empty value of the address, if there is an entry in the Car table, where the value of the column CarRegistration matches the value in the People table / CarRegistration column, then we should take it from there. Otherwise, go to the next point.

4.If there are still rows with missing address in the People table, we take n/a as a placeholder.

This operation can be achieved by the following JSON definition:

    {

      "id": 1,

      "name""Enrich People with other tables",

      "type""enrich",

      "source""People",

      "target""People_enriched",

      "column""Address"// the column to be enriched - point (1)

      "lookup": [

        {

          "in""Car"// point (2) - lookup [People].[Car]

          "key""CarRegistration", // matching the value from [Car].[CarRegistration] column.

          "take""Address" // if found, take [Car].[Address]

        },

        {

          "in""Address"// point (3) - lookup [People].[CarRegistration] 

          "key""Name"// matching the value from [Car].[CarRegistration] column.

          "take""Address" // if found, take [Address].[Address]

        },

        {

          "take": {

            "value""n/a" // point (4) - if all the previous attempts did not resulted in a non-null value, then use n/a.

          }

        }

      ]

    }

 

As such, the lookups property is an array of objects which have one of the following structures:

An object defining the in, key and take properties for look-ups

An object defining only the value table for a fall-back.

The lookup items are ordered and used from the top to the bottom. Once a value for the row can be found from a look-up, it is not going to be looked-up in other sources (for that row). The element in is the look-up source. It can be another step (referenced by ID, see chapter Chaining steps) or a table name (string).
 

papercliper

Note:

Enrichment does not override the values in the enriched columns that are already not null. If you want the results of the look-up to be not affected by the state of the current table, use a column name that does not exist in the first table.

 
Once the above JSON is started, the following table People_Enriched is returned:

Id

Name

Address

CarRegistration

1

Marcin

Paderborn

PB-OT-123

2

John

Berlin

PB-JO-123

3

Andreas

Stettin

<NULL>

Enriched values are highlighted using the yellow color to show the differences between the original table.