Tutorial and implementation guide

The tasks and their relations are described using a JSON format. The ETL CLI expects the user to provide a full path to a JSON file for further processing.

This user guide assumes that the user has already an experience with JSON format. Below is a quick recap of the syntax:

Basic data types

•Number: An integer or floating-point number. The numbers are written directly without any enclosing quotes, for example 100 or 10.23.

•String: Represents an Unicode text. Strings must be enclosed with opening and closing quotation-marks or a pair of apostrophes. To use a quotation inside a string, it must be preceded by so-called "escape character" - a backslash (\). Valid example of strings are: "" (empty string), "Hello World", "Hello \"World\" I have nested quotes". Note: Instead of quotation, a single apostrophe can be used - in this case quotation marks do not need to be escaped (on the other hand the apostrophes should). Examples using single apostrophe are: '' (empty string), 'Hello World', 'Hello "World" I have nested quotes', 'Hello I am John\'s computer'.

•Boolean: Either true or false (without quotation or apostrophes).

•Array: An ordered list of objects of other types (numbers, strings, boolean values, objects or other arrays). Array elements are separated by a comma, and surrounded with a pair of opening and closing square brackets. Valid examples of an array are: [] (empty array), [ "single string" ], [ 1, 2, 3 ], [ true, 1, "string] etc.

•Object: A collection of name–value pairs where the names are strings. Objects are delimited with curly brackets and use commas to separate each pair, while within each pair the colon ':' character separates the key or name from its value. Examples of valid objects are: { } (empty object), { "name": "Marcin" }, { "Name": John", "Surname": "Smith", "Age": 30 } etc.

•null: an empty value, using the word null.

Object values and array elements can be of any type, including other objects, arrays or null values.

ETL definition format

In its simple form, the definition of a transformation is a JSON object, which defines an array of steps ("steps"):

{

"steps": [

{

// step 1 here

{

// step 2 here

}

// and so on...

]

}

JSON does not define the source connection or a database type. Instead, this information is provided via the command-line (when executed as a standalone tool), or passed by DataHub (when executed as a transformation task).

The array of steps can contain one or more steps, available from the list of supported operations

•Mapping and selecting
Transforms one table into another (the number of rows stays the same) or concatenates two or more tables into a single table.

Similarities to SQL:
For a single table usage this translates roughly to SELECT <columns> FROM <source>
For multiple tables usage, this translates roughly to SELECT <all-columns> FROM <source1> UNION ALL SELECT <all-columns> FROM <source2>...

•Filtering
Takes an input table and produced another one with the same columns, but having less or equal number of rows (based on filtering condition).

Similarities to SQL:
This method translates roughly to SELECT * FROM <source> WHERE <condition>

•Grouping and deduplicating
Take a table and group the rows based on a matching combination of a single or more columns. Grouping has two modes: it can either only mark the rows which should be grouped by introducing an extra column, or perform de-duplication. De-duplication always takes a single row from each group. Some typical aggregation modes are provided (maximum, minimum, average, concatenation and coalescing).

Similarities to SQL:
Grouping itself is similar to SELECT * FROM <source> GROUP BY <group_key>. Introducing an extra column with group ID has no direct equivalent.

•Joining
Given a list of several tables, this operation joins them together and automatically resolves conflicting columns using aggregation methods. It provides a way for defining a "master joining order" and override this on column-basis, ensuring that different table may have different priorities in different columns.

Similarities to SQL:
Less complex scenarios can be represented by a sequence of JOIN (LEFT, INNER, OUTER) queries, more complex have no simple mapping. Additionally, it is possible to join on wildcards (for example all tables matching the pattern MyTable*) and also define priorities based on wildcard patterns. Neither of these is possible with simple SQL.

•Advanced deduplicating
Given some input table, deduplicates it based on several conditions and produces a table without duplicates. The duplicates can be either picked from existing rows or created based on all grouped values. It is also possible to select any number of rows (for example only first, none, all or all but not last etc.)

Similarities to SQL:
There is nothing which maps 1-1 to this, the de-duplication is represented by several operations involving the usage of GROUP BY and aggregations.

Note: This function is powerful but for most cases may be a bit of overkill. A similar functionality is offered by the Groupping, which should be the first choice for de-duplication or grouping.

•Splitting
Given a single input table having a column with aggregated values separated by a specific separator, this produces a table which "unpivots" the data.

Similarities to SQL:
There is nothing which maps 1-1, the operation is represented by several functions and select queries.

•Enriching
Given a single input table, add extra information to it from other tables, using different methods of joining by different keys.

Similarities to SQL:
There is nothing which maps 1-1 to this, enrichment is represented by many different join operations, grouping, sub-queries etc.

The steps can be chained.

Comments

Both single line comments (starting with //) and block comments (starting with /* and ending with */) are supported in the JSON file.