YAML Reference

This chapter is a reference guide to the structure of the YAML specification for the delivery platform. Each notebook built on the SDP delivery platform, should specify its input and output tables in the notebook. This is done using YAML definition language.

Each notebook should define its input tables using the delivery_load cell-magic. This allows adding some extra checks and validations on the input data (bronze or silver). This effectivly makes this statement a kind of gatekeeper: the code makes sure that the data going into the notebook is stable and does not change. This is done as the bronze/silver layer are built dynamically and follow the schema of the source system (which can change over time).

Similarly, the output tables should be specified using the delivery_write magic. The user of the might have some specific requirements for the data. The rules specified here will verify the data and check whether the data conforms to the expectations. You can check and enforce:

  1. The existence of a column

  2. The datatype of a column

  3. The integrity of the relationships between columns

  4. The nullability of a column

Finally, the order of execution of the seperate notebooks can be specified using the delivery_run magic. This allows you to specify the order of execution of your notebooks. Or to specify restore behaviour when a child notebook fails.

In order to use the magics, one should import the delivery package:

import adp.delivery

Now you can use the magics in any cell by using a double percentage sign. For example, using the delivery_load magic:

%%delivery_load
# YAML HERE

Sources and Sinks

The following structure is valid for both delivery_load (Sources) and delivery_write (Sinks):

...
- uri: str # Mandatory, reference to the table in the SDP in the form `layer.system_name.table_name`.
  append_mode: bool # Optional, specifies whether data should be appended when using delivery_write. Default to False.
  allow_additional_columns: bool # Optional, Allow more columns in the data than specified under `columns`. Defaults to True.
  on_missing: recreate | fail # Optional, Specifies what to do when the table is missing from the ADP. Defaults to fail.
  columns: # Optional, a list of columns. Defaults to None.
  - name: str # Mandatory, the expected name of the column
    type: str # Mandatory, The expected datatype of the column. You can use any spark type here. E.g. StringType, StringType(), LongType, DecimalType(10,2) etc.
    on_type_mismatch: fail | ignore | cast # Optional, action to execute when type mismatch occurs. Defaults to fail. Cast will be in ANSI strict mode.
    nullable: bool # Optional, specifies whether nulls are allowed in the column. Defaults to True.
    on_missing: recreate | fail # Optional. Specifies what to do when a column is missing from the data. Defaults to fail.
    relationships: # Optional, a list of relationships to other columns in the form `layer.system.entity.column`. Relationships are checked and enforced. Defaults to None.
    - uri: layer.system.entity.column # A reference to the foreign key column
    - ...
    metadata: dict # Optional, extra key/value pairs for the column. For metadata purposes only. Defaults to None.
  - ...
- ...
...

The code below is an example of the desciption above.

%%delivery_load
- uri: layer.system_name.table_name
  allow_additional_columns: True
  on_missing: fail
  columns:
  - name: column_1
    type: StringType
    on_type_mismatch: fail
  - name: column_2
    type: IntegerType
    on_missing: recreate
...

This will register a temporary view to the data in the storage account with the same name as the URI, verify whether the column column_1 is a string and check whether the column column_2 exists. The platform will automatically add column_2 with a IntegerType to the dataframe if it does not exist yet.

Usage in notebooks

The main function of a notebook is to transform data. Most of the notebooks within the SDP take their data from the bronze and/or silver layer. These layers conform to the input of the source data as much as possible. However, the delivery layer should yield robust and stable results, even when the schema or form of the data changes over time.

In the past, every developer used their own method of retrieving and writing data from the storage account managed by the ADP. This was complicated by the fact that the name of a delta table and the location of the data can be managed seperately. This caused delta tables to be loosely coupled to their location on the storage account. Specifying the input and outputs of a notebook explicitly, solves this problem automatically, as the platform will handle all reads and writes to the storage account. You should IPython Magics to specify your load and writes. You can register these magics by import adp.delivery. Then you can use them like this:

%%delivery_load
# Retrieves `dbo_saex_g_l_entry` from the storage account
# and creates a temporary view called `silver.navision.dbo_saex_g_l_entry`
- uri: silver.navision.dbo_saex_g_l_entry
...

Then, do your transformations and create a temporary view called export.system_name.table_name. Finally (at the end of the transformations) at the following to the end of the notebook:

.. code-block:: python

%%delivery_write
# This will export your view to ADLS, Databricks Hive and Synapse
- uri: export.system_name.table_name
...

Parent.py (delivery_run)

The parent.py file in your repository should indicate which notebooks to run, as projects may consist of various notebooks (a notebook for each table for example). The parent.py file will specify the order of execution for each notebook. Running this parent notebook will also generate a specific run_id. All child notebooks will export their metadata as JSON with this same run_id at runtime. This allows you to gather all the metadata information for that specific run. When running the parent.py in a job, it’s possible to exclude certain notebooks from the run by using the dbutils widget __adp_exclude_jobs_by_name. This widget should contain a comma-separated list of job names to exclude. The platform will then exclude these notebooks from the run. This is useful when you want to run a specific notebook or when you want to exclude a notebook that is failing.

To specify notebooks to run, use the delivery_run cell magic and use the following schema:

name: str # Mandatory, name of the project (e.g. navision)
layer: str # Mandatory, layer of the project (e.g. gold, export)
description: str # Optional, a short description of the project. For metadata reasons
metadata: dict # Optional, extra key/value pairs for the project. For metadata purposes only. Defaults to None.
threads: int # Optional, defaults to 1. How many parallel threads are used to execute jobs. 1 executes them sequentually. > 1 causes jobs to be executed in parallel.
restore_automatically: bool # Optional, defaults to False. If True, the platform will automatically restore all the tables when a single job fails.
check_previous_run_successful: bool # Optional, default to False. If True, the platform will check whether the previous run was successful. If not, the platform will not run the current run.
jobs: # Mandatory, a list of jobs to run
- name: str # Mandatory, the name of the job
  type: databricks_notebook # Mandatory, type of the job to run. For now, only valid possiblility is `databricks_notebook`.
  description: str # Optional, a short description of the notebook. For metadata reasons
  settings: # Mandatory, settings for a specific jobtype.
    path: ./child # Mandatory, path of the notebook to execute
    arguments: # Optional, a dictionary of key/value pairs to pass to the notebook.
      key1: value1
      ...
  ...

check_previous_run_successful

You can enable checking whether the previous job run has finished successfully by setting the check_previous_run_successful to True in the @delivery_load yaml. When this is set to True, the framework will check whether the previous run has been executed successfully. If this is not the case, the framework will raise an exception by default. This option defaults to False.

restore_automatically

With the restore_automatically setting, you can specify whether the platform should automatically restore tables when a notebook fails. When a notebook fails, the platform will automatically restore the tables to the state they were in before the notebook was executed. This is done by restoring the tables from the previous run.