YAML

This section is a reference guide to the SDP maintenance YAML schema. A maintenance YAML specifies 2 components, a list of Variable and a list of Job. The variables are used to substitude a different values in the yaml on different environments (dev, text, acc and prd). As such, a Variable specifies the key used used for subtitution (the name), and the different values for each of the environments. A Job object also has a name, but more importantly it has contains a list of paths (which in turn contain tables), and a list of RetentionRule, VacuumRule, BackupRule, RetentionRule. These rules are applied to the tables in the paths.

Job

A Job is the main building block of the maintenance framework. Each job contains one or more paths and one or more rules.

jobs:
- name: string
  paths: ...
  rules: ...

Path

The path key refers to a path on the datalake using the ABFSS notation (e.g. abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json).

You can specify multiple paths. The maintenace package will automatically find all delta tables under the path.

paths:
- abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/bronze/
- abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json

Rules

rules:
- type: retention
  column: __sdp_ingest_timestamp # default
  days: int
- type: optimize
- type: vacuum
  hours: int # 168 default
- type: backup
  name: str #can only contain upper- or lowercase letters
  type: Full | Sync | Archive # Full creates a backup in the `latest` folder. Sync does an incremental backup to the `latest` folder. Archive will create a separate folder for each run.
...

RetentionRule

The RetentionRule uses a column (the __sdp_ingest_timestamp by default) and the retention period - the days parameter - to determine which rows can be removed from the dataset. All data before the retention period (days) will be removed from the dataset.

Warning

The RetentionRule will not remove the underlying files in the delta files. It only removes the rows from the dataset. Run the VacuumRule to remove any unused files.

OptimizeRule

The OptimizeRule runs the following query:

OPTIMIZE table_name

OPTIMIZE Recursively optimizes the layout of a Delta Lake Table. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed. For now, we do not specify the colocate parameter.

You can read more about OPTIMIZE in the Databricks Documentation

VacuumRule

The VacuumRule runs the following query:

VACUUM table_name [RETAIN num HOURS]

VACUUM will the data files which are no longer used by the delta table. This will also cause the table_changes to be truncated until the specified retention period.

You can read more about VACUUM in the Databricks Documentation

BackupRule

The BackupRule will call the Backup API to make a backup from the data. The backup will be created for the whole path (instead of the individual tables).

Using variables

Variables allow to specify a different value in the YAML for each environment (dev, tst, acc, prd). The variable definition is placed at the root level of the YAML and should always contain a value for each environment.

We define a variable in the following manner:

..
variables:
- name: server_name
  values:
    dev: 'develop.server.nl'
    tst: 'test.server.nl'
    acc: 'accept.server.nl'
    prd: 'prod.server.nl'
..

Now we can use the variable anywhere in the code:

..
name: 'Ingest for ${{ variables.server_name }}'
..