YAML
This section is a reference guide to the SDP maintenance YAML schema.
A maintenance YAML specifies 2 components, a list of Variable and a list of Job.
The variables are used to substitude a different values in the yaml on different environments (dev, text, acc and prd). As such, a Variable specifies the key used used for subtitution (the name), and the different values for each of the environments.
A Job object also has a name, but more importantly it has contains a list of paths (which in turn contain tables), and a list of RetentionRule, VacuumRule, BackupRule, RetentionRule. These rules are applied to the tables in the paths.
Job
A Job is the main building block of the maintenance framework. Each job contains one or more paths and one or more rules.
jobs:
- name: string
paths: ...
rules: ...
Path
The path key refers to a path on the datalake using the ABFSS notation (e.g. abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json).
You can specify multiple paths. The maintenace package will automatically find all delta tables under the path.
paths:
- abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/bronze/
- abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json
Rules
rules:
- type: retention
column: __sdp_ingest_timestamp # default
days: int
- type: optimize
- type: vacuum
hours: int # 168 default
- type: backup
name: str #can only contain upper- or lowercase letters
type: Full | Sync | Archive # Full creates a backup in the `latest` folder. Sync does an incremental backup to the `latest` folder. Archive will create a separate folder for each run.
...
RetentionRule
The RetentionRule uses a column (the __sdp_ingest_timestamp by default) and the retention period - the days parameter - to determine which rows can be removed from the dataset.
All data before the retention period (days) will be removed from the dataset.
Warning
The RetentionRule will not remove the underlying files in the delta files. It only removes the rows from the dataset. Run the VacuumRule to remove any unused files.
OptimizeRule
The OptimizeRule runs the following query:
OPTIMIZE table_name
OPTIMIZE Recursively optimizes the layout of a Delta Lake Table. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed. For now, we do not specify the colocate parameter.
You can read more about OPTIMIZE in the Databricks Documentation
VacuumRule
The VacuumRule runs the following query:
VACUUM table_name [RETAIN num HOURS]
VACUUM will the data files which are no longer used by the delta table. This will also cause the table_changes to be truncated until the specified retention period.
You can read more about VACUUM in the Databricks Documentation
BackupRule
The BackupRule will call the Backup API to make a backup from the data.
The backup will be created for the whole path (instead of the individual tables).
Using variables
Variables allow to specify a different value in the YAML for each environment (dev, tst, acc, prd). The variable definition is placed at the root level of the YAML and should always contain a value for each environment.
We define a variable in the following manner:
..
variables:
- name: server_name
values:
dev: 'develop.server.nl'
tst: 'test.server.nl'
acc: 'accept.server.nl'
prd: 'prod.server.nl'
..
Now we can use the variable anywhere in the code:
..
name: 'Ingest for ${{ variables.server_name }}'
..