.. _yaml_reference_maintain: YAML ==== This section is a reference guide to the SDP maintenance YAML schema. A maintenance YAML specifies 2 components, a list of :py:class:`~adp.maintain.dataclasses.Variable` and a list of :py:class:`~adp.maintain.dataclasses.Job`. The variables are used to substitude a different values in the yaml on different environments (dev, text, acc and prd). As such, a Variable specifies the key used used for subtitution (the name), and the different values for each of the environments. A :py:class:`~adp.maintain.dataclasses.Job` object also has a name, but more importantly it has contains a list of paths (which in turn contain tables), and a list of :py:class:`~adp.maintain.dataclasses.RetentionRule`, :py:class:`~adp.maintain.dataclasses.VacuumRule`, :py:class:`~adp.maintain.dataclasses.BackupRule`, :py:class:`~adp.maintain.dataclasses.RetentionRule`. These rules are applied to the tables in the paths. Job --------------- A Job is the main building block of the maintenance framework. Each job contains one or more paths and one or more rules. .. code-block:: yaml jobs: - name: string paths: ... rules: ... Path ---- The `path` key refers to a path on the datalake using the ABFSS notation (e.g. `abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json`). You can specify multiple paths. The maintenace package will automatically find all delta tables under the path. .. code-block:: yaml paths: - abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/bronze/ - abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json Rules ----- .. code-block:: yaml rules: - type: retention column: __sdp_ingest_timestamp # default days: int - type: optimize - type: vacuum hours: int # 168 default - type: backup name: str #can only contain upper- or lowercase letters type: Full | Sync | Archive # Full creates a backup in the `latest` folder. Sync does an incremental backup to the `latest` folder. Archive will create a separate folder for each run. ... RetentionRule ^^^^^^^^^^^^^ The :py:class:`~adp.maintain.rules.RetentionRule` uses a column (the `__sdp_ingest_timestamp` by default) and the retention period - the `days` parameter - to determine which rows can be removed from the dataset. All data *before* the retention period (days) will be removed from the dataset. .. warning:: The RetentionRule will not remove the underlying files in the delta files. It *only* removes the rows from the dataset. Run the `VacuumRule` to remove any unused files. OptimizeRule ^^^^^^^^^^ The :py:class:`~adp.maintain.rules.OptimizeRule` runs the following query: .. code-block:: OPTIMIZE table_name `OPTIMIZE` Recursively optimizes the layout of a Delta Lake Table. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed. For now, we do not specify the `colocate` parameter. You can read more about `OPTIMIZE` in the `Databricks Documentation `_ VacuumRule ^^^^^^^^^^ The :py:class:`~adp.maintain.rules.VacuumRule` runs the following query: .. code-block:: VACUUM table_name [RETAIN num HOURS] `VACUUM` will the data files which are no longer used by the delta table. This will also cause the `table_changes` to be truncated until the specified retention period. You can read more about `VACUUM` in the `Databricks Documentation `_ BackupRule ^^^^^^^^^^ The :py:class:`~adp.maintain.rules.BackupRule` will call the `Backup API `_ to make a backup from the data. The backup will be created for the whole path (instead of the individual tables). Using variables --------------- Variables allow to specify a different value in the YAML for each environment (dev, tst, acc, prd). The variable definition is placed at the `root` level of the YAML and should always contain a value for each environment. We define a variable in the following manner: .. code-block:: yaml .. variables: - name: server_name values: dev: 'develop.server.nl' tst: 'test.server.nl' acc: 'accept.server.nl' prd: 'prod.server.nl' .. Now we can use the variable anywhere in the code: .. code-block:: yaml .. name: 'Ingest for ${{ variables.server_name }}' .. Examples -------- .. literalinclude:: ../../../../sdp-maintain/examples/example.yaml :language: yaml