.. _yaml_reference_maintain:

YAML
====

This section is a reference guide to the SDP maintenance YAML schema. 
A maintenance YAML specifies 2 components, a list of :py:class:`~adp.maintain.dataclasses.Variable` and a list of :py:class:`~adp.maintain.dataclasses.Job`.
The variables are used to substitude a different values in the yaml on different environments (dev, text, acc and prd). As such, a Variable specifies the key used used for subtitution (the name), and the different values for each of the environments.
A :py:class:`~adp.maintain.dataclasses.Job` object also has a name, but more importantly it has contains a list of paths (which in turn contain tables), and a list of :py:class:`~adp.maintain.dataclasses.RetentionRule`, :py:class:`~adp.maintain.dataclasses.VacuumRule`, :py:class:`~adp.maintain.dataclasses.BackupRule`, :py:class:`~adp.maintain.dataclasses.RetentionRule`. These rules are applied to the tables in the paths.

Job
---------------

A Job is the main building block of the maintenance framework.
Each job contains one or more paths and one or more rules.

.. code-block:: yaml
  
  jobs:
  - name: string
    paths: ...
    rules: ...

Path
----

The `path` key refers to a path on the datalake using the ABFSS notation (e.g. `abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json`).


You can specify multiple paths. The maintenace package will automatically find all delta tables under the path.

.. code-block:: yaml

  paths:
  - abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/bronze/
  - abfss://sdp@nubulosdpdlsdev01.dfs.core.windows.net/staging/_tmp/test_a2089d8f-0ef2-4d06-9c31-ac49da100dad/Areas.json


Rules
-----

.. code-block:: yaml

  rules:
  - type: retention
    column: __sdp_ingest_timestamp # default
    days: int
  - type: optimize
  - type: vacuum
    hours: int # 168 default
  - type: backup
    name: str #can only contain upper- or lowercase letters
    type: Full | Sync | Archive # Full creates a backup in the `latest` folder. Sync does an incremental backup to the `latest` folder. Archive will create a separate folder for each run.
  ...

RetentionRule
^^^^^^^^^^^^^

The :py:class:`~adp.maintain.rules.RetentionRule` uses a column (the `__sdp_ingest_timestamp` by default) and the retention period - the `days` parameter - to determine which rows can be removed from the dataset.
All data *before* the retention period (days) will be removed from the dataset.

.. warning::
    The RetentionRule will not remove the underlying files in the delta files. It *only* removes the rows from the dataset. Run the `VacuumRule` to remove any unused files. 

OptimizeRule
^^^^^^^^^^


The :py:class:`~adp.maintain.rules.OptimizeRule` runs the following query:

.. code-block:: 

  OPTIMIZE table_name

`OPTIMIZE` Recursively optimizes the layout of a Delta Lake Table. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed. For now, we do not specify the `colocate` parameter.

You can read more about `OPTIMIZE` in the `Databricks Documentation <https://docs.databricks.com/sql/language-manual/delta-optimize.html>`_


VacuumRule
^^^^^^^^^^


The :py:class:`~adp.maintain.rules.VacuumRule` runs the following query:

.. code-block:: 

  VACUUM table_name [RETAIN num HOURS] 

`VACUUM` will the data files which are no longer used by the delta table. This will also cause the `table_changes` to be truncated until the specified retention period.

You can read more about `VACUUM` in the `Databricks Documentation <https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-vacuum>`_

BackupRule
^^^^^^^^^^

The :py:class:`~adp.maintain.rules.BackupRule` will call the `Backup API <https://app-backup-api-prd-01.azurewebsites.net/swagger>`_ to make a backup from the data.
The backup will be created for the whole path (instead of the individual tables).


Using variables
---------------

Variables allow to specify a different value in the YAML for each environment (dev, tst, acc, prd). The variable definition is placed at the `root` level of the YAML and should always contain a value for each environment. 

We define a variable in the following manner:

.. code-block:: yaml

    ..
    variables:
    - name: server_name
      values:
        dev: 'develop.server.nl'
        tst: 'test.server.nl'
        acc: 'accept.server.nl'
        prd: 'prod.server.nl'
    ..


Now we can use the variable anywhere in the code:

.. code-block:: yaml
    
    ..
    name: 'Ingest for ${{ variables.server_name }}'
    ..


Examples
--------

.. literalinclude:: ../../../../sdp-maintain/examples/example.yaml
  :language: yaml