.. _delivery requirements:

Requirements
============

RQ-DL-REPO-001 - Repository naming convention
---------------------------------------------

- Repository names should only contain lowercase letters [a-z] and hypens "-".
- Gold repositories should be in the form: "gold-<name>" (e.g. "gold-complaints")
- Export repositories should be in the form: "export-<name>" (e.g. "export-lifetime")


RQ-DL-REPO-002 - All delivery repositories contain a 'parent.py' using the delivery_run cell magic
--------------------------------------------------------------------------------------------------

The parent.py file defines the order in which the notebooks should be executed. It contains `name`, `layer` and `jobs`.

.. code-block:: python3

  %%delivery_run
  
  name: str # Mandatory, name of the project (e.g. navision)
  layer: str # Mandatory, layer of the project (e.g. gold, export)
  jobs:
  - name: notebook_1
    type: databricks_notebook
    description: description here
    settings:
      path: ./child
      arguments:
        key1: value1
        
  - name: notebook_2
    type: databricks_notebook
    description: description here
    settings:
      path: ./child
      arguments:
        key1: value2


RQ-DL-REPO-003 - All repositories should contain ``.gitattributes``
-------------------------------------------------------------------

Each repo should contain a ``.gitattributes`` to fix line-endings. This is because Databricks runs on Linux, and we often use VSCode on windows to initialize the repositories with files.

Create a ``.gitattributes`` file. The file should contain the following content:

.. code-block:: text

    ###############################################################################
    # Athora: Set behavior for normalizing line endings.
    #    Code will usually be written in a Databricks environment, so keep the
    #    settings to a minimum.
    # https://git-scm.com/docs/gitattributes
    # https://gitattributes.io/api/common%2Ccsharp%2Cweb%2Cvisualstudio
    # EvdK: 2022-04-07, initial
    ###############################################################################

    # Auto detect text files and perform line-end normalization
    * text=auto

    # Script
    *.py      text diff=python
    *.sql     text
    *.sh      text
    *.ps1     text eol=crlf
    *.yaml    text
    *.yml     text

    # Data
    *.json    text
    *.xml     text

    # Documentation
    *.markdown   text
    *.md         text
    *.txt        text



RQ-DL-REPO-004 - Each notebook uses a YAML specification
---------------------------------------------------------------

Each notebook should define (at minimum) its data input and outputs by using a YAML specification. See the :ref:`YAML reference <delivery yaml>` and the :ref:`Quickstart <delivery yaml>`

Example:

.. code-block:: python3
    
    # python
    %%delivery_load
    
    # Retrieves `dbo_saex_g_l_entry` from the storage account 
    # and creates a temporary view called `silver.navision.dbo_saex_g_l_entry`

    sources:
    - uri: silver.navision.dbo_saex_g_l_entry

This registers a view called ``silver.navision.dbo_saex_g_l_entry`` which is now usable to read from. The code below shows how you can write a transformation. This transformation can do whatever you want, as long as you create a temporary view that which matches your `Sink` (``export.system_name.table_name`` here). When you use Python, you can also retrieve the temporary `Source` dataframe by using: ``df = spark.table('`silver.navision.dbo_saex_g_l_entry`')``. 

.. code-block:: sql

    -- sql
    -- example only, may also use Python here
    CREATE TEMPORARY VIEW `export.system_name.table_name`
    AS
    -- Your custom transformation here
    SELECT 
        * 
    FROM `silver.navision.dbo_saex_g_l_entry`


.. code-block:: python3
    
    # python
    %%delivery_write
    
    # This will export your view to ADLS, Databricks Hive and Synapse
    sinks:
    - uri: export.system_name.table_name


RQ-DL-REPO-005 - Helper functions should be imported from ``adp.delivery.gold``
-------------------------------------------------------------------------------

The ``%%delivery_load`` magic will automatically register the user defined functions in the sparksession from the following module: :mod:`adp.delivery.gold`. This module also contains the `append_unknown_record` function. Which is fully tested and documented.
This means that you should NOT import nor use functions from ``%run /delivery-sdp/libs/functions`` anymore.


RQ-DL-REPO-006 - Checkpoint management
--------------------------------------

Checkpoints can help in truncating the execution plan of your data transformation. However, the way it is currently implemented in Spark (and the implementation of Spark by Databricks) allows for lots of room for the developer regarding:

- The storage location of the checkpointed data
- The lifetime of the checkpointed data

In order to stay in control over the checkpointed data. We state the following requirements:

1. All checkpoints should write to the ``checkpoints`` folder in the root of the mounted storage account.
2. Your code should clean its checkpoints at the end.

.. note:: 
    The setting ``spark.cleaner.referenceTracking.cleanCheckpoints`` is set to True for all the clusters managed by the ADP.
    This setting, however, does not guarantee the complete removal of all checkpointed data. This is noted in the JIRA discussion `here <https://issues.apache.org/jira/browse/SPARK-33000>`_.



RQ-DL-REPO-006 - Each repo should have a readme.md file
-------------------------------------------------------------------------------

RQ-DL-REPO-007 - Notebooks should *not* use the .ipynb extension
----------------------------------------------------------------

Databricks currently defaults to creating notebooks in the .ipynb format. 
However, this file type poses challenges for review during pull requests. To facilitate a smoother review process, we mandate the use of standard .py files. This policy may be revisited in the future if Azure DevOps introduces support for diffs in .ipynb notebooks.
The setting can be set on a per-user basis. See the `Databricks Documentation <https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-format>`_ for guidelines on how to change this setting for your personal account.

RQ-DL-GOLD-001 - Fact tables always start with 'FCT\_'
------------------------------------------------------

RQ-DL-GOLD-002 - Dimension tables always start with 'DIM\_'
-----------------------------------------------------------

RQ-DL-OPCON-001 - OPCON Job naming convention
---------------------------------------------

- A gold job should be in the form "gold-<name>"
- A delivery job should be in the form "export-<name>"

RQ-DL-AAD-001 - AAD Group naming convention
-------------------------------------------

- A export group should be in the form "AAD_SDP_export_<name>"
- A gold group should be in the form "AAD_SDP_gold_<name>"