Requirements

RQ-DL-REPO-001 - Repository naming convention

  • Repository names should only contain lowercase letters [a-z] and hypens “-“.

  • Gold repositories should be in the form: “gold-<name>” (e.g. “gold-complaints”)

  • Export repositories should be in the form: “export-<name>” (e.g. “export-lifetime”)

RQ-DL-REPO-002 - All delivery repositories contain a ‘parent.py’ using the delivery_run cell magic

The parent.py file defines the order in which the notebooks should be executed. It contains name, layer and jobs.

%%delivery_run

name: str # Mandatory, name of the project (e.g. navision)
layer: str # Mandatory, layer of the project (e.g. gold, export)
jobs:
- name: notebook_1
  type: databricks_notebook
  description: description here
  settings:
    path: ./child
    arguments:
      key1: value1

- name: notebook_2
  type: databricks_notebook
  description: description here
  settings:
    path: ./child
    arguments:
      key1: value2

RQ-DL-REPO-003 - All repositories should contain .gitattributes

Each repo should contain a .gitattributes to fix line-endings. This is because Databricks runs on Linux, and we often use VSCode on windows to initialize the repositories with files.

Create a .gitattributes file. The file should contain the following content:

###############################################################################
# Athora: Set behavior for normalizing line endings.
#    Code will usually be written in a Databricks environment, so keep the
#    settings to a minimum.
# https://git-scm.com/docs/gitattributes
# https://gitattributes.io/api/common%2Ccsharp%2Cweb%2Cvisualstudio
# EvdK: 2022-04-07, initial
###############################################################################

# Auto detect text files and perform line-end normalization
* text=auto

# Script
*.py      text diff=python
*.sql     text
*.sh      text
*.ps1     text eol=crlf
*.yaml    text
*.yml     text

# Data
*.json    text
*.xml     text

# Documentation
*.markdown   text
*.md         text
*.txt        text

RQ-DL-REPO-004 - Each notebook uses a YAML specification

Each notebook should define (at minimum) its data input and outputs by using a YAML specification. See the YAML reference and the Quickstart

Example:

# python
%%delivery_load

# Retrieves `dbo_saex_g_l_entry` from the storage account
# and creates a temporary view called `silver.navision.dbo_saex_g_l_entry`

sources:
- uri: silver.navision.dbo_saex_g_l_entry

This registers a view called silver.navision.dbo_saex_g_l_entry which is now usable to read from. The code below shows how you can write a transformation. This transformation can do whatever you want, as long as you create a temporary view that which matches your Sink (export.system_name.table_name here). When you use Python, you can also retrieve the temporary Source dataframe by using: df = spark.table('`silver.navision.dbo_saex_g_l_entry`').

-- sql
-- example only, may also use Python here
CREATE TEMPORARY VIEW `export.system_name.table_name`
AS
-- Your custom transformation here
SELECT
    *
FROM `silver.navision.dbo_saex_g_l_entry`
# python
%%delivery_write

# This will export your view to ADLS, Databricks Hive and Synapse
sinks:
- uri: export.system_name.table_name

RQ-DL-REPO-005 - Helper functions should be imported from adp.delivery.gold

The %%delivery_load magic will automatically register the user defined functions in the sparksession from the following module: adp.delivery.gold. This module also contains the append_unknown_record function. Which is fully tested and documented. This means that you should NOT import nor use functions from %run /delivery-sdp/libs/functions anymore.

RQ-DL-REPO-006 - Checkpoint management

Checkpoints can help in truncating the execution plan of your data transformation. However, the way it is currently implemented in Spark (and the implementation of Spark by Databricks) allows for lots of room for the developer regarding:

  • The storage location of the checkpointed data

  • The lifetime of the checkpointed data

In order to stay in control over the checkpointed data. We state the following requirements:

  1. All checkpoints should write to the checkpoints folder in the root of the mounted storage account.

  2. Your code should clean its checkpoints at the end.

Note

The setting spark.cleaner.referenceTracking.cleanCheckpoints is set to True for all the clusters managed by the ADP. This setting, however, does not guarantee the complete removal of all checkpointed data. This is noted in the JIRA discussion here.

RQ-DL-REPO-006 - Each repo should have a readme.md file

RQ-DL-REPO-007 - Notebooks should not use the .ipynb extension

Databricks currently defaults to creating notebooks in the .ipynb format. However, this file type poses challenges for review during pull requests. To facilitate a smoother review process, we mandate the use of standard .py files. This policy may be revisited in the future if Azure DevOps introduces support for diffs in .ipynb notebooks. The setting can be set on a per-user basis. See the Databricks Documentation for guidelines on how to change this setting for your personal account.

RQ-DL-GOLD-001 - Fact tables always start with ‘FCT_’

RQ-DL-GOLD-002 - Dimension tables always start with ‘DIM_’

RQ-DL-OPCON-001 - OPCON Job naming convention

  • A gold job should be in the form “gold-<name>”

  • A delivery job should be in the form “export-<name>”

RQ-DL-AAD-001 - AAD Group naming convention

  • A export group should be in the form “AAD_SDP_export_<name>”

  • A gold group should be in the form “AAD_SDP_gold_<name>”