Quickstart

1. Create repository

Ask the Core Team to create a repository under the AppSmartDataPlatform project. Make sure to following the naming convention as specified in the requirements. You can clone this repository using Databricks Repos which will allow you to develop and collaborate on your data code fully in databricks.

2. Create pipeline

Your code will be deployed under the root of the Databricks Workspace for each Databricks environment. To automatically deploy the code after a Pull Request (and commit on the master), we use Azure Pipelines. Azure Pipelines is controlled by using azure-pipelines.yaml files. Ask the Core Team to create a pipeline. They will add a azure-pipelines.yaml file to root of your repository with the following content:

trigger:
- main

pool:
  name: DataLinux

resources:
  repositories:
  - repository: templates
    type: git
    name: AppSmartDataPlatform/pipeline-templates

stages:
- template: delivery-sdp/template.yaml@templates

This pipeline will refer to a centralised template under the pipeline-templates repository. Finally, ask a member of the Core Team to activate the pipeline under the AppSmartDataPlatform project.

Now, when a PR is completed, your code will be automatically deployed to the following databricks workspaces:

1. Create code for each table

Now you can create the code for your data products. You can create as many notebooks as you see fit. Each notebook should be placed in a subdirectory called gold or export of your repository. Each notebook should define its input (sources) and outputs (sinks) by using a YAML definition. The YAML definition can also contain validations or extra metadata (see YAML reference). The adp.delivery package will read this YAML definition and deploy the data to: storage account, databricks database and Synapse serverless views. This chapter goes into detail about this.

Example (./export/dbo_saex_g_l_entry):

# python
# Load the adp.delivery package. This will enable you to use %%delivery_load and %%delivery_write magics.
import adp.delivery

# python
%%delivery_load

# Retrieves `dbo_saex_g_l_entry` from the storage account
# and creates a temporary view called `silver.navision.dbo_saex_g_l_entry`
- uri: silver.navision.dbo_saex_g_l_entry

-- sql
CREATE TEMPORARY VIEW `export.system_name.table_name`
AS
-- Your custom transformation here
SELECT
    *
FROM `silver.navision.dbo_saex_g_l_entry`

# python
%%delivery_write

# This will export your view to ADLS, Databricks Hive and Synapse
- uri: export.system_name.table_name

For gold, in the FCT notebook you can also add relations

# python
%%delivery_write
sinks:
- uri: gold.system_name.FCT_name
  columns:

    - name: key
      relationships:
        - uri: gold.system_name.DIM_name.key

    - name: DimTijdsintervalID
      relationships:
        - uri: gold.system_name.DIM_name.key

4. Create parent.py and run it

Create a file called parent.py. This file will be called by OPCON and defines the order of execution for each notebook.

%%delivery_run

name: str # Mandatory, name of the project (e.g. navision)
layer: str # Mandatory, layer of the project (e.g. gold, export)
jobs:
- name: notebook_1
  type: databricks_notebook
  description: description here
  settings:
    path: ./child
    arguments:
      key1: value1

- name: notebook_2
  type: databricks_notebook
  description: description here
  settings:
    path: ./child
    arguments:
      key1: value2

Run your parent.py afterwards and fix errors if needed.

5. Write readme.md

Create a short README.md file in which you shortly state the purpose of the export/gold repository and some other basic info. The goal of the readme file is that it should supply a colleague with some basic background info on the data product. If the usage or purpose of your project is immediately clear, then you might skip this step.

1. Create `.gitattributes` file

Create a .gitattributes file. This will normalize the line-endings, which is needed as databricks runs on Linux (and we’re also editing in Windows.). The file should contain the following content:

###############################################################################
# Athora: Set behavior for normalizing line endings.
#    Code will usually be written in a Databricks environment, so keep the
#    settings to a minimum.
# https://git-scm.com/docs/gitattributes
# https://gitattributes.io/api/common%2Ccsharp%2Cweb%2Cvisualstudio
# EvdK: 2022-04-07, initial
###############################################################################

# Auto detect text files and perform line-end normalization
* text=auto

# Script
*.py      text diff=python
*.sql     text
*.sh      text
*.ps1     text eol=crlf
*.yaml    text
*.yml     text

# Data
*.json    text
*.xml     text

# Documentation
*.markdown   text
*.md         text
*.txt        text

7. Submit Pull Request

After you’re statisfied with the content of the repository, create a Pull Request and let a member of the Core Team review your code.

8. Deploy to DEV

After PR completion, your code will automatically be deployed to the development` and testing` environments. After you code has been deployed, it will show up in the databricks workspace.

9. Schedule opcon in DEV

Create a OPCON job following the naming convention as specified in the requirements (gold-name, export-name).

The job should contain the following command line:

[[DATA_PowerShell]] [[DATA_Start_SDP_Databricks_Notebook]] -notebook_path '/<REPO_NAME>/<LAYER>/parent.py'

10. Run in OPCON DEV

Run your job now in the development environment and check the following things:

Does the data (in gold/export) conform to your expectations?
Is the storage account filled with data?
Has the databricks databases and tables been created?
Are the views created in the Synapse layer?

11. Repeat 8, 9 and 10 for TST, ACC and PRD

Ask the Core team to create a job in the OPCON production environment. At this moment you’ll handover the code to the Core team. The Core team is responsible for the next steps (10, 11, 12 and 13).

12. Create AAD groups

An AAD group has to be created using the following convention:

AAD_SDP_export_name
AAD_SDP_gold_name

13. Bind AAD group to SQL role in Synapse serverless

An role called “sdp_name_datareader” (e.g. sdp_lifetime_datareader) is automatically created in the Synapse database. The group created in step 10 (e.g. AAD_SDP_export_lifetime) should be assigned to this schema. Ask the core team to do this for you.

Example:

CREATE USER AAD_SDP_export_lifetime FROM EXTERNAL PROVIDER WITH DEFAULT_SCHEMA=[dbo];
ALTER ROLE sdp_lifetime_datareader ADD MEMBER AAD_SDP_export_lifetime;

14. Set ACL for AAD Group

The AAD group (as created in step 10) should have the correct ACL rights on the datalake. Ask the Core team to do this for you.

15. Inform customer

Congratulations. You’ve finished your first export/gold project. Inform users of the export/gold about the awesome things you’ve just accomplished and ask them whether everything works as expected.