Quickstart
1. Create repository
Ask the Core Team to create a repository under the AppSmartDataPlatform project. Make sure to following the naming convention as specified in the requirements. You can clone this repository using Databricks Repos which will allow you to develop and collaborate on your data code fully in databricks.
2. Create pipeline
Your code will be deployed under the root of the Databricks Workspace for each Databricks environment. To automatically deploy the code after a Pull Request (and commit on the master), we use Azure Pipelines. Azure Pipelines is controlled by using azure-pipelines.yaml files. Ask the Core Team to create a pipeline. They will add a azure-pipelines.yaml file to root of your repository with the following content:
trigger:
- main
pool:
name: DataLinux
resources:
repositories:
- repository: templates
type: git
name: AppSmartDataPlatform/pipeline-templates
stages:
- template: delivery-sdp/template.yaml@templates
This pipeline will refer to a centralised template under the pipeline-templates repository. Finally, ask a member of the Core Team to activate the pipeline under the AppSmartDataPlatform project.
Now, when a PR is completed, your code will be automatically deployed to the following databricks workspaces:
1. Create code for each table
Now you can create the code for your data products. You can create as many notebooks as you see fit. Each notebook should be placed in a subdirectory called gold or export of your repository. Each notebook should define its input (sources) and outputs (sinks) by using a YAML definition. The YAML definition can also contain validations or extra metadata (see YAML reference). The adp.delivery package will read this YAML definition and deploy the data to: storage account, databricks database and Synapse serverless views. This chapter goes into detail about this.
Example (./export/dbo_saex_g_l_entry):
# python
# Load the adp.delivery package. This will enable you to use %%delivery_load and %%delivery_write magics.
import adp.delivery
# python
%%delivery_load
# Retrieves `dbo_saex_g_l_entry` from the storage account
# and creates a temporary view called `silver.navision.dbo_saex_g_l_entry`
- uri: silver.navision.dbo_saex_g_l_entry
-- sql
CREATE TEMPORARY VIEW `export.system_name.table_name`
AS
-- Your custom transformation here
SELECT
*
FROM `silver.navision.dbo_saex_g_l_entry`
# python
%%delivery_write
# This will export your view to ADLS, Databricks Hive and Synapse
- uri: export.system_name.table_name
For gold, in the FCT notebook you can also add relations
# python
%%delivery_write
sinks:
- uri: gold.system_name.FCT_name
columns:
- name: key
relationships:
- uri: gold.system_name.DIM_name.key
- name: DimTijdsintervalID
relationships:
- uri: gold.system_name.DIM_name.key
4. Create parent.py and run it
Create a file called parent.py. This file will be called by OPCON and defines the order of execution for each notebook.
%%delivery_run
name: str # Mandatory, name of the project (e.g. navision)
layer: str # Mandatory, layer of the project (e.g. gold, export)
jobs:
- name: notebook_1
type: databricks_notebook
description: description here
settings:
path: ./child
arguments:
key1: value1
- name: notebook_2
type: databricks_notebook
description: description here
settings:
path: ./child
arguments:
key1: value2
Run your parent.py afterwards and fix errors if needed.
5. Write readme.md
Create a short README.md file in which you shortly state the purpose of the export/gold repository and some other basic info. The goal of the readme file is that it should supply a colleague with some basic background info on the data product. If the usage or purpose of your project is immediately clear, then you might skip this step.
1. Create .gitattributes file
Create a .gitattributes file. This will normalize the line-endings, which is needed as databricks runs on Linux (and we’re also editing in Windows.).
The file should contain the following content:
###############################################################################
# Athora: Set behavior for normalizing line endings.
# Code will usually be written in a Databricks environment, so keep the
# settings to a minimum.
# https://git-scm.com/docs/gitattributes
# https://gitattributes.io/api/common%2Ccsharp%2Cweb%2Cvisualstudio
# EvdK: 2022-04-07, initial
###############################################################################
# Auto detect text files and perform line-end normalization
* text=auto
# Script
*.py text diff=python
*.sql text
*.sh text
*.ps1 text eol=crlf
*.yaml text
*.yml text
# Data
*.json text
*.xml text
# Documentation
*.markdown text
*.md text
*.txt text
7. Submit Pull Request
After you’re statisfied with the content of the repository, create a Pull Request and let a member of the Core Team review your code.
8. Deploy to DEV
After PR completion, your code will automatically be deployed to the development` and testing` environments. After you code has been deployed, it will show up in the databricks workspace.
9. Schedule opcon in DEV
Create a OPCON job following the naming convention as specified in the requirements (gold-name, export-name).
The job should contain the following command line:
[[DATA_PowerShell]] [[DATA_Start_SDP_Databricks_Notebook]] -notebook_path '/<REPO_NAME>/<LAYER>/parent.py'
10. Run in OPCON DEV
Run your job now in the development environment and check the following things:
Does the data (in gold/export) conform to your expectations?
Is the storage account filled with data?
Has the databricks databases and tables been created?
Are the views created in the Synapse layer?
11. Repeat 8, 9 and 10 for TST, ACC and PRD
Ask the Core team to create a job in the OPCON production environment. At this moment you’ll handover the code to the Core team. The Core team is responsible for the next steps (10, 11, 12 and 13).
12. Create AAD groups
An AAD group has to be created using the following convention:
AAD_SDP_export_name
AAD_SDP_gold_name
13. Bind AAD group to SQL role in Synapse serverless
An role called “sdp_name_datareader” (e.g. sdp_lifetime_datareader) is automatically created in the Synapse database. The group created in step 10 (e.g. AAD_SDP_export_lifetime) should be assigned to this schema. Ask the core team to do this for you.
Example:
CREATE USER AAD_SDP_export_lifetime FROM EXTERNAL PROVIDER WITH DEFAULT_SCHEMA=[dbo];
ALTER ROLE sdp_lifetime_datareader ADD MEMBER AAD_SDP_export_lifetime;
14. Set ACL for AAD Group
The AAD group (as created in step 10) should have the correct ACL rights on the datalake. Ask the Core team to do this for you.
15. Inform customer
Congratulations. You’ve finished your first export/gold project. Inform users of the export/gold about the awesome things you’ve just accomplished and ask them whether everything works as expected.