.. _delivery quickstart: Quickstart ========== 1. Create repository ------------------------- Ask the `Core Team `_ to create a repository under the `AppSmartDataPlatform `_ project. Make sure to following the naming convention as specified in the :ref:`requirements`. You can clone this repository using `Databricks Repos `_ which will allow you to develop and collaborate on your data code fully in databricks. 2. Create pipeline ------------------------- Your code will be deployed under the root of the `Databricks Workspace` for each Databricks environment. To automatically deploy the code after a Pull Request (and commit on the master), we use Azure Pipelines. Azure Pipelines is controlled by using `azure-pipelines.yaml` files. Ask the Core Team to create a pipeline. They will add a `azure-pipelines.yaml` file to root of your repository with the following content: .. code-block:: yaml trigger: - main pool: name: DataLinux resources: repositories: - repository: templates type: git name: AppSmartDataPlatform/pipeline-templates stages: - template: delivery-sdp/template.yaml@templates This pipeline will refer to a centralised template under the `pipeline-templates `_ repository. Finally, ask a member of the `Core Team `_ to activate the pipeline under the `AppSmartDataPlatform `_ project. Now, when a PR is completed, your code will be automatically deployed to the following databricks workspaces: - `dbr-sdp-nubulo-dev-01 `_ - `dbr-sdp-nubulo-tst-01 `_ - `dbr-sdp-nubulo-acc-01 `_ - `dbr-sdp-nubulo-prd-01 `_ 1. Create code for each table --------------------------------- Now you can create the code for your data products. You can create as many notebooks as you see fit. Each notebook should be placed in a subdirectory called `gold` or `export` of your repository. Each notebook should define its input (sources) and outputs (sinks) by using a YAML definition. The YAML definition can also contain validations or extra metadata (see YAML reference). The ``adp.delivery`` package will read this YAML definition and deploy the data to: storage account, databricks database and Synapse serverless views. :ref:`This chapter ` goes into detail about this. Example (`./export/dbo_saex_g_l_entry`): .. code-block:: python3 # python # Load the adp.delivery package. This will enable you to use %%delivery_load and %%delivery_write magics. import adp.delivery .. code-block:: python3 # python %%delivery_load # Retrieves `dbo_saex_g_l_entry` from the storage account # and creates a temporary view called `silver.navision.dbo_saex_g_l_entry` - uri: silver.navision.dbo_saex_g_l_entry .. code-block:: sql -- sql CREATE TEMPORARY VIEW `export.system_name.table_name` AS -- Your custom transformation here SELECT * FROM `silver.navision.dbo_saex_g_l_entry` .. code-block:: python3 # python %%delivery_write # This will export your view to ADLS, Databricks Hive and Synapse - uri: export.system_name.table_name For gold, in the FCT notebook you can also add relations .. code-block:: python3 # python %%delivery_write sinks: - uri: gold.system_name.FCT_name columns: - name: key relationships: - uri: gold.system_name.DIM_name.key - name: DimTijdsintervalID relationships: - uri: gold.system_name.DIM_name.key 4. Create parent.py and run it --------------------------------- Create a file called ``parent.py``. This file will be called by OPCON and defines the order of execution for each notebook. .. code-block:: yaml %%delivery_run name: str # Mandatory, name of the project (e.g. navision) layer: str # Mandatory, layer of the project (e.g. gold, export) jobs: - name: notebook_1 type: databricks_notebook description: description here settings: path: ./child arguments: key1: value1 - name: notebook_2 type: databricks_notebook description: description here settings: path: ./child arguments: key1: value2 Run your `parent.py` afterwards and fix errors if needed. 5. Write readme.md ----------------------- Create a short `README.md` file in which you shortly state the purpose of the export/gold repository and some other basic info. The goal of the readme file is that it should supply a colleague with some basic background info on the data product. If the usage or purpose of your project is immediately clear, then you might skip this step. 1. Create ``.gitattributes`` file ------------------------------------- Create a ``.gitattributes`` file. This will normalize the line-endings, which is needed as databricks runs on Linux (and we're also editing in Windows.). The file should contain the following content: .. code-block:: text ############################################################################### # Athora: Set behavior for normalizing line endings. # Code will usually be written in a Databricks environment, so keep the # settings to a minimum. # https://git-scm.com/docs/gitattributes # https://gitattributes.io/api/common%2Ccsharp%2Cweb%2Cvisualstudio # EvdK: 2022-04-07, initial ############################################################################### # Auto detect text files and perform line-end normalization * text=auto # Script *.py text diff=python *.sql text *.sh text *.ps1 text eol=crlf *.yaml text *.yml text # Data *.json text *.xml text # Documentation *.markdown text *.md text *.txt text 7. Submit Pull Request ---------------------------- After you're statisfied with the content of the repository, create a Pull Request and let a member of the `Core Team `_ review your code. 8. Deploy to DEV ----------------------- After PR completion, your code will automatically be deployed to the `development`` and `testing`` environments. After you code has been deployed, it will show up in the databricks workspace. 9. Schedule opcon in DEV ----------------------------- Create a OPCON job following the naming convention as specified in the :ref:`requirements` (gold-name, export-name). The job should contain the following command line: .. code-block:: powershell [[DATA_PowerShell]] [[DATA_Start_SDP_Databricks_Notebook]] -notebook_path '///parent.py' 10. Run in OPCON DEV -------------------------- Run your job now in the development environment and check the following things: - Does the data (in gold/export) conform to your expectations? - Is the storage account filled with data? - Has the databricks databases and tables been created? - Are the views created in the Synapse layer? 11. Repeat 8, 9 and 10 for TST, ACC and PRD ------------------------------------------------ Ask the Core team to create a job in the OPCON production environment. At this moment you'll handover the code to the Core team. The Core team is responsible for the next steps (10, 11, 12 and 13). 12. Create AAD groups --------------------------- An AAD group has to be created using the following convention: - AAD_SDP_export_name - AAD_SDP_gold_name 13. Bind AAD group to SQL role in Synapse serverless --------------------------------------------------------- An role called "sdp_name_datareader" (e.g. sdp_lifetime_datareader) is automatically created in the Synapse database. The group created in step 10 (e.g. AAD_SDP_export_lifetime) should be assigned to this schema. Ask the core team to do this for you. Example: .. code-block:: t-sql CREATE USER AAD_SDP_export_lifetime FROM EXTERNAL PROVIDER WITH DEFAULT_SCHEMA=[dbo]; ALTER ROLE sdp_lifetime_datareader ADD MEMBER AAD_SDP_export_lifetime; 14. Set ACL for AAD Group ------------------------------- The AAD group (as created in step 10) should have the correct ACL rights on the datalake. Ask the Core team to do this for you. 15. Inform customer -------------------------- Congratulations. You've finished your first export/gold project. Inform users of the export/gold about the awesome things you've just accomplished and ask them whether everything works as expected.