Breaking Changes

General Changes

  • sdp is now one single package instead of four (core, delivery, ingest, maintain)

  • Each function/class that is not part of the public API is now private (prefixed with _)

  • Pydantic now used for all dataclasses and data validation

  • UV is now the default project manager

  • Each module is now unittested and integration tested where applicable

  • Package can now be developed outside of the Athora network and is independent from Athora network resources

  • Azure Pipelines are updated, restructured and made more modular

  • .devcontainer is now introduced, which bootstraps an development environment

  • Package is fully integration tested on a Databricks Dedicated cluster (Unity Catalog Compatible)

  • Package is also used only open-source packages and is fully independent of any Spark vendor (Databricks)

CORE

  • sdp.core.adls - with_path_converted_to_base is now private - base_to_abfss is now private - abfss_to_base is now private - abfss_to_blob_http is now private - CustomCredential is now private

  • sdp.core.authenticate - Now only supports authentication through the SPN provided by client_id and client_secret in settings. - Removed REFRESH_TOKEN - Removed CLIENT - Removed set_refresh_token - removed get_refresh_token

  • sdp.core.backup - Moved AuthenticationError to sdp.core.exceptions

  • sdp.core.constants - sdp.core.constants.ENV -> Still works correctly - sdp.core.constants.ENV.short() -> Still works correctly - sdp.core.constants.TENANT_ID -> sdp.core.settings.Settings().CORE.TENANT_ID - sdp.core.constants.IS_EXECUTING_IN_DATABRICKS -> sdp.core.settings.IS_EXECUTING_IN_DATABRICKS - Removed sdp.core.constants.PROXY_ADDRESS - Removed sdp.core.constants.PROXY_PORT - Removed sdp.core.constants.bootstrap_env

  • sdp.core.databricks - Moved sdp.core.databricks.dbsql to sdp.core.dbsql - Removed sdp.core.databricks.unitycatalog as it was unused

  • sdp.core.dbsql - Now uses Spark SQL to execute queries instead of Thrift client - Deprecated build_from_appsettings method (still works, but doesn’t do anything)

  • sdp.core.exceptions - Removed AppSettingsError - Removed PackageNotCompatibleError

  • sdp.core.log - configure_structlog is now private - DatabricksFormatter is now private - Removed DatabricksHandler - LogAnalyticsHandler is now private - setup_for_databricks is now private

  • sdp.core.loganalytics - LogAnalyticsClient.build_from_appsettings is now private - LogAnalyticsClient now uses the intended authentication method, review the new initalization methods arguments.

  • sdp.core.secrets - Now raises an explicit exception when no secret is found

  • sdp.core.settings - Environment settings are now prefixed with ADP__ instead of SDP__ - Nesting in settings is now denoted with ‘.’ instead of ‘_’. Example, Settings().SDP__INGEST__STORAGE_ACCOUNT_BASE is now Settings().SDP.INGEST.STORAGE_ACCOUNT_BASE - sdp.core.APPSETTINGS -> sdp.core.Settings() - Settings now load when called, not when imported - sdp.core.Settings().CORE.CLIENT_SECRET is now required - Removed sdp.core.Settings().CORE.POWERBI - Removed sdp.core.Settings().CORE.DATABRICKS - Removed sdp.core.Settings().CORE.DBSQL - Removed sdp.core.Settings().CORE.LOGANALYTICS.WORKSPACE_ID - Removed sdp.core.Settings().CORE.LOGANALYTICS.SHARED_KEY - Removed sdp.core.Settings().CORE.LOGANALYTICS.LOG_TYPE - Removed sdp.core.Settings().INGEST.STORAGE_ACCOUNT_NAME_PRD - Removed Unity Catalog settings - Added sdp.core.Settings().DELIVERY.STORAGE_ACCOUNT_NAME - Added sdp.core.Settings().DELIVERY.LOAD_ENV - Added sdp.core.Settings().CORE.SPARK to control usage of spark_session - Added sdp.core.Settings().CORE.LOGANALYTICS.STREAM_NAME - Added sdp.core.Settings().CORE.LOGANALYTICS.ENDPOINT_URI - Added sdp.core.Settings().CORE.LOGANALYTICS.DCR_ID - Added mandatory sdp.core.Settings().INGEST.STORAGE_ACCOUNT_NAME

  • Removed sdp.core.mssql module as it was unused

  • sdp.core.spark - Now compatible with Spark Remote which tremendously increases developer happiness - create_sparksession now accepts a optional argument type (‘local’/’remote’) to control whether Spark Remote is unused - list_of_dict_to_dataframe is now protected and should not be used (renamed to _list_of_dict_to_dataframe) - dict_to_dataframe is now protected - Removed json_to_dataframe

DELIVERY

No breaking changes are made to the YAML structure. The changes below are made to the codebase. These changes may break existing code that uses the SDP Delivery module. Most users should not be affected, as they instead call this package through YAML configuration files.

  • sdp.delivery.databricks is removed

  • sdp.delivery.dataclasses now used only Pydantic

  • sdp.delivery.data_io (used to be sdp.delivery.io) - Renamed sdp.delivery.io to sdp.delivery.data_io as debugger confuses io with Python standard library io - sdp.delivery.io.SDPTable is moved to sdp.delivery.dataclasses - SDPTable is now a Pydantic BaseModel. - The following methods are removed from SDPTable

    • writer -> use sdp.delivery.data_io.write_table instead

    • reader -> use sdp.delivery.data_io.load_table instead

    • from_yaml -> Use standard Pydantic parsing methods

    • to_dict -> Use standard Pydantic methods

    • Removed sdp.delivery.io.DataFrameCreator class and moved methods to sdp.delivery.data_io and made them protected methods

    • Removed all classes in sdp.delivery.data_io and refactored to functions

  • sdp.delivery.gold - Removed _simplify_rename_dictionary - Removed create_query - Removed simplify_columns - Removed centralhash/udf_centralhash function -> udf_hashkey is to be used instead - Removed to_string/udf_to_string function - Simplified hashkey/udf_hashkey function - Removed now/udf_now

  • sdp.delivery.jobs - Moved all dataclasses to sdp.delivery.dataclasses - Removed JobList - sdp.delivery.jobs.JobExecutor is now called sdp.delivery.dataclasses.JobWorkflow - JobExecutor.run is now run_all_jobs(job_workflow: JobWorkflow) - DatabricksJob.run is now run_databricks_job(job: DatabricksJob)

  • sdp.delivery.rollback - ubfs_to_uc added - No potential other breaking changes

  • sdp.delivery.transformation - Removed TableList class - Removed SourceList class: for SourceList.load_all -> loop with a list of tables over sdp.delivery.data_io.load_table - Removed SinkList class: for Sinklist.write_all -> use sdp.delivery.data_io.write_tables - Removed register_cell_magics function, moved to __init__.py

  • Removed sdp.delivery.yaml, all validation now uses Pydantic directly

INGEST

  • Changes in YAML structure: - file_read_settings is now removed in favor of spark_read_options

The changes below are made to the codebase. These changes may break existing code that uses the SDP Ingest module. Most users should not be affected, as they instead call this package through YAML configuration files.

  • sdp.ingest.constants - Moved to sdp.ingest.dataclasses

  • sdp.ingest.bootstrap - Removed, as it was unused

  • sdp.ingest.ingest - Removed from_adls_yaml - Removed from_local_yaml - Made set_spark_config private - Made unset_spark_config private - backup_schema is now private - Don’t use quotes with password: use password: ${{ secrets.Source-Prik-pwd }} instead of password: ‘${{ secrets.Source-Prik-pwd }}’

  • sdp.ingest.schema - Made all functions private

  • sdp.ingest.silver.gbo.append - All internal functions are now private

  • sdp.ingest.silver.gbo.checks - All Enums and Dataclasses are now Pydantic BaseModels and are moved to sdp.ingest.dataclasses - All internal functions are now private

  • sdp.ingest.silver.gbo.logging - All functions are now private

  • sdp.ingest.sources - Moved all dataclasses to sdp.ingest.dataclasses

  • sdp.ingest.schema - Made all functions private

  • sdp.ingest.yaml - Removed entirely, all YAML related functionality is now in sdp.ingest.dataclasses

MAINTAIN

This module is only used for maintaining the SDP, and is therefore only called by YAML magic cells. Therefore, breaking changes here are less critical.

  • Completely restructured module

  • Migrated all dataclasses to Pydantic and moved to sdp.maintain.dataclasses

  • Made all functions private unless they are explicitly part of the public API