Breaking Changes
General Changes
sdp is now one single package instead of four (core, delivery, ingest, maintain)
Each function/class that is not part of the public API is now private (prefixed with _)
Pydantic now used for all dataclasses and data validation
UV is now the default project manager
Each module is now unittested and integration tested where applicable
Package can now be developed outside of the Athora network and is independent from Athora network resources
Azure Pipelines are updated, restructured and made more modular
.devcontainer is now introduced, which bootstraps an development environment
Package is fully integration tested on a Databricks Dedicated cluster (Unity Catalog Compatible)
Package is also used only open-source packages and is fully independent of any Spark vendor (Databricks)
CORE
sdp.core.adls - with_path_converted_to_base is now private - base_to_abfss is now private - abfss_to_base is now private - abfss_to_blob_http is now private - CustomCredential is now private
sdp.core.authenticate - Now only supports authentication through the SPN provided by client_id and client_secret in settings. - Removed REFRESH_TOKEN - Removed CLIENT - Removed set_refresh_token - removed get_refresh_token
sdp.core.backup - Moved AuthenticationError to sdp.core.exceptions
sdp.core.constants - sdp.core.constants.ENV -> Still works correctly - sdp.core.constants.ENV.short() -> Still works correctly - sdp.core.constants.TENANT_ID -> sdp.core.settings.Settings().CORE.TENANT_ID - sdp.core.constants.IS_EXECUTING_IN_DATABRICKS -> sdp.core.settings.IS_EXECUTING_IN_DATABRICKS - Removed sdp.core.constants.PROXY_ADDRESS - Removed sdp.core.constants.PROXY_PORT - Removed sdp.core.constants.bootstrap_env
sdp.core.databricks - Moved sdp.core.databricks.dbsql to sdp.core.dbsql - Removed sdp.core.databricks.unitycatalog as it was unused
sdp.core.dbsql - Now uses Spark SQL to execute queries instead of Thrift client - Deprecated build_from_appsettings method (still works, but doesn’t do anything)
sdp.core.exceptions - Removed AppSettingsError - Removed PackageNotCompatibleError
sdp.core.log - configure_structlog is now private - DatabricksFormatter is now private - Removed DatabricksHandler - LogAnalyticsHandler is now private - setup_for_databricks is now private
sdp.core.loganalytics - LogAnalyticsClient.build_from_appsettings is now private - LogAnalyticsClient now uses the intended authentication method, review the new initalization methods arguments.
sdp.core.secrets - Now raises an explicit exception when no secret is found
sdp.core.settings - Environment settings are now prefixed with ADP__ instead of SDP__ - Nesting in settings is now denoted with ‘.’ instead of ‘_’. Example, Settings().SDP__INGEST__STORAGE_ACCOUNT_BASE is now Settings().SDP.INGEST.STORAGE_ACCOUNT_BASE - sdp.core.APPSETTINGS -> sdp.core.Settings() - Settings now load when called, not when imported - sdp.core.Settings().CORE.CLIENT_SECRET is now required - Removed sdp.core.Settings().CORE.POWERBI - Removed sdp.core.Settings().CORE.DATABRICKS - Removed sdp.core.Settings().CORE.DBSQL - Removed sdp.core.Settings().CORE.LOGANALYTICS.WORKSPACE_ID - Removed sdp.core.Settings().CORE.LOGANALYTICS.SHARED_KEY - Removed sdp.core.Settings().CORE.LOGANALYTICS.LOG_TYPE - Removed sdp.core.Settings().INGEST.STORAGE_ACCOUNT_NAME_PRD - Removed Unity Catalog settings - Added sdp.core.Settings().DELIVERY.STORAGE_ACCOUNT_NAME - Added sdp.core.Settings().DELIVERY.LOAD_ENV - Added sdp.core.Settings().CORE.SPARK to control usage of spark_session - Added sdp.core.Settings().CORE.LOGANALYTICS.STREAM_NAME - Added sdp.core.Settings().CORE.LOGANALYTICS.ENDPOINT_URI - Added sdp.core.Settings().CORE.LOGANALYTICS.DCR_ID - Added mandatory sdp.core.Settings().INGEST.STORAGE_ACCOUNT_NAME
Removed sdp.core.mssql module as it was unused
sdp.core.spark - Now compatible with Spark Remote which tremendously increases developer happiness - create_sparksession now accepts a optional argument type (‘local’/’remote’) to control whether Spark Remote is unused - list_of_dict_to_dataframe is now protected and should not be used (renamed to _list_of_dict_to_dataframe) - dict_to_dataframe is now protected - Removed json_to_dataframe
DELIVERY
No breaking changes are made to the YAML structure. The changes below are made to the codebase. These changes may break existing code that uses the SDP Delivery module. Most users should not be affected, as they instead call this package through YAML configuration files.
sdp.delivery.databricks is removed
sdp.delivery.dataclasses now used only Pydantic
sdp.delivery.data_io (used to be sdp.delivery.io) - Renamed sdp.delivery.io to sdp.delivery.data_io as debugger confuses io with Python standard library io - sdp.delivery.io.SDPTable is moved to sdp.delivery.dataclasses - SDPTable is now a Pydantic BaseModel. - The following methods are removed from SDPTable
writer -> use sdp.delivery.data_io.write_table instead
reader -> use sdp.delivery.data_io.load_table instead
from_yaml -> Use standard Pydantic parsing methods
to_dict -> Use standard Pydantic methods
Removed sdp.delivery.io.DataFrameCreator class and moved methods to sdp.delivery.data_io and made them protected methods
Removed all classes in sdp.delivery.data_io and refactored to functions
sdp.delivery.gold - Removed _simplify_rename_dictionary - Removed create_query - Removed simplify_columns - Removed centralhash/udf_centralhash function -> udf_hashkey is to be used instead - Removed to_string/udf_to_string function - Simplified hashkey/udf_hashkey function - Removed now/udf_now
sdp.delivery.jobs - Moved all dataclasses to sdp.delivery.dataclasses - Removed JobList - sdp.delivery.jobs.JobExecutor is now called sdp.delivery.dataclasses.JobWorkflow - JobExecutor.run is now run_all_jobs(job_workflow: JobWorkflow) - DatabricksJob.run is now run_databricks_job(job: DatabricksJob)
sdp.delivery.rollback - ubfs_to_uc added - No potential other breaking changes
sdp.delivery.transformation - Removed TableList class - Removed SourceList class: for SourceList.load_all -> loop with a list of tables over sdp.delivery.data_io.load_table - Removed SinkList class: for Sinklist.write_all -> use sdp.delivery.data_io.write_tables - Removed register_cell_magics function, moved to __init__.py
Removed sdp.delivery.yaml, all validation now uses Pydantic directly
INGEST
Changes in YAML structure: - file_read_settings is now removed in favor of spark_read_options
The changes below are made to the codebase. These changes may break existing code that uses the SDP Ingest module. Most users should not be affected, as they instead call this package through YAML configuration files.
sdp.ingest.constants - Moved to sdp.ingest.dataclasses
sdp.ingest.bootstrap - Removed, as it was unused
sdp.ingest.ingest - Removed from_adls_yaml - Removed from_local_yaml - Made set_spark_config private - Made unset_spark_config private - backup_schema is now private - Don’t use quotes with password: use password: ${{ secrets.Source-Prik-pwd }} instead of password: ‘${{ secrets.Source-Prik-pwd }}’
sdp.ingest.schema - Made all functions private
sdp.ingest.silver.gbo.append - All internal functions are now private
sdp.ingest.silver.gbo.checks - All Enums and Dataclasses are now Pydantic BaseModels and are moved to sdp.ingest.dataclasses - All internal functions are now private
sdp.ingest.silver.gbo.logging - All functions are now private
sdp.ingest.sources - Moved all dataclasses to sdp.ingest.dataclasses
sdp.ingest.schema - Made all functions private
sdp.ingest.yaml - Removed entirely, all YAML related functionality is now in sdp.ingest.dataclasses
MAINTAIN
This module is only used for maintaining the SDP, and is therefore only called by YAML magic cells. Therefore, breaking changes here are less critical.
Completely restructured module
Migrated all dataclasses to Pydantic and moved to sdp.maintain.dataclasses
Made all functions private unless they are explicitly part of the public API