Requirements
The ingestion platform offers great flexibility to the developer of the ingest YAML. This allows for quick development of new ingest. Great flexibility should also be met with strict constraints. This page offers an overview of requirements. Each ingest built on the ingestion platform should conform to these requirements to keep the YAML and way-of-working consistent and logical. Each requirement is denoted with a special code. This code can be used to refer to a specific requirement or guideline in a PR (or any other communication).
Ingest YAML Requirements
RQ-DQ-001 - Bronze/silver should conform to datatypes of the source
The data types in our bronze and silver tables should be equal to that of the source as much as possible. Check this the first time you run an ingest. When settings.can_rewrite_history is False (which is default), the platform will never change the datatype of the bronze/silver layer ever again.
The ingestion platform also contains options to handle changes in datatypes in a flexible way, such that datatypes mismatches can be resolved automatically. The drawback is that this technique can convert the history of the bronze/silver tables into StringType columns (when settings.can_rewrite_history is True) . You can read more about so called Datatype escalation patterns here. For now, please use the following settings in the YAML unless you have a very good reason not to do so:
RQ-DQ-002 - settings.can_rewrite_history: False
Do not rewrite the history of the bronze/silver datasets. This will keep the datatypes of the columns intact through time and the same as after the first ingest. Defaults to False.
RQ-DQ-003 - settings.type_escalation_mode: strict.
Try to convert each datatype to the most strict one in case of a conflict between the new data en the data in bronze/silver. (e.g. StringType + Integertype -> IntegerType). Defaults to strict.
RQ-DQ-004 - settings.default_datatype: infer.
Infer uses the datatype as provided by using the Spark DataframeReader by default. Settings this to string will convert all columns to a StringType. Defaults to infer.
RQ-DQ-005 - settings.silver_type: merge_and_delete.
Note: When using a cvs file, silver_type should be merge instead of merge_and_delete. Defaults to merge_and_delete.
RQ-DQ-006 - Use settings.datatypes
Use this option When Spark does not infer the type correctly and you want to force a certain datatype. To be used. Ingestion will fail when the data cannot be casted to the target type. Ingestion will give a warning when the column specified is not available in the dataset.
RQ-DQ-007 - CSV Files: Use settings.spark_read_options.inferSchema: True.
The CSV DataSourceReader does not infer the schema of the ingested data by default. Use this option to enable schema inference for CSV files. See the Spark CSV Datasource Docs. (Note: the JSON and XML DatasourceReader enable InferSchema by default).
RQ-YAML-001 - Only specify keys for the silver layer when applicable
Do not specify keys for an ingest or entity when the data does not naturally have any key. Not specifying any key will cause the platform to skip the silver layer for that specific entity, which is good for performance reasons.
RQ-YAML-002 - Use ingest-level settings when applicable
Use ingest-level settings for settings that are applicable for each entity within a YAML file. Those ingest-level settings (settings) will be applied automatically to each underlying entity (entity.settings). This reduces code duplication and conforms to the DRY principle.
RQ-YAML-003 - Define all common source configurations on the ingest level
All source settings which are the same for each entity within one YAML file should be defined the Ingest level. Each entity can modify and override settings from the referenced source, but this should only be done as minimally as possible. This keeps the YAML simple.
RQ-YAML-004 - Ingest and entity names should only contains lowercases, numbers and ‘_’
The name of the ingest, as well as the name of each entity. Should only contain lowercases, numbers and underscores (‘_’). Doing so will keep consistency and make referring to tables in bronze/silver easier on the delivery side of the platform.
RQ-SQL-001 - SQL: Use partitioning for large tables
Normally, Spark JDBC DataframeReader will use only a single query to read a database. This can be very slow for very large tables. Specifying the numPartitions option under the entity.settings.spark_read_options will divide this single query in multiple (smaller) queries that can be executed by each executorthread in parallel. When specifying numPartitions the Spark DataframeReader also need the partitionColumn, lowerBound and upperBound options. See an usage example here. Also refer to the Spark JDBC documentation
RQ-SQL-002 - SQL: Cooperate with the database administrator
Using partitioning will increase the number of parallel connections and the speed of the ingest. This does increase the load on the source database. To make sure that the database can keep up with the requests, always cooperate with the database administrator. Try to find an optimum between ingestion speed and server load.
DevOps Requirements
RQ-REPO-001 - Ingest repo names should be named like: ingest-name
RQ-REPO-002 - Ingest repo should contain a yaml folder which the ingest yamls
RQ-REPO-003 - Notebooks should not use the .ipynb extension
Databricks currently defaults to creating notebooks in the .ipynb format. However, this file type poses challenges for review during pull requests. To facilitate a smoother review process, we mandate the use of standard .py files. This policy may be revisited in the future if Azure DevOps introduces support for diffs in .ipynb notebooks. The setting can be set on a per-user basis. See the Databricks Documentation for guidelines on how to change this setting for your personal account.