Requirements ############ The ingestion platform offers great flexibility to the developer of the ingest YAML. This allows for quick development of new ingest. Great flexibility should also be met with strict constraints. This page offers an overview of requirements. Each ingest built on the ingestion platform should conform to these requirements to keep the YAML and way-of-working consistent and logical. Each requirement is denoted with a special code. This code can be used to refer to a specific requirement or guideline in a PR (or any other communication). Ingest YAML Requirements ------------------------ **RQ-DQ-001** - Bronze/silver should conform to datatypes of the source """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" The data types in our bronze and silver tables should be equal to that of the source as much as possible. **Check this the first time you run an ingest**. When :ref:`settings.can_rewrite_history` is False (which is default), the platform will never change the datatype of the bronze/silver layer ever again. The ingestion platform also contains options to handle changes in datatypes in a flexible way, such that datatypes mismatches can be resolved automatically. The drawback is that this technique can convert the history of the bronze/silver tables into `StringType` columns (when :ref:`settings.can_rewrite_history` is True) . You can read more about so called *Datatype escalation* patterns :ref:`here`. For now, please use the following settings in the YAML **unless you have a very good reason not to do so**: * **RQ-DQ-002** - :ref:`settings.can_rewrite_history`: False Do not rewrite the history of the bronze/silver datasets. This will keep the datatypes of the columns intact through time and the same as after the first ingest. Defaults to False. * **RQ-DQ-003** - :ref:`settings.type_escalation_mode`: strict. Try to convert each datatype to the most `strict` one in case of a conflict between the new data en the data in bronze/silver. (e.g. StringType + Integertype -> IntegerType). Defaults to `strict`. * **RQ-DQ-004** - :ref:`settings.default_datatype`: infer. `Infer` uses the datatype as provided by using the Spark DataframeReader by default. Settings this to `string` will convert all columns to a `StringType`. Defaults to `infer`. * **RQ-DQ-005** - :ref:`settings.silver_type`: merge_and_delete. Note: When using a cvs file, silver_type should be merge instead of merge_and_delete. Defaults to `merge_and_delete`. * **RQ-DQ-006** - Use :ref:`settings.datatypes` Use this option When Spark does not infer the type correctly and you want to force a certain datatype. To be used. Ingestion will fail when the data cannot be casted to the target type. Ingestion will give a warning when the column specified is not available in the dataset. * **RQ-DQ-007** - CSV Files: Use `settings.spark_read_options.inferSchema: True`. The CSV DataSourceReader does not infer the schema of the ingested data by default. Use this option to enable schema inference for CSV files. See the `Spark CSV Datasource Docs `_. (Note: the JSON and XML DatasourceReader enable InferSchema by default). **RQ-YAML-001** - Only specify keys for the silver layer when applicable """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" Do not specify keys for an ingest or entity when the data does not `naturally` have any key. Not specifying any key will cause the platform to skip the silver layer for that specific entity, which is good for performance reasons. **RQ-YAML-002** - Use ingest-level settings when applicable """"""""""""""""""""""""""""""""""""""""""""""""""""""""""" Use ingest-level settings for settings that are applicable for each entity within a YAML file. Those ingest-level settings (`settings`) will be applied automatically to each underlying entity (`entity.settings`). This reduces code duplication and conforms to the DRY principle. **RQ-YAML-003** - Define all common source configurations on the ingest level """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" All source settings which are the same for each entity within one YAML file should be defined the `Ingest` level. Each entity can modify and override settings from the referenced source, but this should only be done as minimally as possible. This keeps the YAML simple. **RQ-YAML-004** - Ingest and entity names should only contains lowercases, numbers and '_' """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" The name of the ingest, as well as the name of each entity. Should only contain lowercases, numbers and underscores ('_'). Doing so will keep consistency and make referring to tables in bronze/silver easier on the delivery side of the platform. **RQ-SQL-001** - SQL: Use partitioning for large tables """"""""""""""""""""""""""""""""""""""""""""""""""""""" Normally, Spark JDBC DataframeReader will use only a single query to read a database. This can be very slow for very large tables. Specifying the `numPartitions` option under the `entity.settings.spark_read_options` will divide this single query in multiple (smaller) queries that can be executed by each executorthread in parallel. When specifying `numPartitions` the Spark DataframeReader also need the `partitionColumn`, `lowerBound` and `upperBound` options. See an usage example `here `_. Also refer to the `Spark JDBC documentation `_ **RQ-SQL-002** - SQL: Cooperate with the database administrator """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" Using partitioning will increase the number of parallel connections and the speed of the ingest. This does increase the load on the source database. To make sure that the database can keep up with the requests, always cooperate with the database administrator. Try to find an optimum between ingestion speed and server load. DevOps Requirements ------------------- **RQ-REPO-001** - Ingest repo names should be named like: `ingest-name` """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" **RQ-REPO-002** - Ingest repo should contain a `yaml` folder which the ingest yamls """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" **RQ-REPO-003** - Notebooks should *not* use the .ipynb extension --------------------------------------------------------------- Databricks currently defaults to creating notebooks in the .ipynb format. However, this file type poses challenges for review during pull requests. To facilitate a smoother review process, we mandate the use of standard .py files. This policy may be revisited in the future if Azure DevOps introduces support for diffs in .ipynb notebooks. The setting can be set on a per-user basis. See the `Databricks Documentation `_ for guidelines on how to change this setting for your personal account. **RQ-PIPELINE-001** - Pipeline names should be equal to repo names """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" **RQ-PIPELINE-002** - Pipelines should reside in the '2.0' folder """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""