Ingest
Contents:
- Dataflow
- DataType Handling
- YAML
- Requirements
- Ingest YAML Requirements
- RQ-DQ-001 - Bronze/silver should conform to datatypes of the source
- RQ-YAML-001 - Only specify keys for the silver layer when applicable
- RQ-YAML-002 - Use ingest-level settings when applicable
- RQ-YAML-003 - Define all common source configurations on the ingest level
- RQ-YAML-004 - Ingest and entity names should only contains lowercases, numbers and ‘_’
- RQ-SQL-001 - SQL: Use partitioning for large tables
- RQ-SQL-002 - SQL: Cooperate with the database administrator
- DevOps Requirements
- RQ-REPO-003 - Notebooks should not use the .ipynb extension
- Ingest YAML Requirements
- Excel ingest
The Smart Data Platform (ADP) is build around the idea of Ingestion As A Service. That is, it wants to bring a unified way of ingesting data for each business platform within Athora. Now, the business does not have deep technical knowledge about the datasources. In essence, the business only knows which data it wants to ingest, but does not know how to do that.
Note
This is where the platform comes in. It translates the which to the how and tries to do this as efficient and fast as possible.
Thus, in order to start an ingestion process, we need a specification on what to ingest. This specification should be in a computer-readable format and should integrate with any existing system. Also, we want this specification to contain all information we need to ingest the data. We do not want to send multiple configurations or store our ingest configurations in multiple tables. In essence, we want our ingest specification to be stateless.
So, we specify our ingests in a single .yaml file. This file is parsed by the computer that is responsible for starting ingestion process. This computer should have the adp python-package installed. This package translates the .yaml specification to ingestion logic. This computer can be a stand-alone virtual machine, but can also be the Databricks cluster itself (as long as the ADP package is installed correctly).