DataType Handling ================= Data flows from our source systems (snapshot) through the preprocess layer into the bronze/silver layer, according to this flow diagram:: ┌──────────────┐ │ │ ┌───► BRONZE │ │ │ │ ┌──────────────┐ │ └──────────────┘ │ │ │ snapshot───► PREPROCESS ├──┤ │ │ │ └──────────────┘ │ ┌──────────────┐ │ │ │ └───► SILVER │ │ │ └──────────────┘ We can handle datatypes in two locations. First the preprocess layer. DataType handling in PREPROCESS ------------------------------- The datatypes in the preprocess layer should be handled so that a minimum number of datatype conflicts occur in later stages (bronze/silver). The developer has the following options in the YAML configuration available: :ref:`default_datatype ` and :ref:`datatype `. The `datatype` field can be used to specify a fixed datatype for a certain column. This datatype will *always* be used for that particular column. Regardless of the ``default_datatype`` setting. The `default_datatype` setting can be set to either ``infer`` or ``string``. This specifies how the platform handles the datatypes that haven't been specified using the ``datatypes`` setting. Should I use ``infer`` or ``string``? ************************************* This depends on the type of data ingested. When ingesting a large number of rows it might be OK to use `infer`. Also when reading data from SQL Databases via JDBC, as the JDBC connector is smart enough to read the datatypes from the source table. When such datasources are read, the chance of type conflicts in bronze/silver is really small. On the other hand, when ingesting small files incrementally, then it might be usefull to use `string` and specify the datatypes by hand (using the ``datatypes`` directive in the yaml). Conflict handling in BRONZE/SILVER ---------------------------------- Datatype conflicts *might* occur when the datatypes that result from the preprocess stage, conflict with the datatypes in the bronze/silver layer. From now on, the data in bronze/silver is called `history`. The following conflicts scenarios might occur: 1. (new) IntegerType(1) + (history) StringType('123') 2. (new) StringType('123') + (history) IntegerType(1) 3. (new) IntegerType(1) + (history) StringType('TEXT') 4. (new) StringType('TEXT') + (history) IntegerType(1) 5. (new) BooleanType(True) + (history) DateType('01-01-1970') How the platform handles these type conflicts is termined by two properties in the YAML: ``type_escalation_mode`` and ``can_rewrite_history``. See the :ref:`YAML Specification `. The ``type_escalation_mode`` property can be set to either ``strict`` or ``loose``. Strict-mode always tries to convert either the `new` or the `storage` dataset to the most *stringent* datatype. For example, situation (1) results in IntegerType while using `strict` mode. Loose mode will always try to convert to the most `loose` datatype. For example, situation (1) using `loose` mode results in StringType. When a cast in `strict` mode fails, it will always fallback on `loose` mode. The conversion rules for `strict` and `loose` mode are specified in :numref:`table_strict` and :numref:`table_loose`. When the platform tries to solve a conflict, it might need to convert data which is already present `bronze` and `silver` tables. `can_rewrite_history` specifies whether it is allowed to rewrite the bronze/silver data. Rewriting history rewrites every record and this is represented in the `Change Data Feed` a.k.a. `table_changes` table. Therefore, we should be careful in using this option. The following table takes our conflict scenarios shows the handling method for each setting: .. _table_conflict_handling .. table:: Conflict handling during various situations. +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | datatype (new) | datatype(history) | type_escalation_mode | can_rewrite_history | result | remark | +========================+============================+==============================+=============================+=====================+=========================================================================+ | IntegerType(1) | StringType('123') | strict | True | IntegerType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('123') | strict | False | StringType | We cannot convert history, fallback on loose mode | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('123') | loose | True | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('123') | loose | False | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('123') | IntegerType(1) | strict | True | IntegerType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('123') | IntegerType(1) | strict | False | IntegerType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('123') | IntegerType(1) | loose | True | StringType | We convert history to StringType | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('123') | IntegerType(1) | loose | False | Fail | We cannot convert history, and cannot fallback on loose mode | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('TEXT') | strict | True | StringType | We try to convert to IntegerType, this fails. Fallback on loose mode | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('TEXT') | strict | False | StringType | Cannot rewrite hitory, fall back on loose mode | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('TEXT') | loose | True | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | IntegerType(1) | StringType('TEXT') | loose | False | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('TEXT') | IntegerType(1) | strict | True | IntegerType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('TEXT') | IntegerType(1) | strict | False | IntegerType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('TEXT') | IntegerType(1) | loose | True | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | StringType('TEXT') | IntegerType(1) | loose | False | Fail | Cannot convert history, fail | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | BooleanType(True) | DateType('01-01-1970') | strict | True | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | BooleanType(True) | DateType('01-01-1970') | strict | False | Fail | Cannot convert history | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | BooleanType(True) | DateType('01-01-1970') | loose | True | StringType | | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ | BooleanType(True) | DateType('01-01-1970') | loose | False | Fail | Cannot convert history, fail | +------------------------+----------------------------+------------------------------+-----------------------------+---------------------+-------------------------------------------------------------------------+ .. _table_strict: .. table:: Conversion table in strict mode +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | type_1/type_2 | ByteType | ShortType | IntegerType | LongType | DecimalType | FloatType | DoubleType | StringType | BinaryType | BooleanType | TimestampType | DateType | +========================+=================+==================+====================+=================+====================+==================+===================+====================+===================+====================+======================+======================+ | **ByteType** | ByteType | ShortType | IntegerType | LongType | DecimalType | FloatType | DoubleType | ByteType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **ShortType** | | ShortType | IntegerType | LongType | DecimalType | FloatType | DoubleType | ShortType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **IntegerType** | | | IntegerType | LongType | DecimalType | FloatType | DoubleType | IntegerType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **LongType** | | | | LongType | DecimalType | FloatType | DoubleType | LongType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **DecimalType** | | | | | DecimalType | FloatType | Doubletype | DecimalType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **FloatType** | | | | | | FloatType | DoubleType | FloatType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **DoubleType** | | | | | | | Doubletype | Doubletype | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **StringType** | | | | | | | | StringType | BinaryType | BooleanType | TimestampType | DateType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **VarcharType** | | | | | | | | | BinaryType | BooleanType | TimestampType | DateType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **CharType** | | | | | | | | | BinaryType | BooleanType | TimestampType | DateType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **BinaryType** | | | | | | | | | BinaryType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **BooleanType** | | | | | | | | | | BooleanType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **TimestampType** | | | | | | | | | | | TimestampType | TimeStampType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ | **DateType** | | | | | | | | | | | | DateType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+----------------------+----------------------+ .. _table_loose: .. table:: Conversion table in loose mode +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | type_1/type_2 | ByteType | ShortType | IntegerType | LongType | DecimalType | FloatType | DoubleType | StringType | BinaryType | BooleanType | TimestampType | DateType | +========================+=================+==================+====================+=================+====================+==================+===================+===================+===================+====================+======================+=====================+ | **ByteType** | ByteType | ShortType | IntegerType | LongType | DecimalType | FloatType | DoubleType | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **ShortType** | | ShortType | IntegerType | LongType | DecimalType | FloatType | DoubleType | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **IntegerType** | | | IntegerType | LongType | DecimalType | FloatType | DoubleType | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **LongType** | | | | LongType | DecimalType | FloatType | DoubleType | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **DecimalType** | | | | | DecimalType | FloatType | Doubletype | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **FloatType** | | | | | | FloatType | DoubleType | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **DoubleType** | | | | | | | Doubletype | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **StringType** | | | | | | | | StringType | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **VarcharType** | | | | | | | | | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **CharType** | | | | | | | | | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **BinaryType** | | | | | | | | | StringType | StringType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **BooleanType** | | | | | | | | | | BooleanType | StringType | StringType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **TimestampType** | | | | | | | | | | | TimestampType | TimeStampType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+ | **DateType** | | | | | | | | | | | | DateType | +------------------------+-----------------+------------------+--------------------+-----------------+--------------------+------------------+-------------------+-------------------+-------------------+--------------------+----------------------+---------------------+