DataType Handling

Data flows from our source systems (snapshot) through the preprocess layer into the bronze/silver layer, according to this flow diagram:

                                 ┌──────────────┐
                                 │              │
                             ┌───►    BRONZE    │
                             │   │              │
           ┌──────────────┐  │   └──────────────┘
           │              │  │
snapshot───►  PREPROCESS  ├──┤
           │              │  │
           └──────────────┘  │   ┌──────────────┐
                             │   │              │
                             └───►    SILVER    │
                                 │              │
                                 └──────────────┘

We can handle datatypes in two locations. First the preprocess layer.

DataType handling in PREPROCESS

The datatypes in the preprocess layer should be handled so that a minimum number of datatype conflicts occur in later stages (bronze/silver). The developer has the following options in the YAML configuration available: default_datatype and datatype. The datatype field can be used to specify a fixed datatype for a certain column. This datatype will always be used for that particular column. Regardless of the default_datatype setting. The default_datatype setting can be set to either infer or string. This specifies how the platform handles the datatypes that haven’t been specified using the datatypes setting.

Should I use infer or string?

This depends on the type of data ingested. When ingesting a large number of rows it might be OK to use infer. Also when reading data from SQL Databases via JDBC, as the JDBC connector is smart enough to read the datatypes from the source table. When such datasources are read, the chance of type conflicts in bronze/silver is really small.

On the other hand, when ingesting small files incrementally, then it might be usefull to use string and specify the datatypes by hand (using the datatypes directive in the yaml).

Conflict handling in BRONZE/SILVER

Datatype conflicts might occur when the datatypes that result from the preprocess stage, conflict with the datatypes in the bronze/silver layer. From now on, the data in bronze/silver is called history.

The following conflicts scenarios might occur:

  1. (new) IntegerType(1) + (history) StringType(‘123’)

  2. (new) StringType(‘123’) + (history) IntegerType(1)

  3. (new) IntegerType(1) + (history) StringType(‘TEXT’)

  4. (new) StringType(‘TEXT’) + (history) IntegerType(1)

  5. (new) BooleanType(True) + (history) DateType(‘01-01-1970’)

How the platform handles these type conflicts is termined by two properties in the YAML: type_escalation_mode and can_rewrite_history. See the YAML Specification.

The type_escalation_mode property can be set to either strict or loose. Strict-mode always tries to convert either the new or the storage dataset to the most stringent datatype. For example, situation (1) results in IntegerType while using strict mode. Loose mode will always try to convert to the most loose datatype. For example, situation (1) using loose mode results in StringType. When a cast in strict mode fails, it will always fallback on loose mode. The conversion rules for strict and loose mode are specified in table_strict and table_loose.

When the platform tries to solve a conflict, it might need to convert data which is already present bronze and silver tables. can_rewrite_history specifies whether it is allowed to rewrite the bronze/silver data. Rewriting history rewrites every record and this is represented in the Change Data Feed a.k.a. table_changes table. Therefore, we should be careful in using this option.

The following table takes our conflict scenarios shows the handling method for each setting:

Conflict handling during various situations.

datatype (new)

datatype(history)

type_escalation_mode

can_rewrite_history

result

remark

IntegerType(1)

StringType(‘123’)

strict

True

IntegerType

IntegerType(1)

StringType(‘123’)

strict

False

StringType

We cannot convert history, fallback on loose mode

IntegerType(1)

StringType(‘123’)

loose

True

StringType

IntegerType(1)

StringType(‘123’)

loose

False

StringType

StringType(‘123’)

IntegerType(1)

strict

True

IntegerType

StringType(‘123’)

IntegerType(1)

strict

False

IntegerType

StringType(‘123’)

IntegerType(1)

loose

True

StringType

We convert history to StringType

StringType(‘123’)

IntegerType(1)

loose

False

Fail

We cannot convert history, and cannot fallback on loose mode

IntegerType(1)

StringType(‘TEXT’)

strict

True

StringType

We try to convert to IntegerType, this fails. Fallback on loose mode

IntegerType(1)

StringType(‘TEXT’)

strict

False

StringType

Cannot rewrite hitory, fall back on loose mode

IntegerType(1)

StringType(‘TEXT’)

loose

True

StringType

IntegerType(1)

StringType(‘TEXT’)

loose

False

StringType

StringType(‘TEXT’)

IntegerType(1)

strict

True

IntegerType

StringType(‘TEXT’)

IntegerType(1)

strict

False

IntegerType

StringType(‘TEXT’)

IntegerType(1)

loose

True

StringType

StringType(‘TEXT’)

IntegerType(1)

loose

False

Fail

Cannot convert history, fail

BooleanType(True)

DateType(‘01-01-1970’)

strict

True

StringType

BooleanType(True)

DateType(‘01-01-1970’)

strict

False

Fail

Cannot convert history

BooleanType(True)

DateType(‘01-01-1970’)

loose

True

StringType

BooleanType(True)

DateType(‘01-01-1970’)

loose

False

Fail

Cannot convert history, fail

Conversion table in strict mode

type_1/type_2

ByteType

ShortType

IntegerType

LongType

DecimalType

FloatType

DoubleType

StringType

BinaryType

BooleanType

TimestampType

DateType

ByteType

ByteType

ShortType

IntegerType

LongType

DecimalType

FloatType

DoubleType

ByteType

StringType

StringType

StringType

StringType

ShortType

ShortType

IntegerType

LongType

DecimalType

FloatType

DoubleType

ShortType

StringType

StringType

StringType

StringType

IntegerType

IntegerType

LongType

DecimalType

FloatType

DoubleType

IntegerType

StringType

StringType

StringType

StringType

LongType

LongType

DecimalType

FloatType

DoubleType

LongType

StringType

StringType

StringType

StringType

DecimalType

DecimalType

FloatType

Doubletype

DecimalType

StringType

StringType

StringType

StringType

FloatType

FloatType

DoubleType

FloatType

StringType

StringType

StringType

StringType

DoubleType

Doubletype

Doubletype

StringType

StringType

StringType

StringType

StringType

StringType

BinaryType

BooleanType

TimestampType

DateType

VarcharType

BinaryType

BooleanType

TimestampType

DateType

CharType

BinaryType

BooleanType

TimestampType

DateType

BinaryType

BinaryType

StringType

StringType

StringType

BooleanType

BooleanType

StringType

StringType

TimestampType

TimestampType

TimeStampType

DateType

DateType

Conversion table in loose mode

type_1/type_2

ByteType

ShortType

IntegerType

LongType

DecimalType

FloatType

DoubleType

StringType

BinaryType

BooleanType

TimestampType

DateType

ByteType

ByteType

ShortType

IntegerType

LongType

DecimalType

FloatType

DoubleType

StringType

StringType

StringType

StringType

StringType

ShortType

ShortType

IntegerType

LongType

DecimalType

FloatType

DoubleType

StringType

StringType

StringType

StringType

StringType

IntegerType

IntegerType

LongType

DecimalType

FloatType

DoubleType

StringType

StringType

StringType

StringType

StringType

LongType

LongType

DecimalType

FloatType

DoubleType

StringType

StringType

StringType

StringType

StringType

DecimalType

DecimalType

FloatType

Doubletype

StringType

StringType

StringType

StringType

StringType

FloatType

FloatType

DoubleType

StringType

StringType

StringType

StringType

StringType

DoubleType

Doubletype

StringType

StringType

StringType

StringType

StringType

StringType

StringType

StringType

StringType

StringType

StringType

VarcharType

StringType

StringType

StringType

StringType

CharType

StringType

StringType

StringType

StringType

BinaryType

StringType

StringType

StringType

StringType

BooleanType

BooleanType

StringType

StringType

TimestampType

TimestampType

TimeStampType

DateType

DateType