DataType Handling
Data flows from our source systems (snapshot) through the preprocess layer into the bronze/silver layer, according to this flow diagram:
┌──────────────┐
│ │
┌───► BRONZE │
│ │ │
┌──────────────┐ │ └──────────────┘
│ │ │
snapshot───► PREPROCESS ├──┤
│ │ │
└──────────────┘ │ ┌──────────────┐
│ │ │
└───► SILVER │
│ │
└──────────────┘
We can handle datatypes in two locations. First the preprocess layer.
DataType handling in PREPROCESS
The datatypes in the preprocess layer should be handled so that a minimum number of datatype conflicts occur in later stages (bronze/silver).
The developer has the following options in the YAML configuration available: default_datatype and datatype.
The datatype field can be used to specify a fixed datatype for a certain column. This datatype will always be used for that particular column. Regardless of the default_datatype setting.
The default_datatype setting can be set to either infer or string. This specifies how the platform handles the datatypes that haven’t been specified using the datatypes setting.
Should I use infer or string?
This depends on the type of data ingested. When ingesting a large number of rows it might be OK to use infer. Also when reading data from SQL Databases via JDBC, as the JDBC connector is smart enough to read the datatypes from the source table. When such datasources are read, the chance of type conflicts in bronze/silver is really small.
On the other hand, when ingesting small files incrementally, then it might be usefull to use string and specify the datatypes by hand (using the datatypes directive in the yaml).
Conflict handling in BRONZE/SILVER
Datatype conflicts might occur when the datatypes that result from the preprocess stage, conflict with the datatypes in the bronze/silver layer. From now on, the data in bronze/silver is called history.
The following conflicts scenarios might occur:
(new) IntegerType(1) + (history) StringType(‘123’)
(new) StringType(‘123’) + (history) IntegerType(1)
(new) IntegerType(1) + (history) StringType(‘TEXT’)
(new) StringType(‘TEXT’) + (history) IntegerType(1)
(new) BooleanType(True) + (history) DateType(‘01-01-1970’)
How the platform handles these type conflicts is termined by two properties in the YAML: type_escalation_mode and can_rewrite_history. See the YAML Specification.
The type_escalation_mode property can be set to either strict or loose. Strict-mode always tries to convert either the new or the storage dataset to the most stringent datatype.
For example, situation (1) results in IntegerType while using strict mode. Loose mode will always try to convert to the most loose datatype. For example, situation (1) using loose mode results in StringType.
When a cast in strict mode fails, it will always fallback on loose mode. The conversion rules for strict and loose mode are specified in table_strict and table_loose.
When the platform tries to solve a conflict, it might need to convert data which is already present bronze and silver tables. can_rewrite_history specifies whether it is allowed to rewrite the bronze/silver data. Rewriting history rewrites every record and this is represented in the Change Data Feed a.k.a. table_changes table. Therefore, we should be careful in using this option.
The following table takes our conflict scenarios shows the handling method for each setting:
datatype (new) |
datatype(history) |
type_escalation_mode |
can_rewrite_history |
result |
remark |
|---|---|---|---|---|---|
IntegerType(1) |
StringType(‘123’) |
strict |
True |
IntegerType |
|
IntegerType(1) |
StringType(‘123’) |
strict |
False |
StringType |
We cannot convert history, fallback on loose mode |
IntegerType(1) |
StringType(‘123’) |
loose |
True |
StringType |
|
IntegerType(1) |
StringType(‘123’) |
loose |
False |
StringType |
|
StringType(‘123’) |
IntegerType(1) |
strict |
True |
IntegerType |
|
StringType(‘123’) |
IntegerType(1) |
strict |
False |
IntegerType |
|
StringType(‘123’) |
IntegerType(1) |
loose |
True |
StringType |
We convert history to StringType |
StringType(‘123’) |
IntegerType(1) |
loose |
False |
Fail |
We cannot convert history, and cannot fallback on loose mode |
IntegerType(1) |
StringType(‘TEXT’) |
strict |
True |
StringType |
We try to convert to IntegerType, this fails. Fallback on loose mode |
IntegerType(1) |
StringType(‘TEXT’) |
strict |
False |
StringType |
Cannot rewrite hitory, fall back on loose mode |
IntegerType(1) |
StringType(‘TEXT’) |
loose |
True |
StringType |
|
IntegerType(1) |
StringType(‘TEXT’) |
loose |
False |
StringType |
|
StringType(‘TEXT’) |
IntegerType(1) |
strict |
True |
IntegerType |
|
StringType(‘TEXT’) |
IntegerType(1) |
strict |
False |
IntegerType |
|
StringType(‘TEXT’) |
IntegerType(1) |
loose |
True |
StringType |
|
StringType(‘TEXT’) |
IntegerType(1) |
loose |
False |
Fail |
Cannot convert history, fail |
BooleanType(True) |
DateType(‘01-01-1970’) |
strict |
True |
StringType |
|
BooleanType(True) |
DateType(‘01-01-1970’) |
strict |
False |
Fail |
Cannot convert history |
BooleanType(True) |
DateType(‘01-01-1970’) |
loose |
True |
StringType |
|
BooleanType(True) |
DateType(‘01-01-1970’) |
loose |
False |
Fail |
Cannot convert history, fail |
type_1/type_2 |
ByteType |
ShortType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
StringType |
BinaryType |
BooleanType |
TimestampType |
DateType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
ByteType |
ByteType |
ShortType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
ByteType |
StringType |
StringType |
StringType |
StringType |
ShortType |
ShortType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
ShortType |
StringType |
StringType |
StringType |
StringType |
|
IntegerType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
IntegerType |
StringType |
StringType |
StringType |
StringType |
||
LongType |
LongType |
DecimalType |
FloatType |
DoubleType |
LongType |
StringType |
StringType |
StringType |
StringType |
|||
DecimalType |
DecimalType |
FloatType |
Doubletype |
DecimalType |
StringType |
StringType |
StringType |
StringType |
||||
FloatType |
FloatType |
DoubleType |
FloatType |
StringType |
StringType |
StringType |
StringType |
|||||
DoubleType |
Doubletype |
Doubletype |
StringType |
StringType |
StringType |
StringType |
||||||
StringType |
StringType |
BinaryType |
BooleanType |
TimestampType |
DateType |
|||||||
VarcharType |
BinaryType |
BooleanType |
TimestampType |
DateType |
||||||||
CharType |
BinaryType |
BooleanType |
TimestampType |
DateType |
||||||||
BinaryType |
BinaryType |
StringType |
StringType |
StringType |
||||||||
BooleanType |
BooleanType |
StringType |
StringType |
|||||||||
TimestampType |
TimestampType |
TimeStampType |
||||||||||
DateType |
DateType |
type_1/type_2 |
ByteType |
ShortType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
StringType |
BinaryType |
BooleanType |
TimestampType |
DateType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
ByteType |
ByteType |
ShortType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
StringType |
StringType |
StringType |
StringType |
StringType |
ShortType |
ShortType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
StringType |
StringType |
StringType |
StringType |
StringType |
|
IntegerType |
IntegerType |
LongType |
DecimalType |
FloatType |
DoubleType |
StringType |
StringType |
StringType |
StringType |
StringType |
||
LongType |
LongType |
DecimalType |
FloatType |
DoubleType |
StringType |
StringType |
StringType |
StringType |
StringType |
|||
DecimalType |
DecimalType |
FloatType |
Doubletype |
StringType |
StringType |
StringType |
StringType |
StringType |
||||
FloatType |
FloatType |
DoubleType |
StringType |
StringType |
StringType |
StringType |
StringType |
|||||
DoubleType |
Doubletype |
StringType |
StringType |
StringType |
StringType |
StringType |
||||||
StringType |
StringType |
StringType |
StringType |
StringType |
StringType |
|||||||
VarcharType |
StringType |
StringType |
StringType |
StringType |
||||||||
CharType |
StringType |
StringType |
StringType |
StringType |
||||||||
BinaryType |
StringType |
StringType |
StringType |
StringType |
||||||||
BooleanType |
BooleanType |
StringType |
StringType |
|||||||||
TimestampType |
TimestampType |
TimeStampType |
||||||||||
DateType |
DateType |