adp.delivery.gold.append_unknown_record

adp.delivery.gold.append_unknown_record(df: DataFrame, primary_key_columns: list[str] | str, end_date_column: str | None = None) DataFrame

append_unknown_record to dataframe

Appends an unknown record to a dataframe. Comes in handy when creating dimension tables.

Example

>>> schema = StructType([
    StructField('test_id',IntegerType(),nullable=False),
    StructField('test_string_required',StringType(),nullable=False),
    StructField('test_string_optional',StringType(),nullable=True),
    StructField('test_string_excluded',StringType(),nullable=True),
    StructField('test_integer',IntegerType(),nullable=True),
    StructField('test_decimal',DecimalType(5,2),nullable=True),
    StructField('test_double', DoubleType()),
    StructField('test_float', FloatType()),
    StructField('test_date',DateType(),nullable=True),
    StructField('test_timestamp',TimestampType(),nullable=True)
])
>>> data = [
    (1, 'test_string_required_1', 'test_string_optional_1', 'test_string_excluded_1', 123, Decimal(999.99), 9.99,  9.99, date(2000,1,1), datetime.now()),
    (2, 'test_string_required_2', 'test_string_optional_2', 'test_string_excluded_2', 123, Decimal(999.99), 9.99,  9.99, date(2000,1,1), datetime.now()),
    (3, 'test_string_required_3', 'test_string_optional_3', 'test_string_excluded_3', 123, Decimal(999.99), 9.99,  9.99, date(2000,1,1), datetime.now())]
>>> df_in = create_sparksession().createDataFrame(data,schema)
>>> df_in.show()
+-------+--------------------+--------------------+--------------------+------------+------------+-----------+----------+----------+--------------------+
|test_id|test_string_required|test_string_optional|test_string_excluded|test_integer|test_decimal|test_double|test_float| test_date|      test_timestamp|
+-------+--------------------+--------------------+--------------------+------------+------------+-----------+----------+----------+--------------------+
|      1|test_string_requi...|test_string_optio...|test_string_exclu...|         123|      999.99|       9.99|      9.99|2000-01-01|2022-07-19 15:08:...|
|      2|test_string_requi...|test_string_optio...|test_string_exclu...|         123|      999.99|       9.99|      9.99|2000-01-01|2022-07-19 15:08:...|
|      3|test_string_requi...|test_string_optio...|test_string_exclu...|         123|      999.99|       9.99|      9.99|2000-01-01|2022-07-19 15:08:...|
+-------+--------------------+--------------------+--------------------+------------+------------+-----------+----------+----------+--------------------+
>>> df_out = append_unknown_record(df_in, 'test_id', 'test_date')
>>> df_out.show()
+-------+--------------------+--------------------+--------------------+------------+------------+-----------+----------+----------+--------------------+
|test_id|test_string_required|test_string_optional|test_string_excluded|test_integer|test_decimal|test_double|test_float| test_date|      test_timestamp|
+-------+--------------------+--------------------+--------------------+------------+------------+-----------+----------+----------+--------------------+
|      1|test_string_requi...|test_string_optio...|test_string_exclu...|         123|      999.99|       9.99|      9.99|2000-01-01|2022-07-19 15:08:...|
|      2|test_string_requi...|test_string_optio...|test_string_exclu...|         123|      999.99|       9.99|      9.99|2000-01-01|2022-07-19 15:08:...|
|      3|test_string_requi...|test_string_optio...|test_string_exclu...|         123|      999.99|       9.99|      9.99|2000-01-01|2022-07-19 15:08:...|
|     -1|            Onbekend|            Onbekend|            Onbekend|           1|        1.00|        1.0|       1.0|3000-12-31|3000-12-31 00:00:00.|
+-------+--------------------+--------------------+--------------------+------------+------------+-----------+----------+----------+--------------------+
Parameters:
  • df (DataFrame) – The dataframe to add the unknown record to

  • primary_key_columns (str | List[str]) – Column name(s) of the primary key.

  • end_date_column (str, optional) – Column where the date should be replaced to a date far in the future. Defaults to None.

Returns:

DataFrame with the unknown record added to

Return type:

DataFrame