adp.ingest.cdc
CDC functionality for adp.ingest
This package contains functionality to:
transform data from a multiple-snapshot to timeline representation (terms explained below)
transform data from a timeline to a multiple-snapshot representation
add a new snapshot to a timeline
MULTIPLE SNAPSHOT REPRESENTATION
With multiple-snapshot representation, the following is meant:
VALUE_1 |
VALUE_2 |
TIMESTAMP |
|---|---|---|
A |
A |
1 |
B |
B |
1 |
A |
A |
2 |
B |
B |
2 |
C |
C |
2 |
The table above contains two snapshots. First, timestamp 1, the second for timestamp 2. This data layout is used by default in the bronze layer in the ADP.
Use the function create_snapshots() to create snapshots from a timeline.
The multiple-snapshots can be converted into a timeline representation. This representation consists of two parts:
TIMELINE REPRESENTATION
Part 1:
VALUE_1 |
VALUE_2 |
FROM_TIMESTAMP |
TO_TIMESTAMP |
COUNT |
|---|---|---|---|---|
A |
A |
1 |
2 |
1 |
B |
B |
1 |
2 |
1 |
C |
C |
2 |
2 |
1 |
Part 2:
TIMESTAMPS |
|---|
1 |
2 |
The timestamp table (part 2) is required to be able to convert the timeline representation back to snapshots again. As, one must know which snapshots lay between FROM_TIMESTAMP and TO_TIMESTAMP
This timeline presentation is particulary efficient when snapshots are almost identical. Which is often the case for a daily snapshot representation of ingested SQL tables.
Use the function create_timeline() to create a timeline (part 1 + 2) from a snapshot table. The timestamp table (part 2) is in our case stored as TBLPROPERTY (adp.timestamps). See get_timestamp_data(), set_timestamp_data() and append_timestamp_data() functions.
SNAPSHOT MERGE TO TIMELINE
Finally, the add_to_timeline() function is used to merge a new snapshot to a timeline representation.
For example, consider the new snapshot (for timestamp 3):
VALUE_1 |
VALUE_2 |
TIMESTAMP |
|---|---|---|
A |
A |
3 |
B |
B |
3 |
C |
C |
3 |
Will be merged by the function add_to_timeline() to the timeline as:
Part 1 (resulting timeline table):
VALUE_1 |
VALUE_2 |
FROM_TIMESTAMP |
TO_TIMESTAMP |
COUNT |
|---|---|---|---|---|
A |
A |
1 |
3 |
1 |
B |
B |
1 |
3 |
1 |
C |
C |
2 |
3 |
1 |
Part 2 (TBLPROPERTY adp.timestamps):
TIMESTAMPS |
|---|
1 |
2 |
3 |
Functions
|
Adds a new snapshot to the timeline |
|
append_timestamp_data adds a new timestamp to the timestamp list TBLPROPETY |
create_initial_timestamp_df creates an timestamp table (part 2, see module docs) from a multiple-snapshot representation |
|
|
create_snapshots creates a snapshots representation from a timeline delta table as specified by path_delta_timeline |
|
Create a timeline table on the target_path from a dataframe of snapshots. |
|
get_timestamp_data writes timestamp data as TBLPROPERTY |
|
get_timestamp_df creates a dataframe of the timestamps in the TBLPROPERTY |
|
query_timeline queries the timeline |
|
set_timestamp_data sets the TBLPROPERTY with the list of timestamps |
Classes
|
Options for transformation to/from snapshots/timelines |
Exceptions
Custom exception for CDC errors |