adp.ingest.cdc

CDC functionality for adp.ingest

This package contains functionality to:

  1. transform data from a multiple-snapshot to timeline representation (terms explained below)

  2. transform data from a timeline to a multiple-snapshot representation

  3. add a new snapshot to a timeline

MULTIPLE SNAPSHOT REPRESENTATION

With multiple-snapshot representation, the following is meant:

VALUE_1

VALUE_2

TIMESTAMP

A

A

1

B

B

1

A

A

2

B

B

2

C

C

2

The table above contains two snapshots. First, timestamp 1, the second for timestamp 2. This data layout is used by default in the bronze layer in the ADP.

Use the function create_snapshots() to create snapshots from a timeline.

The multiple-snapshots can be converted into a timeline representation. This representation consists of two parts:

TIMELINE REPRESENTATION

Part 1:

VALUE_1

VALUE_2

FROM_TIMESTAMP

TO_TIMESTAMP

COUNT

A

A

1

2

1

B

B

1

2

1

C

C

2

2

1

Part 2:

TIMESTAMPS

1

2

The timestamp table (part 2) is required to be able to convert the timeline representation back to snapshots again. As, one must know which snapshots lay between FROM_TIMESTAMP and TO_TIMESTAMP

This timeline presentation is particulary efficient when snapshots are almost identical. Which is often the case for a daily snapshot representation of ingested SQL tables. Use the function create_timeline() to create a timeline (part 1 + 2) from a snapshot table. The timestamp table (part 2) is in our case stored as TBLPROPERTY (adp.timestamps). See get_timestamp_data(), set_timestamp_data() and append_timestamp_data() functions.

SNAPSHOT MERGE TO TIMELINE

Finally, the add_to_timeline() function is used to merge a new snapshot to a timeline representation. For example, consider the new snapshot (for timestamp 3):

VALUE_1

VALUE_2

TIMESTAMP

A

A

3

B

B

3

C

C

3

Will be merged by the function add_to_timeline() to the timeline as:

Part 1 (resulting timeline table):

VALUE_1

VALUE_2

FROM_TIMESTAMP

TO_TIMESTAMP

COUNT

A

A

1

3

1

B

B

1

3

1

C

C

2

3

1

Part 2 (TBLPROPERTY adp.timestamps):

TIMESTAMPS

1

2

3

Functions

add_to_timeline(snapshot, timeline_path, options)

Adds a new snapshot to the timeline

append_timestamp_data(new_unix_timestamp, ...)

append_timestamp_data adds a new timestamp to the timestamp list TBLPROPETY

create_initial_timestamp_df(...)

create_initial_timestamp_df creates an timestamp table (part 2, see module docs) from a multiple-snapshot representation

create_snapshots(path_delta_timeline, options)

create_snapshots creates a snapshots representation from a timeline delta table as specified by path_delta_timeline

create_timeline(snapshots_dataframe, ...)

Create a timeline table on the target_path from a dataframe of snapshots.

get_timestamp_data(delta_path, options)

get_timestamp_data writes timestamp data as TBLPROPERTY

get_timestamp_df(delta_path, options)

get_timestamp_df creates a dataframe of the timestamps in the TBLPROPERTY

query_timeline(path, keys, at_timestamp[, ...])

query_timeline queries the timeline

set_timestamp_data(data, delta_path, options)

set_timestamp_data sets the TBLPROPERTY with the list of timestamps

Classes

TransformOptions([system_prefix, ...])

Options for transformation to/from snapshots/timelines

Exceptions

CDCException

Custom exception for CDC errors