adp.ingest.schema.StructTypeGraph

class adp.ingest.schema.StructTypeGraph(value: StructType)
class adp.ingest.schema.StructTypeGraph(value: DiGraph)
class adp.ingest.schema.StructTypeGraph(value: Any)

Create a graph from a nested Dataframe schema and allows for filtering on the original dataframe.

Examples

Creating a StructTypeGraph out of a pyspark StructType and applying select on a df.

>>> StructTypeGraph(df.schema).select([
    '[home].[array_a].[array_b].[a]',
    '[home].[array_a].[array_b].[b]'])
    .apply(df)
__init__(value: StructType)
__init__(value: DiGraph)
__init__(value: Any)

Methods

__init__()

apply(df)

Apply the graph on a given dataframe, filtering the columns of the dataframe

apply_maskings(maskings, df)

Scrambles the data

delete_null_keys(keys, df)

Remove those rows when one of the keys are None

exclude([exclude])

Exclude a subset from the schema/graph and return a StructTypeGraph

get_node_by_name(node_name)

get_node_by_name finds a node for the input pattern

max_depth(max_depth)

Creates a subgraph for a maximum depth

plot([graph, draw_options])

Plot the graph of the schema

printSchema(df)

Printschema prints the schema of filtered dataframe

select([select, ignore_missing_columns])

Select a subset from the schema/graph and return a StructTypeGraph

verify_keys([keys])

verify_keys checks if all keys are present in the dataframe

apply(df: DataFrame) DataFrame

Apply the graph on a given dataframe, filtering the columns of the dataframe

Uses the topological_generations (read: layers of the directed graph), to explode_outer each column layer by layer. This opens up the nested structure of the Dataframe. By iterating through a flatten and explode.

Parameters:

df (DataFrame, optional) – Dataframe that has to be filtered. Defaults to None.

Returns:

The Dataframe with only the columns in the graph.

Return type:

DataFrame

apply_maskings(maskings: List[Masking], df: DataFrame) DataFrame

Scrambles the data

delete_null_keys(keys: List[str], df: DataFrame) DataFrame

Remove those rows when one of the keys are None

exclude(exclude: List[str] | None = None)

Exclude a subset from the schema/graph and return a StructTypeGraph

Parameters:

exclude (List[str], optional) – A list of nodes to exclude. Defaults to None.

Returns:

A StructTypeGraph with only a subset of the data

Return type:

StructTypeGraph

get_node_by_name(node_name: str)

get_node_by_name finds a node for the input pattern

The input pattern can be in two formats:

  1. name.name

  2. [name].[name]

Parameters:

node (str) – The name of the node in the graph

max_depth(max_depth: int | None)

Creates a subgraph for a maximum depth

plot(graph=None, draw_options: dict = {})

Plot the graph of the schema

Plots a graph for the schema of any dataframe.

Example

>>> StructTypeGraph(df.schema).plot()
printSchema(df)

Printschema prints the schema of filtered dataframe

select(select: List[str] | None = None, ignore_missing_columns: bool = False) StructTypeGraph

Select a subset from the schema/graph and return a StructTypeGraph

Parameters:

select (List[str], optional) – A list of nodes to select. Defaults to None.

Returns:

A StructTypeGraph with only a subset of the data

Return type:

StructTypeGraph

verify_keys(keys: List[str] | None = None) List[str] | None

verify_keys checks if all keys are present in the dataframe

Parameters:

keys (List[str]) – Keys to verify the existance for in the dataframe