adp.ingest.schema.StructTypeGraph

class adp.ingest.schema.StructTypeGraph(value: StructType)

class adp.ingest.schema.StructTypeGraph(value: DiGraph)

class adp.ingest.schema.StructTypeGraph(value: Any)

Create a graph from a nested Dataframe schema and allows for filtering on the original dataframe.

Examples

Creating a StructTypeGraph out of a pyspark StructType and applying select on a df.

>>> StructTypeGraph(df.schema).select([
    '[home].[array_a].[array_b].[a]',
    '[home].[array_a].[array_b].[b]'])
    .apply(df)

__init__(value: StructType)
__init__(value: DiGraph)
__init__(value: Any)

Methods

`__init__`()
`apply`(df)	Apply the graph on a given dataframe, filtering the columns of the dataframe
`apply_maskings`(maskings, df)	Scrambles the data
`delete_null_keys`(keys, df)	Remove those rows when one of the keys are None
`exclude`([exclude])	Exclude a subset from the schema/graph and return a StructTypeGraph
`get_node_by_name`(node_name)	get_node_by_name finds a node for the input pattern
`max_depth`(max_depth)	Creates a subgraph for a maximum depth
`plot`([graph, draw_options])	Plot the graph of the schema
`printSchema`(df)	Printschema prints the schema of filtered dataframe
`select`([select, ignore_missing_columns])	Select a subset from the schema/graph and return a StructTypeGraph
`verify_keys`([keys])	verify_keys checks if all keys are present in the dataframe

apply(df: DataFrame) → DataFrame

Apply the graph on a given dataframe, filtering the columns of the dataframe

Uses the topological_generations (read: layers of the directed graph), to explode_outer each column layer by layer. This opens up the nested structure of the Dataframe. By iterating through a flatten and explode.

Parameters:: df (DataFrame, optional) – Dataframe that has to be filtered. Defaults to None.
Returns:: The Dataframe with only the columns in the graph.
Return type:: DataFrame

apply_maskings(maskings: List[Masking], df: DataFrame) → DataFrame: Scrambles the data

delete_null_keys(keys: List[str], df: DataFrame) → DataFrame: Remove those rows when one of the keys are None

exclude(exclude: List[str] | None = None)

Exclude a subset from the schema/graph and return a StructTypeGraph

Parameters:: exclude (List[str], optional) – A list of nodes to exclude. Defaults to None.
Returns:: A StructTypeGraph with only a subset of the data
Return type:: StructTypeGraph

get_node_by_name(node_name: str)

get_node_by_name finds a node for the input pattern

The input pattern can be in two formats:

name.name
[name].[name]

Parameters:: node (str) – The name of the node in the graph

max_depth(max_depth: int | None): Creates a subgraph for a maximum depth

plot(graph=None, draw_options: dict = {})

Plot the graph of the schema

Plots a graph for the schema of any dataframe.

Example

>>> StructTypeGraph(df.schema).plot()

printSchema(df): Printschema prints the schema of filtered dataframe

select(select: List[str] | None = None, ignore_missing_columns: bool = False) → StructTypeGraph

Select a subset from the schema/graph and return a StructTypeGraph

Parameters:: select (List[str], optional) – A list of nodes to select. Defaults to None.
Returns:: A StructTypeGraph with only a subset of the data
Return type:: StructTypeGraph

verify_keys(keys: List[str] | None = None) → List[str] | None

verify_keys checks if all keys are present in the dataframe

Parameters:: keys (List[str]) – Keys to verify the existance for in the dataframe