adp.ingest.schema.StructTypeGraph
- class adp.ingest.schema.StructTypeGraph(value: StructType)
- class adp.ingest.schema.StructTypeGraph(value: DiGraph)
- class adp.ingest.schema.StructTypeGraph(value: Any)
Create a graph from a nested Dataframe schema and allows for filtering on the original dataframe.
Examples
Creating a StructTypeGraph out of a pyspark StructType and applying select on a df.
>>> StructTypeGraph(df.schema).select([ '[home].[array_a].[array_b].[a]', '[home].[array_a].[array_b].[b]']) .apply(df)
- __init__(value: StructType)
- __init__(value: DiGraph)
- __init__(value: Any)
Methods
__init__()apply(df)Apply the graph on a given dataframe, filtering the columns of the dataframe
apply_maskings(maskings, df)Scrambles the data
delete_null_keys(keys, df)Remove those rows when one of the keys are None
exclude([exclude])Exclude a subset from the schema/graph and return a StructTypeGraph
get_node_by_name(node_name)get_node_by_name finds a node for the input pattern
max_depth(max_depth)Creates a subgraph for a maximum depth
plot([graph, draw_options])Plot the graph of the schema
printSchema(df)Printschema prints the schema of filtered dataframe
select([select, ignore_missing_columns])Select a subset from the schema/graph and return a StructTypeGraph
verify_keys([keys])verify_keys checks if all keys are present in the dataframe
- apply(df: DataFrame) DataFrame
Apply the graph on a given dataframe, filtering the columns of the dataframe
Uses the topological_generations (read: layers of the directed graph), to explode_outer each column layer by layer. This opens up the nested structure of the Dataframe. By iterating through a flatten and explode.
- Parameters:
df (DataFrame, optional) – Dataframe that has to be filtered. Defaults to None.
- Returns:
The Dataframe with only the columns in the graph.
- Return type:
DataFrame
- delete_null_keys(keys: List[str], df: DataFrame) DataFrame
Remove those rows when one of the keys are None
- exclude(exclude: List[str] | None = None)
Exclude a subset from the schema/graph and return a StructTypeGraph
- Parameters:
exclude (List[str], optional) – A list of nodes to exclude. Defaults to None.
- Returns:
A StructTypeGraph with only a subset of the data
- Return type:
- get_node_by_name(node_name: str)
get_node_by_name finds a node for the input pattern
The input pattern can be in two formats:
name.name
[name].[name]
- Parameters:
node (str) – The name of the node in the graph
- max_depth(max_depth: int | None)
Creates a subgraph for a maximum depth
- plot(graph=None, draw_options: dict = {})
Plot the graph of the schema
Plots a graph for the schema of any dataframe.
Example
>>> StructTypeGraph(df.schema).plot()
- printSchema(df)
Printschema prints the schema of filtered dataframe
- select(select: List[str] | None = None, ignore_missing_columns: bool = False) StructTypeGraph
Select a subset from the schema/graph and return a StructTypeGraph
- Parameters:
select (List[str], optional) – A list of nodes to select. Defaults to None.
- Returns:
A StructTypeGraph with only a subset of the data
- Return type:
- verify_keys(keys: List[str] | None = None) List[str] | None
verify_keys checks if all keys are present in the dataframe
- Parameters:
keys (List[str]) – Keys to verify the existance for in the dataframe