adp.ingest.sources.gbo.GBOEntitySource

class adp.ingest.sources.gbo.GBOEntitySource(entity: EntityType, ingest: Ingest)

Represents a GBOSource

__init__(entity: EntityType, ingest: Ingest)

Methods

__init__(entity, ingest)

all_files_in_staging([abfs_path])

all_files_in_staging retrieves all files in the staging directory for the given entity

download_files(files)

files_matching_patterns(input_paths, pattern)

Retrieve all files corresponding to the pattern string

filter_all_files(files, glob_pattern)

filter_all_files matching files with patterns.

get_data(run_metadata)

Returns a dataframe for the GBO files

get_data_rdd(data_file)

Create a RDD with the data from the D rows in the data files

get_file_pairs(files)

Returns a list of tuple with pairs of data-files and metadata-files

remove_temp_files()

Removes the temporary files

scan_files([scan_subfolders])

unzip_files()

Unzips all files in the staging folder

upload_to_staging()

upload_to_staging downloads files from the SMB drive and uploads it to the staging folder

Attributes

path

Retrieves the path for the staging folder

source

get_data(run_metadata: IngestRunMetadata) DataFrame | None

Returns a dataframe for the GBO files

get_data_rdd(data_file: str) RDD[Any]

Create a RDD with the data from the D rows in the data files

Uses the number_of_columns property on the`GBOTableMetadata` to determine the width of the data.

Parameters:
  • data_file (str) – _description_

  • metadata (GBOTableMetadata) – _description_

Returns:

_description_

Return type:

RDD

get_file_pairs(files: List[str]) List[tuple[str, str]]

Returns a list of tuple with pairs of data-files and metadata-files

Parameters:

files – A list with filepaths to the metadata and data files

Returns:

A list of tuples. The first element is the data file, the second element is the metadata file

Return type:

List[(str, str)]