adp.ingest.sources.gbo.GBOEntitySource
- class adp.ingest.sources.gbo.GBOEntitySource(entity: EntityType, ingest: Ingest)
Represents a GBOSource
Methods
__init__(entity, ingest)all_files_in_staging([abfs_path])all_files_in_staging retrieves all files in the staging directory for the given entity
download_files(files)files_matching_patterns(input_paths, pattern)Retrieve all files corresponding to the pattern string
filter_all_files(files, glob_pattern)filter_all_files matching files with patterns.
get_data(run_metadata)Returns a dataframe for the GBO files
get_data_rdd(data_file)Create a RDD with the data from the D rows in the data files
get_file_pairs(files)Returns a list of tuple with pairs of data-files and metadata-files
remove_temp_files()Removes the temporary files
scan_files([scan_subfolders])unzip_files()Unzips all files in the staging folder
upload_to_staging()upload_to_staging downloads files from the SMB drive and uploads it to the staging folder
Attributes
pathRetrieves the path for the staging folder
source- get_data(run_metadata: IngestRunMetadata) DataFrame | None
Returns a dataframe for the GBO files
- get_data_rdd(data_file: str) RDD[Any]
Create a RDD with the data from the D rows in the data files
Uses the number_of_columns property on the`GBOTableMetadata` to determine the width of the data.
- Parameters:
data_file (str) – _description_
metadata (GBOTableMetadata) – _description_
- Returns:
_description_
- Return type:
RDD
- get_file_pairs(files: List[str]) List[tuple[str, str]]
Returns a list of tuple with pairs of data-files and metadata-files
- Parameters:
files – A list with filepaths to the metadata and data files
- Returns:
A list of tuples. The first element is the data file, the second element is the metadata file
- Return type:
List[(str, str)]