Excel ingest

This page describes how Excel files (.xls and .xlsx) are ingested with ADP.

It explains which spark_read_options are relevant for Excel ingest, how they affect the selected workbook data, and which defaults and limitations apply.

Excel files in ingest

Spark can read Excel files directly, specifically files in the .xls and .xlsx formats. Excel files are more complex than flat file formats that contain only one table, but their ingestion only requires several excel specific spark_read_options. To understand these options you must first understand that Excel workbooks can contain multiple worksheets, multiple tables on the same worksheet, and even implicit data such as formulas that must be interpreted by the software opening the file.

By contrast, a simpler format such as CSV typically contains a single table.

Example of a simple CSV file with a single table. — A CSV file usually contains a single table.

When you open an Excel file, you may see multiple tables on one worksheet and multiple worksheets in the same file. If you want a computer to read data from an Excel file that is not limited to one simple table on one sheet, you must tell it where the relevant data is located.

Example Excel workbook with multiple tables and sheets. — Excel files can contain multiple tables and worksheets.

Ingest and `spark_read_options`

To understand how Excel ingest works, it helps to compare it to an ingest pattern that may already be familiar. Below is an example of an ingest that reads a table from an MSSQL server.

name: mock_layer
variables:
- name: server
  values:
    dev: dev-sqls-adp-01.database.windows.net
    tst: tst-sqls-adp-01.database.windows.net
    acc: acc-sqls-adp-01.database.windows.net
    prd: prd-sqls-adp-01.database.windows.net
sources:
- type: mssql
  name: source1
  config:
    server: ${{{{ variables.server }}}}
    database: sqld-adp-01
    auth_type: sql
    username: ${{{{ secrets.sqld-adp-01-admin-username }}}}
    password: ${{{{ secrets.sqld-adp-01-admin-password }}}}
entities:
- name: mock_table_sql_server
  source:
    name: source1
    config:
      spark_read_options:
        dbtable: 'dbo.mock_table'
  settings:
    keys:
    - id

In the entity source configuration, we specify spark_read_options. These are the options Spark needs in order to read the data. In this MSSQL example, we provide the table name through dbtable.

For CSV files, the same section is used to tell Spark things like which delimiter is used and whether a header is present. The image below shows a CSV file without a header and with a non-standard delimiter.

A CSV file without a header and with a custom delimiter.

A CSV ingest might look like this:

name: mock_layer
sources:
- type: smb
  name: source1
  config:
    root: saadp01.file.core.windows.net\\fssaadp01\\smb
    auth_protocol: negotiate
    username: localhost\saadp01
    password: ${{{{ secrets.smb-adp-01-password }}}}
entities:
- name: mock_table_smb
  source:
    name: source1
    config:
      adls_file_pattern: 'username.csv'
      spark_read_options:
        file_type: csv
        delimiter: ';'
        header: true
  settings:
    keys:
    - _Identifier

Here, delimiter sets the separator and header declares that the file contains column headers.

Reading an Excel file works in exactly the same way: the only real difference is which spark_read_options you provide. An Excel ingest can look like this:

name: mock_layer
sources:
- type: smb
  name: source1
  config:
    root: saadp01.file.core.windows.net\\fssaadp01\\smb
    auth_protocol: negotiate
    username: localhost\saadp01
    password: ${{{{ secrets.smb-adp-01-password }}}}
entities:
- name: mock_excel_uppercase_smb
  source:
    name: source1
    config:
      adls_file_pattern: 'testexcel.xlsx'
      spark_read_options:
        file_type: excel
        dataAddress: Sheet1!A2:H10
        headerRows: 1
  settings:
    keys:
    - prefix1.Column5
    columns_to_lowercase: false

With the Excel ingest above, we read the following worksheet content:

Excel data range selected for ingest. — The selected range in the Excel workbook.

With dataAddress, we specify where the relevant data is located. In this example, the data is on Sheet1 and spans from cell A2 in the top left to H10 in the bottom right.

With headerRows, we specify that this range contains one header row. Because the dataAddress starts at row 2, that means row 2 is treated as the header row.

Arguments and defaults

These options do not always need to be set explicitly, because they have useful default values. The official Databricks documentation describes the supported Excel options in more detail:

Read Excel files - Azure Databricks | Microsoft Learn

This documentation contains several useful details.

Formula columns

One useful capability is support for reading formula columns, meaning values that are calculated in Excel rather than stored directly as static cell values.

The example below shows a formula column that adds the values of two other columns.

Example of an Excel formula column. — A formula column inside the Excel sheet.

Key features of spark excel ingest. — The calculated values can be ingested as data.

`dataAddress`

The default value of dataAddress is all cells from the first worksheet. You can also provide only a worksheet name, in which case all cells from that worksheet are read.

`headerRows`

The default value of headerRows is 0. In that case, column names are generated automatically.

`operation`

The default value of operation is readSheet, which reads worksheet data. The other supported option is listSheets. That option is less useful for ingesting data, but it can be helpful if your goal is to retrieve the available worksheet names in the workbook.

Date and timestamp formats

You can also customize date and timestamp parsing using dateFormat and timestampNTZFormat. Their defaults are yyyy-MM-dd and yyyy-MM-dd'T'HH:mm:ss[.SSS].

Documentation excerpt for date and timestamp format arguments. — Default formats for dates and timestamps.

Limitations

Excel ingest also has some limitations. For example, only a single header row is supported. In practice, this means you can use 0 or 1 for headerRows.