Excel ingest
============

This page describes how Excel files (``.xls`` and ``.xlsx``) are ingested with ADP.

It explains which ``spark_read_options`` are relevant for Excel ingest, how they
affect the selected workbook data, and which defaults and limitations apply.

Excel files in ingest
----------------------

Spark can read Excel files directly, specifically files in the ``.xls`` and ``.xlsx`` formats. 
Excel files are more complex than flat file formats that contain only one table,
but their ingestion only requires several excel specific ``spark_read_options``.
To understand these options you must first understand that Excel workbooks can contain multiple worksheets, multiple tables on the same worksheet, and even implicit data such as formulas that must be interpreted by the software opening the file.

By contrast, a simpler format such as CSV typically contains a single table.

.. figure:: figures/excel/0.png
   :alt: Example of a simple CSV file with a single table.
   :align: center

   A CSV file usually contains a single table.

When you open an Excel file, you may see multiple tables on one worksheet and
multiple worksheets in the same file. If you want a computer to read data from an
Excel file that is not limited to one simple table on one sheet, you must tell it
where the relevant data is located.

.. figure:: figures/excel/1.png
   :alt: Example Excel workbook with multiple tables and sheets.
   :align: center

   Excel files can contain multiple tables and worksheets.

Ingest and ``spark_read_options``
---------------------------------

To understand how Excel ingest works, it helps to compare it to an ingest pattern
that may already be familiar. Below is an example of an ingest that reads a table
from an MSSQL server.

.. code-block:: yaml

   name: mock_layer
   variables:
   - name: server
     values:
       dev: dev-sqls-adp-01.database.windows.net
       tst: tst-sqls-adp-01.database.windows.net
       acc: acc-sqls-adp-01.database.windows.net
       prd: prd-sqls-adp-01.database.windows.net
   sources:
   - type: mssql
     name: source1
     config:
       server: ${{{{ variables.server }}}}
       database: sqld-adp-01
       auth_type: sql
       username: ${{{{ secrets.sqld-adp-01-admin-username }}}}
       password: ${{{{ secrets.sqld-adp-01-admin-password }}}}
   entities:
   - name: mock_table_sql_server
     source:
       name: source1
       config:
         spark_read_options:
           dbtable: 'dbo.mock_table'
     settings:
       keys:
       - id

In the entity source configuration, we specify ``spark_read_options``. These are the
options Spark needs in order to read the data. In this MSSQL example, we provide the
table name through ``dbtable``.

For CSV files, the same section is used to tell Spark things like which delimiter is
used and whether a header is present. The image below shows a CSV file without a
header and with a non-standard delimiter.

.. figure:: figures/excel/2.png
   :alt: CSV file without a header and with a custom delimiter.
   :align: center

   A CSV file without a header and with a custom delimiter.

A CSV ingest might look like this:

.. code-block:: yaml

   name: mock_layer
   sources:
   - type: smb
     name: source1
     config:
       root: saadp01.file.core.windows.net\\fssaadp01\\smb
       auth_protocol: negotiate
       username: localhost\saadp01
       password: ${{{{ secrets.smb-adp-01-password }}}}
   entities:
   - name: mock_table_smb
     source:
       name: source1
       config:
         adls_file_pattern: 'username.csv'
         spark_read_options:
           file_type: csv
           delimiter: ';'
           header: true
     settings:
       keys:
       - _Identifier

Here, ``delimiter`` sets the separator and ``header`` declares that the file contains
column headers.

Reading an Excel file works in exactly the same way: the only real difference is
which ``spark_read_options`` you provide. An Excel ingest can look like this:

.. code-block:: yaml

   name: mock_layer
   sources:
   - type: smb
     name: source1
     config:
       root: saadp01.file.core.windows.net\\fssaadp01\\smb
       auth_protocol: negotiate
       username: localhost\saadp01
       password: ${{{{ secrets.smb-adp-01-password }}}}
   entities:
   - name: mock_excel_uppercase_smb
     source:
       name: source1
       config:
         adls_file_pattern: 'testexcel.xlsx'
         spark_read_options:
           file_type: excel
           dataAddress: Sheet1!A2:H10
           headerRows: 1
     settings:
       keys:
       - prefix1.Column5
       columns_to_lowercase: false

With the Excel ingest above, we read the following worksheet content:

.. figure:: figures/excel/3.png
   :alt: Excel data range selected for ingest.
   :align: center

   The selected range in the Excel workbook.

With ``dataAddress``, we specify where the relevant data is located. In this example,
the data is on ``Sheet1`` and spans from cell ``A2`` in the top left to ``H10`` in
the bottom right.

With ``headerRows``, we specify that this range contains one header row. Because the
``dataAddress`` starts at row 2, that means row 2 is treated as the header row.

Arguments and defaults
----------------------

These options do not always need to be set explicitly, because they have useful
default values. The official Databricks documentation describes the supported Excel
options in more detail:

`Read Excel files - Azure Databricks | Microsoft Learn <https://learn.microsoft.com/en-us/azure/databricks/query/formats/excel>`_

This documentation contains several useful details.

Formula columns
~~~~~~~~~~~~~~~

One useful capability is support for reading formula columns, meaning values that are
calculated in Excel rather than stored directly as static cell values.

The example below shows a formula column that adds the values of two other columns.

.. figure:: figures/excel/4.png
   :alt: Example of an Excel formula column.
   :align: center

   A formula column inside the Excel sheet.

.. figure:: figures/excel/5.png
   :alt: Key features of spark excel ingest.
   :align: center

   The calculated values can be ingested as data.

``dataAddress``
~~~~~~~~~~~~~~~

The default value of ``dataAddress`` is all cells from the first worksheet. You can
also provide only a worksheet name, in which case all cells from that worksheet are
read.

.. figure:: figures/excel/6.png
   :alt: Documentation excerpt for the dataAddress argument.
   :align: center

   ``dataAddress`` defaults to all cells from the first worksheet.

``headerRows``
~~~~~~~~~~~~~~

The default value of ``headerRows`` is ``0``. In that case, column names are
generated automatically.

.. figure:: figures/excel/7.png
   :alt: Documentation excerpt for the headerRows argument.
   :align: center

   ``headerRows`` defaults to ``0``.

``operation``
~~~~~~~~~~~~~

The default value of ``operation`` is ``readSheet``, which reads worksheet data. The
other supported option is ``listSheets``. That option is less useful for ingesting
data, but it can be helpful if your goal is to retrieve the available worksheet
names in the workbook.

.. figure:: figures/excel/8.png
   :alt: Documentation excerpt for the operation argument.
   :align: center

   ``operation`` defaults to ``readSheet``.

Date and timestamp formats
~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also customize date and timestamp parsing using ``dateFormat`` and
``timestampNTZFormat``. Their defaults are ``yyyy-MM-dd`` and
``yyyy-MM-dd'T'HH:mm:ss[.SSS]``.

.. figure:: figures/excel/9.png
   :alt: Documentation excerpt for date and timestamp format arguments.
   :align: center

   Default formats for dates and timestamps.

Limitations
-----------

Excel ingest also has some limitations. For example, only a single header row is
supported. In practice, this means you can use ``0`` or ``1`` for ``headerRows``.

.. figure:: figures/excel/10.png
   :alt: Documentation excerpt describing Excel ingest limitations.
   :align: center

   Only a single header row is supported.

FAQ
---

The Microsoft documentation also includes a FAQ section that may help answer common
questions.

.. figure:: figures/excel/11.png
   :alt: FAQ excerpt from the Databricks Excel documentation.
   :align: center

   FAQ excerpt from the official documentation.

.. figure:: figures/excel/12.png
   :alt: Additional FAQ excerpt from the Databricks Excel documentation.
   :align: center

   More FAQ details from the official documentation.