Skip to content

How to create resilient Excel models using spilling Dynamic Array Formulas for automation purposes – and not just only for SlideFab

When it comes to programming most coders or programmers accept that code and data need to be separated. “Avoiding hard-coding” is a common best practice in this regard. Unfortunately, when it comes to Excel this idea hasn’t prevailed yet. Even when it comes to VBA coding within Excel many developers still rely on intermingling code into huge XLSB files, wondering why performance and maintainability are so poor. However, this post is not about writing VBA add-ins for Excel (using XLAM files).

This post is about keeping data and calculation logic (instead of code) separate within mere Excel files. Basically an outrageous idea, as “dragging formulas” and “adapting ranges” are instantaneous associations for most Excel users once input data changes. I will explain how Dynamic Array Formulas will help to build highly robust Excel models which will adapt to changes in the raw data automatically.

Accordingly, the concepts provided here are ideal for Excel calculations that need to last, like survey evaluations or corporate reporting. Both have in common that data structure remains rather similar while new surveys or new months are just around the next corner, respectively.

Disclaimer: What you will find here is my personal opinion based on more than 10 years of Excel and VBA experience. You might have a different opinion. But maybe you will learning something new nevertheless 😉

Classic ways to build an Excel model based on data files, e.g. from survey tools or business software data dumps

Let’s assume that there is raw data which fuels some calculations. This data could be a result table from a web-based survey tool or some business data, such as a list of supplier-related spend items. This data might serve for some kind of automation task. It could be PowerPoint slide creation using SlideFab. But it could also be a VBA script which dissects the data into multiple files, e.g. one file per supplier. Such a task is the starting point for the ideas presented here.

Typical approaches on structuring logic and raw data in an Excel model

Based on my experience, there are typically three approaches how an Excel model can be build in this situation.

The basic approach: Work in the raw data directly

Plenty of Excel Users open up the data file and then add columns with formulas and lookups wherever required. Experienced users know that adding columns only on the left and the right end of the original data is smart. If there is new raw data, they paste it just into the position where the old data was. If the number of rows changes, they will drag the formulas.

The intermediate approach: Use a staging area for the raw data

Some Excel users create an Excel workbook for calculation purposes which contains a kind of “staging area” worksheet where raw data can be pasted upfront. Then calculations reside on separate sheets, referring to the stage. This is not so bad as it further separates data from logic (i.e. the formulas). This is cleaner than the basic approach as it makes raw data locations easy to find.

The advanced approach: Put the raw data into Excel table stages

The minority of Excel users might follow the idea of the intermediate approach but use Excel tables as staging areas. This is really useful as it creates legible formulas which are based on table and column names instead of ordinary worksheet and cell addresses. The tables ensure unique table headers (i.e. column names). On top of this it also reduces the need for dragging formulas to some degree.

All these approaches have in common that they mix up data and calculation logic in one workbook. Even a soberly and robustly structured Excel model cannot avoid this drawback. Accordingly, these Excel models don’t need to be bad all the time, but there is some room for improvement. Besides, when looking at the most examples provided here on SlideFab.com, then you will that they follow the advanced approach. This was the best approach I came up with, when no VBA coding was part of the game.

Let’s look at something different but related: Data links.

What is wrong with linking Excel data files?

Accidentally linking external Excel files happens quite often: Users copy and paste formulas between two workbooks and then there is a link: The pasted formula references not only the sheets and ranges but also the original workbook. Quite annoying to get rid of, most of the time.

When I see users linking to other Excel files, they often create a link here and there to certain cells or ranges in some other workbook. This works. But it is rather fragile, because changes to linked data might not be reflected in the linking workbook. Accordingly, I haven’t been a fan of such an Excel modelling approach. In my opinion, it is rather bad practice in most situations. Having scattered links across a workbook can get pretty messy.

But before discussing improvement ideas to external Excel data links, there is new Excel functionality worth mentioning. Most of the even more experienced Excel users I know weren’t aware of these formulas. So a quick wrap-up makes sense here.

What are Dynamic Array Formulas and why are they so handy?

Microsoft introduced Dynamic Arrays and Dynamic Array Formulas in September 2018. They introduced the so-called “spilled array behaviour”. It means that a formula can now reference a range where it expected only a single cell before. In these situations a single formula can not only affect the first cell but also the others. Its calculation result “spills” over to other cells as well. A quite artificial description, I know, below are some examples in order to shed some more light on how they work.

This alone is quite nice, but it gets even better: Excel allows to reference formulas to the spilled range, not only the cell which holds the formula. And when the spilled range changes, then all referencing formulas will reflect this without manual intervention.

Let’s consider a simple example about Dynamic Array Formulas

The following screenshot shows an example. The formulas aggregate results per player and in total. While the “old style” formula approach requires one unique formula per player (columns E and F), the “spilling style” with Dynamic Array Formulas requires only one formula for all players (columns G and H). Even though it is not visible on the screenshot: The range G4:G10 is empty. The results are spilled from G3 downwards since it sums up ranges instead of cells. Also note the # at the end of formula =SUM(G3#) in cell H11. This hashtag makes sure that the whole spilling range (i.e. G3:G10) is considered here (more regarding the # notation will follow).

Depiction of the difference in Excel model between classical Excel formula and new dynamic array formula
Summarising results per player and in total. Using the old formula style in columns E and F and the spilling style in columns G and H.

On top of this spill behaviour there are a couple of new functions available which make Excel work much easier than it was before. For example, new functions like SORT, UNIQUE or FILTER are tremendously useful. Reason is that without them significant work (let it be Excel formula writing or coding VBA) is required to achieve the same functionality. And they all play a role when it comes to building robust Excel models for automation tasks.

Another example gives a flavour of the power of Dynamic Array Formulas

The following screenshot depicts another example based on the one before. It illustrates how to extract the top 3 players with their total results in the right order. And all of this is possible with just one formula. A quick explanation of this formula in K3: SORTBY is used to sort the range A3:E10 by values of total result range E3:E10 decreasingly. INDEX is then used to select the top 3 rows using TRANSPOSE({1,2,3}) and columns for player name and total result using {1,5}. Cells K4,K5,L3,L4,L5 have no formula.

Example how to create a complex query with one short but powerful dynamic array formula
Selecting the top 3 players with their total results in the right order. Possible with just one formula. (Please note: The worksheet depicted starts in A2 due to a filtered row 1)

It is easy to grasp that classical Excel function are much weaker in comparison. Several calculation ranges would calculate this outcome step by step.

As explained above, mere linking of some external ranges in other workbooks is not really a game changer, but actually this is more detrimental. Instead, the whole data area on a worksheet will be linked using Dynamic Array Formulas. But not in a brute-force way such as a 1:1048576 reference which basically draws all values from the other worksheet. This way is really bad as it costs performance without any gain. Even worse, the following approach would be harmed. So, let’s have a look at a smarter way.

The idea is to only link the range where really data is. With a dynamic formula it allows that changes in the amount of data (let it be more/less rows or columns) gets reflected automatically. So how can this be done? For sake of simplicity we assume that data starts in A1 and has a column header in row 1:1 and row headers in columns A:C. The rest of this document will use some dummy financial data. This data contains four KPIs (i.e. revenues, cost of goods sold (COGS), margin and units) for countries grouped into regions between January 2020 and December 2021.

So the exemplary data looks like this:

Exemplary datafile
This is the exemplary external data file containing various business KPIs for multiple months.

Now the question is: How many rows and columns are there?

Fortunately, the following formula will answer this question:

=OFFSET('[ExternalFile.xlsx]datasheet'!$A$1,0,0,
  LOOKUP(2,1/('[ExternalFile.xlsx]datasheet'!$A:$A<>""),
    ROW('[ExternalFile.xlsx]datasheet'!$A:$A)
  ),
  LOOKUP(2,1/('[ExternalFile.xlsx]datasheet'!$1:$1<>""),
    COLUMN('[ExternalFile.xlsx]datasheet'!$1:$1)
  )
)

The idea of this formula is to use OFFSET starting in A1 of the external worksheet. And then resize the range to cover the last entry in row 1 and column 1. These last entries are found using the LOOKUPROW/COLUMN functions.

The result of this formula is a range covering the whole data in the external worksheet. According to the spilling behaviour, this formula will drag all external data and put it into its worksheet.

As a side note: The OFFSET formula is sometimes combined with COUNTA to obtain the number of non-empty entries. When there are no blanks, this will work, too. Where there are, it will underestimate the row or column count respectively. Accordingly, the formula above is the recommended version.

The linked area of the exemplary external datafile
The CalculationLogic.xlsx workbook links the “datasheet” worksheet in the ExternalFile.xlsx workbook using 1 Dynamic Array Formula in cell A1.

To be precise, it is possible to improve the idea from above a little bit: Instead of linking everything starting in cell A1, it is smarter to separate column headers into one lookup and data into another one.

So the column header formula goes into datasheet_loading_improved!A1. It works in a similar way but considers only the first row of the linked worksheet:

=OFFSET('[ExternalFile.xlsx]datasheet'!$A$1,0,0,,
  LOOKUP(2,1/('[ExternalFile.xlsx]datasheet'!$1:$1<>""),
    COLUMN('[ExternalFile.xlsx]datasheet'!$1:$1)
  )
)

The data body linking formula goes into datasheet_loading_improved!A2 and just starts referencing below the column header row like this:

=OFFSET('[ExternalFile.xlsx]datasheet'!$A$2,0,0,
  LOOKUP(2,1/('[ExternalFile.xlsx]datasheet'!$A:$A<>""),
    ROW('[ExternalFile.xlsx]datasheet'!$A:$A)
  ) - 1,
  LOOKUP(2,1/('[ExternalFile.xlsx]datasheet'!$1:$1<>""),
    COLUMN('[ExternalFile.xlsx]datasheet'!$1:$1)
  )
)
The improved version of the linked worksheet referring the external data file
The CalculationLogic.xlsx workbook links the “datasheet” worksheet in the ExternalFile.xlsx workbook using 2 Dynamic Array Formulas in cells A1 (for the header row) and A2 (for the remaining rows).

So, what’s the benefit of linking data in such a way using Dynamic Array Formulas?

There are a couple of benefits of this approach.

  1. When linking all raw data , then it is possible to separate all data from the logic. So, data is one file (or multiple files, depending on the data) and in another file is the logic, i.e. formulas and structure. So, when new data arises, it is only necessary to update the links between the logic Excel workbook and the new data file(s). Also, when there are changes to the logic Excel Workbook, there is no problem getting the data into it. Again, it is only about updating the data links.
  2. Data can now be found under a new worksheet address making use of the spilling notation, ending with a hashtag, e.g. Sheet1!A1#. This range has just the right size, according to the formula above. So it is very convenient referencing this range. No rows or columns will be missed. No superfluous rows and columns will reduce performance.
  3. This approach is very clean. There is only 1 spilling formula referencing to external data (or maybe 2, when header row and data body are linked separately). No calculation logic will refer to external data directly but instead to these linking data areas. This makes formulas way more legible when they don’t contain path and filename information on top. And even better, in cases where a link breaks, there is no search for #REF! formula errors all across the Excel workbook. Accordingly, it will be easy to fix broken links.
Edit Links window from Excel
The linked raw files can be easily replaced using the Edit Links window which opens in the Excel ribbon via the Data tab.

How to handle the Excel range which is spilling linked data?

A huge range of possibilities emerges once the external data is linked. What the next tasks are depends on the data and the objectives. Fortunately, there are a couple of patterns which are useful in many situations. The next sections will shed some light on at least some of the possibilities.

Finding the right column number by its header name

It might happen that the external data file has a different column order. Or maybe there was a formula obtaining a calculated column name. Or maybe there are new columns added every month. So the question is: How to get the column data for further handling.

As the column header and the data is assumed to be in two separate spilling ranges, the formula to obtain the column number is rather easy to build.

=XMATCH(ColumnNumbers[[#Headers],[KPI]],datasheet_loading_improved!$A$1#)

This is a mere lookup of the column header (here addressed via an Excel table ColumnNumbers[[#Headers],[KPI]]) within the spilling header range datasheet_loading_improved!$A$1#.

An Excel table with make the column number easily accessible for further use.

Using an Excel table to have the column numbers within the spill range ready
An Excel table “ColumnNumbers” makes the column numbers easily available.

Getting a unique sorted list of entries for aggregation or iteration

Once the column number is clear, as explained before, it is possible to extract a complete column using the INDEX function. For sake of easier handling it makes sense to create a worksheet containing all unique lists which are supposed to be used for aggregation or iteration purposes. In the example, it makes sense to have unique list of the KPI names and the region names.

=UNIQUE(INDEX(datasheet_loading_improved!A2#,,ColumnNumbers[KPI]))

The INDEX function returns the whole KPI column from the data range using the column number as defined in the ColumnNumbers table, cf. above. In contrast to classical Excel formulas, the INDEX function can return more than a single cell now. Accordingly, when the row parameter is left empty (the double comma which has nothing in between is not a typo), it will just return all rows. Afterwards the UNIQUE function removes all duplicates. Even though it is not shown here: How to wrap a SORT function around this, should not be too challenging, when required.

Creating unique lists from the spilling data range using the UNIQUE and INDEX as well as the column number Excel table
Getting unique lists of KPIs and regions for further usage.

Fuelling dropdown lists with spilling Dynamic Array Formulas

In fact, it does not really require dropdown lists to build a robust Excel model for automation. However, they make testing and manual use a lot more comfortable. Fortunately, the unique lists as defined above do an awesome job, when it comes to fuel dropdown lists in a dynamic way. Having a dropdown of dynamic lengths does not require the cumbersome OFFSET/COUNTA function constructs anymore.

For example, building a dynamic KPI selector is very straight forward (please note the # at the end of the formula):

Using the spilling formula with the unique list to create a drop down. Setup view
Dropdown validation using spill formula definition
Using the spilling formula with the unique list to create a drop down. Live view
Dropdown opened

Extracting values from the linked data worksheet using Dynamic Array Formulas

To get data from the spilling range is rather advanced, to not call it “cumbersome”. However, the reward is that formulas referring to these spilling ranges are highly robust and don’t require maintenance once the input data changes, as already explained above.

The following exemplary situations and their solutions shall give some inspiration.

Extracting cells or ranges using the FILTER function

When only certain rows or columns from the spilling range are required, the FILTER function does a great job. In its basic form it returns the rows/columns from the provided range that fulfil a certain condition. For example, when only Revenues are supposed to be shown, the FILTER function does a great job.

=FILTER(
   datasheet_loading_improved!A2#,
   INDEX(datasheet_loading_improved!A2#,,ColumnNumbers[KPI])=filterByKPI[KPI Selector],
   "Nothing found"
)

In this example, the filter function takes as first parameter the spilled data range and filters it in the second parameter by comparing the KPI column INDEX(datasheet_loading_improved!A2#,,ColumnNumbers[KPI]) with the selected KPI filterByKPI[KPI Selector]. Finally the third parameter contains a default message, if nothing is found at all.

The filter dynamic array formula function helps to select parts of the spilled data range based on a condition
The FILTER functions selects Revenue parts of the original data based on a KPI condition.

Aggregating data two-dimensionally using the MAKEARRAY and LAMBDA function

A rather challenging task is to use Dynamic Array Formulas to aggregate data two-dimensionally without dragging the formula. For example, when summing up the values per KPI for all the months at once. While it is rather simple to span a horizontal header row with the months and a vertical header column with the unique KPIs, the actual calculation is not straight-forward. In fact, it requires both the new MAKEARRAY and also LAMBDA functions to be solved.

=MAKEARRAY(
    ROWS(A3#),
    COLUMNS(B1#),
    LAMBDA(
       r,
       c,
       SUM(
          IFERROR(
             datasheet_loading_improved!A2# * 
             --(INDEX(datasheet_loading_improved!A2#,,ColumnNumbers[KPI])=INDEX(A3#,r)) * 
             --(datasheet_loading_improved!A1#=INDEX(B1#,1,c))
          ,0)
       )
    )
)

The MAKEARRAY formula spans an array with the required number of rows and columns via the first two parameters ROWS(A3#) and COLUMNS(B1#).

The third parameter of MAKEARRAY takes the LAMBDA function and provides provides the number of rows r and columns c to its first two parameters.

The third parameter of LAMBDA takes here the SUM function to simulate a SUMIFS calculation. It has three factors:

  1. The data range datasheet_loading_improved!A2#
  2. The row condition --(INDEX(datasheet_loading_improved!A2#,,ColumnNumbers[KPI])=INDEX(A3#,r))
  3. The column condition --(datasheet_loading_improved!A1#=INDEX(B1#,1,c))

The conditions in 2) and 3) result to 1 or 0 when applicable or not. So the SUM counts a value where it matches and 0 where it does not. Besides, the formulas in 2) and 3) are using row number r and column number c with INDEX to obtain the condition values.

Advanced 2d aggregation using the MAKEARRAY and LAMDA function within Excel
Advanced aggregation using MAKEARRAY and LAMBDA function to

How to deal with overlapping spilling areas?

The spilling areas will take as much space as required. When raw data increases, the spilling ranges will grow as well. This might lead to situations where spilling ranges overlap and the nasty #SPILL! error occurs. Thinking how much “reserve” should be left between calculation areas is seducing here. It follows the old concept of letting formulas range from row 1 to 5,000 (or the number of your choice) because it is unconceivable that there are more rows. And if there are more rows, the formulas will miss the data. The alternative of changing formula manually is also not that attractive. Good luck with this in larger Excel models, where it is so easy to overlook a formula change here and there.

Only one robust solution exists

So in situations, where both the number of rows and number of columns are dynamic, only one robust solution exists: One calculation per sheet, which theoretically can cover up to a million rows and 16,384 columns. Doing complex, dependent calculations will then lead to a series of calculation sheets. Each of these sheets will be very tidy and lean because there should in most cases not be too many spilling formulas required. So naming of worksheets gets important, to avoid confusion. But with this in place the calculation logic will be very tidy and stringent.

But no rule without exceptions

  1. There are also spilling formulas which will grow either in rows or in columns. Here it is much simpler to structure a spill-overlap-free worksheet with multiple calculations on it.
  2. If some “main” spilling area grows both towards the right side and to the bottom, then there are possibilities on the left and on the top for all kind of calculations which do not spill in the direction of the “main” spilling area. This will lead to an inverse direction, where calculation steps flow from right to left or bottom to top, respectively. This is uncommon for sure, but it avoids potential spill overlaps.

What are the drawbacks of this approach?

Of course, also this coin has two sides. While most of the article described the pros, here should also be some space for the cons:

  1. Linking external data require that the linked workbooks are open. This can be nasty, when there are too many linked Excel files.
  2. Using Dynamic Array Formulas and the spilling mechanism are not as legible as when using Excel tables instead. It is more like classic Excel: Worksheet names and addresses appear within the formulas.
  3. The spilling areas of a formula only return the values but not the formats. In particular missing number format create additional effort, e.g. by applying formats manually or using the TEXT function for it.
  4. The modularity when separating data from calculation logic leads to separate files. Some people might not like this and instead prefer “heavy” Excel files.

Processing Dynamic Array Formula calculation results for further use

Ensure number formats

After creating the calculations it is time to think about the representation of the results. This is number formatting of course. As the spilling ranges do not carry their original number format with them, it makes a lot of sense to use the TEXT function to apply the desired number format. This also preserves the spilling idea which would suffer, when formatting ranges manually. If so, then it is good to format the whole row or column, to never run out of format settings once the spilling range grows.

Creating charts based on spill ranges

Of course, charts make sense to show some insights into the results. As charts do not support Dynamic Array Formulas, a little trick will do the job: Wrapping the spill range into a named range helps. Then charts will have a much better time supporting the spilling functionality.

Summary

This article gave a rather brief overview on how to build a resilient Excel calculation workbook. The concepts brought forward build on the new Excel Dynamic Array Formulas and a separation of calculation logic and raw data. These formulas helped to avoid dragging ranges, increasing the Excel model maintainability of all formulas. Instead these new formulas adapt to changes in the raw data automatically as their spilling area takes as much space as required. The article did not go into detail how to build a SlideFab automation. This will be covered in separate blog posts.

Of course, the example files shown in this blog post are also available for download:

Do it yourself vs. purchasing Excel and SlideFab services

A personal comment at the end: In this blog post I shared a wealth of details of my thinking process on how to build resilient Excel models. However, the content in here requires an advanced Excel user for sure. If you like what you see in here, but don’t have the skills or time to build it yourself, please reach out. Building complex Excel models – in particular but not exclusively for SlideFab – is one of our offerings.

2 thoughts on “How to create resilient Excel models using spilling Dynamic Array Formulas for automation purposes – and not just only for SlideFab”

  1. I will have to give that some consideration. I had thought more in terms of importing the Lambda function/s to the data rather than the other way around.
    = ProcessDataλ(rawdata, options)
    The options might determine which intermediate tables are stacked to form the output. I do expect the data owner to correctly populate the range they identify as containing the raw data, otherwise the Lambda function could spend a lot of time processing blanks until it crashes.

  2. Dear Peter.

    Thank you for this comment.

    Regarding “importing the Lambda function/s to the data” I think this is a more classical approach. But I think it gives up the strict separation of data and formula logic. But I have to admit that I might not get the point you are making here.

    With the calculation speed, you are totally right. This is challenging. But with data and formulas separated, users don’t have to update the formula workbook when there is new data. So this avoids waiting for calculation. What I found useful is the FILTER function to remove any superfluous data as early as possible. But this is something which might be not applicable in all situations. For automation it works, as often only the current iteration data needs to be calculated. Luckily, I did not run into Excel crashes with this approach so far.
    Regarding the blanks: The formulas referencing external workbook try to be cheap on the range by looking for the last row and column with data. But if the raw data is relatively sparse, this approach might suffer. But getting raw data in useful formats doesn’t hurt most of the time 😉

    All the best
    Jens

Leave a Reply

Your email address will not be published. Required fields are marked *