Array formulas and large ranges

What can you learn here ?
  • array formulas VBA
  • using VBA profiler
  • using worksheet profiler

Are Array formulas viable with large ranges

On the Excel Blackbelts forum, someone asked a very interesting question about how to analyze relationships between customers and products.

I came up with a solution based on VBA, which was used as the basis for the Create a heatmap in Excel article and related blog post. As usual this code is implemented in the arrayFormulas.xlsm workbook downloadable from Download Complete Projects

The problem scenario

The question was how to create a heatmap of associated products, starting from a list of pairs of customer and product number. At first this seemed like a job for some combination of array formulas involving countifs, sumproduct, but the user also said that the input data could be 100000 rows with as many as 30 products. This seemed intuitively too big for an array formula solution, so I pressed ahead with a VBA solution using the Data Manipulation Classes to speed up development.

Why not array formula ?

One of the group members, Peter Bartholomew came up with what I thought was a very elegant solution involving MMULT and COUNTIFS, and it got me wondering whether this could be automated and was it a viable solution to a large data problem. If you just want the answer - then yes, it can easily be automated and no, it's not a viable solution. It's not practical to use array formulas on very large datasets like this to solve this kind of problem. Read on for the analysis. However, I believe that if we can find a way to optimize this, a hybrid of VBA automation and formulae is probably the best solution. See Optimizing the array formula, and for a more viable solution see Leveraging sorted data with Match

Generating test data

The first problem was generating enough random valid test data. Here's how. It's not great data, since with a sample this size, most cells will end up about even by chance - but the main thing is it gives me volume.

Public Sub generateTestMatrixData()
    Dim r As Range, n As Long, kr As Long, kc As Long, i As Long, j As Long, a() As String
    Dim id As Long
    kr = 100000
    kc = 30
    ReDim a(1 To kc)
    For i = 1 To kc
        a(i) = "a" & Format(i, "0#")
    Next i
    Set r = Sheets("matrixin").Range("a1")
    id = 1
    i = 0
    While i < kr
        aShuffle a
        n = Int(kc * Rnd + 1)
        id = id + 1
        For j = 1 To n
         i = i + 1
         If i <= kr Then
            r.Offset(i).Value = id
            r.Offset(i, 1).Value = a(j)
            Exit For
         End If
        Next j
End Sub
Private Sub aShuffle(p() As String)
    Dim i As Long, S As String, j As Long
    Dim x As Long
    x = UBound(p) - LBound(p) + 1

    For i = LBound(p) To UBound(p)
        j = Int(x * Rnd + 1)
        S = p(j)
        p(j) = p(i)
        p(i) = S
    Next i

End Sub

This gives me a 100000 rows that start like this

Target solution

In the screenshots following, I'm only looking at the first 10,000 rows for the sake of speed. If you are playing around with this downloaded sheet, I suggest you set calculation to manual. If you run the example it will do that anyway.

The objective is to first transform the list into a list of unique customers down the side, with a list of potential products across the top, and a series of 0/1 in the table body that would indicate whether this product was used by this customer, like this

The final array formula to do this is (entered as a CSE array formula) is (thanks Peter)
=COUNTIFS(MatrixIn!$A$2:$A$10001, tempmatrix!$A$2:$A$662, MatrixIn!$B$2:$B$10001, tempmatrix!$B$1:$AE$1)

Next step is do matrix multiplication on this to create a table of product by product, which gives us this

using this formula (and thanks again Peter)
=MMULT(TRANSPOSE( tempmatrix!$B$2:$AE$662),tempmatrix!$B$2:$AE$662)

Adding heatmap

Next we'll apply heatmapping ramp formats to that table.

and lets do a couple of heatmap chart as well.

A smaller sample size with more randomness, gives us a nicer chart of course

The performance analysis

There are 4 main stages to this process
  1. Populating - reading and organizing the input data
  2. Creating formulas to populate the temporary usage table 
  3. Creating formulas for the matrix operation
  4. Calculating the worksheet.
Using Automatic Profiling tools which you can download on this website I ran various scenarios from 100 to 50,000 rows. Here are the calculation times in seconds. I didn't even bother with 100,000 and you can see why from the graph below.


Here is the code to automate all this
Sub matrixInExampleX()
    Dim maxrows As Long
    maxrows = 100
    Dim dsIn As cDataSet, dsout As cDataSet
    Dim cop As Collection, r As Range, coc As Collection
    Dim cht As Chart

    Set dsIn = New cDataSet
    Set dsout = New cDataSet
    ' sort it first
    Application.Calculation = xlCalculationManual
    Application.ScreenUpdating = False
    Worksheets("matrixin").Range("$A:$b").Sort _
        Key1:=Worksheets("matrixin").Range("A1"), _
        Key1:=Worksheets("matrixin").Range("b1"), _
    With dsIn
        ' create a dset with input data
        .populateData wholeSheet("matrixin"), , "matrixin", , , , True, , maxrows
        ' make collections of customers and products that appear
        Set cop = .Column("product").uniqueValues(eSortAscending)
        Set coc = .Column("customer").uniqueValues(eSortNone)
        ' create the output matrix
    End With
        'get newly created output matrix into a dataset
    dsout.populateData createMatrixFormulas(coc, cop, dsIn), _
            , "matrixout"
    Set dsIn = Nothing
    Application.ScreenUpdating = True
    ' create heatmap
    Set cht = createSurfaceChart(dsout, "heatmap")
    ''''Set cht = createSurfaceChart(dsOut, "heatmaptop", xlSurfaceTopView)
    Set cht = Nothing
    Set dsout = Nothing
    Set dsIn = Nothing

End Sub
Private Function createMatrixFormulas(coc As Collection, cop As Collection, _
                dsIn As cDataSet) As Range
    Dim r As Range, wr As Range

    Dim listcustomer As Range   'MatrixIn'!$A$2:$A$98685
    Dim listProduct As Range 'MatrixIn'!$B$2:$B$98685
    Dim tableentries As Range   'Sheet2'!$B$2:$J$35
    Dim tableHeadingCustomer As Range   'Sheet2'!$A$2:$A$35
    Dim tableheadingProduct As Range    'Sheet2'!$B$1:$J$1
    Dim matrix As Range
    Set wr = firstCell(wholeSheet("tempMatrix").Cells)
    wr.Value = wr.Worksheet.Name
    Set listProduct = dsIn.Column("product").where
    Set listcustomer = dsIn.Column("customer").where
    Set tableHeadingCustomer = _
        wr.Offset(1).Resize(coc.Count, 1)
    Set tableheadingProduct = _
         wr.Offset(, 1).Resize(1, cop.Count)
    Set tableentries = tableHeadingCustomer.Offset(, _
        1).Resize(coc.Count, cop.Count)
    Set r = tableHeadingCustomer.Resize(1, 1)
    For Each cc In coc
        r.Value = cc.Value
        Set r = r.Offset(1)
    Next cc
    Set r = tableheadingProduct.Resize(1, 1)
    For Each cc In cop
        r.Value = cc.Value
        Set r = r.Offset(, 1)
    Next cc
    ' the temporary array
    tableentries.FormulaArray = _
        "=COUNTIFS(" & list( _
            SAd(listcustomer), _
            SAd(tableHeadingCustomer), _
            SAd(listProduct), _
            SAd(tableheadingProduct)) & ")"
    ' the final table
    Set matrix = firstCell(wholeSheet _
        ("matrixOutx")).Resize(cop.Count, cop.Count).Offset(1, 1)
    Set r = firstCell(matrix.Offset(-1))
    r.Offset(, -1).Value = "matrix"
    For Each cc In cop
        r.Value = cc.Value
        Set r = r.Offset(, 1)
    Next cc
    Set r = firstCell(matrix.Offset(, -1))
    For Each cc In cop
        r.Value = cc.Value
        Set r = r.Offset(1)
    Next cc
    matrix.FormulaArray = _
     "=MMULT(TRANSPOSE( " & SAd(tableentries) & ")," & _
            SAd(tableentries) & ")"
    Set createMatrixFormulas = wr.Resize(cop.Count + 1, _
            cop.Count + 1)
End Function