I am supporting CandidateX

CandidateX is a startup that focuses on creating inclusion-focused hiring solutions, designed to increase access to job opportunities for underestimated talent. Check them out if you have a few minutes to spare. They need visibility!

What can you learn here?

  • Optimization and profiling
  • filling ranges using VBA
  • using match to partition data

Another approach to Array formulas and large Ranges

previously covered a VBA solution for a problem that could have been solved using array formulas. Even after Optimizing the array formula the result still wasn’t acceptable because of  the exponential time to calculate cells using countifs on large scale dataset. This article explores how you can use MATCH() to leverage sorted data to cut calculation times down to a fraction of their natural calculation times. You can download this workbook, arrayformulas.xlsm, from Downloads

What’s good enough?

In our example, through Optimizing the array formula we have reduced the calculation times by a factor of 10 for a medium sized (50,000 * 30) test data sample. Althttps://ramblings.mcpher.com/array-formulas-doing-lots-of-things-at-once/array-formulas-and-large-ranges/hough this sounds like a good result, it’s not good enough – since workbook calculation still takes 5 minutes, and run times are exponential rather than linear. So we need to find a way to get this down to under a second, and at the same time get to a predictable run time for bigger data sets. Here are the current calculation results using countifs for the cell call calculations.  The problem here is that the bigger the data set, the more countifs has to search multiple times – so how to get a small, predictable cut of the data for each search. Our input data is sorted, and looks like this We are trying to create a matrix of 1/0 with customer down the side and product along the top that looks like the below so that we can perform MMULT() on it to get this 

If we could encourage an operation in the body of the 1/0 operation to only search in the part of the list for the customer relevant for that row, we could reduce the processing that needs to be done by a large factor. Luckily, MATCH(), has 3 modes

 And since our input data is sorted we can use a combination of 0 and -1 match options to partition the input data into small chunks. To help with that we are going to use a couple of helper columns to minimize the amount of recalculations – so the 1/0 tabel now looks like this – the first two columns showing the offset row number and length of the input data concerning the customer in column C  The formula for cell B3is –=MATCH($C3,OFFSET(MatrixIn!$A$2,$A2+$B2,0,31,1),0)+$A2+$B2-1 This is saying that we need to do an exact match on the customer number in cell C3, but we only need to look at the small strip of the input data that starts after the section belonging to the previous customer is finished. The length of the section is in cell A3, and can be calculated as follows=MATCH($C3+1,OFFSET(MatrixIn!$A$2,$B3,0,31,1),)-1 In this case we are going to count from the place that this customer’s section starts to the place when the data doesn’t belong to this customer any more ($c3+1). Although most people use 0 as the 3rd parameter to MATCH(), the default behavior is to stop when we match >=  $c3+1. These 2 simple formulas can be used to slice the input data into small manageable chunks. We now know which row each customer’s data starts on, and how many entries there are.  They can just be filled down since care has been taken with the use of $ as to what gets incremented in a fill operation. (but note that the formula for cell c2 is a little different to all the others since we don’t have a previous customer to refer to)

How to replace the countifs functionality with MATCH.

Using these two helper columns, the body of the table formula is a piece of cake. This is cell D2 =--ISNUMBER(MATCH(D$1,OFFSET(MatrixIn!$A$2,$B2,1,$A2,1),0)) This is checking the section of the input data used by this customer and defined by columns A & B, to see whether there is a match with the product associated with this column, and converting the true/false returned by isnumber() to 1/0. Again, this can be filled across an down for the whole table body.

Efficiency

We started this exercise at 1500 seconds for a 50,000 * 30 matrix, then through Optimizing the array formula got that down to 300+ seconds. Incredibly, this approach gets us to .5 seconds – a 600x improvements. When you think of it though, we are searching on average about 15 rows per customer instead of 50,000 so actually it is not really surprising that we see a dramatic improvement. Not only that, but the time to calculate is linear. Doubling to 100,000 records actually takes not much more than double the time at 1.2 seconds Here are the results   

Automating

We could just leave it at that, but a big data set like would be error prone to setup manually as well as taking a lot of time, so using some of the VBA approach used in Array formulas and large ranges, this is pretty straightforward to automate.Here is the code, which uses Data Manipulation Classes Sub matrixmatchExample()    Dim maxrows As Long    maxrows = 100000    Dim dsIn As cDataSet, dsout As cDataSet    Dim cop As Collection, r As Range, coc As Collection    Dim cht As Chart    Set dsIn = New cDataSet    Set dsout = New cDataSet    ' sort it first    Application.Calculation = xlCalculationManualApplication.ScreenUpdating = False    Worksheets("matrixin").Range("$A:$b").Sort _        Key1:=Worksheets("matrixin").Range("A1"), _        Key2:=Worksheets("matrixin").Range("b1"), _        header:=xlYes    With dsIn        ' create a dset with input data        .populateData wholeSheet("matrixin"), , "matrixin", , , , True, , maxrows        ' make collections of customers and products that appear        Set cop = .Column("product").uniqueValues(eSortAscending)        Set coc = .Column("customer").uniqueValues(eSortNone)        ' create the output matrix        wholeSheet("matrixout").Cells.ClearContents        wholeSheet("tempmatrix").Cells.ClearContents    End With        'get newly created output matrix into a dataset    dsout.populateData creatematchMatrixFormulas(coc, cop, dsIn), _            , "matrixout"    Set dsIn = Nothing    Application.ScreenUpdating = True    Application.Calculate    Application.Calculation = xlCalculationAutomatic    ' create heatmap    'executeHeatMapScale dsout    ''Set cht = createSurfaceChart(dsout, "heatmapx")    Set cht = createSurfaceChart(dsout, "heatmap", xlSurfaceTopView)    Set cht = Nothing    Set dsout = Nothing    Set dsIn = NothingEnd Sub and to populate the workbook with the formulae discussed earlier. Private Function creatematchMatrixFormulas(coc As Collection, cop As Collection, _                dsIn As cDataSet) As Range    Dim r As Range, wr As Range, cc As cCell    Dim listCustomer As Range    Dim listProduct As Range    Dim tableentries As Range    Dim tableHeadingCustomer As Range    Dim tableheadingProduct As Range    Dim helperStart As Range    Dim helperLength As Range    Dim matrix As Range    Set wr = firstCell(wholeSheet("tempMatrix").Cells)    ' these 2 columns will contain the start row and length of    ' list of products used by each customer    Set helperStart = wr.Resize(coc.Count, 1).Offset(1, 1)    helperStart.Offset(-1).Cells(1, 1).Value = "Start"    Set helperLength = helperStart.Offset(, -1)    helperLength.Offset(-1).Cells(1, 1).Value = "Length"    Set tableHeadingCustomer = helperStart.Offset(, 1)    tableHeadingCustomer.Offset(-1).Cells(1, 1).Value = "Customer"    Set tableheadingProduct = tableHeadingCustomer.Offset(-1, 1).Resize(1, cop.Count)    Set tableentries = tableHeadingCustomer.Offset(, _        1).Resize(coc.Count, cop.Count)    Set listCustomer = dsIn.Column("customer").Where    Set listProduct = dsIn.Column("product").Where    Set r = tableHeadingCustomer.Resize(1, 1)    For Each cc In coc        r.Value = cc.Value        Set r = r.Offset(1)    Next cc    Set r = tableheadingProduct.Resize(1, 1)    For Each cc In cop        r.Value = cc.Value        Set r = r.Offset(, 1)    Next cc    'START    ' the first cell is a little different    '=MATCH($C2,MatrixIn!$A$2:$A$100001,0)-1    helperStart.Cells(1, 1).Formula = _        "=Match(" & list( _            SAd(tableHeadingCustomer.Cells(1, 1), helperStart, , True), _            SAd(listCustomer, helperStart), 0) & ")-1"    '=MATCH($C3,OFFSET(MatrixIn!$A$2,$B2+$A2,0,31,1),0)+$A2+$B2-1    With helperStart.Offset(1).Resize(helperStart.Rows.Count - 1)        .Cells(1, 1).Formula = _        "=Match(" & list( _            SAd(tableHeadingCustomer.Cells(2, 1), helperStart, True, True), _            "offset(" & list( _                    SAd(listCustomer, helperStart, True), _                    SAd(helperLength, helperStart, True, True) & "+" & _                    SAd(helperStart, helperStart, True, True), _                    0, cop.Count + 1, 1) & ")", 0) & ")+" & _                    SAd(helperLength, helperStart, True, True) & "+" & _                    SAd(helperStart, helperStart, True, True) & "-1"        .FillDown    End With    'LENGTH    '=MATCH($C2+1,OFFSET(MatrixIn!$A$2,$B2,0,31,1),)-1    helperLength.Cells(1, 1).Formula = _        "=Match(" & list( _            SAd(tableHeadingCustomer.Cells(1, 1), helperStart, True, True) & "+1", _            "offset(" & list( _                    SAd(listCustomer, helperStart, True), _                    SAd(helperStart, helperStart, True, True), _                    0, cop.Count + 1, 1) & ")") & ",)-1"    helperLength.FillDown    'table body    '=--ISNUMBER(MATCH(D$1,OFFSET(MatrixIn!$A$2,$B2,1,$A2,1),0))    tableentries.Cells(1, 1).Formula = _        "=--isnumber(match(" & list( _            SAd(tableheadingProduct, tableentries, True, , True), _            "offset(" & list( _                SAd(listCustomer, tableentries, True), _                SAd(helperStart, tableentries, True, True), _                1, SAd(helperLength, tableentries, True, True), 1) & ")", 0 _            ) & "))"    tableentries.Resize(1).FillRight    tableentries.FillDown    ' the final table    Set matrix = firstCell(wholeSheet _        ("matrixOut")).Resize(cop.Count, cop.Count).Offset(1, 1)    Set r = firstCell(matrix.Offset(-1))    r.Offset(, -1).Value = "matrix"    For Each cc In cop        r.Value = cc.Value        Set r = r.Offset(, 1)    Next cc    Set r = firstCell(matrix.Offset(, -1))    For Each cc In cop        r.Value = cc.Value        Set r = r.Offset(1)    Next cc    matrix.FormulaArray = _     "=MMULT(TRANSPOSE( " & SAd(tableentries) & ")," & _            SAd(tableentries) & ")"    Set creatematchMatrixFormulas = matrix.Offset(-1, -1).Resize(cop.Count + 1, _            cop.Count + 1)End Function

Summary

Now we have a viable and scalable solution using MATCH(), MMULT() and a bit of VBA to automate the whole thing. If you want to read more about the detail of this problem, see Array formulas and large ranges. You can download this workbook, arrayformulas.xlsm, from Downloads Why not join our forum, follow the blog or follow me on twitter to ensure you get updates when they are available. You can also submit anything you want considered for publication on this site to our forum.