Leveraging sorted data with Match

What can you learn here ?
  • Optimization and profiling
  • filling ranges using VBA
  • using match to partition data

Array formulas and large ranges - another approach


Even after Optimizing the array formula the result still wasn't acceptable because of  the exponential time to calculate cells using countifs on large scale dataset. This article explores how you can use MATCH() to leverage sorted data to cut calculation times down to a fraction of their natural calculation times. You can download this workbook, arrayformulas.xlsm, from Downloads

What's good enough?

In our example, through Optimizing the array formula we have reduced the calaculation times by a factor of 10 for a medium sized (50,000 * 30) test data sample. Although this sounds like a good result, it's not good enough - since workbook calculation still takes 5 minutes, and run times are exponential rather than linear. So we need to find a way to get this down to under a second, and at the same time get to a predictable run time for bigger data sets. 

Here are the current calculation results using countifs for the cell call calculations. 

The problem here is that the bigger the data set, the more countifs has to search multiple times - so how to get a small, predictable cut of the data for each search. 

Our input data is sorted, and looks like this 
->>


We are trying to create a matrix of 1/0 with customer down the side and product along the top that looks like the below


so that we can perform MMULT() on it to get this


If we could encourage an operation in the body of the 1/0 operation to only search in the part of the list for the customer relevant for that row, we could reduce the processing that needs to be done by a large factor. Luckily, MATCH(), has 3 modes

And since our input data is sorted we can use a combination of 0 and -1 match options to partition the input data into small chunks. To help with that we are going to use a couple of helper columns to minimize the amount of recalculations - so the 1/0 tabel now looks like this - the first two columns showing the offset row number and length of the input data concerning the customer in column C


The formula for cell B3 is - 
=MATCH($C3,OFFSET(MatrixIn!$A$2,$A2+$B2,0,31,1),0)+$A2+$B2-1

This is saying that we need to do an exact match on the customer number in cell C3, but we only need to look at the small strip of the input data that starts after the section belonging to the previous customer is finished. 

The length of the section is in cell A3, and can be calculated as follows
=MATCH($C3+1,OFFSET(MatrixIn!$A$2,$B3,0,31,1),)-1

In this case we are going to count from the place that this customer's section starts to the place when the data doesn't belong to this customer any more ($c3+1). Although most people use 0 as the 3rd parameter to MATCH(), the default behavior is to stop when we match >=  $c3+1.

These 2 simple formulas can be used to slice the input data into small manageable chunks. We now know which row each customer's data starts on, and how many entries there are.  They can just be filled down since care has been taken with the use of $ as to what gets incremented in a fill operation. (but note that the formula for cell c2 is a little different to all the others since we don't have a previous customer to refer to)

How to replace the countifs functionality with MATCH.

Using these two helper columns, the body of the table formula is a piece of cake. This is cell D2

=--ISNUMBER(MATCH(D$1,OFFSET(MatrixIn!$A$2,$B2,1,$A2,1),0))

This is checking the section of the input data used by this customer and defined by columns A & B, to see whether there is a match with the product associated with this column, and converting the true/false returned by isnumber() to 1/0. Again, this can be filled across an down for the whole table body.

Efficiency

We started this exercise at 1500 seconds for a 50,000 * 30 matrix, then through Optimizing the array formula got that down to 300+ seconds. Incredibly, this approach gets us to .5 seconds - a 600x improvements. When you think of it though, we are searching on average about 15 rows per customer instead of 50,000 so actually it is not really surprising that we see a dramatic improvement. Not only that, but the time to calculate is linear. Doubling to 100,000 records actually takes not much more than double the time at 1.2 seconds

Here are the results




Automating

We could just leave it at that, but a big data set like would be error prone to setup manually as well as taking a lot of time, so using some of the VBA approach used in Array formulas and large ranges, this is pretty straightforward to automate. 
Here is the code, which uses Data Manipulation Classes

Sub matrixmatchExample()
    
    Dim maxrows As Long
    maxrows = 100000
    Dim dsIn As cDataSet, dsout As cDataSet
    Dim cop As Collection, r As Range, coc As Collection
    Dim cht As Chart

    Set dsIn = New cDataSet
    Set dsout = New cDataSet
    
    ' sort it first
    Application.Calculation = xlCalculationManual
    Application.ScreenUpdating = False
    
    Worksheets("matrixin").Range("$A:$b").Sort _
        Key1:=Worksheets("matrixin").Range("A1"), _
        Key2:=Worksheets("matrixin").Range("b1"), _
        header:=xlYes
    
    With dsIn
        ' create a dset with input data
        .populateData wholeSheet("matrixin"), , "matrixin", , , , True, , maxrows
        ' make collections of customers and products that appear
        Set cop = .Column("product").uniqueValues(eSortAscending)
        Set coc = .Column("customer").uniqueValues(eSortNone)
        ' create the output matrix
        wholeSheet("matrixout").Cells.ClearContents
        wholeSheet("tempmatrix").Cells.ClearContents
        
    End With
        'get newly created output matrix into a dataset
    dsout.populateData creatematchMatrixFormulas(coc, cop, dsIn), _
            , "matrixout"
    
    Set dsIn = Nothing
    Application.ScreenUpdating = True
    Application.Calculate
    Application.Calculation = xlCalculationAutomatic
    ' create heatmap
    'executeHeatMapScale dsout
    ''Set cht = createSurfaceChart(dsout, "heatmapx")
    Set cht = createSurfaceChart(dsout, "heatmap", xlSurfaceTopView)
    Set cht = Nothing
    Set dsout = Nothing
    Set dsIn = Nothing

End Sub

and to populate the workbook with the formulae discussed earlier.

Private Function creatematchMatrixFormulas(coc As Collection, cop As Collection, _
                dsIn As cDataSet) As Range
    Dim r As Range, wr As Range, cc As cCell

    Dim listCustomer As Range
    Dim listProduct As Range
    Dim tableentries As Range
    Dim tableHeadingCustomer As Range
    Dim tableheadingProduct As Range
    Dim helperStart As Range
    Dim helperLength As Range
    Dim matrix As Range
    
    Set wr = firstCell(wholeSheet("tempMatrix").Cells)
    ' these 2 columns will contain the start row and length of
    ' list of products used by each customer
    Set helperStart = wr.Resize(coc.Count, 1).Offset(1, 1)
    helperStart.Offset(-1).Cells(1, 1).Value = "Start"
    Set helperLength = helperStart.Offset(, -1)
    helperLength.Offset(-1).Cells(1, 1).Value = "Length"
    
    Set tableHeadingCustomer = helperStart.Offset(, 1)
    tableHeadingCustomer.Offset(-1).Cells(1, 1).Value = "Customer"
    Set tableheadingProduct = tableHeadingCustomer.Offset(-1, 1).Resize(1, cop.Count)
    Set tableentries = tableHeadingCustomer.Offset(, _
        1).Resize(coc.Count, cop.Count)
        
    Set listCustomer = dsIn.Column("customer").Where
    Set listProduct = dsIn.Column("product").Where
    
    Set r = tableHeadingCustomer.Resize(1, 1)
    For Each cc In coc
        r.Value = cc.Value
        Set r = r.Offset(1)
    Next cc
    
    Set r = tableheadingProduct.Resize(1, 1)
    For Each cc In cop
        r.Value = cc.Value
        Set r = r.Offset(, 1)
    Next cc
    'START
    ' the first cell is a little different
    '=MATCH($C2,MatrixIn!$A$2:$A$100001,0)-1
    helperStart.Cells(1, 1).Formula = _
        "=Match(" & list( _
            SAd(tableHeadingCustomer.Cells(1, 1), helperStart, , True), _
            SAd(listCustomer, helperStart), 0) & ")-1"
    '=MATCH($C3,OFFSET(MatrixIn!$A$2,$B2+$A2,0,31,1),0)+$A2+$B2-1
    With helperStart.Offset(1).Resize(helperStart.Rows.Count - 1)
        .Cells(1, 1).Formula = _
        "=Match(" & list( _
            SAd(tableHeadingCustomer.Cells(2, 1), helperStart, True, True), _
            "offset(" & list( _
                    SAd(listCustomer, helperStart, True), _
                    SAd(helperLength, helperStart, True, True) & "+" & _
                    SAd(helperStart, helperStart, True, True), _
                    0, cop.Count + 1, 1) & ")", 0) & ")+" & _
                    SAd(helperLength, helperStart, True, True) & "+" & _
                    SAd(helperStart, helperStart, True, True) & "-1"
                    
        .FillDown
    End With

    'LENGTH
    '=MATCH($C2+1,OFFSET(MatrixIn!$A$2,$B2,0,31,1),)-1
    helperLength.Cells(1, 1).Formula = _
        "=Match(" & list( _
            SAd(tableHeadingCustomer.Cells(1, 1), helperStart, True, True) & "+1", _
            "offset(" & list( _
                    SAd(listCustomer, helperStart, True), _
                    SAd(helperStart, helperStart, True, True), _
                    0, cop.Count + 1, 1) & ")") & ",)-1"
    helperLength.FillDown
    'table body
    '=--ISNUMBER(MATCH(D$1,OFFSET(MatrixIn!$A$2,$B2,1,$A2,1),0))
    tableentries.Cells(1, 1).Formula = _
        "=--isnumber(match(" & list( _
            SAd(tableheadingProduct, tableentries, True, , True), _
            "offset(" & list( _
                SAd(listCustomer, tableentries, True), _
                SAd(helperStart, tableentries, True, True), _
                1, SAd(helperLength, tableentries, True, True), 1) & ")", 0 _
            ) & "))"

    tableentries.Resize(1).FillRight
    tableentries.FillDown
    
    ' the final table
    Set matrix = firstCell(wholeSheet _
        ("matrixOut")).Resize(cop.Count, cop.Count).Offset(1, 1)
    Set r = firstCell(matrix.Offset(-1))
    r.Offset(, -1).Value = "matrix"
    For Each cc In cop
        r.Value = cc.Value
        Set r = r.Offset(, 1)
    Next cc
    Set r = firstCell(matrix.Offset(, -1))
    For Each cc In cop
        r.Value = cc.Value
        Set r = r.Offset(1)
    Next cc
    matrix.FormulaArray = _
     "=MMULT(TRANSPOSE( " & SAd(tableentries) & ")," & _
            SAd(tableentries) & ")"
    Set creatematchMatrixFormulas = matrix.Offset(-1, -1).Resize(cop.Count + 1, _
            cop.Count + 1)
End Function



Comments