Leveraging sorted data with Match

What can you learn here ?
• Optimization and profiling
• filling ranges using VBA
• using match to partition data

Array formulas and large ranges - another approach

Even after Optimizing the array formula the result still wasn't acceptable because of  the exponential time to calculate cells using countifs on large scale dataset. This article explores how you can use MATCH() to leverage sorted data to cut calculation times down to a fraction of their natural calculation times. You can download this workbook, arrayformulas.xlsm, from Downloads

What's good enough?

In our example, through Optimizing the array formula we have reduced the calaculation times by a factor of 10 for a medium sized (50,000 * 30) test data sample. Although this sounds like a good result, it's not good enough - since workbook calculation still takes 5 minutes, and run times are exponential rather than linear. So we need to find a way to get this down to under a second, and at the same time get to a predictable run time for bigger data sets.

Here are the current calculation results using countifs for the cell call calculations.
The problem here is that the bigger the data set, the more countifs has to search multiple times - so how to get a small, predictable cut of the data for each search.

Our input data is sorted, and looks like this
->>

We are trying to create a matrix of 1/0 with customer down the side and product along the top that looks like the below

so that we can perform MMULT() on it to get this

If we could encourage an operation in the body of the 1/0 operation to only search in the part of the list for the customer relevant for that row, we could reduce the processing that needs to be done by a large factor. Luckily, MATCH(), has 3 modes

And since our input data is sorted we can use a combination of 0 and -1 match options to partition the input data into small chunks. To help with that we are going to use a couple of helper columns to minimize the amount of recalculations - so the 1/0 tabel now looks like this - the first two columns showing the offset row number and length of the input data concerning the customer in column C

The formula for cell B3 is -
=MATCH(\$C3,OFFSET(MatrixIn!\$A\$2,\$A2+\$B2,0,31,1),0)+\$A2+\$B2-1

This is saying that we need to do an exact match on the customer number in cell C3, but we only need to look at the small strip of the input data that starts after the section belonging to the previous customer is finished.

The length of the section is in cell A3, and can be calculated as follows
=MATCH(\$C3+1,OFFSET(MatrixIn!\$A\$2,\$B3,0,31,1),)-1

In this case we are going to count from the place that this customer's section starts to the place when the data doesn't belong to this customer any more (\$c3+1). Although most people use 0 as the 3rd parameter to MATCH(), the default behavior is to stop when we match >=  \$c3+1.

These 2 simple formulas can be used to slice the input data into small manageable chunks. We now know which row each customer's data starts on, and how many entries there are.  They can just be filled down since care has been taken with the use of \$ as to what gets incremented in a fill operation. (but note that the formula for cell c2 is a little different to all the others since we don't have a previous customer to refer to)

How to replace the countifs functionality with MATCH.

Using these two helper columns, the body of the table formula is a piece of cake. This is cell D2

=--ISNUMBER(MATCH(D\$1,OFFSET(MatrixIn!\$A\$2,\$B2,1,\$A2,1),0))

This is checking the section of the input data used by this customer and defined by columns A & B, to see whether there is a match with the product associated with this column, and converting the true/false returned by isnumber() to 1/0. Again, this can be filled across an down for the whole table body.

Efficiency

We started this exercise at 1500 seconds for a 50,000 * 30 matrix, then through Optimizing the array formula got that down to 300+ seconds. Incredibly, this approach gets us to .5 seconds - a 600x improvements. When you think of it though, we are searching on average about 15 rows per customer instead of 50,000 so actually it is not really surprising that we see a dramatic improvement. Not only that, but the time to calculate is linear. Doubling to 100,000 records actually takes not much more than double the time at 1.2 seconds

Here are the results

Automating

We could just leave it at that, but a big data set like would be error prone to setup manually as well as taking a lot of time, so using some of the VBA approach used in Array formulas and large ranges, this is pretty straightforward to automate.
Here is the code, which uses Data Manipulation Classes

Sub matrixmatchExample()

Dim maxrows As Long
maxrows = 100000
Dim dsIn As cDataSet, dsout As cDataSet
Dim cop As Collection, r As Range, coc As Collection
Dim cht As Chart

Set dsIn = New cDataSet
Set dsout = New cDataSet

' sort it first
Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False

Worksheets("matrixin").Range("\$A:\$b").Sort _
Key1:=Worksheets("matrixin").Range("A1"), _
Key2:=Worksheets("matrixin").Range("b1"), _

With dsIn
' create a dset with input data
.populateData wholeSheet("matrixin"), , "matrixin", , , , True, , maxrows
' make collections of customers and products that appear
Set cop = .Column("product").uniqueValues(eSortAscending)
Set coc = .Column("customer").uniqueValues(eSortNone)
' create the output matrix
wholeSheet("matrixout").Cells.ClearContents
wholeSheet("tempmatrix").Cells.ClearContents

End With
'get newly created output matrix into a dataset
dsout.populateData creatematchMatrixFormulas(coc, cop, dsIn), _
, "matrixout"

Set dsIn = Nothing
Application.ScreenUpdating = True
Application.Calculate
Application.Calculation = xlCalculationAutomatic
' create heatmap
'executeHeatMapScale dsout
''Set cht = createSurfaceChart(dsout, "heatmapx")
Set cht = createSurfaceChart(dsout, "heatmap", xlSurfaceTopView)
Set cht = Nothing
Set dsout = Nothing
Set dsIn = Nothing

End Sub

and to populate the workbook with the formulae discussed earlier.

Private Function creatematchMatrixFormulas(coc As Collection, cop As Collection, _
dsIn As cDataSet) As Range
Dim r As Range, wr As Range, cc As cCell

Dim listCustomer As Range
Dim listProduct As Range
Dim tableentries As Range
Dim tableHeadingCustomer As Range
Dim tableheadingProduct As Range
Dim helperStart As Range
Dim helperLength As Range
Dim matrix As Range

Set wr = firstCell(wholeSheet("tempMatrix").Cells)
' these 2 columns will contain the start row and length of
' list of products used by each customer
Set helperStart = wr.Resize(coc.Count, 1).Offset(1, 1)
helperStart.Offset(-1).Cells(1, 1).Value = "Start"
Set helperLength = helperStart.Offset(, -1)
helperLength.Offset(-1).Cells(1, 1).Value = "Length"

Set tableHeadingCustomer = helperStart.Offset(, 1)
tableHeadingCustomer.Offset(-1).Cells(1, 1).Value = "Customer"
Set tableentries = tableHeadingCustomer.Offset(, _
1).Resize(coc.Count, cop.Count)

Set listCustomer = dsIn.Column("customer").Where
Set listProduct = dsIn.Column("product").Where

Set r = tableHeadingCustomer.Resize(1, 1)
For Each cc In coc
r.Value = cc.Value
Set r = r.Offset(1)
Next cc

Set r = tableheadingProduct.Resize(1, 1)
For Each cc In cop
r.Value = cc.Value
Set r = r.Offset(, 1)
Next cc
'START
' the first cell is a little different
'=MATCH(\$C2,MatrixIn!\$A\$2:\$A\$100001,0)-1
helperStart.Cells(1, 1).Formula = _
"=Match(" & list( _
SAd(listCustomer, helperStart), 0) & ")-1"
'=MATCH(\$C3,OFFSET(MatrixIn!\$A\$2,\$B2+\$A2,0,31,1),0)+\$A2+\$B2-1
With helperStart.Offset(1).Resize(helperStart.Rows.Count - 1)
.Cells(1, 1).Formula = _
"=Match(" & list( _
"offset(" & list( _
SAd(listCustomer, helperStart, True), _
SAd(helperLength, helperStart, True, True) & "+" & _
SAd(helperStart, helperStart, True, True), _
0, cop.Count + 1, 1) & ")", 0) & ")+" & _
SAd(helperLength, helperStart, True, True) & "+" & _
SAd(helperStart, helperStart, True, True) & "-1"

.FillDown
End With

'LENGTH
'=MATCH(\$C2+1,OFFSET(MatrixIn!\$A\$2,\$B2,0,31,1),)-1
helperLength.Cells(1, 1).Formula = _
"=Match(" & list( _
SAd(tableHeadingCustomer.Cells(1, 1), helperStart, True, True) & "+1", _
"offset(" & list( _
SAd(listCustomer, helperStart, True), _
SAd(helperStart, helperStart, True, True), _
0, cop.Count + 1, 1) & ")") & ",)-1"
helperLength.FillDown
'table body
'=--ISNUMBER(MATCH(D\$1,OFFSET(MatrixIn!\$A\$2,\$B2,1,\$A2,1),0))
tableentries.Cells(1, 1).Formula = _
"=--isnumber(match(" & list( _
"offset(" & list( _
SAd(listCustomer, tableentries, True), _
SAd(helperStart, tableentries, True, True), _
1, SAd(helperLength, tableentries, True, True), 1) & ")", 0 _
) & "))"

tableentries.Resize(1).FillRight
tableentries.FillDown

' the final table
Set matrix = firstCell(wholeSheet _
("matrixOut")).Resize(cop.Count, cop.Count).Offset(1, 1)
Set r = firstCell(matrix.Offset(-1))
r.Offset(, -1).Value = "matrix"
For Each cc In cop
r.Value = cc.Value
Set r = r.Offset(, 1)
Next cc
Set r = firstCell(matrix.Offset(, -1))
For Each cc In cop
r.Value = cc.Value
Set r = r.Offset(1)
Next cc
matrix.FormulaArray = _
"=MMULT(TRANSPOSE( " & SAd(tableentries) & ")," & _