What can you learn here?
- Optimization and profiling
- filling ranges using VBA
- using match to partition data
Another approach to Array formulas and large Ranges
I previously covered a VBA solution for a problem that could have been solved using array formulas. Even after Optimizing the array formula the result still wasn’t acceptable because of the exponential time to calculate cells using countifs on large scale dataset. This article explores how you can use MATCH() to leverage sorted data to cut calculation times down to a fraction of their natural calculation times. You can download this workbook, arrayformulas.xlsm, from Downloads
What’s good enough?
In our example, through Optimizing the array formula we have reduced the calculation times by a factor of 10 for a medium sized (50,000 * 30) test data sample. Althttps://ramblings.mcpher.com/array-formulas-doing-lots-of-things-at-once/array-formulas-and-large-ranges/hough this sounds like a good result, it’s not good enough – since workbook calculation still takes 5 minutes, and run times are exponential rather than linear. So we need to find a way to get this down to under a second, and at the same time get to a predictable run time for bigger data sets. Here are the current calculation results using countifs for the cell call calculations. The problem here is that the bigger the data set, the more countifs has to search multiple times – so how to get a small, predictable cut of the data for each search. Our input data is sorted, and looks like this We are trying to create a matrix of 1/0 with customer down the side and product along the top that looks like the below so that we can perform MMULT() on it to get this
If we could encourage an operation in the body of the 1/0 operation to only search in the part of the list for the customer relevant for that row, we could reduce the processing that needs to be done by a large factor. Luckily, MATCH(), has 3 modes
And since our input data is sorted we can use a combination of 0 and -1 match options to partition the input data into small chunks. To help with that we are going to use a couple of helper columns to minimize the amount of recalculations – so the 1/0 tabel now looks like this – the first two columns showing the offset row number and length of the input data concerning the customer in column C The formula for cell B3is –=MATCH($C3,OFFSET(MatrixIn!$A$2,$A2+$B2,0,31,1),0)+$A2+$B2-1
This is saying that we need to do an exact match on the customer number in cell C3, but we only need to look at the small strip of the input data that starts after the section belonging to the previous customer is finished. The length of the section is in cell A3, and can be calculated as follows=MATCH($C3+1,OFFSET(MatrixIn!$A$2,$B3,0,31,1),)-1
In this case we are going to count from the place that this customer’s section starts to the place when the data doesn’t belong to this customer any more ($c3+1). Although most people use 0 as the 3rd parameter to MATCH(), the default behavior is to stop when we match >= $c3+1. These 2 simple formulas can be used to slice the input data into small manageable chunks. We now know which row each customer’s data starts on, and how many entries there are. They can just be filled down since care has been taken with the use of $ as to what gets incremented in a fill operation. (but note that the formula for cell c2 is a little different to all the others since we don’t have a previous customer to refer to)
How to replace the countifs functionality with MATCH.
Using these two helper columns, the body of the table formula is a piece of cake. This is cell D2 =--ISNUMBER(MATCH(D$1,OFFSET(MatrixIn!$A$2,$B2,1,$A2,1),0))
This is checking the section of the input data used by this customer and defined by columns A & B, to see whether there is a match with the product associated with this column, and converting the true/false returned by isnumber() to 1/0. Again, this can be filled across an down for the whole table body.
Efficiency
We started this exercise at 1500 seconds for a 50,000 * 30 matrix, then through Optimizing the array formula got that down to 300+ seconds. Incredibly, this approach gets us to .5 seconds – a 600x improvements. When you think of it though, we are searching on average about 15 rows per customer instead of 50,000 so actually it is not really surprising that we see a dramatic improvement. Not only that, but the time to calculate is linear. Doubling to 100,000 records actually takes not much more than double the time at 1.2 seconds Here are the results
Automating
We could just leave it at that, but a big data set like would be error prone to setup manually as well as taking a lot of time, so using some of the VBA approach used in Array formulas and large ranges, this is pretty straightforward to automate.Here is the code, which uses Data Manipulation Classes Sub matrixmatchExample()
Dim maxrows As Long
maxrows = 100000
Dim dsIn As cDataSet, dsout As cDataSet
Dim cop As Collection, r As Range, coc As Collection
Dim cht As Chart
Set dsIn = New cDataSet
Set dsout = New cDataSet
' sort it first
Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False
Worksheets("matrixin").Range("$A:$b").Sort _
Key1:=Worksheets("matrixin").Range("A1"), _
Key2:=Worksheets("matrixin").Range("b1"), _
header:=xlYes
With dsIn
' create a dset with input data
.populateData wholeSheet("matrixin"), , "matrixin", , , , True, , maxrows
' make collections of customers and products that appear
Set cop = .Column("product").uniqueValues(eSortAscending)
Set coc = .Column("customer").uniqueValues(eSortNone)
' create the output matrix
wholeSheet("matrixout").Cells.ClearContents
wholeSheet("tempmatrix").Cells.ClearContents
End With
'get newly created output matrix into a dataset
dsout.populateData creatematchMatrixFormulas(coc, cop, dsIn), _
, "matrixout"
Set dsIn = Nothing
Application.ScreenUpdating = True
Application.Calculate
Application.Calculation = xlCalculationAutomatic
' create heatmap
'executeHeatMapScale dsout
''Set cht = createSurfaceChart(dsout, "heatmapx")
Set cht = createSurfaceChart(dsout, "heatmap", xlSurfaceTopView)
Set cht = Nothing
Set dsout = Nothing
Set dsIn = Nothing
End Sub
and to populate the workbook with the formulae discussed earlier. Private Function creatematchMatrixFormulas(coc As Collection, cop As Collection, _
dsIn As cDataSet) As Range
Dim r As Range, wr As Range, cc As cCell
Dim listCustomer As Range
Dim listProduct As Range
Dim tableentries As Range
Dim tableHeadingCustomer As Range
Dim tableheadingProduct As Range
Dim helperStart As Range
Dim helperLength As Range
Dim matrix As Range
Set wr = firstCell(wholeSheet("tempMatrix").Cells)
' these 2 columns will contain the start row and length of
' list of products used by each customer
Set helperStart = wr.Resize(coc.Count, 1).Offset(1, 1)
helperStart.Offset(-1).Cells(1, 1).Value = "Start"
Set helperLength = helperStart.Offset(, -1)
helperLength.Offset(-1).Cells(1, 1).Value = "Length"
Set tableHeadingCustomer = helperStart.Offset(, 1)
tableHeadingCustomer.Offset(-1).Cells(1, 1).Value = "Customer"
Set tableheadingProduct = tableHeadingCustomer.Offset(-1, 1).Resize(1, cop.Count)
Set tableentries = tableHeadingCustomer.Offset(, _
1).Resize(coc.Count, cop.Count)
Set listCustomer = dsIn.Column("customer").Where
Set listProduct = dsIn.Column("product").Where
Set r = tableHeadingCustomer.Resize(1, 1)
For Each cc In coc
r.Value = cc.Value
Set r = r.Offset(1)
Next cc
Set r = tableheadingProduct.Resize(1, 1)
For Each cc In cop
r.Value = cc.Value
Set r = r.Offset(, 1)
Next cc
'START
' the first cell is a little different
'=MATCH($C2,MatrixIn!$A$2:$A$100001,0)-1
helperStart.Cells(1, 1).Formula = _
"=Match(" & list( _
SAd(tableHeadingCustomer.Cells(1, 1), helperStart, , True), _
SAd(listCustomer, helperStart), 0) & ")-1"
'=MATCH($C3,OFFSET(MatrixIn!$A$2,$B2+$A2,0,31,1),0)+$A2+$B2-1
With helperStart.Offset(1).Resize(helperStart.Rows.Count - 1)
.Cells(1, 1).Formula = _
"=Match(" & list( _
SAd(tableHeadingCustomer.Cells(2, 1), helperStart, True, True), _
"offset(" & list( _
SAd(listCustomer, helperStart, True), _
SAd(helperLength, helperStart, True, True) & "+" & _
SAd(helperStart, helperStart, True, True), _
0, cop.Count + 1, 1) & ")", 0) & ")+" & _
SAd(helperLength, helperStart, True, True) & "+" & _
SAd(helperStart, helperStart, True, True) & "-1"
.FillDown
End With
'LENGTH
'=MATCH($C2+1,OFFSET(MatrixIn!$A$2,$B2,0,31,1),)-1
helperLength.Cells(1, 1).Formula = _
"=Match(" & list( _
SAd(tableHeadingCustomer.Cells(1, 1), helperStart, True, True) & "+1", _
"offset(" & list( _
SAd(listCustomer, helperStart, True), _
SAd(helperStart, helperStart, True, True), _
0, cop.Count + 1, 1) & ")") & ",)-1"
helperLength.FillDown
'table body
'=--ISNUMBER(MATCH(D$1,OFFSET(MatrixIn!$A$2,$B2,1,$A2,1),0))
tableentries.Cells(1, 1).Formula = _
"=--isnumber(match(" & list( _
SAd(tableheadingProduct, tableentries, True, , True), _
"offset(" & list( _
SAd(listCustomer, tableentries, True), _
SAd(helperStart, tableentries, True, True), _
1, SAd(helperLength, tableentries, True, True), 1) & ")", 0 _
) & "))"
tableentries.Resize(1).FillRight
tableentries.FillDown
' the final table
Set matrix = firstCell(wholeSheet _
("matrixOut")).Resize(cop.Count, cop.Count).Offset(1, 1)
Set r = firstCell(matrix.Offset(-1))
r.Offset(, -1).Value = "matrix"
For Each cc In cop
r.Value = cc.Value
Set r = r.Offset(, 1)
Next cc
Set r = firstCell(matrix.Offset(, -1))
For Each cc In cop
r.Value = cc.Value
Set r = r.Offset(1)
Next cc
matrix.FormulaArray = _
"=MMULT(TRANSPOSE( " & SAd(tableentries) & ")," & _
SAd(tableentries) & ")"
Set creatematchMatrixFormulas = matrix.Offset(-1, -1).Resize(cop.Count + 1, _
cop.Count + 1)
End Function
Summary
Now we have a viable and scalable solution using MATCH(), MMULT() and a bit of VBA to automate the whole thing. If you want to read more about the detail of this problem, see Array formulas and large ranges. You can download this workbook, arrayformulas.xlsm, from Downloads Why not join our forum, follow the blog or follow me on twitter to ensure you get updates when they are available. You can also submit anything you want considered for publication on this site to our forum.