Regex, scraping and visualizing - Desktop Liberation

Normally I’d want to use the Rest to Excel library to get data into Excel, but sometimes you have to resort to web scraping. My preferred route is to do something in scraperwiki, then use a rest query to get the scraped wiki into Excel (or Google Apps)

But you need to know python, R, PHP or more recently node.js, to be able to do that – these are the languages that scraperWiki supports. None are particularly hard to learn to the degree you need to be able to get on and do some scraping, but if you don’t want to do that, here’s an example of scraping using VBA, and some of the libraries on this site.

In particular, we’re going to use regex. This is exactly what I’d do on scraperWiki, but it’s often overlooked that VBA has pretty good regex support too. A simplified approach to regex was covered in Regular Expressions, but for this task we’ll have to do some thing a little more complex, as was introduced in pinyin conversions

You’ll find the code in cDataSet.xlsm, downlable from Downloads

Page Content hide

1 The source data

2 How to identify your data

3 Navigating to the page.

4 Finding the header

5 Splitting up the headers

6 Populating the sheet

7 The result

8 Visualizing the table

9 The complete code

The source data

For this example, we’ll try to get the results of the Boston Half Marathon, available here. The objective is to get the thing into an Excel table – and as a bonus I’ll also make an interactive Google Table and a static table for loading to a web site too.

A snippet below

I’m using this one, because it’s a bit trickier than the usual scraping activities where data is organized into tables – what we have here is some fairly unformatted data enclosed in <pre> </pre> tags, so we’ll be able to use some simple regex to sort it out.

How to identify your data

I normally use Chrome developer tools to take a look at the web page first. In this case,

Using those, I’ve traced the data I want to be within a <pre></pre> tag. This is pretty unusual – you would have expected this kind of data to be nicely formatted in an html table. Instead we have, essentially a snapshot of a printed report.

Navigating to the page.

I’m using the cBrowser class, since I’ve already dealt with the mechanics of browser navigation elsewhere.

  URL = "http://www.coolrunning.com/results/13/ma/Oct13_13thAn_set1.shtml"
 
    ' get the web page
    Set cb = New cBrowser
    cb.init().Navigate URL, , False

When we return we’ll have a nicely populated html document. We can find the text in the <pre> class like this. This is a one off scraper, so Im not going to bother too much about error handling.

    ' its the only <pre> in the page
    Set nodeList = cb.elementTags("pre")
    
    Set node = nodeList(0)
    ' node should contain the results
    ' kind of wierd its not an html table
    Debug.Assert InStr(1, node.innerText, "Half Marathon") > 0

Finding the header

One thing I noticed was that the headings are underlined with equals signs. If I could find them, I’d know where the data started, and not only that, but the width of each column.

This regex snippet will return a single match with the row of === markers. This will tell me not only where the headings are (on the row before), but where the data is (starting the row after), and where each column begins – at the beginning of each block of “==”

    ' a simple regex will find the title distribution
    Set rx = New RegExp
    With rx
        .ignorecase = True
        .Global = True
        .Pattern = "=.*(?=\s)"
    End With
    Set matches = rx.execute(node.innerText)
    ' now we should have a single match with all about where the title underlining is
    ' looks like this ===== ======== ==== =====
    ' except that all but the first one are misaligned by 1 for the data
    Debug.Assert matches.count = 1
    Set match = matches(0)

Splitting up the headers

Now we can use another regex to split the “==” into discrete column heading positions, and then work backwords to extrapolate where the header row begins and ends

 ' we'll split that up into seperate sections
    rx.Pattern = "=+(?=\s)"
    Set heads = rx.execute(Mid(node.innerText, match.FirstIndex + 1, match.Length - 1))
    
    ' we can get the header line
    p = 0
    For i = match.FirstIndex To 1 Step -1
        If (Mid(node.innerText, i, 1) = vbLf) Then
            p = p + 1
            If (p = 2) Then
                ' we're at the \n before the header row
                headerRow = Mid(node.innerText, i + 1, match.FirstIndex - i - 2)
                Exit For
            End If
        End If
    Next i

Populating the sheet

Now we can populate the headers and data in the sheet, using the “==” ruler as guide for where the columns begin and end. We’ll use another regex to split the data into separate lines. Finally I did notice that the headings are a little misaligned to the data, so I need to make a small correction for that too.

  'clear the worksheet
    Application.Calculation = xlCalculationManual
    Set r = firstCell(wholeSheet("marathonresults"))
    r.Worksheet.Cells.ClearContents
    
    p = 0
    For Each item In heads
        r.Offset(0, p).value = Trim(Mid(headerRow, item.FirstIndex + 1, item.Length))
        p = p + 1
    Next item
 
    ' and the data
    rx.Pattern = ".+(?=\n|$)"
    ' get the start point of the data
    k = match.FirstIndex + match.Length
    While (Mid(node.innerText, k, 1) <> vbLf)
        k = k + 1
    Wend
    Set dats = rx.execute(Mid(node.innerText, 1 + k))
 
    t = 0
    For Each data In dats
        p = 0
        t = t + 1
        For Each item In heads
            k = item.FirstIndex + 1
            ' the headings, except the first, are actually misaligned by 1
            If k > 1 Then k = k - 1
            r.Offset(t, p).value = Trim(Mid(data.value, k, item.Length))
            p = p + 1
        Next item
    Next data

The result

Here we have it- 6000+ lines scraped and copied to Excel.

Visualizing the table

So how about we turn that into a google visualization. We can use Visualizing tables to make one from this data. The code is a one liner.

Public Sub vizMaratahonGoogle()
    tableToHtml "marathonresults", , , , "google"
End Sub

and it gives us a very nice interactive version of that data,

that can be hosted on a web site – You can see it here.

Alternatively, you could create a static version – another one liner

Public Sub vizMaratahonStatic()
    ' a static html rendering
    tableToHtml "marathonresults", , , , "static"
End Sub

also nicely formatted, and hosted here.

The complete code

For help and more information, you can join the forum, follow me on Twitter or follow the blog.