Services‎ > ‎Desktop Liberation‎ > ‎

Regular Expressions


Manipulating strings in Excel.

As you will know, Excel very quickly runs of out steam as you try to do any kind of string manipulation with cells. Even very basic requirements cause you reach for VBA, but then you end up with very complex functions highly specific to each string processing task.  Many tools nowadays, especially in the Unix/Linux world allow you to use “Regular Expressions” right as part of their basic functionality. This section will show you how to implement Regular Expressions (RegEX) that you can use both in your workbooks and in your VBA procedures to harness their string processing capabilities, as well as a way to start a RegEx library that you can build up with common tasks. The examples and procedures can all be downloaded. Note that there are many regexes already written on sites such as, http://regexlib.com . Before writing your own, I recommend you check to see if anyone has already done something like it. Many of the examples here come from such libraries. This section is a wrapper class for regexes to make them easier to use from Excel, but also as a library implementation to give complex regexes a name so you can reuse them without recreating the contents. 

If' you'd rather use some of the more complex features of regexes and use them directly in Excel, you'll find some fairly extreme regexes in pinyin conversions, and an example of using simple regexes to do complicated tasks here Regex, scraping and visualizing

What is a Regular Expression. 

These are commonly called RegEX, which is how I will refer to them from now on. A regEX is a pattern that can be applied to a string, either to test or transform a string, depending on whether there is a match according to rules contained in that pattern. As with most things that come up through the Unix route, RegEX syntax is both powerful, and concise. This section is not supposed to be a tutorial on RegEx- there are thousands of them available on the internet. Just google "regular expression tutorial". Some of these examples have come from a great regex resource, http://regexlib.com/

Here's a few examples

Numeric                                  [\0-9]

Numeric exactly 3 characters long        [\0-9]{3}

Non Numeric                              [^\0-9]


So applying 'Numeric' to the string abc123  yields 123. Contrast that with a generalized formula in excel to do that.

=MID(H1,MATCH(TRUE,ISNUMBER(--MID(H1,ROW(INDIRECT("$1:$"&LEN(H1))),1)),0),COUNT(--MID(H1,ROW(INDIRECT("$1:$"&LEN(H1))),1)))

RegEX can get also very complex - but the are expressing very complex validation concepts that would be complex in VBA and unimaginably so with Excel functions.  Here's one that describes a valid UK post code, from regexlib.com, as are a few in this section.

(((^[BEGLMNS][1-9]\d?)|(^W[2-9])|(^(A[BL]|B[ABDHLNRST]|C[ABFHMORTVW]| D[ADEGHLNTY]|E[HNX]|F[KY]| G[LUY]|H[ADGPRSUX]|I[GMPV]| JE|K[ATWY]|L[ADELNSU]|M[EKL]|N[EGNPRW]|  O[LX]|P[AEHLOR]|R[GHM]|S[AEGKL-PRSTWY]|T[ADFNQRSW]|UB|W[ADFNRSV] |YO|ZE)\d\d?)|(^W1[A-HJKSTUW0-9])|(((^WC[1-2])|(^EC[1-4])|(^SW1))[ABEHMNPRVWXY]))(\s*)?([0-9][ABD-HJLNP-UW-Z]{2}))|(^GIR\s?0AA)

Aside from extracting string components, you can also use regEX to transform a string. 

This one will reduce multiple spaces inside a string to single spaces, when used in conjunction with the procedures discussed on this page

(\S+)\x20{2,}(?=\S+)

and this one will format an ip number.

^(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.   (\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])$

Although all that looks rather complex, there are millions of already written regEX on the web you can find for almost any purpose. The regEX implementation you can download here also provides the concept of a 'preset'. The idea here is that you add regEX patterns to your own personal library and refer to them by name rather than needing to reference the patterns themselves. This makes for easy maintenance and less typing errors.

So for example, you can setup a preset which translates to a regEX pattern, but in your formula you refer to a preset name. You'll see more of that later, but for example, you could set up a preset called IPNUMBER. The formula you would use in your worksheet would be

=rxString ( "ipnumber", A1) 

instead of 

=rxString ( "^(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])$", A1) 

RegEX implementation.

There are 3 at least functions that a generally available in implementations of RegEX

    1. Test - Check if a string matches a RegEx Pattern
    2. Replace - Replace any matching string portion(s) with something else
    3. Execute - Apply a RegEx pattern to a string and return the result(s) so you can manipulate them.

Excel does not have an implementation of RegEX, but VBscript does, and that is available to VBA developers simply by including a reference to it. If you are writing VBA code then you can build that right into your procedures. However, you cannot directly access it as an Excel formula so in this case you need to wrap them in some VBA user defined functions to make them accessible. In this case, I have created both a wrapper class (cRegEXLib) and some wrapper functions that access that class. Details are at the end of this article and are  available as a download.

examples

Note that the full regEX pattern and formulas may not be shown below in the interests of space, but you will find them in the example sheet in the download section.


These are the formulas for the above


 

Public Function rxTest(sName As String, s As String, Optional ignorecase As Boolean = True) As Boolean

Test whether the pattern sName matches s according to RegEX rules.

where
    sName            A RegEX or the name of a preset RegEX (we will discuss what is meant by preset later).
    s                        The string against which the sName pattern is applied
    ignoreCase     Whether or not matches need to be exact matching lower or upper case 

Public Function rxTest(sName As String, s As String, Optional ignorecase As Boolean = True) As Boolean

    Dim rx As cregXLib

    ' create a new regx

    Set rx = rxMakeRxLib(sName)

    rx.ignorecase = ignorecase

    ' extract the string that matches the requested pattern

    rxTest = rx.getTest(s)

End Function

 Public Function rxString(sName As String, s As String, Optional ignorecase As Boolean = True) As Boolean

Return the concatenated results of applying  the pattern sName to s according to RegEX rules.

where
    sName            A RegEX or the name of a preset RegEX (we will discuss what is meant by preset later).
    s                        The string against which the sName pattern is applied
    ignoreCase     Whether or not matches need to be exact matching lower or upper case 

Public Function rxString(sName As String, s As String, Optional ignorecase As Boolean = True) As String

    Dim rx As cregXLib

    ' create a new regx

    Set rx = rxMakeRxLib(sName)

    rx.ignorecase = ignorecase

    ' extract the string that matches the requested pattern

    rxString = rx.getString(s)

End Function


Public Function rxReplace(sName As String, sFrom As String, sTo as string Optional ignorecase As Boolean = True) As Boolean

Replace the concatenated results of applying  the pattern sName to sFrom according to RegEX rules with sTo.

where
    sName            A RegEX or the name of a preset RegEX (we will discuss what is meant by preset later).
    sFrom              The string against which the sName pattern is applied
    sTo                  The string with which to replace the outcome of applying sName to sFrom.
    ignoreCase     Whether or not matches need to be exact matching lower or upper case 

Public Function rxReplace(sName As String, sFrom As String, sTo As String, Optional ignorecase As Boolean = True) As String

    Dim rx As cregXLib

     ' create a new regx

    Set rx = rxMakeRxLib(sName)

    rx.ignorecase = ignorecase

    ' replace the string that matches the requested pattern

    rxReplace = rx.getReplace(sFrom, sTo)

End Function

 

Public Function rxPattern(sName As String) As string

Returns the pattern associated with the preset RegEX sName

where
    sName            A RegEX or the name of a preset RegEX (we will discuss what is meant by preset later).

Public Function rxPattern(sName As String) As String

    Dim rx As cregXLib

     ' create a new regx

    Set rx = rxMakeRxLib(sName)

    ' just returnthe pattern

    rxPattern = rx.Pattern

    

End Function

 

cRegEXLib class

A wrapper class to expose the  VBscript Regexp object (needs a reference to Microsoft VBSCRIPT regex)

Option Explicit
' for building up a library of useful regex expressions
Private pName As String
Private pRegex As RegExp
Public Property Get Pattern() As String
    Pattern = pRegex.Pattern
End Property
Public Property Let Pattern(p As String)
    pRegex.Pattern = p
End Property
Public Property Get Name() As String
    Name = pName
End Property
Public Property Let Name(p As String)
   pName = p
End Property
Public Property Get ignorecase() As Boolean
    ignorecase = pRegex.ignorecase
End Property
Public Property Let ignorecase(p As Boolean)
     pRegex.ignorecase = p
End Property
Public Property Get rGlobal() As Boolean
    rGlobal = pRegex.Global
End Property
Public Property Let rGlobal(p As Boolean)
   pRegex.Global = p
End Property

Public Sub init(sName As String, _
            Optional spat As String = "", _
            Optional bIgnoreSpaces As Boolean = True, _
            Optional bIgnoreCase As Boolean = True, _
            Optional bGlobal As Boolean = True)
    
    Dim s As String
    s = spat
    If bIgnoreSpaces Then
        s = Replace(s, " ", "")
    End If
    Set pRegex = New RegExp
    With pRegex
        .Pattern = s
        .ignorecase = bIgnoreCase
        .Global = bGlobal
    End With
    pName = sName

End Sub
Public Function getString(sFrom As String) As String
    Dim mc As matchcollection, am As Match, rs As String
    Set mc = pRegex.Execute(sFrom)
    rs = ""
    For Each am In mc
        rs = rs & am.Value
    Next am
    getString = rs
End Function
Function getReplace(sFrom As String, sTo As String) As String
    getReplace = pRegex.Replace(sFrom, sTo)

End Function
Function getTest(sFrom As String) As Boolean
    getTest = pRegex.Test(sFrom)

End Function

 Presets

You will have noticed the concept of 'Presets'. This allows you to refer to regex as a name rather than a pattern. Of course you need to build up your library of these. Below are the ones implemented in the download. You can see it is very easy to add your own by updating the rxMakeRxLib function in the rxLib Function.. 

Function rxMakeRxLib(sName As String) As cregXLib
    Dim rx As cregXLib, s As String
    Set rx = New cregXLib
    ' normally sname points to a preselected regEX
    ' if not known, silently assume its a regex pattern
        s = Replace(UCase(sName), " ", "")
        Select Case s
            Case "POSTALCODEUK"
                rx.init s, _
                "(((^[BEGLMNS][1-9]\d?) | (^W[2-9] ) | ( ^( A[BL] | B[ABDHLNRST] | C[ABFHMORTVW] | D[ADEGHLNTY] | E[HNX] | F[KY] | G[LUY] | H[ADGPRSUX] | I[GMPV] |" & _
                " JE | K[ATWY] | L[ADELNSU] | M[EKL] | N[EGNPRW] | O[LX] | P[AEHLOR] | R[GHM] | S[AEGKL-PRSTWY] | T[ADFNQRSW] | UB | W[ADFNRSV] | YO | ZE ) \d\d?) |" & _
                " (^W1[A-HJKSTUW0-9]) | ((  (^WC[1-2])  |  (^EC[1-4]) | (^SW1)  ) [ABEHMNPRVWXY] ) ) (\s*)?  ([0-9][ABD-HJLNP-UW-Z]{2})) | (^GIR\s?0AA)"
            
            Case "POSTALCODESPAIN"
                rx.init s, _
                    "^([1-9]{2}|[0-9][1-9]|[1-9][0-9])[0-9]{3}$"
                    
            Case "PHONENUMBERUS"
                rx.init s, _
                "^\(?(?<AreaCode>[2-9]\d{2})(\)?)(-|.|\s)?(?<Prefix>[1-9]\d{2})(-|.|\s)?(?<Suffix>\d{4})$"
                
            Case "CREDITCARD" 'amex/visa/mastercard
                rx.init s, _
                "^((4\d{3})|(5[1-5]\d{2}))(-?|\040?)(\d{4}(-?|\040?)){3}|^(3[4,7]\d{2})(-?|\040?)\d{6}(-?|\040?)\d{5}"
                
            Case "NUMERIC"
                rx.init s, _
                    "[\0-9]"
            
            Case "ALPHABETIC"
                rx.init s, _
                    "[\a-zA-Z]"
                    
            Case "NONNUMERIC"
                rx.init s, _
                    "[^\0-9]"
                    
            Case "IPADDRESS"
                rx.init s, _
                "^(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])\.(\d{1,2}|1\d\d|2[0-4]\d|25[0-5])$"
            
            Case "SINGLESPACE"  ' should take a replace value of "$1 "
                rx.init s, _
                    "(\S+)\x20{2,}(?=\S+)"
            
            Case "EMAIL"
                rx.init s, _
                    "^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$"
                    
            Case "EMAILINSIDE"
                rx.init s, _
                    "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"
                    

            
            Case Else
                rx.init "Adhoc", sName
        
        End Select
    
    Set rxMakeRxLib = rx
End Function

Summary


For more stuff like this see Get Started Snippets. Why not join our forum,follow the blog or follow me on twitter to ensure you get updates when they are available.