Increasing Performance of Regular Expressions in VBA

February 12, 2013 in VBA

In a previous post I presented some general-purpose wrappers for adding regular expressions functionality to VBA and SQL in Microsoft Access.

These wrappers work well, but we can make them much more efficient for situations where the same pattern is used many times, such as in a SQL query where the regular expression will be executed on each row of the input table or query.

Caching Compiled Regular Expressions

What we’ll use is the fact that, behind the scenes, every regular expression pattern gets compiled into a tiny little program that executes in order to actually apply the regular expression. The first time you call any of the methods on a RegExp object there will be a short pause while the expression is compiled. Once the regular expression has been compiled, though, it can execute right away.

The pause for compilation is usually on the order of a few tens of milliseconds. That’s small enough that you might never notice it even when stepping through code. But it can really add up when the expression is going to be applied to a few hundred thousand rows.

If you’re using a RegExp object in a procedure while looping through a recordset programmatically you can just keep using the same RegExp object. But what about when the same pattern is used in independent calls to the same function — or even to different functions entirely? The solution is to cache all of your regular expressions objects, and serve them up using a simple factory pattern. In this article I’ll show how to do just that.

Creating the Factory

We will store all the RegExp objects in a dictionary, where the keys will be a concatenation of the pattern and some symbols to represent the three options properties (Global, IngoreCase, and Multiline).

You will need the following references for this to code work:

  • Microsoft VBScript Regular Expressions 5.5
  • Microsoft Scripting Runtime
' Module-level declaration. The object will persist between calls
Private pCachedRegexes As Dictionary

Public Function GetRegex( _
    ByVal Pattern As String, _
    Optional ByVal IgnoreCase As Boolean = True, _
    Optional ByVal MultiLine As Boolean = True, _
    Optional ByVal MatchGlobal As Boolean = True) As RegExp
    
    ' Ensure the dictionary has been initialized
    If pCachedRegexes Is Nothing Then Set pCachedRegexes = New Dictionary
    
    ' Build the unique key for the regex: a combination
    ' of the boolean properties and the pattern itself
    Dim rxKey As String
    rxKey = IIf(IgnoreCase, "1", "0") & _
            IIf(MultiLine, "1", "0") & _
            IIf(MatchGlobal, "1", "0") & _
            Pattern
            
    ' If the RegExp object doesn't already exist, create it
    If Not pCachedRegexes.Exists(rxKey) Then
        Dim oRegExp As New RegExp
        With oRegExp
            .Pattern = Pattern
            .IgnoreCase = IgnoreCase
            .MultiLine = MultiLine
            .Global = MatchGlobal
        End With
        Set pCachedRegexes(rxKey) = oRegExp
    End If

    ' Fetch and return the pre-compiled RegExp object
    Set GetRegex = pCachedRegexes(rxKey)
End Function

Now anywhere in your code that you need a RegExp object, you call GetRegex instead of declaring a New RegExp. This also makes it really fast to create new RegExp objects in a single line.

Updating the Regular Expressions Wrappers

See Regular Expressions in VBA for an in-depth discussion of these methods. Here we’ll just update them to take advantage of the new RegExp factory we just built:

RxTest
Public Function RxTest( _
    ByVal SourceString As String, _
    ByVal Pattern As String, _
    Optional ByVal IgnoreCase As Boolean = True, _
    Optional ByVal MultiLine As Boolean = True) As Boolean

    ' Wow, that was easy:
    RxTest = GetRegex(Pattern, IgnoreCase, MultiLine, False).Test(SourceString)
    
End Function
RxMatch
Public Function RxMatch( _
    ByVal SourceString As String, _
    ByVal Pattern As String, _
    Optional ByVal IgnoreCase As Boolean = True, _
    Optional ByVal MultiLine As Boolean = True) As Variant

    Dim oMatches As MatchCollection
    With GetRegex(Pattern, IgnoreCase, MultiLine, False)
        Set oMatches = .Execute(SourceString)
        If oMatches.Count > 0 Then
            RxMatch = oMatches(0).Value
        Else
            RxMatch = Null
        End If
    End With

End Function
RxMatches
Public Function RxMatches( _
    ByVal SourceString As String, _
    ByVal Pattern As String, _
    Optional ByVal IgnoreCase As Boolean = True, _
    Optional ByVal MultiLine As Boolean = True, _
    Optional ByVal MatchGlobal As Boolean = True) As Variant

    Dim oMatch As Match
    Dim arrMatches
    Dim lngCount As Long
    
    arrMatches = Array()
    With GetRegex(Pattern, IgnoreCase, MultiLine, MatchGlobal)
        For Each oMatch In .Execute(SourceString)
            ReDim Preserve arrMatches(lngCount)
            arrMatches(lngCount) = oMatch.Value
            lngCount = lngCount + 1
        Next
    End With

    RxMatches = arrMatches
End Function
RxReplace
Public Function RxReplace( _
    ByVal SourceString As String, _
    ByVal Pattern As String, _
    ByVal ReplacePattern As String, _
    Optional ByVal IgnoreCase As Boolean = True, _
    Optional ByVal MultiLine As Boolean = True, _
    Optional ByVal MatchGlobal As Boolean = True) As String

    ' A single statement!
    RxReplace = GetRegex( _
        Pattern, IgnoreCase, MultiLine, MatchGlobal).Replace( _
        SourceString, ReplacePattern)

End Function

And that’s that! In a future post I’ll show examples of when and how you might actually use such functions, as well as performance tests delving deeper into the benefits of caching your RegExp objects.

Conclusion

Behind the scenes, RegExp objects generate tiny compiled programs that are executed in order to apply regular expressions. Using a simple caching mechanism you can dramatically increase the performance of using regular expressions in Microsoft Access by saving each unique compiled RegExp object, as identified by the pattern and the three options properties of the object.