Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-5914

Cache compiled regular expressions in SQL function runtime

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.36.0
    • None

    Description

      Cache compiled regular expressions (and other amortized work) in SQL function runtime. Compiling a regular expression to a pattern is expensive (compared to the cost of matching, given an existing pattern) and therefore caching the compiled form will yield performance benefits if the regular expression is constant or has a small number of values.

      Consider the following query:

      SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*')
      FROM emp
      

      The first regular expression, 'A.*', is constant and can be compiled at prepare time or at the start of execution; the second regular expression, job || '.*', might vary from one row to the next. However if the job column has a small number of values it still might be beneficial to cache the compiled regular expression.

      If SqlFunctions.rlike could use a cache (mapping from String to java.util.regex.Pattern) then it would achieve benefits in both the constant and non-constant cases.

      The cache needs to:

      • be thread-safe (in case queries are executing using multiple threads),
      • return thread-safe objects (as is Pattern),
      • have bounded space (so that a query doesn't blow memory with 1 million distinct regular expressions),
      • disposed after the query has terminated,
      • (ideally) share with regexes of the same language in the same query,
      • not conflict with regexes of different languages in the same query.

      One possible implementation is to add an interface FunctionState, with subclasses including class RegexpCache, and if argument 1 of a function is a subclass of FunctionState the compiler would initialize the state in the generated code. The function can rely on the state argument being initialized, and being the same object from one call to the next. Example:

      interface FunctionState {
      }
      
      class RegexpCache implements FunctionState {
        final Cache<String, Pattern> cache = ...;
      }
      

      This change should install the cache for all applicable functions, including LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, JSON_REPLACE, PARSE_TIMESTAMP.

      It can also be used for functions that have mutable state, e.g. RANDOM with and without a seed.

      Attachments

        Issue Links

          Activity

            People

              julianhyde Julian Hyde
              julianhyde Julian Hyde
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: