Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-13242

RegexReplaceProcessorFactory not making accurate replacement

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.6, 7.7, 7.7.1
    • None
    • None

    Description

      We are using the RegexReplaceProcessorFactory, and have tried with all of the following configurations in solrconfig.xml:

       

      <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">(\s*\r?\n){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

      <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">([ \s]*\r?\n){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

       <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">(\s*\n){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

       <processor class="solr.RegexReplaceProcessorFactory">
         <str name="fieldName">content</str>
         <str name="pattern">(\n\s*){2,}</str>
         <str name="replacement"><br><br></str>
         <bool name="literalReplacement">true</bool>
       </processor>

       

      The regex pattern of (\s*\r?\n){2,}, ([ \s]\r?\n){2,}, (\s\n){2,} and (\n\s*){2,} are working perfectly in regex101.com, in which all the \n will be replaced by only two <br>

      However, in Solr, there are cases (in Example 2 and 3 below) that has four <br> in a row. This should not be the case, as we have already set it to replace by two <br> regardless of how many \n are there in a row.

       

       

      Example 1: The sentence that the above regex pattern is working correctly 

      *Original content in EML file:*  

      Dear Sir, 

       

      I am terminating 

      Original content:    Dear Sir,  \n\n \n \n\n I am terminating

      Index content:     Dear Sir,  <br><br>I am terminating 

       

      Example 2: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

      *Original content in EML file:*    

      exalted

      Psalm 89:17

       

      3 Choa Chu Kang Avenue 4    

      Original content: exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu Kang Avenue 4, Singapore

      Index content: exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu Kang Avenue 4, Singapore

       

      Example 3: The sentence that the above regex pattern is partially working (as you can see, instead of 2 <br>, there are 4 <br>)

      *Original content in EML file:*    

      http://www.concordpri.moe.edu.sg/

       

       

       

       

      On Tue, Dec 18, 2018 at 10:07 AM    

      Original content: http://www.concordpri.moe.edu.sg/   \n\n   \n\n \n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at 10:07 AM 

      Index content: http://www.concordpri.moe.edu.sg/   <br><br>  <br><br>On Tue, Dec 18, 2018 at 10:07 AM

      Attachments

        Activity

          People

            Unassigned Unassigned
            edwinyeozl Edwin Yeo Zheng Lin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: