Browse Source

Improve efficiency of BoundedBreakIteratorScanner fragmentation algorithm (#89041)

As discussed in #73569 the current implementation is too slow in certain scenarios.

The inefficient part of the code can be stated as the following problem:

Given a text (getText()) and a position in this text (offset), find the sentence 
boundary before and after the offset, in such a way that the after boundary is 
maximal but respects end boundary - start boundary < fragment size.

In case it's impossible to produce an after boundary that respects the said 
condition, use the nearest boundary following offset.

The current approach begins by finding the nearest preceding and following boundaries, 
and expands the following boundary greedily while it respects the problem restriction. This 
is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes 
expensive.

This new approach maximizes the after boundary by scanning for the last boundary 
preceding the position that would cause the condition to be violated (i.e. knowing start
boundary and offset, how many characters are left before resulting length is fragment size). 
If this scan finds the start boundary, it means it's impossible to satisfy the problem 
restriction, and we get the first boundary following offset instead (or better, since we 
already scanned [offset, targetEndOffset], start from targetEndOffset + 1).
Denilson das Mercês Amorim 3 years ago
parent
commit
6bf5078fa9

+ 8 - 0
docs/changelog/89041.yaml

@@ -0,0 +1,8 @@
+pr: 89041
+summary: Improve efficiency of `BoundedBreakIteratorScanner` fragmentation algorithm
+area: Highlighting
+type: enhancement
+issues:
+ - 73569
+ - 73785
+

+ 22 - 9
server/src/main/java/org/elasticsearch/lucene/search/uhighlight/BoundedBreakIteratorScanner.java

@@ -89,16 +89,29 @@ public class BoundedBreakIteratorScanner extends BreakIterator {
             innerStart = innerEnd;
             innerEnd = windowEnd;
         } else {
-            windowStart = innerStart = mainBreak.preceding(offset);
-            windowEnd = innerEnd = mainBreak.following(offset - 1);
-            // expand to next break until we reach maxLen
-            while (innerEnd - innerStart < maxLen) {
-                int newEnd = mainBreak.following(innerEnd);
-                if (newEnd == DONE || (newEnd - innerStart) > maxLen) {
-                    break;
-                }
-                windowEnd = innerEnd = newEnd;
+            innerStart = Math.max(mainBreak.preceding(offset), 0);
+
+            final long targetEndOffset = (long) offset + Math.max(0, maxLen - (offset - innerStart));
+            final int textEndIndex = getText().getEndIndex();
+
+            if (targetEndOffset + 1 > textEndIndex) {
+                innerEnd = textEndIndex;
+            } else {
+                innerEnd = mainBreak.preceding((int) targetEndOffset + 1);
+            }
+
+            assert innerEnd != DONE && innerEnd >= innerStart;
+
+            // in case no break was found up to maxLen, find one afterwards.
+            if (innerStart == innerEnd) {
+                innerEnd = mainBreak.following((int) targetEndOffset);
+                assert innerEnd - innerStart > maxLen;
+            } else {
+                assert innerEnd - innerStart <= maxLen;
             }
+
+            windowStart = innerStart;
+            windowEnd = innerEnd;
         }
 
         if (innerEnd - innerStart > maxLen) {

+ 28 - 0
server/src/test/java/org/elasticsearch/lucene/search/uhighlight/BoundedBreakIteratorScannerTests.java

@@ -117,4 +117,32 @@ public class BoundedBreakIteratorScannerTests extends ESTestCase {
             testRandomAsciiTextCase(BoundedBreakIteratorScanner.getSentence(Locale.ROOT, maxLen), maxLen);
         }
     }
+
+    public void testTextThatEndsBeforeMaxLen() {
+        BreakIterator bi = BoundedBreakIteratorScanner.getSentence(Locale.ROOT, 1000);
+
+        final String text = "This is the first test sentence. Here is the second one.";
+
+        int offset = text.indexOf("first");
+        bi.setText(text);
+        assertEquals(0, bi.preceding(offset));
+        assertEquals(text.length(), bi.following(offset - 1));
+
+        offset = text.indexOf("second");
+        bi.setText(text);
+        assertEquals(33, bi.preceding(offset));
+        assertEquals(text.length(), bi.following(offset - 1));
+    }
+
+    public void testFragmentSizeThatIsTooBig() {
+        final int fragmentSize = Integer.MAX_VALUE;
+        BreakIterator bi = BoundedBreakIteratorScanner.getSentence(Locale.ROOT, fragmentSize);
+
+        final String text = "Any sentence";
+        final int offset = 0; // find at beggining of text
+
+        bi.setText(text);
+        assertEquals(0, bi.preceding(offset));
+        assertEquals(text.length(), bi.following(offset - 1));
+    }
 }