Browse Source

[DOCS] Updated ICU-Plugin docs from the repo README

Clinton Gormley 12 years ago
parent
commit
ea05f4538c
1 changed files with 71 additions and 3 deletions
  1. 71 3
      docs/reference/analysis/icu-plugin.asciidoc

+ 71 - 3
docs/reference/analysis/icu-plugin.asciidoc

@@ -39,7 +39,7 @@ Here is a sample settings:
 === ICU Folding
 
 Folding of unicode characters based on `UTR#30`. It registers itself
-under `icu_folding` and `icuFolding` names.  
+under `icu_folding` and `icuFolding` names.
 The filter also does lowercasing, which means the lowercase filter can
 normally be left out. Sample setting:
 
@@ -70,7 +70,7 @@ primary letters in a specific language is wanted. See syntax for the
 UnicodeSet
 http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UnicodeSet.html[here].
 
-The Following example excempt Swedish characters from the folding. Note
+The Following example exempts Swedish characters from the folding. Note
 that the filtered characters are NOT lowercased which is why we add that
 filter below.
 
@@ -148,5 +148,73 @@ And here is a sample of custom collation:
             }
         }
     }
-}    
+}
 --------------------------------------------------
+
+[float]
+==== Options
+
+[horizontal]
+`strength`::
+    The strength property determines the minimum level of difference considered significant during comparison.
+     The default strength for the Collator is `tertiary`, unless specified otherwise by the locale used to create the Collator.
+     Possible values: `primary`, `secondary`, `tertiary`, `quaternary` or `identical`.
+ +
+ See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html[ICU Collation] documentation for a more detailed
+ explanation for the specific values.
+
+`decomposition`::
+    Possible values: `no` or `canonical`. Defaults to `no`. Setting this decomposition property with
+    `canonical` allows the Collator to handle un-normalized text properly, producing the same results as if the text were
+    normalized. If `no` is set, it is the user's responsibility to insure that all text is already in the appropriate form
+    before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between
+    faster and more complete collation behavior. Since a great many of the world's languages do not require text
+    normalization, most locales set `no` as the default decomposition mode.
+
+[float]
+==== Expert options:
+
+[horizontal]
+`alternate`::
+     Possible values: `shifted` or `non-ignorable`. Sets the alternate handling for strength `quaternary`
+     to be either shifted or non-ignorable. What boils down to ignoring punctuation and whitespace.
+
+`caseLevel`::
+    Possible values: `true` or `false`. Default is `false`. Whether case level sorting is required. When
+     strength is set to `primary` this will ignore accent differences.
+
+`caseFirst`::
+    Possible values: `lower` or `upper`. Useful to control which case is sorted first when case is not ignored
+    for strength `tertiary`.
+
+`numeric`::
+    Possible values: `true` or `false`. Whether digits are sorted according to numeric representation. For
+    example the value `egg-9` is sorted before the value `egg-21`. Defaults to `false`.
+
+`variableTop`::
+    Single character or contraction. Controls what is variable for `alternate`.
+
+`hiraganaQuaternaryMode`::
+    Possible values: `true` or `false`. Defaults to `false`. Distinguishing between Katakana and
+    Hiragana characters in `quaternary` strength .
+
+[float]
+=== ICU Tokenizer
+
+Breaks text into words according to UAX #29: Unicode Text Segmentation ((http://www.unicode.org/reports/tr29/)).
+
+[source,js]
+--------------------------------------------------
+{
+    "index" : {
+        "analysis" : {
+            "analyzer" : {
+                "collation" : {
+                    "tokenizer" : "icu_tokenizer",
+                }
+            }
+        }
+    }
+}
+--------------------------------------------------
+