README.txt 6.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
  1. This directory contains source code for the SQLite "ICU" extension, an
  2. integration of the "International Components for Unicode" library with
  3. SQLite. Documentation follows.
  4. 1. Features
  5. 1.1 SQL Scalars upper() and lower()
  6. 1.2 Unicode Aware LIKE Operator
  7. 1.3 ICU Collation Sequences
  8. 1.4 SQL REGEXP Operator
  9. 2. Compilation and Usage
  10. 3. Bugs, Problems and Security Issues
  11. 3.1 The "case_sensitive_like" Pragma
  12. 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
  13. 3.3 Collation Sequence Security Issue
  14. 1. FEATURES
  15. 1.1 SQL Scalars upper() and lower()
  16. SQLite's built-in implementations of these two functions only
  17. provide case mapping for the 26 letters used in the English
  18. language. The ICU based functions provided by this extension
  19. provide case mapping, where defined, for the full range of
  20. unicode characters.
  21. ICU provides two types of case mapping, "general" case mapping and
  22. "language specific". Refer to ICU documentation for the differences
  23. between the two. Specifically:
  24. http://www.icu-project.org/userguide/caseMappings.html
  25. http://www.icu-project.org/userguide/posix.html#case_mappings
  26. To utilise "general" case mapping, the upper() or lower() scalar
  27. functions are invoked with one argument:
  28. upper('ABC') -> 'abc'
  29. lower('abc') -> 'ABC'
  30. To access ICU "language specific" case mapping, upper() or lower()
  31. should be invoked with two arguments. The second argument is the name
  32. of the locale to use. Passing an empty string ("") or SQL NULL value
  33. as the second argument is the same as invoking the 1 argument version
  34. of upper() or lower():
  35. lower('I', 'en_us') -> 'i'
  36. lower('I', 'tr_tr') -> 'ı' (small dotless i)
  37. 1.2 Unicode Aware LIKE Operator
  38. Similarly to the upper() and lower() functions, the built-in SQLite LIKE
  39. operator understands case equivalence for the 26 letters of the English
  40. language alphabet. The implementation of LIKE included in this
  41. extension uses the ICU function u_foldCase() to provide case
  42. independent comparisons for the full range of unicode characters.
  43. The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
  44. dotless 'I' character used in the Turkish language is considered
  45. to be in the same equivalence class as the dotted 'I' character
  46. used by many languages (including English).
  47. 1.3 ICU Collation Sequences
  48. A special SQL scalar function, icu_load_collation() is provided that
  49. may be used to register ICU collation sequences with SQLite. It
  50. is always called with exactly two arguments, the ICU locale
  51. identifying the collation sequence to ICU, and the name of the
  52. SQLite collation sequence to create. For example, to create an
  53. SQLite collation sequence named "turkish" using Turkish language
  54. sorting rules, the SQL statement:
  55. SELECT icu_load_collation('tr_TR', 'turkish');
  56. Or, for Australian English:
  57. SELECT icu_load_collation('en_AU', 'australian');
  58. The identifiers "turkish" and "australian" may then be used
  59. as collation sequence identifiers in SQL statements:
  60. CREATE TABLE aust_turkish_penpals(
  61. australian_penpal_name TEXT COLLATE australian,
  62. turkish_penpal_name TEXT COLLATE turkish
  63. );
  64. 1.4 SQL REGEXP Operator
  65. This extension provides an implementation of the SQL binary
  66. comparision operator "REGEXP", based on the regular expression functions
  67. provided by the ICU library. The syntax of the operator is as described
  68. in SQLite documentation:
  69. <string> REGEXP <re-pattern>
  70. This extension uses the ICU defaults for regular expression matching
  71. behavior. Specifically, this means that:
  72. * Matching is case-sensitive,
  73. * Regular expression comments are not allowed within patterns, and
  74. * The '^' and '$' characters match the beginning and end of the
  75. <string> argument, not the beginning and end of lines within
  76. the <string> argument.
  77. Even more specifically, the value passed to the "flags" parameter
  78. of ICU C function uregex_open() is 0.
  79. 2 COMPILATION AND USAGE
  80. The easiest way to compile and use the ICU extension is to build
  81. and use it as a dynamically loadable SQLite extension. To do this
  82. using gcc on *nix:
  83. gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so
  84. You may need to add "-I" flags so that gcc can find sqlite3ext.h
  85. and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
  86. loaded into sqlite in the same way as any other dynamically loadable
  87. extension.
  88. 3 BUGS, PROBLEMS AND SECURITY ISSUES
  89. 3.1 The "case_sensitive_like" Pragma
  90. This extension does not work well with the "case_sensitive_like"
  91. pragma. If this pragma is used before the ICU extension is loaded,
  92. then the pragma has no effect. If the pragma is used after the ICU
  93. extension is loaded, then SQLite ignores the ICU implementation and
  94. always uses the built-in LIKE operator.
  95. The ICU extension LIKE operator is always case insensitive.
  96. 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
  97. Passing very long patterns to the built-in SQLite LIKE operator can
  98. cause excessive CPU usage. To curb this problem, SQLite defines the
  99. SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
  100. pattern in bytes (irrespective of encoding). The default value is
  101. defined in internal header file "limits.h".
  102. The ICU extension LIKE implementation suffers from the same
  103. problem and uses the same solution. However, since the ICU extension
  104. code does not include the SQLite file "limits.h", modifying
  105. the default value therein does not affect the ICU extension.
  106. The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
  107. the ICU extension LIKE operator is 50000, defined in source
  108. file "icu.c".
  109. 3.3 Collation Sequence Security Issue
  110. Internally, SQLite assumes that indices stored in database files
  111. are sorted according to the collation sequence indicated by the
  112. SQL schema. Changing the definition of a collation sequence after
  113. an index has been built is therefore equivalent to database
  114. corruption. The SQLite library is not very well tested under
  115. these conditions, and may contain potential buffer overruns
  116. or other programming errors that could be exploited by a malicious
  117. programmer.
  118. If the ICU extension is used in an environment where potentially
  119. malicious users may execute arbitrary SQL (i.e. gears), they
  120. should be prevented from invoking the icu_load_collation() function,
  121. possibly using the authorisation callback.