📄 rfc4790.txt
字号:
not actually be present). The "collation-auri" form is an abstract name for an ordering, a collation pattern or a vendor private collator. collation-uri = "http://www.iana.org/assignments/collation/" collation-id ".xml" collation-auri = ( "http://www.iana.org/assignments/collation/" collation-order ".xml" ) / other-uri other-uri = <absoluteURI> ; excluding the IANA collation namespace.3.5. Naming Guidelines While this specification makes no absolute requirements on the structure of collation identifiers, naming consistency is important, so the following initial guidelines are provided. Collation identifiers with an international audience typically begin with "i;". Collation identifiers intended for a particular language or locale typically begin with a language tag [5] followed by a ";". After the first ";" is normally the name of the general collation algorithm, followed by a series of algorithm modifications separated by the ";" delimiter. Parameterized modifications will use "=" toNewman, et al. Standards Track [Page 7]RFC 4790 Collation Registry March 2007 delimit the parameter from the value. The version numbers of any lookup tables used by the algorithm SHOULD be present as parameterized modifications. Collation identifiers of the form *;vnd-hostname;* are reserved for vendor-specific collations created by the owner of the hostname following the "vnd-" prefix (e.g., vnd-example.com for the vendor example.com). Registration of such collations (or the name space as a whole), with intended use of the "Vendor", is encouraged when a public specification or open-source implementation is available, but is not required.4. Collation Specification Requirements4.1. Collation/Server Interface The collation itself defines what it operates on. Most collations are expected to operate on character strings. The i;octet (Section 9.3) collation operates on octet strings. The i;ascii- numeric (Section 9.1) operation operates on numbers. This specification defines the collation interface in terms of octet strings. However, implementations may choose to use character strings instead. Such implementations may not be able to implement e.g., i;octet. Since i;octet is not currently mandatory to implement for any protocol, this should not be a problem.4.2. Operations Supported A collation specification MUST state which of the three basic operations are supported (equality, substring, ordering) and how to perform each of the supported operations on any two input character strings, including empty strings. Collations must be deterministic, i.e., given a collation with a specific identifier, and any two fixed input strings, the result MUST be the same for the same operation. In general, collation operations should behave as their names suggest. While a collation may be new, the operations are not, so the new collation's operations should be similar to those of older collations. For example, a date/time collation should not provide a "substring" operation that would morph IMAP substring SEARCH into e.g., a date-range search. A non-obvious consequence of the rules for each collation operation is that, for any single collation, either none or all of the operations can return "undefined". For example, it is not possible to have an equality operation that never returns "undefined", and a substring operation that occasionally does.Newman, et al. Standards Track [Page 8]RFC 4790 Collation Registry March 20074.2.1. Validity The validity test takes one string as argument. It returns valid if its input string is a valid input to the collation's other operations, and invalid if not. (In other words, a string is valid if it is equal to itself according to the collation's equality operation.) The validity test is provided by all collations. It MUST NOT be listed separately in the collation registration.4.2.2. Equality The equality test always returns "match" or "no-match" when it is supplied valid input, and MAY return "undefined" if one or both input strings are not valid. The equality test MUST be reflexive and symmetric. For valid input, it MUST be transitive. If a collation provides either a substring or an ordering test, it MUST also provide an equality test. The substring and/or ordering tests MUST be consistent with the equality test. The return values of the equality test are called "match", "no-match" and "undefined" in this document.4.2.3. Substring The substring matching operation determines if the first string is a substring of the second string, i.e., if one or more substrings of the second string is equal to the first, as defined by the collation's equality operation. A collation that supports substring matching will automatically support two special cases of substring matching: prefix and suffix matching, if those special cases are supported by the application protocol. It returns "match" or "no-match" when it is supplied valid input and returns "undefined" when supplied invalid input. Application protocols MAY return position information for substring matches. If this is done, the position information SHOULD include both the starting offset and the ending offset for each match. This is important because more sophisticated collations can match strings of unequal length (for example, a pre-composed accented character can match a decomposed accented character). In general, overlapping matches SHOULD be reported (as when "ana" occurs twice within "banana"), although there are cases where a collation may decide notNewman, et al. Standards Track [Page 9]RFC 4790 Collation Registry March 2007 to. For example, in a collation which treats all whitespace sequences as identical, the substring operation could be defined such that " 1 " (SP "1" SP) is reported just once within " 1 " (SP SP "1" SP SP), not four times (SP SP "1" SP, SP "1" SP, SP "1" SP SP and SP SP "1" SP SP), since the four matches are, in a sense, the same match. A string is a substring of itself. The empty string is a substring of all strings. Note that the substring operation of some collations can match strings of unequal length. For example, a pre-composed accented character can match a decomposed accented character. The Unicode Collation Algorithm [7] discusses this in more detail. The return values of the substring operation are called "match", "no- match", and "undefined" in this document.4.2.4. Ordering The ordering operation determines how two strings are ordered. It MUST be reflexive. For valid input, it MUST be transitive and trichotomous. Ordering returns "less" if the first string is listed before the second string, according to the collation; "greater", if the second string is listed before the first string; and "equal", if the two strings are equal, as defined by the collation's equality operation. If one or both strings are invalid, the result of ordering is "undefined". When the collation is used with a "+" prefix, the behavior is the same as when used with no prefix. When the collation is used with a "-" prefix, the result of the ordering operation of the collation MUST be reversed. The return values of the ordering operation are called "less", "equal", "greater", and "undefined" in this document.4.3. Sort Keys A collation specification SHOULD describe the internal transformation algorithm to generate sort keys. This algorithm can be applied to individual strings, and the result can be stored to potentially optimize future comparison operations. A collation MAY specify that the sort key is generated by the identity function. The sort key may have no meaning to a human. The sort key may not be valid input to the collation.Newman, et al. Standards Track [Page 10]RFC 4790 Collation Registry March 20074.4. Use of Lookup Tables Some collations use customizable lookup tables, e.g., because the tables depend on locale, and may be modified after shipping the software. Collations that use more than one customizable lookup table in a documented format MUST assign numbers to the tables they use. This permits an application protocol command to access the tables used by a server collation, so that clients and servers use the same tables.5. Application Protocol Requirements This section describes the requirements and issues that an application protocol needs to consider if it offers searching, substring matching and/or sorting, and permits the use of characters outside the US-ASCII charset.5.1. Character Encoding The protocol specification has to make sure that it is clear on which characters (rather than just octets) the collations are used. This can be done by specifying the protocol itself in terms of characters (e.g., in the case of a query language), by specifying a single character encoding for the protocol (e.g., UTF-8 [3]), or by carefully describing the relevant issues of character encoding labeling and conversion. In the later case, details to consider include how to handle unknown charsets, any charsets that are mandatory-to-implement, any issues with byte-order that might apply, and any transfer encodings that need to be supported.5.2. Operations The protocol must specify which of the operations defined in this specification (equality matching, substring matching, and ordering) can be invoked in the protocol, and how they are invoked. There may be more than one way to invoke an operation. The protocol MUST provide a mechanism for the client to select the collation to use with equality matching, substring matching, and ordering. If a protocol needs a total ordering and the collation chosen does not provide it because the ordering operation returns "undefined" at least once, the recommended fallback is to sort all invalid strings after the valid ones, and use i;octet to order the invalid strings. Although the collation's substring function provides a list of matches, a protocol need not provide all that to the client. It mayNewman, et al. Standards Track [Page 11]RFC 4790 Collation Registry March 2007 provide only the first matching substring, or even just the information that the substring search matched. In this way, collations can be used with protocols that are defined such that "x is a substring of y" returns true-false. If the protocol provides positional information for the results of a substring match, that positional information SHOULD fully specify the substring(s) in the result that matches, independent of the length of the search string. For example, returning both the starting and ending offset of the match would suffice, as would the starting offset and a length. Returning just the starting offset is not acceptable. This rule is necessary because advanced collations can treat strings of different lengths as equal (for example, pre- composed and decomposed accented characters).5.3. Wildcards The protocol MUST specify whether it allows the use of wildcards in collation identifiers. If the protocol allows wildcards, then: The protocol MUST specify how comparisons behave in the absence of explicit collation negotiation, or when a collation of "default" is requested. The protocol MAY specify that the default collation used in such circumstances is sensitive to server configuration. The protocol SHOULD provide a way to list available collations matching a given wildcard pattern, or patterns.5.4. String Comparison If a protocol compares strings in any nontrivial way, using a collation may be appropriate. As an example, many protocols use case-independent strings. In many cases, a simple ASCII mapping to upper/lower case works well. In other cases, it may be better to use a specifiable collation; for example, so that a server can treat "i" and "I" as equivalent in Italy, and different in Turkey (Turkish also has a dotted upper-case" I" and a dotless lower-case "i"). Protocol designers should consider, in each case, whether to use a specifiable collation. Keywords often have other needs than user variables, and search arguments may be different again.5.5. Disconnected Clients If the protocol supports disconnected clients, and a collation is used that can use configurable tables (e.g., to support locale-specific extensions), then the client may not be able to reproduce the server's collation operations while offline.Newman, et al. Standards Track [Page 12]RFC 4790 Collation Registry March 2007 A mechanism to download such tables has been discussed. Such a mechanism is not included in the present specification, since the problem is not yet well understood.5.6. Error Codes The protocol specification should consider assigning protocol error codes for the following circumstances: o The client requests the use of a collation by identifier or pattern, but no implemented collation matches that pattern. o The client attempts to use a collation for an operation that is not supported by that collation -- for example, attempting to use the "i;ascii-numeric" collation for substring matching. o The client uses an equality or substring matching collation, and the result is an error. It may be appropriate to distinguish between the two input strings, particularly when one is supplied by the client and the other is stored by the server. It might also be appropriate to distinguish the specific case of an invalid UTF-8 string.5.7. Octet Collation The i;octet (Section 9.3) collation is only usable with protocols based on octet-strings. Clients and servers MUST NOT use i;octet with other protocols. If the protocol permits the use of collations with data structures other than strings, the protocol MUST describe the default behavior for a collation with those data structures.6. Use by Existing Protocols This section is informative. Both ACAP [11] and Sieve [14] are standards track specifications that used collations prior to the creation of this specification and registry. Those standards do not meet all the application protocol requirements described in Section 5. These protocols allow the use of the i;octet (Section 9.3) collation working directly on UTF-8 data, as used in these protocols.Newman, et al. Standards Track [Page 13]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -