rosscarter.com

NSString -characterAtIndex:

14 August 2008 11:17 am UTC

I posit that there is no such thing; that is to say, characterAtIndex: will return a unichar value that might or might not represent the encoding of a character.

Today I filed this documentation bug on the NSString reference document:

I believe the description of characterAtIndex: is misleading and possibly incorrect.

The description makes no reference to Unicode and fosters ambiguity regarding the meaning of “character.” The method name itself evidently a historical relic: “We have this terminology problem for historical reasons; characterAtIndex: antedates the introduction of surrogate pairs.” Douglas Davidson, http://www.cocoabuilder.com/archive/message/cocoa/2007/9/10/188991.

The documentation states that characterAtIndex: “Returns the character at a given array position.” The method does not in fact return a “character” at the specified index, but rather the numeric value, expressed as a unichar, at the index, a value which may or may not represent the smallest semantic unit in a writing system. “In particular, -characterAtIndex: can return either half of a surrogate pair (e.g. if you have a string containing a non-BMP code point like MUSICAL SYMBOL G CLEF U+1D11E, which is encoded D834 DD1E according to Character Palette, you might get 0xD834 or 0xDD1E, but you won’t ever get 0x1D11E). Nor is that the only trap for the unwary; you can also get various types of Unicode control codes as well as several kinds of combining characters (though the most common group is probably accents).” Alastair Houghton, http://www.cocoabuilder.com/archive/message/cocoa/2008/6/18/210572.

Also, “In addition to what Doug says, bear in mind that even precomposed Unicode cannot be accessed one ‘unichar’ at a time. First, there may still be surrogate pairs (two consecutive UTF-16 code units used to represent characters beyond the first 16 bits of Unicode), and second, there are some characters that cannot be represented by a single Unicode code point, even in the canonical precomposed form of Unicode (NFC == Normalization Form C).” Deborah Goldsmith, http://www.cocoabuilder.com/archive/message/cocoa/2007/4/7/181483.

For these reasons, Apple actively discourages use of characterAtIndex:. “The characterAtIndex: method should be avoided wherever possible; with Unicode strings, examining a single character usually is not sufficient.” Douglas Davidson, http://www.cocoabuilder.com/archive/message/cocoa/2007/4/4/181368. “We try to discourage developers from working at the level of individual characters wherever possible, primarily because in Unicode the individual character is usually not the appropriate level at which to operate. . . . For those who need to do their own low-level processing, and who are willing to handle Unicode complications themselves, we provide access to UTF-16 directly via characterAtIndex: et al., and to other representations with getBytes:… and related methods.” Douglas Davidson, http://www.cocoabuilder.com/archive/message/cocoa/2007/9/7/188858.

It might be posited that developers are given adequate warning in this statement from the String Programming Guide For Cocoa: “If you need to access string objects character-by-character, you must understand the Unicode character encoding—specifically, issues related to composed character sequences.” However, in the same paragraph the guide makes this statement: “A string object presents itself as an array of Unicode characters. You can determine how many characters it contains with the length method and can retrieve a specific character with the characterAtIndex: method,” which is simply wrong. It might have been correct in the days before surrogate pairs, but today it is wrong.

Incidentally, the best description of Unicode in the Apple docs does not appear in any programming guide, but in Tech Note 2078 (http://developer.apple.com/technotes/tn2002/tn2078.html#UNICODENOTES) dealing with FSRefs.

Might I suggest that the String Programming Guide For Cocoa contain a short section on dealing with Unicode at the unichar level (avoiding the term “Unicode character,” which IMHO has been meaningless since the introduction of surrogate pairs), and that the description of characterAtIndex: refer the reader to that discussion?

Alternatively, at the very least, -characterAtIndex: should contain a brief message referring the developer to rangeOfComposedCharacterSequenceAtIndex:, which is generally more appropriate for dealing with Unicode.

Leave a Comment

Comments

  • There are no comments for this article.