Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Unicode defines text as a sequence of code points.

Does it? Do you have a link?

[edit] I looked up the spec and here is what it says.

> The Unicode Standard does not define what is and is not a text element in different processes; instead, it defines elements called encoded characters. An encoded character is represented by a number from 0 to 10FFFF_16, called a code point. A text element, in turn, is represented by a sequence of one or more encoded characters. [1]

The definition of 'text' in the context of Unicode seems to explicitly not be defined as a sequence of code points, but rather a more nebulous sequence of aggregations of code points. It's probably closest to a grapheme cluster but they seem to want to avoid pinning it down.

[1] https://www.unicode.org/versions/Unicode15.0.0/UnicodeStanda... p. 7 (1.3 - Text Handling), PDF page 33.



Review chapter 2.2 Unicode Design Principles in the Unicode Standard: "Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes."

Text elements are an abstract concept whose definition depends upon what is being processed. It might be a grapheme, it might be word, etc...


There might be something a little imprecise here: code points vs code units vs character codes.

I'm open to being wrong but I would be very surprised if they defined text as a "series of code units" the count of which can vary by encoding even for the same character. IMO in this context 'character codes' would likely be far more consistent with 'code points' and they're just trying to differentiate between styled and un-styled text. Whereas the 1.3 definition appears to be trying to make an authoritative definition of 'text.'

If we read 2.2's "character codes" as code points, then that can be multiple code points as referenced in 1.3

[edit] I originally flipped 'units' and 'codes' - cleaned it up.


"Character code" is short for "character code point" or just code point. All Unicode algorithms and properties are defined in terms of the code point. UTF encodings are just a way of encoding a code point. From Unicode's perspective, you care about what is encoded (i.e. the code point) and not how it is encoded (i.e. UTF-8).

Unicode is one of the most poorly understood topics. I think the confusion stems from 1. most programming languages getting the abstraction wrong, and 2. programmers trying to reconcile their non-technical interpretation of what "character" means.


I agree with everything you said, I think I'm just trying to reconcile that with the top of thread saying python was the most correct because it was returning '7 code points' and that 'UTF-whatever is an implementation detail'

But 7 is not the number of code points/USVs - that's the number of UTF-16 code units. The string is 5 USVs. If UTF-whatever is an implementation detail, wouldn't the correct answer to length be 5?

What am I missing haha.


Python does return 5. JavaScript returns 7. Python is returning the number of code points, JavaScript is returning the number of UTF-16 code units.


There's my mistake. Thank you. Flipped them in my head, it's ben a long day.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: