> Unicode defines text as a sequence of code points. Does it? Do you have a link...

hgs3 · on June 2, 2023

Review chapter 2.2 Unicode Design Principles in the Unicode Standard: "Plain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes."

Text elements are an abstract concept whose definition depends upon what is being processed. It might be a grapheme, it might be word, etc...

arcticbull · on June 2, 2023

There might be something a little imprecise here: code points vs code units vs character codes.

I'm open to being wrong but I would be very surprised if they defined text as a "series of code units" the count of which can vary by encoding even for the same character. IMO in this context 'character codes' would likely be far more consistent with 'code points' and they're just trying to differentiate between styled and un-styled text. Whereas the 1.3 definition appears to be trying to make an authoritative definition of 'text.'

If we read 2.2's "character codes" as code points, then that can be multiple code points as referenced in 1.3

[edit] I originally flipped 'units' and 'codes' - cleaned it up.

hgs3 · on June 2, 2023

"Character code" is short for "character code point" or just code point. All Unicode algorithms and properties are defined in terms of the code point. UTF encodings are just a way of encoding a code point. From Unicode's perspective, you care about what is encoded (i.e. the code point) and not how it is encoded (i.e. UTF-8).

Unicode is one of the most poorly understood topics. I think the confusion stems from 1. most programming languages getting the abstraction wrong, and 2. programmers trying to reconcile their non-technical interpretation of what "character" means.

arcticbull · on June 2, 2023

I agree with everything you said, I think I'm just trying to reconcile that with the top of thread saying python was the most correct because it was returning '7 code points' and that 'UTF-whatever is an implementation detail'

But 7 is not the number of code points/USVs - that's the number of UTF-16 code units. The string is 5 USVs. If UTF-whatever is an implementation detail, wouldn't the correct answer to length be 5?

What am I missing haha.

hgs3 · on June 2, 2023

Python does return 5. JavaScript returns 7. Python is returning the number of code points, JavaScript is returning the number of UTF-16 code units.

arcticbull · on June 2, 2023

There's my mistake. Thank you. Flipped them in my head, it's ben a long day.