You are right. Unicode in UTF-8 will have variable char length. Plain ASCII will...

jsjohnst · on Aug 21, 2022

> Porting that from ASCII to UTF-32 is just dividing the length in bytes by 4. For UTF-8, you'd have to iterate over each character and figure out if there is a combination of bytes that collapse into a single character.

You have to do that with UTF-32 as well. Yes, every codepoint is 32 bits, but every character can be made up of one or more codepoints.

alganet · on Aug 21, 2022

What?

U+004D in UTF-8 is 4D, in UTF-32 is 0000004D

U+0430 in UTF-8 is D0 B0, in UTF-32 is 00000430

I took this examples from page 102 of The Unicode Standard Version 5.0.

These leading zeros on the UTF-32 are what is omitted from UTF-8. One of the encodings is optimized for space, the other for processing speed.

I am no expert on unicode though, there might be something I'm missing (ligatures, this kind of stuff). I would gladly accept a more in depth expalanation on why I am wrong on this.

bzxcvbn · on Aug 21, 2022

The letter à can be represented in two different ways in Unicode:

* As the single code point U+00E0, which encodes 'à'.

* Or as the sequence of two code points U+0061 U+0300, which respectively encode the Latin letter 'a' and the combining acute accent (it's hard to display as a standalone character, so go e.g. here: https://www.compart.com/fr/unicode/U+0300). These two code points get combined into a single grapheme cluster, the technical name for what most people consider to be a character, that displays as 'à'.

As you can see, there is no difference in the visual representation of à and à. But if you inspect the string (in python or whatever) then you'll see that one of them has one code point (one "char"), while the second has two. If you're on Windows, an easy way is to type it into pwsh:

    PS> 'à'.EnumerateRunes() | % { "U+{0:X4}" -f $_.Value }
    U+0061
    U+0300
    PS> 'à'.EnumerateRunes() | % { "U+{0:X4}" -f $_.Value }
    U+00E0

alganet · on Aug 21, 2022

The character "à" as a single byte E0, as far as I know, is not UTF-8. This is the 224th character of the extended ASCII table (also ISO 8859). It's another system: https://en.wikipedia.org/wiki/Extended_ASCII

This representation is incompatible with UTF-8, which marks the octet series Cx, Dx, Ex and Fx as leading bytes. You can see the table here: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

What I believe Kernighan _wants_, is to reuse a lot of code that was designed for ASCII. Code that treats 1 char as 1 byte. In order to do that, he is going to encode each unicode character using 4 of whatever type he was previously using before, zeroing the "pad" bytes, which is exactly what UTF-32 does. This way, he doesn't have to fundamentally change all the types he is already working with. (ps: I looked at the commit after I wrote this, he doesn't seem to be doing what I suggested).

Your PS snippet failed for me "Method invocation failed because [System.String] does not contain a method named 'EnumerateRunes'."

Dit it in bash, which supports UTF-8 and lets you explicitly set the encoding:

    $ LC_ALL=C bash -c 'x=à; echo ${#x}'
    2
    $ LC_ALL=en_US.UTF-8 bash -c 'x=à; echo ${#x}'
    1

These snippets count the number of chars according to the selected encoding.

If I had UTF-32 locales in my machine, "à" would appear as 4. Unfortunatelly I don't know any system that implements UTF-32 to display its behavior.

Let's now reconstruct "à" in UTF-8, telling bash to use only ASCII:

    $ LC_ALL=C bash -c 'printf %b \\xC3\\xA0'
    à

I've tunneled UTF-8 through an ASCII program, and because my terminal accepts UTF-8, everything is fine. Didn't even had to use \u escapes. This backwards compatibility is by design, and chosing between UTF-8, 16 and 32 plays a role in how you're going to deal with these byte sequences.

bzxcvbn · on Aug 21, 2022

> The character "à" as a single byte E0, as far as I know, is not UTF-8.

It's not, but I wrote "Unicode" and "code point", not "UTF-8" and "code unit". It sounds like you don't know the difference, which makes this whole discussion a bit silly and probably a bit pointless. One of the best explanations I've read is this one: https://docs.microsoft.com/en-us/dotnet/standard/base-types/... and especially the section on grapheme clusters.

> Your PS snippet failed for me "Method invocation failed because [System.String] does not contain a method named 'EnumerateRunes'."

Sounds like you're using the outdated windows powershell, not pwsh.

alganet · on Aug 21, 2022

I'm talking about how many bytes it takes to store characters in these encodings, and how the lengh of these sequences changes according to each standard.

You are talking about another level of abstraction. These UTF-16 builtins are first-class citizens in .NET, and you never have to deal with the raw structure behind all of it unless you want to. C does not have these primitives, a string there is composed of smokes and mirrors using byte arrays.

In summary, you are talking about using a complete, black box Unicode implementation. I'm talking about the tradeoffs one might encounter when crafting stuff that goes inside one of these black boxes, and it's different from the black box you know.

bzxcvbn · on Aug 21, 2022

Okay, let's back up a little. You were replying about a comment that said:

> Yes, every codepoint is 32 bits, but every character can be made up of one or more codepoints.

And you expressed surprise and incredulity at the idea. I explained that a "character" (which, given the context, is to be taken as a "grapheme") can, indeed, take up more than one code point. This has nothing to do with UTF-32, UTF-8, or .NET's implementation. It's inherent to how Unicode works.

As an even more obvious example, try to figure out how many bytes are in the character ȃ encoded as UTF-32, using whatever method you'd like. (I say "more obvious" because you'll be forced to copy-paste the character I've used in your code editor.) Other examples include emojis such as this one: https://emojipedia.org/woman-golfing/ (that I can't use in the comment box for some reason).

Then for some reason you felt compelled to offer a rebuttal that showed you didn't understand the difference between a grapheme and a code point. So I just linked some doc. I'm not attacking you.

alganet · on Aug 21, 2022

That reply is not my first interaction on the thread. I guess you didn't see the previous one, so perhaps you are missing context. Here's the link: https://news.ycombinator.com/item?id=32535499

"ȃ" uses two bytes in UTF-8, 2 bytes in UTF-16 and 4 bytes in UTF-32. The woman golfing uses 4 bytes in UTF-8, UTF-16 and UTF-32.

My surprise was at the idea of having to look up a variable length sequence of bytes in a string stored as UTF-32. To me, UTF-32 exists exactly for the use case of "I don't want to calculate variable length" (optimizing for speed), and UTF-8 exists for "I don't want to use more bytes than I need" (optimizing for space).

I never used the term "codepoint" in this entire thread until now. I was originally talking about storing bytes, hence my little discomfort of not being able to communicate what I'm trying to say. Not feeling attacked, just incapable of expressing, which might have come off wrong. Sorry about that.

BTW, there are many things that I don't understand about Unicode. I never read the full standard. I don't know enough about a grapheme to tell whether it is something that impacts or not the number of bytes when storing in UTF-32.

Why we are talking about UTF-8 versus UTF-32 and not Unicode as a whole? Because there is a code comment about these two in the commit linked by the OP, which sparked this particular subthread.

tzot · on Aug 22, 2022

> "ȃ" uses two bytes in UTF-8

"â" uses two bytes when encoded in UTF-8, while "ȃ" (which was what bzxcvbn supplied as an example and you pasted in the quoted section) uses three bytes when encoded in UTF-8.

  >>> s="ȃ"
  >>> len(s)
  2
  >>> s.encode('utf8')
  b'a\xcc\x91'
  >>> import unicodedata as ud
  >>> [ud.name(c) for c in s]
  ['LATIN SMALL LETTER A', 'COMBINING INVERTED BREVE']

bzxcvbn · on Aug 21, 2022

> The woman golfing uses 4 bytes in UTF-8, UTF-16 and UTF-32.

No. It's made up of 5 code points. Each of these takes 32 bits, or 4 bytes, in UTF32. So that emoji, which is a single grapheme, uses 20 bytes in UTF32. Once again, just try it yourself, encode a string containing that single emoji in UTF32 using your favorite programming language and count the bytes!

alganet · on Aug 21, 2022

Let's go to the source:

https://www.unicode.org/faq/utf_bom.html

> Q: Should I use UTF-32 (or UCS-4) for storing Unicode strings in memory? > > This depends. It may seem compelling to use UTF-32 as your internal string format because it uses one code unit per code point. (...)

It also confirms what I said about implementation vs interface (or storage vs access, whatever):

> Q: How about using UTF-32 interfaces in my APIs? > > Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. (...) While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling

Finally:

> Q: What is UTF-32? > > Any Unicode character can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character.

So, it does in fact reduce the complexity of implementing it _for storage_, as I suspected. And there is a tradeoff, as I mentioned. And the Unicode documentation explicitly separates the interface from the storage side of things.

That's good enough for me. I mentioned before there might be some edge cases like ligatures, and you came up with a zero-width joiner example. None of this changes these fundamental properties of UTF-32 though.

jsjohnst · on Aug 22, 2022

Reading this thread, I feel bad for the person you’re arguing with. It’s clear you are in the “knowledgeable enough to be dangerous” stage and no amount of trying to guide you will sway you from your mistaken belief that you are right.

Now to try one last time, you are misreading the spec and not understanding important concepts. Take the “woman golfing” emoji as an example. That emoji is not a Unicode “character” and is part of why it can’t be represented by a single UTF32. That emoji is a grapheme which combines multiple “characters” together with a zero width joiner, “person golfing” and “female” in this case. Rather than have a single “character” for every supported representation of gender and skin color, modern emoji use ZWJ sequences instead, which means yes, something you incorrectly think is a “character” can in fact take up more than 4 bytes in UTF32.

alganet · on Aug 23, 2022

I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that?

I said I might be wrong _multiple times_, and its genuine. I'm glad you appeared with an in-depth explanation that proves me wrong. I asked exactly for that.

The first examples in this thread are not zero-width-joiners (á, â) or complex graphemes. They all could be stored in 4 bytes. It took some time to come up with the woman golfing example.

By the way, one can still implement reading "character" by "character" in sequences of 4 UTF-32 bytes, and decide to abstract the grapheme units on top of that. It still saves a lot of leading byte checks.

Maybe someone else learned a little bit reading through the whole thing as well. If you are afraid I'm framing the subject with the wrong facts, this is me assuming, once again, that I never claimed to be a Unicode expert. I don't regret a single thing, I love being actually proven wrong.

jsjohnst · on Aug 23, 2022

> I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that? I said I might be wrong _multiple times_, and its genuine.

I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

> I don't regret a single thing, I love being actually proven wrong.

Me too!

alganet · on Aug 23, 2022

> I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

To me it sounds ok. I'm not an english native speaker though, so there is a handicap on my side on tone, vocabulary and phrasing.

My intent was to admit I had wrong assumptions. At some point before this whole thread, I _really_ believed all graphemes (which, in my mind, were "character combinations") could be stored in just 4 bytes. I was aware of combining characters, I just assumed all of them could fit in 32bit. You folks taught me that it can't.

However, there's another subject we're dealing with here as well. Storing these characters at a lower level, whether they form graphemes or not at a higher level of abstraction.

The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.

jsjohnst · on Aug 23, 2022

> The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.

As I said in the other thread, to try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. The world you’ve made up that everything is characters and they all take exactly four bytes in UTF32 is just wrong. Yes, many graphemes are a single codepoint, so yes, they are 4 bytes in UTF32, but not ALL (and it’s not just emoji’s to blame).

If what you’re on about is the leading zeros, yes, they don’t matter individually. Unicode by rule is limited to 21bits to represent all codepoints so the 11 bits left as leading zeros are wasted, which is why folks don’t use UTF32 typically as it’s the least efficient storage wise and doesn’t have really any advantage over UTF-16 outside easy counting of codepoints (but again, codepoints aren’t characters).

alganet · on Aug 23, 2022

I'm talking more about C (or any low level stuff) than Unicode now. The world I'm using as a reference has only byte arrays, smokes and mirrors.

I'm constantly pointing you _to the awk codebase_. It's a relevant context, the title of the post and it matters. Can you please stop ignoring this? There's no Unicode library there, it's an implementation from scratch.

If you are doing it from scratch, there's a part of the code that will deal with raw bytes, way before they are recognized as Unicode things (whatever they might be).

Ever since this entire post was created, the main context was always this: an implementation from scratch, in an environment that does not have unicode primitives as a first-class citizens. Your string functions don't have \u escapes, YOU are doing the string functions that support \u.

jsjohnst · on Aug 23, 2022

Ok, now I know you’re just trolling. Enjoy and goodbye!

alganet · on Aug 24, 2022

I'm not kidding. Just look at the code:

https://github.com/onetrueawk/awk/commit/d322b2b5fc16484affb...

I am talking about that environment. You are not.

jsjohnst · on Aug 23, 2022

To make the point more clear, the female golfing Unicode “character” is encoded as follows in various UTFs:

UTF16 (12 bytes total):

\ud83c\udfcc\ufe0f\u200d\u2640\ufe0f

UTF32 (20 bytes total):

u+0001f3ccu+0000fe0fu+0000200du+00002640u+0000fe0f

UTF8 (16 bytes total):

\xf0\x9f\x8f\x8c\xef\xb8\x8f\xe2\x80\x8d\xe2\x99\x80\xef\xb8\x8f

alganet · on Aug 23, 2022

You are still presenting me with abstractly encoded data. \u and u+ are in a higher level of abstraction. The only raw bytes I am seeing here are in the UTF-8 string, which you decided to serialize as hexadecimals (did you had a choice? why?).

If you had all of these expressed as pure hexadecimals (or octals, or any single-byte unit), how would they be serialized?

Then, once all of them are just hexadecimals, how would you go about parsing one of these sequences of hexadecimals into _characters_? (each hexa representing a raw byte, just like in the awk codebase we are talking about)

Another question: do you need first to parse the raw bytes into characters before recognizing graphemes, or can you do it all at once for both variable-length and fixed-length encodings?

jsjohnst · on Aug 23, 2022

> You are still presenting me with abstractly encoded data.

That’s the actual encoding for that grapheme as specified by the spec for UTF8, UTF16, and UTF32.

> \u and u+ are in a higher level of abstraction.

No, it’s not, it’s how you write escaped 16bit and 32bit hexadecimal strings for UTF-16 and UTF-32 respectively. Notice there’s 4 hex characters after \u and 8 hex after u+. Those are the “raw bytes” in hex.

> The only raw bytes I am seeing here are in the UTF-8 string, which you decided to serialize as hexadecimals

All three forms are “raw bytes” in hex form. \x is how you represent an escaped 8 bit UTF-8 byte in hex.

> Another question: do you need first to parse the raw bytes into characters before recognizing graphemes, or can you do it all at once for both variable-length and fixed-length encodings?

You need to “parse” (more like read for UTF16 and UTF32, as there’s not much actual parsing outside byte order handling) the raw bytes into codepoints. To try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. It really doesn’t matter if it’s variable length or fixed length, you still have to get the codepoints before you can determine character/graphemes.

alganet · on Aug 23, 2022

I am asking you "how do you purify water"? And you're holding a bottle of Fiji and telling me "look, it's simple".

You're absolutely right about what is a character and what is a grapheme. I already said that, this subject is done. You're right, no need to come back to it. You win, I already yielded several comments ago.

Now, to the other subject: I would very much prefer if we talked only about bytes. Yes, talking only about bytes makes things harder. Somewhere down the line, there must be an implementation that deals with the byte sequence. I'm talking about these, just above the assembler (for awk). There IS confusion at this level, no way to avoid it except abstracting it by yourself, byte by byte (or, 4 bytes by 4 bytes in UTF-32).

adambatkin · on Aug 21, 2022

Take a step back and ask "why do I need to know the length of this thing?" If it's so you know how much storage to allocate, then the process/time is the same for both (the answer is: however many bytes you happen to have). If your array is bigger than the number of valid characters (code points) and you need to search through it to find the "end" (last valid character), you can do that with almost identical complexity (you don't actually need to iterate over ever byte and Code Point with UTF-8 because of how elegantly the encoding was designed).

Why else might you need to know the length? If it's to know how much space to allocate in a GUI (or even on a console) then neither encoding is going to help.

Maybe it's because of some arbitrary limitation like "your name must be less than 50 characters" and I'll just say that if that's the case, you are doing it wrong (if you need to limit it for storage/efficiency purposes, fine, but you will probably be better off limiting by bytes and using UTF-8 since most people will be able to squeeze in more of their names).

I'm not saying there aren't reasons for needing to know the "length" (number of Code Points) of a string, and certainly many existing algorithms are written in a way that they assume that calculating string length and being able index arbitrarily into the middle of a string are fast (O( 1 ) for indexing) but in reality, for almost any real world problem beyond "how much storage do I need" almost everything you need to do actually requires iterating over a string one Code Point at a time (which is O( n ) for both, with the biggest difference being that UTF-8 may require more branching, but also it's common enough that in many cases between vectorization and just generally better optimizations because of it's popularity, UTF-8 will do just fine while usually using less storage, which can significantly benefit CPU cache locality).

alganet · on Aug 21, 2022

It needs the length for operations such as substring, or to apply length modifiers on regular expressions (such as \w{3,5}), which is a common thing in awk programs.

In fact, the return value of the u8_rune as implemented in the branch we are discussing (https://github.com/onetrueawk/awk/compare/unicode-support) returns a length to be used as an offset later.

This is not me saying, it's the author. There is a code comment there:

> For most of Awk, utf-8 strings just "work", since they look like null-terminated sequences of 8-bit bytes. Functions like length(), index(), and substr() have to operate in units of utf-8 characters. The u8_* functions in run.c handle this.

I know there might be different ways of doing it, but we're talking about a specific implementation.

I was wrong to assume he is storing stuff in UTF-32. He could have, but there was already code in place there to make the UTF-8 storage easier to implement.