Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

length is not ambiguous at all. Its the number of elements in the array. A string in python3 is an array of unicode code points, so the length of a string is the number of unicode code points. If you want the number of bytes, you need to encode the string in a unicode format (utf8, utf16 or utf32) to get a bytes object, which is an array of bytes. Then you can get the length of that.

Remember, one of the big accomplishments (breaking changes) of python 3 is that all strings are Unicode, not byte arrays. If you want to view a dtring as bytes, you need to convert the string to bytes. But note the number of bytes depends on the ancoding u use (utf8, …).



Interestingly, the number of Unicode codepoints is probably the only measure of a string that is unlikely to ever be relevant to anyone in practice except when it happens to coincide with a different measure.

It can't be used to determine length in bytes (important for storage or network transmission), it can't be used to determine number of displayed characters, it can't be used to safely split a string at some position.

The only reason it has caught on is that it is easy to encode into UTF-8 and UTF-16, and that anything more interesting generally requires a language context and even a font.

I hope that future languages will get rid of this single string abstraction, and instead offer two completely separate types:

- symbol strings, which would only be usable for programming purposes and should probably be limited to ASCII

- text strings, which would be intended for human display purposes, with full Unicode support, and have APIs which answer things like "in the specified Culture, what is the length of human-recognizable characters of this string" or "what is the seventh human-recognizable characters in this string in the specified culture"

There's no reason to pay the conceptual cost of Unicode for representing field names or enums (and yes, I don't believe supporting Unicode identifiers is a good idea for a programming language; and note that I am not a native English speaker, and while I do use an alphabet, ASCII is missing some of the letters&symbols I use in my native Romanian). And there's no reason to settle for the misleading safety of Unicode code points when trying to process human displayable text.


The length of an array should correspond to the number of elements. Since each element is a code point, it's the most relevant number if you intend to operate on individual elements. That is, the maximum index corresponds to the length of the array.

If you care about the number of bytes, or to operate on individual bytes, then convert to utf-8,16 or 32, and operate on the bytes object. If you wish to operate on grapheme clusters, then you could probably find some 3rd party Python library that allows you to represent and operate on strings in terms of grapheme clusters.


A string is not an array, it is a chunk of text, for the vast majority of uses of strings. Exactly how that chunk of text is represented in memory and what API it should expose is the discussion we're having. My point is that it shouldn't be exposed as an array of codepoints, since array operations (lengths, indexing, taking a range) are not a very useful way of manipulating text; and even if we did expose them as an array, Unicode code points are definitely not a useful data structure for almost any purpose.

There are basically only two things that can be done with a Unicode codepoint: encode it in bytes for storage, or transform it to a glyph in a particular font or culture.

You can't even compare two sequences of Unicode codepoints for equality in many cases, since there are different ways to represent the same text with Unicode. For example the strings "thá" and "thá" are different in terms of codepoints, but most people would expect to find the second when typing in the first. Even worse, there are codepoints which are supposed to represent different characters, depending on the font being used / the locale of the display (the same Unicode codepoints are used to represent related Chinese, Japanese, or Korean characters, even when these characters are not identical between the three cultures).


Splitting into ASCII-only and Unicode would be more of a regression than a progression. And yes, the “I’m not a native speaker” is a typical pre-emptive reply, as if it matters (neither am I—doesn’t mean anything by itself).


Let me give an example of why I don't think a single unified string API works. When doing (stringA == stringB) , what do you expect to get as a result? Do you expect it to tell if you the two strings represent the exact same codepoints, or do you expect it to tell you whether they represent the same Unicode grapheme clusters, as Unicode recommends?

The answer is of course both, depending on context. You certainly don't want a fuzzy match when, say, decoding a protobuf, but you also don't want a codepoint match when looking up user input.

What most modern languages have settled on is having a Unicode codepoint array type, typically called string or text, and an array of bytes type. However, common string operations are often only provided for the text type, and not the bytes type - which becomes very annoying when doing low level work and using bytes for text, and hoping for simple text operations.


Exactly this. People conflate unicode with encoding quite a bit. I think it was plan9 and early Go that used "runes" as a unit, where one or more runes formed a character and an array of runes could be encoded into bytes using a given encoding.

The in memory size of a rune was just an implementation detail, and while it could be important for the programmer that the size of a rune was 2 bytes, this didn't mean the length of an array of 2 runes was 4.

I always liked the rune unit, and while my memory is hazy I think it was just code points.

I think part of the issue is programmers and apis mixing bit units for in memory representation of a conceptual value mapping (unicode), conceptual characters, stored size when encoded and so on ... without firming up those abstractions with interfaces. It gets lossy.


FWIW, CPython uses one of several Unicode string implementation representations, depending on the code points involved:

  >>> import sys
  >>> s = 'A' * 1000
  >>> len(s)
  1000
  >>> sys.getsizeof(s)
  1049
  >>> s = '\N{SNOWMAN WITHOUT SNOW}' * 1000
  >>> len(s)
  1000
  >>> sys.getsizeof(s)
  2074
  >>> s = '\N{MUSICAL SYMBOL G CLEF}' * 1000
  >>> len(s)
  1000
  >>> sys.getsizeof(s)
  4076
See https://peps.python.org/pep-0393/ . Mentioned in the linked-to article with "CPython since 3.3 makes the same idea three-level with code point semantics".


>length is not ambiguous at all. Its the number of elements in the array

That's because you defined it first as "the number of elements in the array".

It is ambiguous however because that's not how people understand it when it comes to strings, and there are several counter-intuitive ways they expect it to behave.

Not to mention there might not be any "array". A string (whatever the encoding / representation) is a chunk of memory, not an array. That you can often use a method to traverse it doesn't mean it's in an array.


The python doc says "str" are immutable sequences of unicode code points. Since it implements __getitem__, its fair to call it an array (it has a length, and allows indexing). I couldn't find out in the documentation whether the __getitem__ is O(1), which I consider a deficiency -- this should definitely be well documented.

It doesn't really matter how some people think "how people understand" something, the documentation matters. Any string in any language is some ordered sequence of atomic text-like objects, so python's approach isn't unreasonable or unexpected, either.


>Since it implements __getitem__, its fair to call it an array (it has a length, and allows indexing)

Well, weren't we talking about things being "ambiguous"?

In Python we call what you describe a list. An array is something different. And people would expect something like the C (or the Java) data structure. In Python that would match the "array" lib package.

And that's just discussing the meaning of array - before we even get to whether a string is an array, and what this means.

>It doesn't really matter how some people think "how people understand" something, the documentation matters

In what universe? In practical use, clarity and non-ambiguous, least surprise names and semantics matter.

"But we clarify it in page 2000 of the documentation" is not an excuse. Nor is invoking moral or professional failings of those not reading the documentation. A good library design doesn't offload clearing ambiguity to the documentation.

>Any string in any language is some ordered sequence of atomic text-like objects

You'd be surprised. Especially since this isn't 1985 where strings were a bunch of 8-bit ascii characters, or even 1995, when widechar 16-bit arrays were "good enough" for Windows and Java, but we have not just non-ascii strings, but even variable length (e.g. utf-8) internal strings in mainstream languages.


Tell me a language where a string isnt an ordered sequence of elements of some atomic text-like data type. Those may have different types - like utf8 bytes, bytes, unicode code points, grapheme clusters etc. But these are all some sort of representation of text at some level. Which one a programming language uses depends on the language, and should be checked in the documentation. Its not like some obscure “check page 2000” of the doc type small print, implying that you need to read 18 tomes of language doc before u can work with the language —- no, but if u want to work with strings in any programming language, u should know what type the elements consist of.

Btw, python my try to overload the meaning of the words array and list, but the word “array” has a generic meaning in this branch of math called computer science (an ordered sequence of elements indexible in O(1)), which is how I used it here.


Except a sequence is not an array.

So OP’s definition still does not apply to python’s definition of a string in an unambiguous manner, which was the claim they were making.

In fact, using the OP’s “unambiguous” definition leads to the conclusion that strings shouldn’t have a length function at all since it’s not an array.


I didnt say say a sequence is an array. I said its fair to call str objects in python arrays: https://en.m.wikipedia.org/wiki/Array_(data_type)


It's ambiguous because it's not clear what elements go into separate cells of the array.


Why is it ambiguous? The Python documentation is pretty clear about what type of elements a string contains:

> Strings are immutable sequences of Unicode code points

(from https://docs.python.org/3/library/stdtypes.html#text-sequenc...)


According to python. Every language can have a different definition. C#, for example, defines the blocks to be Char objects, which is based on UTF 16:

> The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters.

https://learn.microsoft.com/en-us/dotnet/csharp/programming-...


Though C# also recommends the Rune APIs for more modern/better code point handling. The Rune APIs have a bit more in common with Python 3's unicode treatment than the classic (and sometimes wrong) UTF-16 approach.

https://learn.microsoft.com/en-us/dotnet/api/system.text.run...


Did you not read the article?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: