Author: Jan Nijtmans <jan.nijtmans@gmail.com>
State: Final
Type: Project
Vote: Done
Created: 23-Mar-2022
Tcl-Version: 8.7
Keywords: Tcl
Tcl-Branch: full-utf-for-87
Vote-Summary: Accepted 5/0/1
Votes-For: AK, JN, KBK, KW, SL
Votes-Against: none
Votes-Present: JD
Abstract
Although Tcl 8.7 understands quite a lot of Unicode, one thing it still cannot:
$ tclsh8.7 % string length 🤝 2
The reason for this is that - internally - sizeof(Tcl_UniChar) == 2, and
this is visible in the full Tcl C API. "string length"
counts UTF-16
code-points, not the number of Unicode characters. Changing that (defining
TCL_UTF_MAX = 4) means that the C-API would change behaviour. E.g.
Tcl_Obj *obj = Tcl_NewObj("🤝x"); int size = Tcl_UniCharLength(obj); Tcl_UniChar *uniCharString = Tcl_GetUnicode(obj); Tcl_UniChar *partOfTheString = Tcl_GetRange(obj, 2, 2);Since, in Tcl 8.6, the above example gives
size = 3
, uniCharString
being a 16-bit array and partOfTheString
will contain the value "x", we
cannot change that in Tcl 8.7: Extensions depending on that, compiled
against Tcl 8.6 headers, would lead to different behavior when loaded
in Tcl 8.7. The C-API must stay binary compatible.
This TIP proposes to change TCL_UTF_MAX=4 internally, and create a UTF-16 compatibility layer of stub entries such that extensions won't be affected. This compatibility layer will only be implemented for Tcl 8.7, it won't be forward-merged to Tcl 9.0!
Specification
In tcl.h
, determine TCL_UTF_MAX as follows:
#ifndef TCL_UTF_MAX # ifdef BUILD_tcl # define TCL_UTF_MAX 4 # else # define TCL_UTF_MAX 3 # endif #endif
This means, that Tcl is built using UTF-32 internally. A new set of stub entries is created for 5 functions:
int TclNumUtfChars(const char *, int) int TclGetCharLength(Tcl_Obj *) const char *TclUtfAtIndex(const char *, int) Tcl_Obj *TclGetRange(Tcl_Obj *, int, int) int TclGetUniChar(Tcl_Obj *, int)Those 5 functions are used everywhere in Tcl, and those functions count in UTF-32. So
TclNumUtfChars("🤝", -1)
will return 1
, not 2
as Tcl 8.6 does. But
extensions using Tcl_NumUtfChars
/Tcl_GetCharLength
/Tcl_UtfAtIndex
will
continue to use the original functions, which count UTF-16 characters.
Extensions which want to use the new UTF-32 functions, can define TCL_UTF_MAX=4
before including tcl.h
, then Tcl_NumUtfChars
/Tcl_GetCharLength
/Tcl_UtfAtIndex
/
Tcl_GetRange
/Tcl_GetUniChar
will be mapped to TclNumUtfChars
/
TclGetCharLength
/TclUtfAtIndex
/TclGetRange
/TclGetUniChar
.
Also, the following 3 functions which were deprecated in Tcl 8.7 (because they don't work well with UTF-32) are added to this compatibility layer:
int Tcl_UniCharNcmp(const Tcl_UniChar *, const Tcl_UniChar *, unsigned long); int Tcl_UniCharNcasecmp(const Tcl_UniChar *, const Tcl_UniChar *, unsigned long); int Tcl_UniCharCaseMatch(const Tcl_UniChar *, const Tcl_UniChar *, int);Those 3 functions are still deprecated (See TIP #542), but they are implemented in the UTF-16 compatibility layer for Tcl 8.7. In Tcl 9.0, they are gone.
Finally, the "string"
objType is renamed to "utf32string"
, and a new
"string"
objType is implemented which uses UTF-16 codepoints. This
objType is used in the compatibility layer of the following 5 functions:
Tcl_UniChar *Tcl_GetUnicode(Tcl_Obj *); Tcl_UniChar *Tcl_GetUnicodeFromObj(Tcl_Obj *, int *); Tcl_Obj *Tcl_NewUnicodeObj(Tcl_UniChar *, int); Tcl_SetUnicodeObj(Tcl_Obj *, Tcl_UniChar *, int); void Tcl_AppendUnicodeToObj(Tcl_Obj *, const Tcl_UniChar *, int)
If Tcl is compiled with -DTCL_NO_DEPRECATED
, then the UTF-16 compatibility
layer is removed. This is meant to verify that the compatibility layer is not
used internally anywhere. This also means that extensions using any of the
above API will Panic. Compiling the extension with -DTCL_UTF_MAX=4
will
make the extension work again, but this is only meant for test
purposes, not for production!
Finally, this TIP proposes to undo the deprecation of Tcl_AppendUnicodeToObj
.
Although this was proposed in TIP #542, it turned out that this function
could not really be removed, it just moved to be a internal stub function in
Tcl 9.0. Therefore, there is no burden exposing it again.
Caveat
Since - internally - TCL_UTF_MAX
is raised from 3 to 4, this influences
the behavior of the encoding/decoding functions. For example, the following code;
Tcl_EncodingState state; int read; Tcl_Encoding encoding = Tcl_Encoding("utf-8"); char buf[TCL_UTF_MAX] = ""; int result = Tcl_ExternalToUtf(interp, encoding, "🤝", 4, flags, &state, buf, sizeof(buf), &read, NULL, NULL);In Tcl 8.6, after doing this call,
buf
will be filled with
the bytes 0xED 0xA0 0xBE, which is the cesu-8 representation
of a high surrogate. The function Tcl_ExternalToUtf
in Tcl 8.6
is guaranteed to provide some output if the buffer provided has
at least 3 bytes. In Tcl 8.7, buffers used for Tcl_ExternalToUtf
or Tcl_UtfToExternal
need at least 4 bytes, otherwise
4-byte utf-8 sequences cannot be handled.
Therefore, in Tcl 8.7, the above example won't give
any output (read
= 0), since the buffer cannot handle
even a single unicode character.
Implementation
See branch full-utf-for-87
Copyright
This document has been placed in the public domain.