TIP 547: New encodings: UTF-16, UCS-2

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <jan.nijtmans@users.sf.net>
Author:         Jan Nijtmans <jan.nijtmans@gmail.com>
State:          Final
Type:           Project
Vote:           Done
Created:        31-May-2019
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.7
Tcl-Branch:     tip-547

Abstract

This TIP proposes to add more encodings for handling utf-16 and ucs-2.

Rationale

Currently, Tcl only has one multi-byte Utf encoding named "unicode". Depending on how Tcl is compiled, this could be 16-bit or 32-bit. If 16-bit, then it's currently not clear whether surrogates are handled or not. Also, those encodings always use the platform-endian mode. There is no way to force little- or big-endianess.

Therefore this TIP proposes to clear up the ambiguity: Make clear that those encodings are always 16-bit, and provide different encodings for little- and big-endian. The "utf-16" variant handles surrogates while the "ucs-2" variant does not.

Specification

This document proposes:

  • Add new encodings "utf-16", "utf-16le", "utf-16be", "ucs-2", "ucs-2le", "ucs-2be".

  • Deprecate the "unicode" encoding. "utf-16" is supposed to be used in stead. The "unicode" encoding will NOT be removed in Tcl 9.0, since it's too common.

Use case

Tk defines it's own "ucs-2be" encoding when compiled on little-endian machines. So, this TIP means that Tk no longer needs to provide this encoding any more.

Compatibility

This is fully upwards compatible, except when Tcl is compiled with -DTCL_UTF_MAX=6 (which is - actually - not supported).

Reference Implementation

A reference implementation is available in the tip-547 branch. https://core.tcl-lang.org/tcl/timeline?r=tip-547

Copyright

This document has been placed in the public domain.

History