TIP 573: Surrogates are invalid

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <jan.nijtmans@gmail.com>
State:          Draft
Type:           Project
Tcl-Version:    8.7
Tcl-Branch:     tip-573

Abstract

In Tcl 8.5 and 8.6, it is possible to specify "\uD800" as valid character, and do reasonable things with it. Actually, U+D800 is not a valid Unicode code point, so any operation on such value is "undefined behavior". Surrogates are only allowed when Unicode characters are stored or transmitted as UTF16, and then a single Unicode character behaves as 2 characters in storage or on-line. However, when translating back to UTF-8, those 2 characters have to be merged back to a single Unicode character. They are unseparatable.

Tcl has always allowed this, but it's wrong. The UtfToUtf encoder should translate all characters in the range \uD800 - \uDFFF to the replacement character \uFFFD. Also using the form \uD800 in scripts should always behave the same as if \uFFFD was specified. This way, any possible 'hack' attempt, trying to insert invalid Unicode in Tcl, can be mitigated rapidly.

One exception to this has to be made. When using the escace sequence \uD800\uDC00, so a valid combination of an upper and a lower surrogate, in a script, there is no harm in translating that to the intended Unicode code point. In Tcl 8.6 and 8.7 there is no other way than that for specifying Emoji. Allowing this, provides a upgrade path for existing scripts handling Emoji. Starting with Tcl 8.7, the "\UXXXXXX" escape sequence should be used for this.

Specification

This TIP proposes:

  • All escape sequences in the range "\uD800" to "\uDFFF" become illegal, starting with Tcl 8.7. They will not result in an error (since Tcl has no error-handling for escape sequences), but they will behave as if they are specified as "\uFFFD" (the replacement character).
  • For Tcl 8.7, one exception will be made. Escape sequence pairs like "\uD800\uDC00", so a combination of an upper surrogate immediately followed by a lower surrogate, will be translated to the corresponding Unicode code point. For Tcl 8.7, this usage will be deprecated but still possible. For Tcl 9.0, this usage will no longer be permitted.
  • All Tcl_UtfTo*() functions will also, when encountering the 3-byte UTF-8 sequence corresponding to this range, translate this directly to \uFFFD.

Rationale

Since the code-points for surrogates are marked "invalid" in the Unicode standard, it should not be possible to use them in Tcl.

Implementation

Implementation is in branch tip-573

Discussion

See ticket 5dabd088ee.

Copyright

This document has been placed in the public domain.

History