TIP 656: A revised proposal for encodings

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:		Ashok P. Nadkarni <apnmbx-public@yahoo.com>
State:		Final
Type:		Project
Vote:		Done
Created:	2023-02-13
Tcl-Version:	8.7
Tcl-branch:	tip-656
Vote-Summary:	Accepted 6/0/0
Votes-For:	AK, JD, JN, KW, MC, SL
Votes-Against:	none
Votes-Present:	none

Abstract

This TIP proposes enhancements to the character encoding commands and public C API's present in Tcl 8.6 based on based on the profile concept of TIP 654. It differs in terms of syntax, C API and semantics related to other options. It supplants previously accepted TIP's 346, 601 and 633 that targeted the same functionality.

The TIP also defines fconfigure options to associate profiles with channels to control their encoding behavior.

Rationale

Operations involving encoding transforms may encounter multiple types of errors such as invalid sequences in the source, characters that cannot be encoded in the target encoding etc. Tcl 8.6 dealt with these errors by "wishing them away", either by substituting a ? character or (effectively) assuming the offending byte was in CP1252 encoding.

TIP's 346, 601 and 607 proposed options to the encoding command that allowed applications to detect and handle encoding transform errors. Further, 633 added corresponding options to fconfigure for the same purpose.

There are however inadequacies in these options as described in a separate write-up and mailing list and summarized below.

  • The options -strict and -nocomplain are added to increase or decrease the "level of strictness" relative to the default. These do not cover the conformant behaviors specified by the Unicode standard.

  • At the C API level, these translate to the flags TCL_ENCODING_STRICT and TCL_ENCODING_NOCOMPLAIN with an additional flag TCL_ENCODING_STOPONERROR that was already present in 8.6 The use of multiple options (plus the default) is confusing in terms of their interaction and combined behavior. This is exacerbated by the difference in their treatment between Tcl 8 and Tcl 9.

  • There is no provision for the standard conformant behavior that uses the U+FFFD replacement character. This means it is actually not possible to write a W3C conformant browser (or other Web content processor) without implementing your own encoders.

  • Adding further error handling behaviors in the future would require additional mutually exclusive switches and flags which further complicate the interface.

This proposal based on TIP 654's profile model, is intended to address the above shortcomings.

Specification

Profiles

The following types of errors may be encountered when converting an encoded byte sequence into Tcl's internal form:

  • Case 1. The byte value may be one that should never appear in the specified encoding or at a particular position in a multibyte encoding. For example, the values \xC0 and \xC1 should never appear at any point in a UTF-8 encoded byte sequence. As an example of the latter, the byte \xE0 (amongst others) should never appear as the lead byte in ShiftJIS.

  • Case 2. The rules for the encoding do not permit the value to have been encoded in the first place. For example, surrogate Unicode code points should never be encoded and thus should be treated as an error when encountered during a decoding operation. (Note the surrogate could appear in the UTF-16 encoded byte sequence. But the decoded value should never be a surrogate code point.)

  • Case 3. A byte subsequence within a byte sequence that is encoded with a multibyte encoding terminates prematurely. This may or may not be an error depending whether the subsequence is in the middle of the containing byte sequence or at the end. In the latter case, it may just mean more bytes are needed as may happen when data is read over a streaming interface. For example, the UTF-8 sequence \xC2\x41 is a hard error as there is no trailing byte succeeding the lead byte \xC2 (\x41 cannot be a trailing byte). On the other hand, the sequence \x41\xC2 may not be an error because additional data may arrive containing a valid trailing byte to complete the \xC2.

  • Case 4. The decoded values may lie outside the range of Unicode code points. For example the UTF-32 encoded sequence \x7F\xFF\xFF\x7F trivially transates to the integer value U+7FFFFF7F which is greater than the largest valid code point U+10FFFF. This is distinguished from Case 2 because it is treated differently by Tcl in the current implementation.

Similarly, the following types of errors may be encountered when converting in the other direction:

  • Case 1. The encoding does not support the Unicode code point. For examples, code points higher than U+007F are not supported in the ASCII encoding.

  • Case 2. The encoding may be able to encode a Unicode code point but the rules for the encoding do not allow it. For example, the Unicode standard for UTF-8 encoding prohibits encoding of surrogate code points. So although the surrogate U+DC00 can be encoded as the byte sequence \xED\xB0\x80, it is prohibited by the standard.

  • Case 3: The value of the code point lies outside the valid code point range.

A profile defines the handling of each of the above error cases by either

  1. Terminating further processing of the source data. The profile does not determine how this premature termination is conveyed to the caller. By default, this is signalled by raising an exception. The -failindex option as described in TIP 607 may be used instead.

  2. Using a fallback strategy for the offending bytes and continuing processing the rest of the data. This may be use of a replacement character (either fixed or dependent on the invalid byte), discarding the invalid bytes etc. as defined by the profile.

Note that none of the currently defined profiles distinguish between errors cases but there is no reason preventing a profile defined in the future to do so. For example, a allowsurrogates profile may pass through surrogate code points (illegal in UTF-8) but stop processing on other error cases.

This TIP defines three profiles, tcl8, strict and replace.

The tcl8 profile

The tcl8 profile corresponds to the implementation of encoders in Tcl 8.6.

When converting to Tcl's string form, with the exception of the special case noted below, each byte of an illegal byte sequence is mapped to its numerically equivalent code point. In effect, it treats the byte as being in ISO8859-1 encoding even though the transform may have specified a different encoding.

As an special case, for the UTF-8 encoding the illegal sequence \xC0\x80 is mapped to U+000000.

When converting a Tcl string to an encoded byte sequence, values that cannot be encoded in the target encoding are mapped to an encoding-specific fallback character, usually ?. For UTF encodings, this case cannot arise as they can represent all code points. Additionally, for the error case where the code point being encoded is prohibited from appearing in encoded form (surrogates for example), the tcl8 profile ignores the mandate and encodes the code point anyways.

The tcl8 profile is not conformant with the Unicode standard. Moreover, it leaves room for silent misinterpretation of data.

With respect to the current implementation, the tcl8 profile replaces the -nocomplain option of TIP 601.

The strict profile

The strict profile implements strictly conformant behavior as defined in the Unicode standard. All error cases result in the error being signalled.

With respect to the current implementation, the strict profile replaces the -strict option of TIP 346.

The replace profile

The replace profile implements an alternate conformant behaviour defined in the Unicode standard.

When converting an encoded byte sequence to a Tcl string, invalid byte sequences are replaced by the U+FFFD REPLACEMENT CHARACTER code point.

When encoding a Tcl string, characters that cannot be represented in the target encoding are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT CHARACTER for UTF targets and generally ? for other encodings.

When multiple successive invalid bytes are encountered, the Unicode standard allows for their substitution with a single or multiple replacement characters. The replace profile conforms to this.

There is no equivalent to the replace profile in the current TIP 346/601 based 8.7 implementation.

The default profile

This TIP does not specify the default profile to be used. That is the subject of a separate TIP.

The encoding profiles command

A new command is added that will return the names of implemented profiles.

encoding profiles

Changes to encoding convertfrom and encoding convertto

The commands encoding convertfrom and encoding convertto support a new option profile that takes a profile name as value. The -strict and -nocomplain options are no longer supported. The commands take the form

encoding convertfrom DATA
encoding convertfrom ?-profile PROFILE? ?-failindex VAR? ENCODING DATA

encoding convertto DATA
encoding convertto ?-profile PROFILE? ?-failindex VAR? ENCODING DATA

The syntax is backward compatible with 8.6. However, it differs from the current 8.7/9.0 implementation in that there is no ambiguity. In the current implementation, when two arguments are supplied, it tries to guess whether the first is an option or an encoding name. With the above syntax, if any options are specified, the encoding must be explicitly specified as well. Note it is possible to relax this based on odd/evenness of the argument count but that would make it trickier to add options in the future that do not take an argument.

The -profile option specifies the profile to be used to be used for the conversion as described earlier. If multiple -profile options are passed, the last one will be used.

The -failindex option behaves as defined in TIP 607. However, although not specified in that TIP, in the current 8.7/9.0 implementation the -failindex option also enables the -strict option. This TIP specifically proposes that the option not make any implicit selection of profiles. In addition to the author's opinion that options should be as orthogonal to each other as possible, the current implied behavior makes it awkward to write (for example) a proc that takes a profile and returns as much data as can be read without raising an error. The -failindex option now only determines whether an exception is raised or decoded data is returned with error location in the passed variable when processing of the input data is stopped as determined by the profile.

New option -profile for fconfigure and chan configure

A new option -profile has been added to the fconfigure command. The option's value must be a profile name. The encoding transforms in use for the channel's input and output will then be subject to the rules of that profile. Any failures will result in a channel error. The mode of reporting channel error is a function of the channel subsystem and not defined by this TIP.

The -strictencoding and -nocomplainencoding options that were defined by the earlier TIP's and currently implemented in 8.7 and 9.0 alphas are supplanted by -profile and removed.

Changes to the C API's

Two new functions, Tcl_ExternalToUtfDStringEx and Tcl_UtfToExternalDString, related to encoding transforms were added by TIP 601 for 8.7. These had the signatures

Tcl_Size Tcl_ExternalToUtfDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr);
Tcl_Size Tcl_UtfToExternalDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr);

This TIP changes the signatures of these functions to the following:

int
Tcl_ExternalToUtfDStringEx(
    Tcl_Interp *interp,     /* For error messages. May be NULL. */
    Tcl_Encoding encoding,  /* The encoding for the source string, or NULL
                             * for the default system encoding. */
    const char *src,        /* Source string in specified encoding. */
    Tcl_Size srcLen,        /* Source string length in bytes, or < 0 for
                             * encoding-specific string length. */
    int flags,              /* Conversion control flags. */
    Tcl_DString *dstPtr,    /* Uninitialized or free DString in which the
                             * converted string is stored. Must be freed
                             * after return irrespective of return value */
    Tcl_Size *errorLocPtr); /* Where to store the error location
                               (or TCL_INDEX_NONE if no error). May be NULL. */
int
Tcl_UtfToExternalDStringEx(
    Tcl_Interp *interp,     /* For error messages. May be NULL. */
    Tcl_Encoding encoding,  /* The encoding for the converted string, or
                             * NULL for the default system encoding. */
    const char *src,        /* Source string in UTF-8. */
    Tcl_Size srcLen,        /* Source string length in bytes, or < 0 for
                             * strlen(). */
    int flags,              /* Conversion control flags. */
    Tcl_DString *dstPtr,    /* Uninitialized or free DString in which the
                             * converted string is stored Must be freed
                             * after return irrespective of return value */
    Tcl_Size *errorLocPtr); /* Where to store the error location
                              (or TCL_INDEX_NONE if no error). May be NULL. */

The Tcl_ExternalToUtfDStringEx function converts a source buffer from the specified encoding into UTF-8. The Tcl_UtfToExternalDStringEx function does the converse, converting a source buffer from UTF-8 to the specified encoding.

The flags parameter may be composed from OR-ing the following values:

  • At most one of TCL_ENCODING_PROFILE_DEFAULT, TCL_ENCODING_PROFILE_TCL8, TCL_ENCODING_PROFILE_STRICT and TCL_ENCODING_PROFILE_REPLACE. If none are specified, a version-dependent default profile is used.

  • For reasons of backward compatibility and consistency with 8.6 functions, the TCL_ENCODING_STOPONERROR flag remains. It has the same effect as specifying the TCL_ENCODING_PROFILE_STRICT overriding any other profile flags that might have been specified.

For preserving future compatibility, any other bits will result in an error being returned. In particular, callers should not set the TCL_ENCODING_START and TCL_ENCODING_STOP flags as those are not supported by the above functions (even in the current pre-profile implementation) as they do not implement streaming operation.

Both functions have the same set of return values:

  • TCL_OK: success. Converted string in *dstPtr and NUL terminated in an encoding specific manner.

  • TCL_ERROR: Error other than conversion error, e.g. invalid parameter values. Error message is stored in the interpreter.

  • TCL_CONVERT_MULTIBYTE: Indicates that the source ends in truncated multibyte sequence.

  • TCL_CONVERT_SYNTAX: The source is not conformant to encoding definition

  • TCL_CONVERT_UNKNOWN: The source contained a character that could not be represented in target encoding.

In the case of the TCL_CONVERT_* return codes,

  • If errorLocPtr is NULL, an error message is stored in the interpreter if it is not NULL.

  • If errorLocPtr is not NULL, no error message is stored as it is expected the caller is interested in whatever is decoded so far and not treating this as an error condition.

Differences from the current 8.7 API

As stated above, the signatures of the functions differ from the currently implemented 8.7 and 9.0 API's. The new signatures are motivated by:

  • The older signatures had no mechanism to signal an error other than encoding errors in the data stream. In particular, there was no way to signal errors in parameter values. Examples would be invalid profiles, conflicting flags, flags not applicable to the given encoding etc.

  • Callers had to generate error messages themselves based on error codes, including computing the error offset, offending character etc. This is both inconvenient and avoidable duplication. Passing the interpreter for the purposes of retrieving error messages is a common convention in the Tcl core.

In addition to the change in signatures, the TCL_ENCODING_NOCOMPLAIN, TCL_ENCODING_STRICT and TCL_ENCODING_MODIFIED bit flags have been removed. These were not present in Tcl 8.6 so there is no backward compatibility issue.

The first two have been supplanted by the profile related flags. The TCL_ENCODING_MODIFIED bit was intended to be used to specify a variant of UTF-8 or CESU-8 for encoding nul bytes as \xC0\x80. This is never set internally within the Tcl core and not accessible at the script level either. The motivation of eliminating it arises from the belief that this is actually an encoding and best modeled as such instead of through flags. If encoding variants are enabled through flags, then why not CESU-8 as as variant of UTF-8, or UTF-16LE/UTF-16BE as variants of UTF-16? As an aside, other languages also treat this "modified" UTF-8 as a separate selectable encoding. A separate encoding would also make it usable from the script level if so desired without changing the API.

Implementation

Implementation and tests for Tcl 8.7 and 9.0 are available in the tip-656 and tip-656-tcl9 branches respectively. Currently these still use the -encodingprofile option name and will be changed to -profile dependent on TIP approval. Manpages have not been updated.

Alternative proposals

There have been a couple of alternatives proposed on the mailing list.

Finer granularity of error class selection

The first is an -onerror option which is similar to the -profile option but allows for finer granularity.

encoding convertfrom -onerror {surrogates invalid wrongcode} ....
encoding convertfrom -handle {SURROGATE error INVALID replace INCOMPLETE ignore ...}

The author is not in favor of this as I expect it to add considerable complexity to implementation and test suites while being minimally useful to the end user. (It feels like over generalization to me. How often would a user want to distinguish between invalid cases?).

Include the profile within the encoding parameter

Another syntactic alternative proposed was to embed the error handling options into the encoding argument.

encoding convertfrom {utf-8 strict}
fconfigure CHANNEL -encoding {utf-8 strict}

Since the difference is primarily in command option processing, implementation changes are not many. I prefer the first form from a stylistic perspective. For example, the latter makes it a little more awkard to request a profile using the default encoding.

Alternative fconfigure option name

The original option to fconfigure proposed by this TIP was -encodingprofile. That has been renamed to -profile as per Jan's suggestion. This is both less wordy and does not conflict with -encoding when used in shorter forms.

Copyright

This document has been placed in the public domain.

History