Author: Ashok P. Nadkarni <apnmbx-public@yahoo.com>
State: Final
Type: Project
Vote: Done
Created: 2023-02-13
Tcl-Version: 8.7
Tcl-branch: tip-656
Vote-Summary: Accepted 6/0/0
Votes-For: AK, JD, JN, KW, MC, SL
Votes-Against: none
Votes-Present: none
Abstract
This TIP proposes enhancements to the character encoding commands and public C API's present in Tcl 8.6 based on based on the profile concept of TIP 654. It differs in terms of syntax, C API and semantics related to other options. It supplants previously accepted TIP's 346, 601 and 633 that targeted the same functionality.
The TIP also defines fconfigure options to associate profiles with channels to control their encoding behavior.
Rationale
Operations involving encoding transforms may encounter multiple types of
errors such as invalid sequences in the source, characters that
cannot be encoded in the target encoding etc. Tcl 8.6 dealt with these
errors by "wishing them away", either by substituting a ?
character
or (effectively) assuming the offending byte was in CP1252 encoding.
TIP's 346,
601 and
607 proposed options
to the encoding
command that allowed applications to detect and handle
encoding transform errors. Further,
633 added
corresponding options to fconfigure
for the same purpose.
There are however inadequacies in these options as described in a separate write-up and mailing list and summarized below.
The options
-strict
and-nocomplain
are added to increase or decrease the "level of strictness" relative to the default. These do not cover the conformant behaviors specified by the Unicode standard.At the C API level, these translate to the flags
TCL_ENCODING_STRICT
andTCL_ENCODING_NOCOMPLAIN
with an additional flagTCL_ENCODING_STOPONERROR
that was already present in 8.6 The use of multiple options (plus the default) is confusing in terms of their interaction and combined behavior. This is exacerbated by the difference in their treatment between Tcl 8 and Tcl 9.There is no provision for the standard conformant behavior that uses the
U+FFFD
replacement character. This means it is actually not possible to write a W3C conformant browser (or other Web content processor) without implementing your own encoders.Adding further error handling behaviors in the future would require additional mutually exclusive switches and flags which further complicate the interface.
This proposal based on TIP 654's profile model, is intended to address the above shortcomings.
Specification
Profiles
The following types of errors may be encountered when converting an encoded byte sequence into Tcl's internal form:
Case 1. The byte value may be one that should never appear in the specified encoding or at a particular position in a multibyte encoding. For example, the values
\xC0
and\xC1
should never appear at any point in a UTF-8 encoded byte sequence. As an example of the latter, the byte\xE0
(amongst others) should never appear as the lead byte in ShiftJIS.Case 2. The rules for the encoding do not permit the value to have been encoded in the first place. For example, surrogate Unicode code points should never be encoded and thus should be treated as an error when encountered during a decoding operation. (Note the surrogate could appear in the UTF-16 encoded byte sequence. But the decoded value should never be a surrogate code point.)
Case 3. A byte subsequence within a byte sequence that is encoded with a multibyte encoding terminates prematurely. This may or may not be an error depending whether the subsequence is in the middle of the containing byte sequence or at the end. In the latter case, it may just mean more bytes are needed as may happen when data is read over a streaming interface. For example, the UTF-8 sequence
\xC2\x41
is a hard error as there is no trailing byte succeeding the lead byte\xC2
(\x41
cannot be a trailing byte). On the other hand, the sequence\x41\xC2
may not be an error because additional data may arrive containing a valid trailing byte to complete the\xC2
.Case 4. The decoded values may lie outside the range of Unicode code points. For example the UTF-32 encoded sequence
\x7F\xFF\xFF\x7F
trivially transates to the integer value U+7FFFFF7F which is greater than the largest valid code point U+10FFFF. This is distinguished from Case 2 because it is treated differently by Tcl in the current implementation.
Similarly, the following types of errors may be encountered when converting in the other direction:
Case 1. The encoding does not support the Unicode code point. For examples, code points higher than U+007F are not supported in the ASCII encoding.
Case 2. The encoding may be able to encode a Unicode code point but the rules for the encoding do not allow it. For example, the Unicode standard for UTF-8 encoding prohibits encoding of surrogate code points. So although the surrogate U+DC00 can be encoded as the byte sequence
\xED\xB0\x80
, it is prohibited by the standard.Case 3: The value of the code point lies outside the valid code point range.
A profile defines the handling of each of the above error cases by either
Terminating further processing of the source data. The profile does not determine how this premature termination is conveyed to the caller. By default, this is signalled by raising an exception. The
-failindex
option as described in TIP 607 may be used instead.Using a fallback strategy for the offending bytes and continuing processing the rest of the data. This may be use of a replacement character (either fixed or dependent on the invalid byte), discarding the invalid bytes etc. as defined by the profile.
Note that none of the currently defined profiles distinguish between errors
cases but there is no reason preventing a profile defined in the future to
do so. For example, a allowsurrogates
profile may pass through
surrogate code points (illegal in UTF-8) but stop processing on other error
cases.
This TIP defines three profiles, tcl8
, strict
and replace
.
The tcl8
profile
The tcl8
profile corresponds to the implementation of encoders in Tcl 8.6.
When converting to Tcl's string form, with the exception of the special case noted below, each byte of an illegal byte sequence is mapped to its numerically equivalent code point. In effect, it treats the byte as being in ISO8859-1 encoding even though the transform may have specified a different encoding.
As an special case, for the UTF-8 encoding the illegal sequence \xC0\x80
is mapped to U+000000.
When converting a Tcl string to an encoded byte sequence, values that cannot
be encoded in the target encoding are mapped to an encoding-specific
fallback character, usually ?
. For UTF encodings, this case cannot arise
as they can represent all code points. Additionally, for the error case
where the code point being encoded is prohibited from appearing in
encoded form (surrogates for example), the tcl8
profile ignores the
mandate and encodes the code point anyways.
The tcl8
profile is not conformant with the Unicode standard. Moreover,
it leaves room for silent misinterpretation of data.
With respect to the current implementation, the tcl8
profile
replaces the -nocomplain
option of TIP 601.
The strict
profile
The strict
profile implements strictly conformant behavior as defined
in the Unicode standard. All error cases result in the error being signalled.
With respect to the current implementation, the strict
profile
replaces the -strict
option of TIP 346.
The replace
profile
The replace
profile implements an alternate conformant behaviour defined
in the Unicode standard.
When converting an encoded byte sequence to a Tcl string, invalid byte sequences are replaced by the U+FFFD REPLACEMENT CHARACTER code point.
When encoding a Tcl string, characters that cannot be represented in the
target encoding are transformed to an encoding-specific fallback character,
U+FFFD REPLACEMENT CHARACTER for UTF targets and generally ?
for other
encodings.
When multiple successive invalid bytes are encountered, the Unicode standard
allows for their substitution with a single or multiple replacement characters.
The replace
profile conforms to this.
There is no equivalent to the replace
profile in the current TIP 346/601
based 8.7 implementation.
The default profile
This TIP does not specify the default profile to be used. That is the subject of a separate TIP.
The encoding profiles
command
A new command is added that will return the names of implemented profiles.
encoding profiles
Changes to encoding convertfrom
and encoding convertto
The commands encoding convertfrom
and encoding convertto
support
a new option profile
that takes a profile name as value. The -strict
and -nocomplain
options are no longer supported. The commands take
the form
encoding convertfrom DATA
encoding convertfrom ?-profile PROFILE? ?-failindex VAR? ENCODING DATA
encoding convertto DATA
encoding convertto ?-profile PROFILE? ?-failindex VAR? ENCODING DATA
The syntax is backward compatible with 8.6. However, it differs from the current 8.7/9.0 implementation in that there is no ambiguity. In the current implementation, when two arguments are supplied, it tries to guess whether the first is an option or an encoding name. With the above syntax, if any options are specified, the encoding must be explicitly specified as well. Note it is possible to relax this based on odd/evenness of the argument count but that would make it trickier to add options in the future that do not take an argument.
The -profile
option specifies the profile to be used to be used
for the conversion as described earlier. If multiple -profile
options are
passed, the last one will be used.
The -failindex
option behaves as defined in TIP 607. However, although
not specified in that TIP, in the current 8.7/9.0 implementation the
-failindex
option also enables the -strict
option. This TIP specifically
proposes that the option not make any implicit selection of profiles.
In addition to the author's opinion that options should be as orthogonal to
each other as possible, the current implied behavior makes it awkward
to write (for example) a proc that takes a profile and returns as much
data as can be read without raising an error. The -failindex
option now
only determines whether an exception is raised or decoded data is returned
with error location in the passed variable when processing of the input data
is stopped as determined by the profile.
New option -profile
for fconfigure
and chan configure
A new option -profile
has been added to the fconfigure
command.
The option's value must be a profile name. The encoding transforms in use
for the channel's input and output will then be subject to the rules of that
profile. Any failures will result in a channel error. The mode of reporting
channel error is a function of the channel subsystem and not defined by this
TIP.
The -strictencoding
and -nocomplainencoding
options that were
defined by the earlier TIP's and currently implemented in 8.7 and 9.0
alphas are supplanted by -profile
and removed.
Changes to the C API's
Two new functions, Tcl_ExternalToUtfDStringEx
and Tcl_UtfToExternalDString
,
related to encoding transforms were added by TIP 601 for 8.7. These had the signatures
Tcl_Size Tcl_ExternalToUtfDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr);
Tcl_Size Tcl_UtfToExternalDStringEx(Tcl_Encoding encoding, const char *src, int srcLen, int flags, Tcl_DString *dsPtr);
This TIP changes the signatures of these functions to the following:
int
Tcl_ExternalToUtfDStringEx(
Tcl_Interp *interp, /* For error messages. May be NULL. */
Tcl_Encoding encoding, /* The encoding for the source string, or NULL
* for the default system encoding. */
const char *src, /* Source string in specified encoding. */
Tcl_Size srcLen, /* Source string length in bytes, or < 0 for
* encoding-specific string length. */
int flags, /* Conversion control flags. */
Tcl_DString *dstPtr, /* Uninitialized or free DString in which the
* converted string is stored. Must be freed
* after return irrespective of return value */
Tcl_Size *errorLocPtr); /* Where to store the error location
(or TCL_INDEX_NONE if no error). May be NULL. */
int
Tcl_UtfToExternalDStringEx(
Tcl_Interp *interp, /* For error messages. May be NULL. */
Tcl_Encoding encoding, /* The encoding for the converted string, or
* NULL for the default system encoding. */
const char *src, /* Source string in UTF-8. */
Tcl_Size srcLen, /* Source string length in bytes, or < 0 for
* strlen(). */
int flags, /* Conversion control flags. */
Tcl_DString *dstPtr, /* Uninitialized or free DString in which the
* converted string is stored Must be freed
* after return irrespective of return value */
Tcl_Size *errorLocPtr); /* Where to store the error location
(or TCL_INDEX_NONE if no error). May be NULL. */
The Tcl_ExternalToUtfDStringEx
function converts a source buffer from the
specified encoding into UTF-8. The Tcl_UtfToExternalDStringEx
function does
the converse, converting a source buffer from UTF-8 to the specified encoding.
The flags parameter may be composed from OR-ing the following values:
At most one of
TCL_ENCODING_PROFILE_DEFAULT
,TCL_ENCODING_PROFILE_TCL8
,TCL_ENCODING_PROFILE_STRICT
andTCL_ENCODING_PROFILE_REPLACE
. If none are specified, a version-dependent default profile is used.For reasons of backward compatibility and consistency with 8.6 functions, the
TCL_ENCODING_STOPONERROR
flag remains. It has the same effect as specifying theTCL_ENCODING_PROFILE_STRICT
overriding any other profile flags that might have been specified.
For preserving future compatibility, any other bits will result in an error being
returned. In particular, callers should not set the TCL_ENCODING_START
and
TCL_ENCODING_STOP
flags as those are not supported by the above functions
(even in the current pre-profile implementation) as they do not implement
streaming operation.
Both functions have the same set of return values:
TCL_OK
: success. Converted string in *dstPtr and NUL terminated in an encoding specific manner.TCL_ERROR
: Error other than conversion error, e.g. invalid parameter values. Error message is stored in the interpreter.TCL_CONVERT_MULTIBYTE
: Indicates that the source ends in truncated multibyte sequence.TCL_CONVERT_SYNTAX
: The source is not conformant to encoding definitionTCL_CONVERT_UNKNOWN
: The source contained a character that could not be represented in target encoding.
In the case of the TCL_CONVERT_*
return codes,
If errorLocPtr is NULL, an error message is stored in the interpreter if it is not NULL.
If errorLocPtr is not NULL, no error message is stored as it is expected the caller is interested in whatever is decoded so far and not treating this as an error condition.
Differences from the current 8.7 API
As stated above, the signatures of the functions differ from the currently implemented 8.7 and 9.0 API's. The new signatures are motivated by:
The older signatures had no mechanism to signal an error other than encoding errors in the data stream. In particular, there was no way to signal errors in parameter values. Examples would be invalid profiles, conflicting flags, flags not applicable to the given encoding etc.
Callers had to generate error messages themselves based on error codes, including computing the error offset, offending character etc. This is both inconvenient and avoidable duplication. Passing the interpreter for the purposes of retrieving error messages is a common convention in the Tcl core.
In addition to the change in signatures,
the TCL_ENCODING_NOCOMPLAIN
, TCL_ENCODING_STRICT
and
TCL_ENCODING_MODIFIED
bit flags have been removed. These were
not present in Tcl 8.6 so there is no backward compatibility issue.
The first two have been supplanted by the profile related flags. The
TCL_ENCODING_MODIFIED
bit was intended to be used to specify a variant of
UTF-8 or CESU-8 for encoding nul bytes as \xC0\x80
. This is never set
internally within the Tcl core and not accessible at the script level either.
The motivation of eliminating it arises from the belief that this is actually an
encoding and best modeled as such instead of through flags. If encoding variants
are enabled through flags, then why not CESU-8 as as variant of UTF-8, or
UTF-16LE/UTF-16BE as variants of UTF-16? As an aside, other languages also treat
this "modified" UTF-8 as a separate selectable encoding. A separate encoding
would also make it usable from the script level if so desired without
changing the API.
Implementation
Implementation and tests for Tcl 8.7 and 9.0 are available in the
tip-656 and
tip-656-tcl9 branches
respectively. Currently these still use the -encodingprofile
option name and will be changed to -profile
dependent
on TIP approval. Manpages have not been updated.
Alternative proposals
There have been a couple of alternatives proposed on the mailing list.
Finer granularity of error class selection
The first is an
-onerror
option which is similar to the -profile
option but allows for
finer granularity.
encoding convertfrom -onerror {surrogates invalid wrongcode} ....
encoding convertfrom -handle {SURROGATE error INVALID replace INCOMPLETE ignore ...}
The author is not in favor of this as I expect it to add considerable complexity to implementation and test suites while being minimally useful to the end user. (It feels like over generalization to me. How often would a user want to distinguish between invalid cases?).
Include the profile within the encoding parameter
Another syntactic alternative proposed was to embed the error handling options into the encoding argument.
encoding convertfrom {utf-8 strict}
fconfigure CHANNEL -encoding {utf-8 strict}
Since the difference is primarily in command option processing, implementation changes are not many. I prefer the first form from a stylistic perspective. For example, the latter makes it a little more awkard to request a profile using the default encoding.
Alternative fconfigure option name
The original option to fconfigure
proposed by this TIP was -encodingprofile
.
That has been renamed to -profile
as per Jan's suggestion. This is both
less wordy and does not conflict with -encoding
when used in shorter forms.
Copyright
This document has been placed in the public domain.