Author: Nathan Coulter <org.tcl-lang.tips@pooryorick.com>
State: Draft
Type: Project
Vote: Pending
Created: 10-Jan-2023
Tcl-Version: 8.7
Obsoleted-By: 656
Abstract
Previous attempts to articulate the options for handling non-conforming data for a character set encoding have resulted in a set of available options that are not well defined. This TIP articulates the fundamental optional behaviours and proposes a new set of names for them.
Definitions
Non-conforming representation
One or more bytes that do not conform to the specification for the representation of code points in the encoding.
Non-conforming code point
One or more bytes that conform to the specification for the representation of code points in the encoding, but whose represented code points do not conform to the rules for the encoding.
Non-conforming data
Both non-conforming repesentations and non-conforming code points.
Specification
-nocomplain
is no longer an option.
The "encoding" value of encoding converfrom, encoding convertto, and chan configure -encoding, is a dictionary (or at least conceptually one). The first key in the dictionary is optional, and if ommitted, it is "name". The "name" key provides the name of the encoding. Examples:
chan configure $chan -encoding utf-8
chan configure $chan -encoding {name utf-8}
chan configure $chan -encoding {utf-8 profile strict}
chan configure $chan -encoding {name utf-8 profile strict}
chan configure $chan -encoding
returns the name of the encoding for the
channel, as it always has.
chan configure $chan -encoding*
returns a dictionary describing the
configuration of the encoder for the channel.
The "profile" key provides the encoding profile. Each profile is independent of the other, and activating one profile cancels any previous active profile. The profiles are identified below by the options that activate them. For each option there is a corresponding channel configuration option prefixed with the word "encoding".
The profiles are:
discard
Not strict. Discards non-conforming data by omitting them from the output.
surrogate
Not strict. Each byte of non-conforming data is transformed into a single low surrogate code point that can be transformed back to the original byte, as described in Unicode Security Considerations This accomplishes the same purpose as
tag
, but requires only one character per byte instead of two.
pass
Not strict. Each non-conforming byte becomes the character having the Unicode code-point represented by that single byte.
replace
Not strict. When converting an encoded byte sequence to a Tcl string, invalid byte sequences are replaced by the U+FFFD REPLACEMENT CHARACTER code point.
When encoding a Tcl string, characters that cannot be represented in the target encoding are transformed to an encoding-specific fallback character, U+FFFD REPLACEMENT CHARACTER for UTF targets and generally
?
for other encodings.
report
Not strict. Like
pass
, butresult errors
in the return options is a dictionary where each key is the starting index in the result of noncomforming substring and each value is the corresponding ending index.
strict
Strictly conform to the specification for the encoding. It is an error for non-conforming data to occur.
tag
Not strict. Like
pass
, but tags each non-conforming byte by prefixing it with a replacement character, which is normally the standard replacement character for the encoding. Each occurrence of the replacement character itself is also prefixed with the replacement character.
tcl8
The same as
pass
, but may in the future diverge if it is discovered that Tcl 8 behaviour does not mirror that described forpass
.
Rationale
-strict
was introduced in TIP
346, which
focused narrowly on issues surrounding byte arrays and non-mappable
characters. It should instead have focused on conformance to the
chosen encoding, which is more fundamental. -nocomplain
, was
subsequenctly introduced in TIP
601, but did not
describe its relationship to -strict
, turned out to be nothing more
than the inverse of -strict
, and has already been eliminated in the
implementation of Tcl's internal encoding/decoding functions.
Under this proposal, the syntax for specifying an encoding and its options
is the same for both encoding convertto/from
and chan configure -encoding
,
which simplifies the interface.
Implementation
Implementation is in progress.
Copyright
This document has been placed in the public domain.