Author: Ashok P. Nadkarni <apnmbx-public@yahoo.com>
Tcl-Version: 9.1
State: Draft
Type: Project
Created: 2025-09-19
Tcl-Branch: tip-737
Abstract
This TIP defines new C functions Tcl_UtfToExternalEx and
Tcl_ExternalToUtfEx that parallel the existing Tcl_UtfToExternal and
Tcl_ExternalToUtf functions but support lengths up to TCL_SIZE_MAX.
The TIP also proposes some related changes to existing functionality that
makes public some internally used features, potential performance improvements for
some scenarios, and fixes for some inconsistencies.
Motivation
The existing functions Tcl_ExternalToUtf and Tcl_UtfToExternal accept
lengths upto TCL_SIZE_MAX on input but only support lengths upto INT_MAX on
returned output buffers. Tcl extensions that deal with longer strings have to
deal with breaking them up and looping over fragments, a non-trivial task given
variable length encodings, character and space limits and different profiles.
The new Ex variants hide these complexities and allow handling of such strings
in a straightforward fashion.
Additionally, the current non-Ex functions suffer from defects even otherwise in
some cases. For example when the target encoding is more compact than the source
encoding, more than INT_MAX source bytes are transformed but there is no way
to return that via the srcReadPtr parameter which points to an int variable.
The TCL_ENCODING_NO_TERMINATE flag to Tcl_ExternalToUtf controls whether
the result is null-terminated. However, the inverse Tcl_UtfToExternal function
does not support this flags which feels inconsistent. Further, the flag is
not made public in the documentation though it would be useful in applications
as well. The TIP therefore proposes to also implement the flag for the
Tcl_UtfToExternal function (as well as the new Ex variants) and make it
public in the documentation.
Finally, the existing implementation has a strange anomaly in that when asked to
encode a fixed number of characters (not bytes) via the
TCL_CHAR_ENCODING_LIMIT flag, the low level encoders actually encode an
extra character, destination buffer space permitting (see for example the
check
against charLimit in UtfToUtfProc). To caller then reduces the destination
buffer size (see here)
and calls the low level encoders again with the smaller buffer so the exact
number of requested characters are returned. It is clear this was done with
intent but I have not deciphered the reason and there is no explanatory comment.
The circumstances in which this behaviour is triggered depends on the estimate
of destination buffer size calculated based on prior invocations. Nevertheless,
with large strings this behaviour of iterating over the entire input twice is
detrimental to performance. The TIP implementation avoids this by only iterating
once. This also necessitates some small changes in the channel subsystem.
Specification
New functions Tcl_UtfToExternalEx and Tcl_ExternalToUtfEx
The following new public API is defined.
int Tcl_ExternalToUtfEx(
Tcl_Interp *interp, Tcl_Encoding encoding, const char *src,
Tcl_Size srcLen, int flags, Tcl_EncodingState *statePtr, char *dst,
Tcl_Size dstLen, Tcl_Size *srcReadPtr, Tcl_Size *dstWrotePtr,
Tcl_Size *dstCharsPtr);
int Tcl_UtfToExternalEx(
Tcl_Interp *interp, Tcl_Encoding encoding, const char *src,
Tcl_Size srcLen, int flags, Tcl_EncodingState *statePtr, char *dst,
Tcl_Size dstLen, Tcl_Size *srcReadPtr, Tcl_Size *dstWrotePtr,
Tcl_Size *dstCharsPtr);
The signatures are equivalent to those of the existing Tcl_ExternalToUtf and
Tcl_UtfToExternal functions respectively except that the srcReadPtr,
dstWrotePtr and dstCharsPtr parameters are of type Tcl_Size * instead of
int *. The following expands on the description of those functions in the
current manpages that is both sparse and underspecified.
The Tcl_ExternalToUtfEx function converts bytes in the encoding encoding at
address src into TUTF-8 encoding, storing them in the buffer at address dst.
Conversely, Tcl_UtfToExternalEx converts bytes encoded in TUTF-8 at address
src into the encoding given by encoding, storing them in the outbut buffer
at address dst. In both cases, srcLen specifies the number of bytes to be
converted and dstLen specifies the size of the output buffer.
The flags parameter is a bitmask that controls operation of the conversion and
should be a bitwise OR of zero or more of the following values:
At most one of the profile selection flags listed in the PROFILES section of the manpage.
The
TCL_ENCODING_NO_TERMINATEflag disables null termination of the output. By default, the output bufferdstis terminated with an encoding-appropriate null.The
TCL_ENCODING_STARTandTCL_ENCODING_ENDflags indicate whether the source bytes correspond to the first or last blocks, respectively, in a source stream.TCL_ENCODING_STARTwill cause the conversion routine to reset to an initial state ready to process the first byte of an encoded stream.TCL_ENCODING_ENDindicates the source buffer is the last block in an input stream allowing any required finalization to be performed. Any incomplete trailing characters will then be treated as per the encoding profile in effect. Both flags may be specified in the same call when all data to be converted is passed in a single block. Both flags are also presumed to be implicitly set if thestatePtrparameter is passed as NULL.The
TCL_ENCODING_CHAR_LIMITflag specifies that the functions should not convert more characters than the number passed through thedstCharsPtrargument, if not NULL. This flag is only supported by theTcl_ExternalToUtfExfunction and should not be passed to other functions.
The statePtr parameter is an opaque pointer to a location used by the encoding
functions to store intermediate state when the data to be converted is passed
in multiple chunks. The same location should be passed in statePtr for all
related calls with the first chunk passed with the TCL_ENCODING_START
flag set and the last with TCL_ENCODING_END set. If statePtr is passed
as NULL, all data is presumed to all be contained in that single call and
the functions behave as if TCL_ENCODING_START and TCL_ENCODING_END were
both set in the flags parameter.
The srcReadPtr, dstWrotePtr and dstCharsPtr point to locations to hold the
number of bytes in the source that were processed successfully, the number of
bytes written to the output buffer, and the number of characters written to the
output buffer respectively. All three are optional and may be passed as NULL.
Further, in the case of the Tcl_ExternalToUtfEx function, the dstCharsPtr
may be used with the TCL_ENCODING_CHAR_LIMIT flags to limit the number of
characters processed as described earlier.
With the exceptions noted below, the counts returned in srcReadPtr,
dstWrotePtr and dstCharsPtr are valid for all return codes listed below
other than TCL_ERROR.
The functions return one of the following codes.
TCL_OK - the function completed without any exceptional conditions. Note this
does not mean all passed input in src was processed or verified. In
particular, in the case of the caller passing TCL_ENCODING_CHAR_LIMIT to limit
the number of characters converted, only the corresponding number of bytes in
the source input would have been processed as indicated by the value in
srcReadPtr.
TCL_CONVERT_NOSPACE - the output buffer had insufficient space. The output
buffer will contain as much converted data as it could fit and will be null
terminated as appropriate unless the buffer was too small to even contain a null
terminator by itself. srcReadPtr will hold number of processed source bytes
and caller should call again to process the remaining bytes. Note: As a quirk
of implementation, in some cases the destination buffer needs to be
TCL_UTF_MAX bytes greater than the actual size needed. This is an existing
quirk present in both 8.6 and 9.0 that is not addressed in this TIP.
TCL_CONVERT_MULTIBYTE - the trailing bytes in the source input formed an
incomplete encoding sequence. Caller should call the function again with
additional source bytes appended to the tail at offset *srcReadPtr of the
original source bytes and with the same statePtr. Note that if flags had the
TCL_ENCODING_END flag set, indicating no more data is forthcoming, the
functions will return TCL_CONVERT_SYNTAX instead of TCL_CONVERT_MULTIBYTE.
TCL_CONVERT_SYNTAX - an invalid byte sequence was detected in the source
input. What constitutes an "invalid" sequence is subject to the
encoding profile as specified by the flags parameter. The *srcReadPtr count
will contain the number of bytes successfully processed and is therefore also
the offset of the start of the invalid sequence. The output buffer will contain
the converted data up to that point.
TCL_CONVERT_UNKNOWN - the input byte sequence represented a character that
cannot be encoded in the output encoding for the encoding profile in effect.
Treatment is similar to that of TCL_CONVERT_SYNTAX.
TCL_ERROR - an error message is stored in interp if it is not NULL. The
output locations at srcReadPtr, dstWrotePtr and dstCharsPtr may have been
modified but should not be considered valid.
Changes to existing functions Tcl_UtfToExternal and Tcl_ExternalToUtf
The TCL_ENCODING_NO_TERMINATE flag is implemented for Tcl_UtfToExternal
and documented in manpages for both functions.
The TCL_ENCODING_CHAR_LIMIT flag is documented. It is not implemented for
Tcl_UtfToExternal[Ex] because it would require changes to the low level
encoding API which is public.
Both functions will raise error if the caller has asked for counts to be written
via srcReadPtr, dstWrotePtr or dstCharsPtr, and one of more of those
values is greater than INT_MAX. Note this is a potential incompatibility
since these functions are not documented to return TCL_ERROR as a return
code. An alternative would be to clamp inputs to INT_MAX but that leads to
misleading return codes and outputs which has potential for more confusion than
returning TCL_ERROR. This is a rare condition in practice in any case.
Documentation will be updated to mark these functions as deprecated and
recommend the use of the new *Ex* variants.
Testing
Test cases in utfext.test have been adapted with those testing long
(greater than INT_MAX) strings under the bigdata constraint.
Implementation notes
Implementation is in the tip-737 branch.
Copyright
This document has been placed in the public domain.