TIP 737: Expanded encoding C API

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:             Ashok P. Nadkarni <apnmbx-public@yahoo.com>
Tcl-Version:        9.1
State:              Draft
Type:               Project
Created:            2025-09-19
Tcl-Branch:         tip-737

Abstract

This TIP defines new C functions Tcl_UtfToExternalEx and Tcl_ExternalToUtfEx that parallel the existing Tcl_UtfToExternal and Tcl_ExternalToUtf functions but support lengths up to TCL_SIZE_MAX. The TIP also proposes some related changes to existing functionality that makes public some internally used features, potential performance improvements for some scenarios, and fixes for some inconsistencies.

Motivation

The existing functions Tcl_ExternalToUtf and Tcl_UtfToExternal accept lengths upto TCL_SIZE_MAX on input but only support lengths upto INT_MAX on returned output buffers. Tcl extensions that deal with longer strings have to deal with breaking them up and looping over fragments, a non-trivial task given variable length encodings, character and space limits and different profiles. The new Ex variants hide these complexities and allow handling of such strings in a straightforward fashion.

Additionally, the current non-Ex functions suffer from defects even otherwise in some cases. For example when the target encoding is more compact than the source encoding, more than INT_MAX source bytes are transformed but there is no way to return that via the srcReadPtr parameter which points to an int variable.

The TCL_ENCODING_NO_TERMINATE flag to Tcl_ExternalToUtf controls whether the result is null-terminated. However, the inverse Tcl_UtfToExternal function does not support this flags which feels inconsistent. Further, the flag is not made public in the documentation though it would be useful in applications as well. The TIP therefore proposes to also implement the flag for the Tcl_UtfToExternal function (as well as the new Ex variants) and make it public in the documentation.

Finally, the existing implementation has a strange anomaly in that when asked to encode a fixed number of characters (not bytes) via the TCL_CHAR_ENCODING_LIMIT flag, the low level encoders actually encode an extra character, destination buffer space permitting (see for example the check against charLimit in UtfToUtfProc). To caller then reduces the destination buffer size (see here) and calls the low level encoders again with the smaller buffer so the exact number of requested characters are returned. It is clear this was done with intent but I have not deciphered the reason and there is no explanatory comment. The circumstances in which this behaviour is triggered depends on the estimate of destination buffer size calculated based on prior invocations. Nevertheless, with large strings this behaviour of iterating over the entire input twice is detrimental to performance. The TIP implementation avoids this by only iterating once. This also necessitates some small changes in the channel subsystem.

Specification

New functions Tcl_UtfToExternalEx and Tcl_ExternalToUtfEx

The following new public API is defined.

int Tcl_ExternalToUtfEx(
    Tcl_Interp *interp, Tcl_Encoding encoding, const char *src,
    Tcl_Size srcLen, int flags, Tcl_EncodingState *statePtr, char *dst,
    Tcl_Size dstLen, Tcl_Size *srcReadPtr, Tcl_Size *dstWrotePtr,
    Tcl_Size *dstCharsPtr);

int Tcl_UtfToExternalEx(
    Tcl_Interp *interp, Tcl_Encoding encoding, const char *src,
    Tcl_Size srcLen, int flags, Tcl_EncodingState *statePtr, char *dst,
    Tcl_Size dstLen, Tcl_Size *srcReadPtr, Tcl_Size *dstWrotePtr,
    Tcl_Size *dstCharsPtr);

The signatures are equivalent to those of the existing Tcl_ExternalToUtf and Tcl_UtfToExternal functions respectively except that the srcReadPtr, dstWrotePtr and dstCharsPtr parameters are of type Tcl_Size * instead of int *. The following expands on the description of those functions in the current manpages that is both sparse and underspecified.

The Tcl_ExternalToUtfEx function converts bytes in the encoding encoding at address src into TUTF-8 encoding, storing them in the buffer at address dst. Conversely, Tcl_UtfToExternalEx converts bytes encoded in TUTF-8 at address src into the encoding given by encoding, storing them in the outbut buffer at address dst. In both cases, srcLen specifies the number of bytes to be converted and dstLen specifies the size of the output buffer.

The flags parameter is a bitmask that controls operation of the conversion and should be a bitwise OR of zero or more of the following values:

  • At most one of the profile selection flags listed in the PROFILES section of the manpage.

  • The TCL_ENCODING_NO_TERMINATE flag disables null termination of the output. By default, the output buffer dst is terminated with an encoding-appropriate null.

  • The TCL_ENCODING_START and TCL_ENCODING_END flags indicate whether the source bytes correspond to the first or last blocks, respectively, in a source stream. TCL_ENCODING_START will cause the conversion routine to reset to an initial state ready to process the first byte of an encoded stream. TCL_ENCODING_END indicates the source buffer is the last block in an input stream allowing any required finalization to be performed. Any incomplete trailing characters will then be treated as per the encoding profile in effect. Both flags may be specified in the same call when all data to be converted is passed in a single block. Both flags are also presumed to be implicitly set if the statePtr parameter is passed as NULL.

  • The TCL_ENCODING_CHAR_LIMIT flag specifies that the functions should not convert more characters than the number passed through the dstCharsPtr argument, if not NULL. This flag is only supported by the Tcl_ExternalToUtfEx function and should not be passed to other functions.

The statePtr parameter is an opaque pointer to a location used by the encoding functions to store intermediate state when the data to be converted is passed in multiple chunks. The same location should be passed in statePtr for all related calls with the first chunk passed with the TCL_ENCODING_START flag set and the last with TCL_ENCODING_END set. If statePtr is passed as NULL, all data is presumed to all be contained in that single call and the functions behave as if TCL_ENCODING_START and TCL_ENCODING_END were both set in the flags parameter.

The srcReadPtr, dstWrotePtr and dstCharsPtr point to locations to hold the number of bytes in the source that were processed successfully, the number of bytes written to the output buffer, and the number of characters written to the output buffer respectively. All three are optional and may be passed as NULL. Further, in the case of the Tcl_ExternalToUtfEx function, the dstCharsPtr may be used with the TCL_ENCODING_CHAR_LIMIT flags to limit the number of characters processed as described earlier.

With the exceptions noted below, the counts returned in srcReadPtr, dstWrotePtr and dstCharsPtr are valid for all return codes listed below other than TCL_ERROR.

The functions return one of the following codes.

TCL_OK - the function completed without any exceptional conditions. Note this does not mean all passed input in src was processed or verified. In particular, in the case of the caller passing TCL_ENCODING_CHAR_LIMIT to limit the number of characters converted, only the corresponding number of bytes in the source input would have been processed as indicated by the value in srcReadPtr.

TCL_CONVERT_NOSPACE - the output buffer had insufficient space. The output buffer will contain as much converted data as it could fit and will be null terminated as appropriate unless the buffer was too small to even contain a null terminator by itself. srcReadPtr will hold number of processed source bytes and caller should call again to process the remaining bytes. Note: As a quirk of implementation, in some cases the destination buffer needs to be TCL_UTF_MAX bytes greater than the actual size needed. This is an existing quirk present in both 8.6 and 9.0 that is not addressed in this TIP.

TCL_CONVERT_MULTIBYTE - the trailing bytes in the source input formed an incomplete encoding sequence. Caller should call the function again with additional source bytes appended to the tail at offset *srcReadPtr of the original source bytes and with the same statePtr. Note that if flags had the TCL_ENCODING_END flag set, indicating no more data is forthcoming, the functions will return TCL_CONVERT_SYNTAX instead of TCL_CONVERT_MULTIBYTE.

TCL_CONVERT_SYNTAX - an invalid byte sequence was detected in the source input. What constitutes an "invalid" sequence is subject to the encoding profile as specified by the flags parameter. The *srcReadPtr count will contain the number of bytes successfully processed and is therefore also the offset of the start of the invalid sequence. The output buffer will contain the converted data up to that point.

TCL_CONVERT_UNKNOWN - the input byte sequence represented a character that cannot be encoded in the output encoding for the encoding profile in effect. Treatment is similar to that of TCL_CONVERT_SYNTAX.

TCL_ERROR - an error message is stored in interp if it is not NULL. The output locations at srcReadPtr, dstWrotePtr and dstCharsPtr may have been modified but should not be considered valid.

Changes to existing functions Tcl_UtfToExternal and Tcl_ExternalToUtf

The TCL_ENCODING_NO_TERMINATE flag is implemented for Tcl_UtfToExternal and documented in manpages for both functions.

The TCL_ENCODING_CHAR_LIMIT flag is documented. It is not implemented for Tcl_UtfToExternal[Ex] because it would require changes to the low level encoding API which is public.

Both functions will raise error if the caller has asked for counts to be written via srcReadPtr, dstWrotePtr or dstCharsPtr, and one of more of those values is greater than INT_MAX. Note this is a potential incompatibility since these functions are not documented to return TCL_ERROR as a return code. An alternative would be to clamp inputs to INT_MAX but that leads to misleading return codes and outputs which has potential for more confusion than returning TCL_ERROR. This is a rare condition in practice in any case.

Documentation will be updated to mark these functions as deprecated and recommend the use of the new *Ex* variants.

Testing

Test cases in utfext.test have been adapted with those testing long (greater than INT_MAX) strings under the bigdata constraint.

Implementation notes

Implementation is in the tip-737 branch.

Copyright

This document has been placed in the public domain.

History