TIP 258: Enhanced Interface for Encodings

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Don Porter <dgp@users.sf.net>
State:          Final
Type:           Project
Vote:           Done
Created:        01-Oct-2005
Post-History:	
Keywords:       encoding
Tcl-Version:    8.5

Abstract

This TIP proposes public C routines and a new encoding dirs subcommand to improve the interfaces to Tcl's encodings.

Background

Several internal improvements have been made to the internals of how Tcl encodings are initialized, found, stored, and refcounted during Tcl 8.5 development. This TIP is primarily about making these improvements available via public interfaces.

The lifetime of Tcl_Encoding values has been identified as a problem (Bug 1077262), where premature freeing means repeated re-loading of encoding data. Since each encoding data load involves interaction with the filesystem, this can be an expensive mistake.

The Tcl documentation has long claimed that by setting the value of a global variable ::tcl_libPath a script could influence the search path of directories where encoding data files are sought. That documentation has never been correct (Bug 463190).

Tclkit suffers from an initialization dilemma. It stores encoding data files in a virtual filesystem. In particular the system encoding is often based on a data file in the virtual filesystem. The Tclkit virtual filesystem is (largely) script-implemented and cannot exist until a Tcl_Interp has been created. However, Tcl wants to determine the correct value for the system encoding very early in its initialization, before any Tcl_Interp gets created. The consequence is that Tclkit fails to successfully set the system encoding in Tcl's early initialization, and Tclkit has had to jump through hoops to get Tcl to repeat those early initialization steps after the virtual filesystem is in place.

Proposed changes

Add the following routines to Tcl's public interface:

int Tcl_GetEncodingFromObj(Tcl_Interp *interp, Tcl_Obj *objPtr, Tcl_Encoding *encodingPtr)

Writes to *encodingPtr the Tcl_Encoding value that corresponds to the value of objPtr, and returns TCL_OK. The Tcl_Encoding value is also cached as the internal rep of objPtr so that the lifetime of the Tcl_Encoding data in the process will be at least the lifetime of that internal rep of objPtr. The caller is expected to call Tcl_FreeEncoding on *encodingPtr when it no longer needs it. If no corresponding Tcl_Encoding value for the value of objPtr can be determined, TCL_ERROR is returned, and an error message is stored in the result of interp.

Tcl_Obj *Tcl_GetEncodingSearchPath()

Returns a list of directory pathnames that Tcl's encoding subsystem will search for encoding data files when an encoding is requested that's not already loaded in the process. This will be the value stored by the last successful call to Tcl_SetEncodingSearchPath. If no calls to Tcl_SetEncodingSearchPath have occurred, Tcl will compute an initial value based on the environment. There is one encoding search path for the entire process, shared by all threads in the process.

int Tcl_SetEncodingSearchPath(Tcl_Obj *searchPath)

Stores searchPath as the list of directory pathnames for Tcl's encoding subsystem to search for encoding data files, and returns TCL_OK. Returns TCL_ERROR only if searchPath is not a valid Tcl list. There is no checking for validity of the directory pathnames, so for example, one can place a directory on the encoding search path before mounting the Tcl_Filesystem that contains that directory. When searching for encoding data files, Tcl's encoding subsystem ignores any non-existent directories in the search path as well.

CONST char *Tcl_GetEncodingNameFromEnvironment(Tcl_DString *bufPtr)

This routine exposes Tcl's determination about what the system encoding should be, based on system calls and examination of the environment suitable for the platform. It accepts bufPtr, a pointer to an uninitialized or freed Tcl_DString and writes to it the string value of the appropriate system encoding dictated by the environment. The Tcl_DStringValue is returned.

In a properly initialized Tcl, the string value returned by Tcl_GetEncodingNameFromEnvironment ought to be the same as that returned by Tcl_GetEncodingName(NULL); that is, the system encoding dictated by the environment ought to be the encoding Tcl will return as the result of encoding system.

If these two results do not match, it indicates that at the time Tcl was initialized, the proper sytem encoding was not available. Perhaps the necessary data file was not on the encoding search path at that time. With this new routine, the check for this match can be performed, and if the match does not exist, a call to Tcl_SetSystemEncoding can try again to get Tcl's system encoding to agree with what the environment dictates.

Add a new subcommand, encoding dirs with syntax:

encoding dirs ?searchPath?

This subcommand is the script-level interface to the Tcl_GetEncodingSearchPath and Tcl_SetEncodingSearchPath routines. When called without an argument, the current list of directory pathnames to be searched for encoding files is returned. When called with searchPath argument, the value searchPath is set as the new list of directory pathnames to be searched.

The documentation for existing routines Tcl_GetDefaultEncodingDir and Tcl_SetDefaultEncodingDir will be updated to discourage their use and to encourage the use of Tcl_GetEncodingSearchPath and Tcl_SetEncodingSearchPath instead.

Compatibility

This proposal includes only new features. It is believed that existing scripts and C code that operate without errors will continue to do so.

The encoding dirs command has been available with the name ::tcl::unsupported::EncodingDirs since the Tcl 8.5a3 release. It is proposed to remove this unsupported command completely, as it has only existed in alpha releases. Anyone using it should be able to migrate to encoding dirs without difficulty.

Reference Implementation

The actual code is already complete as internals corresponding to the proposed public. Implementation is just an exercise in renaming, placing in stub tables, documentation, etc.

Copyright

This document has been placed in the public domain.

History