Author: Andy Goth (andrew.m.goth@gmail.com)
State: Rejected
Type: Project
Vote: Done
Created: 22-Sep-2017
Post-History:
Keywords: Tcl,string,insert
Tcl-Version: 8.7
Abstract
This TIP proposes a [string insert
] subcommand for inserting a substring at a
given index. This new [string insert
] command is to be the string analogue of
[linsert
], much like how [string replace
] resembles [lreplace
].
This TIP additionally proposes a Tcl_ReplaceObj
() public function to provide
C-level access to the underlying functionality used to implement [string
insert
] and [string replace
].
Rationale
[string insert
] Command
Substring insertion is a basic string operation not directly available in current Tcl. Substring insertion can be synthesized from existing string commands, but the numerous legal forms of indexing yield significant difficulty. A novice user cannot be expected to know, much less implement, all possible index formats. Thus it is reasonable to provide a standard substring insertion command.
The current design of [string replace
] expressly (albeit inexplicably)
prevents its use for performing string insertion. TIP 323 originally proposed
to extend [string replace
] to allow string insertion, but this aspect of TIP
323 was withdrawn
for the sake of compatibility.
Clarification: TIP 475 proposes no changes to the semantics of [
string replace
].
To mirror the behavior of [linsert
], [string insert
] at end
should append
to the string. This is in conflict with [string replace
], were it to be
extended to permit replacing an empty substring.
[lreplace
] of an empty range at end
inserts elements immediately before the
final element, whereas appending requires the start index to be end+1
. The
hypothetical extended [string replace
] would mirror [lreplace
] and therefore
would not have the end-relative indexing semantics needed to implement [string
insert
].
In conclusion, it is most straightforward to simply provide a [string insert
]
command with the same semantics as [linsert
].
Tcl_ReplaceObj
() Function
The proposed [string insert
] cannot be implemented in terms of [string
replace
], even if [string replace
] were extended to permit replacing an empty
substring. However, this fact only argues against implementing [string
insert
] as a Tcl script which calls [string replace
]. It is indeed possible
to implement both [string insert
] and [string replace
] in terms of a common
C function supporting the superset of required capabilities.
Not only is it possible, but it is also preferable to take this approach. Minimizing the number of code paths increases the potential benefit of any given optimization, improves instruction cache utilization, and reduces both the possibility for bugs and the number of interactions to be audited.
Furthermore, this new function should be publicly exported using the stubs
interface so extension code can call it without having to enter the interpreter.
The desirability of providing C APIs is made clear by the existence of the the
FlightAware [array
] enumeration C API project.
Current Behavior
Currently available methods for string insertion are awkward and do not handle end-relative and TIP 176-style indexing without extraordinary effort. To demonstrate the surprising degree of complexity, the following is a pure Tcl script reference implementation intended to handle all possible index formats and corner cases:
# Pure Tcl implementation of [string insert] command. proc ::tcl::string::insert {string index insertString} { # Convert end-relative and TIP 176 indexes to simple integers. if {[regexp -expanded { ^(end(?![\t\n\v\f\r ]) # "end" is never followed by whitespace |[\t\n\v\f\r ]*[+-]?\d+) # m, with optional leading whitespace (?:([+-]) # op, omitted when index is "end" ([+-]?\d+))? # n, omitted when index is "end" [\t\n\v\f\r ]*$ # optional whitespace (unless "end") } $index _ m op n]} { # Convert first index to an integer. switch $m { end {set index [string length $string]} default {scan $m %d index} } # Add or subtract second index, if provided. switch $op { + {set index [expr {$index + $n}]} - {set index [expr {$index - $n}]} } } elseif {![string is integer -strict $index]} { # Reject invalid indexes. return -code error "bad index \"$index\": must be\ integer?\[+-\]integer? or end?\[+-\]integer?" } # Concatenate the pre-insert, insertion, and post-insert strings. string cat [string range $string 0 [expr {$index - 1}]] $insertString\ [string range $string $index end] } # Bind [string insert] to [::tcl::string::insert]. namespace ensemble configure string -map [dict replace\ [namespace ensemble configure string -map]\ insert ::tcl::string::insert]
More sample implementations can be found on the Additional String Functions page of the Tcler's Wiki, but at time of
writing, they do not handle end-relative indexing nor can be used to append to a
string. Since they are implemented in terms of [string replace
] and do not
perform any index arithmetic of their own, they actually do support TIP 176
indexes.
Compatibility Considerations
The existence of a command named [string insert
] breaks any existing code that
assumes [string in
] is an unambiguous abbreviation for [string index
]. Two
options exist:
- Special-case [
string in
] to mean [string index
]. - Take no special action, in which case [
string in
] becomes an error.
This TIP proposes option #2 because abbreviations are not guaranteed to be stable in the long term. This TIP targets Tcl 8.7, the first alpha version of which was recently released. Thus there are three reasons why compatibility is deemphasized in this situation.
Specification
Add a new [string insert
] command:
string insert string index insertString
Returns a copy of string with insertString inserted at the index'th character. index may be specified as described in the STRING INDICES section.
If index is start-relative, the first character inserted in the returned string will be at the specified index. If index is end-relative, the last character inserted in the returned string will be at the specified index.
If index is at or before the start of string (e.g., index is 0), insertString is prepended to string. If index is at or after the end of string (e.g., index is end), insertString is appended to string.
Add a new Tcl_ReplaceObj
() stubs-exported function:
Tcl_Obj *
Tcl_ReplaceObj(interp, objPtr, startIndex, removeCount, insertObjPtr)
Tcl_Interp *interp (in) If non-NULL, interpreter in which error result messages are stored. Tcl_Obj *objPtr (in/out) Points to a value to manipulate. int startIndex (in) The index of the first character to insert, replace, or remove. int removeCount (in) The number of characters to replace or remove. Tcl_Obj *insertObjPtr (in/out) The value to insert or replace. If NULL, interpreted as empty string. Tcl_ReplaceObj inserts, replaces, or removes a substring in objPtr and returns a pointer to the new or updated string object. If objPtr or insertObjPtr are unshared, one of them will be modified in place, and its address will be returned. If this optimization is unavailable, a new string object will be allocated to store the result, its address will be returned, and neither objPtr nor insertObjPtr will be modified. Should a memory allocation error occur, Tcl_ReplaceObj returns NULL and (if interp is non-NULL) places error information in the result value of interp.
The choice of operation performed by Tcl_ReplaceObj is determined by removeCount and the length of insertObjPtr. To insert, removeCount is zero, and insertObjPtr is the insertion substring. To replace, removeCount is the number of characters to replace, and insertObjPtr is the replacement substring. To remove, removeCount is the number of characters to remove, and insertObjPtr is NULL or empty string.
startIndex is the index of the first character in objPtr to insert, replace, or remove. Negative startIndex is taken to be zero, and large startIndex is automatically limited to be no greater than the length of objPtr. In a similar manner, negative removeCount is taken to be zero, and large removeCount is automatically limited such that the sum of startIndex and removeCount is no greater than the length of objPtr. Consequently, if startIndex is at the end of objPtr, insertObjPtr will be appended to objPtr, and removeCount will be ignored.
Implement both the bytecoded and non-bytecoded [string insert
] and [string
replace
] commands in terms of Tcl_ReplaceObj
().
Reference Implementation
A pure Tcl reference implementation is given above.
The amg-string-insert
branch in the Tcl Fossil
repository provides an optimized C implementation, complete with documentation,
bytecode compilation, and a full set of test cases exercising all code paths.
This C implementation creates a new strinsert
bytecode, also known as
STR_INSERT
or INST_STR_INSERT
in varying contexts. The decision was made to
keep strinsert
separate from the existing strreplace
opcode due to
differences in end-relative indexing.
Implementing [string insert
] in terms of strreplace
would require modifying
the interface to strreplace
and would therefore break compatibility with
tclquadcode and potentially other projects
that depend on the [tcl::unsupported::assemble
]
and [tcl::unsupported::disassemble
] commands.
Despite the use of the word "separate" above, the strinsert
and strreplace
opcodes are both implemented in terms of the common Tcl_ReplaceObj
() function,
as are the non-bytecoded [string insert
] and [string replace
] commands.
This function was written by distilling common code found across the baseline
implementations of the various subroutines that now call it.
Open Issues
The Tcl_ReplaceObj
() function header comment includes this TODO note:
Memory allocation failure is only checked when concatenating shared, non-pure byte array, non-pure Unicode character array strings. Need to commit to a consistent memory allocation failure handling policy.
This comment is describing the case in which
Tcl_ReplaceObj
() calls TclStringCatObjv
() to concatenate its two arguments.
TclStringCatObjv
() returns TCL_ERROR on memory allocation failure, and
Tcl_ReplaceObj
() dutifully translates that to a NULL return, with detailed
error information in the interpreter result.
In every other circumstance, Tcl_ReplaceObj
() does not check for memory
allocation errors. Many, if not all, of the underlying functions it uses call
Tcl_Panic
() on allocation error, so Tcl_ReplaceObj
() has no opportunity to
handle errors itself. Therefore, when Tcl_ReplaceObj
()'s documentation claims
it notifies its caller of memory allocation errors, it is making a promise it
can only rarely keep.
Copyright
This document is placed in public domain.