TIP 418: Add [binary] Subcommands for In-Place Modification

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jeff Rogers <dvrsn@diphi.com>
State:          Draft
Type:           Project
Vote:           Pending
Created:        27-Aug-2012
Post-History:   
Keywords:       Tcl,binary data
Tcl-Version:    8.7

Abstract

This TIP proposes adding new subcommands to the binary to better enable parsing and manipulation of binary values.

Rationale

The binary command efficiently deals with creating new objects or completely parsing existing ones, but it does not handle modifying existing binary objects or parsing them a little bit at a time. A few new subcommands would greatly improve the performance of these operations on large objects.

Variable vs. Value

While it will be possible to implement these modification operations as standard copy-on-write operations taking a value instead of a variable name, I believe this would result in copying unless the well-known but still clumsy technique of unsetting the variable after reading it (i.e., [K $x [set x {}]]) is used. This TIP is intended to fix this by providing a more convenient and simpler-to-use mechanism which also admits more efficient implementation.

Specification

Two new core subcommands are proposed: binary edit modifies an existing byte array "in place"; and binary scanshift parse data from a byte array and removes the data that was parsed. The intent is that additional commands can be built on top of these in library code.

The existing binary commands already make use of an internal cursor; that notion is extensively used by these new commands.

Binary Edit

binary edit varName formatStr ?value value ...?

This is similar to binary format except that the initial value of the new byte array is an existing object stored in a variable rather than an array of nulls.

Format specifiers in formatStr are as in binary format except:

  • fixed-width format specifiers (e.g., c, s, i) that do not have enough values in their corresponding argument (importantly, if they have 0 values) move the cursor by the appropriate width.

  • New z and Z format specifiers are introduced that move the cursor forward or backward in the binary string without writing anything and consume no arguments. Z is a synonym for "X", provided for symmetry. A count of "*" for the "z" format moves the cursor to the end of the existing object.

After the format string and all arguments have been processed, the length of the string is adjusted to end at the current cursor position.

Thus, a format string that starts with "z*" will append to the existing value, and one that ends with "z*" will keep the length the same.

Binary Scanshift

binary scanshift varName formatStr ?var var var ...?

This works like binary scan except that after the format string has been processed and all variables assigned to, all data in the string before the ending location of the cursor is discarded.

Thus,

binary scanshift bvar c byte1
binary scanshift bvar c byte2
binary scanshift bvar c byte3

Will put the first 3 bytes of the binary $bvar into byte1, byte2, and byte3, with bvar being subsequently three bytes shorter (the missing bytes being the first three).

This is useful to avoid keeping a separate external cursor variable that must be incremented and re-used on each iteration.

Additional Library Commands

Suggested additional library commands are poke and append. The arguments to binary poke will be:

binary poke varName index formatStr ?var ...?

This moves the cursor to a specified index, then overwrite with the specified format string. Implemented as

      binary edit varName "@${index} $formatStr z*" var ...

The arguments to binary append will be:

binary append varName formatStr "?var ...?

This appends the given formatted data to an existing var. Implemented as

      binary edit varName "z* $formatStr" ?var ...?"

Implementation Notes

Efficient implementation of the "scanshift" subcommand requires a new "offset" field in the ByteArray structure and any operations that read the object (particularly duplicating it and updating the string representation) need to be aware of this field. All external interfaces should be unaffected, as the ByteArray structure type is private to tclBinary.c, And since it's internal, EIAS is not violated.

When extending an existing byte array with the "edit" subcommand, care should be taken with memory allocation to avoid repeated realloc() and memcpy() operations. It is a reasonable assumption that a given byte array will be extended repeatedly or not at all beyond the initial creation. So a memory allocation strategy is to allocate the exact length initially (i.e., when adjusting the size from 0 to non-zero) and allocating double the requested length subsequently a typical allocation-doubling strategy should work well. A double allocation should not be needed for an initial extension (i.e., extending from 0 to some length) as that is typically the case when a binary object is first created, and most binary objects will probably not be extended; but once extended it is reasonable to prepare for more of the same.

After numerous "scanshift" operations there will be wasted space at the beginning of the memory allocated for the data. One strategy for keeping this under control would be to move the live data to the beginning of the allocated space when the offset is larger than the live data, so that the memory could be copied without worrying about overlap; and this would also leave the allocation size at roughly double the live data size.

Reference Implementation

Forthcoming.

Copyright

This document has been placed in the public domain.

History