TIP 400: Setting the Compression Dictionary and Other 'zlib' Updates

Login
Bounty program for improvements to Tcl and certain Tcl packages.
State:		Final
Type:		Project
Tcl-Version:	8.6
Vote:		Done
Post-History:	
Author:		Donal K. Fellows <dkf@users.sf.net>
Created:	30-Mar-2012
Keywords:	Tcl, zlib

Abstract

Sometimes it is necessary to set the compression dictionary so that a sequence of bytes may be compressed more efficiently (and decompressed as well). This TIP exposes that functionality. It also reduces the number of inconsistencies in the zlib command.

Rationale

The SPDY protocol extensions to HTTP require the seeding of the zlib compression dictionary (which greatly improves the performance of compression on small amounts of data, such as HTTP headers). In order to allow a pure Tcl implementation of the SPDY protocol, it is therefore necessary to provide a mechanism whereby the compression dictionary (a byte-array, normally up to 262 bytes long according to the zlib documentation).

There is to be no mechanism for retrieving the compression dictionary generated by the compression engine; there is no API for doing that.

A side issue discovered during working on this TIP was that there was considerable variation in what could be achieved by various parts of the API. In partcular, it was identified that the API was inconsistent, providing access to some features in "simplified" parts of the API that could not be controlled from the "advanced" parts (e.g., there was no way to set the GZIP header descriptor with zlib stream gzip).

Proposed Changes: Tcl

Changes to the Channel Transforms

The zlib push command will gain two extra options, -dictionary and -limit:

-dictionary bytes

This option will provide a compression dictionary to be used (bytes is a byte-array used to initialize the compression engine) which will be supplied to the zlib compression engine at the correct moment during compression or provided on request of the compression engine on decompression. The bytes argument must be non-empty if given (we will not enforce a limit on the length of the dictionary, but using an excessively long one may cause the zlib engine to issue errors). This will be illegal to use with gzip and gunzip streams, and its use with raw (deflate) streams will be not recommended due to the difficulty of detecting whether a compression dictionary was applied; the zlib-format header adds very little overhead. This value can also be set with chan configure, though doing so after data has started to be pushed through the compression engine (except if an error requesting a compression dictionary was received) is not recommended.

-limit size

This option (valid on the three decompressing transforms only, and where size must be a positive integer of no more than 0x10000) allows for control over the size of chunks read from the underlying channel for feeding into the decompression engine. Its default is 1, which makes for the correct behavior under the widest range of conditions, but at a significant cost in terms of computational complexity: when the underlying data source is known to never block for long and to have complete data, a larger value can be used which will greatly improve performance. This value can be set at runtime using chan configure.

Changes to the Streams

The zlib stream command will also gain some complexity. In particular, the compress, decompress, deflate and inflate subcommands will gain the ability to take an extra -dictionary bytes pair of options (same interpretation as above), as will the add and put subcommands of the stream instance command.

In addition (as a correction to the functionality originally proposed in [234]) the zlib stream gzip subcommand will also gain the ability to take:

-header dict

(where dict is a Tcl dictionary such as is passed to the -header option to zlib gzip and not a compression dictionary), and the stream instance subcommand will gain a header subcommand to retrieve the gzip header (it will be an error to use it on a stream not produced by zlib stream gunzip). In order to facilitate the above change, the compression level used in that case will be altered to be specified via an option:

-level compressionLevel

Proposed Change: C

At the C level, one additional function will be provided:

void Tcl_ZlibStreamSetCompressionDictionary(Tcl_ZlibStream zshandle, Tcl_Obj *compressionDictionaryObj)

This sets the compression dictionary for a particular stream to the given (byte-array) Tcl_Obj, which will be duplicated. It is the caller's responsibility to dispose of the object passed in if they allocated it; they may do so immediately after calling this function.

Copyright

This document has been placed in the public domain.

History