TIP 69: Improvements for the Tcl Hash Table

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         George A. Howlett <gah@siliconmetrics.com>
Author:         Don Porter <dgp@users.sf.net>
Author:         Donal K. Fellows <donal.k.fellows@man.ac.uk>
State:          Draft
Type:           Project
Vote:           Pending
Created:        16-Oct-2001
Post-History:   
Discussions-To: news:comp.lang.tcl
Tcl-Version:    9.0

Abstract

This document describes various improvements to the existing Tcl hash table. They include support for 64 bit platforms, better memory performance, and improved array hashing. The goal is a hash table that improves Tcl/Tk, but also can be used in industrial strength applications.

Introduction

A strength of Tcl that has not diminished in the advance of other scripting languages (Perl, Python, Ruby, etc.) is the easy way its command language can be extended with C/C++ code. For example, the prominence of Tcl in Electronic Design Automation (EDA) tools is striking. It's hard to find EDA tools that do not use Tcl to some degree. At the same time, there is a current trend toward 64-bit computing platforms. The impetus has been from industry (like EDA) rather than office or home users, wanting to solve bigger problems, faster. If Tcl applications are to operate on 64-bit platforms, a big first step towards this goal will be a 64-bit version of the Tcl hash table.

The current Tcl hash table performs well on 32-bit platforms. It has been tuned and wrung out handling internal Tcl/Tk code. But its one word hash function

 #define RANDOM_INDEX(tablePtr, i) \
    (((((long) (i))*1103515245) >> (tablePtr)->downShift) & (tablePtr)->mask)

can not hash the longer 64-bit addresses properly.

Example:

    Tcl_HashTable t;
    unsigned long i;
    char *base, *addr;	
    int isNew;
    char *mesg;

    Tcl_InitHashTable(&t, TCL_ONE_WORD_KEYS);
    base = 0xFFFFFF000000000UL;
    for (i = 0; i < 100; i++) {
        addr = base + i * 0x100000000;
        hPtr = Tcl_CreateHashEntry(&t, addr, &isNew);
    }
    mesg = Tcl_HashStats(&t);
    fprintf(stderr, "Stats\n%s\n", mesg);
    free((char *)mesg;

Note that the keys all have zeros in the lower 32 bits. All 100 entries will hash to the same value. Driving the need for 64-bit systems is the ability to address more memory. So it's imperative that Tcl hash table be able to hash large virtual addresses.

Building upon the current hash table implementation, the following sections describe specific areas for improvement:

* improved array/structure hashing

* better memory performance 

* support for 64-bit platforms

The goal is an improved Tcl hash table for internal Tcl and Tk code, but also high performance applications.

Improved Array/Structure Hashing

The Tcl hash table handles three types of hash keys: string keys, one word keys, and multi-word keys. Each key type has its own hash function associated with it. The benefit of this approach is that better hash functions can be used for the specific types, than one general function for all types. The string and one word hash functions are very good for typical keys. The multi-word or array hash is not as good.

The array hash sums the each word of the array and then randomizes the result.

    for (index = 0, count = tablePtr->keyType, iPtr1 = arrayPtr;
	    count > 0; count--, iPtr1++) {
	index += *iPtr1;
    }
    index = RANDOM_INDEX(tablePtr, index);

This works poorly for many types of hash keys. For an contrived example of hashing 1 million 3D coordinates,

    typedef struct {
       double x, y, z;
    }  Double3;
    double d3;

    Tcl_InitHashTable(&t, sizeof(Double3) / sizeof(int));

    for (i = 0; i < 100; i++) {
	for (j = 0; j < 100; j++) {
	    for (k = 0; k < 100; k++) {
		d3.x = (double)i;
		d3.y = (double)j;
		d3.z = (double)k;
		hPtr = Tcl_CreateHashEntry(&t, (char *)&d3, &isNew);
	    }
	}
    }

we get a hash table with an average search distance of 1082.3. The maximum distance is 3324! Replacing the hash function with Bob Jenkins' http://burtleburtle.net/bob 32-bit mixing function

   #define MIX32(a,b,c) \
	a -= b, a -= c, a ^= (c >> 13), \
	b -= c, b -= a, b ^= (a <<  8), \
	c -= a, c -= b, c ^= (b >> 13), \
	a -= b, a -= c, a ^= (c >> 12), \
	b -= c, b -= a, b ^= (a << 16), \
	c -= a, c -= b, c ^= (b >>  5), \
	a -= b, a -= c, a ^= (c >>  3), \
	b -= c, b -= a, b ^= (a << 10), \
	c -= a, c -= b, c ^= (b >> 15)

   int a, b, c, len;

   len = length;
   a = b = GOLDEN_RATIO32;	/* An arbitrary value */
   c = 0;			/* Previous hash value */

   while (len >= 3) {		/* Handle most of the key */
	a += key[0];
	b += key[1];
	c += key[2];
	MIX32(a, b, c);
	key += 3; len -= 3;
    }
    c += length;		
    switch(len) {
    case 2 : b += key[1];
    case 1 : a += key[0];
    }
    MIX32(a, b, c);
    return c;

yields a table with an average search distance of 1.48. The maximum distance is 8. The Jenkins' hash function provides good results for many different types of arrays and structures. The disadvantage is that the hash function is slightly more expensive to compute.

Improving RebuildTable.

The cost of computing a hash function is especially felt each time table is rebuilt as new entries are added. The RebuildTable function calls the hash function of each entry to recompute its new location in the bigger table.

    for (oldChainPtr = oldBuckets; oldSize > 0; oldSize--, oldChainPtr++) {
	for (hPtr = *oldChainPtr; hPtr != NULL; hPtr = *oldChainPtr) {
	    *oldChainPtr = hPtr->nextPtr;
	    if (tablePtr->keyType == TCL_STRING_KEYS) {
		index = HashString(hPtr->key.string) & tablePtr->mask;
	    } else if (tablePtr->keyType == TCL_ONE_WORD_KEYS) {
		index = RANDOM_INDEX(tablePtr, hPtr->key.oneWordValue);
	    } else {
		index = HashArray(hPtr->key.words, tablePtr->keyType) &
			tablePtr->mask;
	    }
	    hPtr->bucketPtr = &(tablePtr->buckets[index]);
	    hPtr->nextPtr = *hPtr->bucketPtr;
	    *hPtr->bucketPtr = hPtr;
	}
    }

The new bucket location is then stored in the hash entry.

Except for one word keys, the hash value is invariant of the table size. If the hash value was stored with each entry, then it would not need to be recomputed each time the table is rebuilt.

    for (oldChainPtr = oldBuckets; oldSize > 0; oldSize--, oldChainPtr++) {
	for (hPtr = *oldChainPtr; hPtr != NULL; hPtr = *oldChainPtr) {
	    *oldChainPtr = hPtr->nextPtr;
	    if (tablePtr->keyType == TCL_ONE_WORD_KEYS) {
		index = RANDOM_INDEX(tablePtr, hPtr->key.oneWordValue);
	    } else {
		index = hPtr->hval & tablePtr->mask;
	    }
	    hPtr->bucketPtr = &(tablePtr->buckets[index]);
	    hPtr->nextPtr = *hPtr->bucketPtr;
	    *hPtr->bucketPtr = hPtr;
	}
    }

This would increase size of an hash entry, except that the pointer to the hash bucket is now redundant, since it can cheaply be computed.

    bucketPtr = tablePtr->buckets + (hPtr->hval & tablePtr->mask);

An added benefit is that hash table lookups become faster and easier to perform. If there is more than one hash entry in a bucket, you don't need to examine the key unless the entry has the same hash value.

    for (hPtr = tablePtr->buckets[hindex]; hPtr != NULL;
	    hPtr = hPtr->nextPtr) {
       /* Don't look at entry unless the hash value is the same. */
	if (hPtr->hval == hval) { 
	    register int *iPtr1, *iPtr2;
	    int count;

	    for (iPtr1 = arrayPtr, iPtr2 = (int *)hPtr->key.words,
		     count = tablePtr->keyType; ; count--, iPtr1++, iPtr2++) {
		if (count == 0) {
		    return hPtr;
		}
		if (*iPtr1 != *iPtr2) {
		    break;
		}
	    }
	}
    }

Don Porter dgp@users.sf.net

It appears that the recommendations of this section have already been implemented in Tcl 8.4. In particular, when the symbol TCL_HASH_KEY_STORE_HASH == 1 (as it does by default), then the hash value is stored in each entry instead of the bucketPtr.

If that is correct, then I recommend this section of the TIP be removed. If not, more detail about how this proposal differs from the 8.4 implementation, and an argument why the proposal is superior are in order.

Better Memory Performance

One enduring complaint of the Tcl hash table on comp.lang.tcl is its unexpected memory costs. A table of 1 million one word key entries uses over 36 Megabytes.

A hash entry is 20 bytes long.

 typedef struct Tcl_HashEntry {
    struct Tcl_HashEntry *nextPtr;	/* Pointer to next entry in this
					 * hash bucket, or NULL for end of
					 * chain. */
    struct Tcl_HashTable *tablePtr;	/* Pointer to table containing entry. */
    struct Tcl_HashEntry **bucketPtr;	/* Pointer to bucket that points to
					 * first entry in this entry's chain:
					 * used for deleting the entry. */
    ClientData clientData;		/* Application stores something here
					 * with Tcl_SetHashValue. */
    union {				/* Key has one of these forms: */
	char *oneWordValue;		/* One-word value for key. */
	int words[1];			/* Multiple integer words for key.
					 * The actual size will be as large
					 * as necessary for this table's
					 * keys. */
	char string[4];			/* String for key.  The actual size
					 * will be as large as needed to hold
					 * the key. */
    } key;				/* MUST BE LAST FIELD IN RECORD!! */
 } Tcl_HashEntry;

Each entry stores a pointer to its hash table. This field is used only for deleting a hash entry. But if the hash table is passed to Tcl_DeleteHashEntry, the hash entry can be reduced to 16 bytes. Inspecting Tcl/Tk code, I could not find a case where the hash table was not easily available to pass as a parameter.

Each hash entry is allocated using malloc. System memory allocators typically add 8-16 bytes overhead for each allocation. Worse, calls to malloc and free tend to dominate the cost of large hash tables. Tcl_DeleteHashTable becomes very slow, freeing hash entries scattered across pages of virtual memory.

For large hash tables, a pool allocation scheme can improve both reduce the amount of memory used and improve memory performance. By allocating memory in larger chunks, the number of malloc and free calls is dramatically reduced. Fixed size allocators (one word keys and array keys) can also reclaim and reuse memory from deleted entries.

The disadvantage of pool allocation is that memory is not released until the hash table is deleted. This is less of an issue for large tables which tend to grow to a steady-state size. Both Tcl and Tk use hash tables to keep track of small amounts of information that probably don't pool allocation.

So to retain compatibility, a new specialized initialization routine can be used to indicate when to use pool-based allocation.

    Tcl_InitHashTableWithPool(&table, TCL_ONE_WORD_KEYS);

The standard Tcl_InitHashTable call

    Tcl_InitHashTable(&table, TCL_ONE_WORD_KEYS);

will still use malloc and free.

Support for 64-bit Platforms.

While the C language makes no guarantees of a type's size or its relation to other types, current programming practice assumes that integers, longs, and pointers are all 32 bits long. This, of course, changes with 64-bit systems where pointers are 64-bits wide. Depending upon the programming model, longs and ints may or may not be 64 bits too.

	 Datatype      LP64    ILP64   LLP64   ILP32   LP32
	  char           8       8       8       8       8
	  short         16      16      16      16      16
	  _int32                        32
	 int            32      64      32      32      16
	 long           64      64      32      32      32
	 long long                      64
	 pointer        64      64      64      32      32

ILP32 is typical for 32 bit systems. Windows 3.1 was a LP32 model.

In the LP64 model, pointers and longs are 64 bits, but ints remain 32 bits wide. The LLP model retains the 32-bits for ints and longs, but adds a 64-bit "long long" type. Most 64-bit Unix systems (Solaris, HP-UX, Tru64, AIX) are LP64. I believe that Win64 is LLP.

The first problem is that addresses are now 64-bits, not 32. This means that existing code such as

    Tcl_InitHashTable(&table, TCL_ONE_WORD_KEYS);

    ptr = CreateSomeObject();
    hPtr = Tcl_CreateHashEntry(&table, (void *)ptr, &isNew);

can possibly fail because the 32-bit one word hash function can't properly hash the 64-bit pointer address.

Don Porter dgp@users.sf.net

Pardon the interruption, but I do not understand what is meant by the assertion that hashing of 64-bit pointers "can possibly fail". I've used Tcl on a 64-bit Alpha system for years, hashing 64-bit pointers the whole time. What failures should I be seeing?

 #define RANDOM_INDEX(tablePtr, i) \
    (((((long) (i))*1103515245) >> (tablePtr)->downShift) & (tablePtr)->mask)

The above one word hash function can be replaced with a 64-bit version of Donald Knuth's multiplicative hash function.

    ((key * GOLDEN_RATIO64) >> downShift) & tablePtr->mask)

where downShift is 64 - log2(tableSize) and the GOLDEN_RATION64 is a prime approximately equal to (sqrt(64) - 1) / 2.

The 64-bit array function is again from Bob Jenkins. This time it's a 64-bit mixing function.

 #define MIX64(a,b,c) \
 	a -= b, a -= c, a ^= (c >> 43), \
 	b -= c, b -= a, b ^= (a <<  9), \
 	c -= a, c -= b, c ^= (b >>  8), \
 	a -= b, a -= c, a ^= (c >> 38), \
 	b -= c, b -= a, b ^= (a << 23), \
 	c -= a, c -= b, c ^= (b >>  5), \
 	a -= b, a -= c, a ^= (c >> 35), \
 	b -= c, b -= a, b ^= (a << 49), \
 	c -= a, c -= b, c ^= (b >> 11), \
 	a -= b, a -= c, a ^= (c >> 12), \
 	b -= c, b -= a, b ^= (a << 18), \
 	c -= a, c -= b, c ^= (b >> 22)

     uint64_t a, b, c, len;

     len = length;
     a = b = GOLDEN_RATIO64;	/* An arbitrary value */
     c = 0;			/* Previous hash value */

     while (len >= 3) {	/* Handle most of the key */
 	a += key[0];
 	b += key[1];
 	c += key[2];
 	MIX64(a,b,c);
 	key += 3; len -= 3;
     }
     c += length;		
     switch(len) {
     case 2 : b += key[1];
     case 1 : a += key[0];
     }
     MIX64(a,b,c);
     return c;

Note that it also takes advantage of the 64-bit word size.

Summary

The following improvements to the current Tcl hash table have been suggested.

  • Replace the current array hash function.

  • Replace the bucket pointer in the hash entry with its hash value. This allows the table to be rebuilt without rehashing each entry. It also speeds bucket searches.

  • Remove the tablePtr from the hash entry, a 20% savings. This requires that callers of Tcl_DeleteHashEntry pass the hash table as a parameter.

  • Allow the hash table to use fixed or variable size pool allocation since malloc and free costs dominate large tables. Pool allocation substantial speeds large tables while also saving 8-16 bytes per entry. This can be done while still providing the normal malloc/free versions.

  • Support 64-bit platforms. This requires 64-bit versions of one word and array hash functions.

The suggested changes are nothing new and can be found in most hash table implementations. This work builds on the already solid foundation of the current hash table. With the above improvements, the Tcl hash table can be used in high performance applications. It also adds a useful piece to the 64-bit Tcl/Tk port.

I've created and tested a new hash table implementation under the following systems.

	System			32     64
	linux-ix86-gcc		x	
	Solaris-v9-cc		x	x
	Solaris-v9-gcc		x	x
	HPUX-11-cc		x	x
 	HPUX-11-gcc		x
	Win2k			x

It will be made publicly available on SourceForge.

Hashing of Malicious Strings

Donal K. Fellows adds:

In 2003 a possible denial-of-service attack on hash tables was published that operated by making a majority of keys map to the same bucket. While this would not make the hashes function incorrectly - there would be no extra memory consumed or incorrect accesses to memory - it still permits an attacker to escalate the cost of hash accesses from O(1) to O(n) in the normal case (and with obvious knock-on effects for the order of other algorithms) and so mount an attack out-of-scale with the effort required to set the attack up.

The way to fix this is to use a different hashing function for string hashing that varies the exact hashing algorithm on a table-by-table basis, and to base that algorithm on a hashing function with better spectral properties than Tcl's current (extremely simple) one. An algorithm that might be suitable for such uses is described online http://burtleburtle.net/bob/hash/evahash.html though the code would need substantial adaption (including the addition of a fairly strong random number generator) before being placed in the core.

Copyright

This document has been placed in the public domain.

History