Author:         Martin Weber <ephaeton@gmx.net>
Author:         Vince Darley <vincentdarley@users.sourceforge.net>
State:          Rejected
Type:           Project
Vote:           Done
Created:        12-Dec-2002
Post-History:   
Tcl-Version:    8.6

Abstract

This TIP shall bring flexible management of word and non-word chars to Tcl, to be used throughout the Tcl realm in e.g. [regexp]'s \w \W, Tk's [textwidget], [string] wordstart/wordend etc.

Specification

Assignment to tcl_{non,}wordchars shall influence any place in Tcl which decides whether something is a word character or not, including detection of word boundaries in e.g. regular expressions, Tk's text widget and so on.

For this there shall be no hard-coding of lists of values which are word and non-word characters, and neither shall the language rely on the language of implementation (i.e. C's is*() functions), as this disallows dynamic changing of tcl_{non,}wordchars.

Rather shall the value(s) of tcl_{non,}wordchars be used to determine whether a given character is part of a word or not.

Rationale

Currently in Tcl there are different hard-coded ways to decide whether a certain character is a word character or a non word character. Different hard-coded ways also imply that changes on one side might not get over to the other side, so there soon are different hard-coded ways which yield different hard-coded results. As a inference of it being hard-coded, this also means that there is no way to change or fix that potentially broken behavior. Having Tcl lookup the values of those variables at runtime allows for the needed flexibility, both when dealing with nonstandard demands and nonstandard character sets.

As an example of the breakage, you can assign a regular expression to tcl_{,non}wordchars, and the double click binding in the textwidget will regard that pattern when marking a "whole word". When you try to ask the text widget to deliver the data under a certain coordinate with the indices 'wortstart' and 'wordend', the value of tcl_{non,}wordchars is not used though.

There may be a problem with the performance of the lookup, but on the other hand are C's is*() functions also implemented via a table lookup. An installation of a caching static character table could guarantee the needed performance.

Example of current word confusion

Tk's text widget uses "word" in several ways:

selection by word (double-click + drag),
movement by word ('insert wordstart'),
regexp searching with \m\M wordmatching.
line breaks when wrapping (-wrap word)

It is not at all clear from reading Tcl or Tk's documentation what the behaviour of the above options will be. It turns out that:

after a convoluted call-chain, ends up calling tcl_wordBreakAfter/Before which use tcl_wordchars and tcl_nonwordchars (which actually are defined differently on Windows vs Unix/MacOS!!).
uses 'isalnum(char)' or '_' to define a word (hard-coded in Tk's tkTextIndex.c) (in Tk8.5a0 this has been fixed to use Tcl_UniCharIsWordChar)
uses Tcl's regexp engine's definition of a word (this ought to be the same as that used in (2)).
Anything separated by white-space from something else, used with '-wrap word' to define line-wrapping in text widgets (and canvases).

It is quite likely that most of the above are different under some circumstances or some platforms/locales, and certainly if the user/developer wants to create a text widget with a different word definition, they basically can't in any consistent way.

Implementation

None yet.

Copyright