TIP 228: Tcl Filesystem Reflection API

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author:         Andreas Kupries <andreas_kupries@users.sf.net>
Author:         Andreas Kupries <akupries@shaw.ca>
Author:         Vince Darley <vincentdarley@users.sourceforge.net>
State:          Draft
Type:           Project
Vote:           Pending
Created:        02-Nov-2004
Post-History:   
Tcl-Version:    8.7

Abstract

This document describes an API which reflects the Filesystem Driver API of the core Virtual Filesystem Layer up into the Tcl level, for the implementation of filesystems in Tcl. It is an independent companion to [219] ('Tcl Channel Reflection API') and [230] ('Tcl Channel Transformation Reflection API'). As the latter TIPs bring the ability of writing channel drivers and transformations in Tcl itself into the core so this TIP provides the facilities for the implementation of filesystems in Tcl. This document specifies version 1 of the filesystem reflection API.

Motivation / Rationale

The purpose of this and the other reflection TIPs is to provide all the facilities required for the creation and usage of wrapped files (= virtual filesystems attached to executables and binary libraries) within the core.

While it is possible to implement and place all the proposed reflectivity in separate and external packages, this however means that the core itself cannot make use of wrapping technology and virtual filesystems to encapsulate and attach its own data and library files to itself. Something which is desirable as it can make the deployment and embedding of the core easier, due to having less files to deal with, and a higher degree of self-containment.

One possible application of a completely self-contained core library would be, for example, the Tcl browser plugin.

While it is also possible to create a special purpose filesystem and channel driver in the core for this type of thing, it is however my belief that the general purpose framework specified here is a better solution as it will also give users of the core the freedom to experiment with their own ideas, instead of constraining them to what we managed to envision.

Another use for reflected filesystems is as a helper for testing the generic filesystem layer of Tcl, by creating filesystems which forcibly return errors, bogus data, and the like.

An implementation of this TIP exists already as a package, TclVfs. This TIP asks to make that mechanism publicly available to script and package authors, with a bit of cleanup regarding the Tcl level API.

Specification of Tcl-Level API

The Tcl level API consists of a single new command, filesystem, and one change to the existing command file. The new command is an ensemble command providing five subcommands. These subcommands are mount, unmount, info, posixerror, and internalerror.

(Note that this TIP does not introduce a new C API, but rather exposes an existing C API to Tcl scripts.)

The mount Subcommand

filesystem mount ?-volume? path cmdprefix

This subcommand creates a new filesystem using the command prefix cmdprefix as its handler. The API this handler has to provide is specified below, in the section "Command Handler API".

The new filesystem is immediately mounted at path. After completion of the call any access to a subdirectory of path will be handled by that filesystem, through its handler. The filesystem is represented here by the command prefix which will be executed whenever an operation on a file or directory within path has to be performed.

If the option -volume is specified then the new mount point is also registered with Tcl as a new volume and will therefore from then on appear in the output of the command file volumes.

This is useful (and actually required for reasonable operation) when mounting paths like ftp://. It should not be used for paths mounted inside the native filesystem.

The new filesystem will be immediately accessible in all interpreters executed by the current process.

The command returns the empty string as its result. Returning a handle or token is not required despite the fact that the handler command can be used in more than one mount operation. The different instances can be clearly distinguished through the root argument given to each called method. This root is identical to the path specified here. In other words, the chosen path (= mount point) is the handle as well.

We have chosen to use early binding of the handler command. See the section "Early versus late binding of the handler command" for more detailed explanations.

Important note: The handler command for the filesystem resides in the interpreter performing the mount operation. This interpreter is the filesystem interpreter mentioned in the section "Interaction with threads and other interpreters".

The unmount Subcommand

filesystem unmount path

This methods unmounts the reflected filesystem which was mounted at path. An error is thrown if no reflected filesystem was mounted at that location. After the completion of the operation the filesystem which was mounted at that location is not visible anymore, and any previous filesystem accessible through this path becomes accessible again.

The command returns the empty string as its result.

The info Subcommand

filesystem info ?path?

This method will return a list of all filesystems mounted in all interpreters, if it was called without arguments.

When called with a path the reflected filesystem responsible for that path is examined and the command prefix used to handle all filesystem operations is returned. An error is thrown if no reflected filesystem is mounted for that path.

There is currently no facility to determine the filesystem interpreter (nor its thread).

The posixerror Subcommand

filesystem posixerror error

This command can be called by a handler command during the execution of a filesystem operation to signal the POSIX error code of a failure. This also aborts execution immediately, behaving like return -code -1.

The argument error is either the integer number of the POSIX error to signal, or its symbolic name, like "EEXIST", "ENOENT", etc.

The internalerror Subcommand

filesystem internalerror cmdprefix

This method registers the provided command prefix as the command to call when the core has to report internal errors thrown by a handler command for a reflected filesystem.

If no such command is registered, then internal errors will stay invisible, as the core currently does not provide a way for reporting them through the regular VFS layer.

We have chosen to use early binding of the handler command. See the section "Early versus late binding of the handler command" for more detailed explanations.

Modifications to the file Command

The existing command file_ is modified. Its method **normalize is extended to recognize a new switch, -full. When this switch is specified the method performs a normal expansion of path first , followed by an expansion of any links in the last element of path. It returns the result of the expansion as its own result.

The new signature of the method is

  • file normalize ?-full? path

Command Handler API

The Tcl-level handler command for a reflected filesystem has to support the following subcommands, as listed below. Note that the term ensemble is used to generically describe all command (prefixes) which are able to process subcommands. This TIP is not tied to the recently introduced 'namespace ensemble's.

There are three arguments whose meaning does not change across the methods. They are explained now, and left out of the specifications of the various methods.

root: This is always the path the filesystem is mounted at, i.e. the handle of the filesystem. In other words, it is the part of the absolute path we are operating upon which is 'outside' of the control of this filesystem.

relative: This is always the full path to the file or directory the operation has to work on, relative to root (s.a.). In other words, it is the part of the absolute path we are operating upon which is 'inside' of the control of the reflected filesystem.

actualpath: This is the exact path which was given to the file command which caused the invocation of the handler command. This path can be absolute or relative. If it is absolute then actualpath is identical to "root/relative". Otherwise it can be a sub- or super-path of relative, depending on the current working directory.

And finally the list of methods and their detailed specification.

The initialize Method

handler initalize root

This method is called first, and then never again (for the given root). Its responsibility is to initialize all parts of the filesystem at the Tcl level.

The return value of the method has to be a list containing two elements, the version of the reflection API, and a list containing the names of all methods which are supported by this handler. Any error thrown by the method will prevent the creation of the filesystem and aborts the mount operation which caused the call. The thrown error will appear as error thrown by filesystem mount.

The current version is 1.

The finalize Method

handler finalize root

The method is called when the filesystem was unmounted, and is the last call a handler can receive for a specific root. This happens just before the destruction of the C level data structures. Still, the command handler must not access the filesystem anymore in no way. It is now his responsibility to clean up any internal resources it allocated to this filesystem.

The return value of the method is ignored. Any error thrown by the method is returned as the error of the unmount command.

The access Method

  • handler access root relative actualpath mode

This method is called to determine the "access" permissions for the file (relative).

It has to either return successfully, or signal a POSIX error (See filesystem posixerror. The latter means that the permissions asked for via mode are not compatible with the file.

Any result returned by the method is ignored.

Regular errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The argument mode is a list containing any of the strings read, write, and exe, the permissions the file has to have for the request to succeed.

  • write contained in mode implies "writable".

  • read contained in mode implies "readable".

  • exe contained in mode implies "executable".

The createdirectory Method

handler createdirectory root relative actualpath

This method has to create a directory with the given name (relative). The command can assume that relative does not exist yet, but the directory relative is in does. The C level of the reflection takes care of this.

Any result returned by the method is ignored.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The deletefile Method

handler deletefile root relative actualpath

This method has to delete the file relative.

Any result returned by the method is ignored.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The fileattributes Method

handler fileattributes root relative actualpath ?index? ?value?

The command has to return a list containing the names of all acceptable attributes, if neither index nor value were specified.

The command has to return the value of the index**th attribute if the _index is specified, but not the value. The attributes are counted in the same order as their names appear in the list returned by a call where neither index nor value were specified. The first attribute is has the index 0.

The command has to set the value of the index**th attribute to _value if both index and value were specified for the call. Any result returned by the method is ignored for this case.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The matchindirectory Method

handler matchindirectory root relative actualpath pattern types perm mac

This method has to return the list of files or directories in the path relative which match the glob pattern, are compatible with the specified list of types, have the given permissions and mac creator/type data. The specified path is always the name of an existing directory.

Note: As the core VFS layer generates requests for directory-only matches from the filesystems involved when performing any type of recursive globbing this subcommand absolutely has to handle such (and file-only) requests correctly or bad things (TM) will happen.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

types is a list of strings, interpreted as set. The strings are the names of the types of files the caller is looking for. Allowed strings are: files, and dirs. The command has to return all files which match at least one of the types. If types is empty then all types are valid.

perm is a list of permission strings (i.e. a set), i.e. read, write, and exe. The command has to return all files which have at least all the given permissions. If perm is empty then no permissions are required.

mac is a list containing 2 strings, for Macintosh creator and type. If mac is empty then the data is irrelevant.

The open Method

handler open root relative actualpath mode permissions

This command has to return a list describing the successfully opened file relative, or throw an error describing how the operation failed. The thrown error will appear as error thrown by the open command which caused the invocation of the handler.

The list returned upon success contains at least one and at most two elements. The first element is obligatory and is always the handle of the channel which was created to allow access to the contents of the file.

If the second element is present it will be interpreted as a callback, i.e. a command prefix. This prefix will always be executed as is, i.e. without additional arguments. Any required arguments have to be returned as part of the result of the call to open. This callback is fully specified in section "The channel close callback".

The argument mode specifies if the file is opened for read, write, both, appending, etc. Its value is a string in the set r, w, a, w+, or a+.

The argument permissions determines the native mode the opened file is created with. This is relevant only if the mode actually requests the creation of a non-existing file, i.e. is not r.

Note: it is possible to return a channel implemented through reflection here. See also section "The channel close callback" for more.

The removedirectory Method

handler removedirectory root relative actualpath recursive

This method has to delete the given directory. The argument recursive is a boolean value. The method has to signal the POSIX error "EEXIST" if recursive is false and the directory is not empty. Otherwise it has to attempt to recursively delete the directory and its contents.

Any result returned by the method is ignored.

Regular errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The stat Method

handler stat root relative actualpath

This method has to return a dictionary containing the stat structure for the file relative.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The following keys and their values have to be provided by the filesystem:

dev: A long integer number, the device number of the path stat was called for. This number is optional and always overwritten by the C level of the filesystem reflection.

ino: A long integer number, the inode number of the path stat was called for.

mode: An integer number, the encoded access mode of the path. It is this mode which is checked by the method access.

nlink: A long integer number, the number of hard links to the specified path.

uid: A long integer number, the id of the user owning the virtual path.

gid: A long integer number, the id of the user group the virtual path belongs to.

size: A long integer number, the true size of the virtual path, in bytes.

atime: A long integer number, the time of the latest access to the path, in seconds since the epoch. Convertible into a readable date/time by the command clock format.

mtime: A long integer number, the time of the latest modification of the path, in seconds since the epoch. Convertible into a readable date/time by the command clock format.

ctime: A long integer number, the time of the path was created, in seconds since the epoch. Convertible into a readable date/time by the command clock format.

type: A string, either directory, or file, describing the type of the given path.

Notes: The stat data is highly Unix-centric, especially device node, inode, and the various ids for file ownership.

While the latter are not that important both device and inode number can be crucial to higher-level algorithms. An example would be a directory walker using the device/inode information to keep itself out of infinite loops generated by symbolic links referring to each other. Returning non-unique device/inode information will most likely cause such a walker to skip over paths under the wrong assumption of having them seen already.

To prevent the various reflected filesystem from stomping over each other with regard to device ids this information will be generated by the common C level of the filesystem reflection.

The inode numbers however have to be assigned by the filesystem itself.

It is possible to make a higher-level algorithm depending on device/inode data aware of the problem with virtual filesystems (and has actually been done, see the Tcllib directory walker), this however is a kludgey solution and should be avoided.

The utime Method

handler utime root relative actualpath atime ctime mtime

This method has to set the access and modification times of the file relative. The access time is set to atime, creation time to ctime, and the modification time is set to mtime. The arguments are positive integer numbers, the number of seconds since the epoch.

Any result returned by the method is ignored.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

The copyfile Method

handler copyfile root relative_src actualpath_src relative_dst actualpath_dst

This method is optional. It has to create a copy of a file in the filesystem under a different name, in the same filesystem. This method is not for copying of files between different filesystems and won't be called for such.

Any result returned by the method is ignored.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

If this method is not supported the core filesystem layer will fall back to a Tcl & channel based method of copying the file.

The same fallback will happen if the method is available, but signals the POSIX error "EXDEV".

The copydir Method

handler copydir root relative_src actualpath_src relative_dst actualpath_dst

This method is optional. It has to create a recursive copy of a directory in the filesystem under a different name, in the same filesystem. This method is not for copying of directories between different filesystems and won't be called for such.

Any result returned by the method is ignored.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

If this method is not supported the core filesystem layer will fall back to a Tcl based method of copying the directory file by file..

The same fallback will happen if the method is available, but signals the POSIX error "EXDEV".

The rename Method

handler rename root relative_src actualpath_src relative_dst actualpath_dst

This method is optional. It has to rename a file in the filesystem, giving it a different name in the same filesystem. This method is not for the renaming of files between different filesystems and won't be called for such.

Any result returned by the method is ignored.

Errors thrown by the method are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

If this method is not supported the core filesystem layer will fall back to a Tcl & channel based method of renaming the file.

The same fallback will happen if the method is available, but signals the POSIX error "EXDEV".

Interaction with Threads and Other Interpreters.

Virtual filesystems in Tcl are process global structures. In other words, they are seen and accessible by all interpreters, and all threads in the current process. For filesystems implemented completely at the C-level this is not that big a problem.

However a filesystem implemented based on the reflection here will always be associated with a Tcl interpreter, the interpreter executing the requested filesystem operations. This cannot be avoided as only the interpreter containing the handler command also has all the state required by it.

The filesystem/interpreter association also implies that any such filesystem is associated with a particular thread, the thread containing that interpreter.

Filesystem requests coming from a different interpreter are handled by executing the driver functionality in the filesystem interpreter instead. In the case of requests coming from a different thread the C level part of the reflection will post specialized events to the filesystem thread, essentially forwarding the invocations of the driver.

When a thread or interpreter is deleted all filesystems mounted with the filesystem mount command using this thread/interpreter as their computing base will be automatically unmounted and deleted as well. This pulls the rug out under the other thread(s) and/or interpreter(s), this however cannot be avoided. Future accesses will either fail because the virtual files are now missing, or will access different files provided by a different filesystem now owning the path.

Interaction with Safe Interpreters

The command filesystem is unsafe and safe interpreters are not allowed to use it. The reason behind this restriction: The ability of mounting filesystems gives a safe interpreter the ability to inject code into a trusted interpreter. The mechanism is as follows:

  • An application using a trusted master interpreter and safe slaves for plugins reads and evaluates a file foo directly in the trusted interpreter.

  • A malicious plugin loaded into one of the safe slaves knows about this file foo, and its actual location. It mounts a virtual filesystem using a driver which is part of its own code, over the directory foo is in.

  • When the trusted interpreter reads foo, it does not go to the native filesystem anymore, but the mounted filesystem. In other words the driver in the slave provides the contents, the code which is executed in the trusted environment. From here on the slave can do anything it wishes in the trusted environment.

  • Access to any other file in the directory can be passed through unchanged to the filesystem originally owning the path.

The Channel Close Callback

The channel close callback is an optional callback which can be set up by the Tcl layer when a file is opened. This is done in the open method, by returning a 2-element list. The first element is the channel handle as usual and the second element the command prefix of the callback.

The command prefix is early-bound, i.e. the command will be resolved when the callback is set up. The resolution happens in the current context, and thus can be anywhere in the application. Because of this it is strongly recommended to use a fully-qualified command name in the callback.

The callback is executed in the current context of the operation which caused the channel to close. It is executed just before the channel is closed by the generic filesystem layer. The callback itself must not call close. It will always be executed as is, i.e. without additional arguments. Any required arguments have to be made part of the prefix when it is set up.

The channel is still live enough at the time of the call to allow seek and read operations. In addition all available data will have been flushed into it already. This means, for example, that the callback can seek to the beginning of the said channel, read its contents and then store the gathered data elsewhere. In other words, this callback is not only crucial to the cleanup of any resources associated with an opened file, but also for the ability to implement a filesystem which can be written to. This does assume that the filesystem does not use a reflected channel to access the contents of the virtual file. If a reflected channel is used however, the close callback is not required, as the finalize method of the channel can be used for the same purpose.

Under normal circumstances return code and any errors thrown by the callback itself are ignored. In that case errors have to be signaled asynchronously, for example by calling bgerror.

Any result returned by the callback is ignored.

Errors thrown by the callback are reported through the registered handler for internal errors, if there is any. They are ignored if no such handler is present.

Note that it is possible that the channel we are working with here is implemented through reflection.

The order in which the various callbacks are called during closing is this:

  • The channel for the file is closed via close by the VFS.

  • The channel close callback has been set up as a regular close handler, and is called now.

  • The close function of the channel driver is called, reflected into the Tcl level and cleans it up.

  • The close operation completes.

The important point here is that the channel close callback set up by the filesystem is definitely called before the reflected channel cleans up its Tcl layer, so the assertion above about the channel being live enough to be read and saved from the filesystem Tcl layer holds even if both filesystem and channel are reflected. It also holds if reflected transformations are involved.

Early versus Late Binding of the Handler Command

We have two principal methods for using the handler command. These are called early and late binding.

Early binding means that the command implementation to use is determined at the time of the creation of the channel, i.e. when chan create is executed, before any methods are called. Afterward it cannot change. The result of the command resolution is stored internally and used until the channel is destroyed. Renaming the handler command has no effect. In other words, the system will automatically call the command under the new name. The destruction of the handler command is intercepted and causes the channel to close as well.

Late binding means that the handler command is stored internally essentially as a string, and this string is mapped to the implementation to use for each and every call to a method of the handler. Renaming the command, or destroying it means that the next call of a handler method will fail, causing the higher level channel command to fail as well. Depending on the method the error message may not be able to explain the reason of that failure.

Another problem with this approach is that the context for the resolution of the command name has to be specified explicitly to avoid problems with relative names. Early binding resolves once, in the context of the chan create. Late binding performs resolution anywhere where channel commands like puts, gets, etc. are called, i.e. in a random context. To prevent problems with different commands of the same name in several namespaces it becomes necessary to force the usage of a specific fixed context for the resolution.

Note that moving a different command into place after renaming the original handler allows the Tcl level to change the implementation dynamically at runtime. This however is not really an advantage over early binding as the early bound command can be written such that it delegates to the actual implementation, and that can then be changed dynamically as well.

Limitations

For now this section documents the existing limitations of the reflection.

The code of the package TclVfs has only a few limitations.

  • One subtlety one has to be aware of is that mixing case-(in)sensitive filesystems and application code may yield unexpected results.

    For example mounting a case-sensitive virtual filesystem into a case-insensitive system (like the standard Windows or MacOS filesystems) and then using this with code relying on case-insensitivity problems will appear when accessing the virtual filesystem.

    Note that application code relying on case-insensitivity will not under Unix either, i.e. is inherently non-portable, and should be fixed.

  • The C-API's for the methods link and lstat are currently not exposed to the Tcl level. This may be done in the future to allow virtual filesystems implemented in Tcl to support the reading and writing of links.

    Note - Exposure of links may require path normalization and native path generation, something the TclVfs implementation does not support. This limitation regarding any type of link, hard or or soft, is quite deeply entrenched in the TclVfs code.

  • The public C-API filesystem function Tcl_FSUtime is Unix-centric, its main data argument is a struct utimbuf *. This structure contains only a single value for both atime and ctime. The method utime of the handler command was nevertheless defined to take separate values for access and creation times, in case that this changes in the future.

  • The Tcl core VFS layer was written very near to regular filesystems and has no way to transport regular Tcl error messages through it. This is the reason for the introduction of the internal error callback. This problem cannot be fixed within the 8.5 line as it requires more extensive changes to the public API. Note that when such changes are done the reflection API has to change as well, as it then allows the direct passing of errors. At that point the C layer of the reflection will have to support both this and the new version of the API.

Examples of Filesystems

The filesystems provided by TclVfs are all examples.

  • webdav

  • ftp sites

  • http sites

  • zip archive

  • tar archive

  • metakit database

  • namespace/procedures as filesystem

  • widget fs

Some examples can be found on the Tcler's Wiki, see pages referring to http://wiki.tcl.tk/11851

  • Encryption

  • Compression

  • Jails

  • Quotas

Reference Implementation

The package TclVfshttp://sourceforge.net/projects/tclvfs/ can serve as the basis for a reference implementation. The final reference implementation will be provided at SourceForge, as an entry in the Tcl Patch Tracker. The exact url will be added here when it becomes available.

Comments

Comments on http://mini.net/tcl/12328 suggest it might be a good idea to modify the 'file attributes' callback to make it more efficient for vfs writers, especially across a network and when vfs's are stacked. Currently one needs to make multiple calls to accomplish anything.

[ Add comments on the document here ]

Copyright

This document has been placed in the public domain.

History