Tangled in the Threads

Jon Udell, November 12, 2001

Hybrid Storage Models

Mixing storage schemes creates opportunities, but poses dilemmas

When an application has to choose between a filesystem and a database, there's never a right answer.

From the moment I first saw Groove, there were two enhancements I craved. One was a bridge between the secure, instant-messaging-based communication that flows within Groove shared spaces and the email-based messaging that flows all around them. The other was a bridge from the file-based storage of conventional apps to the object-based storage of Groove apps. It was impossible to see Groove's sketch app, for example, as anything other than a stand-in for a "real" diagramming tool such as Visio. Could such an app be redirected from the filesystem to Groove's XML object store?

The answer, I learned last week when Groove showed me its forthcoming Office integration technology, is yes and no. Yes, you'll be able to collaborate on a Word document that's synchronized across Groove shared spaces. But no, Word's OLE-based storage model is not magically transposed into that of the Groove object store -- not yet, at least. According to Groove, the issues are partly social and partly technical. On the social side of the equation, live updates to a document that's synchronized across a shared space can be confusing. So updates are batched from a master instance of the document to the rest, and users negotiate who controls the master instance. Technically, though, OLE structured storage -- a "filesystem within a file" -- can yield superior results. Intercepting raw filesystem APIs isn't very helpful, because while all the bytes pass through that interface, none of the semantics do. Hooking the structured storage interface, though, gives access to richer semantics. That's a technique that Groove already exploits in order to integrate with some ActiveX components but, due to the complexity of the task, not yet with Office apps.

Ultimately, of course, the standard file system wants to become a kind of object database, to which applications can persist self-describing objects. Environments such as Groove could then add value -- security, collaboration, synchronization -- by augmenting the built-in persistence mechanism. If the newly-strengthened Microsoft/Groove relationship helps move things in this direction, I'll be delighted. But the road to an object file system is littered with corpses. Remember Cairo? Realistically, we'll be weighing the tradeoffs of hybrid solutions for a long time to come.

Using Zope's LocalFS and ExtFile/ExtImage Products

The relationship between Zope's object database, ZODB, and the underlying filesystem, first came up in my programming newsgroup several years ago. Zope seemed to take an all-or-none approach. Either you dumped everything into the ZODB, including static files that might be large in number or in size, or you used Zope in concert with a conventional web server such as Apache. Here's how it looked to me then:

ZODB is ambitious and wonderful, but I suggest that it's important to realize the tried-and-true filesystem will remain a very important foundation for many things. The perception that it is necessary to commit 100% to ZODB is the kind of thing that could slow down adoption a lot. I've explored the idea of keeping node placeholders in Zope's database, to enable its object magic, while sourcing content from the filesystem. Of course, this wouldn't be ideal; you'd lose a lot of the clean manageability of Zope.

Over time, there have been several efforts to hybridize Zope and the filesystem. First came a Zope add-in (aka "Product") called LocalFS, which you can use to effectively mount a filesystem directory in Zope. Say you have a collection of images in /var/www/images. You can use LocalFS to assocate the Zope mount point /images with that directory. Handy! Watch out for the semantics, though. From the perspective of Zope's management UI, for example, it looks like you can assign properties to the files that are stored on the LocalFS. But things aren't as they appear. If you invoke the property editor on /images/picture.jpg (aka /var/www/images/picture.jpg), nothing prevents you from assigning that object a property called "Category" with a value "Horses." But when you reinvoke the property editor, the newly-assigned property seems to have vanished. In fact, LocalFS did not ignore this request. However, it attached that property not to the file -- which has no permanent Zope representation in this scheme -- but rather to the LocalFS object itself. Similarly, the LocalFS management UI presents, but does not honor, Zope's cut-and-paste semantics. You can upload files to the LocalFS, and delete them, but you can't cut from the LocalFS and paste into ZODB.

The ExtFile, Product, of more recent vintage, takes a more granular approach. It implements plug-in replacements for Zope's native File and Image objects. These alternate versions, ExtFile and ExtImage, must be individually created -- you can't just point Zope at /var/www and have it automagically encapsulate what it finds there. Once you do create an ExtFile or an ExtImage, though, it comes much closer to being a real Zope object which happens to store its data on the filesystem rather than in ZODB. If the Zope pathname /images/picture.jpg refers to a picture of a horse stored in /var/www/images/picture.jpg, then you can associate Zope metadata (Category: Horses) with that external image. What's more, you can cut the image from one Zope folder and paste it into another, or delete the image and then undo the delete operation. The files themselves are kept in a filesystem repository which is, by default, /var/reposit.

This is really quite nifty. There are, of course, the obvious caveats. If you delete /var/www/images/picture.jpg from the filesystem, Zope's not going to know a thing about it until somebody asks for the file and finds it missing. And there are subtler semantic issues as well. For example, it took me a while to discover why I couldn't programmatically upload ExtFile objects the same way I upload normal File objects. Here's a snippet of a Zope method to do the latter:

def addFile(self,REQUEST):
    id = REQUEST.file.filename
    file = REQUEST.file.read()
    title = ''
    self.manage_addFile(id,file,title)

Given that pattern, I figured could just call manage_addExtFile instead. Nope. The ExtFile API is slightly different. And because it's implemented as an add-in Product, it turns out that API is accessible in an indirect way, like so:

def addExtFile(self,REQUEST):
    id = REQUEST.file.filename
    title = ''
    descr = ''
    file = REQUEST.file
    self.manage_addProduct['ExtFile'].manage_addExtFile(id,title,descr,file)

When you poke around in the ExtFile repository, /var/reposit, you'll see how ExtFile represents Zope's view of what's happening to your external file -- for example, copy_of_picture.jpg, or picture.jpg.undo.

With ExtFile, as with LocalFS, there's a penalty for taking the hybrid approach. You have to remember to back up the repository as well as the ZODB. And Zope won't know about changes applied directly to the repository.

I'm interested in ExtFile because it's incredibly useful to have users pumping content into ZODB, but you tend to run into memory and performance issues when that content is made up of lots of binary files. I haven't yet implemented an ExtFile-based solution, but size and speed concerns may compel me to try. If the ExtFile/ExtImage functionality were simply an optional mode of the native File/Image objects, things would be simpler. But there's no getting around the fact that mixing storage models is a tricky business,

Hybrid Java storage

A variation on this theme of mixed storage appeared last week in the programming newsgroup, in a discussion about the uses and merits of EJB (Enterprise Java Beans) and J2EE (Java 2 Enterprise Edition).

Mark Wilcox:

Say you're building a web-based HR management system with a database backend. For database access, JDBC of course. A J2EE server will provide you with a DataSource object that will give you access to a JDBC connection. Behind the scenes, J2EE server handles database connection pooling. You can also utilize the J2EE standard for database transactions, if you're app server supports it.

You could write all of your beans as traditional JavaBeans, but there are real benefits to using EJBs. If you put your business logic into stateless session beans (including database access), you can improve performance by utilizing the J2EE server's scalability/redundancy.

Alex Staubo:

I don't see anything obvious about using JDBC. JDBC implies a relational database and specific SQL dialect. What about object databases, flat files, LDAP, in-memory data or XML?

My impression is that most app server containers assume that a single storage will carry the burden of storing all the data, which is annoying if you want to store your high-resolution photo library images in ObjectStore, your geographical map data in PostGIS, and your satellite tracking data in Oracle.

Will JDO (Java Data Objects), proposed as a generic scheme for Java persistence, save the day? What JDO aims for is the holy grail of transparent persistence. In other words, just build your objects in memory, and let an insulating layer of software sort out the object-relational mapping, if the datastore is an SQL engine, or insert the bytecode hooks, if it's an object database under the covers, or muck with the filesystem if that's where stuff really lives.

Indications are that JDO will not anytime soon deliver the kind of transparency we would all like to see. Nor, I guess, should that be surprising. We're not even close to the point where software designers can regard the choice of a storage model as an implementation decision, something to postpone until the hard problems of UI, behavior, and communication are solved. Applications wire themselves deeply to their storage models, and cannot easily be made to think otherwise.

While there's no immediate answer to this problem, I'm nevertheless hopeful. The notion of a universal canvas for rendering and editing information, backed by a universal data store, is compelling. People expect computers to work this way, and they have little tolerance for the endless frustration that results when data formats and storage models can't cooperate with one another. The unification of storage regimes -- including the file system, XML, SQL, and object data -- is a long-term project which, sooner or later, simply must succeed. Meanwhile, use hybrid strategies when the tradeoffs make sense.


Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He is the author of Practical Internet Groupware, from O'Reilly and Associates. Jon now works as an independent Web/Internet consultant. His recent BYTE.com columns are archived at http://www.byte.com/tangled/

Creative Commons License
This work is licensed under a Creative Commons License.