Jon Udell: Documents, databases, and XML

Tangled in the Threads
Jon Udell, May 17, 2000
Documents, databases, and XML

Are documents elements of databases, or databases in their own right? Well, both.

In an earlier column I talked about the difficulty of fitting object data into the "Procrustean bed" of relational storage. And I said that ODBMS technology, newly reincarnated as "XML data serving," continues to look like a useful -- though admittedly problematic -- alternative.

It's not a black-or-white issue, of course, and follow-on discussion revealed many shades of gray. One contributor, rpeterson, wrote:

SQL has several well-know shortcomings. Most notable, in my opinion, is the inability to work with variable depth hierarchies. Or more generally, graphs.

So why not fix SQL? Why force legions of well trained experts to aquire completely new skills in order fix these shortcomings?

This reminds me of the Road Runner. Every time Wile E. Coyote's contraptions fail, he thows them out the window, and starts over from scratch. WHY????

Replied kwillets:

Recursive queries have been in the standard for years. Unfortunately only a few vendors have made it past SQL 92. IBM is probably the most prominent.

But can such queries be optimized to anywhere near the level of performance that you'd get using a comparable OODB schema?

kwillets:

Certainly. Relational vs. ODB is not a performance issue in my book, since either model can be implemented in the same way. UniSQL uses pointers, objects, *and* SQL. All databases share the same problems of storage modelling, concurrency control, atomicity, and so on.

However I prefer non-procedural languages, simply because they're decoupled from performance issues. SQL queries can be parallelized much more often than nested loops.

Database guru Ken North agreed that SQL's inability to work with tree- or graph-structured data is a "common myth":

I spoke with Dick Beck of Information Builders about the origin of the row-and-column myth. Dick developed much of the logic to optimize SQL queries for EDA/SQL, which can query hierarchical, relational, CODASYL, and flat-file data. I asked him, "Why do you think people have a problem with the concept of using SQL to operate on XML documents and trees? It's not rocket science. Developers have been using SQL for years to do bill-of-materials processing, which is also hierarchical data." Dick's reply was insightful. He said, "Most of the early articles published about relational databases talked about rows and columns, instead of tuples and attributes. People see the row-and-column model of spreadsheets, and think that's how the data is physically laid out in a database."

In reality, the physical organization of data in a database (physical model) is separate from the logical model.

Chapter 26 of Joe Celko's SQL For Smarties (Morgan Kaufmann) discusses tree structures and directed graphs. Joe also wrote an SQL column for DBMS and those columns discussed solutions such as the nested set model and adjacency model for tree processing: "A Look at SQL Trees", http://www.dbmsmag.com/9603d06.html

David Rozenshtein teaches a seminar about tree and graph processing in SQL: http://www.dci.com/events/sql/

Thanks for those pointers, Ken! SQL For Smarties is, by the way, a terrific book. Celko's deep experience shines through on every page. With regard to SQL tree processing, though, I confess I'm not clever enough to apply the lessons taught in the chapter to which Ken refers. The technique is, I think it's fair to say, not a natural or easy way to use SQL.

For Oracle developers, there's a special CONNECT BY syntax that's used to simplify tree processing. Philip Greenspun has documented its use in a chapter of SQL for Web Nerds. But, as Ken reminds us, that's an Oracle-only technique, whereas the technique described by Celko is general.

Ken also referred to a panel discussion (audio) at last October's XMLOne conference, at which "gurus from Informix, IBM, Microsoft, and Oracle discussed how SQL databases support XML documents, which are variable-depth hierarchies."

Documents/databases, particles/waves

Listening to that discussion, I was again struck by the the kind of particle/wave duality that governs discussions about documents and databases. Is a document a database of elements, or an element of a database, or both?

One person on that panel echoed the point that SQL can process nested structures. Another noted that Informix takes the data-blade approach for XML, rather than trying to decompose XML into relational form. Another noted that SQL Server 7.5 will do on-the-fly transforms between SQL and XML.

It sounds like the industry is all over the map on this issue. Not surprisingly, I guess. Much of the world's data is in documents. As the XML discipline increasingly governs the creation, storage, search, and transmission of documents, their database-like properties will matter more and more.

So where's the common ground? Should documents should be represented, within SQL databases, in a highly granular way, with SQL-style tree processing used to manipulate elements at the intra-document level?

Or should documents be represented, within SQL databases, as blobs of XML with metadata attributes, such that the metadata is exposed to SQL querying but XML APIs govern the intra-document level?

Ken North:

The operative word is probably AND, not OR.

Microsoft and Oracle are extending their SQL dialects to support XML processing. Oracle lets you use Xpath expressions, for example. Interest in XML is high so we'll eventually see graphical tools that use tree diagrams to query XML, SQL, or SQL/XML data.

This article discusses how Oracle is tackling XML.

Perhaps it's germane to recall why the concept of the database and the DBMS came about. Before the DBMS, applications managed data on an ad hoc basis. The DBMS was conceived to make data a shared resource that is accessed in a consistent manner. Given the computing power we have today, and the power we can foresee for the future, it's time to start thinking in terms of containers for collections of documents, not using a file as a container for a single document.

XML applications haven't yet reached a maturity level where most XML developers are concerned about scalability issues -- managing very large collections, version control, concurrency, and so on. Database companies have already invested years in solving those problems. They've researched parallel architectures, query optimization and other technologies that will become important as the XML adoption curve goes up.

Oracle, Informix, and DB2 give you several alternatives for managing XML. You can store XML docs as single columns, or decompose them into multiple columns. You can also store information about the document's structure. The mechanics of storing structure are different for Informix, Oracle, and DB2, and obviously you have to make decisions based on application requirements.

In one of my classes, someone asked why anyone would want to store an XML document in a single column (instead of parsing the documents for insertion into multiple columns). Another person piped up that his application required an audit trail of documents for legal reasons.

Even when you store a document as a column, you make application-dependent decisions. You might store a tax form in a column, for example, but you'd probably break up a book.

There are reasons the industry moved away from the hierarchical and network models for databases to embrace relational technology and SQL. Flexibility, granularity, and ad hoc queries are important for many applications.

Most environments will benefit from granular access to data. However, application requirements will dictate when to store XML docs as blobs. Regardless of immediate requirements, you should consider investing in a database platform that supports all methods of storing XML, and does queries over non-XML data.

Thanks again, Ken! Your thoughtful commentary convinces me that I've been a bit naive in my portrayal of the "impedance mismatch" between object and relational storage.

Beyond MySQL: PostgreSQL builds momentum

The investment that Ken recommends is, of course, substantial. One of the reasons that Philip Greenspun's ArsDigita Community System is not more widely used is that, while the ACS itself is an open source product, as are its AOLserver and Tcl components, the ACS is an Oracle 8i application, and Oracle 8i is simply not within the means of many of the people who are today building websites around, for example, MySQL.

Of course, things are never just black or white, are they? MySQL isn't the only open source database. PostgreSQL (originally known, and still often referred to, as Postgres), which became Illustra and then was merged into Informix's Universal Server, is another key player. PostgreSQL's career as an open source product has had some fits and starts, but it's been on the move for several years now and seems to be reaching critical mass.

In an interview with Linux Magazine, Al Ritter, CEO of Great Bridge LLC, comments on his company's charter to "do for databases what Red Hat did for Linux and operating systems." Exciting stuff! And just to close the loop, on the Great Bridge site you can find a link to OpenACS, a project that aims to port ACS from Oracle 8i to PostgreSQL.

We certainly do live in interesting times!

Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He's now an independent Web/Internet consultant, and is the author of Practical Internet Groupware, from O'Reilly and Associates. His recent BYTE.com columns are archived at http://www.byte.com/index/threads

This work is licensed under a Creative Commons License.