Tincat Group, Inc. - Mewsings, a Software Development and Data Modeling Blog

mewsings, a blog

--dawn

Tuesday, January 02, 2007

OTLT: Metadata Piece Not Apartheid

Yes, I am going to reopen a can of worms. I first saw the acronym OTLT, for One True Lookup Table, when I read this Celko article. The article pretty much made sense to me then, as it does now. I would not recommend the use of OTLT to anyone wishing to use an SQL-DBMS properly as a relational database. As a few of you might guess, I will suggest that some might be well-served using something other than a proper SQL-DBMS or using an SQL-DBMS other than properly.

Celko writes "I am going to venture a guess that this idea came from OO programmers" in the above mentioned article. However, OTLT (or OTLF) is a design pattern I encountered before I ever heard of OOP, or design patterns for that matter. It even predates our use of the term table in database design, and might be called a code file or validation file, for example. It is at the point that this design pattern was brought into the world of relational tables where it switched from being a pattern to an anti-pattern.

Many moons ago, this code file pattern was employed with implementations using indexed sequential files, e.g. MIDAS files on Primos or VSAM files on MVS on an IBM 3081, and I'm pretty sure it was around well before my encounters with it.

In case you are not familiar with this pattern, you can see typical values in the picture below. It represents a simple function--put a multipart key of code_type and code_value into the vending machine to retrieve the code_description. These attribute names, as provided by Celko in this article, are pretty much what I recall, with the exception of all caps and dashes in place of underscores.

Once upon a time, data were not democratized, and we could treat some differently than others within the data model. Now we talk about attributes having types, such as varchar or int, while tables are all seen as somewhat of the same type, i.e. Relation. [Yes, these Relations are of type Customer, Order, Person, or whatever, but I only state that so you do not wander off-course on that yourself.] Our profession once spoke of file types of a different ilk. Back when the database was perceived as a set of files, rather than tables, these files were identified with various type designations. There might be files of type:

master (named with nouns corresponding to strong entities)
transaction (events)
log
history
parm
control
code (we will focus on this one)
enzovoort.

We will zero in on the code files. Historically, some systems used separateness, or apartheid in Dutch, for each type of code. Other systems split out the really large or more complex code files as separate files but kept the rest of the codes in a single piece, pouring the data from all other potential code files into one big validation file. One rule of thumb that I recall was that if there were more than a thousand entries, we were more likely to put such codes in their own file (so as not to adversely affect the performance of all applications doing lookups).

Let's look at some of the objections to using this pattern with an SQL-DBMS to see how these play out in the case where we are using another DBMS solution that is not strictly a SQL-DBMS, such as Pick (see Is Codd Dead? for a list of Pick databases). Near the end of Celko's article, he writes that it is a "data modeling principle that a well-designed table is a set of things of the same kind instead of a pile of unrelated items." When pouring all of our code types, values, and descriptions together into one component, are we really talking about unrelated items? They certainly exhibit a similar pattern.

Relational databases, or approximations thereof, split data into two distinct groupings often referred to as data and metadata. So what are code files--data or metadata? These code files are not quite the same as typical user data. A table such as MaritalCodes doesn't relate to a traditional business proposition or entity quite like the Customer table does. With modeled propositions for a business, it is common for purposes of data quality, ease of data entry, compact representation for reports, and storage efficiency to encode values with abbreviations, aka codes, that stand for the original value within a proposition. With the value encoded in a code file, we can then look up the longer description when needed.

Code files host data that is both data and metadata, or tween data.

I'm guessing you understand why and how a code file in the old model translates to a lookup table in an SQL-DBMS. Such lookup data, equating an encoded term with a description, is also validation data. So, we might also call this a validation table. It is clearer now with GUIs that lookups for valid codes and data validation are two sides of the same coin. The list of valid codes is often used to populate a GUI widget so there is no chance of any other values being entered by the user. The values in these old code files might also compose a list for a check constraint. Code files host data that is both data and metadata, or what we might term tween data.

Unlike hard-coded check constraint values, our code files often have data entry forms that permit some of the code types to have the valid list of entries maintained by the end user. They are part of the data integrity function of constraint handling. They are metadata that users see, some of which serves as data too. They are used for validation as well as representation, turning codes into descriptions when needed. Such codes serve a purpose of standardizing attribute values, and indicating valid entries throughout a software system.

Systems that store metadata as data, which includes most software written to be implemented using multiple DBMS solutions, might indicate that the marital_status attribute of the Person relation is validated using the Marital_Status code type in the Validation_Codes or One_True_Lookup table.

Other metadata that might be required to be housed as data includes attribute names and file names, for example, for building business rules. tween data is both data and metadata and it is everywhere. So, how should we model such tween data?

We might need to model enough tween data to specify rules such as

if (order_amount > 1500) requires_authorization = "Yes"

These files are part of the application system architecture. Just as we do when designing user database entities, when looking at the entities involved in developing and maintaining the application, some standard entities such as Attributes, Files (or maybe Tables or Relations), Rules, and Validations might make their way into every application. These entities are not specific to the domain for this application and are required because this is a software application, not because this business is in the xyz vertical market.

Tween data is both data and metadata and it is everywhere

How might one make a design decision about whether to pour all such codes together in OTLT, rather than separating them out? Let's see, we want to have tight cohesion and loose coupling, right? We have the option of modeling a single proposition with a type of code, a value for that code and a description. The implementation of this proposition can then be used for dropdowns and other GUI features as well as validation and represenation. We can otherwise model a set of propositions that all sound remarkably similar with codes and their descriptions, elevating the type of code to the name of a Table rather than having it as an attribute in our Validation Table. But after designing the n-th such table, my patience starts to DRY up (see the DRY wikipedia entry).

Admittedly, while we then have our codes together, these codes relate to different things. That will be a problem if we want to add an attribute for one of these types and not for another, while splitting them out is more problematic if we opt to do something consistently across the board with all of our codes and must repeat the procedure for each such table. The former is typically mitigated by a) using tools where refactoring the database when requirements change is standard fare (that is, using an agile database, such as IBM U2, Revelation, Intersystems Cache', or OpenQM) or b) adding an attribute for further classification of a code, such as code_classifier. Different values for the code_classifier can be used to control the logic performed.

Let's look at more of the objections to the use of OTLT with an assumption that we are using a MultiValue, aka Pick, DBMS. Not coming from the world of 80-column card input (see picture in To Whom Should Size Matter?), Pick handles variable length data really well. In fact, neither types nor lengths are enforced by the DBMS, prompting me to initially conclude that it could not be considered a DBMS, an opinion I have reconsidered. In any case, this objection, while relevant to SQL-DBMS's is irrelevant in other DBMS's, including Pick.

Another objection is that software is capable of putting the wrong data in for the code_type, so that a state abbreviation might become a marital status, for example. Well, as it turns out software can put a multitude of incorrect values into a database. It has a job not to do that, and while DBMS-specified constraints might contribute to one kind of data quality, they certainly do not ensure data quality. As an aside, I would argue that they have even contributed to poor quality software, forcing developers and end-users alike to play games to try to match values to type, even where ill-advised. The bottom line on this one is that if your DBMS does not enforce types, there is no related problem with this pattern. One of the code_type values could be Valid_Code_Types where the application that permits the addition of code_values for all code_types also permits new values here, with appropriate security, of course.

I don't like discussing performance, but I definitely recognize that many-a software project has failed due to inadequate performance. Celko brings up the fact that having to search through a larger table is less efficient than through a smaller one when finding a particular value or a set thereof. As long as we do not make the file or table (whichever is being used) way too large, this file of validation codes is an obvious choice to have in cache throughout the run-time of a software application.

There is no doubt that there are tasks one can perform that are slower and some that are faster using one pattern or another, so obviously there is room for taking into consideration the type of activity that will be performed with codes in any given system. Using Pick, there are even some who design the big code validation file to have one item (aka record) per code type, with associated multivalues for the valid values and descriptions. One can then have one disk read to suck in the entire validation list for a particular attribute, such as the entire set of state codes and their descriptions.

One of the features I like best about using one big code validation file is that you can write one maintenance form for this file and it handily covers all of these, even future validation code types. Sure, you will also need a code file for what code files are permitted and you might want to apply some different validations for different types of codes (more common in theory than in practice, I suspect).

Companies...with a need for a lot of tween data might be well-served using SQL-DBMS tools as if they were file systems

Some might be convinced that when using something other than a traditional SQL-DBMS, it might sometimes make sense to model the tween data in a way that acknowledges that it also serves as metadata. But this could also be relevant to SQL-DBMS users. I will end with a heresy (as if I haven't been spouting such all along, eh?). Companies and developers with a need for a lot of tween data might be well-served using SQL-DBMS tools as if they were file systems, ignoring some of the relational dogma. If you are minimizing the amount of metadata that you duplicate by putting it both in the SQL-DBMS metadata catalog and in the database itself by minimizing your use of SQL constraints and the like, you really can use the OTLT pattern. Of course, you should be aware of the tradeoffs in doing so.

While I recognize this is not the weighty decision that Carter lays out in his controversial book Palestine: Peace Not Apartheid, for any given software metadata design for code files, you simply need to decide whether to go with a single cohesive piece of tween data or whether you favor relational apartheid.

← Previous Next →

1 Comments:

At 4:30 PM, August 09, 2007 , Anonymous said...: The big table is not going to perform noticeably worse than lots of little tables. Reason: it will in all likelihood be indexed by it's primary key, a two-part key consisting of code type and code. This index will support a range scan that produces all the codes for a code type without breaking a sweat. (Celko has his column order backwards - even though the code type is less selective, you'd normally want it first in the index because there is never a case where you're joining to the table and you don't have the code type anyway.) True, the range scan is less efficient than a full-table scan of a distinct table, but the different isn't noticeable, especially if you store the payload values in the index itself using a feature such as Oracle's index-organized tables, or even better, you tend to cache portions of the table (again by code type) in your application anyway.

The negatives for the single table are outweighed by several advantages, some of which you've touched on. The biggies are:

1. You don't have to do DDL every time you need a new code type. You don't end up with 1000s of tiny tables in your schema.

2. You can easily build a single set of maintenance code, including a UI, data loaders, etc. that supports an unlimited number of codes. Doing that with 1000s of tiny tables is a huge headache.

The scalar data type of the code value is rarely a problem but if it is, you can certainly use the pattern once for each scalar type, separating the integer-based codes from the string-based codes, and avoid the horrible series of check constraints in Celko's article.

Litter Box

Paw through past Mewsings, a blog about software development, with a focus on data modeling.

2005
November
A Modeling Profession

2009
January
New Year, New Blog