Tincat Group, Inc. - Mewsings, a Software Development and Data Modeling Blog

mewsings, a blog

--dawn

Thursday, July 27, 2006

To Whom Should Size Matter?

The Grand Tetons are incredibly impressive. They are big, really big. Combining business and pleasure, with one trip this year I went from Iowa to Washington D.C. and with another from Iowa to Washington State, both by car. I was able to see people and places, taking pictures from coast to coast. I love mountains and water, the bigger the better. In my home state of Michigan, the Great Lakes are great because of their size. Of course there are very impressive small points of interest too. When it comes to physical things, size matters.

Physical objects differ from words in that respect. It makes sense to care about the height of a mountain, the size of a portion of food, the length of a skirt, or the size of a vehicle. It rarely makes sense to care about the size of a word, with word games being an exception.

You might recall that feeling when you want to play all seven letters in Scrabble, maybe even on a triple word score, but it doesn't fit. The board is not long enough, or there are not enough open squares to play the word. Rats!

A similar game is often played by users of database management systems and related applications. The word you want stored in the database is just a little too long for the data entry screen to handle or the DBMS to accept. Some DBMS tools require or at least strongly encourage size constraints on data attribute values. Others do not. To an SQL-DBMS, size matters.

Focusing on this one specific type of constraint, let's take a look at reasons a software development team might design using a maximum length constraint (maxlen) rather than permitting variable lengths for values of a specific attribute.

Input

Perhaps the input device and related software provides for limited space for input data. The punch card is a good example. While variable length attribute values could be encoded on punch cards using delimiters between values such as deBoer,Reta,piano, to ensure that a fixed list of attribute values for each person fits on a single card, we might design the musical instrument value to be in card columns 33 through 46. The maxlen for all values of that attribute would then be 14.

Developers once had to design for input from 23 x 80 character screens too—remember those? Even with a fixed screen size, however, users could be given scrolling entry fields so entered data could be larger than what is shown on the screen at any given time.

We should consider whether we would want the restrictions of an input device to dictate database-level constraints. We might also consider that today's typical screen entry or data exchange has no such issues. While screen size is still a factor for designers, this technology requirement need not whittle down our original business requirements. I feel comfortable relegating this reason for a maxlen to the past.
Output

I still have my nifty 132-column ruler for designing greenbar reports. When we could fit only 132 characters in a fixed-length font on a single line of a report printed on greenbar, we naturally cared about attribute sizes. When designing paper forms such as payroll advices, it makes sense to care about the maxlen for names compared to the number of available characters on the form. The representation of the value of any particular attribute might be constrained by the design of a report. Should we also constrain the actual database values and not just the representation of those values? Perhaps there is a reason to do so in some cases, such as the codes discussed next, but, in general, no.
Lookup Table
When modeling data, we often swap out some terms within a proposition for succinct, consistent codes. For example, start with a proposition such as:

Jan (first) VanDoe (last) lives in Illinois (state).

It is typical to model the predicate for this proposition with two base relations, Person(first,last,stateCode) and State(stateCode,stateName). The original predicate is then modeled with a view that joins these two giving PersonView(first,last,stateName). In this example, a code of IL would be in the Person relation, mapping to the name Illinois in the State relation.

This is a common pattern in data modeling. In their Refactoring Databases book, Ambler and Sadalage label a change of this nature to an existing schema as Add Lookup Table. So instead of calling this the code file pattern, a term used back when we also talked about master files and transaction files, I'm OK with calling this the Lookup Table pattern, while still referring to the attribute (such as stateCode above) as a code.

Is there a good reason to constrain the length of codes? Yes. Lookup tables are all about constraining propositions to aid in conformity. This helps with accuracy of captured data and ease in performing analysis, for example. Placing a maxlen on the code does not further constrain nor alter the original proposition derived from business requirements. There are advantages to developers in fixing a maxlen, making such a design attractive in cases where it cannot adversely affect data integrity, such as truncated values, or unnecessarily restrict flexibility.

Now comes the question of who should care about the maxlen on codes for lookup tables. Have you ever seen a case where there is a requirement to change the length of a code? If not, I've seen enough such cases for both of us. In every case I can recall, the change was to a larger maxlen, although I can imagine the other scenario. In the rare case of changing to a smaller maxlen on a code, there must be a mapping from the longer attribute values to the new shorter maxlen, whether truncation or a fancier algorithm, and the existing database would likely need to be modified so it fits within the new constraints.

There are DBMS products that tightly couple the logical constraints with the physical implementation

In the much more common case of needing to upsize the maxlen, there is no similar requirement to change the existing database. The new requirement is less restrictive than the former requirement, so the database is logically already in compliance with the new maxlen. It might even be feasible to roll such a change in constraints out to users along with other maintainable business rules. Unfortunately, there are DBMS products that tightly couple the logical constraints with the physical implementation of the database. Such products actually make computer resource allocations based on the logical maxlen constraints!

In fact, every SQL-DBMS product I have seen requires some maintenance activity on the physical database simply because of an increase to a maxlen, a relaxing of the constraint. While there are products that have no such issue, I'm guessing that most of those pre-date the introduction of the relational model. Coincidence? [I have been corrected on this. Apparently this is not the case for Oracle, so I would guess others as well. I'm curious now what an average cost estimate might be for a requirement to increase a maxlen in a schema since I'm guessing estimates might have influenced my assumption. Regression testing might be the biggest part of the estimate.]

Back to our codes where we want to design in a maxlen, where should this maxlen be encoded? While some might be inclined to specify this constraint to the DBMS, you can see the danger in cases where such a maxlen specification is used for the dual purpose of physical implementation design. I recently saw a SQL-DBMS schema example where every attribute was designated with a maxlen of 50, with 50 being larger than the max logical maxlen for all attributes. Such an approach mitigates the issues with the tight coupling of logical and physical.

Where should the logical maxlen be placed in a logical data model? Given that all UI and web service validation routines need access to this constraint information, it can be made available by being captured as other business rules might be, in the database itself. I suggest that the database proper (rather than the schema) is a good place for specifying all business constraints. There must then be enforcement of the use of standard CRUD services that employ such constraint logic for all software maintaining the database. This provides more flexibility for the business and I suggest the cost of ownership is lower than when any business rules are specified to the DBMS.
Processing
As with punch cards, size matters to COBOL

I'll toss in a short note that if software is being developed with one of the older languages for processing data that require maxlen on variables (e.g. COBOL), then you are stuck specifying the maxlen to your software components for the purpose of reading or writing data. As with punch cards, size matters to COBOL. As sweet as I suspect it can still be to write business applications in COBOL, I'm comfortable making this an historical footnote.
Performance
Some might say that decoupling logical and physical maxlens could adversely affect performance. If that is the case, I would suggest that the industry move to a data model that does not have this defect. Yes, of course I know that the computer is a finite resource and there is a need for some information for performance tuning and optimization. But(3) it(2) is(2) not(3) essential(9) that(4) each(4) individual(10) attribute(9)...you get the point, right?
Space
Similarly, if the amount of disk or other resources is adversely affected by decoupling logical and physical constraints, I would suggest the industry move to a data model that does not have this defect.

While the models I know that fit the bill do have other defects, including logical/physical coupling in other respects, they provide greater flexibility for those changes that companies are likely to make to their data over time. That is because they take physical implementation information that is irrelevant to the user and the business requirements and require it of the logical model rather than the other way around. For example, because such things as the order of stored attributes is irrelevant to the business user, there need not ever be a requirement to change that order in either the logical data model or the physical implementation.
Real Input or Output Business Requirement
Changing the name, my cousin told me how she has let the relevant parties know that she absolutely does not want her hyphenated last name of Oostendorp-Holland to be truncated on documents, credit cards, etc as Oostendorp-Ho again, pa-lease. Given that she is an influential person, I'm guessing these parties are trying to comply, but it might be too costly for them to do so.

What should a company do when there is a business case for having a maximum length on a data value? They should typically not ditch the unrestricted real value for an attribute, but add in another attribute for the representation of the actual value at a shorter length. In this case, they could then keep the actual name value and also have a constrained value. They could put Oo-Holland or Oostendorp or any number of other values into the constrained version. This would only be used in an override situation (ignoring SQL issues with NULL handling here) where the original value as written or as truncated would not suffice. Additionally, if the constraint is on a derived value, such as the first name plus the last name (which happens to be the case in the example cited), the revised full name, with associated maxlen, would be a separate attribute.

Relational theory does not speak to maxlen, but every implementation of the relational model leads developers to make extensive use of such.

We have seen some of the reasons why a designer might design in a maxlen, with related commentary on these. Why focus so keenly on this one type of constraint? I would venture that the application of the maxlen in places where the proposition should have no such constraint, in addition to the tight coupling between the logical and physical maxlen, has cost the industry and individual organizations a bundle. It is but one area related to constraint handling where we have been bleeding, but one about which I hear little mention—and no noise, no fix, I'm guessing.

Relational theory is predicate logic and set theory. Relational theory does not speak to maxlen, but every implementation of the relational model leads developers to make extensive use of such. It has been suggested that if maxlen were removed from RM implementations, the solution would be less optimal in some way. While this has nothing to do with relational theory, it does seem to have something to do with existing relational models and implementations of such, particularly the SQL-DBMS. It might be the case that in order to resolve this, DBMS implementations would be well-served to permit attribute order to be specified, or reject some other tenet of the relational model. To whom should size matter? Not the DBMS.

← Previous Next →

10 Comments:

At 6:07 AM, July 29, 2006 , Tony Andrews said...: > In fact, every SQL-DBMS product I have seen requires some maintenance activity on the physical database simply because of an increase to a maxlen, a relaxing of the constraint.

Every one? Never heard of "Oracle" then? To increase the "maxlen" of employee name from 30 to 40 chars:

SQL> ALTER TABLE emp MODIFY ename VARCHAR(40);

There is no mysterious physical maintenance activity going on here. All that happens is that the constraint is changed to allow 40 instead of 30 characters in column ename from now on. No existing data gets moved about on disk.

I can't speak for other SQL DBMSs than Oracle, but I would be surprised if Oracle was unique in this.

For text of "unlimited" size (up to many gigabytes) you can always use the CLOB data type instead. I wouldn't advise using it for employee names though!
At 6:22 AM, July 29, 2006 , Tony Andrews said...: > While this has nothing to do with relational theory, it does seem to have something to do with existing relational models and implementations of such, particularly the SQL-DBMS. It might be the case that in order to resolve this, DBMS implementations would be well-served to permit attribute order to be specified, or reject some other tenant of the relational model.

LOL! You first admit that it has nothing to do with relational theory, then suggest that the resolution requires SQL DBMSs to deviate even further from the RM than they already have. Do you see the flaw in your argument?

(I presume you meant to write "tenet" rather than "tenant" by the way?)
At 11:09 AM, July 29, 2006 , --dawn said...: Thanks, Tony. I wasn't positive about that statement, only that the last time I worked with a SQL-DBMS it seemed like it was a big deal to alter the schema. Instead of researching it further, I passed it along to a group to correct me if I was wrong and no one did. I kept my original wording above (so that your comment stays relevant) and also indicated I was wrong in the blog. I'm doing this without checking for myself.

With MultiValue, you can change a (display) length from 30 to 20 to 40 all in the same day and break nothing (other than perhaps exceeding the size of a page of output) because there are no max length constraints in the DBMS (all are in a maintained layer above the DBMS or in the apps). What actually happens in Oracle and related apps if you make a change to your live database going from 30 to 20 to 40 all in one day? What would break? Would nothing change in the physical database? Thanks for your help. I don't currently have a forum to ask such questions, but I have found that if I write something that is incorrect, I do get the correct information. Cheers! --dawn
At 11:15 AM, July 29, 2006 , --dawn said...: Hi again Tony -- On your second comment, yes, I'm so glad you caught that. I tried to write it so that it was really, really clear how contradictory it sounded, but thought I had come up short on that. I was hoping that some would find that humorous.

I am toying with nuances to terms here. It sounds like Relational Theory is defined tight enough so that it is, by definition, fine. As it has been defined, it is just set theory and predicate logic. That's peachy. These can surely be applied to data.

But the RM, the Relational Model, the model on which relational database management systems have been written, has significant issues, I think. It doesn't sound like I hit the nail on the head with where the brittleness in using max lengths in a SQL-DBMS (for example) comes from, but it does seem that it is more brittle than database products that do not enforce maxlen constraints in a way so tightly coupled with writing the data.

Thanks. --dawn
At 10:46 AM, July 30, 2006 , Tony Andrews said...: > What actually happens in Oracle and related apps if you make a change to your live database going from 30 to 20 to 40 all in one day? What would break? Would nothing change in the physical database?

Nothing would break, nothing would change in the physical database. The only thing that would change would be the constraint definition to be applied to existing data and subsequent inserts and updates. Of course, if any row already contained a full 30-char value then the attempt to reduce the size of the column to 20 would fail like this:

ORA-01441: cannot decrease column length because some value is too big

However, you could still restrict the size of values entered from now on to 20 using a check constraint:

alter table t add constraint c check (length(col) <= 20) enable novalidate;

(The NOVALIDATE keyword means that data already in the table may violate the constraint).

Reducing the size of a column is slower than increasing it, because of the necessary check for existing data that would be too large for the new size.
At 12:46 PM, July 30, 2006 , --dawn said...: Thanks Tony. The difference between a novalidate check constraint and a change to the "alter table" on the size might account for some of my confusion, although not entirely. I have never been a DBA (as I'm sure you can tell), but have worked with a variety of data persistence interfaces, both packaged and homegrown. My hands-on experience with SQL-DBMS's is primarily on the read-only side (BI) and definitely less than my mgmt experience. I try to get things right, but will miss sometimes given that the bulk of my heads-down, hands-on coding experience is admittedly dated. I keep thinking I'll pick back up with hands-on development, even just as a hobby, but it is looking less likely right now (too many things interest me). Unlike some people who do not want to be corrected when they are wrong, I really want to have accurate information. I sometimes get the best informatation from hands-on work, but often get even better and more info from writing what I think and being corrected in any errors I make.

You made a comment on another blog entry that I wasn't willing to accept that SQL was not the RM. My take is that SQL is not a perfect implementation of relational theory. The data model behind SQL, however, is what has been billed as the relational model from day 1. So, I'm fine with saying that it is not a good relational model or that it is not a perfect relational model, but the reason it was developed in the first place was as an implementation of the relational model. It has then been proliferated extensively to the extent that almost every database product has felt compelled to claim it was relational and to prove it by providing at least read-only access to the data by means of SQL. So I don't know if I have a problem with someone's idea of a perfect relational model, I'm only interested in "the relational model" that changed the face of data management--actual implementations of the relational model over the past few decades that seem to have set the professional back, being one part of the high cost of developing and supporting software.

One question of interest to me is whether a company starting today, whether an application software provider or an IT shop in some vertical, would end up paying a vastly larger amount of money to accomplish the same, or perhaps a worse, implementation if they used a DBMS that was developed with the intent to implement the relational model (however flawed their result might be) than if they use a DBMS that didn't originally have any intent in implementing the relational model.

I suppose that for the sake of some purists I could call it "the model that was once thought to be a relational model but given changes in relational theory as well as its initial flaws, no longer is an implementation that relational theorists would think even comes close to implementing a pure relational model." I try to call the DBMS's "SQL-DBMS's" instead of RDBMS's, giving a nod to this notion that they are not pure in their implementations. But the data model backing various SQL-DBMS products has always been thought by the providers of such DBMS's as well as their users, in general, as being the relational model. I don't want to call that data model the SQL-data model because another language, such as QUEL, could implement a very similar model. I gather that most people completely understand when I call it the "relational model" but that a few are additionally offended because it is not a model they would come up with based on the current state of relational theory. I'm interested in what we can use to build software, and you cannot build software solutions with theory.

Did that help explain it? I'm not trying to be stubborn and am trying to use words that communicate and do not confuse. It is more confusing to many that just a few years ago we were supported to normalize data by removing "repeating groups" in order to be compliant with relational theory. Now the terms "1NF" and "normalize" seem to have been redefined by many. While I understand that we need to move forward, coming up with new terms would be better than redefining terms that have been so intrenched in the minds of software developers. If relational theorists would make more noise on how it is not necessary to normalize data using Codd's origional definition of that term, explaining, then, how you should model data, then maybe practitioners or, worse yet, management types, would not feel compelled to try to spread the news. Cheers! --dawn
At 5:09 AM, August 01, 2006 , Anonymous said...: Note that MS-SQL-Server, should you choose to redefine from 30 to 20 to 40 would, on shortening field, come up with a warning that "you are going to truncate your data - do you want to proceed". If you say "yes", anything beyond column 20 gets lost.

And (as Dawn requested when I said this in the preview) just because a DBMS is written in a language that needs maxlen, it doesn't mean that the dbms itself needs maxlen. Pr1me INFORMATION was written in FORTRAN.

But I very much agree with Dawn about maxlen - it is not an attribute of data, therefore it should not be a mandatory part of the definition of such data within a database. My full name (with title) is 20 characters long - I get well cheesed off when it gets chopped. And I know of plenty of people with FAR longer names (my wife, for example. Nearly 30 characters. And that's short for a long name!).

That's why Pick separates the constraints from the data - because all too often the constraints are everything to do with input/display and nothing whatever to do with the data itself!

Cheers,
Wol
At 7:58 AM, August 09, 2006 , Tony Andrews said...: Wol> That's why Pick separates the constraints from the data - because all too often the constraints are everything to do with input/display and nothing whatever to do with the data itself!

When there is such an input/display constraint, it is correct to enforce this outside the DBMS. OTOH, when a constraint *is* about the data then the DBMS is exactly where it belongs, by definition of the term "DBMS".
At 5:25 PM, November 09, 2006 , mc said...: This is how I read your statements, Dawn:

Premise 1: SQL is based on the RM, though patently inferior to it in many important ways.
Premise 2: SQL is inferior in some important ways to Pick and MV.
Conclusion: The RM is inferior in some important ways to Pick and MV.

This is a clear case of a straw man argument. Until someone presents a complete MV-based solution to my nine queries that approximates their simplicity, I don't see how the case for MV v. the RM can get off the ground.

Also, please stop referring to SQL DBMS's as implementations of the RM. This may be a popular myth in the IT community, but it is demonstrably false; even the SQL standard makes no such claim ("The SQL99 Standard ... does not use the word "relation" or "relational" at all ... the SQL standards committee have distanced SQL somewhat from the term "relational" ... the decision to avoid using the term "relation" was a deliberate one, taken in the 1980s").
At 6:49 PM, November 09, 2006 , --dawn said...: Hi mc -- First, should I be able to guess who you are by your initials? In case you are wondering, I do not know if I know you or not, and of course it would be nice to know who you are.

OK, your two premises and conclusion are not what I would state. I'll adjust them, sortof, below.

"Premise 1: SQL is based on the RM, though patently inferior to it in many important ways."

Because SQL-DBMS tools have traditionally been called RDBMS tools, for Relational Database Management tools, and based on the history of such, my premise is that

SQL is an implementation historically and technically based on the Relational Model, even if it is an early and flawed implementation. It is, however, the only industry standard implementation of the RM (of which I am aware).

"Premise 2: SQL is inferior in some important ways to Pick and MV."

I actually think there are SQL-based tools and apps that are better than similar tools or apps in Pick and vice versa. The data model assumed, required, or encouraged with SQL has some deficiencies compared to that used with MV/Pick. While trying to come up with why my experience has been that some aspects of maintaining MV solutions are far less expensive than their SQL-DBMS counterpart apps, I looked into several different possibiliies. For example, Pick solutions are often written by people intimately familiar with the business and business processes being automated. Pickies use 3GL procedural languages more often than OOP, for example, which might also be relevant. But what I landed on as a hypothesis includes features that are part of the data model that conflict with what has been taught as the RM.

Solutions written with an attempt to align with the RM as feasible while employing an SQL-DBMS have some unnecessary costs associated with such features as 3VL & the related un-value NULL, 1NF, and various SQL-encoded constraints. While I am not trying to move the entire industry toward Pick, I use it as an example so that it is not just a theory I have about how to develop more agile, less-costly, software. We can see it in play. Please note that I am well aware there are downsides of developing software with Pick, the biggest of which is that there is no Pick industry standard for data access. So, I cannot point off-the-shelf-reporting-or-data-analysis-tool at a database, without using SQL.

So, I'll rewrite this one as: Software applications written using a SQL-DBMS is not, in general, as flexible and easy to maintain as a corresponding applicaxtion written with Pick. The data model, including 2VL and NF2, is a significant aspect of agility in Pick software development.

"Conclusion: The RM is inferior in some important ways to Pick and MV."

Just as the RM, a theory, is not comparable to SQL, an implementation, it is also not of the same type as Pick/MV, also an implementation. So, it is the abstraction of the data model behind Pick that compares favorably, in my opinion, to the abstraction of the model behind SQL, which is, by popular use, termed the RM (recognizing it is not an exact match).

Another way to state this: If one develops a logical data model (pick your favorite def thereof if you disagree with mine) that incorporates approaches such as removing all many-to-many relationships between relations, removing attributes that could have a NULL (addressed with 3VL), removing lists and embedded non-simple data from relations they are more likely to end up with a bulkier, less agile, more expensive implementation than if they work with a logical data model that includes M-M, NULL-capable attributes, and embedded lists.

I am challenging the use of the relational model as practitioners have learned and employed it over the past decades. I think it is time for practitioners to return to some prior best practices, including use of two-valued logic and non-first-normal form data. If theorists want to redefine what practitioners understand to be the relational model and still call it the RM, fine. I prefer not to call it the RM if it not the RM we have been employing for the past couple of decades because that just confuses the matter. Every CS major since 19__ knows how to normalize data with a step that includes eliminating lists as data values in order to conform to the relational model. It would be much easier to move forward if we rename the model that permits us to keep lists, doesn't have that daggone non-value NULL, and so on. So, I'll restate this conclusion as

It is time to move to a better data model than the one that has been employed by SQL-DBMS tools, typically called the RM.

I'm coming from the language of the practitioner, which includes calling SQL an implementation, known to be flawed (a common trait among software implementations of any requirements), of the RM. This has some historical and technical accuracy, as well as being generally accepted as the language used in practice, I think (correct me if I'm wrong). It would not have been implemented as it was had not Codd's papers on the relational data model been written. QUEL and SQL were both considered implementation of the relational model among practitioners way back when (and, unfortunately, I recall that discussion, if vaguely). SQL won. If I were to use different terminology, making up a new name for the data model on which SQL was based, it would be very confusing and would not align with the terminology as used in software development, in my opinion. But if you can come up with a way to say this that would be clear to practitioners and align with your definitions (whatever they might be), please advise. Once we have a good "next approach" for practitioners, it will be much easier to move the industry from here to there if we say "time to ditch the RM" rather than "we have now redefined 1NF so that it means nothing etc. There is no punch to that message and it doesn't seem clear at all.

Sorry about not showing the exact matches to your bowling examples. I asked an actual Pick developer (I've pretty much only been design and/or mgmt during my years working with Pick) if he could pitch in, but due to other priorites neither of us has been able to bump this to the top of the stack. With everything on my plate, I feel like I'm either too ADD to code these days or just too brain-dead. I liked to code (in my CICS COBOL days) in more than 5-minute stretches, which I no longer seem to grant myself. Maybe someday.

Thanks again for your comments, mc. --dawn

Litter Box

Paw through past Mewsings, a blog about software development, with a focus on data modeling.

2005
November
A Modeling Profession

2009
January
New Year, New Blog