mewsings, a blog
Thursday, July 27, 2006
To Whom Should Size Matter?
The Grand Tetons are incredibly impressive. They are big, really big. Combining business and pleasure, with one trip this year I went from Iowa to Washington D.C. and with another from Iowa to Washington State, both by car. I was able to see people and places, taking pictures from coast to coast. I love mountains and water, the bigger the better. In my home state of Michigan, the Great Lakes are great because of their size. Of course there are very impressive small points of interest too. When it comes to physical things, size matters.
Physical objects differ from words in that respect. It makes sense to care about the height of a mountain, the size of a portion of food, the length of a skirt, or the size of a vehicle. It rarely makes sense to care about the size of a word, with word games being an exception.
You might recall that feeling when you want to play all seven letters in Scrabble, maybe even on a triple word score, but it doesn't fit. The board is not long enough, or there are not enough open squares to play the word. Rats!
A similar game is often played by users of database management systems and related applications. The word you want stored in the database is just a little too long for the data entry screen to handle or the DBMS to accept. Some DBMS tools require or at least strongly encourage size constraints on data attribute values. Others do not. To an SQL-DBMS, size matters.
Focusing on this one specific type of constraint, let's take a look at reasons a software development team might design using a maximum length constraint (maxlen) rather than permitting variable lengths for values of a specific attribute.
Perhaps the input device and related software provides for limited space for input data. The punch card is a good example. While variable length attribute values could be encoded on punch cards using delimiters between values such as deBoer,Reta,piano, to ensure that a fixed list of attribute values for each person fits on a single card, we might design the musical instrument value to be in card columns 33 through 46. The maxlen for all values of that attribute would then be 14.
Developers once had to design for input from 23 x 80 character screens too—remember those? Even with a fixed screen size, however, users could be given scrolling entry fields so entered data could be larger than what is shown on the screen at any given time.
We should consider whether we would want the restrictions of an input device to dictate database-level constraints. We might also consider that today's typical screen entry or data exchange has no such issues. While screen size is still a factor for designers, this technology requirement need not whittle down our original business requirements. I feel comfortable relegating this reason for a maxlen to the past.
I still have my nifty 132-column ruler for designing greenbar reports. When we could fit only 132 characters in a fixed-length font on a single line of a report printed on greenbar, we naturally cared about attribute sizes. When designing paper forms such as payroll advices, it makes sense to care about the maxlen for names compared to the number of available characters on the form. The representation of the value of any particular attribute might be constrained by the design of a report. Should we also constrain the actual database values and not just the representation of those values? Perhaps there is a reason to do so in some cases, such as the codes discussed next, but, in general, no.
When modeling data, we often swap out some terms within a proposition for succinct, consistent codes. For example, start with a proposition such as:
Jan (first) VanDoe (last) lives in Illinois (state).
It is typical to model the predicate for this proposition with two base relations, Person(first,last,stateCode) and State(stateCode,stateName). The original predicate is then modeled with a view that joins these two giving PersonView(first,last,stateName). In this example, a code of IL would be in the Person relation, mapping to the name Illinois in the State relation.
This is a common pattern in data modeling. In their Refactoring Databases book, Ambler and Sadalage label a change of this nature to an existing schema as Add Lookup Table. So instead of calling this the code file pattern, a term used back when we also talked about master files and transaction files, I'm OK with calling this the Lookup Table pattern, while still referring to the attribute (such as stateCode above) as a code.
Is there a good reason to constrain the length of codes? Yes. Lookup tables are all about constraining propositions to aid in conformity. This helps with accuracy of captured data and ease in performing analysis, for example. Placing a maxlen on the code does not further constrain nor alter the original proposition derived from business requirements. There are advantages to developers in fixing a maxlen, making such a design attractive in cases where it cannot adversely affect data integrity, such as truncated values, or unnecessarily restrict flexibility.
Now comes the question of who should care about the maxlen on codes for lookup tables. Have you ever seen a case where there is a requirement to change the length of a code? If not, I've seen enough such cases for both of us. In every case I can recall, the change was to a larger maxlen, although I can imagine the other scenario. In the rare case of changing to a smaller maxlen on a code, there must be a mapping from the longer attribute values to the new shorter maxlen, whether truncation or a fancier algorithm, and the existing database would likely need to be modified so it fits within the new constraints.
There are DBMS products that tightly couple the logical constraints with the physical implementation
In the much more common case of needing to upsize the maxlen, there is no similar requirement to change the existing database. The new requirement is less restrictive than the former requirement, so the database is logically already in compliance with the new maxlen. It might even be feasible to roll such a change in constraints out to users along with other maintainable business rules. Unfortunately, there are DBMS products that tightly couple the logical constraints with the physical implementation of the database. Such products actually make computer resource allocations based on the logical maxlen constraints!
In fact, every SQL-DBMS product I have seen requires some maintenance activity on the physical database simply because of an increase to a maxlen, a relaxing of the constraint. While there are products that have no such issue, I'm guessing that most of those pre-date the introduction of the relational model. Coincidence? [I have been corrected on this. Apparently this is not the case for Oracle, so I would guess others as well. I'm curious now what an average cost estimate might be for a requirement to increase a maxlen in a schema since I'm guessing estimates might have influenced my assumption. Regression testing might be the biggest part of the estimate.]
Back to our codes where we want to design in a maxlen, where should this maxlen be encoded? While some might be inclined to specify this constraint to the DBMS, you can see the danger in cases where such a maxlen specification is used for the dual purpose of physical implementation design. I recently saw a SQL-DBMS schema example where every attribute was designated with a maxlen of 50, with 50 being larger than the max logical maxlen for all attributes. Such an approach mitigates the issues with the tight coupling of logical and physical.
Where should the logical maxlen be placed in a logical data model? Given that all UI and web service validation routines need access to this constraint information, it can be made available by being captured as other business rules might be, in the database itself. I suggest that the database proper (rather than the schema) is a good place for specifying all business constraints. There must then be enforcement of the use of standard CRUD services that employ such constraint logic for all software maintaining the database. This provides more flexibility for the business and I suggest the cost of ownership is lower than when any business rules are specified to the DBMS.
As with punch cards, size matters to COBOL
I'll toss in a short note that if software is being developed with one of the older languages for processing data that require maxlen on variables (e.g. COBOL), then you are stuck specifying the maxlen to your software components for the purpose of reading or writing data. As with punch cards, size matters to COBOL. As sweet as I suspect it can still be to write business applications in COBOL, I'm comfortable making this an historical footnote.
Some might say that decoupling logical and physical maxlens could adversely affect performance. If that is the case, I would suggest that the industry move to a data model that does not have this defect. Yes, of course I know that the computer is a finite resource and there is a need for some information for performance tuning and optimization. But(3) it(2) is(2) not(3) essential(9) that(4) each(4) individual(10) attribute(9)...you get the point, right?
Similarly, if the amount of disk or other resources is adversely affected by decoupling logical and physical constraints, I would suggest the industry move to a data model that does not have this defect.
While the models I know that fit the bill do have other defects, including logical/physical coupling in other respects, they provide greater flexibility for those changes that companies are likely to make to their data over time. That is because they take physical implementation information that is irrelevant to the user and the business requirements and require it of the logical model rather than the other way around. For example, because such things as the order of stored attributes is irrelevant to the business user, there need not ever be a requirement to change that order in either the logical data model or the physical implementation.
Real Input or Output Business Requirement
Changing the name, my cousin told me how she has let the relevant parties know that she absolutely does not want her hyphenated last name of Oostendorp-Holland to be truncated on documents, credit cards, etc as Oostendorp-Ho again, pa-lease. Given that she is an influential person, I'm guessing these parties are trying to comply, but it might be too costly for them to do so.
What should a company do when there is a business case for having a maximum length on a data value? They should typically not ditch the unrestricted real value for an attribute, but add in another attribute for the representation of the actual value at a shorter length. In this case, they could then keep the actual name value and also have a constrained value. They could put Oo-Holland or Oostendorp or any number of other values into the constrained version. This would only be used in an override situation (ignoring SQL issues with NULL handling here) where the original value as written or as truncated would not suffice. Additionally, if the constraint is on a derived value, such as the first name plus the last name (which happens to be the case in the example cited), the revised full name, with associated maxlen, would be a separate attribute.
Relational theory does not speak to maxlen, but every implementation of the relational model leads developers to make extensive use of such.
We have seen some of the reasons why a designer might design in a maxlen, with related commentary on these. Why focus so keenly on this one type of constraint? I would venture that the application of the maxlen in places where the proposition should have no such constraint, in addition to the tight coupling between the logical and physical maxlen, has cost the industry and individual organizations a bundle. It is but one area related to constraint handling where we have been bleeding, but one about which I hear little mention—and no noise, no fix, I'm guessing.
Relational theory is predicate logic and set theory. Relational theory does not speak to maxlen, but every implementation of the relational model leads developers to make extensive use of such. It has been suggested that if maxlen were removed from RM implementations, the solution would be less optimal in some way. While this has nothing to do with relational theory, it does seem to have something to do with existing relational models and implementations of such, particularly the SQL-DBMS. It might be the case that in order to resolve this, DBMS implementations would be well-served to permit attribute order to be specified, or reject some other tenet of the relational model. To whom should size matter? Not the DBMS.