Tincat Group, Inc. - Mewsings, a Software Development and Data Modeling Blog

mewsings, a blog

--dawn

Monday, January 09, 2006

Is Codd Dead?

Time Magazine issued a cover in 1966 asking "Is God Dead?" This is not the first time the name "Codd" has replaced "God" in a phrase, but in this case it is not for the purpose of comparison. In this blog, I will dare to question some of Codd's legacy including some of the dogma passed along in college database textbooks today.

E.F. (Ted) Codd died in 2003 leaving a significant contribution. Codd is often called the father of relational theory. His 1970 ACM paper A Relational Model of Data for Large Shared Data Banks (E. F. Codd, Communications of the ACM, v.13 n.6, p.377-387, June 1970) is a significant industry milestone.

In this paper, Codd discusses what he sees as the advantages in modeling data by use of mathematical relations compared to mathematical graphs of trees or networks.

Relations are often represented as tables of rows and columns. Trees are often visualized as nested folders and documents. The network graph, seen by Codd as overly complex and a cause of some of the problems he was addressing, can be visualized as a web. While a web, or directed-graph, might be a more complex mathematical structure than a relation, I predict that this data model might just catch on anyway (wink).

There are many viable models for data. Each has its advantages and disadvantages. This blog will not be about right and wrong as much as better and worse approaches. I'm a practitioner dabbling in theory in order to help improve the practice and not the other way around.

My advice is this: Stop normalizing your data. Stop removing all repeating groups.

Codd also introduces the term "normalize" to refer to removing nonsimple domains, such as lists or tables of data often referred to as "repeating groups." He is very clear in this paper that a relation could include repeating groups, but that normalizing it would make the data model simpler for some purposes.

The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for communication of bulk data between systems which use widely different representations of data. (Codd, p. 381)

Anyone communicating bulk data by way of XML or JSON will recognize that we have different issues to solve today than we had in 1970. The rise of XML with its associated unnormalized data model is part of the impetus for what will likely be significant changes on the database landscape.

My advice is this: Stop normalizing your data. Stop removing all repeating groups. Note that I am using the original description of normalization from this paper. This meaning of normlization was later termed, or at least rolled into the term, "First Normal Form" or 1NF. The higher normal forms, such as BCNF, include laudable work with functional dependencies, but all are defined to first require normalization. There is definitely some good that can be salvaged from this normalizing debacle of the past few decades, but we must first ditch the requirement for data to be normalized, placed in 1NF, stripped of repeating groups. I will refer to relations that are not normalized, as others have, as NF2 for Non-First Normal Form.

Do this —>

   Id: 123456
First: Jayne
 Last: VanDoe
Email: jvdoe@abc123.com
       jov@xyz123.com
       jo3@aol.com

Not this —>

   Id: 123456
First: Jayne
 Last: VanDoe

   Id: 123456
Email: jvdoe@abc123.com

   Id: 123456
Email: jov@xyz123.com

   Id: 123456
Email: jo3@aol.com

In this way you will model entities, such as the person above, with their dependent properties, such as the list of e-mail addresses. You only need to remove lists from your model, thereby going from the first example to the second above, if you are using tools that require it. Given that SQL-92 requires it, that is a big if. There are other viable, time-tested NF2 options, however.

Don't be fooled— there is no mathematical requirement to normalize data.

But the Relational Model (RM) is based on mathematics, right? Mathematics is precise. What part of the argument for the RM is amiss? Don't be fooled—there is no mathematical requirement to normalize data. Mathematics provides a means for modeling propositions to be handled in software, presented to end-users, passed as messages, or stored on secondary storage devices. The RM is a mathematical model. It is a model. Models are not the real thing. Models are often anorexic versions of the real thing. The mathematics of the relational model is sound, but the process of determining what this model should be used for is flawed.

The RM has been useful, but not as useful as some pre-relational models, in my opinion. Post-relational models of data for messages, such as those mentioned above, look very much like pre-relational models. I am hoping for a return to best practices for data models, whether or not the theory keeps up. I would, of course, prefer that theory be better aligned with excellent practices. Many pre- and post-relational tools use an NF2 model.

You are likely familiar with RDBMS products, often referred to as relational databases. Purists might prefer these be called SQL-DBMS products since SQL does not promote a pure relational model. I will use this column to dispel what I think to be myths that have helped SQL and the relational model rise to become king of the hill for a couple of decades. While this introductory column is admittedly not meaty, I will delve into this further and provide working examples in the coming weeks.

While I have not yet experimented with any XML DBMS tools, I have been working with one NF2 model, often referred to as the MultiValue (MV) or Pick® data model, for over a decade. This is not the only such model, but one with which I am comfortable, so I will introduce it here and use it in future illustrations and implementations.

Putting the RM and MV side-by-side while wearing both a technical and business woman's hat is what prompted me into further exploration of why the MV data model seems to yield higher productivity for developers, greater flexibility for changes over time, and lower risk of project failure. This was particularly perplexing when I started researching the topic because the RM was developed to help improve database maintenance. While the RM addresses some maintainability issues better than MV, MV seems more flexible in many respects. There are different risks and benefits associated with each approach.

Products employing an MV or NF2 data model include the IBM U2 products, Temenos jBASE, Revelation OpenInsight, Raining Data D3, Northgate Reality, EDP Plc UniVision, Ladybridge Systems OpenQM, and InterSystems Caché. There are other viable functional data model implementations with which I am less familiar, such as Berkeley DB from Sleepycat Software and other products marketed as embedded databases. This is definitely not a small niche market.

OpenQM is an open source implementation, so I will use that for my examples in future blogs. I will be the first to admit that MV isn't new, and although various flavors have tools to make it prettier, it typically doesn't look new. It is unlikely to wow you at first glance, but it often grows on developers quickly with its big bang for the buck results and maintainability. The same principles can be applied to many environments, however, and will typically not be specific to MV tools.

I would like to see the industry start with an NF2 model and move it forward rather than squeeze more out of SQL, as has been attempted with the more recent SQL standards. SQL will be with us for many years, but it is time to make an abrupt cut away from it wherever feasible.

Codd will long be remembered for some very innovative work in the area of database theory. But, yes folks, Codd is dead.

← Previous Next →

35 Comments:

At 10:08 AM, January 10, 2006 , dougc said...: very interesting!
At 3:26 PM, January 10, 2006 , Ira Krakow said...: Excellent job, Dawn. I have programmed in Revelation, Advanced Revelation, and OpenInsight for 20+ years. I also develop databases in SQL Server and Access, so I understand the difference between the MV and relational models. I prefer the MV model, for the reasons you stated.

I started an online Revelation user group called VIRTUALRUG, at:

http://groups.yahoo.com/group/virtualrug

We discuss these, and other, multivalue databases. You're all invited to join.

Sincerely,
Ira Krakow
ikrakow_1999@yahoo.com
At 3:30 PM, January 10, 2006 , Peter McMurray said...: Hi
Great starter.
The multi address item is a sitter for denormalising. However I would like to see comments on imprting data to a frequently used large file to reduce runtime joins and retrieves. For example every invoice can have multiple products and every product can exist in a product group. What is the feeling about including the product group as well as the product code in the invoice line to reduce the effort at reporting time. Of course all invoice lines are included in the one invoice item so one read one hit every time reports are run.
At 3:39 PM, January 10, 2006 , DTsig said...: There have been other writings over the years 'PICK' vs SQL so I wonder what this is about.

Yes, Codd is dead. That is what happens when you die. But his accomplishment was really quite great. A great move away from Hierarchial and Networked Databases. I think, though i might be wrong here, that his was really the first purely 'scientific' approach to database defintion and design. Certainly the most practical.

As was the accomplishment of Dick Pick and Don Nelson. Theirs was the practial design of a database with relational concepts that really worked for business. Of course, just like 'SQL' it took many years for a good implementation. For me it was R83 from PICK and Primes implementation (much better but didn't survive long enough to see where it would go).

SQL a 'user' interface offers more power than found in PICK-A-LIKES (at the command line level) offering both exceptional power of retrieval and formatting but also update capabilities not found in PICK-A-LIKES (unless you consider the update processor (oh my:) OR various implementations of SQL (like bolting on a spoiler to the back of the car for style :).

What SQL doesn't have, and i believe the biggest problem with them, is defined built in development environment. In this I would include a programming language and dynamic metadata. Though all SQL implementations have come a long way in terms of programability they are all done differently which causes many problems for developers.

What i don't understand is '.. SQL will be with us for many years, but it is time to make an abrupt cut away from it wherever feasible ..'. SQL IS a very powerful tool. I would never dream of telling any of my clients to 'cut away' from their SQL systems only to move to one of my MV. To just push MV because it isn't SQL is just as foolish as pushing SQL because it is SQL .. Find the tool that fits and use it.
At 4:00 PM, January 10, 2006 , DTsig said...: By the way .. don't get me wrong. I think the blog is a good idea and it never hurts to discuss differences in design thoughts and methodologies. Plus looking at other 'tools' is very important and very lacking in the 'PICK' world.

Keep up the good work

DTSig
At 4:03 PM, January 10, 2006 , --dawn said...: To DTSig -- Perhaps we have a different variation on the term "feasible." Suitable was another term I toyed with, but businesses work with feasibility studies to determine what is suitable, so I opted for that term. A reason for the industry, as a whole, to move away from SQL includes the huge dollars that will otherwise be poured into Object-Relational or XML-Relational mappings in the coming years. Some professionals don't seem to know there is an option of avoiding that altogether. It might not be feasible as yet in many cases, however.
At 11:17 PM, January 10, 2006 , jog said...: First, congratulations, a very interesting read, and I look forward to future installments.

I would however contest the statement made early on in the blog that Codd's relational model is based upon mathematical relations. All descriptions of the RM that I have been able to find either redefine a "relation" to have a brand new header component (Date), OR "tuple" to implicitly mean a set of column_name-value pairs. Both possible revisions seem to me to bend the formal mathematical definition of the respective term to breaking point. In fact one could probably put a strong case forward that the RM is completely misnamed (it being the database rows themselves which, as sets of binary pairs, are the actual mathematical relations, tables in fact being well defined sets of these relations).

Anyhow, all best, and good luck with the blog - I hope it will continue to address the myth that there is somehow an implicit mathematical truth underlying the RM.
At 5:29 AM, January 11, 2006 , Anonymous said...: The RM provides for the following:
- a small vocabulary for succint queries and updates
- first order logic for guaranteed answers
- declarative programming
- clear separation of data from application
- simplified data export/import
- independence from physical implementation
- scalability

What is missing:
- logical independence
- standardised relational catalog

How the NF2 or MV model compare ?

Using your e-mail adresses example how the NF2 or MV model can do this:
a) retrieve office e-mail address
b) enforce the constraint that an e-mail address is used by only one person (identified by id field)
At 9:56 AM, January 11, 2006 , HRM said...: I think you are brilliant. Keep up the good work.
At 10:39 AM, January 11, 2006 , --dawn said...: I can see that threaded comments would be nice, but I'll try to catch a few of these at once. In case anyone is wondering, my mom's initials are not HRM. Thanks to all for the nice comments.

As for Codd's use of the term "relation." Yes, the term has evolved, sometimes quite fantastically, within the database community so that it no longer refers to a simple mathematical relation. However, Codd did start with the mathematical relation and that is what he uses in his 1970 paper. I did not introduce mathematical "relationships" here, but Codd takes the mathematical relations, names the domain for each and says "Users should not normally be burdened with remembering the domain ordering of any relation (for example, the ordering supplier, then part, then project, then quantity in the relation supply). Accordingly, we propose that users deal, not with relations which are domain-ordered, but with relationships which are their domain-unordered counterparts."

The term "relationship" did not stick. Mathematical relations were added to and declared necessarily unordered which then became database relations, thereby introducing some unfortunate confusion. Because MV solutions (which are not the end-all and be-all, don't get me wrong) do have ordering, I declared once that the model was closer to mathematical relations than the relational model. Needless to say, dbdebunk.com was not impressed. Since I don't want to work at the logical level with ordered tuples anyway (name-value works for me until I want to present them in an order), I don't consider this to be in MV's favor, even if accurate.

Anonymous gave a good list of RM features. I would say that the biggest thing missing from the RM as implemented in databases is flexibility. That was the purpose of the model, but since you cannot make a simple change to cardinality nor arity of a domain, it is not all that easy to maintain over time. I'm sure that will come up in other blog entries.

By the way, I was hoping people would choose "Other" and enter their name, or at least a handle, with comments, otherwise it is difficult to respond, particularly lacking the threading. If you provide an e-mail address, I can reply by e-mail too.
At 11:42 AM, January 11, 2006 , jog said...: Well I've noted in the past from the dbdebunk piece you refer to that Date states: "What he did not envision was the ensuing semantic use of that order, which he did not mean to imply (and of which MV technology is guilty). When he realized that, he explicitly dropped the order in his 1970 paper, which was actually the first public expounding of the relational idea. As we quote him in our above-mentioned forthcoming paper".

As soon as that order was dropped the model - as far as I can see - moved from relational to "relationship". Now this in no way contests the validity or usefullness of RM, but just rather simply implies it is a "Relationship Model" and the term "Relational Model" is a misnomer. However, this does in turn help to highlight that there is nothing mathematically true and right about the RM, but rather it is just a good, solid and indeed pragmatic application of mathematical process.
At 3:04 PM, January 11, 2006 , dbdude said...: First of all, the RM doesn't begin and end with Codd. Codd merely introduced the basic concepts and showed us a general way to view data management. If you are looking to criticize the RM of course it's easier to stick to Codd's original paper which is not as concise and general as, for instance, what's in Date's latest writings. For instance, non-scalar values in relational attributes are fair game, and in no way conflict with the spirit of the RM. In fact, they enhance it, because if we want to use the RM as a general model to understand all data management tasks, we'd better be able to specify arrays, sets, trees, XML documents, Customer objects, audio files, etc. in the attributes of our models.

There's no need to focus on normalization either. Normalization is just a design principle that we can name and define rigorously, so that's what everybody talks about. It's really not what the RM is all about. And as I mentioned above, lack of "repeating groups" (whatever that means) should not be considered part of normalization. Just think of normalization as something to do with the dependencies in your data (meaning 2NF on up). Think of it as a label you can assign to some logical models, and not to others. Even the RM fans talk about this a little too much, though I agree it's almost always "safest" to stick with a normalized model.

Next, you give an arbitrary example of a particular data model, with names and email addresses. This doesn't demonstrate anything. If I change the list of multiple email addresses to a single array value containing multiple email addresses, I would fulfill the requirements of the RM. I can also take all of your fields and stuff them into a single Person object, and create a unary relation to hold it (this is how I model most OO data models, in my head anyway). This is all fine. There's nothing about a multivalue model or network/object model that is incompatible with the RM, and any particular design depends on your business rules and applications.

Whether you spread your attributes out, move them into composite values, or whatever, is purely a design choice. A theoretical model should only help you understand what you've created. I think the RM performs this role very well. It is "as simple as possible and no simpler".

SQL of course doesn't let you create any design you want. It guides you toward one particular set of designs (drags you kicking and screaming in fact) so if you want to bash SQL, go right ahead. An SQL DBMS is not a completely general DBMS. The lack of high-performance, fully updateable views for instance makes even the most basic refactoring a huge chore.

If you feel it's fair to call an SQL DBMS "relational" by the way, then I feel it's equally fair to call MV databases "relational"! They both expose an abstract machine that is a subset of the RM.

Let me make one thing clear here.. I believe Date and others would say, yes, you can model all DBMS products using the RM, but you shouldn't "go there" because only the most completely general DBMS implementation is useful. I personally disagree with this.. in I.T. it is often useful to use products that aren't completely general or powerful, for various valid reasons. I don't usually program in Lisp for instance, even though the CLOS object model is so much more general than the "message passing" model. So I personally have nothing against MV or network databases in practice.. the real danger is thinking that the associated "models" are not a subset of something more general. And unfortunately this lack of understanding runs deep and leads to clouded thinking and a constant stream of buzzwords and new technologies that are just the old ones in disguise.

It's very valuable to have a single, simple, concise model which you can use in your thinking to analyze data management, don't you agree? Even if it's just in your head. So whenever you see a new design, or a new product, or a new set of taxonomy, you can see beyond the superficial differences and understand what's really going on. That is what Codd was trying to do. You are confusing this with various specific products and design principles, which all come and go, and are useful for different tasks.
At 3:26 PM, January 11, 2006 , Anonymous said...: A big difference is that hierarchical storage (multivalues being connected to the root) means not having to join on a key when the data are retrieved.

The Caché model (balanced tree) is actually more than just multivalue.
It can store anything that can be represented by XML or JSON or any other hierarchy.
At 8:38 PM, January 11, 2006 , --dawn said...: Hi dbdude -- Thanks for your comments. I'll take your point that plenty has been done with the RM both by Codd and by others. I started with a single significant document in order to start with that vocabulary. The glossary of terms itself is no small thing to grasp, so I will try to take it one blog at a time, adding information and topics as we go. I thought I would start with what normalization meant when the term was coined for use with databases, recognizing there are now many, many variations on the meaning. A lot of where we are today with DBMS tools and SQL is related to this original definition of normalization.

I do recognize that Date has redefined 1NF and definitely realize that relations could have embedded relations. If the next blog I publish is similar to the draft I'm writing, it will indicate an ordering to those e-mail addresses. Then you have to add data values in order to model it with relations.

I am curious about your statements minimizing normalization. I would think most students in college database courses today would make a strong connection between the RM and normalization. I gather you disagree with such books. I do try to be careful to not equate SQL-DBMS's (which have implementations) with RDBMS's (which have no pure implementation as best I can tell, but Alphora and others might be working to be more in line with the RM). In my draft I wrote it as SQL-DBMS's so as not to offend and one reader thought that might offend, so I added a comment to try to clarify. Still you think I might be equating the two. Nope, I'm not. But I am saying that the implications of the RM starting roughly with Codd's 1970 paper include the SQL-DBMS's we have today.

So while SQL-DBMS's are not purely relational, they are a by-product of relational theory. MV products can call themselves relational because you can model using relations (that include nested relations), but that data model pre-dates and is not a by-product of relational theory, although they typically have SQL extensions. Similarly for the MUMPS model used in Cache'-it is a model that pre-dates the RM. And obviously tag-value databases predate the RM. SQL is a significant part of the legacy of the RM, even if it is not 100% true to the model.

I agree that a single, simple model for data is nice, but I don't prefer relations of atomic (typically poorly defined) values as that model. Cheers! --dawn
At 3:13 AM, January 12, 2006 , x@c.d.t. said...: Anonymous gave a good list of RM features. I would say that the biggest thing missing from the RM as implemented in databases is flexibility. That was the purpose of the model, but since you cannot make a simple change to cardinality nor arity of a domain, it is not all that easy to maintain over time. I'm sure that will come up in other blog entries.

This is what I called "logical independence".
Thinking about flexibility (as schema development in time), it occured to me that it might be a psychological problem, not a technical one.
There is nothing (well...almost nothing) to stop one to add more relations.
At 8:56 AM, January 12, 2006 , --dawn said...: Hi X. Sure the DBMS itself is flexible but if there is even one application that updates the database through a base table, then it might not be trivial to just up and move an attribute to another table in order to take a single value and turn it into a list or table. If ordering is also important, then the inserts and deletes need to be handled. Then there are people who have queries against the database whose views would have to change.

If all updates to the database use logical views and not base tables, that might be a different story, but I have never seen a situation like that, have you?

You can actually add cardinality and arity to an attribute in an MV system and change no screens and no reports. You would then want to add new screens and reports for that area where you need the additional functionality.
At 9:22 AM, January 12, 2006 , x@c.d.t. said...: Cardinality change is a big change.
If the attribute remain "atomic", then the change should be allowed.

Otherwise the application will step in the magic kingdom of non first order statements and it will be alone in dealing with all the problems that might arise in its journey.
At 11:03 AM, January 12, 2006 , Anonymous said...: x@c.d.t. said...
>> "Otherwise the application will step in the magic kingdom of non first order statements
>> and it will be alone in dealing with all the problems that might arise in its journey.

What problems would they be?
At 3:22 PM, January 12, 2006 , dbdude said...: recognizing there are now many, many variations on the meaning. A lot of where we are today with DBMS tools and SQL is related to this original definition of normalization.

Well, it's good to decide on one clear meaning and stick with it! I agree that the original definition of atomic values gave us the inflexible SQL type system. I believe it was an error on Codd's part to require atomic values. Codd probably just didn't like the idea of putting structure outside the reach of the well-defined relational operators, by placing them in multivalued attributes for instance.. but that choice should be up to the designer.

I am curious about your statements minimizing normalization.

Having rescued many denormalized SQL databases, I find that the problems created by denormalization can be solved either by normalizing the design (can be difficult, especially with several legacy apps accessing the DB!) or adding constraints to keep the data from becoming inconsistent (often an easier choice, even though the constraints can get quite hairy!). The fundamental requirement of a DBMS, in my opinion, is to maintain integrity, and normalization is not required for that. Of course it makes things much easier, but saying the the RM requires normalization, or that a normalized design is a "relational" design, is too much. I believe Date shares this position somewhat, he has a chapter in his latest Database in Depth titled "Two Cheers for Normalization".

relations of atomic (typically poorly defined) values as that model.

So don't require them to be atomic! :-)

Another point you wrote:

If all updates to the database use logical views and not base tables, that might be a different story, but I have never seen a situation like that, have you?

This is unfortunate I think. Views are a powerful concept that should be used MUCH more than they are now. It should be common to "encapsulate" your base tables within various views, analogous to how people use methods and functions in a programming language.

In fact if the DBMS was really powerful, we'd be able to create an updateable multi-value view, an XML view, etc., on the same underlying data. Will we ever see this? Who knows.
At 1:36 AM, January 13, 2006 , x@c.d.t. said...: x@c.d.t. said...
>> "Otherwise the application will step in the magic kingdom of non first order statements
>> and it will be alone in dealing with all the problems that might arise in its journey.

Anonymous said...
>>What problems would they be?

Dragons, you know ... :-)
At 6:15 AM, January 15, 2006 , Henry Keultjes said...: It would be nice to get to know that dbdude

Henry Keultjes
Database Scientifics Project http://www.ncolug.org/ppc.htm
Microdyne Company
Mansfield Ohio USA
hbkeultjes at earthlink dot net
At 4:01 PM, January 15, 2006 , Anonymous said...: Further comments from Henry:

It seems to me that a lot of these arguments are totally irrelevant to where the database rubber meets the road.

Perhaps I am the only Pick related person who started a business that required computerization and who had the background to be the architect and interface designer of such a system built on top of Pick.

Typical businesses that implemented Pick did so because they found an application package or a suite of applications that met their needs better than anything else they had seen. Often that "suiting their needs better" also meant that they could afford a Pick solution while other solutions where out of their financial reach. That affordability was undoubtedly also an aspect of our own move to Pick.

Thus I cannot speak for the masses, I am simply trying to speak to the masses.

Nowadays, db technology knowledge typically comes from learning at the college and university level. Most successful businesses originate from a twist or 180 degree flip of conventional wisdom. Why is it then that it has become an acceptable practice that a CEO or an owner, who understands the business very well but knows nothing about database technology, relies on a technologist, who may understand a *certain* (very important to understand that emphasis) database technology very well but has only a marginal understanding of the business at best, to model the enterprise. To me, from those negative combinations only negative results are possible.

A positive result can be obtained when someone, with a reasonable intelligence and knowledge of what the software needs to model, uses the only "hands-on" system that I know of, Pick, to precicely model an enterprise and, together with a knowledgeable and committed Pick programmer, then systemizes that enterprise. If the implementation aspect of the db technology, even when using Pick, is beyond the comprehension of the CEO, the usefulness of db technology is severely limited.

Systemizing the enterprise is the only legitimate function for db technology. An inherent objective of systemizing is productivity but the problem is that very few enterprises can measure the effect of database technology on the overall productivity of the enterprise. So whether they select one technology over another scarsely makes any difference, regardless of the claims that Sam Palmisano, Larry Ellison or Bill Gates may make.

I measured the effects of Pick on the overall productivity of *our* enterprise. It was simply phenomenal.

It will forever be very difficult for db technologist to deal with that need for productivity on their own. Unfortunately for our economy, most IT people use one job after another as career stepping stones. That's not their fault, it's the fault of the CEO's who are incapable of getting their arms around their own enterprises. Therefore they cannot extract the maximum efficiency from their db technologists by making them an increasingly more effective part of the enterprise they are with now. As a consequence db technologist seldom get to the point of identifying their own needs with the needs of the enterprise and our economy suffers from that great waste of talent and educational resources.

Henry Keultjes
Database Scientifics Project http://www.ncolug.org/ppc.htm
Microdyne Company
Mansfield Ohio USA
hbkeultjes at earthlink dot net
At 7:43 PM, January 22, 2006 , Anonymous said...: Dawn, you make a number of assertions without any proof.

The RM was devised to address a number of issues to do with data consistency and correctness. One aspect it addresses is repeating groups. It has been shown that repeating groups or "multi-values" lead to update anomalies. Using them puts an unreasonable burden on the application code to maintain the integrity of the data.

Your simplistic example of the email address does not appear to even be a proper example of repeating groups. Rather it appears that you have a business requirement there for 3 disparate email address fields.

It is becoming both tedious and annoying to be subjected to assertions of this nature from various people in the media and in various blogs without any supporting argument or proof being provided.

There is a great volume of very good argument and proof out there to support the RM. If you believe the RM is flawed, broken or wrong, or whatever, then would you stop just asserting it and please provide proof of your arguments?

Please?
At 8:03 PM, January 22, 2006 , --dawn said...: Hello Anonymous -- As I will be mentioning in the blog entry for this week, I hope to be able to give closing arguments on this by the end of this year. The "Is Codd Dead?" could be thought of as the opening statement. As you might imagine, a single blog is not going to make the entire case to turnaround decades of momentum in the direction of the RM. Given that database textbooks often claim that the RM is essential for excellent software development, there is clearly a good argument for the RM.

I definitely plan to address issues such as update anomalies during the course of the argument.

Also, I will have to disagree with you about putting three separate attributes in for e-mail addresses when some people might have four, for example. Perhaps if I state a requirement that this company has a business need to capture every e-mail address that comes their way for a person, then the variable number might be more clear.

I am fully aware that I have made no slam dunk case, but have plenty of material to discuss before I wrap it up, leaving it to the reader to decide. I hope you stick around for the ride. If you do, it would be great if you would give yourself a name (choose Other with the comments) so I can identify comments from the same person from blog to blog, even if you do not want to identify yourself.

Thanks for reading and commenting. --dawn
At 4:15 PM, February 11, 2006 , Anonymous said...: Is it just me or did you not say anything? I don't know much about MV, I'm glad it worked for you, but that doesn't mean normalizing data should qualify as a "debacle".

"Don't be fooled - there is no mathematical requirement to normalize data." That sentence doesn't make any sense, ofcourse there is not, but it certanly is a good idea to achieve the highest level of atomicity as possible, that way you can save space, make easy fast modifications, and my favorite: it's logical and thus entails a clean solution.

I think I am finally starting see what Fabian and Date are always talking about. People seem to believe what ever they hear.
At 1:57 AM, February 13, 2006 , Slevdi Davotica said...: Any hierarchical data structure is a view on the data. The MV concept is also a view on the data, with the meaning of the data removed. The RM stores the data so different views can be taken of it for specific purpose. This means any application that uses a hierarchical database can be replaced transparently with a relational database, where views on the data get used rather than direct access to it. (The performance would probably be poor until the program was rewritten to access data logically rather than physically.)

Hierarchical databases are pretty good when used in applications with specialised and fairly static requirements for using the data, but don't do well in environments where the data is used to drive a constantly changing set of business requirements. This is because the data is stored as the specialised view needed by the application where its meaning has to inferred by position rather than being explicity given from the schema.

The reason programmers find MV databases easier to work with is simply because the MV concept uses the equivalent of an array to store values where their order is their differentiator. Programmers understand memory location-based concepts such as arrays intuitively and can thus be very productive with databases that have converted meaningful data into application specific structures for them.

But managing data isn't only about programmer productivity. It is far more important to have data integrity, logical independence and flexibility when designing systems for a rapidly changing and widely differing set of business needs. These are the strengths of the RM and why it can be used to replace any hierarchical database. The reverse is not true.

For evidence of this, just look at your own original post. The MV example you give is clearly a lossy view on the equivalent RM schema - a half-way house between the data and the program. It stores data in a physical way in that an email address is selected by it's position in the MV array, not it's meaning. Thus, in a non-RM, the meaning of the data is known in part by the database and in part by the programs that interact with it, whereas in a RM the meaning is known exclusively by the database and the programs simply ask for what they need, transform and or present it and put it back.
At 6:47 AM, February 14, 2006 , --dawn said...: First to anonymous: If you think that I did not say anything, then that would indicate that you knew everything I said already. Fair enough. I did not say anything that has never been said before, although I presented it in a different form, putting different thoughts together than others might have. You can see this blog entry as an "opening statement" for this blog. I will be addressing this topic from a variety of perspectives throughout the year.

And to Slevdi: I won't hit everything in a quick response, but will tackle a couple. The web is a very dynamic structure and it is not based on the relational model. I have seen more agility with non-RM-based databases than with RDBMS's myself, but have no emperical data (which I could figure out how to get some).

Also, I don't understand why you think a structure that you call hierarchical (I'll call it a web since it is a di-graph / web of little trees / DOM) cannot be used to implement any RM structure. It can do that easily.

And with the e-mail address, if the meaning is that the second address is the second e-mail address to use, then you want to keep the ordinal information. It is the database that knows that this is second in my example. The developer need not manipulate ordinal information at all since the database handles it. Developers often dump all ordinal meaning for implementation in an RDBMS because it is just too much work to design, implement, and maintain it when the database does not do it for you. Thanks for reading and responding. --dawn
At 6:51 PM, May 24, 2006 , James Conners said...: Dawn,

If nobody has made it clear yet, you and anyone who saw any bit of brilliance with the above piece are simply idiots.

Jimbo
At 7:00 PM, May 24, 2006 , --dawn said...: Hi Jimbo -- I was aiming for accuracy and clarity in this opening statement. Do you see anything inaccurate or unclear about it? Thanks for your opinion, but I do hope you can rise a notch above such discourse. --dawn
At 7:24 PM, February 17, 2007 , Anonymous said...: Your email example does not provide enough information about business requirements to be critically analysed by your readership.
At 9:46 PM, February 17, 2007 , --dawn said...: Thanks for your comment, anonymous. I definitely did not flesh out the full requirements for a design, but I did expect it was enough for a reader to understand the point. I can give you more requirements if you would like--just let me know. Thanks. --dawn
At 11:30 AM, February 22, 2007 , Anonymous said...: I don't understand the other guys comments.
Isn't this a joke ? You know like
"10 things I hate about SQL Server"
http://weblogs.sqlteam.com/jeffs/archive/2005/05/24/5248.aspx

I seriously thought it was.
At 11:41 AM, February 22, 2007 , --dawn said...: Nope, not a joke. The relational model as mathematics is fine, but applying it as THE model for persisted data has set the industry back, in my opinion.

Some unnecessary restrictions, such as 1NF are (or at least were) part of the RM.

Issues with implementations, such as three-valued logic instead of two-valued logic rode into our standard data processing applications by way of "relational databases."

Elimiating arrays and all ordering is also unnecessary.

There are pros and cons, rather than rights and wrongs, in using 2VL, non-1NF, and lists in data, but these can be open for discussion. Some RM folks have tried to claim that the RM is proven to be the best way to model stored data. There is no such proof.

That said, while the entire argument is not a joke, the collection of blog entries on this topic hopefully does have some humor. --dawn
At 5:38 AM, April 02, 2007 , Baxter Basics said...: Reality check.... C'mon Dawn, you're having a little joke, aren't you? You're just trying to rile any grown-up database practitioners that stroll past - yes? You naughty, naughty little minx.
At 2:28 PM, April 17, 2007 , Concerned DB Professional said...: This article is merely the ramblings of an idiotic and dangerous mind, Move along people, nothing worthwile here

Litter Box

Paw through past Mewsings, a blog about software development, with a focus on data modeling.

2005
November
A Modeling Profession

2009
January
New Year, New Blog