Google Web www.tincat-group.com

mewsings, a blog

--dawn

← Previous   Next →

Wednesday, March 29, 2006

Better to Have No Values

Mailbox, empty?

The topic at hand is NULLS. Some professionals think the best option for recording missing data is to use a NULL to mean no value, an approach implemented by many database management systems. SQL then recognizes missing data when it reads a marker referred to as a NULL. Working with missing data requires both of these components: how one specifies the missing information to the database and how the database languages work with the missing information.

With the SQL approach where NULL does not refer to a value, but to the lack of a value, one attribute tagged as NULL is not equal to another attribute tagged as NULL. When you compare two values, x and y, if either of these is NULL, then your comparison is neither true nor false, but a third logical value of I dunno or UNKNOWN.

In other words, SQL employs a three-valued logic (3VL). There is some good reasoning behind this. Before we turn again to bowling scores, I'll use a very simple example. If my natural hair color is recorded as missing information and Sylvia's natural hair color is also missing, then surely we do not have enough information to say either that my natural hair color matches Sylvia's, nor that it doesn't. It's a mystery. So, if we evaluate the expression Sylvia has the same natural hair color as Dawn we don't know if it is true or false. We need a third logical value, sometimes represented as unk or UNK for unknown.

A lot of software is built with this strategy for missing values, but there are other options. The Pick, aka MV, query language employs a two-valued logic (2VL), as do most programming languages. Much has been written about NULLs and n-valued logics, particularly within the context of the relational data model. The RM does not require the 3VL of SQL, as author Chris Date has made clear. I will not revisit those discussions here, contributing only this glimpse into Pick's approach.

Step one in understanding this 2VL is to suspend the idea that NULL is the lack of a value or an indicator for no value. In MV, a NULL is a value. There are exceptions as various MV products address their interface to SQL differently, but I will simplify and focus on what is common to all such products. Think of a NULL as indicating that there are no values for a particular variable. One way to provide a mathematical model for this approach is to perceive the NULL as an empty set, aka null set.

NULL=the empty set

This empty set is implemented in MV as an empty list. MultiValue solutions permit the value of a variable or the intersection of a row and column (loosly speaking), to be a list of values as indicated in my last mewsing. The list could have 0, 1, or Many elements. If the cardinality is 0, the value is NULL. If the cardinality is 1, the attribute has a single value. If the cardinality is greater than 1, it is more obviously a list. But no matter what the cardinality, the value of an attribute can be modeled as a list, whether empty, single-valued, or multi-valued.

Bowling Series Scores

If there are no values in the list, set, or bag, the value of the MV attribute for a particular record would be NULL. If a NULL is encountered by a user when looking at the data, the meaning of that value is no values. So, if Dani has a record of data and the value of her variable scores is NULL, then the corresponding proposition would be: Dani has no values for her score or Dani has no bowling scores. It would not be: Dani's score has no value. Got it?

Just true or false, that's it, two values and the obvious ones at that.

Because each attribute in Pick does have a value, even if that value is no values, logical expressions are easily evaluated. Just true or false, that's it, two values and the obvious ones at that. If there is ever a test for equality, a list of no values is equal to another list of no values. For less than or greater than comparisons, a NULL is less than everything other than another NULL no matter what type of data is compared.

Some might be resistant to this broad use of a NULL value as it gives us no insight into why there are no values for an attribute. If there is a reason to collect information on why an attribute has no values, additional attributes could be included in a logical data model.

In our scenario, we don't know if Dani bowled the games and the scores have not yet been recorded, if Dani bowled and the score was recorded incorrectly, if Dani has not yet bowled, if Dani was sick with bird flu and missed league bowling this week, or if Dani doesn't ever bowl. In the first two cases, our data as recorded is incorrect. So we could interpret the NULL value for scores as Either Dani has no bowling scores or our data is incorrect but that is similar to the interpretation of any value. We could say that either this bowler's name is Dani or our data is incorrect. So that is not particularly helpful. The interpretation of this NULL value, then, is that Dani has no bowling scores. That's it. If there is a NULL value for an attribute, it means there are no values for that attribute.

Our resulting two-valued logic comparisons are quite simple, useful, and meaningful. If we want to list all people with their series total, ordered lowest to highest, we would get those people who have NULL values listed first.

LIST BOWLERS SCORE BY SCORE

If we want only those with values, we could request only those WITH SCORE:

LIST BOWLERS WITH SCORE SCORE BY SCORE

Instead of having no value, we have a value that is a list with no values.

The first SCORE is testing for any values for the variable SCORE, the second shows the value (the ID for the bowler is listed automatically in the output), and the last SCORE is for the ordering. Using this approach, Dani and Shirl have the same value for their second game score. I'm certain this rubs some SQL users wrong. Instead of Dani and Shirl having no score, which would mean the question is meaningless, Dani and Shirl have no scores or an empty list of scores for the second and third games in the series.

This logic can be applied to attributes that have list values as well as those considered to be single-valued since these can also be modeled as lists, even if short ones. You can likely see how there could still be some misunderstandings in interpreting NULLs of this ilk, but it really is an amazingly simple and useful approach to what is modeled as the lack of values in SQL. Instead of having no value, we have a value that is a list with no values.

← Previous   Next →

18 Comments:

At 10:57 PM, March 29, 2006 , Anonymous Keith Johnson said...

I think you have an error. In Pick, the empty string AND the number zero are both "FALSE". Therefore, your enquiry will not give someone who has a genuine zero score (and yes, I know it's very unlikely).

Where you have numeric data (like birthdays or scores) you can infer the empty string to mean missing data, but the Pick system will, by default, regard it as a zero! A clear example is selecting everyone with a birthday between january 1st 1960 and january 1st 1970. Using

LIST PEOPLE WITH BIRTHDAY AFTER "1 JAN 1960" AND WITH BIRTHDAY BEFORE "1 Jan 1970"

will also get people with no birthdate recorded because Pick treats the empty string as zero. You have to code around this by an enquiry like

LIST PEOPLE WITH BIRTHDAY NE "" AND WITH BIRTHDAY AFTER "1 JAN 1960" AND WITH BIRTHDAY BEFORE "1 Jan 1970"

or maybe even something like

LIST PEOPLE WITH F6 NE "" AND WITH BIRTHDAY AFTER "1 JAN 1960" AND WITH BIRTHDAY BEFORE "1 Jan 1970"

I forget offhand where the conversion gets done.

Mind you, I'd rather use the Pick method (infer missing data if you want, or let it be assumed zero) than get into the hell that is NULLS.

 
At 4:45 AM, March 30, 2006 , Blogger --dawn said...

If no one jumped in with information on how NULLs were reflected in the values, how empty strings were handled, for example, then I was going to have to do that in a subsequent entry. So, I'm glad you added this here. I could only cover so much in one mewsing and wanted to get the point across on the mathematics of NULLs, the logic employed, in Pick. Talking about 0 and "" gets us into a discussion of types in Pick as well.

Although I might have something wrong (I don't see it yet), if you call the empty string in Pick NULL (which people do, in my experience), I think what I have said is accurate. In order to get this one point across, I have some significant omissions, however. I don't think that saying that the empty string and 0 are both "false" is accurate, however. But it is true that with the LIST statement in the blog that has WITH SCORES, both the NULLs (encoded as an empty string "") and those SCORES that are 0 would not be listed. Adding the point about the zeros seemed to cloud the issue for me, so I omitted it from the blog entry, but am pleased to have it brought into the comments.

Additional topics that play into this are 1) representation of NULL as an empty string 2) handling of 0 in Pick 3) types in Pick, which I will get to later 4) date handling (which is definitely another topic altogether). There are also some differences between the MV flavors, with UniVerse adding a SQL NULL to the mix, for example. I have found it to be amazingly difficult to carve out a single topic without talking about everything. Thanks for adding to the completeness of this entry, Keith. I'll read it again when I get back to the office next week to see if there are incorrect statements, but I think the issues you bring up are errors of omission rather than comission, right? --dawn

 
At 7:19 AM, April 23, 2006 , Anonymous mc said...

I'm somewhat perplexed by this blog series. It seems to start off as a polemic against the RM, in favour of MV and other 'richer' models. However, the ensuing discussion of the bowling example and the obscure vagaries of Pick implementation of MV seem, for all the world, to point in exactly the opposite direction.

The bowling example would be expressed in the RM as:


VAR BOWLER RELATION { BOWLER# INT, NAME CHAR } KEY { BOWLER# }
VAR GAME RELATION { BOWLER# INT, GAME# INT, SCORE INT } KEY { BOWLER#, GAME# }

FOREIGN KEY GAME { BOWLER# } REFERENCES BOWLER { BOWLER# }


The question of what to do about NULL is completely obviated, as are confusing issues about what NULL really means in any given context.

The query examples leave me baffled. You indicate that SCORE is a list, set or bag. If so, then how can the query LIST BOWLERS SCORE BY SCORE sort by total score? Is the list implicitly folded over addition to derive a sum? In any event, the RM allows us to ask all sorts of interesting questions (in a hypothetical language inspired by Tutorial D):


-- The total score for each bowler (including anyone who didn't bowl)
BOWLER { BOWLER#, SUM( SCORE ) OVER GAME AS SUM_SCORE }

-- The total score for each bowler (excluding non-bowlers)
GAME { BOWLER#, SUM( SCORE ) OVER GAME AS SUM_SCORE }

-- bowlers who scored at least 140 in any game
(GAME WHERE SCORE >= 140) { BOWLER# }

-- games in which at least one bowler scored 140 or more
(GAME WHERE SCORE >= 140) { GAME# }

-- bowlers who never scored badly (including anyone who didn't bowl)
BOWLER SEMIMINUS (GAME WHERE SCORE < 100)

-- Number of bowlers in each game
GAME { GAME#, COUNT( ) AS #BOWLERS }

-- Bowler scores, grouped by game
GAME { GAME#, GROUP( ) AS BOWLER_SCORE }

-- The average score for each game
GAME { GAME#, AVG( SCORE ) OVER GAME AS AVG_SCORE }

-- Margin between best and worst score, by game
GAME { GAME#, MAX( SCORE ) - MIN( SCORE ) OVER GAME AS MARGIN }


One criticism of MV is that it is usually justified on the basis of a few intended usage patterns, ignoring the likelihood of needing other views that aren't readily amenable to the chosen structure. Avoidance of this problem is one of the key benefits of normalization. How would Pick fare with the above scenarios?

 
At 5:24 AM, April 24, 2006 , Blogger --dawn said...

Hi mc -- Lots to respond to here, so I will use multiple responses to respond in full.

I will start with the example that is baffling you. I can see why it is confusing, as I can see I left too much to the non-MV-savvy reader. While the attribute names scores and SCORE were varied by plural vs. singular, lower vs. upper case, and italics vs. not, that was likely not sufficient enough to tell they were different to anyone not predisposed to think in terms of derived data. I think I described scores properly as the list of the scores in the series so that the value of the scores attribute for Chris would be the list [120, 183, 144]. I did not provide a definition for SCORE, however, as I did not want to digress into derived attribute definitions. SCORE, in this example, is the Total for the series. It is derived data which would be defined as the sum of the scores in the list. I should have shown it as the column header in the spreadsheet as the column header for the Total.

LIST BOWLERS SCORE BY SCORE

Is a query against a schema such as

BOWLERS {BOWLER_ID single, scores multi-valued, BOWLER_NAME single, SCORE single}

Normally I would make scores in all upper too, but I was trying to avoid confusion. Only the BOWLER_ID and scores attributes are "stored in this file" (using MV terminology) while all of these attributes are "in this dictionary." Pick is language-based. You ask questions using terms defined in a dictionary. Those terms could relate to stored or derived data. Derived data includes data stored elsewhere (such as BOWLER_NAME) and data derived from stored data (such as SCORE).

I hope that helps clarify the example. Based on your question, I will likely change the spreadsheet headers to make it easier for the next reader.

Thanks for your comment. I will take up other aspects of your comment as well, hopefully today. Cheers! --dawn

 
At 6:28 AM, April 24, 2006 , Blogger --dawn said...

My last comment has not yet shown up, but I did change the spreadsheet for the bowling series to clarify the scores and SCORE attributes.

Now, to address some of mc's other comments. I will first admit that my dislike of a 3VL is not a beef I have with the relational model, per se, other than that the RM lead to the development, massive deployment, and standardization of SQL which employs a 3VL. I agree that one could have a relational theory that does not meet Codd's Rule 3 regarding nulls.


I applaud Tutorial D for addressing two of the big issues I have with what is typically taught in introductory database courses, often under the heading of relational theory: 1NF and 3VL. I consider 1NF and 3VL to be two of the biggest problems (unnecessary costs) with SQL, both of which were prompted by Codd's work and supported by him, as least related to Codd's 12 rules.

So, what does this mewsing have to do with the relational model? Historically it has quite a bit to do with it. While I am pleased that even some relational theorists have come around to thinking that 2VL would be better, there is a significant amount of writing and instruction done in an effort to teach the next generation of developers that they, too, must employ a 3VL (and must normalize, 1NF, their data).

Does that mean I have no beef with the RM? No. (I'm sounding like Rumsfeld, asking and answering my own questions. Apologies) Even with Tutorial D (are there now production users of an implementation of this language?) which makes great gains over SQL, it seems, the Information Principle is still in play. While there can be relation-valued attributes, there are not list-valued attributes.

I'll take a look at the scenarios further, but I think the answer to the question and valid concern about Pick is that it is likely that after defining an initial schema, many of the questions would be straightforward, even for an end-user, while others would likely have been missed in the initial vocabulary, requiring dictionary additions. It is not unusual to add vocabulary items regularly to describe new derived data not previously identified as needed, however such derivations can often also be included inline in a query, without the more common Pick approach of assigning a name to them.

Your comments were much appreciated. I hope I have responded adequately. --dawn

 
At 1:03 AM, April 25, 2006 , Anonymous mc said...

Thank you for the thoughtful replies Dawn. I am quite curious to understand how Pick deals with queries on structural elements that aren't captured by the chosen hierarchy, counterparts of my queries above will be very helpful in this vein.

I must point out that your criticism of the relational model is unfounded. As propounded by the authors of 'The Third Manifesto', the RM accomodates list-valued attributes with aplomb. In fact, values of almost any type at all are permitted (pointers being a notable exception). My first exposure to this idea was some years ago in Date's 'Introduction...', 7ed. His latest book, 'Database in Depth', provides an excellent exposition in the section titled, 'Data Value Atomicity' (p29). Basically, the notion of atomicity is purely psychological in that it reflects our perceptions of data more than any intrinsic non-decomposability. Is a string an atom, or a list of characters? Is an integer just that, or is it a bag of its prime factors? Thus attributes can be of type list, set, bag, relation, image, xml document, fingerprint, mathematical equation, molecular model, and so on, as long as each can be thought of, and manipulated as a value. A list-valued attribute is almost certainly a bad idea (it would make most of my queries above rather more awkward), but it is perfectly legal.

 
At 10:08 PM, April 25, 2006 , Blogger --dawn said...

Hi mc -- I can provide the easy portion of the response now, but the definitions of the derived data would take me longer (than any real Pickie) as I will need to recall or learn the syntax for specifying virtual fields, having never been a Pick programmer myself. It would probably be good for me to do that, but it might be a while before I can get to it (sometimes I miss having a staff). I hope you understand.

In case it is helpful for understanding the practical aspects of this, when my employees hired students to write virtual fields, the typical turn-around on a request for a new v.f. was within an hour (including testing and moving to live), while some could take a few days if they needed to be backed by 3GL code. There was no backlog of requests. End-users wrote queries like those I wrote below, requesting new v.f.s when needed.

-- The total score for each bowler (including anyone who didn't bowl)
mc: BOWLER { BOWLER#, SUM( SCORE ) OVER GAME AS SUM_SCORE }
dw: LIST BOWLERS SCORE

-- The total score for each bowler (excluding non-bowlers)
mc: GAME { BOWLER#, SUM( SCORE ) OVER GAME AS SUM_SCORE }
dw: LIST BOWLERS WITH SCORE SCORE

-- bowlers who scored at least 140 in any game
mc: (GAME WHERE SCORE >= 140) { BOWLER# }
dw: LIST BOWLERS WITH scores >= 140

-- games in which at least one bowler scored 140 or more
mc: (GAME WHERE SCORE >= 140) { GAME# }
dw: LIST GAMES WITH SCORE >= 140

-- bowlers who never scored badly (including anyone who didn't bowl)
mc: BOWLER SEMIMINUS (GAME WHERE SCORE < 100)
dw: LIST BOWLERS WITH EVERY scores >= 100

-- Number of bowlers in each game
mc: GAME { GAME#, COUNT( ) AS #BOWLERS }
dw: LIST GAMES NUM_BOWLERS

-- Bowler scores, grouped by game
mc: GAME { GAME#, GROUP( ) AS BOWLER_SCORE }
dw: LIST GAMES SCORE

-- The average score for each game
mc: GAME { GAME#, AVG( SCORE ) OVER GAME AS AVG_SCORE }
dw: LIST GAMES AVG_SCORE

-- Margin between best and worst score, by game
mc: GAME { GAME#, MAX( SCORE ) - MIN( SCORE ) OVER GAME AS MARGIN }
dw: LIST GAMES MARGIN

As you can see, I wrote each as if the vocabulary were already defined in the dictionaries of the BOWLERS and GAMES files. In most cases, it would have been. In a few, it is likely new virtual fields such as MARGIN would be required at the point this query was required. Additionally, I suspect that some of these virtual fields could be specified inline in the query statement if the vocabulary were not already there.

Your statements are not the entire SELECT ... FROM statement, where the MV statements I wrote are the entire query AFTER the derived data are defined. I definitely see the elegance in being able to define only stored data and then manipulate it with Tutorial D. I think D is quite elegant with the primary omission from my perspective being a good approach to lists (as you noted, an LVA is almost certainly a bad idea using Tutorial D, or any RM implementation as I understand it).

Hopefully you can also see the elegance of the above Pick queries. Tutorial D is mathematically elegant, while Pick is more aligned with language.

Did this explanation help address your questions? Cheers! --dawn

 
At 5:32 AM, April 27, 2006 , Anonymous mc said...

Thanks for the reply, Dawn. First I'll address a misconception in the above comments: The expressions I gave are complete. There is no SELECT ... FROM in Tutorial D's syntax or in my derivative language, which, while syntactically somewhat different to D, is semantically identical. As a case in point, my final expression (GAME { GAME#, MAX( SCORE ) - MIN( SCORE ) OVER GAME AS MARGIN }) can be transformed into the following equivalent Tutorial D:

WITH
( GAME { GAME# } ) AS G,
( SUMMARIZE GAME PER G ADD MAX( SCORE ) AS t1 ) AS tmp1,
( SUMMARIZE GAME PER G ADD MIN( SCORE ) AS t2 ) AS tmp2:
(tmp1 JOIN tmp2) EXTEND t1 - t2 AS MARGIN { GAME#, MARGIN }

The join and its extra baggage are needed due to Tutorial D's tight restrictions over the expressions allowed in the SUMMARIZE operator's ADD clause. This is peculiar to D, however, and is not fundamental to the RM. A slight relaxation of the rules would allow a syntax almost as terse as mine:

SUMMARIZE GAME PER GAME { GAME# } ADD MAX( SCORE ) - MIN( SCORE ) AS MARGIN

However, D is a teaching language and, as such, ought to be as uncomplicated as necessary to aid the learning process. While one can probably make a case for such a relaxation in D, shortcuts of the kind I'm using are definitely the province of industrial languages.

Now, the incompleteness of your counterexamples is perfectly understandable, since you believed that my examples were also abridged, and there is the issue of your busy-ness, which I quite appreciate, being an extremely busy individual myself at present. However, I reiterate that my examples are complete relational expressions and I therefore suggest that the only way to do justice to this dialectic is to follow suit and present a reasonably complete solution for the scenarios that my examples address. This is one point on which, I believe we are not at liberty of taking shortcuts. Without this, certain very big questions will simply remain open. For instance, the partial solutions you provide hint at, but fail to explain, how the same data structures can simultaneously support queries that slice through different domains (e.g., aggregating scores per game vs. aggregating scores per bowler). Also, since the BOWLERS entity holds scores, how is the GAME entity's SCORE field defined so as to avoid duplication and all the associated consistency problems? (Is this the bit where a 3GL comes into play?) And how do you define the field to refer to each bowler's score, instead of the total score across all bowlers?

There is a simpler and nastier problem lurking in the background. Your suggestion that Pick's virtual fields would only take hours (or sometimes days) to publish is not a ringing endorsement, given that I was able to produce each of my examples in less than one minute. Thus I not only performed two to three orders of magnitude faster than your Pick programmers, but I did so without changing the data model one iota (zero deployment). From what I can gather, the above requirements would lead to a very complex Pick data model that forces everyone to deal with fields that perhaps only one person or application is interested in. Plus I should point out that so far none of my examples has invoked even so much as a join (though, to be fair, (A semiminus B) = (A minus (A semijoin B)), so my semiminus example kind of does). I plan to introduce some substantially more complex scenarios in future exchanges with the addition of just one new field, in order to show the remarkable ease with which the relational algebra can accommodate a changing landscape.

Please feel at liberty to take your time responding, Dawn. I won't take it to mean anything other than that you are a busy person, as am I.

 
At 8:42 AM, April 27, 2006 , Blogger --dawn said...

mc - Thanks for the explanation of your language. I was confused by the lack of verbs, where D has them.

I very much like the fact that with RM implementations (and non-implementations) once the data model is in place, a single query language permits all possible queries. That is such a nice feature that when folks work on new query languages, they do not even consider doing otherwise. So, XQuery would be an example of a language that could be executed against a Pick data model with that feature, but the MV query language I'm using above does not. I do like the MV approach that incorporates derived data to simplify the syntax of the language. There is no reason we could not have the elegance of each.

Some of the times when using derived data I could have put an inline expression in the statement, but defining new vocabulary in the schema is done one time when there is something new we want to ask about, and it can be used for the life of the system. So up front after deploying new files/tables there might be many requests for such new glossary items, while this fades off drastically soon enough and can be nothing at all for long periods of time. Some sites permit end-users to specify derived data themselves, considering this metadata to be part of user-defined business rules. This approach bypasses the costs of coding this vocabulary in a s/w development test environment with related change management procedures (that is what takes 50 minutes of the hour). I won't mention that in small shops, the developer just types them into the dictionary (descriptive, not prescriptive) with an editor (don't tell the auditors).

I'll see if I can build the bowling scenario in OpenQM at some point. The answer to the question about how we can ask a question both from the perspective of the bowlers and the games is that in one case the data are "stored" while in the other case they are "derived." There is one piece of redundant data in the model that would not be present in a relational model -- return links. So, the games would point to the bowlers and the bowlers would point to the games they bowled. I'm not thinking clearly right now, however, so the implementation might look different from that when I get to it.

Some queries for you to add might be to list the game scores for each bowler, in order. Then it is time for frames. List the bowlers with the frames and totals for each game bowled, with the frames and the games in order. The output can align with your data model, but I'm thinking the column headers could be something like:

Name..........Game....Frame 1..2..3..4.....................Final Score

In my case, I would not repeat the name, so the result would have one record that displays in three rows. Then we get to the actual balls bowled so that it would be possible to ask a query that retrieves the data for real bowling score sheets. This gets into domain issues where it would not be difficult for me to have an attribute where a apare and a strike could be designated with letters or symbols (even / and X, if desired) as well as numbers which could be added to each other. While you could fiddle with it without doing this, the right way with the RM might be to have a user-defined type for a frame value? The individual frame scores would then be derived.

By the way, while I'm not trying to move the entire industry to MV products, when the industry adopted the RM in a big way it didn't use the "make new friends, but keep the old, one is silver and the other gold" approach, it ditched the elegance and related best practices of many pre-relational models and has been trying to add many of those features in ever since, with developers struggling in their absence. Now that the RM (in theory, at least) seems to include non-1NF (with RVAs), 2VL, virtual fields/columns associated with a a specific table (in SQL Server), and even lists as user-defined types (although I suspect you would define the lists of frames and games without using list types), perhaps we can accelerate getting the rest of the features that have made other environments such a good bang for the buck for organizations into our products.

Thanks, again, for your comments. --dawn

 
At 4:45 AM, May 03, 2006 , Anonymous Wol said...

mc,

You said that the MV is "usually justified on the basis of a few intended usage patterns". IMHO that justification is false. A properly designed MV database has *natural* usage patterns.

In other words, data is either accessed via its "real world primary key", or its value. These are, imho, the only way for the person asking the query to express what they want. MV and SQL are identical in theoretical performance for an "access by value" query, but MV is provably optimally efficient for "access by primary key".

It all takes proper design (true of both MV and relational), but nearly all of your queries I could design very rapidly. Let's assume I have a FILE for bowlers, a FILE for games, and a FILE for scores. (Actually, that last FILE breaks one of my design rules - a FILE should have a real-world primary key and this has a composite real-world primary key, but never mind...)

I can now define SCORE and AVG as vf's in seconds (you're being unfair on Dawn saying your productivity is better than hers - she said a vf *typically* takes a day from idea to production - these vf's would take me five seconds!).

And (note I said I can PROVE optimal access) I could now run these queries on ram-poor, io-bound system and get good response times (oh, I haven't declared any indices yet :-). Try doing that on a similarly constrained SQL system, especially when your database of scores gets large :-)

In any well designed database, data should be normalised (just not necessarily FNF). It should also be grouped according to its "real world id", something that happens naturally with MV and is almost impossible with SQL. As far as MV is concerned, this approach optimises all *likely* data access routes, but doesn't penalise unlikely ones. It does, however, hinder *stupid* ones - the sort that you can but shouldn't run in SQL because the results don't make sense.

Cheers,
Wol

 
At 3:39 AM, May 10, 2006 , Anonymous mc said...

I have to say, Dawn, that I would much prefer you took a month to present a complete example than expending significant amounts of thought and energy explaining why this is a difficult thing to do. Please don't get me wrong. I really appreciate you taking the time to respond with the thoroughness that you do, but a book's worth of prose doesn't make up for a concrete demonstration of what MV can do for me.

I feel compelled to raise, up front, a very serious objection to your counterexamples: They hide the real complexity of the solution. Now, I'm not saying that hiding complexity is a bad thing. On the contrary it is a very good thing to do, and RM-based languages have views and functions at their disposal to provide this kind of hiding. But your argument that this somehow simplifies your chosen model is highly dubious. It amounts to saying that MV is simpler than Tutorial D because it supports views! D supports views too, but I certainly would not have been justified in presenting, for my examples, nothing more than a list of view names.

A general observation I can make is that pretty much every criticism you've levelled at the RM is misdirected, apparently on the basis of the incorrect assumption that SQL is a faithful and reasonably complete expression of the relation model (It is neither, by a long shot). The relational model has not been trying to bring back pre-relational ideas. It has not abandoned 1NF (RVA's don't change that in the least, and your reiteration of this fallacy suggests that you didn't really grasp my earlier comments about the psychology of atomicity. If you did, and simply disagree with me, then please say so.) In light of my above criticisms of your 'simplicity' argument, I have not seen a single idea that the RM could adopt from MV to the profit of its users. My examples are invariably within a stone's throw of yours, as far as terseness is concerned, and yet are complete and self-contained, with no need of external support or further elucidation.

Bowling score sheets can be addressed in at least three ways. You've already hit on the first (a FRAME type). The second is to split the score into relations for the first and second shot. A third is to store both shots in one relation, with an additional 'shot' attribute who's type is the set {1, 2}. Since the notion of atomicity is essentally a psychological one, there is little in the way of a logical basis to favor one way over any other. And whichever model is chosen, one can readily define views that expose the other models if, and when, the need arises. Furthermore, a true relational DBMS would be free to choose whichever underlying physical layout it deemed optimal, irrespective of the logical model chosen. I personally think that your suggestion is the superior one, since it models the concept of a frame directly, allows one to explicitly encode the type of result (normal, strike and spare) and define meaningful operators such as SCORE( frame ), IS_STRIKE( frame ), etc.

To answer your questions, here is some code to deal with frames. Again, I've made my solution as complete and self-contained as possible. The only unexplained parts are the operators TO_CHAR(INT), PAD(CHAR, INT[, CHAR]) and CONCAT(LIST{CHAR}), who's definitions are self-evident, I hope.


-- The frame type (Apologies for any syntactic blunders here.
-- Tutorial D's type system is a weak point of mine, and I
-- haven't defined my own yet.)
TYPE FRAME
POSSREP NORMAL { SHOT1 (0..9), SHOT2 (0..9) } CONSTRAINT THE_SHOT1(NORMAL) + THE_SHOT2(NORMAL) < 10
POSSREP SPARE { SHOT1 (0..9) }
POSSREP STRIKE { };

OPERATOR TO_CHAR(F NORMAL) RETURNS CHAR BEGIN RETURN TO_CHAR(THE_SHOT1(F)) + TO_CHAR(THE_SHOT2(F)) END;
OPERATOR TO_CHAR(F SPARE ) RETURNS CHAR BEGIN RETURN TO_CHAR(THE_SHOT1(F)) + "/" END;
OPERATOR TO_CHAR(F STRIKE) RETURNS CHAR BEGIN RETURN "X" END;

-- The existing stuff
VAR BOWLER RELATION { BOWLER# INT, NAME CHAR } KEY { BOWLER# }
VAR GAME RELATION { BOWLER# INT, GAME# INT, SCORE INT } KEY { BOWLER#, GAME# }

FOREIGN KEY GAME { BOWLER# } REFERENCES BOWLER { BOWLER# }

-- The new relation
VAR FRAME RELATION { BOWLER# INT, GAME# INT, FRAME# INT, FRAME FRAME } KEY { BOWLER#, GAME# }

FOREIGN KEY FRAME { BOWLER#, GAME# } REFERENCES GAME { BOWLER#, GAME# }

-- Game scores for each bowler, with all frames for each bowler/game
-- combination grouped together.
VAR BGF VIEW (BOWLER JOIN GAME JOIN FRAME) GROUP { FRAME#, FRAME } AS FRAMES

-- The RM has no ordering concept, either for tuples or attributes.
-- However, if we want to present this in a display of some kind,
-- We can do the following. Note that the ORDER operator is still
-- permitted, but it doesn't return a relation. It returns a
-- sequence of tuples.
BGF { PAD(NAME, 20, '.') + PAD(TO_STR(GAME#), 5) + CONCAT(FRAMES ORDER FRAME# { PAD(TO_STR(FRAME), 3) }) }


I don't fully understand what you meant by three rows. Are you referring to the first shot, second shot and total score? Because by my recollection, the first and second shots are usually listed side by side. If, on the other hand, you mean to use the three rows as your data representation, then I have some major issues with it.

One final remark. Mathematical models are not social beings that have some obligation to be nice to and respect other models (something human models aren't generally good at either, I might add). They are either useful or they are not, and the only reason they should borrow from other models is because those models have ideas worth borrowing. The relational model was introduced to the world at a time when the database industry was in a disastrous state, with extremely complex, and practically inscrutable technologies ruling the day. Those ideas quite rightly fell by the wayside due to the clear advantages of the RM (even through its very poor expression in SQL).

 
At 3:56 AM, May 10, 2006 , Anonymous mc said...

Hi wol,

If it would take you seconds per vf, then please do so. I only presented nine cases, so it should only take you a couple of minutes. And when you're done, would you please outline how all of these queries could run efficiently on RAM-poor, I/O-bound hardware. You are not the first to argue that this is easy stuff in MV, and yet so far no one has presented a complete concrete implementation. Until I see some actual code, your arguments are all smoke and mirrors.

I hope that your reference to *stupid* queries doesn't imply any of my examples. As far as I am concerned they are all perfectly reasonable questions to ask, especially in the field of sports, where stats are a major component of the spectator's experience! Whatever the case, you really ought to provide some concrete examples of what a stupid query looks like.

Your comment about the third FILE breaking your design rule is strangely reminiscent of Dawn's comments about the horrible things that go on in small shops (and which she, likewise, prefers not to dwell on). The 'never mind' part really bothers me. Why should we not mind? Is it because your design rule isn't particularly important, and is therefore somewhat redundant? Or is it because, however important your design rule is, Pick is somehow forcing you to violate it? Of course, it could be that you are just trying to keep things simple for the sake of the discussion. If so, then I must cry, 'Foul!' since my examples are complete expressions, and the data model on which they operate is fully defined, even down to the attribute types and integrity constraints. In the interests of full disclosure, there is also a design violation in my examples. I fail to define subtypes of INT for each of the different identity types. This was an inadvertent omission, and can be remedied simply by introducing the necessary types (one line each) and using them at the appropriate spots in place of INT. This is, however, simply a good design principle, and is not required to make my examples work.

Finally, the last paragraph confuses me. What do you mean by "real world id"? What is the real-world id of a car sales record? Is it MODEL_NO? Is it EMPLOYEE_NO? Is it DEPARTMENT_NO? How about DEALERSHIP_NO? And, if it's the last one, what happens if most employees rotate between dealerships and you frequently want to aggregate by employee? And what if there is also a SALE_EXTRAS table that lists the added extras each car was ordered with? Is this table's "real world id" the SALE_NO, or PART_NO? Then, whichever one you choose, how would you group the employees by which extras they sold? In the RM, this last one would be something like (SALES JOIN SALE_EXTRAS) { PART_NO, EMPLOYEE_NO } GROUP { EMPLOYEE_NO } AS EMPLOYEES. Finally, how would these questions be answered efficiently on a manufacturer's global sales register, with a few million sales records, and many millions of extras records? My point is that the concept of "real world id" (if I've understood you correctly) is totally subjective; the only point of view that matters is that of the person asking the questions. Furthermore, even if a "natural" key can be identified, SQL (for all its evils) fully supports this notion through clustered primary keys, and it is essentially a redundant concept in the RM proper.

 
At 6:26 AM, May 11, 2006 , Blogger --dawn said...

wol & mc -- Thanks for your comments. I realize I'm not keeping up, but I'll comment on a few points. Wol - your estimates are precisely what I would normally hear from someone more accustomed to writing code than I. I'm adding in the buffer for the user and developer to recover from any initial missteps in getting the requirements exactly right, issues with testing and change management, etc. I know that in a crunch most virtual fields, even if there is 3GL code behind them, seem to get done with incredible speed. It is this agility in the environment that helped me warm up to MV when I first had employees who were using it and it had looked to be nothing like a dbms from my initial take.

I do, however, understand how it is elegant to have every query you would ever want to write be possible with a query language immediately after an initial schema is developed, which is what mc can do in his or her sandbox and is not the case with the MV query language. And, yes, mc, you are right that my queries are somewhat analogous to you writing these queries as if against views, though not precisely the same. I will also grant that some full examples would be in order and part of me would love to bump other matters off the priority list to focus on it, or I might appeal to others instead. We'll see. I just can't push it high enough right now. You can see how far behind I am in my blog writing and that uses skills I already have.

Now, about this comment from mc:
"A general observation I can make is that pretty much every criticism you've levelled at the RM is misdirected, apparently on the basis of the incorrect assumption that SQL is a faithful and reasonably complete expression of the relation model (It is neither, by a long shot)."

I think I have been consistent in NOT equating SQL with the RM. I do, however, consider SQL to be a product of the "RM movement" even though it is not a pure implementation. I have no reason to lob insults at pure theory not used for any purpose, only actual employment of such theory, however flawed it might be. SQL can be used to illustrate where the industry has headed in light of the RM approach. Please recognize that I am not opposed to modeling data as propositions, as sets, relations (or functions, di-graphs, etc.), which brings me to this comment from mc:

"Mathematical models are not social beings that have some obligation to be nice to and respect other models (something human models aren't generally good at either, I might add). They are either useful or they are not, and the only reason they should borrow from other models is because those models have ideas worth borrowing."

There is no reason that a single model needs to borrow from any other. There is reason for DBMS's to provide the models (faces, that is, I'm not talking about what is under the covers) that users (s/w developers) want when working with their data, however. I could be misinterpreting it, but it seems to me that the RM comes with some exclusive clause where you must work with only its model of data. It must control all access. So, anything you cannot layer on top of a less-than-full-featured language that implements the RM would not be acceptable as a "view" into the data.

At dinner last night a colleague told me that he was at a company where they had garbage data stuck in an Oracle database and there was no means of fixing it due to the built-in RI (at the time, I don't know if things have changed to permit such fixes). I want a dbms that permits software developers to do what they need to do to have quality data and software.

On top of that, I want to minimize the overall cost of the solutions over time. If I had not experienced it, I would not believe that violating the RM provided this. Maybe there is a means of squaring up these practical concerns with the RM, by continuing to redefine the RM as has been done with 1NF, but if it gives up its exclusivity, it seems to me it ceases to be the RM. Once you can put XML documents in as attribute values without RI and constraint-handling on the values within those XML documents, for example, then I don't think it is the RM any longer. If the RM is redefined so that the MV model is one of my options when interfacing with the database, then it isn't the RM.

Cache' now lets you handle your data from a MUMPS, SQL, PICK, or OO perspective, for example. It surely is not a relational database, right? So, maybe I should get clarification from you on what a DBMS must and must not have to be able to say that it is an implementation of the relational model. I don't need an MV perspective of the data to be exclusive. A DBMS, from my perspective, should have the features its users want, not just what some theory says they should be permitted to have. Again, what is under the covers is not where my concerns are -- that should obviously align with solid s/w development architecture.

I've wandered a bit too much in this response, but also see I've taken an hour in what I intended to be a quick response. Today's (much needed) sit-ups will be canceled. Cheers! --dawn

 
At 9:31 PM, May 13, 2006 , Anonymous mc said...

Thanks again Dawn, for taking considerable time to respond to my thoughts. And again, please don't mistake my request for a concrete solution as any kind of urgent demand. If you're busy, that's fine. Time is not of the essence here.

Of course, time in the sense of expended effort is another story. If the reason examples are not forthcoming is that it will take considerable time to elaborate your solution, then I worry that your case for the dev efficiency of Pick/MV is on shaky ground. If, by 'incredible speed' you mean half an hour for a vf, then we are still well over an order of magnitude apart, and if you claim better than that, I would have to skeptically inquire as to the checks and balances in your dev and maintenance processes. In my experience, deploying changes is all sugar and spice during dev, but can be mind-numbingly painful after going live. Having a simple admin UI that submits arbitrary queries (i.e., zero deployement) can be life-saving.

On the subject of RM vs SQL, I quite understand your view that it is of more value to criticise real technology than a theory. I completely agree, but this raises two very important issues:

1. You should be very careful with terminology. In particular, you should take care to call a spade a spade when levelling accusations. If you see something wrong with SQL, then by all means blame SQL (the real technology) not the RM (the theory)!

2. At a more basic level, it is a fallacy to blame the RM for the inadequacies of a language that violates many of its most important principles.

With regard to the RM's exclusivity I reiterate my position that it supercedes the earlier models, and has nothing to gain from them. To put it more precisely, I don't believe that the users of the RM have anything to gain by borrowing ideas from other models and using them in conjunction with the RM. With the RM I can deal with networks, graphs, hierarchies, flat data, structured data, and so on. Do you have a major beef with the examples I gave? Do you think I could have produced my examples more rapidly or more elegantly? What exactly am I lacking that suggests I need to look beyond the RM?

Along the same lines, the RM's exclusivity does not entail abandonment of other languages used to implement algorithms. More correctly, the RM doesn't even mandate a language! Nonetheless, assuming a language based on the RM exists: while it should be complete (in the Turing sense), there's no requirement to deny access to foreign language libraries. In fact, this will occasionally be necessary to implement custom operators.

I fail to grasp your colleague's problem. Dirty data means a broken database. Take it offline, drop all RI, fix everything up, and reintroduce RI.

Now, at this point I have to really object. You've again blandly stated that the RM abandoned 1NF, while completely ignoring my claim that it hasn't. Before you repeat this assertion, I would ask that you please address my explanation of why 1NF is still very much a part of the RM.

The RM does support XML documents (attributes can be of any type, as I've already explained in the same exposition that covered 1NF). Furthermore, this fact is very well documented. It also supports any operators you want on those types, and any integrity constraints you want to define in terms of those operators. What baffles me is your suggestion that by doing all this the RM somehow ceases to be the RM. I guess your tendency to conflate it with SQL would explain this. If you want an explanation of what a database must do and be in order to call itself relational, there is ample guidance in the literature. The five key points are: 1) An open-ended collection of scalar types (BOOLEAN, which at least two major SQL products fail to properly support, is the only required type), 2) A relation type generator, 3) Facilities for defining relation variables, 4) The facility to assign relations to relation variables, 5) An open-ended set of generic relational operators that accept relation values and return relation values. It goes without saying that the key terms must be taken to have certain precisely defined meanings. For instance, a relation is a header consisting of a set of ordered pairs, [Name,Type], called attributes, and a body comprising a set of tuples, which are sets of triplets, [Name,Type,Value], in one-to-one correspondence with the attributes of that relation's header. You can't have duplicate tuples in a relation, for instance, or enforce an ordering, or have unnamed columns, etc. In these, and many other areas, SQL violates the model with disastrous consequences. Which leads me to reinforce my point: blaming the RM for SQL is a gross misrepresentation.

To make this point more concrete, let me take a fairly simple real-world example. A car dealership wants to find out which salespeople are underperforming. So their resident database guy whips up a quick 'n' dirty SELECT:

SELECT EMP_ID, SUM(SALE_AMT) AS TOTAL_SALES
FROM SALES
WHERE :d1 <= SALE_DATE AND SALE_DATE <= :d2
GROUP BY EMP_ID
HAVING TOTAL_SALES < :target


Actually, he gets it wrong, but the UI quickly lets him see his error. You can't refer to the new name, TOTAL_SALES in the HAVING clause (What the...?). SQL's scoping rules are apparently so bizarre that only the most expert practitioners can venture to explain this idiocy. Our resident db guy is a practical sort though. So he doesn't reflect too greatly on this, and simply makes the fix. All's good.

SELECT EMP_ID, SUM(SALE_AMT) AS TOTAL_SALES
FROM SALES
WHERE :d1 <= SALE_DATE AND SALE_DATE <= :d2
GROUP BY EMP_ID
HAVING SUM(SALE_AMT) < :target


Actually, no it isn't. It turns out that one of the employees is a lazy sort who almost never turns up to work, and seems more interested in surfing the web when he does. Management learn of this individual independently, but then want to know why he hasn't been coming up on the underperformers report. They ask the db guy to look into this, and he realises that he's made a subtle mistake. He is only reporting sales for those employees who actually made sales, and the lazy guy hasn't sold a single car for several months! So he fixes the problem by doing a right outer join on employees:

SELECT EMP_ID, SUM(SALE_AMT) AS TOTAL_SALES
FROM SALES S
RIGHT JOIN EMPLOYEES E ON E.EMP_ID = S.EMP_ID
WHERE :d1 <= SALE_DATE AND SALE_DATE <= :d2
GROUP BY E.EMP_ID
HAVING SUM(SALE_AMT) < :target


(You might be tempted to ask why a right outer join, instead of left. I frankly don't know the difference, and can't see why the right join is even there. But, since it's available, why shouldn't I use it?) Being once bitten, he confirms that the lazy guy is now being reported on. But he isn't! By this time, the db guy is into nail-chewing, and rings a friend, who is a db guru. A quick glance at the query and he spots the problem. SUM(SALE_AMT) will return NULL if the only input is a null field, and so the comparison with :target will fail. To fix it, our db guy has to anticipate the null and code for it. Unfortunately, the database has to work on SQL Server and Oracle, which use different functions for automatically falling back to a value when a null is encountered (ISNULL() for SQL Server and NVL() for Oracle). To avoid writing a compatability function and having to deploy it into the heavily guarded production environment, the db guy just does it the hard way:

SELECT EMP_ID, CASE SUM(SALE_AMT) IS NULL WHEN TRUE THEN 0 ELSE SUM(SALE_AMT) END AS TOTAL_SALES
FROM SALES S
RIGHT JOIN EMPLOYEES E ON E.EMP_ID = S.EMP_ID
WHERE :d1 <= SALE_DATE AND SALE_DATE <= :d2
GROUP BY E.EMP_ID
HAVING SUM(SALE_AMT) IS NULL OR SUM(SALE_AMT) < :target


What a mess! Now, how would we have gone about solving this seemingly simple problem in a truly RM language like Tutorial D? Here it is:

(SUMMARIZE SALES WHERE :d1 <= SALE_DATE AND SALE_DATE <= :d2
PER EMPLOYEES { EMP_ID } ADD SUM( SALE_AMT ) AS TOTAL_SALES) WHERE TOTAL_SALES < :target


In my experimental language it's even easier:

EMPLOYEES { EMP_ID, SUM( SALE_AMT ) OVER SALES[:d1 <= SALE_DATE <= :d2] AS TOTAL_SALES }

I hasten to remind you that this is purely a syntactic enhancement that is semantically identical to the Tutorial D expression.

So please let's not have a repetition of the fallacy that deficiencies in SQL are the RM's fault. SQL ignored the lessons that the RM taught, to the detriment of the language itself, with disastrous consequences for the entire database industry.

I'd like to make one final point that might offer a different perspective on the matter. While I've never dealt with Pick specifically, I have come across quite a number of attempts to create a new data model. Invariably these models are variations on the old network and hierarchical systems. The whole XML/XSL/XPath/XQuery/XSchema/... mess is a rehash of the hierarchical model, and I've seen in-house systems where collections of objects point to each other in highly connected cyclic graphs. Invariably these initiatives I've encountered have failed to realise that the RM beat these other models at their own game (even through SQL, which filtered out much of its power). And thus I have invariably seen these misguided attempts founder on all the same reefs that the old models did. The all but inevitable consequences have included long-running disputes over the correct way to structure entity relationships, endless debates about ownership and lifespan management, constant bolt-ons to the 'simple' models (such as an ever-expanding set of pointer flavours) as one limitation after another is discovered and worked around, and rampant denormalization of data to redress massive performance problems. I have always found that these approaches are built on extremely shaky premises, such as some cute code sample that works really nicely with model X, or the pseudo-intellectual invokation of terms like "graph theoretic basis". The reality that always seems to come out in the wash is that graphs and hierarchies are much more subtle, complex and deceptive than a superficial analysis suggests. Ironically, it was this realization that led E. F. Codd to the relational model thirty-seven years ago.

In conclusion, I find it profoundly revealing to compare two diametrically opposed approaches to managing data. On the RM side, the latest exposition of The Third Manifesto needs about eighty pages to fully define and describe the relational model, an extensible type system and a relatively complete programming language that includes all the conventional relational operators. Contrast this with the XML family of standards for data modelling, which clock in at over 1000 pages, almost half of which are devoted to XSchema alone!

 
At 10:22 PM, May 13, 2006 , Anonymous mc said...

Oops. I've just noticed a typo. My version of that last query was missing the final restriction. Here's the corrected version.

EMPLOYEES { EMP_ID, SUM( SALE_AMT ) OVER SALES[:d1 <= SALE_DATE <= :d2] AS TOTAL_SALES }[TOTAL_SALES < :target]

 
At 5:30 AM, July 29, 2006 , Blogger Tony Andrews said...

MC,

Your SQL example is great, but still needs a tweak to make it return any data for the lazy guy:

SELECT EMP_ID, CASE SUM(SALE_AMT) IS NULL WHEN TRUE THEN 0 ELSE SUM(SALE_AMT) END AS TOTAL_SALES
FROM SALES S
RIGHT JOIN EMPLOYEES E ON E.EMP_ID = S.EMP_ID
-- SALE_DATE will be NULL for employees who made no sales
WHERE ((:d1 <= SALE_DATE AND SALE_DATE <= :d2) OR SALE_DATE IS NULL)
GROUP BY E.EMP_ID
HAVING SUM(SALE_AMT) IS NULL OR SUM(SALE_AMT) < :target

Good luck with persuading Dawn that SQL != RM! From what I have experienced on comp.databases.theory it is like trying to herd cats ;-)

- Tony Andrews

 
At 2:10 AM, September 28, 2006 , Anonymous mc said...

Thanks for that great catch Tony! I guess it just underscores the fact that SQL is not a pretty sight.

Moreover I doubt, given the time that has elapsed, that I will ever see a Pick version of the examples I gave.

 
At 7:16 AM, September 28, 2006 , Blogger --dawn said...

Sorry, MC. It is on my list to do something other than research and delegate (with a little writing tossed in there) when it comes to software development these days. I keep getting pulled away into "real work" (the kind that puts food on the table). I wish I could prioritize the assignment you gave me to the top of my list, then sit down and do it, but I just can't squeeze it in. Someone who is already a Pick developer might be able to do it with their eyes closed in no time, but I'm not there and getting myself to that point cannot be a priority for me right now, but perhaps in a few months I'll be able to give it another look.

Have a good day. --dawn

 

Post a Comment

<< Home

Litter Box

Paw through past Mewsings, a blog about software development, with a focus on data modeling.


Atom feed