mewsings, a blog
Wednesday, March 29, 2006
Better to Have No Values
The topic at hand is NULLS. Some professionals think the best option for recording missing data is to use a NULL to mean no value, an approach implemented by many database management systems. SQL then recognizes missing data when it reads a marker referred to as a NULL. Working with missing data requires both of these components: how one specifies the missing information to the database and how the database languages work with the missing information.
With the SQL approach where NULL does not refer to a value, but to the lack of a value, one attribute tagged as NULL is not equal to another attribute tagged as NULL. When you compare two values, x and y, if either of these is NULL, then your comparison is neither true nor false, but a third logical value of I dunno or UNKNOWN.
In other words, SQL employs a three-valued logic (3VL). There is some good reasoning behind this. Before we turn again to bowling scores, I'll use a very simple example. If my natural hair color is recorded as missing information and Sylvia's natural hair color is also missing, then surely we do not have enough information to say either that my natural hair color matches Sylvia's, nor that it doesn't. It's a mystery. So, if we evaluate the expression Sylvia has the same natural hair color as Dawn we don't know if it is true or false. We need a third logical value, sometimes represented as unk or UNK for unknown.
A lot of software is built with this strategy for missing values, but there are other options. The Pick, aka MV, query language employs a two-valued logic (2VL), as do most programming languages. Much has been written about NULLs and n-valued logics, particularly within the context of the relational data model. The RM does not require the 3VL of SQL, as author Chris Date has made clear. I will not revisit those discussions here, contributing only this glimpse into Pick's approach.
Step one in understanding this 2VL is to suspend the idea that NULL is the lack of a value or an indicator for no value. In MV, a NULL is a value. There are exceptions as various MV products address their interface to SQL differently, but I will simplify and focus on what is common to all such products. Think of a NULL as indicating that there are no values for a particular variable. One way to provide a mathematical model for this approach is to perceive the NULL as an empty set, aka null set.
This empty set is implemented in MV as an empty list. MultiValue solutions permit the value of a variable or the intersection of a row and column (loosly speaking), to be a list of values as indicated in my last mewsing. The list could have 0, 1, or Many elements. If the cardinality is 0, the value is NULL. If the cardinality is 1, the attribute has a single value. If the cardinality is greater than 1, it is more obviously a list. But no matter what the cardinality, the value of an attribute can be modeled as a list, whether empty, single-valued, or multi-valued.
If there are no values in the list, set, or bag, the value of the MV attribute for a particular record would be NULL. If a NULL is encountered by a user when looking at the data, the meaning of that value is no values. So, if Dani has a record of data and the value of her variable scores is NULL, then the corresponding proposition would be: Dani has no values for her score or Dani has no bowling scores. It would not be: Dani's score has no value. Got it?
Just true or false, that's it, two values and the obvious ones at that.
Because each attribute in Pick does have a value, even if that value is no values, logical expressions are easily evaluated. Just true or false, that's it, two values and the obvious ones at that. If there is ever a test for equality, a list of no values is equal to another list of no values. For less than or greater than comparisons, a NULL is less than everything other than another NULL no matter what type of data is compared.
Some might be resistant to this broad use of a NULL value as it gives us no insight into why there are no values for an attribute. If there is a reason to collect information on why an attribute has no values, additional attributes could be included in a logical data model.
In our scenario, we don't know if Dani bowled the games and the scores have not yet been recorded, if Dani bowled and the score was recorded incorrectly, if Dani has not yet bowled, if Dani was sick with bird flu and missed league bowling this week, or if Dani doesn't ever bowl. In the first two cases, our data as recorded is incorrect. So we could interpret the NULL value for scores as Either Dani has no bowling scores or our data is incorrect but that is similar to the interpretation of any value. We could say that either this bowler's name is Dani or our data is incorrect. So that is not particularly helpful. The interpretation of this NULL value, then, is that Dani has no bowling scores. That's it. If there is a NULL value for an attribute, it means there are no values for that attribute.
Our resulting two-valued logic comparisons are quite simple, useful, and meaningful. If we want to list all people with their series total, ordered lowest to highest, we would get those people who have NULL values listed first.
LIST BOWLERS SCORE BY SCORE
If we want only those with values, we could request only those WITH SCORE:
LIST BOWLERS WITH SCORE SCORE BY SCORE
Instead of having no value, we have a value that is a list with no values.
The first SCORE is testing for any values for the variable SCORE, the second shows the value (the ID for the bowler is listed automatically in the output), and the last SCORE is for the ordering. Using this approach, Dani and Shirl have the same value for their second game score. I'm certain this rubs some SQL users wrong. Instead of Dani and Shirl having no score, which would mean the question is meaningless, Dani and Shirl have no scores or an empty list of scores for the second and third games in the series.
This logic can be applied to attributes that have list values as well as those considered to be single-valued since these can also be modeled as lists, even if short ones. You can likely see how there could still be some misunderstandings in interpreting NULLs of this ilk, but it really is an amazingly simple and useful approach to what is modeled as the lack of values in SQL. Instead of having no value, we have a value that is a list with no values.