Tincat Group, Inc. -- tincat musings blog

<<software development>>

mewsings, a blog

--dawn

Thursday, March 23, 2006

Better to Have No Values

The topic at hand is NULLS. Some professionals think the best option for recording missing data is to use a NULL to mean no value, an approach implemented by many database management systems. SQL then recognizes missing data when it reads a marker referred to as a NULL. Working with missing data requires both of these components: how a person specifies the missing information to the database and how the database languages work with the missing information.

With the SQL approach where NULL does not refer to a value, but to the lack of a value, one attribute tagged as NULL is not equal to another attribute tagged as NULL. When you compare two values, x and y, if either of these is NULL, then your comparison is neither true nor false, but a third logical value of I dunno or UNKNOWN.

In other words, SQL employs a three-valued logic (3VL). The reasoning behind this makes sense. Before we turn again to bowling scores, I'll use a very simple example. If my natural hair color is recorded as missing information and Sylvia's natural hair color is also missing, then surely we do not have enough information to say either that my natural hair color matches Sylvia's, nor that it doesn't. It's a mystery. So, if we evaluate the expression Sylvia has the same natural hair color as Dawn we don't know if it is true or false. We need a third logical value, sometimes represented as unk or UNK for unknown.

A lot of software is built with this strategy for missing values, but another option is the Pick, aka MV, approach where one might say that a variable has no values instead of no value. Pick employs a two-valued logic (2VL), as do most programming languages.

Step one in understanding this 2VL is to suspend the idea that NULL is the lack of a value or an indicator for no value. In MV, a NULL is a value. Think of it as indicating that there are no values for a particular variable. One way to provide a mathematical model for this approach is to perceive the NULL as an empty set, aka null set.

This empty set is implemented in MV as an empty list. MultiValue solutions permit the value of a variable or the intersection of a row and column (loosly speaking), to be a list of values as indicated in my last mewsing. The list could have 0, 1, or Many elements. If the cardinality is 0, the value is NULL. If the cardinality is 1, the attribute has a single value. If the cardinality is greater than 1, it is more obviously a list. But no matter what the cardinality, the value of an attribute can be modeled as a list, whether empty, single-valued, or multi-valued.

If there are no values in the list, set, or bag, the value of the MV attribute for a particular record (similar to a row) would be NULL. If a NULL is encountered by a user when looking at the data, the meaning of that value is no values. So, if Dani has a record of data and the value of her variable scores is NULL, then the corresponding proposition would be: Dani has no values for her score or Dani has no bowling scores. It would not be: Dani's score has no value. Got it?

Just true or false, that's it, two values and the obvious ones at that.

Because each attribute in Pick does have a value, even if that value is no values, logical expressions are easily evaluated to either true or false. That's it. Just true or false, that's it, two values and the obvious ones at that. If there is ever a test for equality, a list of no values is equal to another list of no values. For less than or greater than comparisons, a NULL is less than everything other than another NULL no matter what type of data is compared.

Some might be resistant to this broad use of a NULL value as it gives us no insight into why there is no value for an attribute. If there is a reason to collect information on why a variable has no value, additional attributes could be included in a logical data model. In our scenario, we don't know if Dani bowled the games and the scores have not yet been recorded, if Dani bowled and the score was recorded incorrectly, if Dani has not yet bowled, if Dani was sick with bird flu and missed league bowling this week, or if Dani doesn't ever bowl. In the first two cases, our data as recorded is incorrect. So we could interpret the NULL value for scores as Either Dani has no bowling scores or our data is incorrect but that is similar to the interpretation of any value. We could say that either this bowler's name is Dani or our data is incorrect. So that is not particularly helpful. The interpretation of this NULL value is that Dani has no bowling scores. That's it. If there is a NULL value for an attribute, it means there are no values for that attribute.

Instead of having no value, we have a value that is a list with no values.

Our resulting two-valued logic comparisons are really quite simple, useful, and meaningful. If we want to list all people with their series total, ordered with the highest totals first, we would get those people who have NULL values listed last. If we ask who has the same score for the second game as Dani, then Shirl would show up in the response. I'm certain this rubs some SQL users wrong. Instead of Dani and Shirl having no score, which would mean the question is meaningless, Dani and Shirl have no scores or an empty list of scores for the second and third games in the series.

This logic can be applied to attributes that have list values as well as those considered to be single-valued since these can also be modeled as lists, even if short ones. You can likely see how there could still be some misunderstandings in interpreting NULLs of this ilk, but it really is an amazingly simple and useful approach to what is modeled as the lack of values in SQL. Instead of having no value, we have a value that is a list with no values.