mewsings, a blog


← Previous   Next →

Saturday, March 25, 2006

With Data Modeling, What's Your Bag?


I've been avoiding some definitions for a while now, but when writing about NULLs and two-valued logic, I digressed into explanations of a few terms. Since I'm writing on the road for a few weeks, I thought I'd split out some terminology into this blog and follow with the NULLs blog next. I'll toss in an ending on this one to tie in the title, but otherwise this mewsing is a glossary.

Bowling Series Scorecard

What I know about bowling could fit in a blog entry, but I just adore the scoring system, so I'll use that for the examples, both for these descriptions and for the NULL handling to follow.

The scores in the above bowling series that correspond to the three games bowled by one person; such as Chris's 120, 183, 144; compose a list. Each such list of game scores could be the value of a variable named scores. This conceptual list could be implemented with a computer language using an array like score[3], for example.

The length of a list is the number of elements in it. Computer languages vary as to what list implementations permit variable lengths. Other variations in the implementation of lists include whether list members can be accessed directly with an index or require a sequential read. Providing different features for working with a list could prompt one to work with it as a queue or a stack, for example. Lists are a wildly popular structure in any general purpose programming language. They are missing from any implementation of the Relational Model (RM), however.

If we remove the ordering from our lists, we get a set of bowling scores for Pat and Chris that includes all three scores. However, the set of scores for Beth is {150, 160} because each element of a set must be distinct from the others. The set of Pat's scores could be written as {110, 85, 130} or as {110, 130, 85} because the set has no implied ordering. In the case of a bowling series, it would be a bad idea to treat these scores as a set in this way, given that we would then lose the information about how many games had a particular score, as illustrated in the case of Beth. A list can be transformed into a set, however, by adding an ordinal attribute to identify the first and nth value. We could, therefore, talk about Chris's set of bowling scores in this series as {(1, 120), (2, 183), (3, 144)).

A more obvious set might be our bowlers = {Chris, Pat, Dani, Shirl, Beth}. Of course, when specifying our set, we would include a unique identifier for the elements, often referred to as a key, so that our set might look like bowlers = {(11235, Chris), (628628, Pat), (11111, Dani), (223344, Shirl), (98765, Beth)}. We saw above that we can turn a list into a set. We can turn our set into a list by placing it in some order, such as an alphabethic order of Beth, Chris, Dani, Pat, Shirl.

Bowling Bag
Also known as a multiset, a bag is like a set in that it has no ordering and like a list in that it includes duplicate values. The bowling bag for Beth is [160, 150, 150] which is equal to [150, 160, 150] and [150, 150, 160]. It is as if you tossed the values into a bag, pulling them out in any order. We could turn this bag into a set by adding a quantity to each value such as Dani having a set {(150, 2), (160, 1)}. We could turn it into a list by adding order to it, such as numerical order with Dani's list being 150, 150, 160.

We talk about SQL result sets. However, because a result set may have duplicate rows, these would more accurately be termed result bags (or multisets, but that term is not as fun so I use it only like I use table tennis instead of ping pong).

Given my approach to the relational theory of data, I prefer to provide a clean, clear, crisp definition of a relation from mathematics, leaving it to others to embellish it as befits their application of this mathematical concept.

A relation is a subset of the set of ordered tuples (A1, A2, ... Am) formed by the Cartesian cross-product of sets S1 x ... x Sm where each An is an element of Sn.

Note that a relation is a set. Bags are not sets. Therefore, SQL result sets are not relations.

The prior example of a set with couples consisting of bowler id and bowler first name is a relation.

Given a relation R, a domain is a set Sn such that for each tuple (A1, A2, ...An, ...Am) in R, An is an element of Sn.

The example relation has a domain of bowler ids and another domain of bowler first names.

A binary mathematical relation with at most one b for each a in (a,b). Note that either or both a or b could be a relation, for example.
A type is a domain, so it is a set, plus related functions. Some folks define domain and type to be identical, typically tossing operations into the definition of each. You might note that this is either similar to or the same as the term class, depending on your definition of that term.
As with many of the terms above, this term may be applied to either a variable or a value. A scalar variable can hold only one value at a time. A scalar value is a single value. It is the opposite of composite. List, set, bag, relation, and function are examples of composite types. Common scalar types defined in computer languages are char, int, float, double, and boolean.
Bowling Game Scorecard

Take a look at this scorecard for a single game of bowling. There are many ways to pour the scalar values into composite types when modeling these data for use in a software application. If you were not tied to everything being a relation, what would be your first take on how to model frames in a game of bowling? That strikes me as a list. Like one of those how many triangles do you see puzzles, how many bags can you find? Sets? Lists?

To set us up for the NULLs discussion, I'll provide a little information on the MultiValue (MV) model. Conceptual sets are implemented in MV as either sets or lists. If modeling a strong entity, they are typically modeled as sets. When modeling a property of a strong entity, they are often modeled as lists. Bags and lists are always modeled as lists unless they are first changed into sets by adding attributes. Sets are top level data structures, while lists are always child structures.

The translation from the conceptual data model to the MV logical data model (see this mewsing if confused by these terms) includes modeling a conceptual set as a list. If an implemented list is a set conceptually, the developer must provide the logic to test if a value is already present before adding it. If it is a semantic bag or set, the developer needs to ignore the ordering. This is a bit more difficult because the query language only checks equality of lists, not bags or sets. A test for equality of [150, 160, 150] to [150, 150, 160] would evaluate to false. A developer might choose some order for this bag, turning it into a list, to avoid this complexity.

While the full collection of information about whether a value is a conceptual list, set, or bag is not available for implemented lists in an MV system, you can use the MV array (list) data type to implement all three concepts. If you try your hand at modeling data this way, you might find that it is often conceptually simpler than modeling exclusively with sets. I, for one, like modeling with data sets that have nested lists. At the risk that you might be too young to be hip to this slang, I'll ask: when it comes to modeling data, what's your bag?

← Previous   Next →


Post a Comment

<< Home