mewsings, a blog

--dawn

← Previous   Next →

Tuesday, January 31, 2006

The Data Movement

Data. Movement. Toss in some musings about differences between the sexes and that will do for today.

dawn's high school photo

<CYA>
Gender can be a divisive subject, and I want to be clear that I have no expertise in the field of gender studies, nature vs. nurture, or how brains differ. One of my readers from down under tells me that feminism is a sensitive subject there due to a lack of full-time jobs for men over forty right now. Apparently there has been some reverse discrimination. I do not think I have been discriminated against in my career at all. Unlike some of you, I have never even been in an all-male meeting. But I have taken note of statistics indicating a downward trend of women in computing and have seen a lack of female computer science majors in at least one college in the USA. I decided to weave this topic into my mewsings today, giving possibly a new angle to the otherwise tired data vs. process topic.
</CYA>

It's time to put a definition of data on the table, data as in data modeling and database, and even the good old term data processing.

Data: encoded propositions, a combination of form and meaning; accurate data are facts.

Accurate data are facts. Ho Hum.

Ho hum. Here's a fact: I'm in my forties. If you capture that fact as data today and then present it thirty years from now, it will be as accurate as if I were to attach a high school picture to this blog today. Alas, data changes. It can also be disseminated. Data with movement—now that's much more interesting to me.

I started in data processing with a summer job during college in 1977. I had pounded the pavement for a job. If I had not gotten this job at the last minute, I would have started as a waitress the following Monday. I had memorized the menu already, but was petrified, knowing I was not really waitress material. But I also knew I didn't want to return to being a nurse's aide in a nursing home ('73-'75) or a maid in a Holiday Inn (Maid of the Month, July 1976).

My qualifications for this job programming COBOL on a Pr1me 300 were that I had taken the equivalent of one semester course covering COBOL, BASIC, and Fortran (two half courses, to be precise). The person hiring me said the other qualification I had was that I was majoring in mathematics and to him that meant that I was smart.

While it is impossible to know what would have happened under different circumstances, I am as sure as I can be that I would not major in computer science were I to enter college today. Often a computer science major is required now for those entering software development professions. I can think of no reason why I would have chosen to major in a machine. I have even avoided joining ACM until last year when I wanted to download more papers than would make sense on a pay-per basis. I didn't like Machinery in the name. I have no interest in machinery, much less an association for such machinery.

What is the percentage of women in the car industry compared to the travel industry? What is the percentage of women in computer science compared to those who were once in that now-called-something-else profession of data processing? Some think the decline is a failure of the women's movement or, perhaps, of women. But I see it as a success with the women's movement in that girls know they have options. It is a failure of our discipline to appeal to these girls as it once appealed to the girl I was when it captured me.

  • The percentage of women receiving bachelor-level degrees in computer or information sciences has declined from a peak of 35.8 percent in 1984 to 26 percent today.
  • Among the science and engineering workforce, computer science is the only area where women's participation has declined since 1993. (umbc.edu/cwit/computer_mania.html)
I'm not interested in bases of data. I'm interested in uses of data.

I like movement and connections. To me, data processing is like travel, like movement. Databases are more like computers and cars. I'm not interested in bases of data. I'm interested in uses of data, in data movement. I went on for a master's degree in pure mathematics, not applied mathematics or science. So there is little appealing to me about a discipline called Computer Science. I'm drawn to it about as much as to the term Database Management System.

That is the great distinction between the sexes. Men see objects, women see the relationships between objects. (John Fowles)

Of course this is a generalization. I do like language and data. But I would say that I like the relationship of language and data within systems that include people. I like connections, change, impact, and movement. Am I really about to blame some part of the decrease in women in computing on the change from a focus on data processing and terms implying movement to a focus on computer science and nouns of data, DBMS, tables, domains, constraints, and objects? Yup. (Do you like how I also just tossed the OO folks into the same bucket as the RM folks?)

Thinking in terms of static data, without concurrently addressing changes to the shape, content, and distribution of that data, the use of and interactions with the data, the movement of data, is simply not compelling. These encoded propositions exist in fluid, changing systems, not as data in a vacuum.

Put the processing back with the data, please.

If you take a girl today similar to the girl pictured here, with similar interests and aptitudes, my hypothesis is that she will not end up in computer science. That's what happens when the women's movement meets a discipline defined in terms that suggest no movement. Girls rarely choose to focus on objects or data. Let's tap into the data movement and put the processing back with the data, please.

← Previous   Next →

16 Comments:

At 12:34 AM, February 01, 2006 , Anonymous Anonymous said...

You missed the opportunity to link girls to GIRLS - such an obvious association, I was waiting, and waiting ... and then I hit the end

Ross Ferris
Stamina Software
Visage>Better by Design!

 
At 5:23 AM, February 01, 2006 , Anonymous Wol said...

You're trying to define "data". And (in a response to the previous blog?) you said that you see "data" and "information" as the same thing. I'd disagree! Knowledge (hopefully) gives you Wisdom. Data (hopefully) gives you information.

And again, I probably should have posted this to a previous blog, but this I think is where the RM (and especially First Normal Form) falls down - it hides the information by emphasising the data. And imho falls badly afoul of Occam's Razor (the Einstein version) of "Make it as simple as possible, BUT NO SIMPLER".

In Science we have something called "emergent complexity". A cell is far more than the sum of its proteins. Its proteins are far more than the sum of their functional groups. And those functional groups are far more than the sum of their atoms. But in order to put data into a typical real-world database, we have to decompose our knowledge of the real world into the atoms of data (I like that Codd actually used the word "atom" :-), thereby losing large chunks of *information* in the process.

As you know, Brian is leading an attempt to define "good practice" for database design in the MV world. I think this is where MV will shine over RM precisely because it recognises this "emergent complexity" thing. Any MV FILE (table for the RM guys) should have as at least one (supposedly) unique non-composite attribute that can be used as a primary key. You don't need to use it (and indeed, with a person's name or even SSN you in practice shouldn't use it because duplicates can occur), but it should be there.

In other words, MV is blocking data in INFORMATION-sized chunks, not atom-sized chunks. (The RM equivalent would be a "view of an object".) Within the FILE, I think the data should be normalised, so we're not straying too far from the RM here :-)

This approach, looking at the INFORMATION stored in the database and not the data, imho is one of the reasons why MV databases are inherently so much more productive and easier to maintain than any RM database currently out there.

While we may not see it, every row in an RM table needs a primary key to uniquely identify it. As soon as that key has no mapping to a simple real world attribute (ie it needs to include a random attribute with no real-world relevance), the model becomes guilty of pushing complexity into the level above the model (in this case data analysis). And imho, this is the point at which the reduced complexity of the model is more than compensated for by the increased complexity of the level above. imho we have just fallen foul of the Einstein Corollary - the combined complexity of the data model and the data analysis is no longer at a mimimum.

Cheers,
Wol

 
At 8:36 AM, February 01, 2006 , Blogger --dawn said...

First to Ross - Wish I had thought of that. Even if it would have been obscure for many readers it would have been worth it to make the Pick professionals smile. Thanks for getting it into the comments. I didn't bring PICK and its original language specification of GIRLS into this entry at all. I do have a GIRLS paper I'm planning to put out on the site soon and a play on words is surely in order.

And Wol, you make good points. I don't have any reason to model meaningless data, however, only information. Another way I could have defined the terms is to say that I'm not working with data, only information, relagating data to a lower level. But related to computers, I don't have a reason to consider data as a some jumble of bits and information is necessarily structured. So I cannot come up with any useful distinction between the words related to the computer where I might end up interested in both. Another possible way to view them differently is to suggest that information is what is transmitted when data meets humans. But again, that adds nothing to the mix as I'm talking about working with data and modeling it, so there is a human in the mix.

But I do agree that you can model data in a way that obscures its meaning and the RM is more prone to doing that than other data models. Your capitalization made me think of Prime INFORMATION, manuals for which are in my office (and an old flavor of PICK, as Wol knows). I can think in terms of turning data into information, but there is no clean line there, so you can think of data or information as being more or less obscure or evident with its presentation.

One other point is that you used the term "normalization" (spelled properly from your side of the ocean) in some way other than as Codd defined it since multivalues were disallowed. I suspect you are using it to indicate that even if you are permitting list values for your attributes, you still care about functional dependencies. We need a different word for that than normalization, I think.

I do appreciate your appeals to science and modeling reality, as always. Cheers! --dawn

 
At 8:50 AM, February 01, 2006 , Anonymous Wol said...

Hi Dawn,

(As you may or may not know, I cut my MV teeth with INFORMATION, v5 :-)

But you said I'm using the word "normalisation" differently to Codd, because I allow list attributes. I thought it simply meant "conversion to normal form", and only in the specific case of first normal form does it ban list attributes...

And you said "you don't have any reason to model meaningless data". Another strike against the RM :-) as the whole point of what I was saying was that the more you decompose data towards its constituent atoms, the more you have to add "meaningless" data in order to retain control of the data!

Cheers,
Wol

 
At 9:13 AM, February 01, 2006 , Blogger --dawn said...

Hi Wol -
If you check http://www.tincat-group.com/mewsings/2006/01/is-codd-dead.html you will see how Codd used the term. All subsequent normal forms required first "normalization" or what was later terms First Normal Form. So, you if you said you like BCNF (Boyce-Codd Normal Form), you have to leave out the part that says that first you must normalize the data (put it in 1NF). Every NF requires 1NF in its definition today. You and I both know that we can consider functional dependencies without "normalizing" the data.

The RM folks will typically agree that data are propositions even as they split them into fragments. I can see your point of data being added, but I suspect RM folks would say it is not meaningless. If you take a list of e-mail addresses as a value and put it into a relational model, you need a new table and an ordering attribute as well as a foreign key. Then you need a constraint to tie the child to the parent. In MV, you just change the metadata on the description of an attribute to an M from an S and then you can query it as a multivalued attribute. (There is no UPDATE verb in MV, however. You need to use a general purpose programming language.)

Cheers! --dawn

 
At 6:02 AM, February 02, 2006 , Anonymous anonymoron said...

dawn,

You mentioned something about "data with movement", and you gave an example: "I'm in my forties. If you capture that fact as data today and then present it thirty years from now, it will be as accurate as if I were to attach a high school picture to this blog today. Alas, data changes."

In recent RM literature, it's called "temporal data", and research on its proper implementation in RDBMS's is still undergoing. See the book "Temporal Data and the Relational Model" by Hugh Darwen, Chris Date and Nikos Lorentzos (or see Darwen's site). It seems that the RM people are very interested with it, too.

How do MV DBMS's handle temporal data?

P.S.: Who's the young girl in the picture? Is it you, too?

 
At 9:21 AM, February 02, 2006 , Blogger --dawn said...

Well, anonymoron, it sounds like I've changed so much that you cannot even tell that the two pictures are of the same person. Yes, that's me. Ah well. I just read dbdebunk.com's latest Dawn page entitled MORE ON IGNORANCE AND STUPIDITY and we can just add "and she looks old too." smiles.

Yes, I have read bits and pieces related to their temporal data approach, but not enough to render an opinion on it or even know when it would typically be used. It appears that it takes time and partitions it into named intervals, thereby turning continuous infinite time into discrete finite intervals, perhaps for analytical purposes. I'll have to give it more time at some point, but it sounds like a useful technique at first glance.

If I were to take a wild (really wild) guess at it, perhaps some (likely not all) of the specific problems solved with this approach to classifying transactions (if I have that right) might be handled in MV through the perspective of (logical) "master files" given that these can have multivalued attributes. I could be way off in that off-the-cuff guess, however. I am interested in what problems are solved and what extensions might be required in existing products in order to accomodate this temporal data and it is on my list to dive into it further at some point. Cheers! --dawn

 
At 2:17 AM, February 03, 2006 , Anonymous Dave P said...

I got my first introduction to temporal data concepts while working on software that calculates premium adjustments on motor insurance policies. We were working on a project to migrate an MV system to DB2. I can testify that we struggled to find a satisfactory data model that was even vaguely elegant on either platform.

 
At 3:48 AM, February 03, 2006 , Anonymous anonymoron said...

May I add my own insights on data and information?

Some people argue that data and information are different; some argue otherwise. Some people perceive data as static, and some perceive it as dynamic.

As for me, data and information are different, though related, concepts. Data are those raw pieces of "things" or "objects" that are relevant to our endeavor and thus we collect from the real world, such as names, ages, monthly salaries, college subjects, grades, home addresses, phone numbers, and even bits of zeroes and ones. In practice, we don't just collect data; we also collect claims (propositions) about the data, claims which we know or believe to be true. Like for instance, we're not satisfied that we're simply able to obtain the name "Dawn", the age "forties" and a high school photo from the real world; we would also want to obtain the claim about these pieces of data, that "Dawn, whose high school photo is shown here, is in her forties". We keep the data and all the claims associated with them in a database.

Now, for me, information is something we want to know about the data. Anything that answers our question about the data is information. For instance, we may ask, "Whose high school photo is this?". Anything that answers this question is information about the high school photo. So we search our database for the answer, and we find that the high school photo is Dawn's. We ask again another question, "How old today is that girl in the photo?". We search the database again, and we find that this girl is now in her forties.

Propositions about data are sometimes viewed as "static", meaning, if the proposition is true today, they must be true at all times, including the future. But it seems that not all propositions are like that; if the proposition "Dawn is in her forties" is true today, it will unfortunately cease to be true thirty years from now. The real problem is that the proposition does not explicitly carry a time element; if the proposition instead is "Dawn is in her forties as of February 3, 2006", then the proposition will still be true even a thousand years from now. Representing temporal data via RM is still an active research today.

 
At 8:06 AM, February 03, 2006 , Anonymous Wol said...

Following on from Anonymoron, you could perhaps (I'm not up on theory here :-) argue that data is a permanent fact - 'on 3 Feb 06, Dawn was in her forties', while information is any inferences drawn from that set of permanent facts - anybody seeing this in Feb 2011 can conclude that by then Dawn will probably be in her fifties.

My instinctive way of handling temporal data (what I've here called information) is simply to not to store it in the database, but to derive it from non-temporal stuff. One nice thing we have in MV (which I think has appeared in some relational databases too) is calculated fields.

Actually, this is a good example of another problem, both with MV and relational implementations. How does one represent "age"? In MV I'd just say '(today - dateofbirth)/365.25'. But there are TWO circumstances in which that could be invalid. If dob is unknown, the correct result should be 'unknown'. If the person is dead the correct result is 'invalid'. I could easily make the field do exactly that in MV. But in both MV and relational, what happens if an app expects to find a number there? :-)

Cheers,
Wol

 
At 8:14 AM, February 03, 2006 , Anonymous Wol said...

I was thinking I'd redefined information from the term I'd used higher up, but now I'm not sure :-)

I've almost defined information as "stuff you can't store in a database". Age changes with time - you can't store it in a database without it degrading.

My favourite example to illustrate what I was going on about in my first post is an invoice - how (especially in a relational database) do you store the fact that an invoice detail is an atttribute of an invoice? (And before you say "an invoice detail row has a foreign key", remember that that could also means "belongs to" and not "is part of". Is there a unique indicator that cannot be misunderstood?)

Cheers,
Wol

 
At 8:42 AM, February 03, 2006 , Blogger --dawn said...

Dave P - That is good to know that it seemed equally difficult to get a good data model for temporal data using either model. I will have to look at this area more closely.

Anonymoron - I'm OK with a distinction between data and information where you might say that "firstName is George" is data and "George (firstName) Van (lastName) is a college professor (occupation" is information. In that case, however, we do information modeling and not data modeling. Since it is "data" as in "data modeling" that I'm trying to get a handle on and "information" as in "computerized information" (or some such), I don't think we lose anything by equating them.

I can see broadening the term data to include strings of seemingly meaningless bits and bytes flying by and broadening information to things that are not computerized such as the feeling of running through a field of daisies as information. So, it is for purposes of understanding "data model" data that I would define data and information to be the same. If I find there is some reason to distinquish these in the future, I'll be open to doing so. Maybe I'm just not seeing something obvious right now. I've put Data -> Information -> Knowledge -> Wisdom -> Service on a white board before, like many have (although the extension of Service is my own), but that is the "meaningless data" idea and not the "data" as in "data modeling."

Wol - enjoyed your wandering with enough other topics (derived data, unknown values) that I know I'll be picking up along the way. I'll comment on the idea of inferences being information because I considered that option. But I abandoned it quickly because if we are doing data modeling, we would not want to assume which data are stored and which are derived. So someone else for some other purpose could define data to be that which is persisted on secondary storage devices and information as that which is derived from it, but that is not "data" as in "data modeling." If I am starting with a real world scenario and doing data modeling, I will model derived facts as well as stored ones. I can also prepare a data model for capturing information from a person and then store a derivation of the collected data, so data modeling is not aligned with data storage in this way.

Thanks for all comments. Cheers! --dawn

 
At 11:51 AM, February 05, 2006 , Anonymous jog said...

Guys there are hundreds of years of epistemological debate on this subject, and over half a century in terms of information tech. There appears to be general scholarly agreement that (as the formal term is being used in the context of this discussion) information does not exist outside the mind, only data does. Information is the results of data plus interpretation. Any use of the term (communication theory aside) that neglects this, such as using it in the term "Information-sized chunks", makes me cringe somewhat. (And of course this happens a lot in the marketing-based world of business nonsense).

The aim of a good database system should be to facilitate the easiest reconstruction of data plus interpretation for the end-user to reform the original information as it stood prior to encoding. That seems an excellent basis of comparison to me between differing approaches.

All best, Jim.

 
At 5:33 AM, February 06, 2006 , Blogger sevry said...

I think the idea of a set of unchanging data is not so uninteresting. But other things do change. If you have a photo, it might be nice if little Abby didn't change from a blond to a redhead in the course of two months in a photo taken of her on a particular date. It might also be very useful, however, to time-record various photos of her over time, and/or her friends, ancestors and their friends, etc.

But data which does change can be held in a temporal database. As for those, maybe you're right in that the idea does seem to be that one 'objectifies' the data, and attempts to lock it in for any interval. But it does show the movement you want, as well.

The idea of relationships makes sense, too, except that as I understood it, Codd's RM was an attempt to escape from hierarchy even while imposing its own rigidity in 'link'/foreign keys and the relations/tables themselves. It was to be more flexible.

Perhaps what you have, instead, are systems - that work. I want the report on my desk, Pascal, as it were. And if Pascal uses any system that gets a reasonably truthful report on the desk, maybe none of the rest matters; until another Pascal comes in with a presentation for a 2x improvement, etc. And maybe this Pascal is more of a purist, and would want a trans-model of relations, if he had the time. But he uses what the company bought, and fits the code to it.

Everyone's done that, saying that this really should have been done in another language, on another platform, etc. In the trades, they call using what's available, 'resourcefulness', in getting the parts to inspection - somehow. I think that's how our top of the line jet fighters are made. I suppose it's not so different from the back office data management, to which any purist worth the name would scream - that's just not professional.

In other words, what's better might be a bit more academic than academics realize, or just far too concrete and problematic/situational. Perhaps that explains the lack of detailed and meaningful examples in assorted theses and textbooks. That is, as you say, the math may suggest something. But it can't tell you if it applies. But even if one can be that Pascal with the 2x improvement presentation just before lunch, who wins the new contract, all one can be sure of is that he may even have something that won't be worse than what they've already got? Time would tell.

 
At 8:00 PM, March 23, 2006 , Anonymous Karen Lopez said...

Hey Dawn -

A great blog on women, IT, data, databases and movement.

I've also written on data management and the gender issue, so it was greating finding someone else with these two things in common.

I believe that girls are opting out of computer science programs of study because computer science, by definition, is not applied. Unlike other professions such as engineering, medicine, and law, we tend to try to force young people to study just the science and then somehow turn around and become practitioners. Sure, lots of people do that, but that's not how professions are supposed to work.

Other professions manage to teach undergraduates *both* theory and application. The closest we have to applied computing study is MIS and Software Engineering.

Really liked your blog. I blogged about it on my site.

Karen

 
At 9:18 PM, March 23, 2006 , Blogger --dawn said...

Hi Karen -- I'm so pleased to have a comment from another woman in computing. I googled you and found http://www.eyetoit.com/2005/09/how_important_i.html . Excellent! I'm still searching for the URL of your blog so I can add it to my reading list. Feel free to post it here.

Related to the idea of applied vs theory, I was interested in mathematics with a focus in algebra (pure mathematics, not applied). I think there are still quite a few women in that area. So, I don't think of myself as only interested in application when it comes to mathematics. However, when it comes to a machine, I am only interested in it to the extent that it is useful. So I agree that an applied degree related to computing would help attract women, but I'm not sure why pure mathematics is a draw for me while CS is not.

Thanks for your comments. Cheers! --dawn

Hi Sevry -- I see I missed your comment earlier, so I'll read and comment later.

 

Post a Comment

<< Home