Tincat Group, Inc. -- tincat musings blog

<<software development>>

mewsings, a blog

--dawn

Monday, April 03, 2006

Consuming Frozen Data

Organizations often freeze data for reporting purposes. By the way, before it was called business intelligence (BI), we called it all reporting, which I am still inclined to do.

Higher education often freezes data on designated census dates to get an account of the students who are registered for a particular term. Government, registrar, and board reports are run against this data, with longitudinal analysis showing changes from term to term and year to year. Cross-tabs of faculty broken down by sex (a favorite example until gender became the preferred term) and students sliced and diced every which way are run against this specific collection of census data.

I have observed many approaches to such reporting and will present three distinct methods, two in this entry and one in the next, that are currently employed by organizations. Many colleges and universities employ more than one of these approaches for their census reporting.

1. Keep Out Reporting

This is the basic old fashioned approach of ceasing all maintenance of registration data, rippling through the campus so a considerable amount of data entry is halted for the time it takes to run reports. Reports are often run at night to minimize disruption.

Sites recognize that they sometimes need to re-run the reports, perhaps after repairing some data that has not been properly cleansed for these reports. As a slight digression, I'll mention that the preparation of reports often includes specifying or coding the report in connection with cleansing the data. Derived data prepared specifically for a particular report might be used to virtually cleanse or tailor the data, for example. Data and code dance together, as is the case with software in general.

One advantage to freezing the data in this way for the purpose of running reports is that these exact same reports can be run at any time against the live data. This is helpful when preparing for the census reports. Prior to freezing the data, the reports can be run to ensure that data values, derived data, and reports are exactly what is needed for this census.

Sometimes the output from these reports is saved in something other than pdf files or, my favorite, greenbar, such as Excel, where it can later be queried as needed. This brings us to the next topic.

2. Data marts

Before we called them data marts they were called extract files. I'll make some not-altogether-standard distinctions between these two terms, however. A data mart might be hosted by a DBMS other than the one used by the OLTP system, where extract files were more often hosted by the same DBMS or downloaded to client software, such as Excel. Some systems employing MS SQL Server have data marts hosted by Oracle and vice versa.

Another possible distinction is the transformation of the data, sometimes denormalizing or shaping into fact-dimension cubes for the data mart. This transformation might simply be to drop data that are unnecessary for reporting purposes. When that is not done, the extract is referred to as a snapshot. But more often for data marts, the logical data model is changed considerably.

The biggest distinction just might be whether or not an organization sinks non-personnel dollars into data mart products. If you buy tools, you have data marts. If not, you might call them extract files.

Now, I could mention data warehouses as a separate bullet point because they are decidedly different. But I have rarely recommended anything I call a warehouse due to the required expansive scope of such a project. I would rather plan for a series of data mart projects that could, over time, be perceived as a data warehouse. But, yes, there are some warehouses out there in higher ed too. In any of these cases, the verb is extract whether the target is an extract, data mart, or data warehouse.

Benefits of extracting data include the ability to run reports against the same set of census data from now on, typically without an adverse effect on the transaction system. Performance is one of the big reasons people extract data. The use of different reporting tools is another. Users of reporting tools such as SAS often like to have the data in a SAS data set, for example.

Data marts are good for longitudinal analysis. If you do not freeze your census data, you are unlikely to report the exact same figures in the future, making it hard to compare this year to last. Additionally, you can model the data to optimize for reporting and aggregating data--cubes, for example.

Data mart projects can range from reasonably priced to grossly expensive. Techniques to help minimize the cost include hosting the data mart using the same DBMS as the transaction system and even retaining the structure of the data for the data mart. This can permit the same reporting tools and even the same reports to be run against either live or frozen data.

If systems using a non-relational data model could be a standard data source, we could have bigger bang for the buck reporting solutions.

Here's the rub. It is rarely advisable for a SQL-DBMS system to retain the data model of the transaction system for the purpose of analytical reporting because the relational model of data is not conducive to high performance, easy-to-specify analytical reporting. With MultiValue databases, the data are typically structured to be able to use the same shape for both OLTP and OLAP, but you cannot typically specify such databases as a data source from the full range of reporting tools. Data marts are often reshaped and rehosted in order to make the data more easily accessible from tools that work exclusively with SQL data sources.

Can you sense my frustration? As an industry, we have adopted standards that trap us into all of the high-wire acts and costs we are sinking into reporting. Getting reports out of our systems was once the easy part! From a database perspective, it is read-only, for goodness sake. One thing that changed between then and now is the proliferation of SQL and the relational model. If systems using a non-relational data model could be a standard data source, we could have bigger bang for the buck reporting solutions. We could run reports against extracts hosted in the same shape and DBMS as the source data. I have seen the benefits for organizations using MV databases combined with one of the handful of MV-specific reporting solutions.

The MV data model is, effectively, a subset of the XML data model.

Enter: XML. The MV data model is, effectively, a subset of the XML data model. Any query language that reports against multiple XML documents could, in theory, be pointed at any MV database, provided the database vendor accomodated such. But I gotta say, XQuery is dog-ugly compared to the language formerly known as GIRLS. It does work with multivalued data and employs a two-valued logic, however, so I'm encouraged by that. If an MV database provider with adequate resources were on top of this situation, they could help standardize a simplified, but proven, subset of XML for persistence, with both OLTP and OLAP capabilities.

I'm hoping that you are thinking that these two approaches cover the mix, because they don't. We will look at a third approach to consuming frozen data in the next mewsing.

See comments.