mewsings, a blog
Thursday, April 20, 2006
Consuming Frozen Data
Organizations often freeze data for reporting purposes. By the way, before it was called business intelligence (BI), we called all of it reporting, which I am still inclined to do.
Higher education often freezes data on designated census dates to get an account of the students who are registered for a particular term. Government, registrar, and board reports are run against this data, with longitudinal analysis showing changes from term to term and year to year. Cross-tabs of faculty broken down by sex (a favorite example until gender became the preferred term) and reports of students sliced and diced every which way are run against the census data.
I have observed many approaches to such reporting and will present three distinct methods, two in this entry and one in the next, that are currently employed by organizations. Many colleges and universities employ more than one of these approaches for their census reporting.
- 1. Keep Out Reporting
- 2. Data marts
This is the basic old fashioned approach of ceasing all maintenance of registration data, rippling through the campus so a considerable amount of data entry is halted for the time it takes to run reports. Reports are often run at night to minimize disruption.
Sites recognize that they sometimes need to re-run the reports, perhaps after repairing some data that has not been properly cleansed for these reports. As a slight digression, I'll mention that the preparation of reports often includes specifying or coding the report in connection with cleansing the data. Derived data prepared specifically for a particular report might be used to virtually cleanse or tailor the data, for example. Data and code dance together, as is the case with software in general.
One advantage to freezing the data in this way for the purpose of running reports is that these exact same reports can be run at any time against the live data. This is helpful when preparing for the census. Prior to freezing the data, the reports can be run to ensure that data values, derived data, and reports are exactly what is needed for this census.
While these reports will end up on web pages, as PDF files, or even greenbar, sometimes output from the reports is also captured as data. This data might be downloaded to Excel, for example, for future use. That brings us to the next topic.
Before we had data marts we had extracts. I'll make some not-altogether-standard distinctions between these two terms. A data mart might be hosted by a DBMS other than the one used by the OLTP system, where extracts were more often hosted by the same DBMS, by the file system of the host OS, or downloaded to client software, such as Excel. Some systems employing MS SQL Server have data marts hosted by Oracle and vice versa. These would not typically be called extracts.
Another possible distinction is the transformation of the data, sometimes denormalizing or shaping into fact-dimension cubes for the data mart. With an extract, there might be a simple transformation to drop attributes that are unnecessary for reporting purposes. When the data is not reshaped at all, the extract is referred to as a snapshot. Often for data marts, the logical data model is changed considerably.
Additionally, an extract comes from a single data source, where a data mart could be populated from multiple sources. There is sometimes a very complex ETL process for extracting, transforming, and loading data for a data mart, where you are more likely to talk about running the extract.
As a rule of thumb, the closer a data mart resembles an extract, the less expensive it is.
The biggest distinction just might be whether or not an organization sinks non-personnel dollars into data mart products. If you buy tools, you have invested in data marts. If not, you must have extracts.
Now, I could mention data warehouses as a separate bullet point because they are decidedly different. But I have rarely recommended anything I call a warehouse due to the required expansive scope of such a project. I would rather plan for a series of data mart projects that could, over time, be perceived as a data warehouse. But, yes, there are some warehouses out there in higher ed too. In any of these cases, the verb is extract whether the target is an extract, data mart, or data warehouse.
Benefits of extracting data include the ability to run reports against the same set of census data from now on, typically without an adverse effect on the transaction system. Performance is one of the big reasons people extract data. The use of different reporting tools is another. Users of reporting tools such as SAS often like to have the data in a SAS data set, for example.
Data marts are good for longitudinal analysis. If you do not freeze your census data, you are unlikely to report the exact same figures in the future, making it hard to compare this year to last. Additionally, you can model the data to optimize for reporting and aggregating data--cubes, for example.
Data mart projects can range from reasonably priced to grossly expensive. Techniques to help minimize the cost include hosting the data mart using the same DBMS as the transaction system and even retaining the structure of the data for the data mart. This can permit the same reporting tools and even the same reports to be run against either live or frozen data. As a rule of thumb, the closer a data mart resembles an extract, the less expensive it is.
If systems using a non-relational data model were a standard data source, we could have bigger bang for the buck reporting solutions.
Here's the rub. This is problematic with both SQL and not-highly-relational, such as MultiValue, databases. It is rarely advisable for a SQL-DBMS system to retain the data model of the transaction system for the purpose of analytical reporting because the relational model of data is not conducive to high performance, easy-to-specify analytical reporting. With MV databases, the data are typically structured to be able to use the same shape for both OLTP and OLAP, but you cannot specify such databases as a data source from the full range of reporting tools. In this case, data marts are often reshaped and rehosted in order to make the data more easily accessible from tools that work exclusively with SQL data sources.
Can you sense my frustration? As an industry, we have adopted standards that trap us into all of the high-wire acts and costs we are sinking into reporting. Getting reports out of our systems was once the easy part! From a database perspective, it is read-only, for goodness sake. One thing that changed between then and now is the proliferation of SQL and the relational model. If systems using a non-relational data model were a standard data source, we could have bigger bang for the buck reporting solutions. We could run reports against extracts hosted in the same shape and DBMS as the source data. I have seen the benefits for organizations using MV databases combined with one of the handful of MV-specific reporting solutions.
Enter: XML. The MV data model is, effectively, a subset of the XML data model. Any query language that reports against multiple XML documents could, in theory, be pointed at any MV database, provided the database vendor accomodated such. But I gotta say, XQuery is dog-ugly compared to the language formerly known as GIRLS. It does work with multivalued data and employs a two-valued logic, however, so I'm encouraged by that. If an MV database provider with adequate resources were on top of this situation, they could help standardize a simplified, but proven, subset of XML for persistence, with both OLTP and OLAP capabilities.
We have now addressed the approaches of reporting against live data and against data that has been extracted for reporting. I'm hoping that you are thinking that these two approaches cover the mix, because they don't. We will look at a third approach to consuming frozen data in the next mewsing.