Tincat Group, Inc. - Mewsings, a Software Development and Data Modeling Blog

<<software development>>

mewsings, a blog

--dawn

Wednesday, April 26, 2006

Surf and Turf Reporting

This is a continuation of the previous mewsing, Consuming Frozen Data. There is a new wave of surf and turf reporting in the business intelligence community.

3. Surf and Turf Reporting

Our final option is a blend of live, fluid data (surf) and extracted, fixed data (turf). Some of the advantages of reporting in the same environment as your production OLTP system and some of the advantages of reporting against data marts or extracted data are present with this option. Additionally, any new data (or data that were not originally extracted) is automatically accessible for reporting in combination with data that has been extracted.

The surf and turf strategy lays an agile foundation for analytical reporting. I am a fan of this approach; however, the pros will also be able to determine the cons of various options once we dive into the descriptions.

It should come as no shock to those of you who have been reading these mewsings that SQL and MultiValue database environments have different approaches to surf and turf. Fortunately, each permits the user to perform steps that somewhat simulate the approach of the other. Users often stick to the first approach they find that works, however. So the first of the below descriptions is what SQL developers are more inclined to do and the second is a technique Pickies often employ.

Materialized Views
An SQL View can be defined as a stored SQL query. A materialized View is the persisted result set of the query. For those cases where an SQL View can be used to specify the data to be extracted from a live system, running the appropriate commands to create a table based on this view results in a materialized view. Most Pick environments, such as U2, provide a means to accomplish a similar feat without use of SQL. The materialized view may then be used as any other base table in a schema.

Using this approach, a college or university can extract registered student data without including any accounts receivable information. Then if someone decides to identify clusters of students who failed to pay in full at the time of registration over the past ten years, the extracted data could be joined to the live accounts receivable data, including historical transactions, to perform this analysis.

We don't need to measure with a micrometer before we cut with an axe.

Reporting is done in support of decision-making. If a decision requires that only data frozen at a particular point in time be used as input, then that decision and related reporting must happen only after there has been a project to identify and extract the required data. There is a cost to this which is sometimes justified. We, as a profession, want to provide the very best data possible for decision-making, but we must also be business-savvy in our cost-benefit analyses.

Very often a small subset of the data required for a particular decision must come from data frozen at a specific point in time, while additional live data contributes a sufficient approximation to make an informed decision. There is a lot of data that rarely or never changes. If it has changed since the time of the extract, it might even be a correction. There is a time to cringe at the words good enough and a time to recognize that quality includes fiscal stewardship. We don't need to measure with a micrometer before we cut with an axe.
Savedlists
MV developers and end-users alike often decouple the restriction process, narrowing down the input for a report to a fixed set of primary keys, for example, from the projection process which yields a set of attributes related to the identified keys. In Pick, the restriction process is called the selection. While these two processes can be handled together in the query language, Pick doesn't work with result sets, in general, but with select lists a.k.a. savedlists, the result of restricting the data to a set of records. These savedlists typically store the keys to the entity (file) of interest. For example, a school might keep a savedlist with the student IDs for all students registered for the fall of 2005 and another for the spring of 2006.

These savedlists are, effectively, materialized views of primary key data. It is similar to a snapshot in time of an index on a subset of rows. After executing a SELECT statement, a SAVE-LIST or SAVE.LIST command will save the resulting keys under the name provided. Then after retrieving a savedlist (GET.LIST or GET-LIST), the next command or query is executed only against that subset of keys. While selection criteria is not always saved as it could otherwise be acted upon immediately, when it is stored in a savedlist we have a unique minimalist Pick twist on surf and turf reporting.

While materialized views are not always heavily used in SQL shops, SAVE-LIST and GET-LIST are bread and butter for MV shops. It greatly improves performance if you already have a savedlist with which to start, much as an index does when reporting on live data. Unlike an index, the savedlists are not updated once they have been built. They are minimalist data marts used to run a set of reports against a consistant set of IDs.

I have seen colleges use only savedlists for their census data extracts, saving these while running their census reports using option 1 from the last mewsing. If any other questions come up related to this census, they consider the live data, including historical records, adequate for the related decision-making. I'll admit that I would advise against this approach for census data, as I think the benefits outweigh the costs for saving some of the important attribute values at the time of the census. In this case a materialized view would likely be a better strategy.

Aside from census data, a registrar's office could create a fresh savedlist of registered students on a nightly basis, and other departments on campus could use that savedlist for their work too. No one other than the registrar's office would then need to understand the criteria for determining who, precisely, is registered for this term. One advantage to using savedlists rather than full data marts is that these savedlists can be fed to update processes as well as reporting processes. Depending on the application software, a savedlist could be entered into a process that would set the graduation date for everyone stored in a savedlist, for example.

Another advantage of savedlists is that often end-users can decide what they want to put in a savedlist. Of course there are different approaches to security in different shops, but in places where end-users have such access, these power users are empowered with respect to their data. Without actually creating new tables or files (savedlists are handled differently) and without writing data extraction procedures, MV end-users are often using a surf and turf strategy with their data from day to day.
Materialized Attributes
With a current industry focus on operational data stores which permit queries against current or near current data values, there is a return to thinking about how to do reporting against your production data banks without extracting data. Although such redundancy was once considered poor form, storing data derived from other data could be just the ticket to improved reporting performance. Aggregates and other derivations are costly when they must be performed on a large data set all for one report. If they can be computed in the background whenever there is a change that would prompt such, the added cost of integrity for the redundant data might be more than balanced by the savings in reporting. Performance against the live data can then be as good as from a data mart.

For example, if we materialize the attribute CURRENT_GPA for each student, we can run reports that require this data without having to read through all courses for all terms for each student in order to compute the GPA on the fly. A policy of no redundant data, held tightly in many shops, is one factor prompting huge redundancies in the form of data marts and warehouses. Yes, redundancy does have a cost, and a few well-chosen materialized attributes could save you from the more excessive redundancies found with extracted data.

Surf and turf is likely easier to pull off successfully if not constrained by SQL

It surely isn't a new technique, but there is a revived interest in strategies used to materialize (compute and store) derived attributes. A colleague pointed me to Sybase as an example of a product that has tools for materializing derived data. I browsed the information on it without trying it out (lazy or busy, you can choose), and I'm impressed! With other products developers can get the value of a user-defined function, virtual field, or stored procedure and store it using application code. They might do this with triggers or CRUD services that adjust the materialized values when the underlying values change.

In case you are wondering where this is headed in relation to the overall topic of these mewsings, I'll give a hint, without yet providing any arguments. In short, a surf and turf strategy can result in significant savings, but is likely easier to pull off successfully, with smiling end-users, if you are not constrained by a SQL-based environment.

Getting back to today's topic, I have often been a bit queasy about the added costs vs. benefits for data marts and warehouses. Each situation is different, but in many cases, this added cost can be reduced significantly by doing surf and turf reporting.

See comments.

Thursday, April 20, 2006

Consuming Frozen Data

Organizations often freeze data for reporting purposes. By the way, before it was called business intelligence (BI), we called all of it reporting, which I am still inclined to do.

Higher education often freezes data on designated census dates to get an account of the students who are registered for a particular term. Government, registrar, and board reports are run against this data, with longitudinal analysis showing changes from term to term and year to year. Cross-tabs of faculty broken down by sex (a favorite example until gender became the preferred term) and reports of students sliced and diced every which way are run against the census data.

I have observed many approaches to such reporting and will present three distinct methods, two in this entry and one in the next, that are currently employed by organizations. Many colleges and universities employ more than one of these approaches for their census reporting.

1. Keep Out Reporting

This is the basic old fashioned approach of ceasing all maintenance of registration data, rippling through the campus so a considerable amount of data entry is halted for the time it takes to run reports. Reports are often run at night to minimize disruption.

Sites recognize that they sometimes need to re-run the reports, perhaps after repairing some data that has not been properly cleansed for these reports. As a slight digression, I'll mention that the preparation of reports often includes specifying or coding the report in connection with cleansing the data. Derived data prepared specifically for a particular report might be used to virtually cleanse or tailor the data, for example. Data and code dance together, as is the case with software in general.

One advantage to freezing the data in this way for the purpose of running reports is that these exact same reports can be run at any time against the live data. This is helpful when preparing for the census. Prior to freezing the data, the reports can be run to ensure that data values, derived data, and reports are exactly what is needed for this census.

While these reports will end up on web pages, as PDF files, or even greenbar, sometimes output from the reports is also captured as data. This data might be downloaded to Excel, for example, for future use. That brings us to the next topic.

2. Data marts

Before we had data marts we had extracts. I'll make some not-altogether-standard distinctions between these two terms. A data mart might be hosted by a DBMS other than the one used by the OLTP system, where extracts were more often hosted by the same DBMS, by the file system of the host OS, or downloaded to client software, such as Excel. Some systems employing MS SQL Server have data marts hosted by Oracle and vice versa. These would not typically be called extracts.

Another possible distinction is the transformation of the data, sometimes denormalizing or shaping into fact-dimension cubes for the data mart. With an extract, there might be a simple transformation to drop attributes that are unnecessary for reporting purposes. When the data is not reshaped at all, the extract is referred to as a snapshot. Often for data marts, the logical data model is changed considerably.

Additionally, an extract comes from a single data source, where a data mart could be populated from multiple sources. There is sometimes a very complex ETL process for extracting, transforming, and loading data for a data mart, where you are more likely to talk about running the extract.

As a rule of thumb, the closer a data mart resembles an extract, the less expensive it is.

The biggest distinction just might be whether or not an organization sinks non-personnel dollars into data mart products. If you buy tools, you have invested in data marts. If not, you must have extracts.

Now, I could mention data warehouses as a separate bullet point because they are decidedly different. But I have rarely recommended anything I call a warehouse due to the required expansive scope of such a project. I would rather plan for a series of data mart projects that could, over time, be perceived as a data warehouse. But, yes, there are some warehouses out there in higher ed too. In any of these cases, the verb is extract whether the target is an extract, data mart, or data warehouse.

Benefits of extracting data include the ability to run reports against the same set of census data from now on, typically without an adverse effect on the transaction system. Performance is one of the big reasons people extract data. The use of different reporting tools is another. Users of reporting tools such as SAS often like to have the data in a SAS data set, for example.

Data marts are good for longitudinal analysis. If you do not freeze your census data, you are unlikely to report the exact same figures in the future, making it hard to compare this year to last. Additionally, you can model the data to optimize for reporting and aggregating data--cubes, for example.

Data mart projects can range from reasonably priced to grossly expensive. Techniques to help minimize the cost include hosting the data mart using the same DBMS as the transaction system and even retaining the structure of the data for the data mart. This can permit the same reporting tools and even the same reports to be run against either live or frozen data. As a rule of thumb, the closer a data mart resembles an extract, the less expensive it is.

If systems using a non-relational data model were a standard data source, we could have bigger bang for the buck reporting solutions.

Here's the rub. This is problematic with both SQL and not-highly-relational, such as MultiValue, databases. It is rarely advisable for a SQL-DBMS system to retain the data model of the transaction system for the purpose of analytical reporting because the relational model of data is not conducive to high performance, easy-to-specify analytical reporting. With MV databases, the data are typically structured to be able to use the same shape for both OLTP and OLAP, but you cannot specify such databases as a data source from the full range of reporting tools. In this case, data marts are often reshaped and rehosted in order to make the data more easily accessible from tools that work exclusively with SQL data sources.

Can you sense my frustration? As an industry, we have adopted standards that trap us into all of the high-wire acts and costs we are sinking into reporting. Getting reports out of our systems was once the easy part! From a database perspective, it is read-only, for goodness sake. One thing that changed between then and now is the proliferation of SQL and the relational model. If systems using a non-relational data model were a standard data source, we could have bigger bang for the buck reporting solutions. We could run reports against extracts hosted in the same shape and DBMS as the source data. I have seen the benefits for organizations using MV databases combined with one of the handful of MV-specific reporting solutions.

Enter: XML. The MV data model is, effectively, a subset of the XML data model. Any query language that reports against multiple XML documents could, in theory, be pointed at any MV database, provided the database vendor accomodated such. But I gotta say, XQuery is dog-ugly compared to the language formerly known as GIRLS. It does work with multivalued data and employs a two-valued logic, however, so I'm encouraged by that. If an MV database provider with adequate resources were on top of this situation, they could help standardize a simplified, but proven, subset of XML for persistence, with both OLTP and OLAP capabilities.

We have now addressed the approaches of reporting against live data and against data that has been extracted for reporting. I'm hoping that you are thinking that these two approaches cover the mix, because they don't. We will look at a third approach to consuming frozen data in the next mewsing.

Continue to next blog →

Tuesday, April 11, 2006

Pascal Loses Wager

I'm on the road right now and stopped in at Sylvia's for the night. Sylvia decided to google me this afternoon instead of cleaning the guest bathroom. She was laughing when I arrived after having read how I was infamous. She wondered how I had offended this Pascal character after taking at look at this and this. In case these links change on the dbdebunk site, I'll just note that they basically claim that I am stupid, ignorant, and vociferous.

The following is a rebus response to Mr. Pascal's ad hominem attacks. Hover over pictures to get the word if you don't want to work the puzzle out for yourself. I hope you enjoy, but if you find it sophomoric, well, you should meet Sylvia.

,

Ewe Argh an Ass .

Oui Awl Love Cod .

Time Move a Head

Love ,

Dawn

Note: I wrote this a few weeks ago. I requested input from a few folks regarding this blog. Roughly half thought it was great and the other half thought it was a bad idea. I agree with others who have been targeted by Pascal that he and his approach to others in the industry should not be left unaddressed. I've decided to take the risk and post this because it fits my style to respond, but not respond in kind. After all, I might be ignorant of many things, but I'm not stupid.

See comments.

Litter Box

Mewsings about software development, with a focus on data modeling.

Some of this is worth pawing through.