mewsings, a blog
Wednesday, April 26, 2006
Surf and Turf Reporting
This is a continuation of the previous mewsing, Consuming Frozen Data. There is a new wave of surf and turf reporting in the business intelligence community.
- 3. Surf and Turf Reporting
- Materialized Views
An SQL View can be defined as a stored SQL query. A materialized View is the persisted result set of the query. For those cases where an SQL View can be used to specify the data to be extracted from a live system, running the appropriate commands to create a table based on this view results in a materialized view. Most Pick environments, such as U2, provide a means to accomplish a similar feat without use of SQL. The materialized view may then be used as any other base table in a schema.
Using this approach, a college or university can extract registered student data without including any accounts receivable information. Then if someone decides to identify clusters of students who failed to pay in full at the time of registration over the past ten years, the extracted data could be joined to the live accounts receivable data, including historical transactions, to perform this analysis.
We don't need to measure with a micrometer before we cut with an axe.
Reporting is done in support of decision-making. If a decision requires that only data frozen at a particular point in time be used as input, then that decision and related reporting must happen only after there has been a project to identify and extract the required data. There is a cost to this which is sometimes justified. We, as a profession, want to provide the very best data possible for decision-making, but we must also be business-savvy in our cost-benefit analyses.
Very often a small subset of the data required for a particular decision must come from data frozen at a specific point in time, while additional live data contributes a sufficient approximation to make an informed decision. There is a lot of data that rarely or never changes. If it has changed since the time of the extract, it might even be a correction. There is a time to cringe at the words good enough and a time to recognize that quality includes fiscal stewardship. We don't need to measure with a micrometer before we cut with an axe.
MV developers and end-users alike often decouple the restriction process, narrowing down the input for a report to a fixed set of primary keys, for example, from the projection process which yields a set of attributes related to the identified keys. In Pick, the restriction process is called the selection. While these two processes can be handled together in the query language, Pick doesn't work with result sets, in general, but with select lists a.k.a. savedlists, the result of restricting the data to a set of records. These savedlists typically store the keys to the entity (file) of interest. For example, a school might keep a savedlist with the student IDs for all students registered for the fall of 2005 and another for the spring of 2006.
These savedlists are, effectively, materialized views of primary key data. It is similar to a snapshot in time of an index on a subset of rows. After executing a SELECT statement, a SAVE-LIST or SAVE.LIST command will save the resulting keys under the name provided. Then after retrieving a savedlist (GET.LIST or GET-LIST), the next command or query is executed only against that subset of keys. While selection criteria is not always saved as it could otherwise be acted upon immediately, when it is stored in a savedlist we have a unique minimalist Pick twist on surf and turf reporting.
While materialized views are not always heavily used in SQL shops, SAVE-LIST and GET-LIST are bread and butter for MV shops. It greatly improves performance if you already have a savedlist with which to start, much as an index does when reporting on live data. Unlike an index, the savedlists are not updated once they have been built. They are minimalist data marts used to run a set of reports against a consistant set of IDs.
I have seen colleges use only savedlists for their census data extracts, saving these while running their census reports using option 1 from the last mewsing. If any other questions come up related to this census, they consider the live data, including historical records, adequate for the related decision-making. I'll admit that I would advise against this approach for census data, as I think the benefits outweigh the costs for saving some of the important attribute values at the time of the census. In this case a materialized view would likely be a better strategy.
Aside from census data, a registrar's office could create a fresh savedlist of registered students on a nightly basis, and other departments on campus could use that savedlist for their work too. No one other than the registrar's office would then need to understand the criteria for determining who, precisely, is registered for this term. One advantage to using savedlists rather than full data marts is that these savedlists can be fed to update processes as well as reporting processes. Depending on the application software, a savedlist could be entered into a process that would set the graduation date for everyone stored in a savedlist, for example.
Another advantage of savedlists is that often end-users can decide what they want to put in a savedlist. Of course there are different approaches to security in different shops, but in places where end-users have such access, these power users are empowered with respect to their data. Without actually creating new tables or files (savedlists are handled differently) and without writing data extraction procedures, MV end-users are often using a surf and turf strategy with their data from day to day.
- Materialized Attributes
With a current industry focus on operational data stores which permit queries against current or near current data values, there is a return to thinking about how to do reporting against your production data banks without extracting data. Although such redundancy was once considered poor form, storing data derived from other data could be just the ticket to improved reporting performance. Aggregates and other derivations are costly when they must be performed on a large data set all for one report. If they can be computed in the background whenever there is a change that would prompt such, the added cost of integrity for the redundant data might be more than balanced by the savings in reporting. Performance against the live data can then be as good as from a data mart.
For example, if we materialize the attribute CURRENT_GPA for each student, we can run reports that require this data without having to read through all courses for all terms for each student in order to compute the GPA on the fly. A policy of no redundant data, held tightly in many shops, is one factor prompting huge redundancies in the form of data marts and warehouses. Yes, redundancy does have a cost, and a few well-chosen materialized attributes could save you from the more excessive redundancies found with extracted data.
Surf and turf is likely easier to pull off successfully if not constrained by SQL
It surely isn't a new technique, but there is a revived interest in strategies used to materialize (compute and store) derived attributes. A colleague pointed me to Sybase as an example of a product that has tools for materializing derived data. I browsed the information on it without trying it out (lazy or busy, you can choose), and I'm impressed! With other products developers can get the value of a user-defined function, virtual field, or stored procedure and store it using application code. They might do this with triggers or CRUD services that adjust the materialized values when the underlying values change.
Our final option is a blend of live, fluid data (surf) and extracted, fixed data (turf). Some of the advantages of reporting in the same environment as your production OLTP system and some of the advantages of reporting against data marts or extracted data are present with this option. Additionally, any new data (or data that were not originally extracted) is automatically accessible for reporting in combination with data that has been extracted.
The surf and turf strategy lays an agile foundation for analytical reporting. I am a fan of this approach; however, the pros will also be able to determine the cons of various options once we dive into the descriptions.
It should come as no shock to those of you who have been reading these mewsings that SQL and MultiValue database environments have different approaches to surf and turf. Fortunately, each permits the user to perform steps that somewhat simulate the approach of the other. Users often stick to the first approach they find that works, however. So the first of the below descriptions is what SQL developers are more inclined to do and the second is a technique Pickies often employ.
In case you are wondering where this is headed in relation to the overall topic of these mewsings, I'll give a hint, without yet providing any arguments. In short, a surf and turf strategy can result in significant savings, but is likely easier to pull off successfully, with smiling end-users, if you are not constrained by a SQL-based environment.
Getting back to today's topic, I have often been a bit queasy about the added costs vs. benefits for data marts and warehouses. Each situation is different, but in many cases, this added cost can be reduced significantly by doing surf and turf reporting.