Generating Database Server-Side Cross Tabulations
by Giuseppe Maxia12/04/2003
If you are a DBA or a database programmer, your manager may have asked you to prepare a breakdown of employees by location and department, or a breakdown of products by category and provider. If your boss is a techie, instead of a breakdown she might have mentioned a pivot table, or, if she likes to use "the right words," a cross-tabulation.
Cross-tabulations, also known as "pivot tables" or crosstabs, are statistical reports that group data by one field, creating one column for each distinct value of another field. In colloquial terms, this way of representing data is called "breaking down the data by X and Y," where X and Y are the names of two columns in the dataset.
In the good old days before computers were on every desk, you would take the list of employees and manually count how many males and females were in each department. After spreadsheets became a common working tool, you would use some cute formulas to count the values. Advanced spreadsheets available since the early 1990s could do the whole report for you from a bare list.
In its basic form, a crosstab is as simple as this: starting from a list of values, we want to group them by the values of field A and create a column for each distinct value of field B. The desired result is a table with one column for field A and several columns for each value of field B.
Now, you would think that a database engine, being more powerful than a spreadsheet, would solve this problem quite easily. When your dataset is so large that it won't fit into a spreadsheet, you start digging for an SQL solution. Unfortunately, in standard SQL there is no predefined function to accomplish this task. Nonetheless, you can get the job done with some creativity. Let's assume that your bare data resembles this table.
+------------------------------+
| person |
+----+--------+--------+-------+
| id | name | gender | dept |
+----+--------+--------+-------+
| 1 | John | m | pers |
| 2 | Mario | m | pers |
| 7 | Mary | f | pers |
| 8 | Bill | m | pers |
| 3 | Frank | m | sales |
| 5 | Susan | f | sales |
| 6 | Martin | m | sales |
| 4 | Otto | m | dev |
| 9 | June | f | dev |
+----+--------+--------+-------+
Given the column gender, having distinct values m
and f, a crosstab between dept and gender would generate a row
for each dept, with a column for the dept, one for m, and one
for f.
The task is not trivial, even when the distinct values are known in advance and are relatively scarce. A solution that should work for most DBMS engines from this simple table is:
SELECT dept,
COUNT(CASE WHEN gender ='m' THEN id ELSE NULL END) AS 'm',
COUNT(CASE WHEN gender ='f' THEN id ELSE NULL END) AS 'f'
FROM
person
GROUP BY
dept
It's easy to get discouraged when the possible values increase by a dozen, especially if they may change.
Easy solutions
The easy way of performing a cross-tabulation, as I mentioned before, is to export your data locally, load it into a spreadsheet, and use its pivot table capabilities (also called "data pilot," depending on the flavor of your spreadsheet). This solution, affordable when you have to perform the task just once and with a limited dataset, becomes intolerable when you can count your records by the millions. In such cases, a spreadsheet is not the proper tool, and you should turn your attention to DBMS calculation skills.
A hybrid solution would be to export only aggregated data and finish the job with a spreadsheet. Again, this is acceptable when you have to do it once or twice, but if you need those tables as support for your data warehouse, then you are putting too much effort on the client side, and perhaps devoting too much time dealing with macros and supporting code, to accomplish your task in a manner that looks smooth to the final user. There are, of course, specialized tools that take the burden off of your shoulders to do the dirty work for you. The good ones are on the server side and give you the final product without sweating. The bad ones are often colorful and rich in features but they shift the workload to the client machine, resulting in a slow process that makes everybody unhappy.
A viable alternative, then, is a server-side cross-tabulation calculated by the database server. For programmers unaccustomed to SQL, it may seem a Herculean task, and even the smart programmer may look at the complexity ahead with a wary eye. What is needed is a tool that reduces the complexity of the task — building a sometimes-huge SQL statement — to a few parameters that will be translated into the actual query.
DBIx::SQLCrosstab is a Perl module that can do exactly that. But before plunging into details, it would be better to analyze what a crosstab report exactly is, how complex it can get, and what you can ask the DBMS to do.
Cross-Tabulation Basics
Using the same raw data seen earlier, let's look at some trivial possibilities.
A simple crosstab between dept and gender gives us:
+-------------------------+
| dept by gender |
+-------+----+----+-------+
| dept | m | f | total |
+-------+----+----+-------+
| dev | 1 | 1 | 2 |
| pers | 3 | 1 | 4 |
| sales | 2 | 1 | 3 |
+-------+----+----+-------+
The query to create this result is:
SELECT
dept,
COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) AS m,
COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) AS f,
COUNT(*) AS total
FROM
person
GROUP BY
dept
This crosstab is trivial for several reasons:
- The data is taken from one table only.
- The cross is between one value horizontally and one vertically.
- No totals are shown at the end of the rows.
- Since there is only one value, there is no need for subtotals either by row or by column.
Before going on to explore further reasons for complexity, I want to explore
the basic method of cross-tabulating with SQL, assisted by Perl. Suppose that
we don't know in advance which values are stored for gender. In
that case, we need to scan the dataset first.
SELECT DISTINCT
gender
FROM
person
Then we can take the result values to create the COUNT lines in
the query. Here is when a supporting high-level programming language comes in
handy. If your DBMS supports stored procedures, you can do that directly on the
server, but its portability is questionable, given the variety of dialects in
that field.
A Perl implementation of this task should be:
my $columns = $dbh->selectcol_arrayref(
"SELECT DISTINCT gender FROM person"
) or die "query failed: $DBI::errstr\n";
my $query = "SELECT dept\n";
for (@$columns) {
$query .= ",COUNT(CASE WHEN gender = '$_' THEN id ELSE NULL END) AS '$_'\n";
}
$query .= ", COUNT(*) AS total\n"
$query .= "FROM person\n"
. "GROUP BY dept";
The resulting query would be:
SELECT dept
,COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) AS 'm'
,COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) AS 'f'
,COUNT(*) AS total
FROM person
GROUP BY dept
This is almost the same as the one we did manually — minus some stylistic tidbits perhaps, but it gets the job done.
You must have realized that there are some issues with this already. For
example, the header column is hardcoded, as is the total line and the
GROUP BY clause. This code, while saving you some keystrokes, is
not useful in a general case. You may be tempted to make a template out of it,
but wait! There is more to be seen before coding a solution.
Introducing Complex Crosstabs
Cross-tabulations can get complex in several ways, especially when the level of headers increases, and when several operations are performed in the same result. The level of headers can be further complicated when the values in the column headers have internal dependencies. Consider the case where you have a first level of countries and a second level of locations, where a given location can only belong to a single country. I will explain each issue in detail.
Multi-Level Cross-Tabulations
If single-level cross-tabulations haven't worried you, multiple level tables should give you something to ponder. In addition to everything already mentioned, multiple-level crosstabs have:
Query-composition complexity. Each column is the combination of several conditions, one for each level.
Column subtotals. To be inserted after the appropriate section.
Row subtotals. To be inserted after the relevant rows.
An explosive increase of column numbers. For a three-level crosstab where each level has three values, you get 27 columns. If you include subtotals, your number rises to 36. If you have just a few levels with five or six values, you may be counting rows by the hundreds.
Visualization problems. While the result set from the DBMS is a simple matrix, the conceptual table has a visualization tree at the top (for columns) and a visualization tree at the left side (for rows).
| A | B | C1 | C2 | Total | column header 1 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| D1 | D2 | Total | D1 | D2 | Total | column header 2 | |||||||||||
| E1 | E2 | Total | E1 | E2 | Total | E1 | E2 | Total | E1 | E2 | Total | column header 3 | |||||
| A1 | B1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
| B2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
| Total | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | row sub total | |
| A2 | B1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
| B2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||
| Total | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | row sub total | |
| Total | --- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | row total |
| row header 1 |
row header 2 |
col sub total |
col sub total |
col sub total |
col sub total |
col sub total |
col sub total |
col total |
|||||||||
Solving this type of cross-tabulation follows the general guidelines for single-level tabulations. First we collect the values for each column header, and then we produce the lines by combining the distinct values for each row.
Consider the following test database. (A sample DBIx::SQLCrosstab database schema is also available.)
+-------------------+--------------------+
| categories | countries |
| ================= | ================== |
| cat_id category | country_id country |
| ------ ---------- | ---------- ------- |
| 1 employee | 1 UK |
| 2 contractor | 2 Germany |
| 3 consultant | 3 Italy |
+-------------------+--------------------+
+-------------------+--------------------------+
| depts | locs |
| ================= | ======================== |
| dept_id dept | loc_id loc country_id |
| ------- -------- | ------ ------ ---------- |
| 1 pers | 1 Rome 3 |
| 2 sales | 2 London 1 |
| 3 dev | 3 Munich 2 |
| 4 research | 4 Berlin 2 |
| | 5 Bonn 2 |
+-------------------+--------------------------+
+----------------------------------------------+
| person |
| ============================================ |
| id name dept_id loc_id gender salary cat_id |
| -- ----- ------- ------ ------ ------ ------ |
| 1 John 1 2 m 5000 2 |
| 2 Mario 1 1 m 6000 1 |
| 3 Frank 2 5 m 5000 1 |
| 4 Otto 3 4 m 6000 1 |
| 5 Susan 2 3 f 5500 3 |
| 6 Martin 2 2 m 5500 2 |
| 7 Mary 1 4 f 5500 1 |
| 8 Bill 1 3 m 5000 1 |
| 9 June 3 1 f 6000 3 |
+----------------------------------------------+
We want to get the number of persons by crossing country and location with department and gender.
Step 1. Collect the Values
SELECT DISTINCT dept FROM depts; # results in 'pers', 'sales, 'dev'
SELECT DISTINCT gender FROM person; # results in 'm', 'f'
Step 2. Assemble the Query
SELECT
country, loc AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, loc
You can see that making this task automatic is a bit trickier than the previous one. Notice that now the column names can be composite, so we need to add a character to separate their components. The choice of such a character can be difficult if the column values already contain non-alphabetic letters. Notice I said "values," not "names," since the values from a column become column names in the crosstab. In addition, if you want subtotals by column, you need to add a count by department in the appropriate places, and our query would become:
SELECT
country, loc AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, loc
The result is here (don't worry about the formatting, for now. We'll come to it soon).
| country | location | pers | sales | dev | total | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| f | m | total | f | m | total | f | m | total | |||
| Germany | Berlin | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| Bonn | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | |
| Munich | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | |
| Italy | Rome | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 |
| UK | London | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 |
What about row subtotals and total? You can calculate them in the client,
but assuming that you want them all at once, you can use a UNION query, such as this monster:
SELECT
country, loc AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, location
UNION
SELECT
country, 'zzzz' AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
INNER JOIN
locs ON (locs.loc_id=person.loc_id)
INNER JOIN
countries ON (locs.country_id=countries.country_id)
GROUP BY
country, location
UNION
SELECT
'zzzz' AS country, 'zzzz' AS location,
COUNT(CASE WHEN dept = 'pers' AND gender = 'f' THEN id ELSE NULL END)
AS 'pers-f',
COUNT(CASE WHEN dept = 'pers' AND gender = 'm' THEN id ELSE NULL END)
AS 'pers-m',
COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END)
AS 'pers',
COUNT(CASE WHEN dept = 'sales' AND gender = 'f' THEN id ELSE NULL END)
AS 'sales-f',
COUNT(CASE WHEN dept = 'sales' AND gender = 'm' THEN id ELSE NULL END)
AS 'sales-m',
COUNT(CASE WHEN dept = 'sales' THEN id ELSE NULL END)
AS 'sales',
COUNT(CASE WHEN dept = 'dev' AND gender = 'f' THEN id ELSE NULL END)
AS 'dev-f',
COUNT(CASE WHEN dept = 'dev' AND gender = 'm' THEN id ELSE NULL END)
AS 'dev-m',
COUNT(CASE WHEN dept = 'dev' THEN id ELSE NULL END)
AS 'dev',
COUNT(*) AS total
FROM
person
INNER JOIN
depts ON (person.dept_id=depts.dept_id)
ORDER BY
country, location
The first UNION creates the subtotals by country. I replaced
the location name with zzzz to be sure that it will get sorted at
the end of the normal rows for each country. Similarly, the second
UNION has both country and location replaced by zzzz,
so that they will be listed at the very end of the result set. The first
UNION subquery will create one row for each country. The last
subquery will create only one row, for the grand total. Notice that the last
UNION subquery doesn't have a GROUP BY clause. The
ORDER BY clause at the end of the query it will affect the whole
result set.
The result, nicely formatted, with the zzzz translated into a
more readable total, is this:
| country | location | pers | sales | dev | total | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| f | m | total | f | m | total | f | m | total | |||
| Germany | Berlin | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| Bonn | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | |
| Munich | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | |
| Total | 1 | 1 | 2 | 1 | 1 | 2 | 0 | 1 | 1 | 5 | |
| Italy | Rome | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 |
| Total | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | |
| UK | London | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 |
| Total | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | |
| Total | Total | 1 | 3 | 4 | 1 | 2 | 3 | 1 | 1 | 2 | 9 |
Automating this task is much harder than a simple crosstab, and the full explanation is beyond the scope of this article. Stay tuned to Perl Monks if you are interested in the implementation details.
Pages: 1, 2 |



