This post was originally published on codeandnumbers.co.uk on 19th September 2021
This one goes in the box marked ‘I should have thought about this a long time ago’.
I’ve been creating charts and reports for a long time and never really stepped considered why I tend to gravitate towards presenting data in an ‘aggregate by all useful dimensions’ format. Having spent some time thinking about this and discussing my thoughts with Alastair McKinley, I feel I can present this in a meaningful manner. I feel almost certain that this has been covered in great detail elsewhere but spent a lot time searching and finding nothing.
I tend to think about a dataset as a collection of:
Most visualisations show data that has been aggregated in some way; the data is summarised before being displayed. There are an infinite number of ways that data can be aggregated, from simple counts of records over a dimension to running complex machine learning models over all records in a dimension.
As with many things, there’s a sliding scale between flexibility and performance in how we present this data - more flexible approaches are less efficient, more efficient approaches are less flexible. The three key approaches for presenting data for visualisation are to present:
Presenting all records is the most flexible, as the person or process receiving that data can perform whatever aggregation they wish on the data. However, there are many situations where this is not preferable, e.g.:
Note that aggregating your data does not ensure that personal/proprietary data cannot be extracted from your dataset - you may still need to apply statistical disclosure control to reduce this risk.
In SQL this looks like:
SELECT
dim1,
dim2,
metric
FROM mytable;
The Cube approach is necessary when you need to show non-aggregatable metrics, such as the distinct number of records for a combination of dimensions. It requires additional effort within the visualisations to filter the data so that only the relevant parts of your Cube are shown - otherwise your metrics may be counted many times over. The benefit is that it is likely still more efficient than presenting all the records.
In SQL this looks like:
SELECT
dim1,
dim2,
agg_function(metric)
FROM mytable
GROUP BY GROUPING SETS ((dim1, dim2), (dim1), (dim2), ());
Which would produce four views of the dataset, each grouped by one of the dimensions specified.
The Grouped approach is most performant as it only calculates and transfers the data that is needed. It is the most performant but not usable where non aggregatable measurements are used - such as counts of distinct items, or median values. It requires some extra effort in visualisations to further aggregate the data where not all dimensions are used.
In SQL this looks like:
SELECT
dim1,
dim2,
agg_function(metric)
FROM mytable
GROUP BY dim1, dim2;
Sometimes datasets are so large that even with Cubes or Grouped views they need to be filtered or sampled before they can effectively be visualised due to the sheer volume of data. The challenge then becomes how to create a good user experience when not all data is available at the same time. Possible solutions include query or drill-down type visualisations where the user controls how they explore the data.