Picking The Best Charts For Big Data

Get inspired - Move beyond lines and bar graphs with this guide.

Introduction

The best chart to be used for your data is dependent on the data type of the variables involved in your visualisation. There are 3 main data types most people use:

  • Numeric data
  • Categorical data
  • Ordinal data

You may read this article here for a summary on these data types.

For the purpose of this guide, it might be useful to think of ordinal data as categorical data. Time series will also not be discussed in this guide.

A combination of these two data types in a visualisation will determine the best type of charts. In this guide, I will focus on the best charts if you had to create a visualisation for large datasets with:

  • 1 numeric variable
  • 2 numeric variables
  • 1 numeric variable, 1 categorical variable
  • 2 numeric variables, 1 categorical variable
  • 1 numeric variable, 2 categorical variables

1 numeric variable: Histogram

Definition

A histogram shows the distribution of a numeric variable. It only requires 1 numeric variable as the input only. The x-axis is split into several bins of equal parts (e.g. 1 - 10, 11 - 20, 21 - 30 ..) and the y-axis shows the number of observations for each bin.

What is it used for?

A histogram is used to show the distribution of the dataset. Here are some of the most commonly seen distributions. Each distribution produces an insight for that variable. For example, an edge peak distribution could either mean an outlier you should take note of or that your dataset was incorrectly processed.

More tips

#1: Layer more than 1 histogram in the same visualisation

If you would like to have more granular insights on the distribution, another variation of the histogram is to layer a categorical variable to the histogram with another color. This allows you to understand how the distribution of values is like with another variable. Alternatively, you may also use the box plot as shown in the section below.

2 numeric variables: Scatterplot

Definition and what is it used for

A scatterplot is made to study the relationship between 2 numeric variables. It is often used to analyse linear relationships, and hence accompanied by a correlation coefficient.

More tips on the scatterplot

#1: Scatterplots are less useful for discrete numerical data

If you are using discrete numerical data, you might want to consider using a ridge plot instead.

#2: Add marginal distribution

Adding marginal distributions to your scatterplot allows you to also understand the distribution of your variables in the x and y axis.

#3: Make scatterplots interactive

Scatterplots are most useful to your end-users if they are interactive. Ideally, end-users should be able to hover their mouse to a single data point and find out more about it.

1 numeric variable, 1 categorical variable: Violin Plot

Defintion and what it is for

A violin plot allows to analyse the distribution of a numeric variable for several categories or ordinal varibles. The shape represents the density estimate of the variable. The higher the count for a specific data point, the larger the violin.

It is really close to a, but allows a deeper understanding of the distribution.

More tips for the violin plot

#1: Including a boxplot within a violin plot

While the violin plot gives you a very granular analysis of the distribution, including a boxplot within a violin plot gives you the added benefit of viewing median and quartiles easily.

#2: Other graphs you can use for 1 numeric, 1 categorical variable
  • Box Plot

1 numeric variable, 2 categorical variables: Heatmap

Definition and what it is for

A heatmap shows magnitude of a phenomenon as color in two dimensions. The variation in color might be in hue or intensity. This gives your end-user a clear visualisation on how the value varies over different category

2 numeric variables, 1 categorical variable: Grouped scatterplot

Definition and what it is for

Similar to the scatterplot above, a grouped scatterplot uses the color of each dot for the categorical variable, allowing your end-user to see how that category is distributed over the 2 numerical variables.