Datasaurus

13 Oct 2020

This week’s Tidy Tuesday speaks to the importance of visualization in data exploration. Alberto Cairo created this simulated data set in order to demonstrate how misleading summary statistics can be and to show how useful visualization is in uncovering patterns in data. In this spirit, let’s start exploring this data set to see what we find.

## Rows: 1,846
## Columns: 3
## $ dataset <chr> "dino", "dino", "dino", "dino", "dino", "dino", "dino", "dino"…
## $ x       <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410,…
## $ y       <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.3333, 84.8718, 79.8718,…

##    dataset                x               y           
##  Length:1846        Min.   :15.56   Min.   : 0.01512  
##  Class :character   1st Qu.:41.07   1st Qu.:22.56107  
##  Mode  :character   Median :52.59   Median :47.59445  
##                     Mean   :54.27   Mean   :47.83510  
##                     3rd Qu.:67.28   3rd Qu.:71.81078  
##                     Max.   :98.29   Max.   :99.69468

##  [1] "dino"       "away"       "h_lines"    "v_lines"    "x_shape"   
##  [6] "star"       "high_lines" "dots"       "circle"     "bullseye"  
## [11] "slant_up"   "slant_down" "wide_lines"

We have 1,846 sets of x and y coordinates divided up into thirteen descriptive data sets.

## # A tibble: 13 x 6
##    dataset    Mean_X Mean_Y  SD_X  SD_Y    Corr
##    <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl>
##  1 away         54.3   47.8  16.8  26.9 -0.0641
##  2 bullseye     54.3   47.8  16.8  26.9 -0.0686
##  3 circle       54.3   47.8  16.8  26.9 -0.0683
##  4 dino         54.3   47.8  16.8  26.9 -0.0645
##  5 dots         54.3   47.8  16.8  26.9 -0.0603
##  6 h_lines      54.3   47.8  16.8  26.9 -0.0617
##  7 high_lines   54.3   47.8  16.8  26.9 -0.0685
##  8 slant_down   54.3   47.8  16.8  26.9 -0.0690
##  9 slant_up     54.3   47.8  16.8  26.9 -0.0686
## 10 star         54.3   47.8  16.8  26.9 -0.0630
## 11 v_lines      54.3   47.8  16.8  26.9 -0.0694
## 12 wide_lines   54.3   47.8  16.8  26.9 -0.0666
## 13 x_shape      54.3   47.8  16.8  26.9 -0.0656

These data sets have a lot in common. Specifically the x and y means, x and y standard deviations, and Pearson’s correlation coefficients are nearly identical.

Let’s try fitting each data set to a linear model.

## # A tibble: 26 x 6
##    dataset term        estimate std.error statistic  p.value
##    <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 dino    (Intercept)  53.5        7.69      6.95  1.29e-10
##  2 dino    x            -0.104      0.136    -0.764 4.46e- 1
##  3 away    (Intercept)  53.4        7.69      6.94  1.31e-10
##  4 away    x            -0.103      0.135    -0.760 4.48e- 1
##  5 h_lines (Intercept)  53.2        7.70      6.91  1.53e-10
##  6 h_lines x            -0.0992     0.136    -0.732 4.66e- 1
##  7 v_lines (Intercept)  53.9        7.69      7.01  9.38e-11
##  8 v_lines x            -0.112      0.135    -0.824 4.12e- 1
##  9 x_shape (Intercept)  53.6        7.69      6.97  1.17e-10
## 10 x_shape x            -0.105      0.135    -0.778 4.38e- 1
## # … with 16 more rows

The intercept, slope and standard errors are all pretty much identical. Let’s plot these models and take a look.

The models match up nicely, but there’s a lot of noise and there seem to be some strong unexplained patterns in the underlying data. Let’s look at each data set individually.

These plots are much more different than the summary statisitcs would suggest!

That’s all for this week. Check out the code on GitHub.

Share!