Agenda


  • build histogram
  • specify bins
  • modify
    • color
    • fill
    • alpha
    • bin width
    • line type
    • line size
  • map aesthetics to variables

Intro


A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. The histogram graphically shows the following:

  • center (location) of the data
  • spread (dispersion) of the data
  • skewness
  • outliers
  • presence of multiple modes

Histograms


To construct a histogram

  • the data is split into intervals called bins
  • the intervals may or may not be equal sized
  • for each bin, the number of data points that fall into it are counted (frequency)
  • the Y axis of the histogram represents the frequency and
  • the X axis represents the variable

Libraries


library(ggplot2)
library(dplyr)
library(tidyr)

Data


ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
## # A tibble: 1,000 x 11
##       id referrer device bouncers n_visit n_pages duration        country
##    <int>    <chr>  <chr>    <chr>   <int>   <dbl>    <dbl>          <chr>
##  1     1   google laptop     true      10       1      693 Czech Republic
##  2     2    yahoo tablet     true       9       1      459          Yemen
##  3     3   direct laptop     true       0       1      996         Brazil
##  4     4     bing tablet    false       3      18      468          China
##  5     5    yahoo mobile     true       9       1      955         Poland
##  6     6    yahoo laptop    false       5       5      135   South Africa
##  7     7    yahoo mobile     true      10       1       75     Bangladesh
##  8     8   direct mobile     true      10       1      908      Indonesia
##  9     9     bing mobile    false       3      19      209    Netherlands
## 10    10   google mobile     true       6       1      208 Czech Republic
## # ... with 990 more rows, and 3 more variables: purchase <chr>,
## #   order_items <dbl>, order_value <dbl>

Data Dictionary


  • id: row id
  • referrer: referrer website/search engine
  • os: operating system
  • browser: browser
  • device: device used to visit the website
  • n_pages: number of pages visited
  • duration: time spent on the website (in seconds)
  • repeat: frequency of visits
  • country: country of origin
  • purchase: whether visitor purchased
  • order_value: order value of visitor (in dollars)

Histogram


ggplot(ecom) +
  geom_histogram(aes(n_visit))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Specify Bins


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7)

Fill


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue')

Alpha


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue', alpha = 0.3)

Color


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'white', color = 'blue')

Bins, Color & Fill


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue', color = 'white')

Bin Width


ggplot(ecom) +
  geom_histogram(aes(n_visit), binwidth = 2, fill = 'blue', color = 'black')

Line Type


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 5, fill = 'white', 
    color = 'blue', linetype = 3)

Line Size


ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 5, fill = 'white', 
    color = 'blue', size = 1.25)

Map Fill to Variable


ggplot(ecom) +
  geom_histogram(aes(n_visit, fill = device), bins = 7)