Agenda

build histogram
specify bins
modify
- color
- fill
- alpha
- bin width
- line type
- line size
map aesthetics to variables

Intro

A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. The histogram graphically shows the following:

center (location) of the data
spread (dispersion) of the data
skewness
outliers
presence of multiple modes

Histograms

To construct a histogram

the data is split into intervals called bins
the intervals may or may not be equal sized
for each bin, the number of data points that fall into it are counted (frequency)
the Y axis of the histogram represents the frequency and
the X axis represents the variable

Libraries

library(ggplot2)
library(dplyr)
library(tidyr)

Data

ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')

## # A tibble: 1,000 x 11
##       id referrer device bouncers n_visit n_pages duration        country
##    <int>    <chr>  <chr>    <chr>   <int>   <dbl>    <dbl>          <chr>
##  1     1   google laptop     true      10       1      693 Czech Republic
##  2     2    yahoo tablet     true       9       1      459          Yemen
##  3     3   direct laptop     true       0       1      996         Brazil
##  4     4     bing tablet    false       3      18      468          China
##  5     5    yahoo mobile     true       9       1      955         Poland
##  6     6    yahoo laptop    false       5       5      135   South Africa
##  7     7    yahoo mobile     true      10       1       75     Bangladesh
##  8     8   direct mobile     true      10       1      908      Indonesia
##  9     9     bing mobile    false       3      19      209    Netherlands
## 10    10   google mobile     true       6       1      208 Czech Republic
## # ... with 990 more rows, and 3 more variables: purchase <chr>,
## #   order_items <dbl>, order_value <dbl>

Data Dictionary

id: row id
referrer: referrer website/search engine
os: operating system
browser: browser
device: device used to visit the website
n_pages: number of pages visited
duration: time spent on the website (in seconds)
repeat: frequency of visits
country: country of origin
purchase: whether visitor purchased
order_value: order value of visitor (in dollars)

Histogram

ggplot(ecom) +
  geom_histogram(aes(n_visit))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Specify Bins

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7)

Fill

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue')

Alpha

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue', alpha = 0.3)

Color

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'white', color = 'blue')

Bins, Color & Fill

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue', color = 'white')

Bin Width

ggplot(ecom) +
  geom_histogram(aes(n_visit), binwidth = 2, fill = 'blue', color = 'black')

Line Type

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 5, fill = 'white', 
    color = 'blue', linetype = 3)

Line Size

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 5, fill = 'white', 
    color = 'blue', size = 1.25)

Map Fill to Variable

ggplot(ecom) +
  geom_histogram(aes(n_visit, fill = device), bins = 7)