Agenda


Learn to construct and use histograms to examine the underlying distribution of a continuous variable. Specifically

  • create a bare bones histogram
  • specify the number of bins/intervals
  • represent frequency density on the Y axis
  • add colors to the bars and the border
  • add labels to the bars

Introduction


A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. The histogram graphically shows the following:

  • center (location) of the data
  • spread (dispersion) of the data
  • skewness
  • outliers
  • presence of multiple modes

Histograms


To construct a histogram

  • the data is split into intervals called bins
  • the intervals may or may not be equal sized
  • for each bin, the number of data points that fall into it are counted (frequency)
  • the Y axis of the histogram represents the frequency and
  • the X axis represents the variable

Histogram


Histogram


h <- hist(mtcars$mpg)

Histogram


# display number of breaks
h$breaks
## [1] 10 15 20 25 30 35
# frequency of the intervals
h$counts
## [1]  6 12  8  2  4
# frequency density
h$density
## [1] 0.0375 0.0750 0.0500 0.0125 0.0250
# mid points of the intervals
h$mids
## [1] 12.5 17.5 22.5 27.5 32.5
# varible name
h$xname
## [1] "mtcars$mpg"
# whether intervals are of equal size
h$equidist
## [1] TRUE

Bins


hist(mtcars$mpg, breaks = 10)

Bins


Intervals


h <- hist(mtcars$mpg, breaks = c(10, 18, 24, 30, 35))

Frequency Density


frequency <- h$counts
class_width <- c(8, 6, 6, 5)
rel_freq <- frequency / length(mtcars$mpg)
freq_density <- rel_freq / class_width
d <- data.frame(frequency = frequency, class_width = class_width, 
  relative_frequency = rel_freq, frequency_density = freq_density)
d
##   frequency class_width relative_frequency frequency_density
## 1        13           8            0.40625        0.05078125
## 2        12           6            0.37500        0.06250000
## 3         3           6            0.09375        0.01562500
## 4         4           5            0.12500        0.02500000

Frequency Density


When multiplied by the class width, the product will always sum upto 1.

sum(d$frequency_density * d$class_width)
## [1] 1

Intervals


We will learn more about frequency density in a bit. Before we end this section, we need to learn about one more way to specify the intervals of the histogram, algorithms. The hist() function allows us to specify the following algorithms:

  • Sturges (default)
  • Scott
  • Freedman-Diaconis (FD)

Intervals


Frequency Distribution II


Probability


hist(mtcars$mpg, probability = TRUE)

Color


hist(mtcars$mpg, col = 'blue')

Border Color


hist(mtcars$mpg, border = c('red', 'blue', 'green', 'yellow', 'brown'))

Labels


hist(mtcars$mpg, labels = TRUE)

Labels


hist(mtcars$mpg, labels = c("6", "12", "8", "2", "4"))

Title & Axis Labels


hist(mtcars$mpg, labels = TRUE, prob = TRUE,
     ylim = c(0, 0.1), xlab = 'Miles Per Gallon',
     main = 'Distribution of Miles Per Gallon',
     col = rainbow(5))

Normal Distribution


Skewed Distribution


Case Study

Data


returns <- read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/tickers.csv')
## # A tibble: 250 x 5
##         AAPL       AMZN        FB       GOOG      MSFT
##        <dbl>      <dbl>     <dbl>      <dbl>     <dbl>
##  1  1.377845  24.169983  2.119995  22.409973  1.120701
##  2  2.834412   3.250000 -0.860001   5.989990  0.766800
##  3 -0.039360   9.910034  1.450005   6.750000  0.973240
##  4  0.108261   3.759949 -0.770004 -10.690002 -0.285091
##  5  1.643570  19.840027  4.750000   8.660034  0.501365
##  6  0.068894   5.330017 -0.299996  -0.929992  0.255596
##  7 -0.560975  -5.210022 -0.630005  -7.280030 -0.707809
##  8  0.551140   0.250000 -0.459999   0.690003  0.127796
##  9 -0.216522 -13.599975  0.030007   6.559997  0.078648
## 10 -0.108253  -4.250000  0.459999   2.600037  0.471878
## # ... with 240 more rows

Daily Returns Summary: Apple


summary(returns$AAPL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.9650 -0.4937  0.1578  0.2573  0.9687  7.2829

Daily Returns: Apple


hist(returns$AAPL)

Daily Returns Summary: Google


summary(returns$GOOG)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -33.5800  -3.1826   1.3950   0.9718   5.8475  31.7100

Daily Returns: Google


hist(returns$GOOG)

Daily Returns Summary: Facebook


summary(returns$FB)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -7.6700 -0.5675  0.1300  0.2258  1.1800  4.8300

Daily Returns: Facebook


hist(returns$FB)

Daily Returns Summary: Microsoft


summary(returns$MSFT)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.91978 -0.22750  0.04423  0.08851  0.43451  1.65121

Daily Returns: Microsoft


hist(returns$MSFT)

Daily Returns Summary: Amazon


summary(returns$AMZN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -32.260  -4.230   0.950   1.126   8.040  24.170

Daily Returns: Amazon


hist(returns$AMZN)