Learn to construct and use histograms to examine the underlying distribution of a continuous variable. Specifically
A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. The histogram graphically shows the following:
To construct a histogram
h <- hist(mtcars$mpg)
# display number of breaks
h$breaks
## [1] 10 15 20 25 30 35
# frequency of the intervals
h$counts
## [1] 6 12 8 2 4
# frequency density
h$density
## [1] 0.0375 0.0750 0.0500 0.0125 0.0250
# mid points of the intervals
h$mids
## [1] 12.5 17.5 22.5 27.5 32.5
# varible name
h$xname
## [1] "mtcars$mpg"
# whether intervals are of equal size
h$equidist
## [1] TRUE
hist(mtcars$mpg, breaks = 10)
h <- hist(mtcars$mpg, breaks = c(10, 18, 24, 30, 35))
frequency <- h$counts
class_width <- c(8, 6, 6, 5)
rel_freq <- frequency / length(mtcars$mpg)
freq_density <- rel_freq / class_width
d <- data.frame(frequency = frequency, class_width = class_width,
relative_frequency = rel_freq, frequency_density = freq_density)
d
## frequency class_width relative_frequency frequency_density
## 1 13 8 0.40625 0.05078125
## 2 12 6 0.37500 0.06250000
## 3 3 6 0.09375 0.01562500
## 4 4 5 0.12500 0.02500000
When multiplied by the class width, the product will always sum upto 1.
sum(d$frequency_density * d$class_width)
## [1] 1
We will learn more about frequency density in a bit. Before we end this section, we need to learn about one more way to specify the intervals of the histogram, algorithms. The hist()
function allows us to specify the following algorithms:
hist(mtcars$mpg, probability = TRUE)
hist(mtcars$mpg, col = 'blue')
hist(mtcars$mpg, border = c('red', 'blue', 'green', 'yellow', 'brown'))
hist(mtcars$mpg, labels = TRUE)
hist(mtcars$mpg, labels = c("6", "12", "8", "2", "4"))
hist(mtcars$mpg, labels = TRUE, prob = TRUE,
ylim = c(0, 0.1), xlab = 'Miles Per Gallon',
main = 'Distribution of Miles Per Gallon',
col = rainbow(5))
returns <- read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/tickers.csv')
## # A tibble: 250 x 5
## AAPL AMZN FB GOOG MSFT
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.377845 24.169983 2.119995 22.409973 1.120701
## 2 2.834412 3.250000 -0.860001 5.989990 0.766800
## 3 -0.039360 9.910034 1.450005 6.750000 0.973240
## 4 0.108261 3.759949 -0.770004 -10.690002 -0.285091
## 5 1.643570 19.840027 4.750000 8.660034 0.501365
## 6 0.068894 5.330017 -0.299996 -0.929992 0.255596
## 7 -0.560975 -5.210022 -0.630005 -7.280030 -0.707809
## 8 0.551140 0.250000 -0.459999 0.690003 0.127796
## 9 -0.216522 -13.599975 0.030007 6.559997 0.078648
## 10 -0.108253 -4.250000 0.459999 2.600037 0.471878
## # ... with 240 more rows
summary(returns$AAPL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.9650 -0.4937 0.1578 0.2573 0.9687 7.2829
hist(returns$AAPL)
summary(returns$GOOG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -33.5800 -3.1826 1.3950 0.9718 5.8475 31.7100
hist(returns$GOOG)
summary(returns$FB)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.6700 -0.5675 0.1300 0.2258 1.1800 4.8300
hist(returns$FB)
summary(returns$MSFT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.91978 -0.22750 0.04423 0.08851 0.43451 1.65121
hist(returns$MSFT)
summary(returns$AMZN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -32.260 -4.230 0.950 1.126 8.040 24.170
hist(returns$AMZN)