The dataset starbucks
in the openintro
package contains nutritional information on 77 Starbucks food items. Spend some time reading the help file of this dataset. For this problem, you will explore the relationship between the calories and carbohydrate grams in these items.
lm()
function.ggplot2
function fortify
can help a lot with this. Describe what you see in the residual plot. Does the model look like a good fit?In most restaurants in the US these days, regulation requires menu items to have the calories in the dish listed next to them. Many Americans also care about other nutrition facts, such as grams of protein, fat, or carbohydrates. For those on low-carb diets, the amount of carbohydrates in their food is very important to them. Since menus usually only list the calorie content, we want to construct a model that someone on a low-carb diet can use to predict carbohydrate grams from calories.
In the figure below, there is a scatter plot showing some data collected on food items at Starbucks. The figure shows that as the calories increase, the number of carbohyrdates also increases. There is, however, a lot of variability in the carb grams as the calories in the item increase. With a calorie count of 300 or more, it looks like carb grams become less associated with calories than they were with less than 300 calories. The calories in the dish is the explanatory variable and the grams of carbohydrates in the dish is the response variable. We want to build a model that helps people on low-carb diets predict the amount of carbs in an item at Starbucks based on the number of calories in the item.
library(tidyverse)
library(openintro)
library(csafethemes)
data("starbucks")
ggplot(data = starbucks) +
geom_point(aes(x = calories, y = carb), size = 3, color = "#77BC1F") +
theme_csafe() +
labs(x = "Calories", y = "Carbohyrdates (g)", title = "Calories and Grams of Carbs in 77 Starbucks Foods") + theme(plot.title = element_text(size = rel(.9)))
Here is the model to predict carbs from calories:
model1 <- lm(carb ~ calories, data = starbucks)
model1
##
## Call:
## lm(formula = carb ~ calories, data = starbucks)
##
## Coefficients:
## (Intercept) calories
## 8.944 0.106
Let \(Y\) be the number of carbs in the item at Starbucks, and let \(X\) be the number of calories in the item. The fitted model can then be written as: \[Y = 8.944 + 0.106\cdot X.\]
This model means that when there are 0 calories in an item (\(X = 0\)), we expect there to be 8.944g of carbs. This is nonsensical, however, because if a food item has 0 calories it cannot have 8.944g of carbs. (1 gram of carbs is equal to 4 calories.) Also, for every additional calorie in an item, the amount of carbs in the item increases by 0.106. An interpretation that makes more sense in context is that for every 100 additional calories in an item, the number of carbs increases, on average, by 10.6g. So, a person wishing to have a low-carb item should eat something lower in calories.
We first assess the fit of this model with the coefficient of determination, \(R^2\). The \(R^2\) value for this model is:
summary(model1)$r.squared
## [1] 0.4556237
This means that about 45.6% of the variation in carbohydrate grams in an item at Starbucks can be explained by the linear relationship with calories.
Another way to assess the fit of the model is with the residual plot shown below:
starbucks2 <- fortify(model1)
ggplot(data = starbucks2) +
geom_hline(yintercept = 0, color = "#003A70", size = 1.5) +
geom_point(aes(x = calories, y = .resid), size = 3, color = "#CF0A2C") +
labs(x = "Calories", y = "Residual (g of carbs)", title = "Residual plot: modeling carbs with calories") +
theme_csafe() +
theme(plot.title = element_text(size = rel(.9)))
The residual plot, if the model were a good fit, would look like a random “cloud” of points above and below the zero line. This plot shows a distinct pattern that we don’t want: as the value on the \(x\)-axis increases, the residuals get larger (in absolute value). You can see that there is a funnel-like shape (that looks like “<”) starting with low variability at 0 and flaring out as calories increases. This means that the variability in carb grams increases with calories, which is not how a linear model should look. So, this is not a good model for this data.
The openintro
package contains a dataset called absenteeism
that consists of data on 146 schoolchildren in a rural area of Australia. Spend some time reading the help file of this dataset. We are interested in seeing if the ethnicity (aboriginal or not), sex (male or female), and learning ability (average or slow) of the children affects the number of days they are absent from school.
Eth
, Sex
, and Lrn
variables to binary variables. One way to do this is with the function ifelse()
. You should construct them so that
Eth = 1
if the student is not aboriginal and Eth = 0
if the student is aboriginal;Sex = 1
if the student is male and Sex = 0
if the student is female;Lrn = 1
if the student is a slow learner and Lrn = 0
is the student is an average learner.Days
as the dependent variable and the three variables mentioned in (1) as explanatory variables.library(tidyverse)
newdata <- data_frame(Eth = c(1,1,1,0,0),
Sex = c(0,1,0,1,0),
Lrn = c(0,0,1,1,0))
We want to know about the effects of enthicity, sex, and learning ability on student absenteeism in rural Australia. We will use multiple regression to do this. The data look something like this:
data("absenteeism")
head(absenteeism)
## Eth Sex Age Lrn Days
## 1 A M F0 SL 2
## 2 A M F0 SL 11
## 3 A M F0 SL 14
## 4 A M F0 AL 5
## 5 A M F0 AL 5
## 6 A M F0 AL 13
First, we need to recode the Eth
, Sex
, and Lrn
data to be coded as 0-1. The first variable, \(X_1\) is ethnicity of the student. This is stored as A
in the data if the student is aboriginal, and N
if the student is nonaboriginal. We will set Eth = 1
if the student is not aboriginal and Eth = 0
if the student is aboriginal. The second variable, \(X_2\) is the sex of the student, either male or female. We will set Sex = 1
if the student is male and Sex = 0
if the student is female. The final covairate, \(X_3\) is the learning ability of a student, either a slow or an average learner. We set Lrn = 1
if the student is a slow learner and Lrn = 0
if the student is an average learner. Here is one way create these variables:
absenteeism <- absenteeism %>%
mutate(Eth = ifelse(Eth == "N", 1, 0),
Sex = ifelse(Sex == "M", 1, 0),
Lrn = ifelse(Lrn == "SL", 1, 0))
In order to determine the relationship between these three variables and the response variable, \(Y\) the number of days absent from school, we use multiple regression:
model2 <- lm(Days ~ Eth + Sex + Lrn , data = absenteeism)
model2
##
## Call:
## lm(formula = Days ~ Eth + Sex + Lrn, data = absenteeism)
##
## Coefficients:
## (Intercept) Eth Sex Lrn
## 18.932 -9.112 3.104 2.154
The fitted model is: \[Y = 18.932 - 9.112\cdot X_1 + 3.104 \cdot X_2 + 2.154 \cdot X_3\]
For students who are aboriginal, female, and average learners (so, \(X_1 = X_2 = X_3 = 0\)), we expect them to miss 18.932, or about 19, days of school, on average. If all else is held constant, we expect students who are nonaboriginal to miss 9.112, or about 9, fewer days of school than students who are aboriginal. Similarly, we expect males to miss about 3 (3.104) more days of school than females, on average, and we expect students who are slow learners to miss about 2.154 more days, on average, than students who are average learners.
The adjusted \(R^2\) value is:
summary(model2)$adj.r.squared
## [1] 0.07009202
So, about 7% of the variability in days of school missed can be explained with the multiple regression model.
Another way to look at the model is a residual plot with the observed response variable on the \(x\)-axis and the residual on the \(y\)-axis.
dat <- fortify(model2)
ggplot(data = dat) +
geom_hline(yintercept = 0, color = "#003A70", size = 1.5) +
geom_point(aes(x = Days, y = .resid), size = 2.5, color = "#CF0A2C") +
labs(x = "# of days missed", y = "Residual", title = "Residual plot: missing school") +
theme_csafe() +
theme(plot.title = element_text(size = rel(.9)))
We see an obivous pattern: the residual value increases as the number of days missed increases. This plot, combined with the adjusted \(R^2\) value, indicates that we are totally failing to learn anything about the number of days of school missed with this model. There is clearly some variable we are missing that is more important than sex, ethnicity, or learning ability.
Next, we predict the number of days missed for each of the new students.
newdata <- data_frame(Eth = c(1,1,1,0,0),
Sex = c(0,1,0,1,0),
Lrn = c(0,0,1,1,0))
newdata <- newdata %>%
mutate(Days_Missed = predict(model2, newdata = newdata))
If we had to use this model, we would predict the number of days that these new students would miss to be:
Eth | Sex | Lrn | Days_Missed |
---|---|---|---|
1 | 0 | 0 | 9.82 |
1 | 1 | 0 | 12.92 |
1 | 0 | 1 | 11.97 |
0 | 1 | 1 | 24.19 |
0 | 0 | 0 | 18.93 |
The openintro
package contains a dataset called orings
that contains information on 23 NASA space shuttle launches. Spend some time reading the help file of this dataset.
orings2
library(openintro)
data("orings")
orings2 <- NULL
for(i in 1:nrow(orings)){
new <- data.frame(temp = orings$temp[i], # for each row in orings,
fail = rep(c(1,0), c(orings$damage[i], # create 6 new rows:
6-orings$damage[i]))) # 1 for each launch
orings2 <- rbind(orings2, new)
}
orings2
, fit a logistic regression to the data using the glm
function.53:81
). Create a plot showing the observed data (in orings2
) as points (\(x\) = temp, \(y\) = fail) and draw a line through the predicted probabilities at each temperature value from 53-81. Describe what you see in the plot.The dataset orings
in the openintro
package looks like this:
temp | 53 | 57 | 58 | 63 | 66 | 67 | 67 | 67 | 68 | 69 | 70 | 70 | 70 | 70 | 72 | 73 | 75 | 75 | 76 | 76 | 78 | 79 | 81 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
damage | 5 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
The most damaged o-rings occur at lower temperatures, below 64. Of the 11 damaged o-rings, 8 occurred whent the temperature was below 64 degrees Fahrenheit. The following code creates a dataset appropriate for logistic regression. It creates a 1 for every damaged o-ring at each temperature, and creates 0s when there are no failures. Each temperature has 6 total observations, so a 1 is created for every damage, and the remaining observations (out of 6) are 0s.
library(openintro)
data("orings")
orings2 <- NULL
for(i in 1:nrow(orings)){
new <- data.frame(temp = orings$temp[i], # for each row in orings,
fail = rep(c(1,0), c(orings$damage[i], # create 6 new rows:
6-orings$damage[i]))) # 1 for each launch
orings2 <- rbind(orings2, new)
}
Now that the data is in the proper tidy form, we can fit a logistic regression:
model3 <- glm(fail ~ temp, data = orings2, family = "binomial")
coef(model3)
## (Intercept) temp
## 11.6629897 -0.2162337
The full model is:
\[Y \sim Bern(p)\] \[logit(p) = 11.663 - 0.216\cdot X\]
where \(Y\) is an indicator variable that is 1 if the o-ring was damaged and 0 otherwise, and \(X\) is the temperature (in degrees Farenheit) on the day of the launch. The probability that the o-ring fails is \(p\).
The coefficient for temperature is -0.216. To interpret it, we compute \(e^{-0.216} = 0.806\). One way to interpret this value is that for every additional degree Fahrenheit, the odds that the o-ring will be damaged are about 80.6% of the odds at the lower temperature. In other words, the increase in temperature decreases the odds of a damaged o-ring by about 20%. Another way to interpret this is to take the reciprocal, \(\frac{1}{0.806} = 1.241\), and talk about the change in odds when the temperature decreases. Thus, when the temperature decreases by one degree, the odds that the o-ring will be damaged increase by a factor of about 1.24 times.
The investigation into the Challenger Explosion found that the low temperature during the launch caused the o-ring to be damaged, which in turn caused the explosion. Our logistic regression supports the theory that lower temperatures increase the likelihood of o-ring damage. The lowest temperature in our data is 53 degrees F. The temperature on the day of the Challenger launch was 31 degrees F.1
Finally, we look at the predictions from our model with the observed 0-1 values for all temperatures in our range of observations.
newdata <- data.frame(temp = 53:81)
newdata <- newdata %>%
mutate(pred_prob = predict(model3, newdata = newdata, type = "response"))
ggplot(data = orings2) +
geom_line(data = newdata, aes(x= temp , y= pred_prob), color = "#003A70",
size = 1.5) +
geom_point(aes(x = temp, y = fail), size = 2.5, color = "#CF0A2C", alpha = .3) +
theme_csafe() +
labs(x = "Temperature (degrees F)", y = "Probability of Damaged O-ring",
title = "Observed and Predicted Failure probabilities",
subtitle = "Did low temperatures cause damaged o-rings?") +
theme(plot.title = element_text(size = rel(.9)), plot.subtitle = element_text(size = rel(.7)))
In the figure above, the darker points represent many observations at that temperature. What we see is that as the temperature decreases and the observed number of failures decreases, the predicted probability of damage decreases and becomes about zero at temperatures higher than 70 degrees F. The highest probability of damage is about 55% at the lowest observed temperature, 53 degrees F. In addition, there appears to be a big change in slope at about 62 degrees F, meaning that the probability of damage decreases more slowly after that point.
The plot below was not required, but shows the predictions from the model outside the range of observations. Typically, extrapolation like this should not be used in a real-world context. This is for demonstration purposes only.
Though it is unwise to predict for data outside of our range of observation, the model we built predicts a probability of failure of 0.993 for a temperature of 31 degrees F.↩