April 29, 2020

Announcements

Relationship between dichotomous (x) and continuous (y) variables

df <- data.frame(
    x = rep(c(0, 1), each = 10),
    y = c(rnorm(10, mean = 1, sd = 1),
          rnorm(10, mean = 2.5, sd = 1.5))
)
head(df)
##   x         y
## 1 0 1.9243372
## 2 0 0.6924972
## 3 0 0.3735868
## 4 0 0.3772007
## 5 0 1.5807124
## 6 0 1.6292041
tab <- describeBy(df$y, group = df$x, mat = TRUE, skew = FALSE)
tab$group1 <- as.integer(as.character(tab$group1))

Relationship between dichotomous (x) and continuous (y) variables

ggplot(df, aes(x = x, y = y)) + geom_point(alpha = 0.5) +
    geom_point(data = tab, aes(x = group1, y = mean), color = 'red', size = 4) + 
    geom_smooth(method = lm, se = FALSE, formula = y ~ x)

Meetup Presentations

  • Christopher Bloome (9.15)

Regression so far…

At this point we have covered:

  • Simple linear regression
    • Relationship between numerical response and a numerical or categorical predictor
  • Multiple regression
    • Relationship between numerical response and multiple numerical and/or categorical predictors

What we haven’t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)

Odds

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

Odds

For some event \(E\),

\[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\]

Similarly, if we are told the odds of E are \(x\) to \(y\) then

\[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \]

which implies

\[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]

Generalized Linear Models

Generalized linear models (GLM) is a generalization of OLS that allows for the response variables (i.e. dependent variables) to have an error distribution that is not normally distributed. Logistic regression is just one type of GLM, specifically for dichotomous response variables that follow a binomial distribution.

All generalized linear models have the following three characteristics:

  1. A probability distribution describing the outcome variable
  2. A linear model
    \(\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n\)
  3. A link function that relates the linear model to the parameter of the outcome distribution
    \(g(p) = \eta\) or \(p = g^{-1}(\eta)\)

Logistic Regression

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors.

To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function.

Logit function

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

The Logistic Function

\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]

logistic <- function(t) { return(1 / (1 + exp(-t))) }
df <- data.frame(x=seq(-4, 4, by=0.01))
df$sigma_t <- logistic(df$x)
plot(df$x, df$sigma_t)

t as a Linear Function

\[ t = \beta_0 + \beta_1 x \]

The logistic function can now be rewritten as

\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]

Similar to OLS, we wish to minimize the errors. However, instead of minimizing the least squared residuals, we will use a maximum likelihood function.

Example: Hours Studying Predicting Passing

study <- data.frame(
    Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,
            3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50),
    Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1)
)
lr.out <- glm(Pass ~ Hours, data=study, family=binomial(link='logit'))
lr.out
## 
## Call:  glm(formula = Pass ~ Hours, family = binomial(link = "logit"), 
##     data = study)
## 
## Coefficients:
## (Intercept)        Hours  
##      -4.078        1.505  
## 
## Degrees of Freedom: 19 Total (i.e. Null);  18 Residual
## Null Deviance:       27.73 
## Residual Deviance: 16.06     AIC: 20.06

Model

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times Hours\]

Plotting the Results

Prediction

Odds (or probability) of passing if studied zero hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0\] \[\frac{p}{1-p} = exp(-4.078) = 0.0169\] \[p = \frac{0.0169}{1.169} = .016\]

Odds (or probability) of passing if studied 4 hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4\] \[\frac{p}{1-p} = exp(1.942) = 6.97\] \[p = \frac{6.97}{7.97} = 0.875\]

Fitted Values

study[1,]
##   Hours Pass    Predict Predict_Pass
## 1   0.5    0 0.03471034        FALSE
logistic <- function(x, b0, b1) {
    return(1 / (1 + exp(-1 * (b0 + b1 * x)) ))
}
logistic(.5, b0=-4.078, b1=1.505)
## [1] 0.03470667

Of course, the fitted function will do the same:

fitted(lr.out)[1]
##          1 
## 0.03471034

Model Performance

The use of statistical models to predict outcomes, typically on new data, is called predictive modeling. Logistic regression is a common statistical procedure used for prediction. We will utilize a confusion matrix to evaluate accuracy of the predictions.