- Please complete the course evaluation if you have not done so already.
- Springer has made a lot of data science and statistics books freely available.
- Check out springerQuarantineBooksR for automated downloading of books.

April 29, 2020

- Please complete the course evaluation if you have not done so already.
- Springer has made a lot of data science and statistics books freely available.
- Check out springerQuarantineBooksR for automated downloading of books.

df <- data.frame( x = rep(c(0, 1), each = 10), y = c(rnorm(10, mean = 1, sd = 1), rnorm(10, mean = 2.5, sd = 1.5)) ) head(df)

## x y ## 1 0 1.9243372 ## 2 0 0.6924972 ## 3 0 0.3735868 ## 4 0 0.3772007 ## 5 0 1.5807124 ## 6 0 1.6292041

tab <- describeBy(df$y, group = df$x, mat = TRUE, skew = FALSE) tab$group1 <- as.integer(as.character(tab$group1))

ggplot(df, aes(x = x, y = y)) + geom_point(alpha = 0.5) + geom_point(data = tab, aes(x = group1, y = mean), color = 'red', size = 4) + geom_smooth(method = lm, se = FALSE, formula = y ~ x)

- Christopher Bloome (9.15)

At this point we have covered:

- Simple linear regression
- Relationship between numerical response and a numerical or categorical predictor

- Multiple regression
- Relationship between numerical response and multiple numerical and/or categorical predictors

What we havenâ€™t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc.) or when the response is weird (categorical, count data, etc.)

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

Odds

For some event \(E\),

\[\text{odds}(E) = \frac{P(E)}{P(E^c)} = \frac{P(E)}{1-P(E)}\]

Similarly, if we are told the odds of E are \(x\) to \(y\) then

\[\text{odds}(E) = \frac{x}{y} = \frac{x/(x+y)}{y/(x+y)} \]

which implies

\[P(E) = x/(x+y),\quad P(E^c) = y/(x+y)\]

Generalized linear models (GLM) is a generalization of OLS that allows for the response variables (i.e.Â dependent variables) to have an error distribution that is not normally distributed. Logistic regression is just one type of GLM, specifically for dichotomous response variables that follow a binomial distribution.

All generalized linear models have the following three characteristics:

- A probability distribution describing the outcome variable
- A linear model

\(\eta = \beta_0+\beta_1 X_1 + \cdots + \beta_n X_n\) - A link function that relates the linear model to the parameter of the outcome distribution

\(g(p) = \eta\) or \(p = g^{-1}(\eta)\)

Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors.

We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors.

To finish specifying the Logistic model we just need to establish a reasonable link function that connects \(\eta\) to \(p\). There are a variety of options but the most commonly used is the logit function.

Logit function

\[logit(p) = \log\left(\frac{p}{1-p}\right),\text{ for $0\le p \le 1$}\]

\[ \sigma \left( t \right) =\frac { { e }^{ t } }{ { e }^{ t }+1 } =\frac { 1 }{ 1+{ e }^{ -t } } \]

logistic <- function(t) { return(1 / (1 + exp(-t))) } df <- data.frame(x=seq(-4, 4, by=0.01)) df$sigma_t <- logistic(df$x) plot(df$x, df$sigma_t)

\[ t = \beta_0 + \beta_1 x \]

The logistic function can now be rewritten as

\[ F\left( x \right) =\frac { 1 }{ 1+{ e }^{ -\left( { \beta }_{ 0 }+\beta _{ 1 }x \right) } } \]

Similar to OLS, we wish to minimize the errors. However, instead of minimizing the least squared residuals, we will use a maximum likelihood function.

study <- data.frame( Hours=c(0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00, 3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50), Pass=c(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1) ) lr.out <- glm(Pass ~ Hours, data=study, family=binomial(link='logit')) lr.out

## ## Call: glm(formula = Pass ~ Hours, family = binomial(link = "logit"), ## data = study) ## ## Coefficients: ## (Intercept) Hours ## -4.078 1.505 ## ## Degrees of Freedom: 19 Total (i.e. Null); 18 Residual ## Null Deviance: 27.73 ## Residual Deviance: 16.06 AIC: 20.06

Model

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times Hours\]

Odds (or probability) of passing if studied **zero** hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 0\] \[\frac{p}{1-p} = exp(-4.078) = 0.0169\] \[p = \frac{0.0169}{1.169} = .016\]

Odds (or probability) of passing if studied **4** hours?

\[log(\frac{p}{1-p}) = -4.078 + 1.505 \times 4\] \[\frac{p}{1-p} = exp(1.942) = 6.97\] \[p = \frac{6.97}{7.97} = 0.875\]

study[1,]

## Hours Pass Predict Predict_Pass ## 1 0.5 0 0.03471034 FALSE

logistic <- function(x, b0, b1) { return(1 / (1 + exp(-1 * (b0 + b1 * x)) )) } logistic(.5, b0=-4.078, b1=1.505)

## [1] 0.03470667

Of course, the `fitted`

function will do the same:

fitted(lr.out)[1]

## 1 ## 0.03471034

The use of statistical models to predict outcomes, typically on new data, is called predictive modeling. Logistic regression is a common statistical procedure used for prediction. We will utilize a **confusion matrix** to evaluate accuracy of the predictions.