DATA606 - Linear Regression Part 2

April 1, 2020

Presentations

8.21 Adam Gersowitz
8.23 Kevin Potter
8.25 Sung Lee

NYS Report Card

NYS publishes data for each school in the state. We will look at the grade 8 math scores for 2012 and 2013. 2013 was the first year the tests were aligned with the Common Core Standards. There was a lot of press about how the passing rates for most schools dropped. Two questions we wish to answer:

Did the passing rates drop in a predictable manner?
Were the drops different for charter and public schools?

load('../course_data/NYSReportCard-Grade7Math.Rda')
head(reportCard, n=4)

##       BEDSCODE                                  School NumTested2012 Mean2012 Pass2012 Charter
## 1 010100010020                    NORTH ALBANY ACADEMY            47      649       13   FALSE
## 2 010100010030         WILLIAM S HACKETT MIDDLE SCHOOL           212      652       30   FALSE
## 3 010100010045 STEPHEN AND HARRIET MYERS MIDDLE SCHOOL           262      670       50   FALSE
## 4 010100860867         KIPP TECH VALLEY CHARTER SCHOOL            61      684       85    TRUE
##   GradeSubject County                                BOCES NumTested2013 Mean2013 Pass2013
## 1 Grade 7 Math Albany BOCES ALBANY-SCHOH-SCHENECTADY-SARAT            45      268        0
## 2 Grade 7 Math Albany BOCES ALBANY-SCHOH-SCHENECTADY-SARAT           250      279        9
## 3 Grade 7 Math Albany BOCES ALBANY-SCHOH-SCHENECTADY-SARAT           256      284        8
## 4 Grade 7 Math Albany BOCES ALBANY-SCHOH-SCHENECTADY-SARAT            59      298        9

Descriptive Statistics

summary(reportCard$Pass2012)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   46.00   65.00   61.73   80.00  100.00

summary(reportCard$Pass2013)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.00   20.00   22.83   33.00   99.00

Histograms

melted <- melt(reportCard[,c('Pass2012', 'Pass2013')])
ggplot(melted, aes(x=value)) + geom_histogram() + facet_wrap(~ variable, ncol=1)

Log Transformation

Since the distribution of the 2013 passing rates is skewed, we can log transfor that variable to get a more reasonably normal distribution.

reportCard$LogPass2013 <- log(reportCard$Pass2013 + 1)
ggplot(reportCard, aes(x=LogPass2013)) + geom_histogram(binwidth=0.5)

Scatter Plot

ggplot(reportCard, aes(x=Pass2012, y=Pass2013, color=Charter)) + 
    geom_point(alpha=0.5) + coord_equal() + ylim(c(0,100)) + xlim(c(0,100))

Scatter Plot (log transform)

ggplot(reportCard, aes(x=Pass2012, y=LogPass2013, color=Charter)) + 
    geom_point(alpha=0.5) + xlim(c(0,100)) + ylim(c(0, log(101)))

Correlation

cor.test(reportCard$Pass2012, reportCard$Pass2013)

## 
##  Pearson's product-moment correlation
## 
## data:  reportCard$Pass2012 and reportCard$Pass2013
## t = 47.166, df = 1360, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7667526 0.8071276
## sample estimates:
##       cor 
## 0.7877848

Correlation (log transform)

cor.test(reportCard$Pass2012, reportCard$LogPass2013)

## 
##  Pearson's product-moment correlation
## 
## data:  reportCard$Pass2012 and reportCard$LogPass2013
## t = 56.499, df = 1360, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8207912 0.8525925
## sample estimates:
##       cor 
## 0.8373991

Linear Regression

lm.out <- lm(Pass2013 ~ Pass2012, data=reportCard)
summary(lm.out)

## 
## Call:
## lm(formula = Pass2013 ~ Pass2012, data = reportCard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.484  -6.878  -0.478   5.965  51.675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.68965    0.89378  -18.67   <2e-16 ***
## Pass2012      0.64014    0.01357   47.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 1360 degrees of freedom
## Multiple R-squared:  0.6206, Adjusted R-squared:  0.6203 
## F-statistic:  2225 on 1 and 1360 DF,  p-value: < 2.2e-16

Linear Regression (log transform)

lm.log.out <- lm(LogPass2013 ~ Pass2012, data=reportCard)
summary(lm.log.out)

## 
## Call:
## lm(formula = LogPass2013 ~ Pass2012, data = reportCard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3880 -0.2531  0.0776  0.3461  2.7368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.307692   0.046030   6.685 3.37e-11 ***
## Pass2012    0.039491   0.000699  56.499  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5915 on 1360 degrees of freedom
## Multiple R-squared:  0.7012, Adjusted R-squared:  0.701 
## F-statistic:  3192 on 1 and 1360 DF,  p-value: < 2.2e-16

Did the passing rates drop in a predictable manner?

Yes! Whether we log tranform the data or not, the correlations are statistically significant with regression models with \(R^2\) creater than 62%.

To answer the second question, whether the drops were different for public and charter schools, we’ll look at the residuals.

reportCard$residuals <- resid(lm.out)
reportCard$residualsLog <- resid(lm.log.out)

Distribution of Residuals

ggplot(reportCard, aes(x=residuals, color=Charter)) + geom_density()

Distribution of Residuals

ggplot(reportCard, aes(x=residualsLog, color=Charter)) + geom_density()

Null Hypothesis Testing

\(H_0\): There is no difference in the residuals between charter and public schools.

\(H_A\): There is a difference in the residuals between charter and public schools.

t.test(residuals ~ Charter, data=reportCard)

## 
##  Welch Two Sample t-test
## 
## data:  residuals by Charter
## t = 6.5751, df = 77.633, p-value = 5.091e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.411064 11.980002
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            0.479356           -8.716177

Null Hypothesis Testing (log transform)

t.test(residualsLog ~ Charter, data=reportCard)

## 
##  Welch Two Sample t-test
## 
## data:  residualsLog by Charter
## t = 4.7957, df = 74.136, p-value = 8.161e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2642811 0.6399761
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##          0.02356911         -0.42855946

Quadratic Models

It is possible to fit quatric models fairly easily in R, say of the following form:

\[ y = b_1 x^2 + b_2 x + b_0 \]

quad.out <- lm(Pass2013 ~ I(Pass2012^2) + Pass2012, data=reportCard)
summary(quad.out)$r.squared

## [1] 0.7065206

summary(lm.out)$r.squared

## [1] 0.6206049

Quadratic Models

summary(quad.out)

## 
## Call:
## lm(formula = Pass2013 ~ I(Pass2012^2) + Pass2012, data = reportCard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.258  -4.906  -0.507   5.430  43.509 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.0466153  1.4263773   4.940 8.77e-07 ***
## I(Pass2012^2)  0.0092937  0.0004659  19.946  < 2e-16 ***
## Pass2012      -0.3972481  0.0533631  -7.444 1.72e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.11 on 1359 degrees of freedom
## Multiple R-squared:  0.7065, Adjusted R-squared:  0.7061 
## F-statistic:  1636 on 2 and 1359 DF,  p-value: < 2.2e-16

Scatter Plot

ggplot(reportCard, aes(x=Pass2012, y=Pass2013)) + 
    geom_point(alpha=0.2) + 
    geom_smooth(method='lm', formula=y~poly(x,2,raw=TRUE), size=3, se=FALSE) +
    coord_equal() + ylim(c(0,100)) + xlim(c(0,100))

Shiny App

shiny::runGitHub('NYSchools','jbryer',subdir='NYSReportCard')

See also the Github repository for more information: https://github.com/jbryer/NYSchools