- 1.9 Angel Claudio
- 1.19 Aysmel Aguasvivas
- 1.23 TJ Parker
- 1.43 Justin Hsi
February 5, 2020
We will use the lego
R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.
devtools::install_github("seankross/lego")
library(lego) data(legosets)
str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6172 obs. of 14 variables: ## $ Item_Number : chr "10246" "10247" "10248" "10249" ... ## $ Name : chr "Detective's Office" "Ferris Wheel" "Ferrari F40" "Toy Shop" ... ## $ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ... ## $ Theme : chr "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ... ## $ Subtheme : chr "Modular Buildings" "Fairground" "Vehicles" "Winter Village" ... ## $ Pieces : int 2262 2464 1158 898 13 39 32 105 13 11 ... ## $ Minifigures : int 6 10 NA NA 1 2 2 3 2 2 ... ## $ Image_URL : chr "http://images.brickset.com/sets/images/10246-1.jpg" "http://images.brickset.com/sets/images/10247-1.jpg" "http://images.brickset.com/sets/images/10248-1.jpg" "http://images.brickset.com/sets/images/10249-1.jpg" ... ## $ GBP_MSRP : num 132.99 149.99 69.99 59.99 9.99 ... ## $ USD_MSRP : num 159.99 199.99 99.99 79.99 9.99 ... ## $ CAD_MSRP : num 200 230 120 NA 13 ... ## $ EUR_MSRP : num 149.99 179.99 89.99 69.99 9.99 ... ## $ Packaging : chr "Box" "Box" "Box" "Box" ... ## $ Availability: chr "Retail - limited" "Retail - limited" "LEGO exclusive" "LEGO exclusive" ...
Descriptive statistics:
Plot types:
table(legosets$Availability, useNA='ifany')
## ## LEGO exclusive LEGOLAND exclusive Not specified ## 695 2 1795 ## Promotional Promotional (Airline) Retail ## 141 12 3120 ## Retail - limited Unknown ## 403 4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
## ## Blister pack Box Box with backing card Bucket Canister ## LEGO exclusive 45 147 0 1 0 ## LEGOLAND exclusive 0 2 0 0 0 ## Not specified 0 20 0 0 0 ## Promotional 0 44 0 0 0 ## Promotional (Airline) 0 11 0 0 0 ## Retail 53 2575 16 30 78 ## Retail - limited 2 302 1 5 0 ## Unknown 0 1 0 0 0 ## ## Foil pack Loose Parts Not specified Other Plastic box ## LEGO exclusive 0 71 7 5 1 ## LEGOLAND exclusive 0 0 0 0 0 ## Not specified 5 0 1739 0 6 ## Promotional 0 1 0 3 2 ## Promotional (Airline) 0 0 1 0 0 ## Retail 285 0 0 28 0 ## Retail - limited 1 0 0 0 1 ## Unknown 0 0 0 0 0 ## ## Polybag Shrink-wrapped Tag Tub ## LEGO exclusive 412 0 6 0 ## LEGOLAND exclusive 0 0 0 0 ## Not specified 24 0 0 1 ## Promotional 90 0 0 1 ## Promotional (Airline) 0 0 0 0 ## Retail 4 18 0 33 ## Retail - limited 86 0 0 5 ## Unknown 3 0 0 0
prop.table(table(legosets$Availability))
## ## LEGO exclusive LEGOLAND exclusive Not specified ## 0.1126053143 0.0003240441 0.2908295528 ## Promotional Promotional (Airline) Retail ## 0.0228451069 0.0019442644 0.5055087492 ## Retail - limited Unknown ## 0.0652948801 0.0006480881
barplot(table(legosets$Availability), las=3)
barplot(prop.table(table(legosets$Availability)), las=3)
Descriptive statistics:
Plot types:
mean(legosets$Pieces, na.rm=TRUE)
## [1] 215.1686
median(legosets$Pieces, na.rm=TRUE)
## [1] 82
var(legosets$Pieces, na.rm=TRUE)
## [1] 126876.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 356.1976
sd(legosets$Pieces, na.rm=TRUE)
## [1] 356.1976
fivenum(legosets$Pieces, na.rm=TRUE)
## [1] 0.0 30.0 82.0 256.5 5922.0
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 226.25
summary
Functionsummary(legosets$Pieces)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 30.0 82.0 215.2 256.2 5922.0 112
psych
Packagelibrary(psych) describe(legosets$Pieces, skew=FALSE)
## vars n mean sd min max range se ## X1 1 6060 215.17 356.2 0 5922 5922 4.58
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
## item group1 vars n mean sd min max range ## X11 1 LEGO exclusive 1 659 172.74203 442.96954 1 3428 3427 ## X12 2 LEGOLAND exclusive 1 2 211.00000 154.14928 102 320 218 ## X13 3 Not specified 1 1747 145.87178 309.19929 1 5195 5194 ## X14 4 Promotional 1 140 53.97143 108.42721 1 1000 999 ## X15 5 Promotional (Airline) 1 12 126.16667 47.01612 10 203 193 ## X16 6 Retail 1 3094 245.78119 294.78052 0 3803 3803 ## X17 7 Retail - limited 1 402 410.94030 652.06435 1 5922 5921 ## X18 8 Unknown 1 4 27.50000 15.96872 6 44 38 ## se ## X11 17.255643 ## X12 109.000000 ## X13 7.397620 ## X14 9.163772 ## X15 13.572384 ## X16 5.299546 ## X17 32.522014 ## X18 7.984360
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
stripchart(legosets$Pieces)
par.orig <- par(mar=c(1,10,1,1)) stripchart(legosets$Pieces ~ legosets$Availability, las=1)
par(par.orig)
hist(legosets$Pieces)
With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.
hist(log(legosets$Pieces))
plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')
plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')
boxplot(legosets$Pieces)
boxplot(log(legosets$Pieces))
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group ## == : Outlier (-Inf) in boxplot 1 is not drawn
plot(legosets$Pieces, legosets$USD_MSRP)
legosets[which(legosets$USD_MSRP >= 400),]
## Item_Number Name Year Theme ## 901 2000430 Identity and Landscape Kit 2013 Serious Play ## 902 2000431 Connections Kit 2013 Serious Play ## 2050 2000409 Window Exploration Bag 2010 Serious Play ## 2852 10179 Ultimate Collector's Millennium Falcon 2007 Star Wars ## Subtheme Pieces Minifigures ## 901 NA 6 ## 902 2455 NA ## 2050 4900 NA ## 2852 Ultimate Collector Series 5195 5 ## Image_URL GBP_MSRP USD_MSRP ## 901 http://images.brickset.com/sets/images/2000430-1.jpg 509.99 789.99 ## 902 http://images.brickset.com/sets/images/2000431-1.jpg 490.18 754.99 ## 2050 http://images.brickset.com/sets/images/2000409-1.jpg 314.99 484.99 ## 2852 http://images.brickset.com/sets/images/10179-1.jpg 342.49 499.99 ## CAD_MSRP EUR_MSRP Packaging Availability ## 901 789.99 699.99 Not specified Not specified ## 902 754.99 559.99 Not specified Not specified ## 2050 484.99 359.99 Not specified Not specified ## 2852 NA NA Not specified Not specified
legosets[which(legosets$Pieces >= 4000),]
## Item_Number Name Year Theme ## 2047 10214 Tower Bridge 2010 Advanced Models ## 2050 2000409 Window Exploration Bag 2010 Serious Play ## 2628 10189 Taj Mahal 2008 Advanced Models ## 2852 10179 Ultimate Collector's Millennium Falcon 2007 Star Wars ## Subtheme Pieces Minifigures ## 2047 Buildings 4287 NA ## 2050 4900 NA ## 2628 Buildings 5922 NA ## 2852 Ultimate Collector Series 5195 5 ## Image_URL GBP_MSRP USD_MSRP ## 2047 http://images.brickset.com/sets/images/10214-1.jpg 209.99 239.99 ## 2050 http://images.brickset.com/sets/images/2000409-1.jpg 314.99 484.99 ## 2628 http://images.brickset.com/sets/images/10189-1.jpg 199.99 299.99 ## 2852 http://images.brickset.com/sets/images/10179-1.jpg 342.49 499.99 ## CAD_MSRP EUR_MSRP Packaging Availability ## 2047 299.99 219.99 Box Retail - limited ## 2050 484.99 359.99 Not specified Not specified ## 2628 399.99 NA Box Retail - limited ## 2852 NA NA Not specified Not specified
plot(legosets$Pieces, legosets$USD_MSRP) bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),] text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.
“There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart”John Tukey
ggplot2
ggplot2
is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.ggplot2
is, in general, more flexible for creating “prettier” and complex plots.ggplot2
has at least three ways of creating plots:
qplot
ggplot(...) + geom_XXX(...) + ...
ggplot(...) + layer(...)
library(ggplot2) data(diamonds) ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()
ggplot2
Statementggplot(myDataFrame, aes(x=x, y=y)
geom_point()
, geom_histogram()
facet_wrap(~ cut)
, facet_grid(~ cut)
scale_y_log10()
ggtitle('my title')
, ylim(c(0, 10000))
, xlab('x-axis label')
ls('package:ggplot2')[grep('geom_', ls('package:ggplot2'))]
## [1] "geom_abline" "geom_area" "geom_bar" ## [4] "geom_bin2d" "geom_blank" "geom_boxplot" ## [7] "geom_col" "geom_contour" "geom_count" ## [10] "geom_crossbar" "geom_curve" "geom_density" ## [13] "geom_density_2d" "geom_density2d" "geom_dotplot" ## [16] "geom_errorbar" "geom_errorbarh" "geom_freqpoly" ## [19] "geom_hex" "geom_histogram" "geom_hline" ## [22] "geom_jitter" "geom_label" "geom_line" ## [25] "geom_linerange" "geom_map" "geom_path" ## [28] "geom_point" "geom_pointrange" "geom_polygon" ## [31] "geom_qq" "geom_qq_line" "geom_quantile" ## [34] "geom_raster" "geom_rect" "geom_ribbon" ## [37] "geom_rug" "geom_segment" "geom_sf" ## [40] "geom_sf_label" "geom_sf_text" "geom_smooth" ## [43] "geom_spoke" "geom_step" "geom_text" ## [46] "geom_tile" "geom_violin" "geom_vline" ## [49] "update_geom_defaults"
ggplot(legosets, aes(x=Pieces, y=USD_MSRP)) + geom_point()
ggplot(legosets, aes(x=Pieces, y=USD_MSRP, color=Availability)) + geom_point()
ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures, color=Availability)) + geom_point()
ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures)) + geom_point() + facet_wrap(~ Availability)
ggplot(legosets, aes(x='Lego', y=USD_MSRP)) + geom_boxplot()
ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot()
ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot() + coord_flip()
Likert scales are a type of questionaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).
library(likert) library(reshape) data(pisaitems) items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q'] items24 <- rename(items24, c( ST24Q01="I read only if I have to.", ST24Q02="Reading is one of my favorite hobbies.", ST24Q03="I like talking about books with other people.", ST24Q04="I find it hard to finish books.", ST24Q05="I feel happy if I receive a book as a present.", ST24Q06="For me, reading is a waste of time.", ST24Q07="I enjoy going to a bookstore or a library.", ST24Q08="I read only to get information that I need.", ST24Q09="I cannot sit still and read for more than a few minutes.", ST24Q10="I like to express my opinions about books I have read.", ST24Q11="I like to exchange books with my friends."))
likert
R Packagel24 <- likert(items24) summary(l24)
## Item low neutral ## 10 I like to express my opinions about books I have read. 41.07516 0 ## 5 I feel happy if I receive a book as a present. 46.93475 0 ## 8 I read only to get information that I need. 50.39874 0 ## 7 I enjoy going to a bookstore or a library. 51.21231 0 ## 3 I like talking about books with other people. 54.99129 0 ## 11 I like to exchange books with my friends. 55.54115 0 ## 2 Reading is one of my favorite hobbies. 56.64470 0 ## 1 I read only if I have to. 58.72868 0 ## 4 I find it hard to finish books. 65.35125 0 ## 9 I cannot sit still and read for more than a few minutes. 76.24524 0 ## 6 For me, reading is a waste of time. 82.88729 0 ## high mean sd ## 10 58.92484 2.604913 0.9009968 ## 5 53.06525 2.466751 0.9446590 ## 8 49.60126 2.484616 0.9089688 ## 7 48.78769 2.428508 0.9164136 ## 3 45.00871 2.328049 0.9090326 ## 11 44.45885 2.343193 0.9609234 ## 2 43.35530 2.344530 0.9277495 ## 1 41.27132 2.291811 0.9369023 ## 4 34.64875 2.178299 0.8991628 ## 9 23.75476 1.974736 0.8793028 ## 6 17.11271 1.810093 0.8611554
likert
Plotsplot(l24)
likert
Plotsplot(l24, type='heat')
likert
Plotsplot(l24, type='density')
Some problems1:
This example looks at the relationship between NZ dollar exchange rate and trade weighted index.
library(DATA606) shiny_demo('DualScales', package='DATA606')
My advise:
1 http://blog.revolutionanalytics.com/2016/08/dual-axis-time-series.html 2 http://ellisp.github.io/blog/2016/08/18/dualaxes