Import: First you must import your data into R.
Tidy: Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.
Transform: Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Wrangling: Together, tidying and transforming are called wrangling, because getting your data in a form that’s natural to work with often feels like a fight!
Two main engines of knowledge generation:
Communication: Last and the most critical step.
Programming: Cuts across all aspects of the project.
# install.packages(c("tidyverse","nycflights13", "gapminder", "Lahman"))
lapply(c("tidyverse","nycflights13", "gapminder", "Lahman"), library, character.only = TRUE)
Lets explore some useful tools that have an immediate payoff:
Using example dataset from ggplot2 on Cars, lets try to answer the following questions
Do cars with big engines use more fuel than cars with small engines?
What does the relationship between engine size and fuel efficiency look like?
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manual~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
## 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~
## 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
## 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p comp~
## # ... with 224 more rows
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_point(mapping = aes(x=displ,y=hwy)) +
geom_smooth(method = "lm", se = TRUE, level=0.95)
plot shows negative relationship between engine size (displ) and efficiency (hwy)
ggplot(data=mpg, aes(x=hwy,y=cyl)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, level=0.95)
plot shows negative relationship between # of cylinders (cyl) and efficiency (hwy)
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_point(mapping = aes(x=displ,y=hwy,color=class)) +
geom_smooth(method = "lm", se = TRUE, level=0.95)
## `geom_smooth()` using formula 'y ~ x'
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_point(mapping = aes(x=displ,y=hwy,color=displ<5)) +
geom_smooth(method = "lm", se = TRUE, level=0.95)
## `geom_smooth()` using formula 'y ~ x'
Clear view of the class variable. But the linear regression gave more insight than the facets.
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_point(mapping = aes(x=displ,y=hwy,color=displ<5)) +
facet_wrap(~class, nrow = 2)
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_point(mapping = aes(x=displ,y=hwy,color=hwy<25)) +
facet_grid(cyl~drv)
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_smooth(mapping = aes(x=displ,y=hwy))
ggplot(data=mpg, aes(x=displ,y=hwy)) +
geom_point(mapping = aes(x=displ,y=hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y= stat(prop) , group=1))
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()