ENST/MRNE 222 Environmental Data Analysis and Visualization
[1] "name" "sale" "lot"
[4] "position" "dealer" "year"
[7] "origin_author" "origin_cat" "school_pntg"
[10] "diff_origin" "logprice" "price"
[13] "count" "subject" "authorstandard"
[16] "artistliving" "authorstyle" "author"
[19] "winningbidder" "winningbiddertype" "endbuyer"
[22] "interm" "type_intermed" "height_in"
[25] "width_in" "surface_rect" "diam_in"
[28] "surface_rnd" "shape" "surface"
[31] "material" "mat" "material_cat"
[34] "quantity" "nfigures" "engraved"
[37] "original" "prevcoll" "othartist"
[40] "paired" "figures" "finished"
[43] "lrgfont" "relig" "lands_all"
[46] "lands_sc" "lands_elem" "lands_figs"
[49] "lands_ment" "arch" "mytho"
[52] "peasant" "othgenre" "singlefig"
[55] "portrait" "still_life" "discauth"
[58] "history" "allegory" "pastorale"
[61] "other"
“Two paintings very rich in composition, of a beautiful execution, and whose merit is very remarkable, each 17 inches 3 lines high, 23 inches wide; the first, painted on wood, comes from the Cabinet of Madame la Comtesse de Verrue; it represents a departure for the hunt: it shows in the front a child on a white horse, a man who gives the horn to gather the dogs, a falconer and other figures nicely distributed across the width of the painting; two horses drinking from a fountain; on the right in the corner a lovely country house topped by a terrace, on which people are at the table, others who play instruments; trees and fabriques pleasantly enrich the background.”
As width increases, height increases:
If we know a painting’s width, we can use the equation of the line to calculate the expected height value for a given width.
geom_smooth(method = "lm")
to draw a straight line through your datageom_smooth()
uses a 95% CI: 95% percent chance that the actual height is within the range of predicted values in the intervalgeom_smooth(method = "lm", se = FALSE)
to remove the CI from the plotFits line using a generalized additive model (GAM): allows for non-linear relationships using smoothing functions
Locally estimated scatterplot smoothing (loess): another method that allows for non-linear relationships
Response variable: Variable whose behavior or variation you are trying to understand; plotted on the y-axis. Also called the dependent variable.
Explanatory variables: Other variable(s) used to explain the variation in the response; plotted on the x-axis. Also called independent or predictor variables.
Predicted value: Output of the model function
Residuals: A measure of how far each observed value is from its predicted value (based on a particular model)
ht_wt_fit <- linear_reg() |>
fit(height_in ~ width_in, data = pp)
ht_wt_fit_tidy <- tidy(ht_wt_fit$fit)
ht_wt_fit_aug <- augment(ht_wt_fit$fit) |>
mutate(res_cat = ifelse(.resid > 0, TRUE, FALSE))
ggplot(data = ht_wt_fit_aug) +
geom_point(aes(x = width_in, y = height_in, color = res_cat)) +
geom_line(aes(x = width_in, y = .fitted), size = 0.75, color = "#8E2C90") +
labs(
title = "Height vs. width of paintings",
subtitle = "Paris auctions, 1764 - 1780",
x = "Width (in)",
y = "Height (in)"
) +
guides(color = FALSE) +
scale_color_manual(values = c("#260b27", "#e6b0e7")) +
annotate("text", x = 0, y = 150, label = "Positive residual", color = "#e6b0e7", hjust = 0, size = 8) +
annotate("text", x = 150, y = 25, label = "Negative residual", color = "#260b27", hjust = 0, size = 8)
The plot below displays the relationship between height and width of paintings. The only difference from the previous plots is that it uses a smaller alpha value. What feature is apparent now that was not (as) obvious in the previous plots? What might be the reason for this feature?
Landscape painting is the depiction in art of landscapes – natural scenery such as mountains, valleys, trees, rivers, and forests, especially where the main subject is a wide view – with its elements arranged into a coherent composition.1
Landscape paintings tend to be wider than they are long.
Portrait painting is a genre in painting, where the intent is to depict a human subject.2
Portrait paintings tend to be longer than they are wide.
How would you expect the relationship between painting width and height to vary depending on whether or not a painting has any landscape elements?
ggplot(data = pp, aes(x = width_in, y = height_in, color = factor(lands_all))) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE,
fullrange = TRUE) +
labs(
title = "Height vs. width of paintings, by landscape features",
subtitle = "Paris auctions, 1764 - 1780",
x = "Width (inches)",
y = "Height (inches)",
color = "Landscape"
) +
scale_color_manual(values = c("#E48957", "#071381"))
ggplot(data = pp, aes(x = width_in, y = height_in, color = factor(lands_all))) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE,
fullrange = TRUE) +
labs(
title = "Height vs. width of paintings, by landscape features",
subtitle = "Paris auctions, 1764 - 1780",
x = "Width (inches)",
y = "Height (inches)",
color = "landscape"
) +
scale_color_manual(values = c("#E48957", "#071381"))
Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data.
There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.
is just as important as the model, if not more!
Statistics is the explanation of variation in the context of what remains unexplained.
The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.
Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We’ll talk more about this later.)
Explanation: Characterize the relationship between \(y\) and \(x\) via slopes for numerical explanatory variables or differences for categorical explanatory variables
Prediction: Plug in \(x\), get the predicted \(y\)
Slope: Two numeric variables; for every unit change in x, y changes by … units (the slope gives us …)
Difference: 1 numeric, 1 categorical variable; level a of the categorical variable (e.g., pumpkin) is, on average, … units higher/lower than level b of the categorical variable (e.g., sunflower)