GLM Diagnostic Plots with ggplot2

Usage

glm_diagnostic_plots(
  model,
  discrete_edm,
  which_plots = 1:4,
  cooksD_type = 1,
  standard_errors = FALSE,
  point_size = 1.1,
  theme_color = "#008EA0FF",
  n_columns = 2
)

Arguments

model: Model of class "glm"
discrete_edm: Logical value to exclusively specify if a discrete EDM is chosen to build model. Quantile residuals are used instead of Deviance/Pearson residuals for the plots for Discrete EDMs
which_plots: Choose which diagnostic plots to choose from.
Options are 1 = "Residuals vs Fitted"; 2 = "Working Responses vs Linear Predictors"; 3 = "Normal Q-Q"; 4 = "Outlier Detection"; 5 = "Half norm plot using leverages"; 6 = "Half norm plot using Cook's Distance"; 7 = "Cook's Distance"; 8 = "DFFITS"; 9 = "VIF"
cooksD_type: An integer between 1 and 4 indicating the threshold to be computed for Cook's Distance plot. Default is 1. See details for threshold computation
standard_errors: Display confidence interval around geom_smooth, FALSE by default
point_size: Change size of points in plots
theme_color: Change color of the geom_smooth line and text labels for the respective diagnostic plot
n_columns: number of columns for grid layout. Default is 2

Value

GLM diagnostic plots

Details

Check the assumptions for the systematic component of the GLM:

Plot 1: If any trends appear, then the systematic component can be improved. This usually involves doing 1 or a combination of the following: (1) changing the link fucntion, (2) adding new predictor variables, and/or (3) transforming the current predictor variables in the model.

Plot 2: If plot is not roughly linear, then another link function might be more appropriate.

Check the assumptions for the random component - EDM distribution - of the GLM:

Plot 3: If most points are NOT close to the line, then the EDM distribution may not be appropriate. The quantile residuals are used since they have an exact normal distribution if the appropriate EDM has been chosen.

Check for outliers and influential points:

Plot 4: If |residual| > 3, then the observation is flagged as an outlier.

Plot 5 & 6: Observations that deviate greatly from the rest of the data are potential outliers. It is not necessary that all points fall on the red line since the leverages and cook's distance values are not strictly expected to have a postive normal distribution. The 5 observations with the largest values are labelled on each plot.

Plot 7: Check for outliers with Cook's Distance. A data point having a large Cook's distance indicates that the data point strongly influences the fitted values of the model. The default threshold used for detecting or classifying observations as outliers is \(4/n\) (i.e cooksD_type = 1) where \(n\) is the number of observations. The thresholds computed are as follows:

cooksD_type = 1: 4/n
cooksD_type = 2: 4/(n-p-1)
cooksD_type = 3: 1/(n-p-1)
cooksD_type = 4: 3 * mean(cook's distance values)

where \(n\) is the number of observations and \(p\) is the number of predictors.

Plot 8: Check for outliers with DFFITS, which considers how much the fitted value of observation i changes between the model fitted with all the data and the model fitted with obervation i omitted. If |DFFITS| > \(2 * sqrt((parameters + 1) / (observations - parameters - 1))\), then the observation is flagged as influential.

Check for collinearity:

Plot 9: Check for collinearity with the variance inflation factor (VIF). Tolerance = \(1 - R_j^2\), VIF = \(1 / Tolerance\), where \(R_j^2\) is the coefficient of determination of a regression of predictor j on all the other predictors. A general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 indicates a multicollinearity problem.

References

Weisberg, S.: Applied Linear Regression. John Wiley and Sons, New York (1985)

Pierce, D.A., Shafer, D.W.: Residuals in generalized linear models. Journal of the American Statistical Association 81, 977–986 (1986)

Williams, D.A.: Generalized linear models diagnostics using the deviance and single-case deletions. Applied Statistics 36(2), 181–191 (1987)

McCullagh, P., Nelder, J.A.: Generalized Linear Models, second edn. Chapman and Hall, London (1989)

Dunn, P.K., Smyth, G.K.: Randomized quantile residuals. Journal of Computational and Graphical Statistics 5(3), 236–244 (1996)

Dunn, P.K., Smyth, G.K. Generalized Linear Models with Examples in R. Springer Texts in Statistics, 297-327 (2018)

Author

Mo Amiri moamiristat@protonmail.com

Examples

model <- glm(Volume ~ Girth + Height, family = Gamma(link = "log"), data = trees)

# Default plots returned
glm_diagnostic_plots(model, discrete_edm = FALSE, which_plots = 1:4)


# Half norm plots
glm_diagnostic_plots(model, discrete_edm = FALSE, which_plots = 5:6)


# Cook's Distance & DFFITS
glm_diagnostic_plots(model, discrete_edm = FALSE, which_plots = 7:8)


# Collinearity
glm_diagnostic_plots(model, discrete_edm = FALSE, which_plots = 9, n_columns = 1)