In this study, we will use a dataset published on Kaggle for demonstration: https://www.kaggle.com/datasets/rajanand/key-indicators-of-annual-health-survey for demonstration. Note that this dataset was extracted from the Annual Health Survey (AHS) from 2012 to 2013 in India.
The original dataset includes 26 different files, where each file focus on a dedicated topic (e.g. sex ratio, disability rate, etc.). Each topic consists of 2~10 different variables, where all the variables are numeric values representing percentages. We are interested in “Section 16: Ante Natal Care”, “Section 17: Delivery Care”, and “Section 25: Mortality”. These three sections will form our initial dataset. All the other sections are unrelated to our studies. Hence, they will be dropped.
The initial dataset has 8 variables, listed as follows. It also has 284 observations.
## [1] "State_Name"
## [2] "State_District_Name"
## [3] "Mothers_Who_Received_Any_Antenatal_Check_Up"
## [4] "Mothers_Who_Had_Full_Antenatal_Check_Up"
## [5] "Mothers_Who_Underwent_Ultrasound"
## [6] "Institutional_Delivery"
## [7] "Delivery_At_Home"
## [8] "Infant_Mortality_Rate"
We check if there are any missing values in these eight variables. Our result shows that these variables are complete. This means that all eight variables have been filled for the entire 284 observations.
## State_Name
## 0
## State_District_Name
## 0
## Mothers_Who_Received_Any_Antenatal_Check_Up
## 0
## Mothers_Who_Had_Full_Antenatal_Check_Up
## 0
## Mothers_Who_Underwent_Ultrasound
## 0
## Institutional_Delivery
## 0
## Delivery_At_Home
## 0
## Infant_Mortality_Rate
## 0
Generate summary for the data.
## State_Name State_District_Name
## Length:284 Length:284
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Mothers_Who_Received_Any_Antenatal_Check_Up
## Min. :59.90
## 1st Qu.:86.47
## Median :91.70
## Mean :90.00
## 3rd Qu.:94.80
## Max. :99.70
## Mothers_Who_Had_Full_Antenatal_Check_Up Mothers_Who_Underwent_Ultrasound
## Min. : 1.01 Min. : 8.57
## 1st Qu.: 6.00 1st Qu.:23.90
## Median :10.82 Median :32.10
## Mean :13.50 Mean :36.60
## 3rd Qu.:18.34 3rd Qu.:47.52
## Max. :54.65 Max. :92.50
## Institutional_Delivery Delivery_At_Home Infant_Mortality_Rate
## Min. :23.81 Min. : 3.95 Min. :19.22
## 1st Qu.:53.56 1st Qu.:19.53 1st Qu.:47.95
## Median :64.34 Median :34.30 Median :55.00
## Mean :64.77 Mean :34.30 Mean :56.18
## 3rd Qu.:80.19 3rd Qu.:45.37 3rd Qu.:65.18
## Max. :95.88 Max. :75.92 Max. :97.00
We proceed to plot the histogram of 5 predictors and the variable of interest. Notice that for a fixed given district, P(Institutional Delivery)=1-P(Delivery At Home).
path_folder = "C:\\Users\\marcc\\Desktop\\"
file_name = "india_birth.csv"
df<- read.csv(paste0(path_folder, file_name), header = T)
df_uni<- data.frame(df$Institutional_Delivery, df$Infant_Mortality_Rate)
model <- lm(df.Infant_Mortality_Rate ~ df.Institutional_Delivery, data = df_uni)
D <- cooks.distance(model)
lev <- which(D > 4/(nrow(df_uni)-2))
par(family = 'serif')
plot(df_uni$df.Institutional_Delivery, D, type = "p",
xlab = "Institutional Delivery", ylab = "Cook's Distances",
main = "Cook's Distance", col = ifelse(D > 4/(nrow(df_uni)-2), "red", "blue"))
text(df_uni$df.Institutional_Delivery[D > 4/(nrow(df_uni)-2)]+0.5, D[D > 4/(nrow(df_uni)-2)],
labels = which(D > 4/(nrow(df_uni)-2)) )
df_uni <- df_uni[-c(79,82,83,112,141,211,265,266,272,274,281),]
model <- lm(df.Infant_Mortality_Rate ~ df.Institutional_Delivery, data = df_uni)
plot(df_uni$df.Institutional_Delivery, df_uni$df.Infant_Mortality_Rate, type = "p", pch = 20,
xlab = "Institutional_Delivery", ylab = "Infant_Mortality_Rate",
main = "Institutional Delivery vs Infant Mortality Rate", col = "red")
abline(model, lwd = 2, col = "blue")
r <- rstandard(model)
qqnorm(r)
qqline(r)
hii <- hatvalues(model)
r <- rstandard(model)
par(family = 'serif')
plot(df_uni$df.Institutional_Delivery, r, type = "p",
xlab = "Institutional Delivery", ylab = "Standardized Residuals",
main = "Standardized Residuals", col = "red", xlim = c(0, 100), ylim = c(-4, 4))
abline(h = 2, lty = 2)
abline(h = -2, lty = 2)
hii <- hatvalues(model)
r <- rstandard(model)
r_mod <- sqrt(abs(r))
best_fit <- lm(r_mod ~ df_uni$df.Institutional_Delivery)
plot(df_uni$df.Institutional_Delivery, r_mod, type = "p",
xlab = "Institutional Delivery", ylab = "Standardized Residuals",
main = "Standardized Residuals", col = "red", xlim = c(0, 100), ylim = c(-1, 3))
abline(best_fit, lwd = 2, col = "blue")
model %>%
tidy() %>%
kable(caption = "The summary from the simple linear regression model")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 54.3116065 | 3.1352501 | 17.3228944 | 0.0000000 |
df.Institutional_Delivery | 0.0316342 | 0.0463868 | 0.6819651 | 0.4958435 |
plot(df_uni$df.Institutional_Delivery, df_uni$df.Infant_Mortality_Rate, type = "p", pch = 20,
xlab = "Institutional_Delivery", ylab = "Infant_Mortality_Rate",
main = "Institutional Delivery vs Infant Mortality Rate", col = "red")
abline(model, lwd = 2, col = "blue")
df_uni <- df_uni[order(df_uni$df.Institutional_Delivery),]
yhat <- predict(model, newdata = df_uni, type = "response", interval = "none")
conf.inter <- predict(model, newdata = df_uni, type = "response", interval = "confidence", level = 0.95)
pred.inter <- predict(model, newdata = df_uni, type = "response", interval = "prediction", level = 0.95)
plot(df_uni$df.Institutional_Delivery, df_uni$df.Infant_Mortality_Rate, type = "p", xlab = "Institutional Delivery", ylab = "Infant Mortality Rate",
main = "Institutional Delivery vs Infant Mortality Rate", col = "red", xlim = c(20, 100), ylim = c(20, 90))
abline(model, lwd = 2, col = "blue")
lines(df_uni$df.Institutional_Delivery, conf.inter[,2], lty = 2, col = "blue")
lines(df_uni$df.Institutional_Delivery, conf.inter[,3], lty = 2, col = "blue")
lines(df_uni$df.Institutional_Delivery, pred.inter[,2], lty = 2, col = "red")
lines(df_uni$df.Institutional_Delivery, pred.inter[,3], lty = 2, col = "red")
Under H0 we have the following assumption: observations are normally distributed (checked 2.1), the errors are independent of each other (checked 2.2), observations have constant variance and mean 0 (checked 2.3).
summary(model)
##
## Call:
## lm(formula = df.Infant_Mortality_Rate ~ df.Institutional_Delivery,
## data = df_uni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.381 -8.920 -1.082 8.735 31.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.31161 3.13525 17.323 <2e-16 ***
## df.Institutional_Delivery 0.03163 0.04639 0.682 0.496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.89 on 271 degrees of freedom
## Multiple R-squared: 0.001713, Adjusted R-squared: -0.001971
## F-statistic: 0.4651 on 1 and 271 DF, p-value: 0.4958
anova(model)
qf(0.95, 1, 282)
## [1] 3.874645
As 0.4650763 < 3.874645, we failed to Reject H0, meaning Institutional delivery has no effect on Infant Mortality Rate.
path_folder = "C:\\Users\\marcc\\Desktop\\"
file_name = "india_birth.csv"
df <- read.csv(paste0(path_folder, file_name), header = T)
model <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Delivery_At_Home + Mothers_Who_Received_Any_Antenatal_Check_Up
+ Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
pairs(~ Institutional_Delivery + Delivery_At_Home + Mothers_Who_Received_Any_Antenatal_Check_Up +
Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
Xmat<- model.matrix(model)
XTX <- solve(t(Xmat)%*%Xmat)
rankMatrix(XTX)
## [1] 6
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.332268e-15
H <- Xmat%*%XTX%*%t(Xmat)
rankMatrix(H)
## [1] 6
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.306067e-14
summary(model)
##
## Call:
## lm(formula = Infant_Mortality_Rate ~ Institutional_Delivery +
## Delivery_At_Home + Mothers_Who_Received_Any_Antenatal_Check_Up +
## Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.004 -8.993 0.303 8.591 43.748
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 99.85501 50.83905 1.964
## Institutional_Delivery 0.32138 0.51155 0.628
## Delivery_At_Home -0.03446 0.51896 -0.066
## Mothers_Who_Received_Any_Antenatal_Check_Up -0.56068 0.14738 -3.804
## Mothers_Who_Had_Full_Antenatal_Check_Up -0.11925 0.10113 -1.179
## Mothers_Who_Underwent_Ultrasound -0.30701 0.05808 -5.286
## Pr(>|t|)
## (Intercept) 0.050510 .
## Institutional_Delivery 0.530352
## Delivery_At_Home 0.947109
## Mothers_Who_Received_Any_Antenatal_Check_Up 0.000175 ***
## Mothers_Who_Had_Full_Antenatal_Check_Up 0.239334
## Mothers_Who_Underwent_Ultrasound 2.53e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.72 on 278 degrees of freedom
## Multiple R-squared: 0.2002, Adjusted R-squared: 0.1858
## F-statistic: 13.92 on 5 and 278 DF, p-value: 3.824e-12
vif(model) ## The vif function is from the car package
## Institutional_Delivery
## 137.611436
## Delivery_At_Home
## 134.846212
## Mothers_Who_Received_Any_Antenatal_Check_Up
## 1.804854
## Mothers_Who_Had_Full_Antenatal_Check_Up
## 1.567969
## Mothers_Who_Underwent_Ultrasound
## 1.708771
As seen, Institutional Delivery and Delivery at Home are almost perferctly related. We remove the variable “Delivery At Home” as it provides the exact same information as “Instutitonal Delivery”
model <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up
+ Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
pairs(~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up +
Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
Xmat<- model.matrix(model)
XTX <- solve(t(Xmat)%*%Xmat)
rankMatrix(XTX)
## [1] 5
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15
H <- Xmat%*%XTX%*%t(Xmat)
rankMatrix(H)
## [1] 5
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.306067e-14
summary(model)
##
## Call:
## lm(formula = Infant_Mortality_Rate ~ Institutional_Delivery +
## Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Had_Full_Antenatal_Check_Up +
## Mothers_Who_Underwent_Ultrasound, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.039 -9.017 0.316 8.594 43.743
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 96.56772 11.52976 8.376
## Institutional_Delivery 0.35512 0.05890 6.029
## Mothers_Who_Received_Any_Antenatal_Check_Up -0.56151 0.14658 -3.831
## Mothers_Who_Had_Full_Antenatal_Check_Up -0.11935 0.10094 -1.182
## Mothers_Who_Underwent_Ultrasound -0.30710 0.05796 -5.299
## Pr(>|t|)
## (Intercept) 2.71e-15 ***
## Institutional_Delivery 5.21e-09 ***
## Mothers_Who_Received_Any_Antenatal_Check_Up 0.000158 ***
## Mothers_Who_Had_Full_Antenatal_Check_Up 0.238066
## Mothers_Who_Underwent_Ultrasound 2.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.7 on 279 degrees of freedom
## Multiple R-squared: 0.2002, Adjusted R-squared: 0.1887
## F-statistic: 17.46 on 4 and 279 DF, p-value: 8.489e-13
vif(model) ## The vif function is from the car package
## Institutional_Delivery
## 1.830999
## Mothers_Who_Received_Any_Antenatal_Check_Up
## 1.791705
## Mothers_Who_Had_Full_Antenatal_Check_Up
## 1.567646
## Mothers_Who_Underwent_Ultrasound
## 1.707864
As VIF < 5 We accept the assumption that there isn’t multidisciplinary.
D <- cooks.distance(model)
which(D > qf(0.5, 5, 284-5))
## named integer(0)
dfits <- dffits(model)
which(abs(dfits) > 2*sqrt(5/284))
## 3 9 141 155 211 261 265 266 281
## 3 9 141 155 211 261 265 266 281
dfb <- dfbetas(model)
which(abs(dfb[,2]) > 2/sqrt(284))
## 3 9 77 79 80 83 86 87 121 127 137 211 261 266 272 273 274 281
## 3 9 77 79 80 83 86 87 121 127 137 211 261 266 272 273 274 281
We proceed to remove influential observations and refit the model.
df <- df[-c(3,9,77,79,80,83,86,87,121,127,137,141,155,211,261,265,266,272,273,274,281),]
model <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up
+ Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
resid <- rstandard(model)
par(family = 'serif')
qqnorm(resid)
qqline(resid)
fitted <- predict(model)
plot(resid ~ fitted, type = "p",
xlab = "Fitted Values", ylab = "Standardized Residual",
cex.lab = 1.2, col = "red")
lines(lowess(fitted, resid), col = "blue")
par(family = 'serif')
plot(df$Infant_Mortality_Rate ~ fitted, type = "p", xlab = "Fitted Values",
ylab = "Infant Mortality Rate", cex.lab = 1.2,
col = "red")
abline(lm(df$Infant_Mortality_Rate ~ fitted), lwd = 2, col = "blue")
lines(lowess(fitted, df$Infant_Mortality_Rate), col = "red")
We conclude that no transformation is needed for this model, or rather, we cannot find an appropriate transformation.
Note that this model has no indicator variable dividing observations into two groups.
Keep in mind that our goal is to have an interpretable model, prediction accuracy is of secondary importance
criteria <- function(model){
n <- length(model$residuals)
p <- length(model$coefficients) - 1
RSS <- sum(model$residuals^2)
R2 <- summary(model)$r.squared
R2.adj <- summary(model)$adj.r.squared
AIC <- n*log(RSS/n) + 2*p
AICc <- AIC + (2*(p+2)*(p+3))/(n-p-1)
BIC <- n*log(RSS/n) + (p+2)*log(n)
res <- c(R2, R2.adj, AIC, AICc, BIC)
names(res) <- c("R Squared", "Adjsuted R Squared", "AIC", "AICc", "BIC")
return(res)
}
## The crteria ##
## model 1 ##
model.full <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up
+ Mothers_Who_Had_Full_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
crit1 <- criteria(model = model.full)
## model 2 ##
model.3.1 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up +
Mothers_Who_Had_Full_Antenatal_Check_Up, data = df)
crit2 <- criteria(model = model.3.1)
## model 3 ##
model.3.2 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up +
Mothers_Who_Underwent_Ultrasound, data = df)
crit3 <- criteria(model = model.3.2)
## model 4 ##
model.3.3 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Had_Full_Antenatal_Check_Up +
Mothers_Who_Underwent_Ultrasound, data = df)
crit4 <- criteria(model = model.3.3)
## model 5 ##
model.3.4 <- lm(Infant_Mortality_Rate ~ Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Had_Full_Antenatal_Check_Up
+ Mothers_Who_Underwent_Ultrasound, data = df)
crit5 <- criteria(model = model.3.4)
## model 6 ##
model.2.1 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Received_Any_Antenatal_Check_Up, data = df)
crit6 <- criteria(model = model.2.1)
## model 7 ##
model.2.2 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Had_Full_Antenatal_Check_Up, data = df)
crit7 <- criteria(model = model.2.2)
## model 8 ##
model.2.3 <- lm(Infant_Mortality_Rate ~ Institutional_Delivery + Mothers_Who_Underwent_Ultrasound, data = df)
crit8 <- criteria(model = model.2.3)
## Comapre the criteria for each model ##
rbind(crit1, crit2, crit3, crit4, crit5, crit6, crit7, crit8)
## R Squared Adjsuted R Squared AIC AICc BIC
## crit1 0.20312161 0.19076691 1265.981 1266.306 1291.413
## crit2 0.12795046 0.11784950 1287.688 1287.920 1309.549
## crit3 0.18951398 0.18012611 1268.434 1268.665 1290.294
## crit4 0.17289599 0.16331563 1273.772 1274.003 1295.632
## crit5 0.12530303 0.11517140 1288.486 1288.717 1310.346
## crit6 0.08696132 0.07993794 1297.769 1297.922 1316.057
## crit7 0.10266439 0.09576181 1293.206 1293.360 1311.495
## crit8 0.12061342 0.11384891 1287.892 1288.046 1306.181
The result is not promising, we proceed to the next step.
df_select = data.frame(df$Infant_Mortality_Rate,
df$Institutional_Delivery,
df$Mothers_Who_Received_Any_Antenatal_Check_Up,
df$Mothers_Who_Had_Full_Antenatal_Check_Up,
df$Mothers_Who_Underwent_Ultrasound)
model.select <- lm(df.Infant_Mortality_Rate ~ df.Institutional_Delivery + df.Mothers_Who_Received_Any_Antenatal_Check_Up
+ df.Mothers_Who_Had_Full_Antenatal_Check_Up + df.Mothers_Who_Underwent_Ultrasound, data = df_select)
## choose lambda ##
set.seed(1006353848)
cv.out <- cv.glmnet(x = as.matrix(df_select[,2:5]), y = df_select$df.Infant_Mortality_Rate, standardize = T, alpha = 1)
plot(cv.out)
best.lambda <- cv.out$lambda.1se
best.lambda
## [1] 0.9303921
co<-coef(cv.out, s = "lambda.1se")
thresh <- 0.00
# select variables #
inds<-which(abs(co) > thresh )
variables<-row.names(co)[inds]
sel.var.lasso<-variables[!(variables %in% '(Intercept)')]
sel.var.lasso
## [1] "df.Institutional_Delivery"
## [2] "df.Mothers_Who_Received_Any_Antenatal_Check_Up"
## [3] "df.Mothers_Who_Had_Full_Antenatal_Check_Up"
## [4] "df.Mothers_Who_Underwent_Ultrasound"
## Based on AIC ##
n <- nrow(df_select)
sel.var.aic <- step(model.select, trace = 0, k = 2, direction = "both")
sel.var.aic<-attr(terms(sel.var.aic), "term.labels")
sel.var.aic
## [1] "df.Institutional_Delivery"
## [2] "df.Mothers_Who_Received_Any_Antenatal_Check_Up"
## [3] "df.Mothers_Who_Had_Full_Antenatal_Check_Up"
## [4] "df.Mothers_Who_Underwent_Ultrasound"
## Based on BIC ##
n <- nrow(df_select)
sel.var.bic <- step(model.select, trace = 0, k = log(n), direction = "both")
sel.var.bic<-attr(terms(sel.var.bic), "term.labels")
sel.var.bic
## [1] "df.Institutional_Delivery"
## [2] "df.Mothers_Who_Received_Any_Antenatal_Check_Up"
## [3] "df.Mothers_Who_Underwent_Ultrasound"
We will divide our initial dataset into two seperate dataset, one for validation, and one for testing.
df_test <- df_select[row.names(df) %in% 1:131, ]
df_valid <- df_select[row.names(df) %in% (134):nrow(df), ]
ols.aic <- ols(df.Infant_Mortality_Rate ~ ., data = df_valid[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))],
x=T, y=T, model = T)
aic.cross <- calibrate(ols.aic, method = "crossvalidation", B = 10) ## 10 fold cross validation
plot(aic.cross, las = 1, xlab = "Predicted Infant Mortality Rate", ylab = "Observed Infant Mortality Rate", main = "Cross-Validation calibration with AIC")
##
## n=125 Mean absolute error=2.252 Mean squared error=6.5517
## 0.9 Quantile of absolute error=3.291
pred.aic <- predict(ols.aic, newdata = df_test[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))]) ## Test Error ##
pred.error.AIC <- mean((df_test$df.Infant_Mortality_Rate - pred.aic)^2) ## Prediction error ##
ols.aic <- ols(df.Infant_Mortality_Rate ~ ., data = df_valid[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))],
x=T, y=T, model = T)
aic.cross <- calibrate(ols.aic, method = "crossvalidation", B = 10) ## 10 fold cross validation
plot(aic.cross, las = 1, xlab = "Predicted Infant Mortality Rate", ylab = "Observed Infant Mortality Rate", main = "Cross-Validation calibration with BIC")
##
## n=125 Mean absolute error=2.546 Mean squared error=8.92306
## 0.9 Quantile of absolute error=3.518
pred.aic <- predict(ols.aic, newdata = df_test[,which(colnames(df_valid) %in% c(sel.var.aic, "df.Infant_Mortality_Rate"))]) ## Test Error ##
pred.error.BIC <- mean((df_test$df.Infant_Mortality_Rate - pred.aic)^2) ## Prediction error ##
ols.lasso <- ols(df.Infant_Mortality_Rate ~ ., data = df_valid[,which(colnames(df_valid) %in% c(sel.var.lasso, "df.Infant_Mortality_Rate"))],
x=T, y=T, model = T)
## 10 fold cross validation ##
lasso.cross <- calibrate(ols.lasso, method = "crossvalidation", B = 10)
plot(lasso.cross, las = 1, xlab = "Predicted Infant Mortality Rate", ylab = "Observed Infant Mortality Rate", main = "Cross-Validation calibration with LASSO")
##
## n=125 Mean absolute error=2.308 Mean squared error=7.34532
## 0.9 Quantile of absolute error=3.987
pred.lasso <- predict(ols.lasso, newdata = df_test[,which(colnames(df_valid) %in% c(sel.var.lasso, "df.Infant_Mortality_Rate"))]) ## Test Error ##
pred.error.lasso <- mean((df_test$df.Infant_Mortality_Rate - pred.lasso)^2) ## Prediction error ##
print(c(pred.error.AIC, pred.error.BIC, pred.error.lasso))
## [1] 235.162 235.162 235.162
After model selection, we decide to apply the model from BIC based selection. LASSO and AIC include 4 predictors, BIC includes 3 predictor. The goal of this study is not to predcit, but to understand the relation. Hence, BIC based selection is desirble.
model.final <- lm(Infant_Mortality_Rate ~ Institutional_Delivery +
Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound, data = df)
summary(model.final)
##
## Call:
## lm(formula = Infant_Mortality_Rate ~ Institutional_Delivery +
## Mothers_Who_Received_Any_Antenatal_Check_Up + Mothers_Who_Underwent_Ultrasound,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.0296 -7.8482 0.4369 7.4197 28.8795
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 98.73380 9.59736 10.288
## Institutional_Delivery 0.29644 0.05578 5.314
## Mothers_Who_Received_Any_Antenatal_Check_Up -0.56311 0.12001 -4.692
## Mothers_Who_Underwent_Ultrasound -0.30168 0.05270 -5.725
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Institutional_Delivery 2.31e-07 ***
## Mothers_Who_Received_Any_Antenatal_Check_Up 4.38e-06 ***
## Mothers_Who_Underwent_Ultrasound 2.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.11 on 259 degrees of freedom
## Multiple R-squared: 0.1895, Adjusted R-squared: 0.1801
## F-statistic: 20.19 on 3 and 259 DF, p-value: 8.673e-12
anova(model.final)
qf(0.95, 1, 282)
## [1] 3.874645
Again we failed to reject H0 for the predictor Institutional_Delivery. We can interpret this as “When conditioned on the percentage of Institutional Delivery, Infant Mortality Rate will not change.” Hence, we can conclude that there is no relationship between Institutional Delivery and Infant Mortality Rate. We successfully rejected H0 for the other two predictors. Surprisingly, the estimate indicates a negative relation.