STATGU4206_GR5206_Midterm
 Gabriel Young 10/18/2018
 The STAT GU4206/GR5206 Fall 2018 Midterm is open notes, open book(s), open computer and online resources are allowed. Students are not allowed to communicate with any other people regarding the  with the exception of the instructor (Gabriel Young) and course TAs. This includes emailing fellow students, using WeChat and other similar forms of communication. If there is any suspicion of one or more students cheating, further investigation will take place. If students do not follow the guidelines, they will receive a zero on the  and potentially face more severe consequences. The  will be posted on Canvas at 8:35AM. Students are required to submit both the .pdf and .Rmd files on Canvas (or .html if you must) by 11:30AM. If students fail to knit the pdf or html file, the TA will take off a significant portion of the grade. Students will also be significantly penalized for late s. If for some reason you are unable to upload the completed  on Canvas by 11:30AM, then immediately email markdown file to the course TA.
 Note: If you have a bug in your code then RMarkdown will not knit. I highly recommend that you comment out any non-working code. That way your file will knit and you will not be penalized for only uploading the Rmd file.
 For online students: Online students are required to be logged into the Zoom meeting with their cameras on. The TA will be available to answer questions during the Zoom . The  shouldn’t take the whole period so I expect for students to have their knitted file uploaded by 11:30AM.
 For in-class students: In-class students are required to be physically present in Room 903. The TA/instructor will be available to answer questions during the . The  shouldn’t take the whole period so I expect for students to have their knitted file uploaded by 11:30AM.
 Section 1 – Bootstrap and Robust Estimation
 Problem Statement:
 Consider the following toy dataset relating response variable Y with covariate X. Note that this dataset is an extreme case of how traditional least squares regression fails to capture the trend of the the data in the presences of outlying observations.
 data <- read.csv("Problem1.csv") plot(data$X,data$Y,main="Linear Trend and Outlyers")
 Linear Trend and Outlyers
 image.png
 Problem 1:
 Fit a regular linear regression to the above dataset and plot the line of best fit in red.
 Also remove the three outlying points and fit the linear model on the remaining 27 cases.  Plot this new  line of best fit on the same graph as the first model. Create a legend on the plot describing each line. Note: remove the points corresponding to Y = 1.05, 1.94, 2.38.
 Comment on any interesting features from the graph and estimated models.
 Solution
 # Solution goes here ————
 Problem 2 Set-Up:
 To fit the linear model, we minimize the total squared Euclidean distance between Y and a linear function of
 X, i.e., minimize the expression below with respect to β0, β1.
 image.png
 From the above fit, we see that the outlying Y values are influencing the estimated line, consequently, the linear fit is being pulled down and is not covering the full trend of the data. To remedy this problem, we can perform a robust estimation procedure. More specifically, instead of minimizing squared Euclidean distance (squared loss), we can minimize Huber loss. To estimate our robust model, we minimize Q(β0, β1) with respect to β0, β1:
 image.png
 The goal of the next exercise is to write a robust estimation procedure for the coefficients β0, β1. In class we performed univariate gradient descent. On this , we will use the R base function nlm() to perform the optimization task. Below shows how the non-linear minimization function is applied to minimize squared Euclidean distance.
 # First define an objective function
 S <- function(betas,response=data$Y,feature=data$X) { b0 <- betas[1]
 b1  <- betas[2]
 lin.diff <- response–(b0+b1*feature) out <- sum((lin.diff)^2)
 return(out)
 }
 # Test S(beta0=1,beta1=1)
 S(betas=c(0,0))
 ## [1] 8207.36
 #  Use  starting point c(beta0=1,beta1=1). The estimated
 ## $minimum
 ## [1] 1415.941 ##
 ## $estimate
 ## [1] 3.769193 1.987186 ##
 ## $gradient
 ## [1] 7.552595e-05 4.119117e-06 ##
 ## $code ## [1] 1 ##
 ## $iterations ## [1] 3
 # Compare estimates to lm()
 nlm(S,p=c(0,0))$estimate
 ## [1] 3.769193 1.987186
 lm(Y~X,data=data)$coefficients
 ## (Intercept) X ## 3.769174 1.987190
 Note that in this ple, the nlm() function produces very similar estimates as the lm() function.
 Problem 2:
 Write a R function Q which computes the Huber loss as a function of the vector c(β0, β1). Note that the Huber loss is defined in Equation (1). This exercise is having you create an objective function Q(β0, β1) so that you can run an optimization procedure later on. Test your function at the point c(0,0).
 Solution
 # Solution goes here ————
 Problem 3:
 Optimize Huber loss Q using the nlm() function. Use the starting point c(0,0) and display your robust estimates. Plot the estimated robust linear model and include the regular least squares regression line on the plot. Create a legend and label the
案例:案例统计,r语言案例STATGU4206_GR5206
2019-03-14
