Thursday, July 21, 2016

Linear Regression - Chapter 3 - ISLR book

Data! Data! Data! I can’t make bricks without clay!
Sir Arthur Conan Doyle

Chapter 3 : Linear Regression (edX Course notes)


Prob.1. Why is linear regression important to understand? Select all that apply:

(1) The linear model is often incorrect
(2) Linear regression is very extensible and can be used to capture nonlinear effects (correct)
(3)Simple methods can outperform more complex ones if the data are noisy (correct)
(4)Understanding simpler methods sheds light on more complex ones (correct)

Prob.2. You may want to reread the paragraph on confidence intervals on page 66 of the textbook before trying this queston (the distinctions are subtle).
Which of the following are true statements? Select all that apply:
(1)A 95% confidence interval is a random interval that contains the true parameter 95% of the time (correct)
(2) The true parameter is a random value that has 95% chance of falling in the 95% confidence interval
(3)I perform a linear regression and get a 95% confidence interval from 0.4 to 0.5. There is a 95% probability that the true parameter is between 0.4 and 0.5.
(4)The true parameter (unknown to me) is 0.5. If I sample data and construct a 95% confidence interval, the interval will contain 0.5 95% of the time.(correct)
EXPLANATION
Confidence intervals are a "frequentist" concept: the interval, and not the true parameter, is considered random.

Standard Errors and Hypothesis Testing:

Standard error (SE) tells us the average amount an estimate differs rom the actual value. SE is inversely dependent on the number of observations and hence the deviation shrinks with increasing number of observations. This is given by

SE = sigma^2 / n, 

where sigma is the standard deviation of the estimate. SE is used to compute confidence intervals. A 95% interval is defined as the range of values such that with 95% probability, the range will contain the actual value of the parameter.

Standard errors can also be used to perform hypothesis tests.

null hypothesis H_0: There is no relationship between X and Y  versus 
alternative hypothesis H_a :There is some relationship betweenX and Y

Mathematically, this corresponds to testing H_0 : β_1  =     0 versus 
                                                                      H_a : β_1  |=    0 

To check how good this is true, again we use the standard error of the coefficient. We use t-statistics and p-value (a small p-value indicates that it is unlikely to observe such an association between response and predictors just by chance).


Assessing the accuracy of the model:

SSE (Sum of Squared Errors) : The sum of the squared differences between the actual and predicted values.
SST (Total Sum of Squares) : Disregard all the coefficients of the model apart from the intercept value, and calculate the SSE using this flat line.
R^2 = 1 - (SSE / SST) : This quantifies how well the coefficients of the independent variables approximate the real data.
Prob3. We run a linear regression and the slope estimate is 0.5 with estimated standard error of 0.2. What is the largest value of "b" for which we would not reject the null hypothesis that beta1= b. Assume normal distribution, and that we are using the 5% significance level for a two-sided test and need two significant digit accuracy.

Beta_1 = 0.5
SE(Beta_1) = 0.2

Need to understand how this is done... 

Ans: 0.892
EXPLANATION
The 95% confidence interval Beta1 +- 1.96* SE(beta_1)= 0.5+1.96*0.2=0.89β^1±1.96S.E.(β^1)contains all parameter values that would not be rejected at a 5% significance level.

Prob 4. Which of the following indicates R^2 a fairly strong relationship between X and Y?
(1) R^2 = 0.9  (correct)
(2) p- value for the null hypothesis beta_1= 0  is 0.0001
(3) p- value for the null hypothesis beta_1 = 0 is 30

The R^2R2 is the correlation between the two variables and measures how closely they are associated. The p value and t statistic merely measure how strong is the evidence that there is a nonzero association. Even a weak effect can be extremely significant given enough data.


Multiple Regression

1. Multiple predictors
2. Correlation among predictors can increase variance and interpretations may become difficult. In that case, claims of causality should be avoided.

Prob 5. Suppose we are interested in learning about a relationship between X1 X_1 and YY, which we would ideally like to interpret as causal.
True or False?  False(correct)
The estimate of beta_1β^1 in a linear regression that controls for many variables (that is, a regression with many predictors in addition to X_1X1) is usually a more reliable measure of a causal relationship than beta_1β^1 from a univariate regression on X_1X1.
EXPLANATION
Adding lots of extra predictors to the model can just as easily muddy the interpretation of beta_1β^1 as it can clarify it. One often reads in media reports of academic studies that "the investigators controlled for confounding variables," but be skeptical!
Causal inference is a difficult and slippery topic, which cannot be answered with observational data alone without additional assumptions.


3. How do we decide on the model and the number of predictive variables?
(all subsets or best subsets regression) - if p = 40; we can have 2^40 models possible.

(i) Forward Selection

(a) Start with an null model (model just with the intercept and no predictor (Beta0)
(b) Add first predictive variable and build the model and get the RSS. (Beta0+ Beta1 .X_1)
(c) Now add each of the other predictive variable to the null model and get the RSS (Beta0 + Beta2.X_2)
(d) Choose the particular predictive variable which results in lowest RSS (Choose Beta1 or Beta2)
(e) Once we decide the first predictive variable - keep that as the base and start adding the predictive variables from (p-1) variables.

(ii) Backward Selection

(a) Start the model will all the predictive variables
(b) Remove first the predictive variable which has the largest p value.
(c) Remove from (p-1) variables which are not significant.
(d) Continue doing this till we find that all the predictive variables are significant

Systematic criteria for choosing an optimal number of predictive variables are done by 

(1) AIC (Akaike Information Criterion)
(2) BIC  (Bayesian Information Criterion)
(3) adjusted R^2
(4) Cross Validation (CV)
(5) Mallow's Cp









for which we would NOT reject the null hypothesis that β1=bAssume normal approximation to t

No comments:

Post a Comment