Thursday, July 21, 2016

Linear Regression - Chapter 3 - ISLR book

Data! Data! Data! I can’t make bricks without clay!
Sir Arthur Conan Doyle

Chapter 3 : Linear Regression (edX Course notes)


Prob.1. Why is linear regression important to understand? Select all that apply:

(1) The linear model is often incorrect
(2) Linear regression is very extensible and can be used to capture nonlinear effects (correct)
(3)Simple methods can outperform more complex ones if the data are noisy (correct)
(4)Understanding simpler methods sheds light on more complex ones (correct)

Prob.2. You may want to reread the paragraph on confidence intervals on page 66 of the textbook before trying this queston (the distinctions are subtle).
Which of the following are true statements? Select all that apply:
(1)A 95% confidence interval is a random interval that contains the true parameter 95% of the time (correct)
(2) The true parameter is a random value that has 95% chance of falling in the 95% confidence interval
(3)I perform a linear regression and get a 95% confidence interval from 0.4 to 0.5. There is a 95% probability that the true parameter is between 0.4 and 0.5.
(4)The true parameter (unknown to me) is 0.5. If I sample data and construct a 95% confidence interval, the interval will contain 0.5 95% of the time.(correct)
EXPLANATION
Confidence intervals are a "frequentist" concept: the interval, and not the true parameter, is considered random.

Standard Errors and Hypothesis Testing:

Standard error (SE) tells us the average amount an estimate differs rom the actual value. SE is inversely dependent on the number of observations and hence the deviation shrinks with increasing number of observations. This is given by

SE = sigma^2 / n, 

where sigma is the standard deviation of the estimate. SE is used to compute confidence intervals. A 95% interval is defined as the range of values such that with 95% probability, the range will contain the actual value of the parameter.

Standard errors can also be used to perform hypothesis tests.

null hypothesis H_0: There is no relationship between X and Y  versus 
alternative hypothesis H_a :There is some relationship betweenX and Y

Mathematically, this corresponds to testing H_0 : β_1  =     0 versus 
                                                                      H_a : β_1  |=    0 

To check how good this is true, again we use the standard error of the coefficient. We use t-statistics and p-value (a small p-value indicates that it is unlikely to observe such an association between response and predictors just by chance).


Assessing the accuracy of the model:

SSE (Sum of Squared Errors) : The sum of the squared differences between the actual and predicted values.
SST (Total Sum of Squares) : Disregard all the coefficients of the model apart from the intercept value, and calculate the SSE using this flat line.
R^2 = 1 - (SSE / SST) : This quantifies how well the coefficients of the independent variables approximate the real data.
Prob3. We run a linear regression and the slope estimate is 0.5 with estimated standard error of 0.2. What is the largest value of "b" for which we would not reject the null hypothesis that beta1= b. Assume normal distribution, and that we are using the 5% significance level for a two-sided test and need two significant digit accuracy.

Beta_1 = 0.5
SE(Beta_1) = 0.2

Need to understand how this is done... 

Ans: 0.892
EXPLANATION
The 95% confidence interval Beta1 +- 1.96* SE(beta_1)= 0.5+1.96*0.2=0.89β^1±1.96S.E.(β^1)contains all parameter values that would not be rejected at a 5% significance level.

Prob 4. Which of the following indicates R^2 a fairly strong relationship between X and Y?
(1) R^2 = 0.9  (correct)
(2) p- value for the null hypothesis beta_1= 0  is 0.0001
(3) p- value for the null hypothesis beta_1 = 0 is 30

The R^2R2 is the correlation between the two variables and measures how closely they are associated. The p value and t statistic merely measure how strong is the evidence that there is a nonzero association. Even a weak effect can be extremely significant given enough data.


Multiple Regression

1. Multiple predictors
2. Correlation among predictors can increase variance and interpretations may become difficult. In that case, claims of causality should be avoided.

Prob 5. Suppose we are interested in learning about a relationship between X1 X_1 and YY, which we would ideally like to interpret as causal.
True or False?  False(correct)
The estimate of beta_1β^1 in a linear regression that controls for many variables (that is, a regression with many predictors in addition to X_1X1) is usually a more reliable measure of a causal relationship than beta_1β^1 from a univariate regression on X_1X1.
EXPLANATION
Adding lots of extra predictors to the model can just as easily muddy the interpretation of beta_1β^1 as it can clarify it. One often reads in media reports of academic studies that "the investigators controlled for confounding variables," but be skeptical!
Causal inference is a difficult and slippery topic, which cannot be answered with observational data alone without additional assumptions.


3. How do we decide on the model and the number of predictive variables?
(all subsets or best subsets regression) - if p = 40; we can have 2^40 models possible.

(i) Forward Selection

(a) Start with an null model (model just with the intercept and no predictor (Beta0)
(b) Add first predictive variable and build the model and get the RSS. (Beta0+ Beta1 .X_1)
(c) Now add each of the other predictive variable to the null model and get the RSS (Beta0 + Beta2.X_2)
(d) Choose the particular predictive variable which results in lowest RSS (Choose Beta1 or Beta2)
(e) Once we decide the first predictive variable - keep that as the base and start adding the predictive variables from (p-1) variables.

(ii) Backward Selection

(a) Start the model will all the predictive variables
(b) Remove first the predictive variable which has the largest p value.
(c) Remove from (p-1) variables which are not significant.
(d) Continue doing this till we find that all the predictive variables are significant

Systematic criteria for choosing an optimal number of predictive variables are done by 

(1) AIC (Akaike Information Criterion)
(2) BIC  (Bayesian Information Criterion)
(3) adjusted R^2
(4) Cross Validation (CV)
(5) Mallow's Cp









for which we would NOT reject the null hypothesis that β1=bAssume normal approximation to t

Sunday, July 10, 2016

Exercises from Chapter 2 - ISLR book

"I never guess. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts",

Sir Arthur Conan Doyle, Author of Sherlock Holmes stories

Exercises from Chapter 2 - ISLR book by Gareth James Daniela Witten Trevor Hastie Robert Tibshirani

2.4 Exercises
Conceptual

1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a)  The sample size n is extremely large, and the number of predictors p is small

Ans: As the sample size n is extremely large, flexible method performs better both for the train and the test data thereby lowering the variance. Also as the number of predictors is small the bias of the flexible method will be small.

(b)  The number of predictors p is extremely large, and the number of observations n is small.

Ans:  Here we have high dimensional data (p is extremely large) but the sample size is small. So both flexible and inflexible models will not perform well.  An inflexible model will not be able to represent the data very well and with sample size being small it can lead to large bias. Whereas a flexible model will be able to represent the data very well taking into account the large p. But with the sample size being small the mean square error for the train data is reduced but the mean square error for test data increases due to large variance. Basically it means a flexible model will overfit the train data with a small number of observations.

(c)  The relationship between the predictors and response is highly non-linear.

Ans: Flexible model performs well here.  Flexible model with larger degrees of freedom in comparison with inflexible models like linear regression (2 degrees of freedom) should represent the data better. 

(d)  The variance of the error terms, i.e. σ2 = Var(ε), is extremely high. 

Ans: If the variance of the error terms is large, a flexible model will try to adjust the parameters to fit it the data for these error variations leading to large variance in the test data. At the same time, a inflexible model being linear will try to map the data in a linear way and hence should be able to get smaller variance in test set. Therefore, inflexible model is better in this case.

2Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p

(a)  We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary. 

Ans: Regression prediction problem - Here we are trying to find the inference or the factors influencing the CEO salary by looking at various predictors like profit, number of employees etc... n= 500 and p = profit, number of employees, industry (p=3); response = CEO salary

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables. 

Ans: Classification prediction problem as the response is either an yes or no / success or failure. n = 20; p =13

(c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market. 

Ans: Prediction using regression.  n= 52 ( the number of weeks in 2012 with each data / week); p = 3

3. We now visit the bias-variance trade off

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one. 

Ans: 

 Blue : Squared Bias
 Brown : Variance
 Yellow : Train MSE
 Green : Test MSE
 Black : Bayes or Irreducible error 

X-axis - increasing flexibility 
Y-axis - values 








(b) Explain why each of the five curves has the shape displayed in part (a). 
Blue: Squared Bias - If a really complicated problem is built using a simple linear regression kind of model, the model is biased. So when the flexibility of the model increases in the positive x direction we can see that bias reduces.
Brown: Variance -  Lets consider a highly flexible model which explains all the points in the data including the outliers. Basically the model overfits the train data. So when we use the above overfit model on test set, we find that the test MSE is increased. This happens due to large variance.  As we see in the graph, when the model is less complex and inflexible, the variance is small and it increases monotonically when the flexibility increases.
Yellow : Train MSE - This is closely related to bias. When the model is simple and inflexible, the bias is large but as the model becomes more and more complex it will try to fit all the data hence reducing the bias thereby reducing the training error. 
Green: Test MSE - This is closely related to variance and bias trade off. When the model is simple and inflexible, the bias is large and variance is small. But as the model becomes more complex, the bias reduces faster than the variance thereby leading to larger test error. This results in the U shaped curve.
Black: Bayes or Irreducible error -  This is a horizontal line and it does not change with increasing flexibility. This gives the upper bound for accuracy of any model. 

4. You will now think of some real-life applications for statistical learning. 
(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer. 

1. From a given demographic data predict whether the particular person will vote for democratic or republican government.
Response : Republican or democratic 
Predictors - Age, Gender, Ethnicity,  salary, state..etc..

2. From a given hand written numbers, predict the correct number. 
Response : 0,1,2,3,4,5,6,7,8,9
Predictors -  various hand written images of all numbers

3. Predict admission into a college for higher education 
Response : Admitted or Not admitted
Predictors : GPA, SAT score, ACT score, Extracurricular participation score, Ethnicity, State score, Relevancy of degree to the higher education, Specilizations, Age, Gender, 

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer. 

1. Predict the total sales from previous years sales

Response: Sales next year
Predictors : Total people visited, New items added, New items sold, Customer happiness index, Customer easy finding index, Average number of items sold / visit, Average time required / visit, Number of sales person / day etc..

2. Predicting the price of wine 
Response : Price of wine
Predictors : Winter Rain, Harvest Rain, Age etc..

3. Moneyball prediction on Oakland A's win in baseball game

Response : Number of wins by the particular team
Predictors :  Previous years wins, run rate, rank, speed, parameters representing players strength, total runs, team score, budget of the team, players salary, games played that year etc..

(c) Describe three real-life applications in which cluster analysis might be useful. 

1. Clustering a group of people into various categories by looking at the spending pattern
2. Clustering an MRI image with respect to intensity and mapping into the original picture for detection
3. Clustering genes using the gene expression data for cancer and other deadly disease detection. 


5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred? 


1. Flexible approach takes into account all predictors from the data and make an elaborate model without any assumptions and hence will be more close to original function. 
2. But as the model is too flexible, it can overfit the train data leading to large test error.  In order to avoid overfitting we need to have large number of sample data.
3. Flexible to less flexible approach can be compared to Epicurean (principle of plentitude or multiple explanations) to Occam's Razor (reductionist or model with fewest and simple assumptions ) approach. 

6Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages? 

In Parametric approach we start with an assumption about the functional form (linear or nonlinear / flexible or less flexible) and then fit the data to the functional form. This model based approach is called parametric approach.  Regression is one among such approach. 

Whereas in non-parametric approach, we don't make an assumption on the type of function. Instead we estimate the function which fits closely to the data. Thin plate splines is a non-parametric approach. 

Advantages: 
(1)Parametric approach can fit any kind of data into a functional form by choosing wide range of functions.
(2)In Parametric approach we can work well with a  simple model (eg. linear model) for inference whereas in the case of non-parametric approach we need large number of parameters to tell even something about the model.

Disadvantages:
(1) Parametric approach with lots of parameters overfits the training data leading to large test errors. 
(2) Parametric models require large train data to train the model else it will have large bias. 


7The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

                              Obs.   X1   X2   X3   Y
                             1       0    3       0   Red
                             2       2    0       0   Red
                             3       0    1       3   Red
                             4       0    1       2   Green 
                             5     1    0       1   Green 
                             6       1    1       1   Red

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point,X1 =X2 =X3 =0.
(b)  What is our prediction with K = 1? Why?
(c)  What is our prediction with K = 3? Why?
(d)  If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why? 

ANS
(a) Red: sqrt((0-0)^2 + (0-3)^2 + (0-0)^2 )= sqrt(9) = 3 
     Red: sqrt((0-2)^2 + (0-0)^2 + (0-0)^2 )= sqrt(4) = 2
     Red: sqrt((0-0)^2 + (0-1)^2 + (0-3)^2 )= sqrt(10) = 3.162278
     Green: sqrt((0-0)^2 + (0-1)^2 + (0-2)^2 )= sqrt(5)= 2.236068
     Green: sqrt((0+1)^2 + (0-0)^2 + (0-1)^2 )= sqrt(2) =1.414214 
     Red: sqrt((0-1)^2 + (0-1)^2 + (0-1)^2 )= sqrt(3)=1.732051
(b) K=1
test set X1=X2=X3=0. This is close to Green which is at a distance sqrt(2). Therefore the prediction is Green
(c) K= 3
test set X1= X2=X3=0. Closest ones are Red (Obs 2), Green(Obs 5) and Red(Obs 6)
Prediction will be Red
(d) If the Bayes decision boundary is highly non-linear it will be easy to fit in data with few K values. Whereas when the boundary becomes more rigid, we need large K values.








Thursday, July 7, 2016

Variance-Bias Trade-Offs in Model building - Notes from ISLR Chapter 2

Emotional commitment is a key to achieving mastery

0. How do we measure the quality of a fit? (for eg in the case of regression)?

For a given data, we need some way to measure how well the predictions match the observed data. In regression setting, the most commonly-used is the mean squared error (MSE). The MSE will be small if the predicted responses are very close to the true responses and vice versa.


1. What is training MSE and testing MSE and how are they related to the flexibility of the statistical method? 

The mean square error computed using the training data which we used to fit the model is called the training MSE. However, our interest is in checking the predictions on the unseen test data which was never used to train the statistical learning method. The mean square error computed using the testing data is called testing MSE. We want to select a statistical method for which the test MSE is smallest. It is important to know that there is no guarantee that the method with the lowest training MSE will also have lowest test MSE.

Lets understand how training MSE and testing MSE are related to the flexibility of the statistical method.

Figure 2.9 illustrates this phenomenon on a simple example. In the left- hand panel of Figure 2.9, observations with the true function f is given by the black curve. The orange, blue and green curves illustrate three possible estimates for f obtained using methods with increasing levels of flexibility. The orange line is the linear regression fit, which is relatively inflexible. The blue and green curves were produced using smoothing splines, with different levels of smoothness. It is clear that as the level of flexibility increases, the curves fit the observed data more closely. The green curve is the most flexible and matches the data very well; however, we observe that it fits the true f (shown in black) poorly because it is too wiggly. 


On the right-hand panel of Figure 2.9, the grey curve displays the average training MSE as a function of flexibility, or more formally the degrees of freedom, for a number of smoothing splines. The degrees of freedom is a quantity that summarizes the flexibility of a curve. The orange, blue and green squares indicate the MSEs associated with the corresponding curves in the left-hand panel. A more restricted and hence smoother curve has fewer degrees of freedom than a wiggly curve. Linear regression is at the most restrictive end, with two degrees of freedom. The training MSE declines monotonically as flexibility increases. In this example the true f is non-linear, and so the orange linear fit is not flexible enough to estimate f well. The green curve has the lowest training MSE of all three methods, since it corresponds to the most flexible of the three curves fit in the left-hand panel.



In this example, we know the true function f, and so we can also compute the test MSE over a very large test set, as a function of flexibility. The test MSE is displayed using the red curve in the right-hand panel of Figure 2.9. As with the training MSE, the test MSE initially declines as the level of flexibility increases. However, at some point the test MSE levels off and then starts to increase again. Consequently, the orange and green curves both have high test MSE. The blue curve minimizes the test MSE, which should not be surprising given that visually it appears to estimate f the best in the left-hand panel of Figure 2.9. 

In the right-hand panel of Figure 2.9, as the flexibility of the statistical learning method increases, we observe a monotone decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property of statistical learning that holds regardless of the particular data set at hand and regardless of the statistical method being used. As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. When we overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data. Note that regardless of whether or not overfitting has occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.


In Fig. 2.10 - the original curve is linear and hence we can see that the linear regression fits well and hence the orange and blue represent the better the curve and the test MSE is also smaller. Green overfits and hence the test MSE is larger.










In this case, Fig.2.11 - the original curve is non-linear and hence linear regression does not fit and hence the train and test MSE are large whereas the blue and green splines fit well and hence the test MSE is smaller.









The above behavior in all the three cases can be understood from the equation to determine expected test MSE.  The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out to be the result of two competing properties of statistical learning methods. The expected test MSE, for a given value x0, can always be decomposed into the sum of three fundamental quantities: the variance of f(x0), the squared bias of f(x0) and the variance of the error terms ε. That is, 
Equation 2.7 tells us that in order to minimize the expected test error,we need to select a statistical learning method that simultaneously achieves low variance and low bias. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below Var(ε), the irreducible error. We can see the three effects shown separately for the three cases Fig. 2.9 to 2.11. We will talk about the meaning and the dependence to model complexity for bias and variance below.



2. What is the meaning of Variance of a statistical method and how is it related to flexibility of the statistical method?

Variance refers to the amount by which a function f would change if the function is estimated using a different training data set. Since different training sets are used to fit statistical methods, this results in a different function f. But ideally the estimate for f should not vary too much between training sets.

In general, more flexible statistical methods have higher variance. For example, if the curve cannot be fitted using linear regression, a small change in one of the data point can cause the function f to change drastically.  Therefore flexible statistical methods have higher variance.

3. What is the meaning of bias of a statistical method and how is it related to flexibility of the statistical method?

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between  the response / dependent variable and features / predictors / independent variables. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of function f.

Hence, it is clear that flexible statistical methods have less bias.