Abstract:
In practice, when one has many candidate variables as explanatory variables in multiple regression, there is always the possibility that variables that are important determinants of the response variable might be omitted from the model, while unimportant variables might be included. Both types of errors are important, and in this dissertation it is attempted to quantify the probabilities of these errors. A simulation study is reported in this dissertation. Different numbers of variables, i.e. p= 4 to 20 are assumed, and different sample sizes, i.e. n=0.5p, p, 2p, 4p. For each p the underlying model assumes that roughly half of the independent variables are actually correlated with the dependant variable and the other half not. The noise is ε~ N(0, σ2, where σ2, is set fixed. The data was simulated 10000 times for each combination of n and p using known underlying models and ε randomly selected from of a normal distribution. For this investigation the full model and forward selection regression are compared. The mean squared error of the estimated coefficient β(p) is determined from the true β of each n and p set. A full discussion, as well as graphs, is presented.