1

QBUS2810: Statistical Modelling for Business

Assignment Task #3

Submission Due Date: Sunday, 22nd November, 2020 (Week 12) before 11:59 pm (Sydney

time)

Instructions:

1. You are required to type up your entire assignment, including any equations. Copy and paste relevant

outputs into your text. If you are using Word, you should use the equation editor for any maths notation.

2. You should attach relevant analysis outputs (graphs, tables, etc.) while discussing your answer in the text.

3. Please answer all questions in the given order; i.e., 1a, 1b, etc. You do not need to re-write the assignment

questions again. Keep your answers clear, brief, and concise.

4. There is no requirement for font size and line spacing, but it must be legible and correctly oriented.

5. Please convert and submit your assignment in pdf, which must be uploaded to the Turnitin assignment

box on Canvas.

6. For hypothesis test question, use the p-value approach. Your answer should include the alternatives

(H0 and H1), decision, and conclusion.

7. Data used in this assignment are in the spreadsheet A3Dataset.xlsx.

8. You are encouraged to discuss the assignment with your classmates, tutors, and lecturer. However, you

MUST write up solutions on your own. Students caught cheating will automatically receive a mark of 0

and are subject to disciplinary action.

1. The capital asset pricing model (CAPM) is used in finance to determine a theoretically appropriate

required rate of return of an asset, where that asset is to be added to an already well-diversified

portfolio, given that asset’s non-diversifiable risk. Traditionally, applications of the CAPM use only

one variable to describe the returns of a portfolio or stock with the returns of the market as a whole:

rstock - rf = αstock + βstock(rm - rf ) + ut

In contrast, the Fama-French model uses three variables:

rstock - rf = αstock + βstock(rm - rf ) + β2SMB + β3HML + "t

rstock is the stock’s rate of return, rf is the risk-free return rate, and rm is the return of the

whole stock market. The parameter αstock is the stock’s "alpha". It measures how much the stock

outperforms its "theoretical" predicted returns under the CAPM and βstock is the stock’s "beta",

which measures the stock’s exposure to the overall market. Different stocks will have different

parameters.

The Fama-French model contains two additional factors to explain stock returns. Small market capitalization Minus Big (SMB) measures the historic excess returns of small cap stocks over big caps.

High book-to-market ratio (BtM) Minus Low book-to-market ratio (HML) measures the historic

excess returns of value stocks (small BtM ratio) over growth stocks (High BtM ratio). These factors

are calculated with combinations of portfolios composed by ranked stocks (BtM ranking, Capitalisation ranking) and available historical market data. Historical values are available on Kenneth

French’s web page for American stocks.

The variables used in this exercise are as follows:

rBHP = Monthly return on BHP stock as observed on the ASX.

rm = Monthly return on market index, here the All Ordinaries Index (AOI).

SMB = Small market capitalization Minus Big market capitalization factor.

HML = High book-to-market ratio Minus Low book-to-market ratio

You are to assume a risk-free rate of rf = 0.005 per month. Your task is to estimate the Fama-French

three factor model using the given data. and determine whether it is any better at explaining the

BHP stock returns compared to the market excess returns given by only the All Ordinaries Index.

2

(a) Write down the five-number summaries plus mean, standard deviation, skewness, and kurtosis

coefficients of rBHP .

(b) Plot and comment the rBHP series over time.

(c) Generate two new variables rBHP - rf and rm - rf and estimate the one-factor CAPM model:

rstock - rf = β0 + β1(rm - rf ) + ut

Copy and paste the regression output into your answer sheet. Write down the fitted regression

equation.

(d) Comment on the sign of the estimated coefficient β1 and state whether this is what you expect.

(e) Test whether or not the excess market returns explain the excess returns of BHP shares at the

α = 0.05 level.

(f) Test whether or not the BHP’s "beta" is greater than one at the α = 0.05 level.

(g) Estimate the Fama-French 3-Factor CAPM model:

rstock - rf = β0 + β1(rm - rf ) + β2SMB + β3HML + "t

Copy and paste the regression output into your answer sheet. Write down the fitted regression

equation.

(h) Set up the general linear hypothesis for testing whether or not the Fama-French 3-Factor CAPM

model explains the stock returns better than the one-factor CAPM model; i.e., determine L,

β, and c for H0: Lβ = c.

(i) Conduct a hypothesis test for part (h).

(j) A Financial Analyst believes that the effect of book-to-market values (HML) on stock returns is

twice as great as the effect of market capitalization (SMB). Formulate an appropriate hypothesis

test and use re-parametrisation to convert it to a simple t-test to test the assertion. Perform

the required regression and state your conclusion at the α = 0.05 level.

(k) Obtain the variance-covariance matrix for the estimators of parameters in a regression model in

part (g). Utilize the regression result in part (g) and the variance-covariance matrix to repeat

the hypothesis test in part (j) by means of a simple t-test.

2. The marketing manager of a company producing a new cereal aimed for children wants to examine

the effect of the shape of the box’s logo on the approval rating of the cereal. He combined 4 colours

and 2 shapes to produce a total of 8 designs. Each logo was presented to 2 different groups (a total

of 16 groups of children) and the approval rating for each was recorded and is shown below.

Color | ||||

Shape | Red | Green | Blue | Yellow |

Circle | 52, 44 | 67, 61 | 36, 44 | 45, 41 |

Square | 34, 36 | 56, 58 | 36, 31 | 21, 25 |

each factor has.

(b) If all combinations are compared, how many different treatments (cells) are there in the experiment? What is the response variable?

(c) Consider the following regression model:

Y = β0 + β1C + β2R + β3G + β4B + β5CR + β6CG + β7CB + "

where C = 1 if shape = circle; 0 otherwise. R = 1 if color = red; 0 otherwise. G = 1 if color

= green; 0 otherwise. B = 1 if color = blue; 0 otherwise.

Use the regression parameters to recover the cell means µij and fill in the following table:

3

Colour | ||||

Shape | Red | Green | Blue | Yellow |

Circle | µ11 = β0 + β1 + β2 + β5 | |||

Square |

constants subject to the restriction P αi = 0. βj are constants subject to the restriction P βj

= 0. (αβ)ij are constants subject to the restrictions Pi Pj(αβ)ij = 0. "ijk are independent

N(0, σ2), i = 1, 2, ..., a; j = 1, 2, ..., b; k = 1, 2, ..., n.

Why are the constraints P αi = P βj = P(αβ)ij = 0 required? What is the advantage of this

model?

(e) Refer to Part (d). Modify the factor effects model to apply to this study with a = 2 and b = 4.

(f) Set up the Y, X, and β matrices for the factor effects regression model.

(g) Refer to part (e). Obtain the fitted regression function.

(h) Plot the residuals against the fitted values and the QQ-plot of the residuals. Use these two

residual plots to check if the assumptions of two-way ANOVA are justifiable. Briefly explain.

(i) Plot an interaction plot. What does this plot suggest?

(j) Fill in the blanks in the following ANOVA table.

Source of Variation | SS | df | MS |

Between treatments | |||

Factor A | |||

Factor B | |||

AB Interactions | |||

Error | |||

Total |

(l) Is it meaningful here to test for main factor effects? If so, test if the main effects for color and

shape are present.

(m) All pairwise comparisons among the color group level means via Tukey procedure with a 95

percent family confidence coefficient are constructed below:

Treatment | Difference | Lower 95% limit | Upper 95% bound | |

Red | Green | -19.00 | -35.8696 | -2.1304 |

Red | Blue | 4.75 | -12.1196 | 21.6196 |

Red | Yellow | 8.50 | -8.3696 | 25.3696 |

Green | Blue | 23.75 | 6.8804 | 40.6196 |

Green | Yellow | 27.50 | 10.6304 | 44.3696 |

Blue | Yellow | 3.75 | -13.1196 | 20.6196 |

(n) Based on the above analysis, what combination of color and shape should be used for the logo

design?

(o) Suppose that in the shape population, 60 percent are circle, and 40 percent are square. Construct a 95% percent confidence interval for the mean overall rating in the shape population.

4

3. A person’s muscle mass is expected to decrease with age. To explore this relationship in women,

a nutritionist randomly selected 4 women from each 10-year age group, beginning with age 40 and

ending with age 79. X is age, and Y is a measure of muscle mass.

(a) Below is a scatter plot of the data with muscle mass on the y axis and age on the x axis.

Based on the plot, does it seem reasonable that there are two different (but connected) regression functions { one when age ≤ 60 and one when age > 60?

(b) The nutritionist conjectures that the regression of muscle mass on age follows a two-piece linear

relation, with the slope changing at age 60 without discontinuity. State the regression model

that applies if the nutritionist’s conjecture is correct.

(c) Refer to part (b). What are respective response functions when age is 60 or less and when age

is over 60?

(d) Explain whether or not the model specified in part (b) violates the principle of marginality.

Also, discuss and show whether or not this model is continuous at X = 60. Is continuity or

marginality more important here and why?

(e) Estimate the regression model specified in part (b). Copy and paste the regression output into

your answer sheet. Write down the fitted regression equation.

(f) Test whether a two-piece linear regression function is needed at α = 0.05.

(g) Refer to part (e). What is the estimated regression function for muscle mass whose age ≤ 60?

for muscle mass whose age > 60?

(h) Based on your estimated regression function, what is the predicted muscle mass when age =

50? When age = 70?

(i) Do you get the same prediction for age = 60 regardless of which estimated regression function

in part (e) you use?

(j) Modify the regression model in part (b) with the slope changing at age 60 without continuity.

(k) Specify the regression model for the case where the slope changes at age 40 and again at age

60 with no discontinuities.

5

4. Consider the general linear regression model Y = Xβ + " where Y is n x 1, X is n x p and of rank

p, β is p x 1, " is n x 1, and " is N(0, σ2I).

(a) The hat matrix H is given by H = X(X’X)-1X’. Show that (I { H) is idempotent where I is

the n x n identity matrix.

(b) Using the least squares method, we minimize RSS = e’e = (Y { Xb)’(Y { Xb) to obtain =

b = (X’X)-1X’Y. Show that RSS can also be written as Y’(I { H)Y.

(c) Obtain an expression for the variance-covariance matrix of the fitted values Ybi, i = 1, 2, ..., n,

in terms of the hat matrix H.

(d) e = Y { Yb is the vector of residuals. Are the residuals statistically independent? Justify your

answer with an explanation.

(e) Show that e = " { H". Suppose we denote by hij the (i, j) element of the HAT matrix H.

Thus, ei can be written as ei = "i { Pn j=1 hij"j. What does this equation of ei show?

Suppose that we partition X and β as

X = [X1

...

X2] β = � β β1 2 �

where X1 is n x p1, X2 is n x p2, and p1 + p2 = p. β1 is p1 x 1, and β2 is p2 x 1.

(f) If the true model is Y = Xβ + ", and we fit the model Y = X1β1 + u, have we underspecified

or overspecified the model?

(g) For the case in part (f), b1 = (X1’X1)-1X1’Y. If the true model is Y = Xβ + ", compute

E(b1).

(h) As a result of model misspecification in part (f), we could obtain an estimator of σ2 which is

larger than it should be. Does this affect inferences made about the model? Explain.

6

5. Criminologists are interested in the effect of demographic characteristics and police expenditure on

crime rates. This has been studied using aggregate data on 47 states of the USA for 2016. The data

set contains the columns as described below:

Variable | Description |

M | percentage of males aged 14{24 in total state population |

So | indicator variable for a southern state |

Ed | mean years of schooling of the population aged 25 years or over |

Po1 | per capita expenditure on police protection in 2016 |

Po2 | per capita expenditure on police protection in 2015 |

LF | labour force participation rate of civilian urban males in the age-group 14{24 |

M.F | number of males per 100 females |

Pop | state population in 2016 in hundred thousands |

NW | percentage of nonwhites in the population |

U1 | unemployment rate of urban males 14{24 |

U2 | unemployment rate of urban males 35{39 |

Wealth | wealth: median value of transferable assets or family income |

Ineq | income inequality: percentage of families earning below half the median income |

Prob | probability of imprisonment: ratio of number of commitments to number of offenses |

Time | average time in months served by offenders in state prisons before their first release |

Crime | crime rate: number of offenses per 100,000 population in 2016 |

2015. Is the sign of the correlation what you expect? Explain.

(b) In the previous question, we saw that the sample correlation between crime rate in 2016 and

police expenditure in 2015 was positive. However, the model fitted below suggests that an

increase in police expenditure in 2015 decreases the crime rate in 2016. Is there a contradiction?

Explain.

Crime d = 158:2646 + 256:1526P o1 - 178:2880P o2

(c) Find the best (parsimonious) regression model for the given data. Do not forget to perform an

initial data analysis before applying the automatic search procedures such as forward selection,

backward elimination, and stepwise regression.

Use the best model found in part (c) to answer the following questions:

(d) What characteristics does a high leverage point have in general?

(e) Are there any high leverage points in the data used to fit the best model?

(f) What is the sum of all leverage values for the data used to fit the best model?

(g) Are there any outliers in the data for the best model?