# UNIVERSITY OF NEW SOUTH WALES MATH3821

## MATH3821 Statistical Modelling and Computing

· 案例展示
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 1 Tutorial Problems
1. For the Simple Linear Regression (SLR) model:
(a) Use the Least Squares (LS) or the Maximum Likelihood (ML) method to show
that
bβ1
= Y - βb2 × x:
(b) Show also that
βb2 =
1n
Pn i=1 xiYi - n1 Pn i=1 xi� n1 Pn i=1 Yi�
1n
Pn i=1 x2 i - n1 Pn i=1 xi�2 =
1n
Pn i=1(xi - x)(Yi - Y )
1n
Pn i=1(xi - x)2
where
x =
1 n
nX i
=1
xi (deterministic)
and
Y = 1
n
nX i
=1
Yi (random):
Consequently, one can write
βb2 = Sx_Y_
Sx_x_ :
2. For the multiple linear regression model Y = Xβ + �:
(a) Show that the least squares estimator is given by b = (XT X)-1XT y when the
design matrix X is full-rank i.e. (XT X) is invertible.
(b) Now assume that the errors �ijxi = xi are i.i.d. N(0; σ2). Show that b is an
unbiased estimator of β.
(c) Show that the covariance matrix is given by Var(b) = σ2(XT X)-1.
3. The Linear Probability Model (LPM) is the multiple linear regression model applied
to a binary response variable Yi 2 f0; 1g, that is,
Yi = xT i β + �i; i = 1; : : : ; n;
where �ijxi = xi are i.i.d. N(0; σ2). The LPM model is not appropriate for this data
generating process (DGP). Several assumptions are violated by the binary data. We
examine a few of these assumptions here.
1
(a) Compute the variance of Y jx = x and explain why the LPM violates the homoscedasticity (or non-constant variance) assumption of multiple linear regression.
(b) Explain why the LPM violates the assumption that the errors �ijxi = xi are
normally distributed.
(c) Another assumption that is violated is that the predicted probabilities are nonnonsensical. Consider the Paid Labor Force for Women example presented in
lectures:
LFP = β1 + β2K5 + β3K618 + β4AGE + β5WC + β6HC + β7LWG + β8INC + �

 Variable x β^i β^isx t InterceptK5K618AGEWCHCLWGINC - 1.144 - 9.000.24 -0.295 -0.154 -8.211.35 -0.011 -0.015 -0.8042.54 -0.013 -0.103 -5.020.28 0.164 - 3.570.39 0.019 - 0.451.10 0.123 0.072 4.0720.13 -0.007 -0.079 -4.30
Compute the predicted probability for a 35 year old woman to work if she has
four young kids (four aged less than five and none between six to eighteen) and if
neither she nor her husband attended college. Assume that she has average values
for all the other variables.
4. For the logistic regression model, show that the maximum likelihood estimator (MLE)
β^ is a value of β that is a solution to:
 nX yixi = nXxi 1 - e-xT i βi β ! i=1 i=1 1 + e-xT
;
and so numerical methods are required to obtain an estimate of β.
5. The Binary Regression Model (BRM) takes a non-linear form, i.e. P(Y = 1jx) =
F(xT β). To interpret the model we can consider the partial change in the probability,
also known as the marginal effects which is given by:
@P (Y = 1jx)
@xk :
Show that for the logit model the marginal effects is given by:
P (Y = 1jx) 1 - P (Y = 1jx) �βk:
2
6. (a) Show that the logit model can be written as a log-linear model,
log Ω(x) = xT β;
where
Ω(x) = P(Y = 1jx)
P(Y = 0jx) =
P(Y = 1jx)
1 - P(Y = 1jx)
is the odds of the event Y = 1 given x.
(b) Show that the Odds Ratio equals:
Ω(x; xk + δ)
Ω(x; xk) = eβkδ:
3
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 2 Tutorial Problems
Tutorial Problems
1. Consider a logistic regression model for a binary response y. Let z = 1 - y, so that z
is one when y is zero and vice versa. What is the relationship between the estimated
coefficients in the model with y as the response, and the estimated coefficients in the
model with z as the response?
2. (a) Compute the Fisher information brought by a sample of n random variables i.i.d. of
law Bernoulli(p) on the parameter p.
(b) Compute the maximum likelihood estimator of p, the bias and variance of this
estimator. What can we conclude here?
3. For the special case of a binary response in binomial regression (ni = 1 for all i) show
that the formula for the deviance
2X
i �yi log yy^ii + (ni - yi) log n ni i - - y y^i i�
reduces to
-2X
i �y^i log 1 -y^i y^i + log(1 - y^i)�
Hint: use the convention y log y = (1 - y) log(1 - y) = 0 for a binary response y and
show that
Xi
yi log y^i
1 - y^i = Xy^i log 1 -y^i y^i
by using the equations to be solved to find the maximum likelihood estimator.
The remarkable feature of this deviance for a binary response is that it depends only
on the fitted values - any two models giving the same fitted values have the same
deviance, regardless of how well they explain the observed pattern of zeros and ones.
So the deviance cannot be used as a summary measure of lack of fit in the case of a
binary response.
4. Compute the null deviance in the logit model, as a function of the sample size n and
the number of observed successes n1 of the dependent variable Y .
1
5. Show that the maximized log-likelihood is, up to an additive constant:
log L(β^) =
nXi
=1
yi log P^(xi) +
nXi
=1
(ni - yi) log(1 - P^(xi))
and that the saturated deviance for the binomial regression model is (up to an additive
constant):
 log L(^ p) =nXyi log(yi=ni) + nX
i
=1
i
=1
(ni - yi) log((ni - yi)=ni):
The maximized log likelihood for the saturated model gives us a standard against which
to judge the value of the maximized likelihood for a simpler model. Deduce that the
deviance for the binomial regression model is the following measure of lack of fit:
D(β^) = -2 log L(β^)
L(^ p) = 2
nXi
=1 �yi log�yy^ii� + (ni - yi) log n ni i - - y y^i i�
where ^ yi = niP^(xi).
6. (a) By the definition of conditional probability
P (x) = P(Y = 1jx) = P(y = 1)f(xjy = 1)
f(x)
where f(xjy) is the density function of x given y. Let p = P(Y = 1) for the
unconditional probability of observing a success, then
P (x) = pf(xjy = 1)
f(x)
Show that
log�1 -P (Px()x)� = log�1 -p p� + log�f f( (x xjjy y = 1) = 0)�:
This says that the log odds is a constant plus a function of x determined by the
conditional densities f(xjy = 0) and f(xjy = 1). For common densities, it is
interesting to compute the second term on the right.
(b) If xjj ∼ N(µj; σj2); j = 0; 1 then
log�f f( (x xjjy y = 1) = 0)� = a1 + a2x + a3x2:
Hence including the predictors x and x2 seems appropriate in that case. Find the
expressions for a1, a2 and a3. Deduce that if the two conditional normal densities
have the same variance, inclusion of the term x2 is not required.
2
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 3 Tutorial Problems
1. Write the normal distribution in exponential family form. What is the canonical link
function for the normal distribution? Check the help of the R software to see what is
coded (function family()).
2. You may come across slightly different definitions of the exponential family of distributions in various textbooks. In lectures we stated that an exponential family density
or probability function reduces to
f(y; θ; φ) = exp((yθ - c(θ))=φ + h(y; φ)):
This is the definition given in Venables and Ripley (1999), \Modern Applied Statistics with S-PLUS", Springer-Verlag, p. 211. Dobson (1990), \An Introduction to
Generalized Linear Models", Chapman and Hall, p. 27 give the definition
f(y; θ) = exp(a(y)b(θ) + c(θ) + d(y))
where it is noted that the functions a(·), b(·), c(·) and d(·) may depend on additional
parameters which are not of primary interest in the model (so-called nuisance parameters). In this last definition if a(y) = y then the distribution is said to be in canonical
form, and b(θ) is sometimes called the natural parameter of the distribution.
Using the last definition of an exponential family distribution, show that the following
distributions belong to the exponential family:
(a) Pareto distribution f(y; θ) = θy-θ-1
(b) Exponential distribution f(y; θ) = θ exp(-yθ)
(c) Negative binomial distribution f(y; θ) = y+r-r-1 1�θr(1 - θ)y where r is known
How would you use these distributions with the glm() function in R?
3. 2014 Final Exam Question
Consider a random variable Y belonging to the exponential family of the form
 f(y; θ) = exp(a(y)b(θ) + c(θ) + d(y))(a) Derive the expressions for E(a(Y )) and V ar(a(Y )). (1)
(b) Suppose the random variable X has the Gamma distribution with a scale parameter θ, which is the parameter of interest, and a known shape parameter φ. Show
that the distribution of X belongs to the exponential family of the form in (1).
State clearly the corresponding expressions of a(x), b(θ), c(θ) and d(x).
1
(c) Using the results from Part (a), find E(X) and V ar(X), for the Gamma random
variable X.
(d) Using the R function rgamma(), check your findings.
2
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 4 Tutorial Problems
1. Let y1; :::; yn be a collection of responses and x1; :::; xn a set of corresponding predictor
values. If a cubic smoothing spline is fit to these data with the ith point (xi; yi) deleted
(to obtain the smooth f^λ-i(x)), and if (xi; yi) lies on this smooth, show that the smooth
based on all the points is the same, f^λ-i(x) = f^λ(x). (Hint: consider the optimization
of
nX j
=1
(yj - f(xj))2 + λ Z f00(x)2dx (1)
and show that if a function g produces a smaller value of (1) than f^λ-i, then g also
produces a smaller value of
X j6=i
(yj - f(xj))2 + λ Z f00(x)2dx
2. As you have seen in lectures, the cross-validation criterion
CV (λ) = 1
n
X i
yi - f^λ-i(xi)
1 - Sii(λ) !2 :
can be used to choose the smoothing parameter for a scatterplot smoother (here yi is
the ith response, xi is the ith predictor value, λ is the smoothing parameter, f^λ-i(xi)
is the value of the smooth at xi when we fit to all the data except the ith point, and
Sii(λ) is the ith diagonal element of the smoothing matrix S(λ) for which f^ = S(λ)y
and f^ = (f^(x1); :::; f^(xn))T is the vector of fitted values). Minimization of CV (λ)
with respect to λ is one way to obtain the smoothing parameter. In generalized crossvalidation we replace the term Sii(λ) in the expression above by the average of the
diagonal entries of S(λ), 1=n:trace(S(λ)) (here trace(A) denotes the trace of A):
GCV (λ) = 1
n
X i
yi - f^λ-i(xi)
1 - 1=n:trace(S(λ))!2 :
Note that we can rewrite GCV (λ) in the form
GCV (λ) = 1
n
X i
�1 - 11=n: -trace( Sii(λ)S(λ))�2 y1i --fS^λ-iii((λx)i)!2 :
1
Also, since f^ = S(λ)y, so that
f^(xi) = X
j
Sij(λ)yj
= Sii(λ)yi + X
j6=i
Sii(λ)yj
we can think of Sii(λ) as being a measure of the potential influence of yi on f^(xi) (Sii(λ)
is the weight on yi on determining f^(xi)). If we regard a simple linear regression model
as being a scatterplot smoother, f^ = Hy where H is the hat matrix H = X(XT X)-1XT
and so in this case Sii(λ) is the ith diagonal element of H, or the ith leverage value
hii.
By using this interpretation of Sii(λ) and the expression given above for GCV (λ),
explain why GCV (λ) gives less weight than CV (λ) to points that have high potential
influence on the smooth.
3. Derive the expressions for MSE(λ) and PSE(λ) given in lectures,
MSE(λ) = 1
n
nX i
=1
V ar(f^λi) + 1
n
nX i
=1
b2
λi
=
trace(SλSλT)
n
σ2 + bT λ bλ
n
PSE(λ) = �1 + trace(nSλSλT)� σ2 + bT λnbλ
where Sλ is the smoothing matrix and bλ is the bias vector bλ = E(y - f^λ) = f - Sλf
where y is the vector of the responses and f = (f(x1); :::; f(xn))T.
2
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 5 Tutorial Problems
1. Locally weighted regression solves a separate weighted least squares problem at each
target point x0:
min
α(x0);β(x0)
nXi
=1
Kλ(x0; xi)[yi - α(x0) - β(x0)xi]2:
The estimate is then f^(x0) = ^ α(x0) + β^(x0)x0. Notice that altough we fit an entire
linear model to the data in the region, we only use it to evaluate the fit at the single
point x0. Let W(x0) be the n×n diagonal matrix with ith diagonal element Kλ(x0; xi)
and X be the design matrix:
X =
0B@
1 x1
...
...
1 xn
1CA
:
Then
f^(x0) = (1; x0)(XT W(x0)X)-1XT W(x0)y =
nXi
=1
li(x0)yi:
Show that Pn i=1(xi-x0)li(x0) = 0 for local linear regression. Define bj(x0) = Pn i=1(xi-
x0)jli(x0). Show that b0(x0) = 1 for local polynomial regression of any degree (including
local constants). Show that bj(x0) = 0 for all j 2 f1; 2; : : : ; kg for local polynomial
regression of degree k. What are the implications of this on the bias?
2. In your last lecture we began to discuss histogram estimators for a density function.
We stated the result that in a large sample, the approximate mean integrated squared
error (MISE) of the histogram density estimator was
1
nh +
h2R(f0)
12
where n is the sample size, h is the bin width and
R(f0) = Z-1 1 f0(x)2dx
where f(x) is the true density function. Use this result to show that the optimal choice
of bin width is
h∗ = �R(6f0)�1=3 n-1=3:
1
By computing R(f0) for a normal density, and then estimating the parameters in the
normal density from the observed data, show that a good choice of bin width if f is
Gaussian is
h∗ = 3:4931^ σn-1=3
where ^ σ is the sample standard deviation.
3. Let (x1; y1); :::; (xn; yn) be a random sample from the bivariate density function f(x; y)
of a random vector (X; Y ). Let K(x) be a kernel function (probability density function)
satisfying
Z uK(u)du = 0 Z u2K(u)du < 1
and let h1 and h2 be two positive real constants. As we saw in lectures, one form of a
bivariate kernel estimator of the density f(x; y) is given by
f^(x; y) = 1
nh1h2 X
i
K �x -h1xi � K �y -h2yi � :
Also, an estimate of the marginal density f(x) of X is given by
1
nh1 X
i
K �x -h1xi � :
The conditional density of Y jX = x is given by
f(yjx) = f(x; y)
f(x) :
By substituting the above kernel estimators of f(x; y) and f(x) into this expression
we obtain an estimate of the conditional density of Y jX = x. From this, obtain an
estimate of E(Y jX = x) and make a connection between this expression and kernel
scatterplot smoothing.
2
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 7 Tutorial Problems
1. In this question we describe the method used for fitting a projection pursuit regression
model. Consider first the case of a single projection term. Then the model is
yi = f(αT xi) + �i
where xi = (; xi1; :::; xik)T , α = (α1; :::; αk)T is a unit vector and �i, i = 1; :::; n are zero
mean uncorrelated errors with common variance (we have subsumed the intercept into
the function f). The way that we estimate f and α here is to start with some guesses
for f and α, and then re-estimate f with α fixed at its current guess, then re-estimate
α with f fixed at its current value. This procedure is iterated until convergence.
Given an estimate of α, estimating f is easy since we just need to do a scatterplot
smooth of the responses yi on the predictors αT xi, i = 1; :::; n. We now describe the
way that an estimate of α is obtained for a given estimate of f.
A Taylor series expansion about the current value for α, αcur, gives
f(αT xi) ≈ f(αcur T xi) + f0(αcur T xi)(α - αcur)T xi:
Using this, show that
 nX nXf0(αcur T xi)2��αcur T xi + i=1 i=1 (yi - f(αT xi))2 ≈
yi f-0(fα(cur Tαcur Txi)xi)� - αT xi�2
and hence deduce that an updated estimate of α can be obtained by a weighted least
squares regression of the responses αcur T xi +(yi-f(αcur T xi))=f0(αcur T xi) on the predictors
xi with weights f0(αcur T xi)2 and no intercept term. Iterating this procedure for updating
α to convergence gives the method used for finding α in projection pursuit.
So far we have just dealt with a single projection term. Multiple projection terms
are fitted one at a time: first a one term model is fitted, then a second term is fitted
without adjusting the first term, then a third term without adjusting the second and so
on. Since we don’t adjust previous terms when adding a new term the above algorithm
for fitting a single term is all that is required in the algorithm (previous terms in
the sequence are fixed and can just be subtracted from the response to give a model
involving just a single projection term).
2. In deriving the conditional posterior distribution of β given σ2 in the Bayesian linear
model we used the following result.
1
Lemma:
Let Z be a q-dimensional random vector with density p(z) with
p(z) / exp �-12 zT Az - 2bT z��
where A is a fixed symmetric positive definite q ×q matrix and b is a fixed q ×1 vector.
Then Z ∼ N(A-1b; A-1).
Prove this result.
3. In deriving the marginal posterior distribution for σ2, p(σ2jy) in the Bayesian linear
model we used the following result.
Lemma:
Let z; b be q × 1 vectors, and A be a symmetric positive definite q × q matrix. Then
Z exp �-1 2 zT Az - 2bT z�� dz = (2π)q=2jAj-1=2 exp �12bT A-1b�
where jAj denotes the determinant of A.
Prove this result
4. We proved in lectures that in the Bayesian linear model with p(β; σ2) / σ-2
βjσ2; y ∼ N(b; σ2(XT X)-1)
where b = (XT X)-1XT y is the least squares estimator. Here as usual y is an n vector
of responses and X is an n × p design matrix. In this question we describe one way to
do the calculations required to obtain b and (XT X)-1 (another way using the so-called
QR decomposition of X is discussed in lectures). The Cholesky factorization of XT X
is
XT X = RT R
where R is a p × p upper triangular matrix (see the next question for a description of
how to find the Cholesky factorization). By an upper triangular matrix we mean one
where Rij = 0 if i > j. The reason why this decomposition is useful is because solving
a set of linear equations where the coefficient matrix is upper or lower trinagular is
very easy. We discuss this now and apply the idea to least squares calculations.
2
(a) Suggest an efficient way of solving the linear system Rx = z for x where R is a
known upper triangular p×p matrix and z is a known p×1 vector. This will give
x = R-1z. (Hint: consider the last equation in the system Rx = z, which since R
is upper triangular is Rnnxn = zn. This gives xn. Then consider the second last
equation and so on).
(b) If R is upper triangular, then RT = L is lower triangular (by lower triangular
we mean Lij = 0 if i < j). Suggest an efficient way of solving the linear system
Lx = z for x. This will give x = L-1z = R-1Tz.
(c) Since b = (XTX)-1XTy, we have b = (RTR)-1XTy = R-1R-1TXTy where R
is the Cholesky factor of XTX. Using the results of parts i) and ii) suggest an
efficient way to find b.
(d) Since RR-1 = I where I is the p×p identity matrix, the jth column of R-1, x say,
satisfies Rx = ej where ej is the jth column of the identity matrix (a column of
zeros with a 1 in the jth position). Hence suggest how we can efficiently compute
R-1 and hence
(XTX)-1 = (RTR)-1 = R-1R-1T
5. Let A be a 2 × 2 matrix (symmetric, positive definite). Write A = RTR where R is an
upper triangular matrix,
R = � a b 0 c �
After writing out an expression for the elements of A in terms of those in R, suggest
a way of finding the elements of R. What conditions are required on A? This line of
argument generalizes to a matrix A of any size.
6. Most statistical packages have a routine for generating independent standard normal
random variables. Recall that a χ2 ν random variable is formed as the sum of ν independent squared standard normal random variables - hence there is a simple direct way of
generating from the χ2 ν density when the degrees of freedom ν is an integer. Recalling
that Z has a χ2 ν density if its density has the form
pZ(z) = 2-ν=2
Γ ν2�zν=2-1 exp �-z2�
and from lectures that W has the Inverse-χ2(ν; s2) density if its density has the form
fW (w) =
�νs 22 �
ν2
Γ ν2� w-(ν2 +1) exp �-νs 2w2 �
show that if Z ∼ χ2 ν then W = (νs2)=Z ∼Inverse-χ2(ν; s2).
3
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 8 Tutorial Problems
1. Using the inverse transform method, write an R function to generate a random variable
with the distribution function
F(x) = x2 + x
2 ; 0 ≤ x ≤ 1:
2. Give a method for generating a random variable having density function
f(x) = � exp(2 exp(-x2)x) 0-1 ≤ x < < x < 1.0
Write an R function to implement your method.
3. Suppose that it is easy to generate a random variable from any of the distributions Fi,
i = 1; :::; n. How can we generate from the following distributions?
(a)
F(x) =
nY i
=1
Fi(x)
(b)
F(x) = 1 -
nY i
=1
(1 - Fi(x))
(Hint: if Xi are independent random variables, with X having distribution Fi, which
random variable has distribution function F?)
4. Show that the distribution function described in part a), is
F(x) = xn 0 ≤ x ≤ 1:
if Fi is the distribution function of a standard uniform, hence write an R program
simulating from this distribution function using the results from Question 3.
1
UNIVERSITY OF NEW SOUTH WALES
SCHOOL OF MATHEMATICS AND STATISTICS
MATH3821 Statistical Modelling and Computing
Term Two 2020
Week 9 Tutorial Problems
1. Suppose that Y has a negative binomial distribution with parameters p and r (where
r is known) with probability mass function,
P(Y = y) = �y + yr - 1�(1 - p)ypr
where the exponential family takes the form,
P(Y = y) = exp�yθa-(φb)(θ) + c(y; φ)�:
(a) Show that Y belongs to the exponential family.
(b) Compute the mean and variance of Y using E[Y ] = b0(θ) and Var(Y ) = a(φ)b00(θ).
(c) What is the canonical link function g(·)?
2. Let
yi = f(xi) + "i for i = 1; : : : ; n
denote a model in which the response variable y depends on a single predictor variables
x: Assume that the xi are distinct and ordered, so x1 < x2 < ::: < xn: Let a and b
denote real numbers such that
a = x0 < x1 < x2 < ::: < xn < xn+1 = b:
(a) For an arbitrary point x0 within the range of the x-values, give an expression for
the pth order regression spline estimator fb(x0) of f(x0): Clearly explaining the
notations used.
(b) List 2 defining properties of natural cubic splines.
(c) Next consider response variables y which depend on k predictor variables x =
(x1; x2; :::xk) as follows
yi = f(x1i; x2i; :::xki) + "i
for i = 1; :::; n. Describe the additive model approach and the projection pursuit
regression approach to finding an estimate for the response variable in this model.
Describe the relative strengths and weaknesses of the two approach.
1
3. The inverse transform method allows us to simulate random samples for any random
variable X, having distribution function F (x), provided that we can obtain uniform
random samples u ∼ U(0; 1) .
(a) Describe the inverse transform method that simulates a single sample x such that
x ∼ F (x).
(b) Prove that the procedure described in a) produces a sample from F .
(c) Let X be a random variable which can take the values 1; 2; :::; n. Write pj =
P r(X = j) for its probability function. Write an R function which takes as input
a vector p of length n with the jth element of p equal to pj and simulates a
realization of X using the inversion method. Your function should check that
input vector takes only positive values, or return an error message. It should also
make sure that the vector p it sums to one.
4. Let X1; : : : ; Xn ∼ Poisson(λ).
(a) Let λ ∼ Gamma(α; β) be the prior. Show that the posterior is also a Gamma.
Find the posterior mean.
(b) Find the Jeffery’s prior. Find the posterior.
2

×