SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 1 Tutorial Problems

1. For the Simple Linear Regression (SLR) model:

(a) Use the Least Squares (LS) or the Maximum Likelihood (ML) method to show

that

bβ1

= Y - βb2 × x:

(b) Show also that

βb2 =

1n

Pn i=1 xiYi - n1 Pn i=1 xi� n1 Pn i=1 Yi�

1n

Pn i=1 x2 i - n1 Pn i=1 xi�2 =

1n

Pn i=1(xi - x)(Yi - Y )

1n

Pn i=1(xi - x)2

where

x =

1 n

nX i

=1

xi (deterministic)

and

Y = 1

n

nX i

=1

Yi (random):

Consequently, one can write

βb2 = Sx_Y_

Sx_x_ :

2. For the multiple linear regression model Y = Xβ + �:

(a) Show that the least squares estimator is given by b = (XT X)-1XT y when the

design matrix X is full-rank i.e. (XT X) is invertible.

(b) Now assume that the errors �ijxi = xi are i.i.d. N(0; σ2). Show that b is an

unbiased estimator of β.

(c) Show that the covariance matrix is given by Var(b) = σ2(XT X)-1.

3. The Linear Probability Model (LPM) is the multiple linear regression model applied

to a binary response variable Yi 2 f0; 1g, that is,

Yi = xT i β + �i; i = 1; : : : ; n;

where �ijxi = xi are i.i.d. N(0; σ2). The LPM model is not appropriate for this data

generating process (DGP). Several assumptions are violated by the binary data. We

examine a few of these assumptions here.

1

(a) Compute the variance of Y jx = x and explain why the LPM violates the homoscedasticity (or non-constant variance) assumption of multiple linear regression.

(b) Explain why the LPM violates the assumption that the errors �ijxi = xi are

normally distributed.

(c) Another assumption that is violated is that the predicted probabilities are nonnonsensical. Consider the Paid Labor Force for Women example presented in

lectures:

LFP = β1 + β2K5 + β3K618 + β4AGE + β5WC + β6HC + β7LWG + β8INC + �

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 1 Tutorial Problems

1. For the Simple Linear Regression (SLR) model:

(a) Use the Least Squares (LS) or the Maximum Likelihood (ML) method to show

that

bβ1

= Y - βb2 × x:

(b) Show also that

βb2 =

1n

Pn i=1 xiYi - n1 Pn i=1 xi� n1 Pn i=1 Yi�

1n

Pn i=1 x2 i - n1 Pn i=1 xi�2 =

1n

Pn i=1(xi - x)(Yi - Y )

1n

Pn i=1(xi - x)2

where

x =

1 n

nX i

=1

xi (deterministic)

and

Y = 1

n

nX i

=1

Yi (random):

Consequently, one can write

βb2 = Sx_Y_

Sx_x_ :

2. For the multiple linear regression model Y = Xβ + �:

(a) Show that the least squares estimator is given by b = (XT X)-1XT y when the

design matrix X is full-rank i.e. (XT X) is invertible.

(b) Now assume that the errors �ijxi = xi are i.i.d. N(0; σ2). Show that b is an

unbiased estimator of β.

(c) Show that the covariance matrix is given by Var(b) = σ2(XT X)-1.

3. The Linear Probability Model (LPM) is the multiple linear regression model applied

to a binary response variable Yi 2 f0; 1g, that is,

Yi = xT i β + �i; i = 1; : : : ; n;

where �ijxi = xi are i.i.d. N(0; σ2). The LPM model is not appropriate for this data

generating process (DGP). Several assumptions are violated by the binary data. We

examine a few of these assumptions here.

1

(a) Compute the variance of Y jx = x and explain why the LPM violates the homoscedasticity (or non-constant variance) assumption of multiple linear regression.

(b) Explain why the LPM violates the assumption that the errors �ijxi = xi are

normally distributed.

(c) Another assumption that is violated is that the predicted probabilities are nonnonsensical. Consider the Paid Labor Force for Women example presented in

lectures:

LFP = β1 + β2K5 + β3K618 + β4AGE + β5WC + β6HC + β7LWG + β8INC + �

Variable | x β^i β^isx t |

Intercept K5 K618 AGE WC HC LWG INC | - 1.144 - 9.00 0.24 -0.295 -0.154 -8.21 1.35 -0.011 -0.015 -0.80 42.54 -0.013 -0.103 -5.02 0.28 0.164 - 3.57 0.39 0.019 - 0.45 1.10 0.123 0.072 4.07 20.13 -0.007 -0.079 -4.30 |

four young kids (four aged less than five and none between six to eighteen) and if

neither she nor her husband attended college. Assume that she has average values

for all the other variables.

4. For the logistic regression model, show that the maximum likelihood estimator (MLE)

β^ is a value of β that is a solution to:

nX | yixi = | nX xi | 1 - | e-xT i β i β ! |

i =1 | i =1 | |||

1 + e-xT |

and so numerical methods are required to obtain an estimate of β.

5. The Binary Regression Model (BRM) takes a non-linear form, i.e. P(Y = 1jx) =

F(xT β). To interpret the model we can consider the partial change in the probability,

also known as the marginal effects which is given by:

@P (Y = 1jx)

@xk :

Show that for the logit model the marginal effects is given by:

P (Y = 1jx) 1 - P (Y = 1jx) �βk:

2

6. (a) Show that the logit model can be written as a log-linear model,

log Ω(x) = xT β;

where

Ω(x) = P(Y = 1jx)

P(Y = 0jx) =

P(Y = 1jx)

1 - P(Y = 1jx)

is the odds of the event Y = 1 given x.

(b) Show that the Odds Ratio equals:

Ω(x; xk + δ)

Ω(x; xk) = eβkδ:

3

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 2 Tutorial Problems

Tutorial Problems

1. Consider a logistic regression model for a binary response y. Let z = 1 - y, so that z

is one when y is zero and vice versa. What is the relationship between the estimated

coefficients in the model with y as the response, and the estimated coefficients in the

model with z as the response?

2. (a) Compute the Fisher information brought by a sample of n random variables i.i.d. of

law Bernoulli(p) on the parameter p.

(b) Compute the maximum likelihood estimator of p, the bias and variance of this

estimator. What can we conclude here?

3. For the special case of a binary response in binomial regression (ni = 1 for all i) show

that the formula for the deviance

2X

i �yi log yy^ii + (ni - yi) log n ni i - - y y^i i�

reduces to

-2X

i �y^i log 1 -y^i y^i + log(1 - y^i)�

Hint: use the convention y log y = (1 - y) log(1 - y) = 0 for a binary response y and

show that

Xi

yi log y^i

1 - y^i = Xy^i log 1 -y^i y^i

by using the equations to be solved to find the maximum likelihood estimator.

The remarkable feature of this deviance for a binary response is that it depends only

on the fitted values - any two models giving the same fitted values have the same

deviance, regardless of how well they explain the observed pattern of zeros and ones.

So the deviance cannot be used as a summary measure of lack of fit in the case of a

binary response.

4. Compute the null deviance in the logit model, as a function of the sample size n and

the number of observed successes n1 of the dependent variable Y .

1

5. Show that the maximized log-likelihood is, up to an additive constant:

log L(β^) =

nXi

=1

yi log P^(xi) +

nXi

=1

(ni - yi) log(1 - P^(xi))

and that the saturated deviance for the binomial regression model is (up to an additive

constant):

log L(^ p) = nX yi log(yi=ni) + | nX |

=1

i

=1

(ni - yi) log((ni - yi)=ni):

The maximized log likelihood for the saturated model gives us a standard against which

to judge the value of the maximized likelihood for a simpler model. Deduce that the

deviance for the binomial regression model is the following measure of lack of fit:

D(β^) = -2 log L(β^)

L(^ p) = 2

nXi

=1 �yi log�yy^ii� + (ni - yi) log n ni i - - y y^i i�

where ^ yi = niP^(xi).

6. (a) By the definition of conditional probability

P (x) = P(Y = 1jx) = P(y = 1)f(xjy = 1)

f(x)

where f(xjy) is the density function of x given y. Let p = P(Y = 1) for the

unconditional probability of observing a success, then

P (x) = pf(xjy = 1)

f(x)

Show that

log�1 -P (Px()x)� = log�1 -p p� + log�f f( (x xjjy y = 1) = 0)�:

This says that the log odds is a constant plus a function of x determined by the

conditional densities f(xjy = 0) and f(xjy = 1). For common densities, it is

interesting to compute the second term on the right.

(b) If xjj ∼ N(µj; σj2); j = 0; 1 then

log�f f( (x xjjy y = 1) = 0)� = a1 + a2x + a3x2:

Hence including the predictors x and x2 seems appropriate in that case. Find the

expressions for a1, a2 and a3. Deduce that if the two conditional normal densities

have the same variance, inclusion of the term x2 is not required.

2

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 3 Tutorial Problems

1. Write the normal distribution in exponential family form. What is the canonical link

function for the normal distribution? Check the help of the R software to see what is

coded (function family()).

2. You may come across slightly different definitions of the exponential family of distributions in various textbooks. In lectures we stated that an exponential family density

or probability function reduces to

f(y; θ; φ) = exp((yθ - c(θ))=φ + h(y; φ)):

This is the definition given in Venables and Ripley (1999), \Modern Applied Statistics with S-PLUS", Springer-Verlag, p. 211. Dobson (1990), \An Introduction to

Generalized Linear Models", Chapman and Hall, p. 27 give the definition

f(y; θ) = exp(a(y)b(θ) + c(θ) + d(y))

where it is noted that the functions a(·), b(·), c(·) and d(·) may depend on additional

parameters which are not of primary interest in the model (so-called nuisance parameters). In this last definition if a(y) = y then the distribution is said to be in canonical

form, and b(θ) is sometimes called the natural parameter of the distribution.

Using the last definition of an exponential family distribution, show that the following

distributions belong to the exponential family:

(a) Pareto distribution f(y; θ) = θy-θ-1

(b) Exponential distribution f(y; θ) = θ exp(-yθ)

(c) Negative binomial distribution f(y; θ) = y+r-r-1 1�θr(1 - θ)y where r is known

How would you use these distributions with the glm() function in R?

3. 2014 Final Exam Question

Consider a random variable Y belonging to the exponential family of the form

f(y; θ) = exp(a(y)b(θ) + c(θ) + d(y)) (a) Derive the expressions for E(a(Y )) and V ar(a(Y )). | (1) |

that the distribution of X belongs to the exponential family of the form in (1).

State clearly the corresponding expressions of a(x), b(θ), c(θ) and d(x).

1

(c) Using the results from Part (a), find E(X) and V ar(X), for the Gamma random

variable X.

(d) Using the R function rgamma(), check your findings.

2

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 4 Tutorial Problems

1. Let y1; :::; yn be a collection of responses and x1; :::; xn a set of corresponding predictor

values. If a cubic smoothing spline is fit to these data with the ith point (xi; yi) deleted

(to obtain the smooth f^λ-i(x)), and if (xi; yi) lies on this smooth, show that the smooth

based on all the points is the same, f^λ-i(x) = f^λ(x). (Hint: consider the optimization

of

nX j

=1

(yj - f(xj))2 + λ Z f00(x)2dx (1)

and show that if a function g produces a smaller value of (1) than f^λ-i, then g also

produces a smaller value of

X j6=i

(yj - f(xj))2 + λ Z f00(x)2dx

contradicting the definition of f^λ-i.)

2. As you have seen in lectures, the cross-validation criterion

CV (λ) = 1

n

X i

yi - f^λ-i(xi)

1 - Sii(λ) !2 :

can be used to choose the smoothing parameter for a scatterplot smoother (here yi is

the ith response, xi is the ith predictor value, λ is the smoothing parameter, f^λ-i(xi)

is the value of the smooth at xi when we fit to all the data except the ith point, and

Sii(λ) is the ith diagonal element of the smoothing matrix S(λ) for which f^ = S(λ)y

and f^ = (f^(x1); :::; f^(xn))T is the vector of fitted values). Minimization of CV (λ)

with respect to λ is one way to obtain the smoothing parameter. In generalized crossvalidation we replace the term Sii(λ) in the expression above by the average of the

diagonal entries of S(λ), 1=n:trace(S(λ)) (here trace(A) denotes the trace of A):

GCV (λ) = 1

n

X i

yi - f^λ-i(xi)

1 - 1=n:trace(S(λ))!2 :

Note that we can rewrite GCV (λ) in the form

GCV (λ) = 1

n

X i

�1 - 11=n: -trace( Sii(λ)S(λ))�2 y1i --fS^λ-iii((λx)i)!2 :

1

Also, since f^ = S(λ)y, so that

f^(xi) = X

j

Sij(λ)yj

= Sii(λ)yi + X

j6=i

Sii(λ)yj

we can think of Sii(λ) as being a measure of the potential influence of yi on f^(xi) (Sii(λ)

is the weight on yi on determining f^(xi)). If we regard a simple linear regression model

as being a scatterplot smoother, f^ = Hy where H is the hat matrix H = X(XT X)-1XT

and so in this case Sii(λ) is the ith diagonal element of H, or the ith leverage value

hii.

By using this interpretation of Sii(λ) and the expression given above for GCV (λ),

explain why GCV (λ) gives less weight than CV (λ) to points that have high potential

influence on the smooth.

3. Derive the expressions for MSE(λ) and PSE(λ) given in lectures,

MSE(λ) = 1

n

nX i

=1

V ar(f^λi) + 1

n

nX i

=1

b2

λi

=

trace(SλSλT)

n

σ2 + bT λ bλ

n

PSE(λ) = �1 + trace(nSλSλT)� σ2 + bT λnbλ

where Sλ is the smoothing matrix and bλ is the bias vector bλ = E(y - f^λ) = f - Sλf

where y is the vector of the responses and f = (f(x1); :::; f(xn))T.

2

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 5 Tutorial Problems

1. Locally weighted regression solves a separate weighted least squares problem at each

target point x0:

min

α(x0);β(x0)

nXi

=1

Kλ(x0; xi)[yi - α(x0) - β(x0)xi]2:

The estimate is then f^(x0) = ^ α(x0) + β^(x0)x0. Notice that altough we fit an entire

linear model to the data in the region, we only use it to evaluate the fit at the single

point x0. Let W(x0) be the n×n diagonal matrix with ith diagonal element Kλ(x0; xi)

and X be the design matrix:

X =

0B@

1 x1

...

...

1 xn

1CA

:

Then

f^(x0) = (1; x0)(XT W(x0)X)-1XT W(x0)y =

nXi

=1

li(x0)yi:

Show that Pn i=1(xi-x0)li(x0) = 0 for local linear regression. Define bj(x0) = Pn i=1(xi-

x0)jli(x0). Show that b0(x0) = 1 for local polynomial regression of any degree (including

local constants). Show that bj(x0) = 0 for all j 2 f1; 2; : : : ; kg for local polynomial

regression of degree k. What are the implications of this on the bias?

2. In your last lecture we began to discuss histogram estimators for a density function.

We stated the result that in a large sample, the approximate mean integrated squared

error (MISE) of the histogram density estimator was

1

nh +

h2R(f0)

12

where n is the sample size, h is the bin width and

R(f0) = Z-1 1 f0(x)2dx

where f(x) is the true density function. Use this result to show that the optimal choice

of bin width is

h∗ = �R(6f0)�1=3 n-1=3:

1

By computing R(f0) for a normal density, and then estimating the parameters in the

normal density from the observed data, show that a good choice of bin width if f is

Gaussian is

h∗ = 3:4931^ σn-1=3

where ^ σ is the sample standard deviation.

3. Let (x1; y1); :::; (xn; yn) be a random sample from the bivariate density function f(x; y)

of a random vector (X; Y ). Let K(x) be a kernel function (probability density function)

satisfying

Z uK(u)du = 0 Z u2K(u)du < 1

and let h1 and h2 be two positive real constants. As we saw in lectures, one form of a

bivariate kernel estimator of the density f(x; y) is given by

f^(x; y) = 1

nh1h2 X

i

K �x -h1xi � K �y -h2yi � :

Also, an estimate of the marginal density f(x) of X is given by

1

nh1 X

i

K �x -h1xi � :

The conditional density of Y jX = x is given by

f(yjx) = f(x; y)

f(x) :

By substituting the above kernel estimators of f(x; y) and f(x) into this expression

we obtain an estimate of the conditional density of Y jX = x. From this, obtain an

estimate of E(Y jX = x) and make a connection between this expression and kernel

scatterplot smoothing.

2

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 7 Tutorial Problems

1. In this question we describe the method used for fitting a projection pursuit regression

model. Consider first the case of a single projection term. Then the model is

yi = f(αT xi) + �i

where xi = (; xi1; :::; xik)T , α = (α1; :::; αk)T is a unit vector and �i, i = 1; :::; n are zero

mean uncorrelated errors with common variance (we have subsumed the intercept into

the function f). The way that we estimate f and α here is to start with some guesses

for f and α, and then re-estimate f with α fixed at its current guess, then re-estimate

α with f fixed at its current value. This procedure is iterated until convergence.

Given an estimate of α, estimating f is easy since we just need to do a scatterplot

smooth of the responses yi on the predictors αT xi, i = 1; :::; n. We now describe the

way that an estimate of α is obtained for a given estimate of f.

A Taylor series expansion about the current value for α, αcur, gives

f(αT xi) ≈ f(αcur T xi) + f0(αcur T xi)(α - αcur)T xi:

Using this, show that

nX | nX f0(αcur T xi)2��αcur T xi + |

i =1 | i =1 |

(yi - f(αT xi))2 ≈ |

and hence deduce that an updated estimate of α can be obtained by a weighted least

squares regression of the responses αcur T xi +(yi-f(αcur T xi))=f0(αcur T xi) on the predictors

xi with weights f0(αcur T xi)2 and no intercept term. Iterating this procedure for updating

α to convergence gives the method used for finding α in projection pursuit.

So far we have just dealt with a single projection term. Multiple projection terms

are fitted one at a time: first a one term model is fitted, then a second term is fitted

without adjusting the first term, then a third term without adjusting the second and so

on. Since we don’t adjust previous terms when adding a new term the above algorithm

for fitting a single term is all that is required in the algorithm (previous terms in

the sequence are fixed and can just be subtracted from the response to give a model

involving just a single projection term).

2. In deriving the conditional posterior distribution of β given σ2 in the Bayesian linear

model we used the following result.

1

Lemma:

Let Z be a q-dimensional random vector with density p(z) with

p(z) / exp �-12 zT Az - 2bT z��

where A is a fixed symmetric positive definite q ×q matrix and b is a fixed q ×1 vector.

Then Z ∼ N(A-1b; A-1).

Prove this result.

3. In deriving the marginal posterior distribution for σ2, p(σ2jy) in the Bayesian linear

model we used the following result.

Lemma:

Let z; b be q × 1 vectors, and A be a symmetric positive definite q × q matrix. Then

Z exp �-1 2 zT Az - 2bT z�� dz = (2π)q=2jAj-1=2 exp �12bT A-1b�

where jAj denotes the determinant of A.

Prove this result

4. We proved in lectures that in the Bayesian linear model with p(β; σ2) / σ-2

βjσ2; y ∼ N(b; σ2(XT X)-1)

where b = (XT X)-1XT y is the least squares estimator. Here as usual y is an n vector

of responses and X is an n × p design matrix. In this question we describe one way to

do the calculations required to obtain b and (XT X)-1 (another way using the so-called

QR decomposition of X is discussed in lectures). The Cholesky factorization of XT X

is

XT X = RT R

where R is a p × p upper triangular matrix (see the next question for a description of

how to find the Cholesky factorization). By an upper triangular matrix we mean one

where Rij = 0 if i > j. The reason why this decomposition is useful is because solving

a set of linear equations where the coefficient matrix is upper or lower trinagular is

very easy. We discuss this now and apply the idea to least squares calculations.

2

(a) Suggest an efficient way of solving the linear system Rx = z for x where R is a

known upper triangular p×p matrix and z is a known p×1 vector. This will give

x = R-1z. (Hint: consider the last equation in the system Rx = z, which since R

is upper triangular is Rnnxn = zn. This gives xn. Then consider the second last

equation and so on).

(b) If R is upper triangular, then RT = L is lower triangular (by lower triangular

we mean Lij = 0 if i < j). Suggest an efficient way of solving the linear system

Lx = z for x. This will give x = L-1z = R-1Tz.

(c) Since b = (XTX)-1XTy, we have b = (RTR)-1XTy = R-1R-1TXTy where R

is the Cholesky factor of XTX. Using the results of parts i) and ii) suggest an

efficient way to find b.

(d) Since RR-1 = I where I is the p×p identity matrix, the jth column of R-1, x say,

satisfies Rx = ej where ej is the jth column of the identity matrix (a column of

zeros with a 1 in the jth position). Hence suggest how we can efficiently compute

R-1 and hence

(XTX)-1 = (RTR)-1 = R-1R-1T

5. Let A be a 2 × 2 matrix (symmetric, positive definite). Write A = RTR where R is an

upper triangular matrix,

R = � a b 0 c �

After writing out an expression for the elements of A in terms of those in R, suggest

a way of finding the elements of R. What conditions are required on A? This line of

argument generalizes to a matrix A of any size.

6. Most statistical packages have a routine for generating independent standard normal

random variables. Recall that a χ2 ν random variable is formed as the sum of ν independent squared standard normal random variables - hence there is a simple direct way of

generating from the χ2 ν density when the degrees of freedom ν is an integer. Recalling

that Z has a χ2 ν density if its density has the form

pZ(z) = 2-ν=2

Γ ν2�zν=2-1 exp �-z2�

and from lectures that W has the Inverse-χ2(ν; s2) density if its density has the form

fW (w) =

�νs 22 �

ν2

Γ ν2� w-(ν2 +1) exp �-νs 2w2 �

show that if Z ∼ χ2 ν then W = (νs2)=Z ∼Inverse-χ2(ν; s2).

3

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 8 Tutorial Problems

1. Using the inverse transform method, write an R function to generate a random variable

with the distribution function

F(x) = x2 + x

2 ; 0 ≤ x ≤ 1:

2. Give a method for generating a random variable having density function

f(x) = � exp(2 exp(-x2)x) 0-1 ≤ x < < x < 1.0

Write an R function to implement your method.

3. Suppose that it is easy to generate a random variable from any of the distributions Fi,

i = 1; :::; n. How can we generate from the following distributions?

(a)

F(x) =

nY i

=1

Fi(x)

(b)

F(x) = 1 -

nY i

=1

(1 - Fi(x))

(Hint: if Xi are independent random variables, with X having distribution Fi, which

random variable has distribution function F?)

4. Show that the distribution function described in part a), is

F(x) = xn 0 ≤ x ≤ 1:

if Fi is the distribution function of a standard uniform, hence write an R program

simulating from this distribution function using the results from Question 3.

1

UNIVERSITY OF NEW SOUTH WALES

SCHOOL OF MATHEMATICS AND STATISTICS

MATH3821 Statistical Modelling and Computing

Term Two 2020

Week 9 Tutorial Problems

1. Suppose that Y has a negative binomial distribution with parameters p and r (where

r is known) with probability mass function,

P(Y = y) = �y + yr - 1�(1 - p)ypr

where the exponential family takes the form,

P(Y = y) = exp�yθa-(φb)(θ) + c(y; φ)�:

(a) Show that Y belongs to the exponential family.

(b) Compute the mean and variance of Y using E[Y ] = b0(θ) and Var(Y ) = a(φ)b00(θ).

(c) What is the canonical link function g(·)?

2. Let

yi = f(xi) + "i for i = 1; : : : ; n

denote a model in which the response variable y depends on a single predictor variables

x: Assume that the xi are distinct and ordered, so x1 < x2 < ::: < xn: Let a and b

denote real numbers such that

a = x0 < x1 < x2 < ::: < xn < xn+1 = b:

(a) For an arbitrary point x0 within the range of the x-values, give an expression for

the pth order regression spline estimator fb(x0) of f(x0): Clearly explaining the

notations used.

(b) List 2 defining properties of natural cubic splines.

(c) Next consider response variables y which depend on k predictor variables x =

(x1; x2; :::xk) as follows

yi = f(x1i; x2i; :::xki) + "i

for i = 1; :::; n. Describe the additive model approach and the projection pursuit

regression approach to finding an estimate for the response variable in this model.

Describe the relative strengths and weaknesses of the two approach.

1

3. The inverse transform method allows us to simulate random samples for any random

variable X, having distribution function F (x), provided that we can obtain uniform

random samples u ∼ U(0; 1) .

(a) Describe the inverse transform method that simulates a single sample x such that

x ∼ F (x).

(b) Prove that the procedure described in a) produces a sample from F .

(c) Let X be a random variable which can take the values 1; 2; :::; n. Write pj =

P r(X = j) for its probability function. Write an R function which takes as input

a vector p of length n with the jth element of p equal to pj and simulates a

realization of X using the inversion method. Your function should check that

input vector takes only positive values, or return an error message. It should also

make sure that the vector p it sums to one.

4. Let X1; : : : ; Xn ∼ Poisson(λ).

(a) Let λ ∼ Gamma(α; β) be the prior. Show that the posterior is also a Gamma.

Find the posterior mean.

(b) Find the Jeffery’s prior. Find the posterior.

2