Normal modeling

STA360 at Duke University

Exercise

Libraries used:

library(tidyverse)
library(latex2exp)

Data

bass = read_csv("https://sta360-fa24.github.io/data/bass.csv")
glimpse(bass)
Rows: 171
Columns: 5
$ river   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ station <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ length  <dbl> 47.0, 48.7, 55.7, 45.2, 44.7, 43.8, 38.5, 45.8, 44.0, 40.4, 47…
$ weight  <dbl> 1.616, 1.862, 2.855, 1.199, 1.320, 1.225, 0.870, 1.455, 1.220,…
$ mercury <dbl> 1.60, 1.50, 1.70, 0.73, 0.56, 0.51, 0.48, 0.95, 1.40, 0.50, 0.…

Mercury, is a naturally occurring element that can have toxic effects on the nervous, digestive and immune systems of humans. Bass from the Waccamaw and Lumber Rivers (NC) were caught randomly, weighed, and measured. In addition, a filet from each fish caught was sent to the lab so that the tissue concentration of mercury could be determined for each fish. Each fish caught corresponds to a single row of the data frame. Today we will examine two columns from the data set: mercury (concentration of mercury in ppm) and weight (weight of the fish in kg). We’ll model the mercury content \(y\) of each fish as a function of the fish’s weight \(x\).

Model

Let

\[ \begin{aligned} Y_i | \theta &\sim \text{ iid } N(\theta x_i, 1)\\ \theta &\sim N(\mu_0, 1 / \kappa_0) \end{aligned} \]

Let \(\mu_0 = 0\), \(\kappa_0 = 1\).

(a). Suppose you observe data \(y_1,\ldots y_n\). Write out the formula for \(p(\theta | y_1, \ldots y_n)\).

(b). Given the data on the previous slide, use Monte Carlo simulation to plot \(p(\theta | y_1, \ldots, y_n)\). Additionally, report \(E[\theta | y_1,\ldots y_n]\) together with a 95% posterior confidence interval.

(c). If you caught a new fish with weight 4kg, what would you predict the mercury content to be? In other words, let x = 4 and compute \(E[\tilde{y}|y_1,\ldots, y_n, x = 4]\). Additionally, plot the the posterior predictive density \(p(\tilde{y} | y_1, \ldots y_n, x = 4)\).

  1. Critique your model. Hint: compare to the models below:
lm(mercury ~ weight, data = bass)
lm(mercury ~ 0 + weight, data = bass)

Solution (a)

a

\[ \theta | y_1, \ldots y_n \sim N(\mu_n, \tau_n^2) \]

where

\[ \begin{aligned} \mu_n &= \frac{\kappa_0 \mu_0 + \sum y_i x_i}{\kappa_0 + \sum x_i^2}\\ \tau_n^2 &= \frac{1}{\kappa_0 + \sum x_i^2} \end{aligned} \]

Solution (b)

b

Demo with simulated data to make sure code works:

# simulated data
set.seed(123)
true.theta = 4
true.sigma = 1
N = 10
x = seq(from = 1, to = 10, length = N)
y = rnorm(N, true.theta * x, true.sigma)

# prior parameters
k0 = 1
mu0 = 1

sumYX = sum(y * x)
d = (k0 + sum(x^2))
mun = (k0 + mu0 + sumYX) / d
tn = sqrt(1 / d)

theta.postsample = rnorm(10000, mun, tn)
hist(theta.postsample)

x = bass$weight
y = bass$mercury

# prior parameters
k0 = 1
mu0 = 1

sumYX = sum(y * x)
d = (k0 + sum(x^2))
mun = (k0 + mu0 + sumYX) / d
tn = sqrt(1 / d)

theta.postsample = rnorm(10000, mun, tn)
hist(theta.postsample)

mean(theta.postsample)
[1] 0.8370612
quantile(theta.postsample, c(0.025, 0.975))
     2.5%     97.5% 
0.7340858 0.9412182 

Solution (c)

# use posterior samples of theta and x = 4 to simulate ytilde

ytilde = rnorm(10000, theta.postsample * 4, 1)
hist(ytilde)
mean(ytilde)
[1] 3.341773

This matches intuition (law of total expectation gives the closed form solution: 4 * 0.838 = 3.352).

Solution (d)

We have no intercept term. We are assuming that our regression line goes through the origin. This is a strong assumption. Our model will be most similar to the lm model without an intercept term:

lm(mercury ~ 0 + weight, data = bass)

Call:
lm(formula = mercury ~ 0 + weight, data = bass)

Coefficients:
weight  
0.8343  

However, we’ll get a different estimate of \(\hat{\theta}\) if we include an intercept term,

lm(mercury ~ weight, data = bass)

Call:
lm(formula = mercury ~ weight, data = bass)

Coefficients:
(Intercept)       weight  
     0.6387       0.4818