To get the best deal on Tutoring, call 1-855-666-7440 (Toll Free)

Bayesian Linear Regression

The approach to linear regression in which we undertake the statistical analysis within the Bayesian inference context is known commonly as Bayesian linear regression.

Related Calculators
Calculating Linear Regression linear regression correlation coefficient calculator

Bayesian Likelihood Linear Regression

Back to Top
We have a general linear regression problem wherein the specification of the conditional distribution of $y_i$ for $i$ = $1,\ 2,\ …,\ n$ we are given a predictor vector $X_i$ of order $k\ \times\ 1$.

$y_i$ = $X_i(^T)\ \beta\ +\ \in_i$

here, $\beta$ represents a vector of the order $k\ \times\ 1$. Also, $\in_i$ are random variables that are identically normally distributed and independent. 
$\in_i\ ~\ N$ ($0$, $\sigma^2$)

All this is corresponding to the likelihood function that is mentioned below:

$\rho$ ($y\ |\ X$, $\beta$, $\sigma^2$) $\alpha$ ($\sigma^2$)$^{(\frac{- n}{2})}$ $exp$ ($\frac{- 1}{(2 \sigma^2)}$ ($y\ –\ X\ \beta$)$^T$ ($y\ –\ X\ \beta$))

Bayesian Multivariate Linear Regression

Back to Top
In this type of linear regression, the outcome that is predictable is not a vector of a single and scalar random variable, rather a vector of random variables which are correlated.

Let us consider a regression problem with a dependent variable that is to be predicted which is a vector of real number that are correlated and the vector is of length say $m$. There are $n$ observations with each of the observation consisting of $k\ –\ 1$ variables which are explanatory as well and again being grouped into a vector of length $k$ say $X_i$. We have,

$y_{(i,\ 1)}$ = $X_i^{T}\ \beta_1\ +\ \in_{(i, 1)}$



$y_{(i,\ m)}$ = $X_i^{T}\ \beta_m\ +\ \in_{(i,\ m)}$

It is given that the {$\in_{(i,\ 1)},\ …,\ \in_{(i,\ m)}$} that is the set of errors are correlated. 

We can write it as a single regression problem as below:

$y_i^{T}$ = $X_i^{T}\ B\ +\ \in_i^{T}$

$B$ is a coefficient matrix with order $k\ \times\ m$. The coefficient vectors $\in_{(i,\ m)}$ are all stacked in $B$ horizontally.

The noise vector is also normal jointly for each $i$ in order to have correlated outcomes for a certain observation.

$\in_i\ ~\ N$ ($0$, $\sum_\in^2$)

In matrix form, the entire problem of regression can be written as
$Y$ = $X\ B\ +\ E$

Both $Y$ and $E$ are matrices of order $n\ \times\ m$. the matrix $X$ is a design matrix of the order $n\ \times\ k$ in which the observations are vertically stacked.

Bayesian Estimation of Linear Regression

Back to Top
We require estimating the parameters in $\beta$ that are not known. For $\beta$, the maximum likelihood estimate is based on the likelihood by Gaussian as below:

$p\ (y\ |\ X,\ \beta; \sigma^2)$ = $\frac{1}{(2 \pi\ \beta^2)^{(\frac{n}{2})}}$ $exp$ ($\frac{-\ 1}{(2 \sigma^2)}$ $(||y\ –\ X \beta||^2))$

It is to be noted at all times that this is nothing but a product of likelihoods of the individual components of $y$.

First, we took the log of the above likelihood and then derivate it with respect to $\beta$. We then get,

$\bigtriangledown_\beta$ ln $p\ (y\ |\ X,\ \beta; \sigma^2)$ = $\frac{-\ 1}{(\sigma^2)}$ $X^T (y – X \beta)$

When this is put equal to zero, we get,

$\hat{\beta}$ = $(X^T\ X)^{- 1}$ $X^T\ y$

The maximum likelihood estimate found for $\beta$ is unbiased and Gaussian distributed:

$\hat{\beta}$ = $N\ (\beta,\ \sigma^2\ (X^T\ X)^{- 1})$

Conjugate Priors for Basic Linear Regression

Back to Top
Here we are considering conjugate priors the posterior distribution of which can be derived in an analytical manner.

A prior say $\rho$ ($\beta$, $\sigma^2$) will only be conjugate to the likelihood function if with respect to $\sigma$ and $\beta$, the functional form it has is the same. The log likelihood of $\beta$ being quadratic is written again in such a manner that we get a normal likelihood ($\beta\ -\ \hat{\beta}$).

$(y\ –\ X\ \beta)^T$ $(y\ –\ X\ \beta)$ = $(y\ –\ X\ \hat{\beta})^T$ ($y\ –\ X \hat{\beta}$) $+$ $(\beta - \hat{\beta})^T$ $(X^T\ X)$ ($\beta\ -\ \hat{\beta}$)

Again, we rewrite the likelihood as below.

$\rho$ ($y\ |\ X,\ \beta, \sigma^2$) $\alpha$ $(\sigma^2)^{(\frac{-\ v}{2})}$ $exp$ ($\frac{-\ v\ s^2}{2\ \sigma^2}$) $(\sigma^2)^{(- \frac{(n\ –\ v)}{2})}$ $\times\ exp$ ($\frac {-\ 1}{(2\ \sigma^2)}$ $(\beta\ -\ \hat{\beta})^T$ $(X^T\ X)$ $(\beta\ -\ \hat{\beta})$)

Here, $v\ s^2$ = $(y\ –\ X\ \hat{\beta})^T$ ($y\ –\ X\ \hat{\beta}$), $v$ = $n\ –\ k$ and $k$ is number of the regression coefficients.

From this we get a suggestion about the prior,

$\rho$ ($\beta, \sigma^2$) = $\rho$ ($\sigma^2$) $\sigma$ ($\beta\ |\ \sigma^2$)

Here, $\rho$ ($\sigma^2$) is the distribution of type inverse gamma

$\rho$ ($\sigma^2$) $\alpha$ $(\sigma^2)^{(- \frac{v_0}{(2 + 1)})}$ $exp$ ($-\ \frac{v_0\ (s_0)^2}{(2 \sigma^2}$))

Also, $\rho$ ($\beta\ |\ \sigma^2$) is a conditional prior density and behaves like a normal distribution

$\rho$ ($\beta\ |\ \sigma^2$) $\alpha$ $(\sigma^2)^{(- \frac{k}{2})}$ $exp$ ($- \frac{1}{(2\ \sigma^2)}$ $(\beta\ -\ \mu_0)^T$ $\wedge_0$ ($\beta\ -\ \mu_0$))

In the normal distribution type notation, the condition prior distribution is given by $N$ ($\mu_0$, $\sigma^2\ \wedge_0^{-1}$).
Related Topics
Math Help Online Online Math Tutor
*AP and SAT are registered trademarks of the College Board.