The approach to linear regression in which we undertake the statistical analysis within the Bayesian inference context is known commonly as Bayesian linear regression.

Related Calculators | |

Calculating Linear Regression | linear regression correlation coefficient calculator |

We have a general linear regression problem wherein the specification of the conditional distribution of $y_i$ for $i$ = $1,\ 2,\ …,\ n$ we are given a predictor vector $X_i$ of order $k\ \times\ 1$.

$y_i$ = $X_i(^T)\ \beta\ +\ \in_i$

here, $\beta$ represents a vector of the order $k\ \times\ 1$. Also, $\in_i$ are random variables that are identically normally distributed and independent.

$\in_i\ ~\ N$ ($0$, $\sigma^2$)

All this is corresponding to the likelihood function that is mentioned below:

$\rho$ ($y\ |\ X$, $\beta$, $\sigma^2$) $\alpha$ ($\sigma^2$)$^{(\frac{- n}{2})}$ $exp$ ($\frac{- 1}{(2 \sigma^2)}$ ($y\ –\ X\ \beta$)$^T$ ($y\ –\ X\ \beta$))

In this type of linear regression, the outcome that is predictable is not a vector of a single and scalar random variable, rather a vector of random variables which are correlated.

Let us consider a regression problem with a dependent variable that is to be predicted which is a vector of real number that are correlated and the vector is of length say $m$. There are $n$ observations with each of the observation consisting of $k\ –\ 1$ variables which are explanatory as well and again being grouped into a vector of length $k$ say $X_i$. We have,

$y_{(i,\ 1)}$ = $X_i^{T}\ \beta_1\ +\ \in_{(i, 1)}$

…..

…..

$y_{(i,\ m)}$ = $X_i^{T}\ \beta_m\ +\ \in_{(i,\ m)}$

It is given that the {$\in_{(i,\ 1)},\ …,\ \in_{(i,\ m)}$} that is the set of errors are correlated.

We can write it as a single regression problem as below:

$y_i^{T}$ = $X_i^{T}\ B\ +\ \in_i^{T}$

$B$ is a coefficient matrix with order $k\ \times\ m$. The coefficient vectors $\in_{(i,\ m)}$ are all stacked in $B$ horizontally.

The noise vector is also normal jointly for each $i$ in order to have correlated outcomes for a certain observation.

$\in_i\ ~\ N$ ($0$, $\sum_\in^2$)

In matrix form, the entire problem of regression can be written as

$Y$ = $X\ B\ +\ E$

Both $Y$ and $E$ are matrices of order $n\ \times\ m$. the matrix $X$ is a design matrix of the order $n\ \times\ k$ in which the observations are vertically stacked.

We require estimating the parameters in $\beta$ that are not known. For $\beta$, the maximum likelihood estimate is based on the likelihood by Gaussian as below:

$p\ (y\ |\ X,\ \beta; \sigma^2)$ = $\frac{1}{(2 \pi\ \beta^2)^{(\frac{n}{2})}}$ $exp$ ($\frac{-\ 1}{(2 \sigma^2)}$ $(||y\ –\ X \beta||^2))$

It is to be noted at all times that this is nothing but a product of likelihoods of the individual components of $y$.

First, we took the log of the above likelihood and then derivate it with respect to $\beta$. We then get,

$\bigtriangledown_\beta$ ln $p\ (y\ |\ X,\ \beta; \sigma^2)$ = $\frac{-\ 1}{(\sigma^2)}$ $X^T (y – X \beta)$

When this is put equal to zero, we get,

$\hat{\beta}$ = $(X^T\ X)^{- 1}$ $X^T\ y$

The maximum likelihood estimate found for $\beta$ is unbiased and Gaussian distributed:

$\hat{\beta}$ = $N\ (\beta,\ \sigma^2\ (X^T\ X)^{- 1})$

Here we are considering conjugate priors the posterior distribution of which can be derived in an analytical manner.

A prior say $\rho$ ($\beta$, $\sigma^2$) will only be conjugate to the likelihood function if with respect to $\sigma$ and $\beta$, the functional form it has is the same. The log likelihood of $\beta$ being quadratic is written again in such a manner that we get a normal likelihood ($\beta\ -\ \hat{\beta}$).

$(y\ –\ X\ \beta)^T$ $(y\ –\ X\ \beta)$ = $(y\ –\ X\ \hat{\beta})^T$ ($y\ –\ X \hat{\beta}$) $+$ $(\beta - \hat{\beta})^T$ $(X^T\ X)$ ($\beta\ -\ \hat{\beta}$)

Again, we rewrite the likelihood as below.

$\rho$ ($y\ |\ X,\ \beta, \sigma^2$) $\alpha$ $(\sigma^2)^{(\frac{-\ v}{2})}$ $exp$ ($\frac{-\ v\ s^2}{2\ \sigma^2}$) $(\sigma^2)^{(- \frac{(n\ –\ v)}{2})}$ $\times\ exp$ ($\frac {-\ 1}{(2\ \sigma^2)}$ $(\beta\ -\ \hat{\beta})^T$ $(X^T\ X)$ $(\beta\ -\ \hat{\beta})$)

Here, $v\ s^2$ = $(y\ –\ X\ \hat{\beta})^T$ ($y\ –\ X\ \hat{\beta}$), $v$ = $n\ –\ k$ and $k$ is number of the regression coefficients.

From this we get a suggestion about the prior,

$\rho$ ($\beta, \sigma^2$) = $\rho$ ($\sigma^2$) $\sigma$ ($\beta\ |\ \sigma^2$)

Here, $\rho$ ($\sigma^2$) is the distribution of type inverse gamma

$\rho$ ($\sigma^2$) $\alpha$ $(\sigma^2)^{(- \frac{v_0}{(2 + 1)})}$ $exp$ ($-\ \frac{v_0\ (s_0)^2}{(2 \sigma^2}$))

Also, $\rho$ ($\beta\ |\ \sigma^2$) is a conditional prior density and behaves like a normal distribution

$\rho$ ($\beta\ |\ \sigma^2$) $\alpha$ $(\sigma^2)^{(- \frac{k}{2})}$ $exp$ ($- \frac{1}{(2\ \sigma^2)}$ $(\beta\ -\ \mu_0)^T$ $\wedge_0$ ($\beta\ -\ \mu_0$))

In the normal distribution type notation, the condition prior distribution is given by $N$ ($\mu_0$, $\sigma^2\ \wedge_0^{-1}$).

Related Topics | |

Math Help Online | Online Math Tutor |