next up previous
Next: Kernel Density Estimation: Parzen Up: svkde Previous: svkde

Non-Parametric Density Estimation

Provided with $ n$ discrete observations of a random variable

$\displaystyle \mathcal{S} = \{ x_1,x_2, \cdots , x_n \} \subseteq \mathcal{X}$

all of which are identically and independently distributed (iid) according to some unknown probability distribution $ F(x)$, we seek an estimate $ \hat{f}(x)$ of the true probability density function $ f(x)$. The search for $ \hat{f}(x)$ is usually performed in a restricted functional space $ \mathcal{H}_f$; which is infact a Reproducing Kernel Hilbert Space (see Section [*]). The functional space is further restricted to only those functions that are non-negative and integrate to one. The probability density function (PDF) is simply the derivative of the cumulative distribution function (CDF):

$\displaystyle f(x) = \frac{dx}{dF(x)} \; \; \Longleftrightarrow \; \; F(x) = \int_{-\infty}^{x} f(\psi) d\psi$ (2)

We can rewrite 2 as a linear mapping [WGS+99]:

$\displaystyle \int_{-\infty}^{_+\infty} \mathbb{I}_{\psi < x} f(\psi) d\psi = F(x) \; \; \Longleftrightarrow \; \; \mathcal{A} f(x) = F(x)$ (3)

where both integrals in 2 and 3 are vector integrations and $ \mathcal{A}$ is an injective mapping from $ \mathcal{H}_f$ to the Hilbert Space where $ F$ is defined; $ \mathcal{H}_F$. Neither $ f$ or $ F$ are known (whereas the operator $ \mathcal{A}$ and its inverse are well defined) so we begin by estimating $ F$ using samples $ \mathcal{S}$ generated by the random process and then proceed to deriving $ \hat{f}$ from our estimate $ \hat{F}$ using an approximation of the inverse of the linear transformation $ \mathcal{A}$.

The empirical distribution at $ x$ can be estimated from the data by taking the ratio of the number of samples that are less than or equal to $ x$ to the total number of samples:

$\displaystyle \hat{F}(x) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{( -\infty, \; x_i]} (x)$ (4)

and is an unbiased maximum likelihood estimate that is piece-wise constant. For the density to exist, the estimated distribution $ \hat{F}$ must be differentiable and hence continuous and so to smooth out the estimate $ \hat{F}$; a non-linear regression (Figure 1) is used to approximate the distribution in the regions where training samples are unavailable; specifically the regression is performed on the set of pairs

$\displaystyle \left\{ x_i, \hat{F}(x_i) ; \; \; \; \; i=1,\cdots , n \right\} $

and is parametrized by the vector $ \rho$ which leads to our new estimate for the distribution: $ \hat{F}_{\rho}(x)$. Support Vector Regression techniques may also be used to derive accurate regressions since regularization (see Section [*]) is then possible.

Now to estimate the density function at $ x$ we need to estimate the derivative of $ \hat{F}_{\rho}(x)$ which can roughly be done by taking the difference between two evaluations of the distribution function at fixed lengths $ +h$ and $ -h$ from $ x$:

Figure: Estimating the Density and Distribution Functions: [left] A linear interpolation on $ \hat{F}_{\rho}$ between the end points of the region $ \mathcal{R}= [x-h, x+h]$ gives us an estimate for the slope $ \hat{f}(x)$. [right] A non-linear regression on evaluations of 4 at all training samples, in this case on $ (x_1, \hat{F}(x_1)), (x_2, \hat{F}(x_2)), \cdots , (x_6, \hat{F}(x_6))$, yields a smooth estimate for the distribution function.
\includegraphics[scale=1]{image_density_interpolation.eps}

Figure 2: Parametric Frequency Histogram Estimation: The true unknown density (top left) can be estimated by taking random samples (top right, 1000 random samples) and placing them in bins of fixed length to generate a histogram. Histograms with bin-size $ h=0.2$, $ h=0.1$, $ h=0.01$ and $ h=0.001$ are shown; the bin-size or bandwidth (as well as the actual placement of the bins) is an important parameter in estimating the density function; in this case only a bin-size of $ h=0.01$ is able to capture the multi-modality of the true density. In the limit as the bandwidth goes to zero the histogram will converge to the true density provided the number of samples goes to infinity.
\includegraphics[scale=0.5]{image_histogram_original_curve_01.eps} \includegraphics[scale=0.5]{image_histogram_random_samples_01.eps} \includegraphics[scale=0.5]{image_histogram_02_01.eps} \includegraphics[scale=0.5]{image_histogram_01_01.eps} \includegraphics[scale=0.5]{image_histogram_001_01.eps} \includegraphics[scale=0.5]{image_histogram_0001_01.eps}


$\displaystyle \hat{f}(x)$ $\displaystyle =$ $\displaystyle \mathcal{A}^{-1} \hat{F}_{\rho}(x)$  
  $\displaystyle =$ $\displaystyle \frac{1}{2h} \int_{x-h}^{x+h} d\hat{F}_{\rho}(\psi)$ (5)
  $\displaystyle =$ $\displaystyle \frac{1}{2h}\left( \hat{F}_{\rho}(x+h) - \hat{F}_{\rho}(x-h) \right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{2h}\left( \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{x_i \leq x+h} - \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{x_i \leq x-h} \right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{2h}\left( \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{x_i \leq x+h} - \mathbb{I}_{x_i \leq x-h} \right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{2h}\left( \frac{1}{n}\sum_{i=1}^n \mathbb{I}_{x-h \; \leq \; x_i \; \leq \; x+h} \right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{V} \times \frac{k(x)}{n}$  

where $ k(x) = \sum_{i=1}^n \mathbb{I}_{x-h \; \leq \; x_i \; \leq \; x+h}$ is the number of samples that fall in the region $ \mathcal{R}= [x-h, x+h]$ and $ V$ is the volume of $ \mathcal{R}$ which in this case is simply $ (2h)^{-1}$.

The shape of the region $ \mathcal{R}$ and hence its volume $ V$ can be adjusted as more random samples become availible; it has been shown [DHS01] that as the regions $ \mathcal{R}$ get smaller ( $ \lim_{n \to \infty} V = 0$) and the samples in the region increases ( $ \lim_{n \to \infty} k = \infty$), the estimated density $ \hat{f}(x)$ will converge to $ f(x)$ provided that $ \lim_{n \to \infty} k/n = \infty$ (that is to say the proportion of samples falling within the region to those outside it is very small); in the limit it will be a smooth density function.


next up previous
Next: Kernel Density Estimation: Parzen Up: svkde Previous: svkde
Rohan Shiloh SHAH 2006-12-12