分享

Kernel smoother

 weicat 2011-10-25
From Wikipedia, the free encyclopedia
Jump to: navigation, search

A kernel smoother is a statistical technique for estimating a real valued function f(X)\,\,\left( X\in \mathbb{R}^{p} \right) by using its noisy observations, when no parametric model for this function is known. The estimated function is smooth, and the level of smoothness is set by a single parameter.

This technique is most appropriate for low dimensional (p < 3) data visualization purposes. Actually, the kernel smoother represents the set of irregular data points as a smooth line or surface.


[edit] Definitions

Let K_{h_\lambda}(X_0 ,X) be a kernel defined by

K_{h_\lambda}(X_0 ,X) = D\left( \frac{\left\| X-X_0 \right\|}{h_\lambda (X_0)} \right)

where:

  • X,X_0 \in \mathbb{R}^p
  • \left\| \cdot  \right\| is the Euclidean norm
  • hλ(X0) is a parameter (kernel radius)
  • D(t) typically is a positive real valued function, which value is decreasing (or not increasing) for the increasing distance between the X and X0.

Popular kernels used for smoothing include

Let \hat{Y}(X):\mathbb{R}^p \to \mathbb{R} be a continuous function of X. For each X_0 \in \mathbb{R}^p, the Nadaraya-Watson kernel-weighted average (smooth Y(X) estimation) is defined by

\hat{Y}(X_{0})=\frac{\sum\limits_{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})Y(X_{i})}}{\sum\limits_{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})}}

where:

  • N is the number of observed points
  • Y(Xi) are the observations at Xi points.

In the following sections, we describe some particular cases of kernel smoothers.

[edit] Nearest neighbor smoother

The idea of the nearest neighbor smoother is the following. For each point X0, take m nearest neighbors and estimate the value of Y(X0) by averaging the values of these neighbors.

Formally, h_m (X_0)=\left\| X_0 - X_{[m]} \right\|, where X[m] is the mth closest to X0 neighbor, and

D(t)= \begin{cases}
1/m & \text{if } |t| \le 1 \0 & \text{otherwise}
\end{cases}

Example:

NNSmoother.svg

In this example, X is one-dimensional. For each X0, the \hat{Y}(X_0) is an average value of 16 closest to X0 points (denoted by red). The result is not smooth enough.

[edit] Kernel average smoother

The idea of the kernel average smoother is the following. For each data point X0, choose a constant distance size λ (kernel radius, or window width for p = 1 dimension), and compute a weighted average for all data points that are closer than λ to X0 (the closer to X0 points get higher weights).

Formally, hλ(X0) = λ = constant, and D(t) is one of the popular kernels.

Example:

KernelSmoother.svg

For each X0 the window width is constant, and the weight of each point in the window is schematically denoted by the yellow figure in the graph. It can be seen that the estimation is smooth, but the boundary points are biased. The reason for that is the non-equal number of points (from the right and from the left to the X0) in the window, when the X0 is close enough to the boundary.

[edit] Local linear regression

In the two previous sections we assumed that the underlying Y(X) function is locally constant, therefore we were able to use the weighted average for the estimation. The idea of local linear regression is to fit locally a straight line (or a hyperplane for higher dimensions), and not the constant (horizontal line). After fitting the line, the estimation \hat{Y}(X_{0}) is provided by the value of this line at X0 point. By repeating this procedure for each X0, one can get the estimation function \hat{Y}(X). Like in previous section, the window width is constant hλ(X0) = λ = constant. Formally, the local linear regression is computed by solving a weighted least square problem.

For one dimension (p = 1):

\begin{align}
  & \min_{\alpha (X_0),\beta (X_0)} \sum\limits_{i=1}^N {K_{h_{\lambda }}(X_0,X_i)\left( Y(X_i)-\alpha (X_0)-\beta (X_{0})X_i \right)^2} \\ 
 & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\Downarrow  \\ 
 & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\hat{Y}(X_{0})=\alpha (X_{0})+\beta (X_{0})X_{0} \\ 
\end{align}

The closed form solution is given by:

\hat{Y}(X_0)=\left( 1,X_0 \right)\left( B^{T}W(X_0)B \right)^{-1}B^{T}W(X_0)y

where:

  • y=\left( Y(X_1),\dots,Y(X_N) \right)^T
  • W(X_0)= \operatorname{diag} \left( K_{h_{\lambda }}(X_0,X_i) \right)_{N\times N}
  • B^{T}=\left( \begin{matrix}
   1 & 1 & \dots & 1  \   X_{1} & X_{2} & \dots & X_{N}  \\end{matrix} \right)

Example:

Localregressionsmoother.svg

The resulting function is smooth, and the problem with the biased boundary points is solved.

[edit] Local polynomial regression

Instead of fitting locally linear functions, one can fit polynomial functions.

For p=1, one should minimize:

\underset{\alpha (X_{0}),\beta _{j}(X_{0}),j=1,...,d}{\mathop{\min }}\,\sum\limits_{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left( Y(X_{i})-\alpha (X_{0})-\sum\limits_{j=1}^{d}{\beta _{j}(X_{0})X_{i}^{j}} \right)^{2}}

with \hat{Y}(X_{0})=\alpha (X_{0})+\sum\limits_{j=1}^{d}{\beta _{j}(X_{0})X_{0}^{j}}

In general case (p>1), one should minimize:

\begin{align}
  & \hat{\beta }(X_{0})=\underset{\beta (X_{0})}{\mathop{\arg \min }}\,\sum\limits_{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left( Y(X_{i})-b(X_{i})^{T}\beta (X_{0}) \right)}^{2} \\ 
 & b(X)=\left( \begin{matrix}
   1, & X_{1}, & X_{2},... & X_{1}^{2}, & X_{2}^{2},... & X_{1}X_{2}\,\,\,...  \\end{matrix} \right) \\ 
 & \hat{Y}(X_{0})=b(X_{0})^{T}\hat{\beta }(X_{0}) \\ 
\end{align}

[edit] See also

[edit] References

  • Li, Q. and J.S. Racine. Nonparametric Econometrics: Theory and Practice. Princeton University Press, 2007, ISBN 0691121613.
  • T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Chapter 6, Springer, 2001. ISBN 0387952845 (companion book site).

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多