Sunday, 26 July 2020

Decomposing Bregman divergences

For a PDF of this post, see here.

Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of this classic result by Morris DeGroot and Stephen Fienberg and, more recently, this paper by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.

Refresher on Bregman divergences

First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset $\mathcal{X} \subseteq \mathbf{R}^n$. A divergence is a function $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ such that
  • $D(x, y) \geq 0$, for all $x$, $y$ in $\mathcal{X}$, and
  • $D(x, y) = 0$ iff $x = y$.
Note: We do not assume that a divergence is symmetric. So the distance from $x$ to $y$ need not be the same as the distance from $y$ to $x$. That is, we do not assume $D(x, y) = D(y, x)$ for all $x$, $y$ in $\mathcal{X}$. Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume the triangle inequality. That is, we don't assume that the divergence from $x$ to $z$ is at most the sum of the divergence from $x$ to $y$ and the divergence from $y$ to $z$. That is, we do not assume $D(x, z) \leq D(x, y) + D(y, z)$. Indeed, the conditions under which $D(x, z) = D(x, y) + D(y, z)$ for a Bregman divergence $D$ will be our concern here.

So, what's a Bregman divergence? $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ is a Bregman divergence if there is a strictly convex function $\Phi : \mathcal{X} \rightarrow \mathbb{R}$ that is differentiable on the interior of $\mathcal{X}$ such that$$D(x, y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) (x-y)$$In other words, to find the divergence from $x$ to $y$, you go to $y$, find the tangent to $\Phi$ at $y$. Then hop over to $x$ and subtract the value at $x$ of the tangent you just drew at $y$ from the value at $x$ of $\Phi$. That is, you subtract $\nabla \Phi(y) (x-y) + \Phi(y)$ from $\Phi(x)$. Because $\Phi$ is convex, it is always curving away from the tangent, and so $\nabla \Phi(y) (x-y) + \Phi(y)$, the value at $x$ of the tangent you drew at $y$, is always less than $\Phi(x)$, the value at $x$ of $\Phi$.

The two most famous Bregman divergences are:
  • Squared Euclidean distance. Let $\Phi(x) = ||x||^2 = \sum_i x_i^2$, in which case$$D(x, y) = ||x-y||^2 = \sum_i (x_i - y_i)^2$$
  • Generalized Kullback-Leibler divergence. Let $\Phi(x) = \sum_i x_i \log x_i$, in which case$$D(x, y) = \sum_i x_i\log\frac{x_i}{y_i} - x_i + y_i$$
Bregman divergences are convex in the first argument. Thus, we can define, for $z$ in $\mathcal{X}$ and for a closed convex subset $C \subseteq \mathcal{X}$, the $D$-projection of $z$ into $C$ is the point $\pi_{z, C}$ in $C$ such that $D(y, z)$ is minimized, as a function of $y$, at $y = \pi_{z, C}$. Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:

Theorem (Generalized Pythagorean Theorem) If $\mathcal{C} \subseteq \mathcal{X}$ is closed and convex, then$$D(x, \pi_{z, C}) + D(\pi_{z, C}, z) \leq D(x, z)$$

Decomposing Bregman divergences

This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if $C$ is a plane in $\mathbb{R}^n$.

Theorem 1  Suppose $r$ is in $\mathbb{R}$ and $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum_i \alpha_i = 1$. Then let $C := \{(x_1, \ldots, x_n) : \sum_i \alpha_ix_i = r\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$

Proof of Theorem 1.  We begin by showing:

Lemma 1 For any $x$, $y$, $z$ in $\mathcal{X}$,$$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z) \Leftrightarrow (\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$

Proof of Lemma 1.  $$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z)$$iff

$\Phi(x) - \Phi(z) - \nabla(z)(x-z)$

$= \Phi(x) - \Phi(y) - \nabla(y)(x-y) + \Phi(y) - \Phi(z) - \nabla(z)(y-z)$

iff$$(\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$as required.

Return to Proof of Theorem 1. Now we show that if $x$ is in $C$, then$$(\nabla \Phi(\pi_{z, C}) - \Phi(z))(x-\pi_{z, C}) = 0$$We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h(x) := \sum_i \alpha_ix^i - r$. Then $\frac{\partial}{\partial x_i} h(x) = \alpha_i$. So, by the KKT conditions, there is $\lambda$ such that,$$\nabla \Phi(y) - \nabla \Phi(z) + (\lambda \alpha_1, \ldots, \lambda \alpha_n) = (0, \ldots, 0)$$Thus,$$\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z) = -\lambda \alpha_i$$for all $i = 1, \ldots, n$.

Thus, finally, 
& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\
& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\
& = &  \sum_i (-\lambda \alpha_i) (x_i - y_i) \\
& = & -\lambda \left (\sum_i \alpha_i x_i - \sum_i \alpha_i y_i\right ) \\
& = & -\lambda (r-r) \\
& = & 0
as required. $\Box$

Theorem 2  Suppose $1 \leq k \leq n$. Let $C := \{(x_1, \ldots, x_n) : x_1 = x_2 = \ldots = x_k\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$

Proof of Theorem 2. We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h_i(x) := x_{i+1} - x_i$, for $i = 1, \ldots, k-1$. Then$$\frac{\partial}{\partial x_j} h_i(x) = \left \{ \begin{array}{ll} 1 & \mbox{if } i+1 = j \\ -1 & \mbox{if } i = j \\ 0 & \mbox{otherwise}\end{array} \right.$$ So, by the KKT conditions, there are $\lambda_1, \ldots, \lambda_k$ such that,

$\nabla \Phi(y) - \nabla \Phi(z)$

$+ (-\lambda_1, \lambda_1, 0, \ldots, 0) + (0, -\lambda_2, \lambda_2, 0, \ldots, 0) + \ldots$

$+ (0, \ldots, 0, -\lambda_k, \lambda_k, 0, \ldots, 0) = (0, \ldots, 0)$

Thus,$$\begin{eqnarray*}\frac{\partial}{\partial y_1} \Phi(y) - \frac{\partial}{\partial z_1} \Phi(z) & = & - \lambda_1 \\ \frac{\partial}{\partial y_2} \Phi(y) - \frac{\partial}{\partial z_2} \Phi(z) & = & \lambda_1 - \lambda_2 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_{k-1}} \Phi(y) - \frac{\partial}{\partial z_{k-1}} \Phi(z) & = & \lambda_{k-2}- \lambda_{k-1} \\ \frac{\partial}{\partial y_k} \Phi(y) - \frac{\partial}{\partial z_k} \Phi(z) & = & \lambda_{k-1} \\ \frac{\partial}{\partial y_{k+1}} \Phi(y) - \frac{\partial}{\partial z_{k+1}} \Phi(z) & = & 0 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_n} \Phi(y) - \frac{\partial}{\partial z_n} \Phi(z) & = & 0 \end{eqnarray*}$$

Thus, finally, 
& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\
& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\
& = & -\lambda_1(x_1-y_1) + (\lambda_1 - \lambda_2)(x_2-y_2) + \ldots \\
&& + (\lambda_{k-2} - \lambda_{k-1})(x_{k-1}-y_{k-1}) + \lambda_{k-1}(x_k-y_k) \\
&& + 0(x_{k+1} - y_{k+1}) + \ldots + 0 (x_n - y_n) \\
& = & \sum^{k-1}_{i=1} \lambda_i (x_{i+1} - x_i) + \sum^{k-1}_{i=1} \lambda_i (y_i - y_{i+1})\\
& = & 0
as required. $\Box$

DeGroot and Fienberg's calibration and refinement decomposition

To obtain these two decomposition results, we needed to assume nothing more than that $D$ is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose $(X_1, \ldots, X_n)$ is a sequence of propositions that forms a partition. And suppose $w$ is a possible world. Then we can represent $w$ as the vector $w = (0, \ldots, 0, 1, 0, \ldots, 0)$, which takes value 1 at the proposition that is true in $w$ and 0 everywhere else. Now suppose $c = (c, \ldots, c)$ is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if $(0, \ldots, 0, 1, 0, \ldots, 0)$ is the world at which $X_i$ is true, then

$D((0, \ldots, 0, 1, 0, \ldots, 0), (c, \ldots, c))$

$= D((0, \ldots, 0, 1, 0, \ldots, 0), (\frac{1}{n}, \ldots, \frac{1}{n})) + D((\frac{1}{n}, \ldots, \frac{1}{n}), (c, \ldots, c))$

Now, we know from Lemma 1 that this is true iff$$(\nabla \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \nabla \Phi(c, \ldots, c))((0, \ldots, 0, 1, 0, \ldots, 0) - (\frac{1}{n}, \ldots, \frac{1}{n})) = 0$$which is true iff

$\left ( \frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c) \right )$

$= \frac{1}{n} \sum^n_{j=1} \left ( \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c) \right )$

and that is true iff

$\frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c)$

$= \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c)$

for all $1 \leq i, j, \leq n$, which is true iff, for any $x$, $1 \leq i, j \leq n$,$$\frac{\partial}{\partial x_i} \Phi(x, \ldots, x) = \frac{\partial}{\partial x_j} \Phi(x, \ldots, x)$$Now, this is true if $\Phi(x_1, \ldots, x_n) = \sum^n_{i=1} \varphi(x_i)$ for some $\varphi$. That is, it is true if $D$ is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:

Definition (log-sum-exp) Suppose $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum^n_{i=1} \alpha_i = 1$. Then let $$\Phi^A(x_1, \ldots, x_n) = \log(1 + \alpha_1e^{x_1} + \ldots \alpha_ne^{x_n})$$Then
$$D(x, y) = \log (1 + \sum_i \alpha_ie^{x_i}) - \log(1 + \sum_i \alpha_ie^{y_i}) - \sum_k \frac{\alpha_k(x_k - y_k)e^{y_k}}{1 + \sum_i \alpha_ie^{y_i}}$$

Now$$\frac{\partial}{\partial x_i} \Phi^A(x_1, \ldots, x_n) = \frac{\alpha_i e^{x_i}}{1 + \alpha_1 e^{x_1} + \ldots + \alpha_ne^{x_n}}$$So, if $\alpha_i = \alpha_j$ for all $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha e^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$But if $\alpha_i \neq \alpha_j$ for some $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha_ie^x}{1 + e^x} \neq \frac{\alpha_je^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$

And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different $\phi_1, \ldots, \phi_n$ such that $\Phi(x) = \sum^n_{i=1} \phi_i(x_i)$. For instance, suppose $\phi_1(x) = x^2$ and $\phi_2(x) = x\log x$ and $\Phi(x, y) = \phi_1(x) +  \phi_2(y) = x^2 + y\log y$. Then$$\frac{\partial}{\partial x_1} \Phi(x, x) = 2x \neq 1 + \log x = \frac{\partial}{\partial x_2} \Phi(x, x)$$

Proving the Generalized Pythagorean Theorem

In this section, I really just spell out in more detail the proof that Predd, et al. give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.

Lemma 2  Suppose $D$ is a Bregman divergence generated from $\Phi$. And suppose $x, y, z \in \mathcal{X}$. Then$$\begin{eqnarray*} & & D(x, z) - [D(x, y) + D(y, z)] \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x - y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(y + \varepsilon (x - y), z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \end{eqnarray*}$$

We can then prove the Generalized Pythagorean Theorem easily. After all, if $x$ is in a closed convex set $C$ and $y$ is the point in $C$ that minimizes $D(y, z)$ as a function of $y$. Then, for all $0 \leq \varepsilon \leq 1$, $\varepsilon x + (1-\varepsilon)y$ is in $C$. And since $y$ minimizes,$$D(\varepsilon x + (1-\varepsilon)y, z) \geq D(y, z)$$. So $D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$. So $$\lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon}D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$$So, by Lemma 2,$$D(x, z) \geq D(x, y) + D(y, z)$$

Proof of Lemma 2.  $$\begin{eqnarray*} && \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(z) - \nabla \Phi(z)(\varepsilon x + (1-\varepsilon)y - z)) - \\ & & (\Phi(y) - \Phi(z) - \nabla\Phi(z)(y-z))]\\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y) - \varepsilon\nabla \Phi(z)(x -y)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y)]  - \nabla \Phi(z)(x -y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(y + \varepsilon (x -y)) - \Phi(y)]  - \nabla \Phi(z)(x -y) \\ & = & \nabla \Phi(y)(x -y) - \nabla \Phi(z)(x -y) \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x -y)\end{eqnarray*}$$

1 comment:

  1. In the paragraph before the statement of the Generalized Pythagorean Theorem, does the uniqueness of the minimizer need D to be strictly convex in its first argument?