Decomposing Bregman divergences

Posted by Richard Pettigrew July 26, 2020

Decomposing Bregman divergences

For a PDF of this post, see here.

Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of this classic result by Morris DeGroot and Stephen Fienberg and, more recently, this paper by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.

Refresher on Bregman divergences

First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset

X \subseteq R^{n}

. A divergence is a function

D : X \times X \to [0, \infty]

such that

$D (x, y) \geq 0$ , for all $x$ , $y$ in $X$ , and
$D (x, y) = 0$ iff $x = y$ .

Note: We do not assume that a divergence is symmetric. So the distance from

x

y

need not be the same as the distance from

y

x

. That is, we do not assume

D (x, y) = D (y, x)

for all

x

y

X

. Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume the triangle inequality. That is, we don't assume that the divergence from

x

z

is at most the sum of the divergence from

x

y

and the divergence from

y

z

. That is, we do not assume

D (x, z) \leq D (x, y) + D (y, z)

. Indeed, the conditions under which

D (x, z) = D (x, y) + D (y, z)

for a Bregman divergence

D

will be our concern here.

So, what's a Bregman divergence?

D : X \times X \to [0, \infty]

is a Bregman divergence if there is a strictly convex function

Φ : X \to R

that is differentiable on the interior of

X

such that

D (x, y) = Φ (x) - Φ (y) - \nabla Φ (y) (x - y)

In other words, to find the divergence from

x

y

, you go to

y

, find the tangent to

Φ

y

. Then hop over to

x

and subtract the value at

x

of the tangent you just drew at

y

from the value at

x

Φ

. That is, you subtract

\nabla Φ (y) (x - y) + Φ (y)

from

Φ (x)

. Because

Φ

is convex, it is always curving away from the tangent, and so

\nabla Φ (y) (x - y) + Φ (y)

, the value at

x

of the tangent you drew at

y

, is always less than

Φ (x)

, the value at

x

Φ

.

The two most famous Bregman divergences are:

Squared Euclidean distance. Let $Φ (x) = | | x | |^{2} = \sum_{i} x_{i}^{2}$ , in which case $D (x, y) = | | x - y | |^{2} = \sum_{i} (x_{i} - y_{i})^{2}$
Generalized Kullback-Leibler divergence. Let $Φ (x) = \sum_{i} x_{i} \log x_{i}$ , in which case $D (x, y) = \sum_{i} x_{i} \log \frac{x_{i}}{y_{i}} - x_{i} + y_{i}$

Bregman divergences are convex in the first argument. Thus, we can define, for

z

X

and for a closed convex subset

C \subseteq X

, the

D

-projection of

z

into

C

is the point

π_{z, C}

C

such that

D (y, z)

is minimized, as a function of

y

, at

y = π_{z, C}

. Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:

Theorem (Generalized Pythagorean Theorem) If

C \subseteq X

is closed and convex, then

D (x, π_{z, C}) + D (π_{z, C}, z) \leq D (x, z)

Decomposing Bregman divergences

This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if

C

is a plane in

R^{n}

.

Theorem 1 Suppose

r

is in

R

and

0 \leq α_{1}, \dots, α_{n} \leq 1

with

\sum_{i} α_{i} = 1

. Then let

C := {(x_{1}, \dots, x_{n}) : \sum_{i} α_{i} x_{i} = r}

. Then if

z

X

and

x

is in

C

D_{Φ} (x, z) = D_{Φ} (x, π_{z, C}) + D_{Φ} (π_{z, C}, z)

Proof of Theorem 1. We begin by showing:

Lemma 1 For any

x

y

z

X

D_{Φ} (x, z) = D_{Φ} (x, y) + D_{Φ} (y, z) \Leftrightarrow (\nabla Φ (y) - \nabla Φ (z)) (x - y) = 0

Proof of Lemma 1.

D_{Φ} (x, z) = D_{Φ} (x, y) + D_{Φ} (y, z)

iff

Φ (x) - Φ (z) - \nabla (z) (x - z)

= Φ (x) - Φ (y) - \nabla (y) (x - y) + Φ (y) - Φ (z) - \nabla (z) (y - z)

iff

(\nabla Φ (y) - \nabla Φ (z)) (x - y) = 0

as required.

Return to Proof of Theorem 1. Now we show that if

x

is in

C

, then

(\nabla Φ (π_{z, C}) - Φ (z)) (x - π_{z, C}) = 0

We know that

D (y, z)

is minimized on

C

, as a function of

y

, at

y = π_{z, C}

. Thus, let

y = π_{z, C}

. And let

h (x) := \sum_{i} α_{i} x^{i} - r

. Then

\frac{\partial}{\partial x_{i}} h (x) = α_{i}

. So, by the KKT conditions, there is

λ

such that,

\nabla Φ (y) - \nabla Φ (z) + (λ α_{1}, \dots, λ α_{n}) = (0, \dots, 0)

Thus,

\frac{\partial}{\partial y_{i}} Φ (y) - \frac{\partial}{\partial z_{i}} Φ (z) = - λ α_{i}

for all

i = 1, \dots, n

.

Thus, finally,

\begin{array}{rcl} (\nabla Φ (y) - \nabla Φ (z)) (x - y) \\ = & \sum_{i} (\frac{\partial}{\partial y_{i}} Φ (y) - \frac{\partial}{\partial z_{i}} Φ (z)) (x_{i} - y_{i}) \\ = & \sum_{i} (- λ α_{i}) (x_{i} - y_{i}) \\ = & - λ (\sum_{i} α_{i} x_{i} - \sum_{i} α_{i} y_{i}) \\ = & - λ (r - r) \\ = & 0 \end{array}

as required.

◻

Theorem 2 Suppose

1 \leq k \leq n

. Let

C := {(x_{1}, \dots, x_{n}) : x_{1} = x_{2} = \dots = x_{k}}

. Then if

z

X

and

x

is in

C

D_{Φ} (x, z) = D_{Φ} (x, π_{z, C}) + D_{Φ} (π_{z, C}, z)

Proof of Theorem 2. We know that

D (y, z)

is minimized on

C

, as a function of

y

, at

y = π_{z, C}

. Thus, let

y = π_{z, C}

. And let

h_{i} (x) := x_{i + 1} - x_{i}

, for

i = 1, \dots, k - 1

. Then

\frac{\partial}{\partial x_{j}} h_{i} (x) = {\begin{cases} 1 & if i + 1 = j \\ - 1 & if i = j \\ 0 & otherwise \end{cases}

So, by the KKT conditions, there are

λ_{1}, \dots, λ_{k}

such that,

\nabla Φ (y) - \nabla Φ (z)

+ (- λ_{1}, λ_{1}, 0, \dots, 0) + (0, - λ_{2}, λ_{2}, 0, \dots, 0) + \dots

+ (0, \dots, 0, - λ_{k}, λ_{k}, 0, \dots, 0) = (0, \dots, 0)

Thus,

\begin{array}{rcl} \frac{\partial}{\partial y_{1}} Φ (y) - \frac{\partial}{\partial z_{1}} Φ (z) & = & - λ_{1} \\ \frac{\partial}{\partial y_{2}} Φ (y) - \frac{\partial}{\partial z_{2}} Φ (z) & = & λ_{1} - λ_{2} \\ ⋮ & ⋮ & ⋮ \\ \frac{\partial}{\partial y_{k - 1}} Φ (y) - \frac{\partial}{\partial z_{k - 1}} Φ (z) & = & λ_{k - 2} - λ_{k - 1} \\ \frac{\partial}{\partial y_{k}} Φ (y) - \frac{\partial}{\partial z_{k}} Φ (z) & = & λ_{k - 1} \\ \frac{\partial}{\partial y_{k + 1}} Φ (y) - \frac{\partial}{\partial z_{k + 1}} Φ (z) & = & 0 \\ ⋮ & ⋮ & ⋮ \\ \frac{\partial}{\partial y_{n}} Φ (y) - \frac{\partial}{\partial z_{n}} Φ (z) & = & 0 \end{array}

Thus, finally,

\begin{array}{rcl} (\nabla Φ (y) - \nabla Φ (z)) (x - y) \\ = & \sum_{i} (\frac{\partial}{\partial y_{i}} Φ (y) - \frac{\partial}{\partial z_{i}} Φ (z)) (x_{i} - y_{i}) \\ = & - λ_{1} (x_{1} - y_{1}) + (λ_{1} - λ_{2}) (x_{2} - y_{2}) + \dots \\ + (λ_{k - 2} - λ_{k - 1}) (x_{k - 1} - y_{k - 1}) + λ_{k - 1} (x_{k} - y_{k}) \\ + 0 (x_{k + 1} - y_{k + 1}) + \dots + 0 (x_{n} - y_{n}) \\ = & \sum_{i = 1}^{k - 1} λ_{i} (x_{i + 1} - x_{i}) + \sum_{i = 1}^{k - 1} λ_{i} (y_{i} - y_{i + 1}) \\ = & 0 \end{array}

as required.

◻

DeGroot and Fienberg's calibration and refinement decomposition

To obtain these two decomposition results, we needed to assume nothing more than that

D

is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose

(X_{1}, \dots, X_{n})

is a sequence of propositions that forms a partition. And suppose

w

is a possible world. Then we can represent

w

as the vector

w = (0, \dots, 0, 1, 0, \dots, 0)

, which takes value 1 at the proposition that is true in

w

and 0 everywhere else. Now suppose

c = (c, \dots, c)

is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if

(0, \dots, 0, 1, 0, \dots, 0)

is the world at which

X_{i}

is true, then

D ((0, \dots, 0, 1, 0, \dots, 0), (c, \dots, c))

= D ((0, \dots, 0, 1, 0, \dots, 0), (\frac{1}{n}, \dots, \frac{1}{n})) + D ((\frac{1}{n}, \dots, \frac{1}{n}), (c, \dots, c))

Now, we know from Lemma 1 that this is true iff

(\nabla Φ (\frac{1}{n}, \dots, \frac{1}{n}) - \nabla Φ (c, \dots, c)) ((0, \dots, 0, 1, 0, \dots, 0) - (\frac{1}{n}, \dots, \frac{1}{n})) = 0

which is true iff

(\frac{\partial}{\partial x_{i}} Φ (\frac{1}{n}, \dots, \frac{1}{n}) - \frac{\partial}{\partial x_{i}} Φ (c, \dots, c))

= \frac{1}{n} \sum_{j = 1}^{n} (\frac{\partial}{\partial x_{j}} Φ (\frac{1}{n}, \dots, \frac{1}{n}) - \frac{\partial}{\partial x_{j}} Φ (c, \dots, c))

and that is true iff

\frac{\partial}{\partial x_{i}} Φ (\frac{1}{n}, \dots, \frac{1}{n}) - \frac{\partial}{\partial x_{i}} Φ (c, \dots, c)

= \frac{\partial}{\partial x_{j}} Φ (\frac{1}{n}, \dots, \frac{1}{n}) - \frac{\partial}{\partial x_{j}} Φ (c, \dots, c)

for all

1 \leq i, j, \leq n

, which is true iff, for any

x

1 \leq i, j \leq n

\frac{\partial}{\partial x_{i}} Φ (x, \dots, x) = \frac{\partial}{\partial x_{j}} Φ (x, \dots, x)

Now, this is true if

Φ (x_{1}, \dots, x_{n}) = \sum_{i = 1}^{n} φ (x_{i})

for some

φ

. That is, it is true if

D

is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:

Definition (log-sum-exp) Suppose

0 \leq α_{1}, \dots, α_{n} \leq 1

with

\sum_{i = 1}^{n} α_{i} = 1

. Then let

Φ^{A} (x_{1}, \dots, x_{n}) = \log (1 + α_{1} e^{x_{1}} + \dots α_{n} e^{x_{n}})

Then

D (x, y) = \log (1 + \sum_{i} α_{i} e^{x_{i}}) - \log (1 + \sum_{i} α_{i} e^{y_{i}}) - \sum_{k} \frac{α_{k} (x_{k} - y_{k}) e^{y_{k}}}{1 + \sum_{i} α_{i} e^{y_{i}}}

Now

\frac{\partial}{\partial x_{i}} Φ^{A} (x_{1}, \dots, x_{n}) = \frac{α_{i} e^{x_{i}}}{1 + α_{1} e^{x_{1}} + \dots + α_{n} e^{x_{n}}}

So, if

α_{i} = α_{j}

for all

1 \leq i, j \leq n

, then

\frac{\partial}{\partial x_{i}} Φ^{A} (x, \dots, x) = \frac{α e^{x}}{1 + e^{x}} = \frac{\partial}{\partial x_{j}} Φ^{A} (x, \dots, x)

But if

α_{i} \neq α_{j}

for some

1 \leq i, j \leq n

, then

\frac{\partial}{\partial x_{i}} Φ^{A} (x, \dots, x) = \frac{α_{i} e^{x}}{1 + e^{x}} \neq \frac{α_{j} e^{x}}{1 + e^{x}} = \frac{\partial}{\partial x_{j}} Φ^{A} (x, \dots, x)

And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different

ϕ_{1}, \dots, ϕ_{n}

such that

Φ (x) = \sum_{i = 1}^{n} ϕ_{i} (x_{i})

. For instance, suppose

ϕ_{1} (x) = x^{2}

and

ϕ_{2} (x) = x \log x

and

Φ (x, y) = ϕ_{1} (x) + ϕ_{2} (y) = x^{2} + y \log y

. Then

\frac{\partial}{\partial x_{1}} Φ (x, x) = 2 x \neq 1 + \log x = \frac{\partial}{\partial x_{2}} Φ (x, x)

Proving the Generalized Pythagorean Theorem

In this section, I really just spell out in more detail the proof that Predd, et al. give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.

Lemma 2 Suppose

D

is a Bregman divergence generated from

Φ

. And suppose

x, y, z \in X

. Then

\begin{array}{rcl} D (x, z) - [D (x, y) + D (y, z)] \\ = & (\nabla Φ (y) - \nabla Φ (z)) (x - y) \\ = & lim_{ε \to 0} \frac{1}{ε} [D (y + ε (x - y), z) - D (y, z)] \\ = & lim_{ε \to 0} \frac{1}{ε} [D (ε x + (1 - ε) y, z) - D (y, z)] \end{array}

We can then prove the Generalized Pythagorean Theorem easily. After all, if

x

is in a closed convex set

C

and

y

is the point in

C

that minimizes

D (y, z)

as a function of

y

. Then, for all

0 \leq ε \leq 1

ε x + (1 - ε) y

is in

C

. And since

y

minimizes,

D (ε x + (1 - ε) y, z) \geq D (y, z)

. So

D (ε x + (1 - ε) y, z) - D (y, z) \geq 0

. So

lim_{ε \to 0} \frac{1}{ε} D (ε x + (1 - ε) y, z) - D (y, z) \geq 0

So, by Lemma 2,

D (x, z) \geq D (x, y) + D (y, z)

Proof of Lemma 2.

\begin{array}{rcl} lim_{ε \to 0} \frac{1}{ε} [D (ε x + (1 - ε) y, z) - D (y, z)] \\ = & lim_{ε \to 0} \frac{1}{ε} [(Φ (ε x + (1 - ε) y) - Φ (z) - \nabla Φ (z) (ε x + (1 - ε) y - z)) - \\ (Φ (y) - Φ (z) - \nabla Φ (z) (y - z))] \\ = & lim_{ε \to 0} \frac{1}{ε} [(Φ (ε x + (1 - ε) y) - Φ (y) - ε \nabla Φ (z) (x - y)] \\ = & lim_{ε \to 0} \frac{1}{ε} [(Φ (ε x + (1 - ε) y) - Φ (y)] - \nabla Φ (z) (x - y) \\ = & lim_{ε \to 0} \frac{1}{ε} [(Φ (y + ε (x - y)) - Φ (y)] - \nabla Φ (z) (x - y) \\ = & \nabla Φ (y) (x - y) - \nabla Φ (z) (x - y) \\ = & (\nabla Φ (y) - \nabla Φ (z)) (x - y) \end{array}

Comments

Anonymous4 August 2020 at 01:23
In the paragraph before the statement of the Generalized Pythagorean Theorem, does the uniqueness of the minimizer need D to be strictly convex in its first argument?
ReplyDelete
Replies

Search This Blog

M-Phi

Decomposing Bregman divergences

Refresher on Bregman divergences

Decomposing Bregman divergences

DeGroot and Fienberg's calibration and refinement decomposition

Proving the Generalized Pythagorean Theorem

Comments

Post a Comment

Popular Posts

Mona Simion on resistance to evidence

Discount code for Bertrand's Paradox and the Principle of Indifference by Nicholas Shackel