Decomposing Bregman divergences
For a PDF of this post, see here.
First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset . A divergence is a function such that
to need not be the same as the distance from to . That is, we do not assume for all , in . Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume the triangle inequality. That is, we don't assume that the divergence from to is at most the sum of the divergence from to and the divergence from to . That is, we do not assume . Indeed, the conditions under which for a Bregman divergence will be our concern here.
So, what's a Bregman divergence? is a Bregman divergence if there is a strictly convex function that is differentiable on the interior of such that In other words, to find the divergence from to , you go to , find the tangent to at . Then hop over to and subtract the value at of the tangent you just drew at from the value at of . That is, you subtract from . Because is convex, it is always curving away from the tangent, and so , the value at of the tangent you drew at , is always less than , the value at of .
The two most famous Bregman divergences are:
in and for a closed convex subset , the -projection of into is the point in such that is minimized, as a function of , at . Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:
This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if is a plane in .
Theorem 1 Suppose is in and with . Then let . Then if in and is in ,
Proof of Theorem 1. We begin by showing:
Lemma 1 For any , , in ,
Proof of Lemma 1. iff
iff as required.
Return to Proof of Theorem 1. Now we show that if is in , then We know that is minimized on , as a function of , at . Thus, let . And let . Then . So, by the KKT conditions, there is such that, Thus, for all .
Thus, finally,
as required.
Theorem 2 Suppose . Let . Then if in
and is in ,
Proof of Theorem 2. We know that is minimized on , as a function of , at . Thus, let . And let , for .
Then So, by the KKT
conditions, there are such that,
Thus,
Thus, finally,
To obtain these two decomposition results, we needed to assume nothing more than that is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose is a sequence of propositions that forms a partition. And suppose is a possible world. Then we can represent as the vector , which takes value 1 at the proposition that is true in and 0 everywhere else. Now suppose is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if is the world at which is true, then
Now, we know from Lemma 1 that this is true iff which is true iff
and that is true iff
for all , which is true iff, for any , , Now, this is true if for some . That is, it is true if is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:
Definition (log-sum-exp) Suppose with . Then let Then
Now So, if for all , then But if for some , then
Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of this classic result by Morris DeGroot and Stephen Fienberg and, more recently, this paper by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.
Refresher on Bregman divergences
First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset
, for all , in , and iff .
So, what's a Bregman divergence?
The two most famous Bregman divergences are:
- Squared Euclidean distance. Let
, in which case - Generalized Kullback-Leibler divergence. Let
, in which case
Theorem (Generalized Pythagorean Theorem) If is closed and convex, then
Decomposing Bregman divergences
This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if
Theorem 1 Suppose
Proof of Theorem 1. We begin by showing:
Lemma 1 For any
Proof of Lemma 1.
iff
Return to Proof of Theorem 1. Now we show that if
Thus, finally,
as required.
Theorem 2 Suppose
Proof of Theorem 2. We know that
Thus,
Thus, finally,
as required.
DeGroot and Fienberg's calibration and refinement decomposition
To obtain these two decomposition results, we needed to assume nothing more than that
Now, we know from Lemma 1 that this is true iff
and that is true iff
for all
Definition (log-sum-exp) Suppose
Now
And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different such that . For instance, suppose and and . Then
Proving the Generalized Pythagorean Theorem
In this section, I really just spell out in more detail the proof that Predd, et al. give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.
Lemma 2 Suppose is a Bregman divergence generated from . And suppose . Then
We can then prove the Generalized Pythagorean Theorem easily. After all, if is in a closed convex set and is the point in that minimizes as a function of . Then, for all , is in . And since minimizes, . So . So So, by Lemma 2,
Proof of Lemma 2.
In the paragraph before the statement of the Generalized Pythagorean Theorem, does the uniqueness of the minimizer need D to be strictly convex in its first argument?
ReplyDeleteYes, that's right! Looking back, I see I wasn't very careful about the convex/strictly convex distinction. If D is a Bregman divergence, Phi is strictly convex, and that ensures that D is strictly convex in its first argument.
Delete