Decomposing Bregman divergences

For a PDF of this post, see here.

Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of this classic result by Morris DeGroot and Stephen Fienberg and, more recently, this paper by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.

Refresher on Bregman divergences


First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset XRn. A divergence is a function D:X×X[0,] such that
  • D(x,y)0, for all x, y in X, and
  • D(x,y)=0 iff x=y.
Note: We do not assume that a divergence is symmetric. So the distance from x to y need not be the same as the distance from y to x. That is, we do not assume D(x,y)=D(y,x) for all x, y in X. Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume the triangle inequality. That is, we don't assume that the divergence from x to z is at most the sum of the divergence from x to y and the divergence from y to z. That is, we do not assume D(x,z)D(x,y)+D(y,z). Indeed, the conditions under which D(x,z)=D(x,y)+D(y,z) for a Bregman divergence D will be our concern here.

So, what's a Bregman divergence? D:X×X[0,] is a Bregman divergence if there is a strictly convex function Φ:XR that is differentiable on the interior of X such thatD(x,y)=Φ(x)Φ(y)Φ(y)(xy)In other words, to find the divergence from x to y, you go to y, find the tangent to Φ at y. Then hop over to x and subtract the value at x of the tangent you just drew at y from the value at x of Φ. That is, you subtract Φ(y)(xy)+Φ(y) from Φ(x). Because Φ is convex, it is always curving away from the tangent, and so Φ(y)(xy)+Φ(y), the value at x of the tangent you drew at y, is always less than Φ(x), the value at x of Φ.

The two most famous Bregman divergences are:
  • Squared Euclidean distance. Let Φ(x)=||x||2=ixi2, in which caseD(x,y)=||xy||2=i(xiyi)2
  • Generalized Kullback-Leibler divergence. Let Φ(x)=ixilogxi, in which caseD(x,y)=ixilogxiyixi+yi
Bregman divergences are convex in the first argument. Thus, we can define, for z in X and for a closed convex subset CX, the D-projection of z into C is the point πz,C in C such that D(y,z) is minimized, as a function of y, at y=πz,C. Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:

Theorem (Generalized Pythagorean Theorem) If CX is closed and convex, thenD(x,πz,C)+D(πz,C,z)D(x,z)

Decomposing Bregman divergences


This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if C is a plane in Rn.

Theorem 1  Suppose r is in R and 0α1,,αn1 with iαi=1. Then let C:={(x1,,xn):iαixi=r}. Then if z in X and x is in C,DΦ(x,z)=DΦ(x,πz,C)+DΦ(πz,C,z)

Proof of Theorem 1.  We begin by showing:

Lemma 1 For any x, y, z in X,DΦ(x,z)=DΦ(x,y)+DΦ(y,z)(Φ(y)Φ(z))(xy)=0

Proof of Lemma 1DΦ(x,z)=DΦ(x,y)+DΦ(y,z)iff

Φ(x)Φ(z)(z)(xz)

=Φ(x)Φ(y)(y)(xy)+Φ(y)Φ(z)(z)(yz)

iff(Φ(y)Φ(z))(xy)=0as required.

Return to Proof of Theorem 1. Now we show that if x is in C, then(Φ(πz,C)Φ(z))(xπz,C)=0We know that D(y,z) is minimized on C, as a function of y, at y=πz,C. Thus, let y=πz,C. And let h(x):=iαixir. Then xih(x)=αi. So, by the KKT conditions, there is λ such that,Φ(y)Φ(z)+(λα1,,λαn)=(0,,0)Thus,yiΦ(y)ziΦ(z)=λαifor all i=1,,n.

Thus, finally, 
(Φ(y)Φ(z))(xy)=i(yiΦ(y)ziΦ(z))(xiyi)=i(λαi)(xiyi)=λ(iαixiiαiyi)=λ(rr)=0
as required.

Theorem 2  Suppose 1kn. Let C:={(x1,,xn):x1=x2==xk}. Then if z in X and x is in C,DΦ(x,z)=DΦ(x,πz,C)+DΦ(πz,C,z)

Proof of Theorem 2. We know that D(y,z) is minimized on C, as a function of y, at y=πz,C. Thus, let y=πz,C. And let hi(x):=xi+1xi, for i=1,,k1. Thenxjhi(x)={1if i+1=j1if i=j0otherwise So, by the KKT conditions, there are λ1,,λk such that,

Φ(y)Φ(z)

+(λ1,λ1,0,,0)+(0,λ2,λ2,0,,0)+

+(0,,0,λk,λk,0,,0)=(0,,0)

Thus,y1Φ(y)z1Φ(z)=λ1y2Φ(y)z2Φ(z)=λ1λ2yk1Φ(y)zk1Φ(z)=λk2λk1ykΦ(y)zkΦ(z)=λk1yk+1Φ(y)zk+1Φ(z)=0ynΦ(y)znΦ(z)=0

Thus, finally, 
(Φ(y)Φ(z))(xy)=i(yiΦ(y)ziΦ(z))(xiyi)=λ1(x1y1)+(λ1λ2)(x2y2)++(λk2λk1)(xk1yk1)+λk1(xkyk)+0(xk+1yk+1)++0(xnyn)=i=1k1λi(xi+1xi)+i=1k1λi(yiyi+1)=0
as required.

DeGroot and Fienberg's calibration and refinement decomposition


To obtain these two decomposition results, we needed to assume nothing more than that D is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose (X1,,Xn) is a sequence of propositions that forms a partition. And suppose w is a possible world. Then we can represent w as the vector w=(0,,0,1,0,,0), which takes value 1 at the proposition that is true in w and 0 everywhere else. Now suppose c=(c,,c) is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if (0,,0,1,0,,0) is the world at which Xi is true, then

D((0,,0,1,0,,0),(c,,c))

=D((0,,0,1,0,,0),(1n,,1n))+D((1n,,1n),(c,,c))

Now, we know from Lemma 1 that this is true iff(Φ(1n,,1n)Φ(c,,c))((0,,0,1,0,,0)(1n,,1n))=0which is true iff

(xiΦ(1n,,1n)xiΦ(c,,c))

=1nj=1n(xjΦ(1n,,1n)xjΦ(c,,c))

and that is true iff

xiΦ(1n,,1n)xiΦ(c,,c)

=xjΦ(1n,,1n)xjΦ(c,,c)

for all 1i,j,n, which is true iff, for any x, 1i,jn,xiΦ(x,,x)=xjΦ(x,,x)Now, this is true if Φ(x1,,xn)=i=1nφ(xi) for some φ. That is, it is true if D is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:

Definition (log-sum-exp) Suppose 0α1,,αn1 with i=1nαi=1. Then let ΦA(x1,,xn)=log(1+α1ex1+αnexn)Then
D(x,y)=log(1+iαiexi)log(1+iαieyi)kαk(xkyk)eyk1+iαieyi

NowxiΦA(x1,,xn)=αiexi1+α1ex1++αnexnSo, if αi=αj for all 1i,jn, thenxiΦA(x,,x)=αex1+ex=xjΦA(x,,x)But if αiαj for some 1i,jn, thenxiΦA(x,,x)=αiex1+exαjex1+ex=xjΦA(x,,x)

And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different ϕ1,,ϕn such that Φ(x)=i=1nϕi(xi). For instance, suppose ϕ1(x)=x2 and ϕ2(x)=xlogx and Φ(x,y)=ϕ1(x)+ϕ2(y)=x2+ylogy. Thenx1Φ(x,x)=2x1+logx=x2Φ(x,x)

Proving the Generalized Pythagorean Theorem


In this section, I really just spell out in more detail the proof that Predd, et al. give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.

Lemma 2  Suppose D is a Bregman divergence generated from Φ. And suppose x,y,zX. ThenD(x,z)[D(x,y)+D(y,z)]=(Φ(y)Φ(z))(xy)=limε01ε[D(y+ε(xy),z)D(y,z)]=limε01ε[D(εx+(1ε)y,z)D(y,z)]

We can then prove the Generalized Pythagorean Theorem easily. After all, if x is in a closed convex set C and y is the point in C that minimizes D(y,z) as a function of y. Then, for all 0ε1, εx+(1ε)y is in C. And since y minimizes,D(εx+(1ε)y,z)D(y,z). So D(εx+(1ε)y,z)D(y,z)0. So limε01εD(εx+(1ε)y,z)D(y,z)0So, by Lemma 2,D(x,z)D(x,y)+D(y,z)

Proof of Lemma 2.  limε01ε[D(εx+(1ε)y,z)D(y,z)]=limε01ε[(Φ(εx+(1ε)y)Φ(z)Φ(z)(εx+(1ε)yz))(Φ(y)Φ(z)Φ(z)(yz))]=limε01ε[(Φ(εx+(1ε)y)Φ(y)εΦ(z)(xy)]=limε01ε[(Φ(εx+(1ε)y)Φ(y)]Φ(z)(xy)=limε01ε[(Φ(y+ε(xy))Φ(y)]Φ(z)(xy)=Φ(y)(xy)Φ(z)(xy)=(Φ(y)Φ(z))(xy)

Comments

  1. In the paragraph before the statement of the Generalized Pythagorean Theorem, does the uniqueness of the minimizer need D to be strictly convex in its first argument?

    ReplyDelete
    Replies
    1. Yes, that's right! Looking back, I see I wasn't very careful about the convex/strictly convex distinction. If D is a Bregman divergence, Phi is strictly convex, and that ensures that D is strictly convex in its first argument.

      Delete

Post a Comment