# UMass LIVING

## July 8, 2009

### (Jul 2) Meeting Summary: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 3:36 pm
Tags: ,

“Expectation Propagation for the Generative Aspect Model”, Minka

This summary paper contains a very intuitive explanation of EP, covers most of the high-level that was presented. More specifically, it covers the KL divergence view of EP: approximating an intractable distribution $p(x)$ with a tractable member of the exponential family, $q(x)$ by minimizing $KL(p || q)$. Note that minimizing this form of the KL divergence when $q(x)$ is a member of the exponential family is equivalent to moment matching.

The mean field methods that we will cover in chapter 5, minimize the reverse: $KL(q || p)$, which has other properties. Note that a large number of message-passing algorithms can be viewed as minimizing information divergence (paper).

We covered the EP algorithm for the LDA model, where the prior $Dir(\lambda |\alpha)$ was the tractable component, and each of the $p(w | \lambda)$ where the intractable component – each is a mixture of multinomials that was approximated with a Dirichlet making $q(\lambda)$ the product of Dirichlet distributions which is itself a Dirichlet (with updated parameters).

To put this in the Wainwright variational framework, the $\tilde{t}_w$ basically correspond to the Lagrange multipliers $\lambda^i$ in Equation 4.77 on page 119 in Wainwright.  Multiplying $p$ with all the $\tilde{t}_w$‘s gives us the base distribution 4.77.  We then divide out by one specific $\tilde{t}_w$ (i.e. the $\lambda^i$ that isn’t in the summation in 4.78), and multiply in the corresponding exact $t_w$, which gives us the augmented distribution in 4.78.  We compute the expectation of the sufficient statistics under this distribution, and then go back and modify the $\tilde{t}_w$ which we divided out so that the base distribution and augmented distribution have matching moments.

We briefly covered the steps for the EP algorithm, I (Moe) have worked out the updates in each step (they are too much to type up), so let me know if you get stuck on any of them.

Notice that the sufficient statistic for a Dirichlet is $\phi(x) = \log(x)$, but in the paper they match the mean and variance instead (more specifically, the first and second moments which is equivalent to the mean and variance). This is because matching $\phi(x) = \log(x)$ is intractable and so the mean and variance is a tractable approximation. More info can be found here.

“Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

This paper gives an alternative view on inference in terms of free energies that we can map back into the framework of Wainwright/Jordan that we’ve been studying.  We should first understand that the sufficient statistics are the standard over-complete indicator functions, one set for each $f_a$, one for each possible assignment to the subset of variables that participate in the function, e.g. $x_a$.  The corresponding canonical parameters are then the corresponding values of $f_a(x_a)$ for each specific assignment to $x_a$.

With this notation, we can now rewrite the distribution $p(x)$ in equation (1) in the canonical form given in Wainwright as $p(x) = \exp(\langle \theta, \phi(x) \rangle - A(\theta))$.  Now we can see that $A(\theta)$ is the same as $\ln Z$.  In other words, the Helmholtz free energy $F_H$ is just the negative of the log partition function.

Now consider a trial distribution $b(x)$.  Using the definition of KL divergence in equation (18), it is straight forward to derive equation (17) (and worth doing so for your own edification).  Since we know the KL divergence is always non-negative and only equal to zero when the two distributions are the same, we now have a variational representation of the Helmholtz free energy as the minimum over trial distributions of the variational free energy (Gibbs free energy), which is equal to the variational average energy minus the variational entropy, such that, if we could solve it exactly, we would find the parameters that would make the trial distribution equal to the actual distribution.  Of course, we will generally be working with a simplier family of trial distributions, since the original distribution is assumed to be intractable, so we won’t be able to make the KL divergence go to zero.

Note that $U(b)$ is the negative of $\langle \theta, \tau \rangle$, and $H(b)$ is an approximation to the negative of the conjugate dual $A^*(\mu)$.  Thus we have just rewritten the constrained maximization problem of Wainwright as a constrained minimization problem.

One unresolved question is what exactly is meant by “cycle” in Proposition 8 (clearly it can’t mean a directed cycle since the region graph is acyclic due to the way the edges are directed), does it simply mean undirected cycles in the region graph?  Wainwright gives a much more complicated definition of an acyclic hypergraph (equivalent to a region graph) in terms of junction trees, so it’s not clear if this is the same as a cycle in the region graph or not.

Understanding the example after Proposition 8 is good to do.

Note there are three possible ways to do generalized belief propagation – parent to child (same as what was presented in Wainwright), child to parent, and two way.  One benefit of parent to child is that there is no need to actually use the counting numbers $c$.

You can get the message-update rule for the parent-to-child algorithm by plugging the definition of the beliefs, (113) into the two sides of the marginalization constraint (117), and working out the math.

The proof that the fixed point of GBP are the same as the stationary points of the region-based free energy is done by defining a second set of Lagrange multipliers that immediate give $b(x)$ as in (113) from (122), and then showing that the two sets of Lagrange multipliers define equivalent constraints.  To be a little more explicit, if you have a fixed point of GBP, then it must satisfy the constraints of the original Lagrange multipliers, so it must satisfy the constraints of the second set of Lagrange multipliers, and (122) shows $b(x)$ satisfies the stationarity conditions for the second optimization problem, which is equivalent to the original, and hence you have a stationary point of the region-based free energy.  Going the other direction is similar.

Section X is very useful if you want to see what the GBP updates actually are.  It’s not hard to work them out from (114).

Lastly, section XI gives a nice example of a lattice network in which straightforward BP will give the correct labels if you just look at the arg max of the marginals, but will be overconfident, whereas GBP will give actual marginals that are very close to the exact marginals.