# UMass LIVING

## July 9, 2009

### (Jul 9 3-5 PM) Meeting Summary – Chapter 5: Mean Field Methods

Filed under: Uncategorized — umassliving @ 6:41 pm
Tags: ,

Mean field methods differ from those discussed in chapter 4 in that $\mathcal{M}$ is approximated used a tractable (nonconvex) inner bound for which $A^*$ is easy to characterize. It uses the notion of a tractable subgraph to restrict the set of sufficient statistics by setting their respective canonical parameters to $0$. Setting certain canonical parameters to zero is equivalent to adding certain constraints between the mean parameters (which have an obvious form for the naive mean field case).

We discussed the notions of tractable subgraphs, the lower bound property of mean field, and its interpretations as maximizing the bound and minimizing the KL divergence (the reverse form of that used in EP). Worked through example 5.2 for the naive mean field and example 5.4 for showing the nonconvexity of the objective function. We worked out $\mathbb{P}[X_1 = X_2]$ to show why the fraction in $\theta(q)$ must be 1/2 instead of 1/4. In example 5.4, clarified the difference between convex sets and convex functions and how figure 5.2(b) is depicting sets. Discussed the extreme points of $\mathcal{M}$, how they arise, and why they must be contained in $\mathcal{M}_{F}$, which explains why $\mathcal{M}_{F} \subset \mathcal{M}$ cant be convex. Discussed structured mean field as a way to use arbitrary tractable subgraphs as opposed to just disconnected subgraphs (as in the mean field case). The constraints between the mean parameters take a form that depends on the subgraph and thus is represented generically as $g_{\alpha}$.

Next time we will briefly make sure that everyone understood examples 5.5 and 5.6.

## July 8, 2009

### (Jul 2) Meeting Summary: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 3:36 pm
Tags: ,

“Expectation Propagation for the Generative Aspect Model”, Minka

This summary paper contains a very intuitive explanation of EP, covers most of the high-level that was presented. More specifically, it covers the KL divergence view of EP: approximating an intractable distribution $p(x)$ with a tractable member of the exponential family, $q(x)$ by minimizing $KL(p || q)$. Note that minimizing this form of the KL divergence when $q(x)$ is a member of the exponential family is equivalent to moment matching.

The mean field methods that we will cover in chapter 5, minimize the reverse: $KL(q || p)$, which has other properties. Note that a large number of message-passing algorithms can be viewed as minimizing information divergence (paper).

We covered the EP algorithm for the LDA model, where the prior $Dir(\lambda |\alpha)$ was the tractable component, and each of the $p(w | \lambda)$ where the intractable component – each is a mixture of multinomials that was approximated with a Dirichlet making $q(\lambda)$ the product of Dirichlet distributions which is itself a Dirichlet (with updated parameters).

To put this in the Wainwright variational framework, the $\tilde{t}_w$ basically correspond to the Lagrange multipliers $\lambda^i$ in Equation 4.77 on page 119 in Wainwright.  Multiplying $p$ with all the $\tilde{t}_w$‘s gives us the base distribution 4.77.  We then divide out by one specific $\tilde{t}_w$ (i.e. the $\lambda^i$ that isn’t in the summation in 4.78), and multiply in the corresponding exact $t_w$, which gives us the augmented distribution in 4.78.  We compute the expectation of the sufficient statistics under this distribution, and then go back and modify the $\tilde{t}_w$ which we divided out so that the base distribution and augmented distribution have matching moments.

We briefly covered the steps for the EP algorithm, I (Moe) have worked out the updates in each step (they are too much to type up), so let me know if you get stuck on any of them.

Notice that the sufficient statistic for a Dirichlet is $\phi(x) = \log(x)$, but in the paper they match the mean and variance instead (more specifically, the first and second moments which is equivalent to the mean and variance). This is because matching $\phi(x) = \log(x)$ is intractable and so the mean and variance is a tractable approximation. More info can be found here.

“Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

This paper gives an alternative view on inference in terms of free energies that we can map back into the framework of Wainwright/Jordan that we’ve been studying.  We should first understand that the sufficient statistics are the standard over-complete indicator functions, one set for each $f_a$, one for each possible assignment to the subset of variables that participate in the function, e.g. $x_a$.  The corresponding canonical parameters are then the corresponding values of $f_a(x_a)$ for each specific assignment to $x_a$.

With this notation, we can now rewrite the distribution $p(x)$ in equation (1) in the canonical form given in Wainwright as $p(x) = \exp(\langle \theta, \phi(x) \rangle - A(\theta))$.  Now we can see that $A(\theta)$ is the same as $\ln Z$.  In other words, the Helmholtz free energy $F_H$ is just the negative of the log partition function.

Now consider a trial distribution $b(x)$.  Using the definition of KL divergence in equation (18), it is straight forward to derive equation (17) (and worth doing so for your own edification).  Since we know the KL divergence is always non-negative and only equal to zero when the two distributions are the same, we now have a variational representation of the Helmholtz free energy as the minimum over trial distributions of the variational free energy (Gibbs free energy), which is equal to the variational average energy minus the variational entropy, such that, if we could solve it exactly, we would find the parameters that would make the trial distribution equal to the actual distribution.  Of course, we will generally be working with a simplier family of trial distributions, since the original distribution is assumed to be intractable, so we won’t be able to make the KL divergence go to zero.

Note that $U(b)$ is the negative of $\langle \theta, \tau \rangle$, and $H(b)$ is an approximation to the negative of the conjugate dual $A^*(\mu)$.  Thus we have just rewritten the constrained maximization problem of Wainwright as a constrained minimization problem.

One unresolved question is what exactly is meant by “cycle” in Proposition 8 (clearly it can’t mean a directed cycle since the region graph is acyclic due to the way the edges are directed), does it simply mean undirected cycles in the region graph?  Wainwright gives a much more complicated definition of an acyclic hypergraph (equivalent to a region graph) in terms of junction trees, so it’s not clear if this is the same as a cycle in the region graph or not.

Understanding the example after Proposition 8 is good to do.

Note there are three possible ways to do generalized belief propagation – parent to child (same as what was presented in Wainwright), child to parent, and two way.  One benefit of parent to child is that there is no need to actually use the counting numbers $c$.

You can get the message-update rule for the parent-to-child algorithm by plugging the definition of the beliefs, (113) into the two sides of the marginalization constraint (117), and working out the math.

The proof that the fixed point of GBP are the same as the stationary points of the region-based free energy is done by defining a second set of Lagrange multipliers that immediate give $b(x)$ as in (113) from (122), and then showing that the two sets of Lagrange multipliers define equivalent constraints.  To be a little more explicit, if you have a fixed point of GBP, then it must satisfy the constraints of the original Lagrange multipliers, so it must satisfy the constraints of the second set of Lagrange multipliers, and (122) shows $b(x)$ satisfies the stationarity conditions for the second optimization problem, which is equivalent to the original, and hence you have a stationary point of the region-based free energy.  Going the other direction is similar.

Section X is very useful if you want to see what the GBP updates actually are.  It’s not hard to work them out from (114).

Lastly, section XI gives a nice example of a lattice network in which straightforward BP will give the correct labels if you just look at the arg max of the marginals, but will be overconfident, whereas GBP will give actual marginals that are very close to the exact marginals.

## June 26, 2009

### (Jun 25) Meeting Summary: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 4:35 pm
Tags: ,

Cleared up confusion over univariate versus multivariate sufficient statistics.  Key point was that $\Phi^i$ will be a vector, not a scalar.

“Marginalization” for a Gaussian means computing mean and covariance, in general it means computing the mean parameters (for discrete random variables using the standard overcomplete sufficient statistics, this will coincide with our normal definition of marginalization).

Note there’s a missing $x$ in equation 4.63, the beginning should be $\exp(-\frac{1}{2}x^T \Sigma^{-1} x)$.  Also, starting on page 113 and repeating several other times, the paper references equation 4.48, but actually means equation 4.58.

Answer to the pre-meeting question of how we know that $\mathcal{L}(\phi, \Phi)$ is convex: for any given projection $\Pi^i$ for a fixed $i$, the set is convex because of the convexity of $\mathcal{M}$$\mathcal{L}$ is then just the intersection of all these convex sets, and hence is convex itself.

How do we get the term-by-term entropy approximation 4.68?  If you start with the definition of entropy as $- \int p(x) \ln p(x) dx$ and use equation 4.58 for $p(x)$, and in addition, ignore the log partition function, you can derive 4.68, but this is a mistake, because the log partition function ties all the components together, so you can’t decompose the probability.  The approximation, then, is to assume that the distribution factorizes as $p(x;\theta,\tilde{\theta}) = p(x;\theta,\vec{0}) \prod_i p(x; \theta, \tilde{\theta}^i) / p(x; \theta, \vec{0})$, and using that approximation, you can derive 4.68.

In the pre-meeting overview, there was the question of what would happen in example 4.9, showing the Bethe approximation as a special case of EP, if you took the sufficient statistics associated with one particular node $v$, $\tau_v$, and put those as a single element of the intractable component.  What I believe will happen is that you will lose the marginalization constraints for that particular node $v$ for all edges connected to $v$ ($\sum_{x_u} \tau_{uv} (x_u, x_v) = \tau_v(x_v)$).  In addition, the entropy approximation will change, since $H(\tau_{uv}) - H(\tau_u) - H(\tau_v)$ will no longer be equal to the mutual information due to the lack of the marginalization constraint.

Deriving equation 4.77: The key is how to take the derivative of $H(\tau)$.  Unlike in the proof of Theorem 4.2, we do not know the exact form of $\tau$ and so we can’t assume that $\tau$ are the marginals and use them in the standard entropy formula.  It turns out that the derivative of the entropy is the negative of the canonical parameters that correspond to $\tau$.  This is because (as you can read on page 68) the conjugate dual is the negative entropy, and $\nabla A^*$ gives the mapping from mean parameters to canonical parameters.  Also, you can prove this directly by letting $\theta'(\tau)$ be the canonical parameters associated with $\tau$, using the fact that $p(x) = \exp(\langle \theta'(\tau), \phi(x) \rangle - A(\theta'(\tau)))$, and take the derivative of $\int p(x) \ln p(x) dx$.  Remember that $\theta'(\tau)$ depends on $\tau$, so you’ll need to take that into account when you take the derivative with respect to $\tau$.

If anyone figures out how to derive equation 4.78, please put that in a comment.

One thing that wasn’t mentioned during the meeting, and which may either be important or obvious, is that $\lambda^i$ does not appear in the augmented distribution $q^i$.  Therefore, you can compute the expectation of $\phi(x)$ with respect to $q^i$ (which is $\eta_i$), and then adjust $\lambda^i$ so that the expected value of $\phi(x)$ is the same under base distribution $q$ without changing $q^i$, which would not be true if you adjusted any of the other $\lambda$‘s.

Another question for thought: at the bottom of p123, it says the entropy $H(\tau(T),\tau_{uv})$ associated with the augmented distribution does not have an explicit form, but can be computed easily.  How would you compute it, and why can’t you do the same thing for the original entire distribution?

In the middle of p124, there is a typo – after the product $\prod_{(s,t) \neq (u,v)}$, to the right of that, it should be $\lambda^{st}$, not $\lambda^{uv}$.

## June 22, 2009

### (Jun 19) Meeting Summary: Wainwright / Jordan, ch 4.1.6 – 4.2 (incl)

Filed under: Uncategorized — umassliving @ 6:06 am
Tags: ,

1. Ideas for paper(s) to read week of June 29 after finishing Chapter 4. Can be theoretical paper going more in depth on topics covered in Chapter 4 (e.g. read paper on generalized BP), theoretical paper extending Chapter 4 (e.g. Sudderth paper on Bethe approximation as lower bound in certain attractive graphical models), or application paper (e.g. vision paper applying generalized BP).
2. Ideas for topics to cover in fall semester. Some potential ideas: MCMC methods for inference, learning theory.

Meeting notes:

The reparameterization 4.27 in Proposition 4.3 is only possible because of the overcomplete set of indicator function sufficient statistics. Does using an even more redundant set of sufficient statistics lead to the existence of more reparameterizations and hence more local optima? The Bethe approximation does not give a lower or upper bound in general, unlike the mean field approximation, but in certain cases it can be convexified (Wainwright [246]) to give an upper bound, and in some models it is a lower bound (Sudderth [224]).

Clarified definition of hypergraphs, and noted that edges in hypergraph diagrams (e.g. Fig. 4.4) denote subset inclusion and are used for generalized belief propagation, they are not hyperedges themselves.

Using 4.41, we can see 4.42 can be put into a similar form as the distribution given by the junction tree, 2.12 on p32, by multiplying and dividing by $\phi_h$ as necessary to get the marginals $\mu_h$ for the maximal hyperedges in the numerator and $\mu_h$ for the intersections between maximal hyperedges (separator sets in the junction tree) in the denominator.

Be able to derive 4.45 from 4.44 and 4.42.

Generalized BP is to the Kikuchi approximation as ordinary sum product is to the Bethe approximation. Kikuchi gives a better approximation to both the set of realizable mean parameters M and to the conjugate dual (negative entropy) A*.

On p103, in the center equation array, middle line, $\omega(e,f)$ should be $\omega(f,e)$.

### (Jun 12) Meeting Summary: Wainwright / Jordan, ch 4 – 4.1.5 (incl)

Filed under: Uncategorized — umassliving @ 6:01 am
Tags: ,

Discussed the problems with solving the variational form of $A(\theta)$, namely that the space of mean parameters is hard to characterize explicitly and for most distributions, $A^*(\mu)$ does not have an explicit form. Noted that the Bethe Variational Principle (4.16) is an approximation to $A(\theta)$ by utilizing a simple-to-characterize polyhedral outer-bound $\mathbb{L}(G)$ (to approximate $\mathbb{M}(G)$ and applying the Bethe entropy approximation (to approximate $A^*(\mu)$). Theorem 4.2 shows that a Lagragian method for solving the BVP is equivalent to the sum-product updates, which provides a principled basis for applying the sum-product algorithm to graph with cycles, namely as a means to approximate the Bethe objective function. Note that, as expected, $A_{\mathrm{Bethe}}(\theta) = A(\theta)$ for tree-structured graphs, since sum-product is an exact inference algorithm for tree-structured graphs.

Note that the right hand side of Equation 4.17 should be reversed, $\sum \tau_s - 1$, to be consistent with a positive $\lambda_{SS}$ in Equation 4.21a.

### (May 28) Meeting Summary: Wainwright / Jordan, Chapter 1-3 review

Filed under: Uncategorized — umassliving @ 5:59 am
Tags: ,

In short,

• chapter 1 covers an overview of variational versus MCMC inference and outlines an overview of the remaining 9 chapters.
• chapter 2 covers a brief overview of graphical models, discussing (1) directed, undirected and factor graphs, and their Markov properties (2) what is inference, (3) several applications of graphical models including bioinformatics, computer vision, and text, (4) the sum-product algorithm and its exact form for tree-structured graphs and approximate form for graphs with cycles, (5) the junction tree algorithm which is an exact inference algorithm for general graphs (although potentially computationally expensive) and the subsequent need for approximations.
• chapter 3 covers (1) the basics of the exponential family of distributions including several examples, (2) the role of forward/backward mapping between canonical and mean parameters in inference problems, (3) properties of the log-partition function, $A$, its conjugate dual, $A^*$, and the space of mean parameters, $\mathcal{M}$, and (4) conjugate duality. The most important aspect of chapter 3 is setting up the variational representation (equation 3.45 on pg. 68) which all forthcoming inference algorithms attempt to approximate.

Chapter 2 is a necessarily brief overview of graphical models, good references include:

• Chapter 8 of “Pattern Recognition and Machine Learning“, by Chris Bishop (chapter available online). Additionally, chapter 10 covers variational methods and chapter 11 covers MCMC.
• Chapter 2 of Erik Sudderth’s Ph.D. thesis [pdf]
• “Graphical Models,” by Lauritzen
• “Probabilistic Graphical Models,” by Koller and Friedman

### (May 20) Meeting Summary: Wainwright / Jordan, p58-74

Filed under: Uncategorized — umassliving @ 5:56 am
Tags: ,

Definition of cumulant? Be able to prove second derivative of $A$ as equation 3.41b in Proposition 3.1. Why does positive definite Hessian mean function is convex? Why is the gradient map $\nabla A$ onto the interior of $\mathcal{M}$, even for discrete random variables?

### (May 14) Meeting Summary: Wainwright / Jordan, p51-74

Filed under: Uncategorized — umassliving @ 5:55 am
Tags: ,

Discussed mean parameterization, the set of realizable mean parameters M, the example of M for a Gaussian, the convexity of M, M as a convex hull for discrete random variables, and the equivalent representation as the intersection of a finite collection of half-spaces.

### (May 7) Meeting Summary: Wainwright / Jordan, p39-74

Filed under: Uncategorized — umassliving @ 5:53 am
Tags: ,

Reviewed exponential family basics, including regular families (families for which the domain Omega is an open set) and overcomplete, using the example of a multinomial distribution. Covered details on the Ising model, its extension to multinomial variables, and Gaussian MRF (focusing on Theta as the negative definite precision matrix), and the dimensions of such families.

### (Apr 30) Meeting Summary: Wainwright / Jordan, p29-66

Filed under: Uncategorized — umassliving @ 5:52 am
Tags: ,

Clarified definitions of clique tree and junction tree. Showed how to derive 2.12 for the simple example of a directed tree. Proved 3.4 using Lagrange multipliers, from Cover and Thomas, Information Theory, Chapter 11. Digression on definition of sufficient statistics. Why “is this distribution a member of the exponential family” is not the right question to ask.

« Previous PageNext Page »

Create a free website or blog at WordPress.com.