UMass LIVING

July 23, 2009

July 22 10:30 am – 12:30 pm: Meeting Summary (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 10:05 am
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Note the following dimensionalities:

|\mathcal{M}_{MF}| = |\mathcal{M}| = |\mu| = |\theta| = |E[\Phi(Y_\omega)]| = |F| = (Number of sufficient statistics in original distribution)

|\mathcal{N}| = |\tau| = |\omega| = |F'| = (Number of sufficient statistics in tractable distribution)

The embedding function \Gamma maps from the space of \tau to the space of “missing” or “ignored” sufficient statistics.

Proof that A^*(\mu) = A^*(\tau):
To be added

Eqn (6) in this paper is the same as Eqn (5.8) in Wainwright-Jordan. We are looking to find the sup (maximum) value of this expression, and that value will be our approximation of the log partition function A(\theta). The maximum value is an approximation and not the exact log partition function because our search space is not all valid values of \mu for the original distribution, but only the ones that can be realized by the tractable distribution.

Section 2.4 takes the partial derivative of (6) to find the max point. The embedding Jacobian (J) is defined, which is the first order partial derivative of  \Gamma w.r.t. \tau.
The solution for max of Eqn (6) is then found to be
\tau = \bigtriangledown A(\omega + J(\tau)\vartheta)
The vector \tau can be broken down into (\tau^{(1)} \tau^{(2)} ... \tau^{(k)}) where (1)...(k) represent the $k$ disjoint connected components (subgraphs) into which the original graph has been broken down.

Section 3.3 shows that the embedding Jacobian is easy to compute when the subgraph in question is a v-acyclic subgraph. Note that this section assumes that the exponential family is expressed in an overcomplete representation. In this section, f is the set of statistics in the tractable graph and g is the set of statistics that are not in the tractable graph.  The v-acyclic subgraph property ensures that when we add any external edge (from the g set) to the connected component (f) in question, the two nodes of the edge are independent (There is no path between the two nodes). This independence makes computing the J matrix easy for v-acyclic graphs.  The complexity of updating \tau in Prop 5 thus takes O(|F'^{(c)}|) time.

For b-acyclic graphs, adding an edge to the f connected component could result in a loop, the two nodes that define the edge are not independent, and we need to use a chain rule to figure out the value of J. Complexity of computation is
O(|E'| \times |F'| \times |F\backslash F'|), which is large.
The alternative approach is to consider each case of g as an exponential family. Calculating the J matrix under this setting reduces the computation cost to O(|E'| \times |F\backslash F'|), which is a considerable reduction from the naive approach.

The paper also makes a connection between the v-acyclic case and Gibbs sampling.

Results:

Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between learning using Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

Given the Generative Aspect model (LDA), where \lambda is the variable (Dirichlet distributed with parameter \alpha) that specifies the distribution of the “aspect” variable $a$. We cannot use EM directly by treating \lambda as the hidden variable because the E step requires expectations over the posterior for \lambda.

Alternate approaches:

(1) Maximize the Variational Inference likelihood in Eqn (8)

(2) “Variational EM”:

E step – Use an approximate posterior for \lambda. Calculating this can be done using the EP algorithm of section 4.2 (Eqns (15) and (21))

M step – Maximize the lower bound on the log-likelihood (Eq (29)). (29) can be obtained by applying the inequality in Eqn (6) to the likelihood in (26) and then taking the log. The solution for the M step is in Eqn (31). The appendix has a more accurate solution based on the Taylor expansion (Eqn (34)).

It may be noted that approach (1) could be taken for EP as well (Find the parameters that results in the largest possible EP estimate of the likelihood ), but the authors make the point that EP makes an approximation which is close to the true likelihood in an average sense. Hence finding the max in EP’s “best approximation on average” will not yield the true maximum in the original likelihood. (Paragraph just before Section 5.2).  Hence (2) is preferred for EP.

Results:

(a) Synthetic data –

Using the VB approach (1), the estimated likelihood resembles the max approximation of the true likelihood. Using the EP approach (2), the estimated likelihood resembles the true likelihood (Figure 1)

VB tends to choose extreme parameters which can explain the data, while EP tends to reflect the true parameters that generated the data.

(b) Real data (TREC)

EP does better than VB. The difference between performance between EP and VB is not very high (As can be seem from the difference in the perplexity curves of EP and VB in Figure 3. Lower perplexity is better). Possible explanation is that in the data set that was used, there was little overlap between the aspects (The topics had little in common). Thus, the true distribution of the aspects are sharply peaked. This means that the difference between the max approximation and the exact likelihoods is not as large as in Figure 1. Thus, the EP and VB approximations are close to each other, with EP still doing slightly better.

Advertisements

July 20, 2009

July 22 10:30 am – 12:30 pm: Pre-Meeting Overview – (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 12:07 pm
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Points to consider are:

Sections 2.1 and 2.2 cover the basics, which we know know from Wainwright-Jordan. The only addition is second part of Eq 3. Hessian of the log partition function is the Variance of the Sufficient statistics.

Section 2.3 is also mostly covered in Wainwright-Jordan, except the part from the last paragraph in Column 1. Here Γ maps to the “missing potential” expectations. This is to show that the optimization can be done in the smaller space of tractable potentials, while ignoring the “missing potentials”.

(a) Think about the dimension of the space in which τ, and θ exist.

Section 2.4 gives the formula for Mean Field updates (top of page 4). Remember τ are the mean parameters in the tractable subfamily. The formula involves a term J which is defined in Definition 2.

Section 3.1 explains what v-acyclic and b-acyclic mean. We will talk about it very briefly.

Section 3.2 shows how the Mean Field update formula can be further broken down into multiple subgraphs (connected components). It is these subgraphs and their properties that determine the complexity of the update equation.

Section 3.3 shows that the updates are easy to compute if the subgraph (connected component) is v-acyclic (Shown by  considering three subcases). Note that 1[f=(a,s)] is an indicator function.

(b) What does g = (e, (s,t)) mean?

(c) It is important to see why equation in Proposition 5 does not depend on \tau^{(c)}

Section 3.4 shows how the update equations are equivalent to Gibbs sampling. We will talk about this.

(d) Hopefully students with experience in Gibbs sampling can help the rest make the connection.

Section 3.5 shows how the update equation is not as easy to solve in case of b-acyclic subgraphs.

(e) This is probably the most confusing part of this paper, we will probably spend good amount of time going through this section.

Results

(f) Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

(g) Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model (?)). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

The main objective for the discussion is to understand how model parameters are Learned for exponential families. We will be looking at Section 5 of the paper mainly. But section 5 is a small section and understanding section 4 will help go through 5 more meaningfully. Section 4 was covered in last fortnight’s meeting. But since it is important, we will brush up on the main points from last fortnight’s meeting.

We will briefly go over the model itself (Section 2). We should not be spending too much time on this section during the meeting. Section 3 discusses EP for the GA model. Again, we will quickly review this because it was done earlier. We should now be able to relate the “inclusion-deletion” being talked about this section to the EP algorithm as explained in Section 4 (Figure 4.7)

Section 5 will be the main discussion point.

(a) Why not use EM algorithm in a straight-forward manner ( Treating λ as a hidden variable ) This point is mentioned briefly after Eq 26. We should understand Why this is so.

Section 5.1 gives the one possible learning algorithm, which can be used for VB approach. Points to think about:

(b) How can Eqs 27 and 28 be understood as EM? This point is mentioned in paragraph 2 of section 5.1

(c) Why the same approach cannot be used for EP? Last few lines of section 5.1

Section 5.2 gives the learning algorithm for EP approach (Can be used for VB too (?)).

(d) How is Eq 29 the log likelihood in this model?

(e) We should understand what Eqs 31 – 34 mean. It might be helpful to revisit section 4.2 to understand what each of the RHS quantities mean.

Results show that EP does significantly better than VB. Points:

(f) For synthetic data, EP likelihood resembles the exact likelihood. VB likelihood resembles the max likelihood (Figure 1). (Briefly, we will talk about what exact and max likelihoods are)

(g) For real data, EP has lower “perplexity” than VB (Figure 3). (Briefly, what is “perplexity”?)


July 15, 2009

July 16 3-5 pm: Pre-Meeting Overview, Chapter 6 – Variational Methods in Parameter Estimation

Filed under: Uncategorized — umassliving @ 7:48 pm
Tags: ,

This chapter describes several methods for estimating the parameter theta from observed data. Section 6.1 and 6.2 focus on this problem under the assumption that theta is fixed but unknown and section 6.3 views theta as a random variable with an associated probability distribution.

The first section describes a common approach for estimation parameters, maximizing the log likelihood of the data. Unfortunately, doing this can be computationally difficult, so two different solutions are provided.  The first is a closed form function for triangulated graphs only and the second is an iterative method for arbitrary graphs based on coordinate ascent.

The next section describes a different setting, where the data is only partially observed (noisy).  The EM algorithm is commonly used in this setting to calculate the maximum likelihood estimate of theta.  This section first presents the exact form of EM for exponential families and then presents ‘variational’ EM for use when the mean parameters are not known exactly.

The last section focuses Variational Bayes, which is a method for estimating theta where theta is thought of as a random variable with a probability distribution.

July 9, 2009

(Jul 9 3-5 PM) Meeting Summary – Chapter 5: Mean Field Methods

Filed under: Uncategorized — umassliving @ 6:41 pm
Tags: ,

Mean field methods differ from those discussed in chapter 4 in that \mathcal{M} is approximated used a tractable (nonconvex) inner bound for which A^* is easy to characterize. It uses the notion of a tractable subgraph to restrict the set of sufficient statistics by setting their respective canonical parameters to 0. Setting certain canonical parameters to zero is equivalent to adding certain constraints between the mean parameters (which have an obvious form for the naive mean field case).

We discussed the notions of tractable subgraphs, the lower bound property of mean field, and its interpretations as maximizing the bound and minimizing the KL divergence (the reverse form of that used in EP). Worked through example 5.2 for the naive mean field and example 5.4 for showing the nonconvexity of the objective function. We worked out \mathbb{P}[X_1 = X_2] to show why the fraction in \theta(q) must be 1/2 instead of 1/4. In example 5.4, clarified the difference between convex sets and convex functions and how figure 5.2(b) is depicting sets. Discussed the extreme points of \mathcal{M}, how they arise, and why they must be contained in \mathcal{M}_{F}, which explains why \mathcal{M}_{F} \subset \mathcal{M} cant be convex. Discussed structured mean field as a way to use arbitrary tractable subgraphs as opposed to just disconnected subgraphs (as in the mean field case). The constraints between the mean parameters take a form that depends on the subgraph and thus is represented generically as g_{\alpha}.

Next time we will briefly make sure that everyone understood examples 5.5 and 5.6.

July 8, 2009

(Jul 9 3-5 PM) Pre-Meeting Overview of Chapter 5 – Mean Field Method

Filed under: Uncategorized — umassliving @ 4:59 pm
Tags: ,

This week’s chapter introduces Mean Field approaches to approximating A(\theta). The general idea is to optimize the search for the best approximation from a tractable graph F rather than the whole graph G.

According to Proposition 5.1, any mean parameter \mu gives a lower bound on A(\theta). Mean Field approaches try to find the best (tightest) lower bound as shown in Equation 5.8. They also present an alternative view of this approximation using KL divergence where minimizing D(\mu || \theta) (bottom of page 133) is the same as Equation 5.8.

The Naive Mean Field Algorithm chooses the product distribution as the tractable distribution (that is, a graph with no edges). Section 5.4 describes the noncovexity of the Mean Field approach which may give local minima. Structured Mean Field approaches use a graph with more structure as the tractable distribution (such as a tree).

Some things to examine closely:

  • Proposition 5.1 and equation 5.8 and the relationship between 5.8 and KL divergence on the bottom of page 133.
  • Why is Mean Field guaranteed to give a lower bound to A(\theta) ?
  • The update rules in Example 5.2
  • Example 5.4. Note that on the bottom of page 138, \theta_{12} = \frac{1}{2} and not \frac{1}{4}
  • Understand how Naive Mean Field is a special case of Structured Mean Field as well as understand Structured Mean Field in general.

(Jul 2) Meeting Summary: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 3:36 pm
Tags: ,

“Expectation Propagation for the Generative Aspect Model”, Minka

This summary paper contains a very intuitive explanation of EP, covers most of the high-level that was presented. More specifically, it covers the KL divergence view of EP: approximating an intractable distribution p(x) with a tractable member of the exponential family, q(x) by minimizing KL(p || q). Note that minimizing this form of the KL divergence when q(x) is a member of the exponential family is equivalent to moment matching.

The mean field methods that we will cover in chapter 5, minimize the reverse: KL(q || p), which has other properties. Note that a large number of message-passing algorithms can be viewed as minimizing information divergence (paper).

We covered the EP algorithm for the LDA model, where the prior Dir(\lambda |\alpha) was the tractable component, and each of the p(w | \lambda) where the intractable component – each is a mixture of multinomials that was approximated with a Dirichlet making q(\lambda) the product of Dirichlet distributions which is itself a Dirichlet (with updated parameters).

To put this in the Wainwright variational framework, the \tilde{t}_w basically correspond to the Lagrange multipliers \lambda^i in Equation 4.77 on page 119 in Wainwright.  Multiplying p with all the \tilde{t}_w‘s gives us the base distribution 4.77.  We then divide out by one specific \tilde{t}_w (i.e. the \lambda^i that isn’t in the summation in 4.78), and multiply in the corresponding exact t_w, which gives us the augmented distribution in 4.78.  We compute the expectation of the sufficient statistics under this distribution, and then go back and modify the \tilde{t}_w which we divided out so that the base distribution and augmented distribution have matching moments.

We briefly covered the steps for the EP algorithm, I (Moe) have worked out the updates in each step (they are too much to type up), so let me know if you get stuck on any of them.

Notice that the sufficient statistic for a Dirichlet is \phi(x) = \log(x), but in the paper they match the mean and variance instead (more specifically, the first and second moments which is equivalent to the mean and variance). This is because matching \phi(x) = \log(x) is intractable and so the mean and variance is a tractable approximation. More info can be found here.

“Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

This paper gives an alternative view on inference in terms of free energies that we can map back into the framework of Wainwright/Jordan that we’ve been studying.  We should first understand that the sufficient statistics are the standard over-complete indicator functions, one set for each f_a, one for each possible assignment to the subset of variables that participate in the function, e.g. x_a.  The corresponding canonical parameters are then the corresponding values of f_a(x_a) for each specific assignment to x_a.

With this notation, we can now rewrite the distribution p(x) in equation (1) in the canonical form given in Wainwright as p(x) = \exp(\langle \theta, \phi(x) \rangle - A(\theta)).  Now we can see that A(\theta) is the same as \ln Z.  In other words, the Helmholtz free energy F_H is just the negative of the log partition function.

Now consider a trial distribution b(x).  Using the definition of KL divergence in equation (18), it is straight forward to derive equation (17) (and worth doing so for your own edification).  Since we know the KL divergence is always non-negative and only equal to zero when the two distributions are the same, we now have a variational representation of the Helmholtz free energy as the minimum over trial distributions of the variational free energy (Gibbs free energy), which is equal to the variational average energy minus the variational entropy, such that, if we could solve it exactly, we would find the parameters that would make the trial distribution equal to the actual distribution.  Of course, we will generally be working with a simplier family of trial distributions, since the original distribution is assumed to be intractable, so we won’t be able to make the KL divergence go to zero.

Note that U(b) is the negative of \langle \theta, \tau \rangle, and H(b) is an approximation to the negative of the conjugate dual A^*(\mu).  Thus we have just rewritten the constrained maximization problem of Wainwright as a constrained minimization problem.

One unresolved question is what exactly is meant by “cycle” in Proposition 8 (clearly it can’t mean a directed cycle since the region graph is acyclic due to the way the edges are directed), does it simply mean undirected cycles in the region graph?  Wainwright gives a much more complicated definition of an acyclic hypergraph (equivalent to a region graph) in terms of junction trees, so it’s not clear if this is the same as a cycle in the region graph or not.

Understanding the example after Proposition 8 is good to do.

Note there are three possible ways to do generalized belief propagation – parent to child (same as what was presented in Wainwright), child to parent, and two way.  One benefit of parent to child is that there is no need to actually use the counting numbers c.

You can get the message-update rule for the parent-to-child algorithm by plugging the definition of the beliefs, (113) into the two sides of the marginalization constraint (117), and working out the math.

The proof that the fixed point of GBP are the same as the stationary points of the region-based free energy is done by defining a second set of Lagrange multipliers that immediate give b(x) as in (113) from (122), and then showing that the two sets of Lagrange multipliers define equivalent constraints.  To be a little more explicit, if you have a fixed point of GBP, then it must satisfy the constraints of the original Lagrange multipliers, so it must satisfy the constraints of the second set of Lagrange multipliers, and (122) shows b(x) satisfies the stationarity conditions for the second optimization problem, which is equivalent to the original, and hence you have a stationary point of the region-based free energy.  Going the other direction is similar.

Section X is very useful if you want to see what the GBP updates actually are.  It’s not hard to work them out from (114).

Lastly, section XI gives a nice example of a lattice network in which straightforward BP will give the correct labels if you just look at the arg max of the marginals, but will be overconfident, whereas GBP will give actual marginals that are very close to the exact marginals.

June 30, 2009

(Jul 2 3-5pm) Pre-meeting Overview: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 2:19 pm
Tags: ,

Paper 1: “Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

Sections I and II are an intro and overview, covering the basics of factor graphs and BP.  Useful to review, although we probably won’t spend any time on it.

Section III and IV give the approximation framework – think about how this relates to the variational framework given in Wainwright and Jordan.  For instance, what corresponds to U(b) and H(b)?

Section V puts the Bethe approximation in this framework and covers some of its properties, and Section VI covers the correspondence between Bethe and BP.  Useful for review (although the subsections dealing with hard constraints may not be as important) but we probably won’t spend any time on these either.

Section VII builds on III and IV, and corresponds to the hypergraph section of Wainwright and Jordan, VIII discusses some details of the region graph method, IX gives the details of GBP, X is a detailed example of GBP, and XI discusses the accuracy of GBP versus BP.

Appendices A, B, and C talk about some specific variants and special cases of the region graph method, and puts everything together in a Venn diagram.  Good things to know, but we might not have time to discuss.

Summary:

Sections III, IV, VII – XI – read carefully, we will focus on these sections

Sections I, II, V, VI – should mostly be review, we’ll skip these sections

Appendices A-C – will cover if we have time

Paper 2: “Expectation Propagation for the Generative Aspect Model”, Minka & Lafferty

The generative aspect model, is the Latent Dirichlet Allocation (LDA) [paper, code] model we covered in Wainwright & Jordan (fig 2.7 on pg. 21 and example 3.5 on pg. 47). All the factors in the model are discrete and thus the model is a member of the exponential family as discussed in detail on pg. 47. This paper tackles the two problems of inference (i.e. marginalization to compute the document probability using (3)) and learning (i.e. estimating the parameters of the model given a collection of documents). Note that we have not yet covered learning in Wainwright & Jordan (thats the topic of chapter 6, after we cover the mean field algorithms in chapter 5), thus for the purpose of this meeting please pay particular attention to the inference (section 4.2). The most important sections for our purposes are 3 and 4.2 and how they relate to section 4.3 of Wainwright and Jordan, so spend the majority of your time understanding the connections between the two papers.

Summary:

Sections 3 & 4.2: Read carefully. We’ll spend the most time on this. I added a link to the original EP paper in case section 3 is too brief.

Sections 6 & 7: Results and conclusions. We’ll spend some time on this.

Sections 1 & 2: Overview of LDA, we’ll spend very little time on this

Section 4.1: The original variational inference algorithm for the LDA, read briefly, we’ll cover in more detail in chapter 5 of Wainwright & Jordan.

Section 5: Covers learning parameters, read briefly, we’ll cover in more detail in chapter 6 of Wainwright & Jordan

June 26, 2009

(Jun 25) Meeting Summary: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 4:35 pm
Tags: ,

Cleared up confusion over univariate versus multivariate sufficient statistics.  Key point was that \Phi^i will be a vector, not a scalar.

“Marginalization” for a Gaussian means computing mean and covariance, in general it means computing the mean parameters (for discrete random variables using the standard overcomplete sufficient statistics, this will coincide with our normal definition of marginalization).

Note there’s a missing x in equation 4.63, the beginning should be \exp(-\frac{1}{2}x^T \Sigma^{-1} x).  Also, starting on page 113 and repeating several other times, the paper references equation 4.48, but actually means equation 4.58.

Answer to the pre-meeting question of how we know that \mathcal{L}(\phi, \Phi) is convex: for any given projection \Pi^i for a fixed i, the set is convex because of the convexity of \mathcal{M}\mathcal{L} is then just the intersection of all these convex sets, and hence is convex itself.

How do we get the term-by-term entropy approximation 4.68?  If you start with the definition of entropy as - \int p(x) \ln p(x) dx and use equation 4.58 for p(x), and in addition, ignore the log partition function, you can derive 4.68, but this is a mistake, because the log partition function ties all the components together, so you can’t decompose the probability.  The approximation, then, is to assume that the distribution factorizes as p(x;\theta,\tilde{\theta}) = p(x;\theta,\vec{0}) \prod_i p(x; \theta, \tilde{\theta}^i) / p(x; \theta, \vec{0}), and using that approximation, you can derive 4.68.

In the pre-meeting overview, there was the question of what would happen in example 4.9, showing the Bethe approximation as a special case of EP, if you took the sufficient statistics associated with one particular node v, \tau_v, and put those as a single element of the intractable component.  What I believe will happen is that you will lose the marginalization constraints for that particular node v for all edges connected to v (\sum_{x_u} \tau_{uv} (x_u, x_v) = \tau_v(x_v)).  In addition, the entropy approximation will change, since H(\tau_{uv}) - H(\tau_u) - H(\tau_v) will no longer be equal to the mutual information due to the lack of the marginalization constraint.

Deriving equation 4.77: The key is how to take the derivative of H(\tau).  Unlike in the proof of Theorem 4.2, we do not know the exact form of \tau and so we can’t assume that \tau are the marginals and use them in the standard entropy formula.  It turns out that the derivative of the entropy is the negative of the canonical parameters that correspond to \tau.  This is because (as you can read on page 68) the conjugate dual is the negative entropy, and \nabla A^* gives the mapping from mean parameters to canonical parameters.  Also, you can prove this directly by letting \theta'(\tau) be the canonical parameters associated with \tau, using the fact that p(x) = \exp(\langle \theta'(\tau), \phi(x) \rangle - A(\theta'(\tau))), and take the derivative of \int p(x) \ln p(x) dx.  Remember that \theta'(\tau) depends on \tau, so you’ll need to take that into account when you take the derivative with respect to \tau.

If anyone figures out how to derive equation 4.78, please put that in a comment.

One thing that wasn’t mentioned during the meeting, and which may either be important or obvious, is that \lambda^i does not appear in the augmented distribution q^i.  Therefore, you can compute the expectation of \phi(x) with respect to q^i (which is \eta_i), and then adjust \lambda^i so that the expected value of \phi(x) is the same under base distribution q without changing q^i, which would not be true if you adjusted any of the other \lambda‘s.

Another question for thought: at the bottom of p123, it says the entropy H(\tau(T),\tau_{uv}) associated with the augmented distribution does not have an explicit form, but can be computed easily.  How would you compute it, and why can’t you do the same thing for the original entire distribution?

In the middle of p124, there is a typo – after the product \prod_{(s,t) \neq (u,v)}, to the right of that, it should be \lambda^{st}, not \lambda^{uv}.

June 22, 2009

(Jun 25 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 6:14 pm
Tags: ,

This week’s reading is on expectation propagation (EP), another method for approximating the negative entropy A^\ast and the set of realizable mean parameters \mathcal{M}.

The idea behind EP is to partition the sufficient statistics into a tractable component and an intractable component, such that it is possible to exactly compute marginals in polynomial time for the distribution associated with the tractable component, as well as the distributions associated with the tractable component combined with any one element of the intractable component.

This partitioning leads to an approximation of \mathcal{M} as \mathcal{L} given in Equation 4.67 (how do we know that \mathcal{L} is convex?), and an approximation of the entropy as H_{ep} given in Equation 4.68.  As before, we can apply the Lagrangian method to the resulting constrained optimization problem, and derive the moment matching EP updates (Fig 4.7).

Important things to follow:

  • The running example of the mixture model – Example 4.8, Example 4.12
  • How the Bethe approximation can be seen as a specific case of EP and sum-product as a case of moment matching – Example 4.9, Example 4.10
  • Deriving the EP updates – Section 4.3.2
  • Tree-structured EP – Example 4.11

To think about:

The Bethe approximation is a special case of EP where all the sufficient statistics associated with the nodes are put into the tractable component, and each sufficient statistic associated with an edge is an element of the intractable component.  How would H_{ep} and \mathcal{L} change if, instead, we took the sufficient statistic associated with one particular node, removed it from the tractable component, and made it a separate element of the intractable component?

(Jun 19) Meeting Summary: Wainwright / Jordan, ch 4.1.6 – 4.2 (incl)

Filed under: Uncategorized — umassliving @ 6:06 am
Tags: ,

Things to be thinking about:

  1. Ideas for paper(s) to read week of June 29 after finishing Chapter 4. Can be theoretical paper going more in depth on topics covered in Chapter 4 (e.g. read paper on generalized BP), theoretical paper extending Chapter 4 (e.g. Sudderth paper on Bethe approximation as lower bound in certain attractive graphical models), or application paper (e.g. vision paper applying generalized BP).
  2. Ideas for topics to cover in fall semester. Some potential ideas: MCMC methods for inference, learning theory.

Meeting notes:

The reparameterization 4.27 in Proposition 4.3 is only possible because of the overcomplete set of indicator function sufficient statistics. Does using an even more redundant set of sufficient statistics lead to the existence of more reparameterizations and hence more local optima? The Bethe approximation does not give a lower or upper bound in general, unlike the mean field approximation, but in certain cases it can be convexified (Wainwright [246]) to give an upper bound, and in some models it is a lower bound (Sudderth [224]).

Clarified definition of hypergraphs, and noted that edges in hypergraph diagrams (e.g. Fig. 4.4) denote subset inclusion and are used for generalized belief propagation, they are not hyperedges themselves.

Using 4.41, we can see 4.42 can be put into a similar form as the distribution given by the junction tree, 2.12 on p32, by multiplying and dividing by \phi_h as necessary to get the marginals \mu_h for the maximal hyperedges in the numerator and \mu_h for the intersections between maximal hyperedges (separator sets in the junction tree) in the denominator.

Be able to derive 4.45 from 4.44 and 4.42.

Generalized BP is to the Kikuchi approximation as ordinary sum product is to the Bethe approximation. Kikuchi gives a better approximation to both the set of realizable mean parameters M and to the conjugate dual (negative entropy) A*.

On p103, in the center equation array, middle line, \omega(e,f) should be \omega(f,e).

« Previous PageNext Page »

Create a free website or blog at WordPress.com.