UMass LIVING

June 30, 2009

(Jul 2 3-5pm) Pre-meeting Overview: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 2:19 pm
Tags: ,

Paper 1: “Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

Sections I and II are an intro and overview, covering the basics of factor graphs and BP.  Useful to review, although we probably won’t spend any time on it.

Section III and IV give the approximation framework – think about how this relates to the variational framework given in Wainwright and Jordan.  For instance, what corresponds to U(b) and H(b)?

Section V puts the Bethe approximation in this framework and covers some of its properties, and Section VI covers the correspondence between Bethe and BP.  Useful for review (although the subsections dealing with hard constraints may not be as important) but we probably won’t spend any time on these either.

Section VII builds on III and IV, and corresponds to the hypergraph section of Wainwright and Jordan, VIII discusses some details of the region graph method, IX gives the details of GBP, X is a detailed example of GBP, and XI discusses the accuracy of GBP versus BP.

Appendices A, B, and C talk about some specific variants and special cases of the region graph method, and puts everything together in a Venn diagram.  Good things to know, but we might not have time to discuss.

Summary:

Sections III, IV, VII – XI – read carefully, we will focus on these sections

Sections I, II, V, VI – should mostly be review, we’ll skip these sections

Appendices A-C – will cover if we have time

Paper 2: “Expectation Propagation for the Generative Aspect Model”, Minka & Lafferty

The generative aspect model, is the Latent Dirichlet Allocation (LDA) [paper, code] model we covered in Wainwright & Jordan (fig 2.7 on pg. 21 and example 3.5 on pg. 47). All the factors in the model are discrete and thus the model is a member of the exponential family as discussed in detail on pg. 47. This paper tackles the two problems of inference (i.e. marginalization to compute the document probability using (3)) and learning (i.e. estimating the parameters of the model given a collection of documents). Note that we have not yet covered learning in Wainwright & Jordan (thats the topic of chapter 6, after we cover the mean field algorithms in chapter 5), thus for the purpose of this meeting please pay particular attention to the inference (section 4.2). The most important sections for our purposes are 3 and 4.2 and how they relate to section 4.3 of Wainwright and Jordan, so spend the majority of your time understanding the connections between the two papers.

Summary:

Sections 3 & 4.2: Read carefully. We’ll spend the most time on this. I added a link to the original EP paper in case section 3 is too brief.

Sections 6 & 7: Results and conclusions. We’ll spend some time on this.

Sections 1 & 2: Overview of LDA, we’ll spend very little time on this

Section 4.1: The original variational inference algorithm for the LDA, read briefly, we’ll cover in more detail in chapter 5 of Wainwright & Jordan.

Section 5: Covers learning parameters, read briefly, we’ll cover in more detail in chapter 6 of Wainwright & Jordan

Advertisements

June 26, 2009

(Jun 25) Meeting Summary: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 4:35 pm
Tags: ,

Cleared up confusion over univariate versus multivariate sufficient statistics.  Key point was that \Phi^i will be a vector, not a scalar.

“Marginalization” for a Gaussian means computing mean and covariance, in general it means computing the mean parameters (for discrete random variables using the standard overcomplete sufficient statistics, this will coincide with our normal definition of marginalization).

Note there’s a missing x in equation 4.63, the beginning should be \exp(-\frac{1}{2}x^T \Sigma^{-1} x).  Also, starting on page 113 and repeating several other times, the paper references equation 4.48, but actually means equation 4.58.

Answer to the pre-meeting question of how we know that \mathcal{L}(\phi, \Phi) is convex: for any given projection \Pi^i for a fixed i, the set is convex because of the convexity of \mathcal{M}\mathcal{L} is then just the intersection of all these convex sets, and hence is convex itself.

How do we get the term-by-term entropy approximation 4.68?  If you start with the definition of entropy as - \int p(x) \ln p(x) dx and use equation 4.58 for p(x), and in addition, ignore the log partition function, you can derive 4.68, but this is a mistake, because the log partition function ties all the components together, so you can’t decompose the probability.  The approximation, then, is to assume that the distribution factorizes as p(x;\theta,\tilde{\theta}) = p(x;\theta,\vec{0}) \prod_i p(x; \theta, \tilde{\theta}^i) / p(x; \theta, \vec{0}), and using that approximation, you can derive 4.68.

In the pre-meeting overview, there was the question of what would happen in example 4.9, showing the Bethe approximation as a special case of EP, if you took the sufficient statistics associated with one particular node v, \tau_v, and put those as a single element of the intractable component.  What I believe will happen is that you will lose the marginalization constraints for that particular node v for all edges connected to v (\sum_{x_u} \tau_{uv} (x_u, x_v) = \tau_v(x_v)).  In addition, the entropy approximation will change, since H(\tau_{uv}) - H(\tau_u) - H(\tau_v) will no longer be equal to the mutual information due to the lack of the marginalization constraint.

Deriving equation 4.77: The key is how to take the derivative of H(\tau).  Unlike in the proof of Theorem 4.2, we do not know the exact form of \tau and so we can’t assume that \tau are the marginals and use them in the standard entropy formula.  It turns out that the derivative of the entropy is the negative of the canonical parameters that correspond to \tau.  This is because (as you can read on page 68) the conjugate dual is the negative entropy, and \nabla A^* gives the mapping from mean parameters to canonical parameters.  Also, you can prove this directly by letting \theta'(\tau) be the canonical parameters associated with \tau, using the fact that p(x) = \exp(\langle \theta'(\tau), \phi(x) \rangle - A(\theta'(\tau))), and take the derivative of \int p(x) \ln p(x) dx.  Remember that \theta'(\tau) depends on \tau, so you’ll need to take that into account when you take the derivative with respect to \tau.

If anyone figures out how to derive equation 4.78, please put that in a comment.

One thing that wasn’t mentioned during the meeting, and which may either be important or obvious, is that \lambda^i does not appear in the augmented distribution q^i.  Therefore, you can compute the expectation of \phi(x) with respect to q^i (which is \eta_i), and then adjust \lambda^i so that the expected value of \phi(x) is the same under base distribution q without changing q^i, which would not be true if you adjusted any of the other \lambda‘s.

Another question for thought: at the bottom of p123, it says the entropy H(\tau(T),\tau_{uv}) associated with the augmented distribution does not have an explicit form, but can be computed easily.  How would you compute it, and why can’t you do the same thing for the original entire distribution?

In the middle of p124, there is a typo – after the product \prod_{(s,t) \neq (u,v)}, to the right of that, it should be \lambda^{st}, not \lambda^{uv}.

June 22, 2009

(Jun 25 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 6:14 pm
Tags: ,

This week’s reading is on expectation propagation (EP), another method for approximating the negative entropy A^\ast and the set of realizable mean parameters \mathcal{M}.

The idea behind EP is to partition the sufficient statistics into a tractable component and an intractable component, such that it is possible to exactly compute marginals in polynomial time for the distribution associated with the tractable component, as well as the distributions associated with the tractable component combined with any one element of the intractable component.

This partitioning leads to an approximation of \mathcal{M} as \mathcal{L} given in Equation 4.67 (how do we know that \mathcal{L} is convex?), and an approximation of the entropy as H_{ep} given in Equation 4.68.  As before, we can apply the Lagrangian method to the resulting constrained optimization problem, and derive the moment matching EP updates (Fig 4.7).

Important things to follow:

  • The running example of the mixture model – Example 4.8, Example 4.12
  • How the Bethe approximation can be seen as a specific case of EP and sum-product as a case of moment matching – Example 4.9, Example 4.10
  • Deriving the EP updates – Section 4.3.2
  • Tree-structured EP – Example 4.11

To think about:

The Bethe approximation is a special case of EP where all the sufficient statistics associated with the nodes are put into the tractable component, and each sufficient statistic associated with an edge is an element of the intractable component.  How would H_{ep} and \mathcal{L} change if, instead, we took the sufficient statistic associated with one particular node, removed it from the tractable component, and made it a separate element of the intractable component?

(Jun 19) Meeting Summary: Wainwright / Jordan, ch 4.1.6 – 4.2 (incl)

Filed under: Uncategorized — umassliving @ 6:06 am
Tags: ,

Things to be thinking about:

  1. Ideas for paper(s) to read week of June 29 after finishing Chapter 4. Can be theoretical paper going more in depth on topics covered in Chapter 4 (e.g. read paper on generalized BP), theoretical paper extending Chapter 4 (e.g. Sudderth paper on Bethe approximation as lower bound in certain attractive graphical models), or application paper (e.g. vision paper applying generalized BP).
  2. Ideas for topics to cover in fall semester. Some potential ideas: MCMC methods for inference, learning theory.

Meeting notes:

The reparameterization 4.27 in Proposition 4.3 is only possible because of the overcomplete set of indicator function sufficient statistics. Does using an even more redundant set of sufficient statistics lead to the existence of more reparameterizations and hence more local optima? The Bethe approximation does not give a lower or upper bound in general, unlike the mean field approximation, but in certain cases it can be convexified (Wainwright [246]) to give an upper bound, and in some models it is a lower bound (Sudderth [224]).

Clarified definition of hypergraphs, and noted that edges in hypergraph diagrams (e.g. Fig. 4.4) denote subset inclusion and are used for generalized belief propagation, they are not hyperedges themselves.

Using 4.41, we can see 4.42 can be put into a similar form as the distribution given by the junction tree, 2.12 on p32, by multiplying and dividing by \phi_h as necessary to get the marginals \mu_h for the maximal hyperedges in the numerator and \mu_h for the intersections between maximal hyperedges (separator sets in the junction tree) in the denominator.

Be able to derive 4.45 from 4.44 and 4.42.

Generalized BP is to the Kikuchi approximation as ordinary sum product is to the Bethe approximation. Kikuchi gives a better approximation to both the set of realizable mean parameters M and to the conjugate dual (negative entropy) A*.

On p103, in the center equation array, middle line, \omega(e,f) should be \omega(f,e).

(Jun 12) Meeting Summary: Wainwright / Jordan, ch 4 – 4.1.5 (incl)

Filed under: Uncategorized — umassliving @ 6:01 am
Tags: ,

Discussed the problems with solving the variational form of A(\theta), namely that the space of mean parameters is hard to characterize explicitly and for most distributions, A^*(\mu) does not have an explicit form. Noted that the Bethe Variational Principle (4.16) is an approximation to A(\theta) by utilizing a simple-to-characterize polyhedral outer-bound \mathbb{L}(G) (to approximate \mathbb{M}(G) and applying the Bethe entropy approximation (to approximate A^*(\mu)). Theorem 4.2 shows that a Lagragian method for solving the BVP is equivalent to the sum-product updates, which provides a principled basis for applying the sum-product algorithm to graph with cycles, namely as a means to approximate the Bethe objective function. Note that, as expected, A_{\mathrm{Bethe}}(\theta) = A(\theta) for tree-structured graphs, since sum-product is an exact inference algorithm for tree-structured graphs.

Note that the right hand side of Equation 4.17 should be reversed, \sum \tau_s - 1, to be consistent with a positive \lambda_{SS} in Equation 4.21a.

(May 28) Meeting Summary: Wainwright / Jordan, Chapter 1-3 review

Filed under: Uncategorized — umassliving @ 5:59 am
Tags: ,

In short,

  • chapter 1 covers an overview of variational versus MCMC inference and outlines an overview of the remaining 9 chapters.
  • chapter 2 covers a brief overview of graphical models, discussing (1) directed, undirected and factor graphs, and their Markov properties (2) what is inference, (3) several applications of graphical models including bioinformatics, computer vision, and text, (4) the sum-product algorithm and its exact form for tree-structured graphs and approximate form for graphs with cycles, (5) the junction tree algorithm which is an exact inference algorithm for general graphs (although potentially computationally expensive) and the subsequent need for approximations.
  • chapter 3 covers (1) the basics of the exponential family of distributions including several examples, (2) the role of forward/backward mapping between canonical and mean parameters in inference problems, (3) properties of the log-partition function, A, its conjugate dual, A^*, and the space of mean parameters, \mathcal{M}, and (4) conjugate duality. The most important aspect of chapter 3 is setting up the variational representation (equation 3.45 on pg. 68) which all forthcoming inference algorithms attempt to approximate.

Chapter 2 is a necessarily brief overview of graphical models, good references include:

  • Chapter 8 of “Pattern Recognition and Machine Learning“, by Chris Bishop (chapter available online). Additionally, chapter 10 covers variational methods and chapter 11 covers MCMC.
  • Chapter 2 of Erik Sudderth’s Ph.D. thesis [pdf]
  • “Graphical Models,” by Lauritzen
  • “Probabilistic Graphical Models,” by Koller and Friedman

Regarding chapter 3, the appendices contain additional helpful information and proofs.

This page overviews each of the chapters in more detail.

(May 20) Meeting Summary: Wainwright / Jordan, p58-74

Filed under: Uncategorized — umassliving @ 5:56 am
Tags: ,

Definition of cumulant? Be able to prove second derivative of A as equation 3.41b in Proposition 3.1. Why does positive definite Hessian mean function is convex? Why is the gradient map \nabla A onto the interior of \mathcal{M}, even for discrete random variables?

(May 14) Meeting Summary: Wainwright / Jordan, p51-74

Filed under: Uncategorized — umassliving @ 5:55 am
Tags: ,

Discussed mean parameterization, the set of realizable mean parameters M, the example of M for a Gaussian, the convexity of M, M as a convex hull for discrete random variables, and the equivalent representation as the intersection of a finite collection of half-spaces.

(May 7) Meeting Summary: Wainwright / Jordan, p39-74

Filed under: Uncategorized — umassliving @ 5:53 am
Tags: ,

Reviewed exponential family basics, including regular families (families for which the domain Omega is an open set) and overcomplete, using the example of a multinomial distribution. Covered details on the Ising model, its extension to multinomial variables, and Gaussian MRF (focusing on Theta as the negative definite precision matrix), and the dimensions of such families.

(Apr 30) Meeting Summary: Wainwright / Jordan, p29-66

Filed under: Uncategorized — umassliving @ 5:52 am
Tags: ,

Clarified definitions of clique tree and junction tree. Showed how to derive 2.12 for the simple example of a directed tree. Proved 3.4 using Lagrange multipliers, from Cover and Thomas, Information Theory, Chapter 11. Digression on definition of sufficient statistics. Why “is this distribution a member of the exponential family” is not the right question to ask.

Next Page »

Create a free website or blog at WordPress.com.