# UMass LIVING

## July 15, 2009

### July 16 3-5 pm: Pre-Meeting Overview, Chapter 6 – Variational Methods in Parameter Estimation

Filed under: Uncategorized — umassliving @ 7:48 pm
Tags: ,

This chapter describes several methods for estimating the parameter theta from observed data. Section 6.1 and 6.2 focus on this problem under the assumption that theta is fixed but unknown and section 6.3 views theta as a random variable with an associated probability distribution.

The first section describes a common approach for estimation parameters, maximizing the log likelihood of the data. Unfortunately, doing this can be computationally difficult, so two different solutions are provided.  The first is a closed form function for triangulated graphs only and the second is an iterative method for arbitrary graphs based on coordinate ascent.

The next section describes a different setting, where the data is only partially observed (noisy).  The EM algorithm is commonly used in this setting to calculate the maximum likelihood estimate of theta.  This section first presents the exact form of EM for exponential families and then presents ‘variational’ EM for use when the mean parameters are not known exactly.

The last section focuses Variational Bayes, which is a method for estimating theta where theta is thought of as a random variable with a probability distribution.

## July 8, 2009

### (Jul 9 3-5 PM) Pre-Meeting Overview of Chapter 5 – Mean Field Method

Filed under: Uncategorized — umassliving @ 4:59 pm
Tags: ,

This week’s chapter introduces Mean Field approaches to approximating $A(\theta)$. The general idea is to optimize the search for the best approximation from a tractable graph $F$ rather than the whole graph $G$.

According to Proposition 5.1, any mean parameter $\mu$ gives a lower bound on $A(\theta)$. Mean Field approaches try to find the best (tightest) lower bound as shown in Equation 5.8. They also present an alternative view of this approximation using KL divergence where minimizing $D(\mu || \theta)$ (bottom of page 133) is the same as Equation 5.8.

The Naive Mean Field Algorithm chooses the product distribution as the tractable distribution (that is, a graph with no edges). Section 5.4 describes the noncovexity of the Mean Field approach which may give local minima. Structured Mean Field approaches use a graph with more structure as the tractable distribution (such as a tree).

Some things to examine closely:

• Proposition 5.1 and equation 5.8 and the relationship between 5.8 and KL divergence on the bottom of page 133.
• Why is Mean Field guaranteed to give a lower bound to $A(\theta)$ ?
• The update rules in Example 5.2
• Example 5.4. Note that on the bottom of page 138, $\theta_{12} = \frac{1}{2}$ and not $\frac{1}{4}$
• Understand how Naive Mean Field is a special case of Structured Mean Field as well as understand Structured Mean Field in general.

## June 30, 2009

### (Jul 2 3-5pm) Pre-meeting Overview: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 2:19 pm
Tags: ,

Paper 1: “Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

Sections I and II are an intro and overview, covering the basics of factor graphs and BP.  Useful to review, although we probably won’t spend any time on it.

Section III and IV give the approximation framework – think about how this relates to the variational framework given in Wainwright and Jordan.  For instance, what corresponds to $U(b)$ and $H(b)$?

Section V puts the Bethe approximation in this framework and covers some of its properties, and Section VI covers the correspondence between Bethe and BP.  Useful for review (although the subsections dealing with hard constraints may not be as important) but we probably won’t spend any time on these either.

Section VII builds on III and IV, and corresponds to the hypergraph section of Wainwright and Jordan, VIII discusses some details of the region graph method, IX gives the details of GBP, X is a detailed example of GBP, and XI discusses the accuracy of GBP versus BP.

Appendices A, B, and C talk about some specific variants and special cases of the region graph method, and puts everything together in a Venn diagram.  Good things to know, but we might not have time to discuss.

Summary:

Sections III, IV, VII – XI – read carefully, we will focus on these sections

Sections I, II, V, VI – should mostly be review, we’ll skip these sections

Appendices A-C – will cover if we have time

Paper 2: “Expectation Propagation for the Generative Aspect Model”, Minka & Lafferty

The generative aspect model, is the Latent Dirichlet Allocation (LDA) [paper, code] model we covered in Wainwright & Jordan (fig 2.7 on pg. 21 and example 3.5 on pg. 47). All the factors in the model are discrete and thus the model is a member of the exponential family as discussed in detail on pg. 47. This paper tackles the two problems of inference (i.e. marginalization to compute the document probability using (3)) and learning (i.e. estimating the parameters of the model given a collection of documents). Note that we have not yet covered learning in Wainwright & Jordan (thats the topic of chapter 6, after we cover the mean field algorithms in chapter 5), thus for the purpose of this meeting please pay particular attention to the inference (section 4.2). The most important sections for our purposes are 3 and 4.2 and how they relate to section 4.3 of Wainwright and Jordan, so spend the majority of your time understanding the connections between the two papers.

Summary:

Sections 3 & 4.2: Read carefully. We’ll spend the most time on this. I added a link to the original EP paper in case section 3 is too brief.

Sections 6 & 7: Results and conclusions. We’ll spend some time on this.

Sections 1 & 2: Overview of LDA, we’ll spend very little time on this

Section 4.1: The original variational inference algorithm for the LDA, read briefly, we’ll cover in more detail in chapter 5 of Wainwright & Jordan.

Section 5: Covers learning parameters, read briefly, we’ll cover in more detail in chapter 6 of Wainwright & Jordan

## June 22, 2009

### (Jun 25 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 6:14 pm
Tags: ,

This week’s reading is on expectation propagation (EP), another method for approximating the negative entropy $A^\ast$ and the set of realizable mean parameters $\mathcal{M}$.

The idea behind EP is to partition the sufficient statistics into a tractable component and an intractable component, such that it is possible to exactly compute marginals in polynomial time for the distribution associated with the tractable component, as well as the distributions associated with the tractable component combined with any one element of the intractable component.

This partitioning leads to an approximation of $\mathcal{M}$ as $\mathcal{L}$ given in Equation 4.67 (how do we know that $\mathcal{L}$ is convex?), and an approximation of the entropy as $H_{ep}$ given in Equation 4.68.  As before, we can apply the Lagrangian method to the resulting constrained optimization problem, and derive the moment matching EP updates (Fig 4.7).

Important things to follow:

• The running example of the mixture model – Example 4.8, Example 4.12
• How the Bethe approximation can be seen as a specific case of EP and sum-product as a case of moment matching – Example 4.9, Example 4.10
• Deriving the EP updates – Section 4.3.2
• Tree-structured EP – Example 4.11

The Bethe approximation is a special case of EP where all the sufficient statistics associated with the nodes are put into the tractable component, and each sufficient statistic associated with an edge is an element of the intractable component.  How would $H_{ep}$ and $\mathcal{L}$ change if, instead, we took the sufficient statistic associated with one particular node, removed it from the tractable component, and made it a separate element of the intractable component?