# UMass LIVING

## July 31, 2009

### (Jul 30) Meeting Summary: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 10:43 am
Tags: ,

Discussion from the meeting:

• Why is the Bethe approximation to the entropy $H_{Bethe}$ not generally concave?
• Since we know that $A^*$ is convex, and is the negative entropy, we know that the entropy is concave.  However, $H_{Bethe}$ does not generally correspond to a true entropy.  We assume that the edges form a tree and compute the entropy as if it were, subtracting on the mutual information on each edge, but this is not a true entropy unless the edges actually do form a tree, and hence is not generally concave.
• On page 167, in the discussion of the projection operation $\mapsto$, how do we know that $\mu(F) \in \mathcal{M}(F)$?
• To see this, we need to go back to the original definition of $\mathcal{M}$.  Assume $\mu \in \mathcal{M}$.  By the definition of $\mathcal{M}$, there exists some distribution $p$ such that $\mu_\alpha = \mathbb{E}_p[\phi_\alpha(x)]$ for all sufficient statistics indexed by $\alpha$, where the expectation is with respect to $p$.  Clearly then, we can use the exact same $p$ to show that $\mu(F) \in \mathcal{M}(F)$ since the requirements are the same, but only for a subset of the sufficient statistics $\mathcal{I}(F)$.
• Understand 7.2.1.  In particular, consider the distribution over spanning trees $\rho$ that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on $A(\theta)$ whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that $A^*(\mu(F)) \leq A^*(\mu)$, and we are subtracting off $A^*$ in the objective function?
• In mean field, we have that $A^*(\tau) = A^*(\mu(F)) = A^*(\mu)$, as proven in last week’s meeting summary.  Understanding this is the key to understanding this issue.  In mean field, we assumed that the canonical parameters that were outside those defined for the tractable subgraph were 0.  Therefore, $A(\theta)$ is the same, whether you include the inner product with the extra parameters, which are all zero, or not.  The convex combination of spanning trees is different because of the projection operator $\mapsto$ above.  We are still searching across the full space of possible mean parameters, but when we compute entropy, we use the projection operator to essentially ignore all the mean parameters that are not in our subgraph.  This differs from mean field, where the mean parameters that are not in our subgraph are deterministic functions of the mean parameters that are in the subgraph, corresponding to the functions $g$ in Wainwright and $\Gamma$ in the structured mean field optimization paper.  This difference means that you are searching over an outer bound as opposed to an inner bound, and are using an upper bound to the objective function, thus you are guaranteed to get an upper bound to $A(\theta)$.
• There should be a $\rho(T')$ in front of the second inner product in the second equation on page 176.
• This is for the $T'$ they mention that has $\Pi^{T'}(\lambda)$ different from zero.  Additionally, if you are wondering why $A^*$ is strictly convex, it is a consequence of the fact that $A$ is strictly convex.  The details are in Appendix B.3.
• Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why $H_{ep}$ is generally not concave, but why using the re-weighting subject to $\sum_{\ell=1}^{d_I} \rho(\ell) = 1$ gives a concave function.
• Since entropy is concave, if we take a positive combination of entropy functions, that should also be concave.  In $H_{ep}$, we have a negative amount $(1-d_I) H(\tau)$, so it is not concave.  Re-weighting subject to $\sum_{\ell=1}^{d_I} \rho(\ell) = 1$, or actually, $\sum_{\ell=1}^{d_I} \rho(\ell) \leq 1$, will fix this problem, making the approximation concave.
• Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
• One other important take-away from Figure 7.3 is why the approximations are better the lower the coupling strength (size of the parameter on the edges).  This is because, the lower the coupling strength, the more independent the individual nodes, which in the extreme case of a coupling strength of zero becomes a simple product distribution.  The higher the coupling strength, the more important the loops in the graph become, the more difficult inference becomes.
• Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when $B$ upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
• This is a check on your understanding of mean field within EM.  If you look at what mean field is approximating in EM, you should see why a lower bound on the cumulant gives a lower bound, whereas in 7.5 we need an upper bound on the cumulant to give a lower bound.
• Understand the intuitative interpretation of example 7.3.
• Another interesting point is the idea, in the final paragraph of the chapter, that it may be better to have an inconsistent estimator of the true parameter $\theta^*$, if your interest is in prediction (inference) and you must use an approximate inference algorithm.

Slightly more esoteric details:

• On page 184, why do $\mu_i$ show up along the diagonal of $N_1[\mu]$?  Hint: each random variable $X_i$ is binary valued.
• The diagonal elements are $\mathbb{E}[X_i^2]$.  Since $X_i$ is binary, we have $\mathbb{E}[X_i^2] = \sum_{0,1} X_i^2 p(X_i) = p(X_i = 1) = \mu_i$.
• Still unsure of whether you can go the other direction and show that $\text{cov}(X) \succeq 0$ implies $N_1[\mu] \succeq 0$.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
• Update: Kim, Kojima, Yamashtia, “Second Order Cone Programming Relaxation of Positive Semidefinite Constraint” has a simple proof of this equivalence.  Let $A = \left( \begin{array}{cc} 1 & -\mu^T \\ 0 & \mathbf{I} \end{array} \right)$.  Then $B = A^T N_1[\mu] A$ where $B = \left( \begin{array}{cc} 1 & 0 \\ 0 & \mathbb{E}[XX^T] - \mu \mu^T \end{array} \right)$.  Clearly $B$ is positive semidefinite if and only if $\mathbb{E}[XX^T] - \mu \mu^T = \text{cov}(X) \succeq 0$, and since $A$ is nonsingular, $N_1[\mu]$ is psd if and only if $B$, and hence $\text{cov}(X)$, is psd.

## July 23, 2009

### (Jul 30 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 4:26 pm
Tags: ,

The approximation methods we’ve covered in chapters 4 and 5 have both been nonconvex.  This chapter addresses the issue of how to create convex approximations, where both the set $\mathcal{L}$ that we use is convex and the objective function that we seek to maximize is concave.

7.1 deals with the generic framework for the convex approximations.  7.2 shows how to convexify the algorithms we saw in chapter 4, while 7.3 shows how to create new algorithms that are convex, not based on using convex combinations.  7.4 shows a separate benefit of convexity in terms of algorithmic stability, and 7.5 discusses how the convex approximations for inference fit in to learning.

Errata:

• There is a missing subscript in equation (7.1) on page 167, the expectation should be $\mathbb{E}[\phi_\alpha(X)]$.
• There should be (I believe) a $\rho(T)$ in front of the second inner product in the second equation on page 176.

• Why is the Bethe approximation to the entropy $H_{Bethe}$ not generally concave?
• On page 167, in the discussion of the projection operation $\mapsto$, how do we know that $\mu(F) \in \mathcal{M}(F)$?
• Understand 7.2.1.  In particular, consider the distribution over spanning trees $\rho$ that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on $A(\theta)$ whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that $A^*(\mu(F)) \leq A^*(\mu)$, and we are subtracting off $A^*$ in the objective function?
• Convince yourself that you can obtain the tree-reweighted sum-product updates in equation (7.12) on page 172 by following the Lagrangian formulation of ordinary sum-product in chapter 4, and using the new messages $M_{ts}(x_s) = \exp(\frac{1}{\rho_{st}} \lambda_{st}(x_t))$.
• Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why $H_{ep}$ is generally not concave, but why using the re-weighting subject to $\sum_{\ell=1}^{d_I} \rho(\ell) = 1$ gives a concave function.
• Understand the general idea in 7.3.1.  The details are not as important, although for the interested, I have put some notes on this section below.
• Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
• Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when $B$ upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
• Understand the intuitative interpretation of example 7.3.

Slightly more esoteric details:

• On page 184, why do $\mu_i$ show up along the diagonal of $N_1[\mu]$?  Hint: each random variable $X_i$ is binary valued.
• On page 184, why is the constraint $N_1[\mu] \succeq 0$ imply the constraint $\text{cov}(X) \succeq 0$?  Use the Schur complement formula and the stated fact that if $M$ is symmetric and positive-definite, then so is the Schur complement of $D$ in $M$.
• Here is a quick proof of that fact: we want to show that, if for all $x$, $x^T M x > 0$, then, for all $y$, $y^T S y > 0$ where $S$ is the Schur complement of $D$.  Split $x$ into two components, $x_1$ and $x_2$, with $x_1$ matching the size of $A$, and $x_2$ matching the size of $D$.  Then we have, $x^T M x = x_1^T A x_1 + x_2^T C x_1 + x_1^T B x_2 + x_2^T D x_2 > 0$.  Now substitute in $C = B^T$ since $M$ is symmetric, and assign $x_2 = -D^{-1} B^T x_1$.  Putting that in and simplifying, we get $x_1^T A x - x_1^T B D^{-1} B^T x_1 = x_1^T S x_1 > 0$, where $x_1$ was arbitrary, thus proving that $S$ is positive-definite.
• Now, how can we apply that to prove that $N_1[\mu] \succeq 0$ implies $\text{cov}(X) \succeq 0$?  If we let the single element 1 in the top left of $N_1[\mu]$ be $A$, and $\mathbb{E}[XX^T]$ in the bottom right be $D$, and take the Schur complement of $A$, we get the fact that, if $N_1[\mu] \succeq 0$, then $\mathbb{E}[XX^T] - \mu\mu^T \succeq 0$, and the second term is exactly $\text{cov}(X)$, thus proving the implication.
• One thing I am unsure of is whether you can go the other direction, and show that $\text{cov}(X) \succeq 0$ implies $N_1[\mu] \succeq 0$.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
• The important thing is that we know that the constraint $N_1[\mu] \succeq 0$ is an outer bound to the marginal polytope $\mathbb{M}(K_m)$.  To show this, we use the fact that if there is a distribution $p(X)$ that realizes mean parameter $\mu$, then letting $Y$ be the random vector $(1, X)$, we have for any vector $a$, $a^T N_1[\mu] a = a^T \mathbb{E}[YY^T] a = \mathbb{E}[(a^T Y)^2] > 0$, proving that $N_1[\mu] \succeq 0$.
• So right now, we have $\mathbb{M}(K_m)$ is outer bounded by the constraint $N_1[\mu] \succeq 0$ which in turn is outer bounded by the constraint $\text{cov}(X) \succeq 0$, and uncertain as to whether the two constraints are equivalent or not.
• Why is it valid to replace the quantity within the determinant on page 184, $\text{cov}(X)$ to $N_1[\tau]$, as in equation (7.22) on page 185?  Again we use the Schur complement formula, and the property that $\det(M) = \det(A)\det(S)$ where $S$ is the Schur complement of $D$ in $M$ (see properties of determinant).  Using this and the fact above that by setting $A$ as the single element 1 in the top right corner of $N_1[\mu]$, the Schur complement of $N_1[\mu]$ is $\text{cov}(X)$ shows that the substitution is valid.
• Why does setting canonical parameters as in (7.28) lead to mean parameters matching the empirical mean parameters?  The paper Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching gives a proof that the optimal $\tau$ returned by the modified sum-product algorithm satisfies an admissibility condition that $\langle \theta, \phi(x) \rangle + \text{constant} = \sum \log \tau_s(x_s) + \sum \rho_{st} \log \frac{\tau_{st}(x_s, x_t)}{\tau_s(x_s) \tau_t(x_t)}$.  Using the $\tilde{\theta}$ given in (7.28), we can see that $\hat{\mu}$ satisfies this condition, and since the solution is unique, we have $\hat{\mu} = \tau$.
• On page 193, how do we know that $\|\theta\|_q = \sup_{\|\tau\|_{q'} \leq 1} \langle \tau, \theta \rangle$?  From Holder’s inequality.

### July 22 10:30 am – 12:30 pm: Meeting Summary (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 10:05 am
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Note the following dimensionalities:

$|\mathcal{M}_{MF}| = |\mathcal{M}| = |\mu| = |\theta| = |E[\Phi(Y_\omega)]| = |F|$ = (Number of sufficient statistics in original distribution)

$|\mathcal{N}| = |\tau| = |\omega| = |F'|$ = (Number of sufficient statistics in tractable distribution)

The embedding function $\Gamma$ maps from the space of $\tau$ to the space of “missing” or “ignored” sufficient statistics.

Proof that $A^*(\mu) = A^*(\tau)$:

Eqn (6) in this paper is the same as Eqn (5.8) in Wainwright-Jordan. We are looking to find the sup (maximum) value of this expression, and that value will be our approximation of the log partition function $A(\theta)$. The maximum value is an approximation and not the exact log partition function because our search space is not all valid values of $\mu$ for the original distribution, but only the ones that can be realized by the tractable distribution.

Section 2.4 takes the partial derivative of (6) to find the max point. The embedding Jacobian ($J$) is defined, which is the first order partial derivative of  $\Gamma$ w.r.t. $\tau$.
The solution for max of Eqn (6) is then found to be
$\tau = \bigtriangledown A(\omega + J(\tau)\vartheta)$
The vector $\tau$ can be broken down into $(\tau^{(1)} \tau^{(2)} ... \tau^{(k)})$ where $(1)...(k)$ represent the $k$ disjoint connected components (subgraphs) into which the original graph has been broken down.

Section 3.3 shows that the embedding Jacobian is easy to compute when the subgraph in question is a v-acyclic subgraph. Note that this section assumes that the exponential family is expressed in an overcomplete representation. In this section, $f$ is the set of statistics in the tractable graph and $g$ is the set of statistics that are not in the tractable graph.  The v-acyclic subgraph property ensures that when we add any external edge (from the $g$ set) to the connected component ($f$) in question, the two nodes of the edge are independent (There is no path between the two nodes). This independence makes computing the $J$ matrix easy for v-acyclic graphs.  The complexity of updating $\tau$ in Prop 5 thus takes $O(|F'^{(c)}|)$ time.

For b-acyclic graphs, adding an edge to the $f$ connected component could result in a loop, the two nodes that define the edge are not independent, and we need to use a chain rule to figure out the value of $J$. Complexity of computation is
$O(|E'| \times |F'| \times |F\backslash F'|)$, which is large.
The alternative approach is to consider each case of $g$ as an exponential family. Calculating the $J$ matrix under this setting reduces the computation cost to $O(|E'| \times |F\backslash F'|)$, which is a considerable reduction from the naive approach.

The paper also makes a connection between the v-acyclic case and Gibbs sampling.

Results:

Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between learning using Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

Given the Generative Aspect model (LDA), where $\lambda$ is the variable (Dirichlet distributed with parameter $\alpha$) that specifies the distribution of the “aspect” variable $a$. We cannot use EM directly by treating $\lambda$ as the hidden variable because the E step requires expectations over the posterior for $\lambda$.

Alternate approaches:

(1) Maximize the Variational Inference likelihood in Eqn (8)

(2) “Variational EM”:

E step – Use an approximate posterior for $\lambda$. Calculating this can be done using the EP algorithm of section 4.2 (Eqns (15) and (21))

M step – Maximize the lower bound on the log-likelihood (Eq (29)). (29) can be obtained by applying the inequality in Eqn (6) to the likelihood in (26) and then taking the log. The solution for the M step is in Eqn (31). The appendix has a more accurate solution based on the Taylor expansion (Eqn (34)).

It may be noted that approach (1) could be taken for EP as well (Find the parameters that results in the largest possible EP estimate of the likelihood ), but the authors make the point that EP makes an approximation which is close to the true likelihood in an average sense. Hence finding the max in EP’s “best approximation on average” will not yield the true maximum in the original likelihood. (Paragraph just before Section 5.2).  Hence (2) is preferred for EP.

Results:

(a) Synthetic data –

Using the VB approach (1), the estimated likelihood resembles the max approximation of the true likelihood. Using the EP approach (2), the estimated likelihood resembles the true likelihood (Figure 1)

VB tends to choose extreme parameters which can explain the data, while EP tends to reflect the true parameters that generated the data.

(b) Real data (TREC)

EP does better than VB. The difference between performance between EP and VB is not very high (As can be seem from the difference in the perplexity curves of EP and VB in Figure 3. Lower perplexity is better). Possible explanation is that in the data set that was used, there was little overlap between the aspects (The topics had little in common). Thus, the true distribution of the aspects are sharply peaked. This means that the difference between the max approximation and the exact likelihoods is not as large as in Figure 1. Thus, the EP and VB approximations are close to each other, with EP still doing slightly better.

## July 20, 2009

### July 22 10:30 am – 12:30 pm: Pre-Meeting Overview – (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 12:07 pm
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Points to consider are:

Sections 2.1 and 2.2 cover the basics, which we know know from Wainwright-Jordan. The only addition is second part of Eq 3. Hessian of the log partition function is the Variance of the Sufficient statistics.

Section 2.3 is also mostly covered in Wainwright-Jordan, except the part from the last paragraph in Column 1. Here Γ maps to the “missing potential” expectations. This is to show that the optimization can be done in the smaller space of tractable potentials, while ignoring the “missing potentials”.

(a) Think about the dimension of the space in which τ, and θ exist.

Section 2.4 gives the formula for Mean Field updates (top of page 4). Remember τ are the mean parameters in the tractable subfamily. The formula involves a term J which is defined in Definition 2.

Section 3.1 explains what v-acyclic and b-acyclic mean. We will talk about it very briefly.

Section 3.2 shows how the Mean Field update formula can be further broken down into multiple subgraphs (connected components). It is these subgraphs and their properties that determine the complexity of the update equation.

Section 3.3 shows that the updates are easy to compute if the subgraph (connected component) is v-acyclic (Shown by  considering three subcases). Note that 1[f=(a,s)] is an indicator function.

(b) What does g = (e, (s,t)) mean?

(c) It is important to see why equation in Proposition 5 does not depend on $\tau^{(c)}$

Section 3.4 shows how the update equations are equivalent to Gibbs sampling. We will talk about this.

(d) Hopefully students with experience in Gibbs sampling can help the rest make the connection.

Section 3.5 shows how the update equation is not as easy to solve in case of b-acyclic subgraphs.

(e) This is probably the most confusing part of this paper, we will probably spend good amount of time going through this section.

Results

(f) Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

(g) Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model (?)). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

The main objective for the discussion is to understand how model parameters are Learned for exponential families. We will be looking at Section 5 of the paper mainly. But section 5 is a small section and understanding section 4 will help go through 5 more meaningfully. Section 4 was covered in last fortnight’s meeting. But since it is important, we will brush up on the main points from last fortnight’s meeting.

We will briefly go over the model itself (Section 2). We should not be spending too much time on this section during the meeting. Section 3 discusses EP for the GA model. Again, we will quickly review this because it was done earlier. We should now be able to relate the “inclusion-deletion” being talked about this section to the EP algorithm as explained in Section 4 (Figure 4.7)

Section 5 will be the main discussion point.

(a) Why not use EM algorithm in a straight-forward manner ( Treating λ as a hidden variable ) This point is mentioned briefly after Eq 26. We should understand Why this is so.

Section 5.1 gives the one possible learning algorithm, which can be used for VB approach. Points to think about:

(b) How can Eqs 27 and 28 be understood as EM? This point is mentioned in paragraph 2 of section 5.1

(c) Why the same approach cannot be used for EP? Last few lines of section 5.1

Section 5.2 gives the learning algorithm for EP approach (Can be used for VB too (?)).

(d) How is Eq 29 the log likelihood in this model?

(e) We should understand what Eqs 31 – 34 mean. It might be helpful to revisit section 4.2 to understand what each of the RHS quantities mean.

Results show that EP does significantly better than VB. Points:

(f) For synthetic data, EP likelihood resembles the exact likelihood. VB likelihood resembles the max likelihood (Figure 1). (Briefly, we will talk about what exact and max likelihoods are)

(g) For real data, EP has lower “perplexity” than VB (Figure 3). (Briefly, what is “perplexity”?)

## July 15, 2009

### July 16 3-5 pm: Pre-Meeting Overview, Chapter 6 – Variational Methods in Parameter Estimation

Filed under: Uncategorized — umassliving @ 7:48 pm
Tags: ,

This chapter describes several methods for estimating the parameter theta from observed data. Section 6.1 and 6.2 focus on this problem under the assumption that theta is fixed but unknown and section 6.3 views theta as a random variable with an associated probability distribution.

The first section describes a common approach for estimation parameters, maximizing the log likelihood of the data. Unfortunately, doing this can be computationally difficult, so two different solutions are provided.  The first is a closed form function for triangulated graphs only and the second is an iterative method for arbitrary graphs based on coordinate ascent.

The next section describes a different setting, where the data is only partially observed (noisy).  The EM algorithm is commonly used in this setting to calculate the maximum likelihood estimate of theta.  This section first presents the exact form of EM for exponential families and then presents ‘variational’ EM for use when the mean parameters are not known exactly.

The last section focuses Variational Bayes, which is a method for estimating theta where theta is thought of as a random variable with a probability distribution.

## July 9, 2009

### (Jul 9 3-5 PM) Meeting Summary – Chapter 5: Mean Field Methods

Filed under: Uncategorized — umassliving @ 6:41 pm
Tags: ,

Mean field methods differ from those discussed in chapter 4 in that $\mathcal{M}$ is approximated used a tractable (nonconvex) inner bound for which $A^*$ is easy to characterize. It uses the notion of a tractable subgraph to restrict the set of sufficient statistics by setting their respective canonical parameters to $0$. Setting certain canonical parameters to zero is equivalent to adding certain constraints between the mean parameters (which have an obvious form for the naive mean field case).

We discussed the notions of tractable subgraphs, the lower bound property of mean field, and its interpretations as maximizing the bound and minimizing the KL divergence (the reverse form of that used in EP). Worked through example 5.2 for the naive mean field and example 5.4 for showing the nonconvexity of the objective function. We worked out $\mathbb{P}[X_1 = X_2]$ to show why the fraction in $\theta(q)$ must be 1/2 instead of 1/4. In example 5.4, clarified the difference between convex sets and convex functions and how figure 5.2(b) is depicting sets. Discussed the extreme points of $\mathcal{M}$, how they arise, and why they must be contained in $\mathcal{M}_{F}$, which explains why $\mathcal{M}_{F} \subset \mathcal{M}$ cant be convex. Discussed structured mean field as a way to use arbitrary tractable subgraphs as opposed to just disconnected subgraphs (as in the mean field case). The constraints between the mean parameters take a form that depends on the subgraph and thus is represented generically as $g_{\alpha}$.

Next time we will briefly make sure that everyone understood examples 5.5 and 5.6.

## July 8, 2009

### (Jul 9 3-5 PM) Pre-Meeting Overview of Chapter 5 – Mean Field Method

Filed under: Uncategorized — umassliving @ 4:59 pm
Tags: ,

This week’s chapter introduces Mean Field approaches to approximating $A(\theta)$. The general idea is to optimize the search for the best approximation from a tractable graph $F$ rather than the whole graph $G$.

According to Proposition 5.1, any mean parameter $\mu$ gives a lower bound on $A(\theta)$. Mean Field approaches try to find the best (tightest) lower bound as shown in Equation 5.8. They also present an alternative view of this approximation using KL divergence where minimizing $D(\mu || \theta)$ (bottom of page 133) is the same as Equation 5.8.

The Naive Mean Field Algorithm chooses the product distribution as the tractable distribution (that is, a graph with no edges). Section 5.4 describes the noncovexity of the Mean Field approach which may give local minima. Structured Mean Field approaches use a graph with more structure as the tractable distribution (such as a tree).

Some things to examine closely:

• Proposition 5.1 and equation 5.8 and the relationship between 5.8 and KL divergence on the bottom of page 133.
• Why is Mean Field guaranteed to give a lower bound to $A(\theta)$ ?
• The update rules in Example 5.2
• Example 5.4. Note that on the bottom of page 138, $\theta_{12} = \frac{1}{2}$ and not $\frac{1}{4}$
• Understand how Naive Mean Field is a special case of Structured Mean Field as well as understand Structured Mean Field in general.

### (Jul 2) Meeting Summary: “GBP” & “EP for LDA”

Filed under: Uncategorized — umassliving @ 3:36 pm
Tags: ,

“Expectation Propagation for the Generative Aspect Model”, Minka

This summary paper contains a very intuitive explanation of EP, covers most of the high-level that was presented. More specifically, it covers the KL divergence view of EP: approximating an intractable distribution $p(x)$ with a tractable member of the exponential family, $q(x)$ by minimizing $KL(p || q)$. Note that minimizing this form of the KL divergence when $q(x)$ is a member of the exponential family is equivalent to moment matching.

The mean field methods that we will cover in chapter 5, minimize the reverse: $KL(q || p)$, which has other properties. Note that a large number of message-passing algorithms can be viewed as minimizing information divergence (paper).

We covered the EP algorithm for the LDA model, where the prior $Dir(\lambda |\alpha)$ was the tractable component, and each of the $p(w | \lambda)$ where the intractable component – each is a mixture of multinomials that was approximated with a Dirichlet making $q(\lambda)$ the product of Dirichlet distributions which is itself a Dirichlet (with updated parameters).

To put this in the Wainwright variational framework, the $\tilde{t}_w$ basically correspond to the Lagrange multipliers $\lambda^i$ in Equation 4.77 on page 119 in Wainwright.  Multiplying $p$ with all the $\tilde{t}_w$‘s gives us the base distribution 4.77.  We then divide out by one specific $\tilde{t}_w$ (i.e. the $\lambda^i$ that isn’t in the summation in 4.78), and multiply in the corresponding exact $t_w$, which gives us the augmented distribution in 4.78.  We compute the expectation of the sufficient statistics under this distribution, and then go back and modify the $\tilde{t}_w$ which we divided out so that the base distribution and augmented distribution have matching moments.

We briefly covered the steps for the EP algorithm, I (Moe) have worked out the updates in each step (they are too much to type up), so let me know if you get stuck on any of them.

Notice that the sufficient statistic for a Dirichlet is $\phi(x) = \log(x)$, but in the paper they match the mean and variance instead (more specifically, the first and second moments which is equivalent to the mean and variance). This is because matching $\phi(x) = \log(x)$ is intractable and so the mean and variance is a tractable approximation. More info can be found here.

“Constructing Free Energies and the Generalized Belief Propagation Algorithm”, Yedida, Freeman & Weiss

This paper gives an alternative view on inference in terms of free energies that we can map back into the framework of Wainwright/Jordan that we’ve been studying.  We should first understand that the sufficient statistics are the standard over-complete indicator functions, one set for each $f_a$, one for each possible assignment to the subset of variables that participate in the function, e.g. $x_a$.  The corresponding canonical parameters are then the corresponding values of $f_a(x_a)$ for each specific assignment to $x_a$.

With this notation, we can now rewrite the distribution $p(x)$ in equation (1) in the canonical form given in Wainwright as $p(x) = \exp(\langle \theta, \phi(x) \rangle - A(\theta))$.  Now we can see that $A(\theta)$ is the same as $\ln Z$.  In other words, the Helmholtz free energy $F_H$ is just the negative of the log partition function.

Now consider a trial distribution $b(x)$.  Using the definition of KL divergence in equation (18), it is straight forward to derive equation (17) (and worth doing so for your own edification).  Since we know the KL divergence is always non-negative and only equal to zero when the two distributions are the same, we now have a variational representation of the Helmholtz free energy as the minimum over trial distributions of the variational free energy (Gibbs free energy), which is equal to the variational average energy minus the variational entropy, such that, if we could solve it exactly, we would find the parameters that would make the trial distribution equal to the actual distribution.  Of course, we will generally be working with a simplier family of trial distributions, since the original distribution is assumed to be intractable, so we won’t be able to make the KL divergence go to zero.

Note that $U(b)$ is the negative of $\langle \theta, \tau \rangle$, and $H(b)$ is an approximation to the negative of the conjugate dual $A^*(\mu)$.  Thus we have just rewritten the constrained maximization problem of Wainwright as a constrained minimization problem.

One unresolved question is what exactly is meant by “cycle” in Proposition 8 (clearly it can’t mean a directed cycle since the region graph is acyclic due to the way the edges are directed), does it simply mean undirected cycles in the region graph?  Wainwright gives a much more complicated definition of an acyclic hypergraph (equivalent to a region graph) in terms of junction trees, so it’s not clear if this is the same as a cycle in the region graph or not.

Understanding the example after Proposition 8 is good to do.

Note there are three possible ways to do generalized belief propagation – parent to child (same as what was presented in Wainwright), child to parent, and two way.  One benefit of parent to child is that there is no need to actually use the counting numbers $c$.

You can get the message-update rule for the parent-to-child algorithm by plugging the definition of the beliefs, (113) into the two sides of the marginalization constraint (117), and working out the math.

The proof that the fixed point of GBP are the same as the stationary points of the region-based free energy is done by defining a second set of Lagrange multipliers that immediate give $b(x)$ as in (113) from (122), and then showing that the two sets of Lagrange multipliers define equivalent constraints.  To be a little more explicit, if you have a fixed point of GBP, then it must satisfy the constraints of the original Lagrange multipliers, so it must satisfy the constraints of the second set of Lagrange multipliers, and (122) shows $b(x)$ satisfies the stationarity conditions for the second optimization problem, which is equivalent to the original, and hence you have a stationary point of the region-based free energy.  Going the other direction is similar.

Section X is very useful if you want to see what the GBP updates actually are.  It’s not hard to work them out from (114).

Lastly, section XI gives a nice example of a lattice network in which straightforward BP will give the correct labels if you just look at the arg max of the marginals, but will be overconfident, whereas GBP will give actual marginals that are very close to the exact marginals.

Blog at WordPress.com.