September 19, 2009

(Sep 25 12-2pm) Pre-meeting Overview: Energy Minimization via Graph Cuts

Filed under: Uncategorized — umassliving @ 4:39 pm
Tags: ,

This week, continuing the theme of MAP inference, we’ll look at graph cut methods for energy minimization.

This will be the main paper: What Energy Functions Can Be Minimized via Graph Cuts? Komogorov, Zabih

This additional paper may help in understanding graph cuts: Fast Approximate Energy Minimization via Graph Cuts. Boykov, Veksler, Zabih.

There is no paper set yet for this week.  Some possibilities:

Variable Elimination, Constructing the Pseudo-tree, Min-fill, and Mini-bucket heuristic

A two week sequence of:

The first paper would be related to our previous reading, in that they do search with branch and bound.

Anything else on this page –

Post your suggestions soon, with the aim of settling on a paper by mid-day Monday.


September 14, 2009

(Sep 18 12-2pm) Pre-meeting Overview: Best-First AND/OR Search for MPE (Marinescue, Dechter)

Filed under: Uncategorized — umassliving @ 11:57 am
Tags: ,

Here is the reading for this week:

The motivation for reading this paper is as follows. In our work on both OCR and on Scene Text Recognition, I have been somewhat less focused on parameter setting, and somewhat more focused on finding the MAP solution, or the Most Probable Explanation (MPE) solution.

There are many ways to do this, including:

1) Loopy Belief Propagation, or Max Product; (problems: convergence not guaranteed; if it converges, still don’t know if we have the right solution)

2) Linear programming on a relaxed version of the marginal polytope, using techniques of Wainwright, or the cutting plane inequalities we just read about in Jaakkola and Sontag, currently under investigation by Gary.

3) A* search as being investigated by David and Jackie at the moment.

4) Graph cut methods.  These work on binary labelings with submodular potentials, and can be extended to multi-class labelings using \alpha-expansion moves.  For instance, see

and probably others.

I’m trying to learn more about the A* version, and in particular, compare it to the Jaakkola Sontag approach. I’m hoping this paper illuminates graph-search type methods of solving the MAP assignment problem.

September 5, 2009

(Sep 09 1-3pm) Pre-meeting Overview: New Outer Bounds on the Marginal Polytope (Sontag, Jaakkola)

Filed under: Uncategorized — umassliving @ 3:23 pm
Tags: ,

Meeting will be in room 243.

This week we’ll be covering the following paper from NIPS 2007 (best student paper) – Sontag and Jaakola, “New Outer Bounds on the Marginal Polytope“.

Aside from the paper, we will also need to deal with two organizational issues, when to have our regular meeting for the rest of fall semester, and what topics we want to cover (at least decide upon the first topic to cover).  See the previous organizational post for details.

Paper Overview:

This paper presents a new variational algorithm for inference in discrete (binary & non-binary) MRF’s. Recall that the main difficulty in exact inference is in explicitly characterizing the marginal polytope and in exactly computing the conjugate dual. The main contribution of this paper is a new [tighter] outer-bound on the marginal polytope. For the conjugate dual, their algorithm utilizes existing approximations such as the log-determinant and tree-reweighted (TRW) approximations.

A broader goal of this paper is in highlighting an emerging connection between polyhedral combinatorics and probabilistic inference. To this end, their outer-bound for the marginal polytope is based on a cutting-plane algorithm (Algorithm 1 on page 3). A key aspect cutting-plane algorithms is in having an efficient mechanism for detecting violated constraints (Step 5 of Algorithm 1). One contribution in this paper is in using the cycle inequalities for which efficient separation algorithms are known (Section 2, page 5). A second main contribution is extending these inequalities and the separation mechanism to non-binary MRF’s (Section 4). Notice that the extension to non-pairwise MRF’s is trivial since any non-pairwise MRF can be easily converted to a pairwise one by introducing auxiliary variables (as described in Appendix E.3 of Wainwright & Jordan). The first two pages of the paper provide a nice & succinct summary of most of the points we covered in Wainwright & Jordan.

The experiments in this paper show the improvement in their inference procedure on computing marginals and MAP estimation for protein side-chain prediction.

August 29, 2009

(Sep 03 3-5pm) Pre-meeting Overview: Convex Relaxations for MAP Estimation (Kumar, Kolmogorov, Torr)

Filed under: Uncategorized — umassliving @ 5:03 pm
Tags: ,

This week we’ll be covering Pawan, Kolmogorov, and Torr, “An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs”, JMLR 2009.

A related paper is Kumar, Torr, and Zisserman, “Solving Markov Random Fields using Second Order Cone Programming Relaxations“, CVPR 2006, which may be useful as a quick introduction as well as an example of how to apply these methods to a vision problem (object recognition), but the main focus of the meeting will be on the JMLR paper.

The paper begins by describing the integer programming formulation of MAP estimation, equivalently energy minimization, hence the integer program and all relaxations are minimization problems in the paper.  Next, the paper gives three relaxations of the integer program – LP-S, the linear programming relaxation (the main one we dealt with in Wainwright); QP-RL, a quadratic programming relaxation; and SOCP-MS, a second order cone programming relaxation.  One of the focuses of the paper is to compare these different relaxations to one another, and describes how to do so in the next section, through the notion of dominance.  The paper goes on to prove that LP-S dominates SOCP-MS, and prove that SOCP-MS and QP-RL are equivalent.

The next focus of the paper is to prove that for a class of relaxations (on trees and certain types of cycles) LP-S strictly dominates SOCP-MS (and consequently QP-RL).  The paper then gives two additional types of constraints that can be added to the SOCP relaxation, such that the new SOCP relaxation dominates LP-S for some problems and has a feasibility region not a super-set of the LP-S feasibility region.

The majority of the paper is in the proofs, so beyond understanding the basic set-up, I think that’s where we should be spending our time.  In particular, if there are any steps that you do not follow, post them in the comments and we can try to work through it during the meeting.


Going from SDP to SOCP relaxation:

One fact that seems to be used without an explicit proof is A \succeq 0 \Leftrightarrow A \bullet C \geq 0 \quad \forall C \succeq 0, where A \bullet C = \sum_{ij} A_{ij}C_{ij} is the Frobenius inner product on matrices.  Here’s a quick proof, where the second direction uses part of the CVPR paper.

First, note that x x^T \succeq 0, since for any vector a, a^T x x^T a = (a^T x)^2 \geq 0.  Therefore, A \bullet C \geq 0 \quad \forall C \succeq 0 \Rightarrow A \bullet x x^T \geq 0 \quad \forall x \Rightarrow \sum_{ij} A_{ij} x_i x_j \geq 0 \Rightarrow x^T A x \geq 0 \Rightarrow A \succeq 0.

To go the reverse direction, note that any symmetric positive semidefinite matrix C can be written as UU^T using the Cholesky decomposition.  Let A \succeq 0, and C = UU^T be any positive semidefinite matrix.  A \bullet C = A \bullet UU^T = \sum_{ijk} A_{ij} U_{ik} U^T_{kj} = \sum_{ijk} U^T_{ki} A_{ij} U_{jk} = \sum_k u^T_k A u_k \geq 0, where u_k is the kth column of U (not row, as stated in the CVPR paper), and the last inequality uses the fact that A \succeq 0.

Recall that in the SDP relaxation, our positive semidefinite constraint was X - xx^T \succeq 0.  From the CVPR paper, we then have, for any positive semidefinite matrix C = UU^T, (X - xx^T) \bullet C \geq 0 \Leftrightarrow xx^T \bullet C \leq X \bullet C \Leftrightarrow x^T UU^T x \leq X \bullet C \Leftrightarrow \|U^T x\|^2 \leq X \bullet C.

Finally, note that the final inequality is equivalent to the following SOCP constraint: \|u\| \leq t where u = \left( \begin{array}{c} 1 - X \bullet C \\ 2U^T x\end{array} \right) and t = 1 + X \bullet C, which can be verified by squaring both sides.

At this point, the SOCP relaxation is equivalent to the SDP relaxation.  However, the SOCP relaxation is made more tractable (at the cost of removing constraints) by only considering a subset of psd matrices C.  The typical set of C used is those with all zeros except for a 2 x 2 submatrix corresponding to two vertices connected by an edge in the MRF.

(Aug 27 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 9, 10

Filed under: Uncategorized — umassliving @ 4:39 pm
Tags: ,

to be filled in

August 13, 2009

(Aug 20 3-5pm) Pre-meeting Overview: Wainwright – Estimating the “wrong” graphical model

Filed under: Uncategorized — umassliving @ 6:37 pm
Tags: ,

M. J. Wainwright. Estimating the “wrong” graphical model: Benefits in the computation-limited setting. –

This paper is about variational approaches to parameter estimation and prediction in an MRF.  In particular, they argue that, in the variational setting, an inconsistent estimator of \theta can be desirable, because it can offset errors later in the prediction step.

Section 2,3 are covered in the Graphical Model survey.

Some points to examine:

  • the overall procedure presented in Section 4.2.
  • the application to Gaussian mixtures in Section 6.1
  • the comparison of the tree-reweighted approach and sum-product in Section 7.

August 11, 2009

(Aug 13 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 8, p214-233

Filed under: Uncategorized — umassliving @ 3:49 pm
Tags: ,

Section 8.4.3 till 8.5

The main points that will be covered in these subsections are

(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

(2) Examples of first order LP relaxations.

(3) Higher order LP approximations and how they can be more accurate.

Section 8.4.3 extends the arguments from earlier subsections in this chapter to show that reweighted max-product (and tree-reweighted Bethe variational problem) can be thought of as an LP problem. Conditions 8.23 have to be met for the max-product solution and LP problem solution to be dual of one another.

Section 8.4.4 discusses examples where an LP problem is formulated from a graph problem. Example 8.3 shows how the Error-control coding problem may be thought of as an LP problem. Example 8.4 shows the same for the (a) Ising model, (b) (Maximum Weight) Independent Set problem, (c) (Minimum Weight) Vertex Cover problem, (d) Max-cut problem. These are all examples of intractable problems, but a first oder LP approximate solution is possible. Example 8.5 is another classic graph problem, Maximum Weight Matching problem, solved using an LP.

Section 8.5 discusses higher order LP relaxations by using the marginal polytopes defined on hypergraphs (rather than the polytopes defined by pairwise interactions in a tree). The main claim and proof in this section is that higher order relaxation LPs are more accurate the lower order relaxation LPs for the same graphical model. The t^{th} relaxation LP solution, where t is the tree-width of the hypergraph for the graphical model, is the same as the exact solution on the original polytope \mathbb{M}(G).

Finally, example 8.6 shows how the second order relaxation adds additional constraints that makes the solution “tighter” than in a first order relaxation. This is done by solving the same problem solved earlier in Example 8.3 by the two methods, and comparing the constraints and considering one solution \tilde{\tau} = (1, \frac{1}{2}, \frac{1}{2}, 0).

We will be reading papers next week, so please come with suggestions for papers you think are interesting.

August 6, 2009

(Aug 06 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 8, p195-213

Filed under: Uncategorized — umassliving @ 10:06 am
Tags: ,

Section 8 covers the problem of mode computation. That is given a probability distribution, compute the most probable configuration or the mode. It turns out that, this problem also has a variational representation as made explicit in Theorem 8.1 (see Appendix B.5. for the proof).  Similar to our treatment in chapter 4, this variational problem can be solved exactly for tree-structured and Gaussian graphical models. Some points to focus on for this week:

  • The objective function is simpler than the one for inference since it does not include A^*, and so the main source of difficulty is in characterizing the space of mean parameters, \mathcal{M} explicitly. It should be clear that the optimization problem is an integer programming one (wikipedia).
  • Section 8.2 transforms the IP problem to a LP one over the set \mathbb{L}(G) for the case of tree-structured distributions and shows that a Lagrangian method for solving the dual of the LP are the max-product updates. Note that this result does not carry over to non-tree-structured graphs as shown in section 8.4.2.
  • Section 8.3 discusses the exact computation of the mode for Gaussian graphical models (see Appendix C).
  • Section 8.4.1 discusses first order LP relaxation for computing the mode of a discrete graphical model with cycles, and discusses the concept of strong and weak persistency.

July 23, 2009

(Jul 30 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 4:26 pm
Tags: ,

The approximation methods we’ve covered in chapters 4 and 5 have both been nonconvex.  This chapter addresses the issue of how to create convex approximations, where both the set \mathcal{L} that we use is convex and the objective function that we seek to maximize is concave.

7.1 deals with the generic framework for the convex approximations.  7.2 shows how to convexify the algorithms we saw in chapter 4, while 7.3 shows how to create new algorithms that are convex, not based on using convex combinations.  7.4 shows a separate benefit of convexity in terms of algorithmic stability, and 7.5 discusses how the convex approximations for inference fit in to learning.


  • There is a missing subscript in equation (7.1) on page 167, the expectation should be \mathbb{E}[\phi_\alpha(X)].
  • There should be (I believe) a \rho(T) in front of the second inner product in the second equation on page 176.

Important things to follow and think about:

  • Why is the Bethe approximation to the entropy H_{Bethe} not generally concave?
  • On page 167, in the discussion of the projection operation \mapsto, how do we know that \mu(F) \in \mathcal{M}(F)?
  • Understand 7.2.1.  In particular, consider the distribution over spanning trees \rho that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on A(\theta) whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that A^*(\mu(F)) \leq A^*(\mu), and we are subtracting off A^* in the objective function?
  • Convince yourself that you can obtain the tree-reweighted sum-product updates in equation (7.12) on page 172 by following the Lagrangian formulation of ordinary sum-product in chapter 4, and using the new messages M_{ts}(x_s) = \exp(\frac{1}{\rho_{st}} \lambda_{st}(x_t)).
  • Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why H_{ep} is generally not concave, but why using the re-weighting subject to \sum_{\ell=1}^{d_I} \rho(\ell) = 1 gives a concave function.
  • Understand the general idea in 7.3.1.  The details are not as important, although for the interested, I have put some notes on this section below.
  • Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
  • Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when B upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
  • Understand the intuitative interpretation of example 7.3.

Slightly more esoteric details:

  • On page 184, why do \mu_i show up along the diagonal of N_1[\mu]?  Hint: each random variable X_i is binary valued.
  • On page 184, why is the constraint N_1[\mu] \succeq 0 imply the constraint \text{cov}(X) \succeq 0?  Use the Schur complement formula and the stated fact that if M is symmetric and positive-definite, then so is the Schur complement of D in M.
    • Here is a quick proof of that fact: we want to show that, if for all x, x^T M x > 0, then, for all y, y^T S y > 0 where S is the Schur complement of D.  Split x into two components, x_1 and x_2, with x_1 matching the size of A, and x_2 matching the size of D.  Then we have, x^T M x = x_1^T A x_1 + x_2^T C x_1 + x_1^T B x_2 + x_2^T D x_2 > 0.  Now substitute in C = B^T since M is symmetric, and assign x_2 = -D^{-1} B^T x_1.  Putting that in and simplifying, we get x_1^T A x - x_1^T B D^{-1} B^T x_1 = x_1^T S x_1 > 0, where x_1 was arbitrary, thus proving that S is positive-definite.
    • Now, how can we apply that to prove that N_1[\mu] \succeq 0 implies \text{cov}(X) \succeq 0?  If we let the single element 1 in the top left of N_1[\mu] be A, and \mathbb{E}[XX^T] in the bottom right be D, and take the Schur complement of A, we get the fact that, if N_1[\mu] \succeq 0, then \mathbb{E}[XX^T] - \mu\mu^T \succeq 0, and the second term is exactly \text{cov}(X), thus proving the implication.
  • One thing I am unsure of is whether you can go the other direction, and show that \text{cov}(X) \succeq 0 implies N_1[\mu] \succeq 0.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
  • The important thing is that we know that the constraint N_1[\mu] \succeq 0 is an outer bound to the marginal polytope \mathbb{M}(K_m).  To show this, we use the fact that if there is a distribution p(X) that realizes mean parameter \mu, then letting Y be the random vector (1, X), we have for any vector a, a^T N_1[\mu] a = a^T \mathbb{E}[YY^T] a = \mathbb{E}[(a^T Y)^2] > 0, proving that N_1[\mu] \succeq 0.
  • So right now, we have \mathbb{M}(K_m) is outer bounded by the constraint N_1[\mu] \succeq 0 which in turn is outer bounded by the constraint \text{cov}(X) \succeq 0, and uncertain as to whether the two constraints are equivalent or not.
  • Why is it valid to replace the quantity within the determinant on page 184, \text{cov}(X) to N_1[\tau], as in equation (7.22) on page 185?  Again we use the Schur complement formula, and the property that \det(M) = \det(A)\det(S) where S is the Schur complement of D in M (see properties of determinant).  Using this and the fact above that by setting A as the single element 1 in the top right corner of N_1[\mu], the Schur complement of N_1[\mu] is \text{cov}(X) shows that the substitution is valid.
  • Why does setting canonical parameters as in (7.28) lead to mean parameters matching the empirical mean parameters?  The paper Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching gives a proof that the optimal \tau returned by the modified sum-product algorithm satisfies an admissibility condition that \langle \theta, \phi(x) \rangle + \text{constant} = \sum \log \tau_s(x_s) + \sum \rho_{st} \log \frac{\tau_{st}(x_s, x_t)}{\tau_s(x_s) \tau_t(x_t)}.  Using the \tilde{\theta} given in (7.28), we can see that \hat{\mu} satisfies this condition, and since the solution is unique, we have \hat{\mu} = \tau.
  • On page 193, how do we know that \|\theta\|_q = \sup_{\|\tau\|_{q'} \leq 1} \langle \tau, \theta \rangle?  From Holder’s inequality.

July 20, 2009

July 22 10:30 am – 12:30 pm: Pre-Meeting Overview – (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 12:07 pm
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Points to consider are:

Sections 2.1 and 2.2 cover the basics, which we know know from Wainwright-Jordan. The only addition is second part of Eq 3. Hessian of the log partition function is the Variance of the Sufficient statistics.

Section 2.3 is also mostly covered in Wainwright-Jordan, except the part from the last paragraph in Column 1. Here Γ maps to the “missing potential” expectations. This is to show that the optimization can be done in the smaller space of tractable potentials, while ignoring the “missing potentials”.

(a) Think about the dimension of the space in which τ, and θ exist.

Section 2.4 gives the formula for Mean Field updates (top of page 4). Remember τ are the mean parameters in the tractable subfamily. The formula involves a term J which is defined in Definition 2.

Section 3.1 explains what v-acyclic and b-acyclic mean. We will talk about it very briefly.

Section 3.2 shows how the Mean Field update formula can be further broken down into multiple subgraphs (connected components). It is these subgraphs and their properties that determine the complexity of the update equation.

Section 3.3 shows that the updates are easy to compute if the subgraph (connected component) is v-acyclic (Shown by  considering three subcases). Note that 1[f=(a,s)] is an indicator function.

(b) What does g = (e, (s,t)) mean?

(c) It is important to see why equation in Proposition 5 does not depend on \tau^{(c)}

Section 3.4 shows how the update equations are equivalent to Gibbs sampling. We will talk about this.

(d) Hopefully students with experience in Gibbs sampling can help the rest make the connection.

Section 3.5 shows how the update equation is not as easy to solve in case of b-acyclic subgraphs.

(e) This is probably the most confusing part of this paper, we will probably spend good amount of time going through this section.


(f) Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

(g) Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model (?)). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

The main objective for the discussion is to understand how model parameters are Learned for exponential families. We will be looking at Section 5 of the paper mainly. But section 5 is a small section and understanding section 4 will help go through 5 more meaningfully. Section 4 was covered in last fortnight’s meeting. But since it is important, we will brush up on the main points from last fortnight’s meeting.

We will briefly go over the model itself (Section 2). We should not be spending too much time on this section during the meeting. Section 3 discusses EP for the GA model. Again, we will quickly review this because it was done earlier. We should now be able to relate the “inclusion-deletion” being talked about this section to the EP algorithm as explained in Section 4 (Figure 4.7)

Section 5 will be the main discussion point.

(a) Why not use EM algorithm in a straight-forward manner ( Treating λ as a hidden variable ) This point is mentioned briefly after Eq 26. We should understand Why this is so.

Section 5.1 gives the one possible learning algorithm, which can be used for VB approach. Points to think about:

(b) How can Eqs 27 and 28 be understood as EM? This point is mentioned in paragraph 2 of section 5.1

(c) Why the same approach cannot be used for EP? Last few lines of section 5.1

Section 5.2 gives the learning algorithm for EP approach (Can be used for VB too (?)).

(d) How is Eq 29 the log likelihood in this model?

(e) We should understand what Eqs 31 – 34 mean. It might be helpful to revisit section 4.2 to understand what each of the RHS quantities mean.

Results show that EP does significantly better than VB. Points:

(f) For synthetic data, EP likelihood resembles the exact likelihood. VB likelihood resembles the max likelihood (Figure 1). (Briefly, we will talk about what exact and max likelihoods are)

(g) For real data, EP has lower “perplexity” than VB (Figure 3). (Briefly, what is “perplexity”?)

Next Page »

Create a free website or blog at