UMass LIVING

August 29, 2009

(Sep 03 3-5pm) Pre-meeting Overview: Convex Relaxations for MAP Estimation (Kumar, Kolmogorov, Torr)

Filed under: Uncategorized — umassliving @ 5:03 pm
Tags: ,

This week we’ll be covering Pawan, Kolmogorov, and Torr, “An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs”, JMLR 2009.

A related paper is Kumar, Torr, and Zisserman, “Solving Markov Random Fields using Second Order Cone Programming Relaxations“, CVPR 2006, which may be useful as a quick introduction as well as an example of how to apply these methods to a vision problem (object recognition), but the main focus of the meeting will be on the JMLR paper.

The paper begins by describing the integer programming formulation of MAP estimation, equivalently energy minimization, hence the integer program and all relaxations are minimization problems in the paper.  Next, the paper gives three relaxations of the integer program – LP-S, the linear programming relaxation (the main one we dealt with in Wainwright); QP-RL, a quadratic programming relaxation; and SOCP-MS, a second order cone programming relaxation.  One of the focuses of the paper is to compare these different relaxations to one another, and describes how to do so in the next section, through the notion of dominance.  The paper goes on to prove that LP-S dominates SOCP-MS, and prove that SOCP-MS and QP-RL are equivalent.

The next focus of the paper is to prove that for a class of relaxations (on trees and certain types of cycles) LP-S strictly dominates SOCP-MS (and consequently QP-RL).  The paper then gives two additional types of constraints that can be added to the SOCP relaxation, such that the new SOCP relaxation dominates LP-S for some problems and has a feasibility region not a super-set of the LP-S feasibility region.

The majority of the paper is in the proofs, so beyond understanding the basic set-up, I think that’s where we should be spending our time.  In particular, if there are any steps that you do not follow, post them in the comments and we can try to work through it during the meeting.

Notes:

Going from SDP to SOCP relaxation:

One fact that seems to be used without an explicit proof is A \succeq 0 \Leftrightarrow A \bullet C \geq 0 \quad \forall C \succeq 0, where A \bullet C = \sum_{ij} A_{ij}C_{ij} is the Frobenius inner product on matrices.  Here’s a quick proof, where the second direction uses part of the CVPR paper.

First, note that x x^T \succeq 0, since for any vector a, a^T x x^T a = (a^T x)^2 \geq 0.  Therefore, A \bullet C \geq 0 \quad \forall C \succeq 0 \Rightarrow A \bullet x x^T \geq 0 \quad \forall x \Rightarrow \sum_{ij} A_{ij} x_i x_j \geq 0 \Rightarrow x^T A x \geq 0 \Rightarrow A \succeq 0.

To go the reverse direction, note that any symmetric positive semidefinite matrix C can be written as UU^T using the Cholesky decomposition.  Let A \succeq 0, and C = UU^T be any positive semidefinite matrix.  A \bullet C = A \bullet UU^T = \sum_{ijk} A_{ij} U_{ik} U^T_{kj} = \sum_{ijk} U^T_{ki} A_{ij} U_{jk} = \sum_k u^T_k A u_k \geq 0, where u_k is the kth column of U (not row, as stated in the CVPR paper), and the last inequality uses the fact that A \succeq 0.

Recall that in the SDP relaxation, our positive semidefinite constraint was X - xx^T \succeq 0.  From the CVPR paper, we then have, for any positive semidefinite matrix C = UU^T, (X - xx^T) \bullet C \geq 0 \Leftrightarrow xx^T \bullet C \leq X \bullet C \Leftrightarrow x^T UU^T x \leq X \bullet C \Leftrightarrow \|U^T x\|^2 \leq X \bullet C.

Finally, note that the final inequality is equivalent to the following SOCP constraint: \|u\| \leq t where u = \left( \begin{array}{c} 1 - X \bullet C \\ 2U^T x\end{array} \right) and t = 1 + X \bullet C, which can be verified by squaring both sides.

At this point, the SOCP relaxation is equivalent to the SDP relaxation.  However, the SOCP relaxation is made more tractable (at the cost of removing constraints) by only considering a subset of psd matrices C.  The typical set of C used is those with all zeros except for a 2 x 2 submatrix corresponding to two vertices connected by an edge in the MRF.

Advertisements

(Aug 27 3-5pm) Meeting Summary: Wainwright / Jordan, ch 9, 10

Filed under: Uncategorized — umassliving @ 4:40 pm
Tags: ,

This section describes how to use moment matrics and conic programming, semidefinite programming(SDP) and second-order cone programming(SOCP) to construct variational relaxations.

We spent a good amount of time going over the background information in the section.  First we discussed the definition of a moment matrix and the property that any valid moment matrix is positive semidefinite.  Next we looked at the definition of two different bases, the multinomial base and the indicator base and went trhough the lemma that shows that it is always possible to convert between the two.  Lastly, we looked at the new definition of the marginal polytope for hypergraphs in terms of the multinomial base.

Next we went over how the Lasserre sequence(moment matrices) provides a nested hierarchy of outer bounds on the marginal polytope.  For any hypergraph on m nodes, the S_t  where t=m provides and exact characterization of the margional polytope.

The last section discussed an alternate second-order cone relaxation technique, which we deffered discussing until next week’s meeting.

(Aug 27 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 9, 10

Filed under: Uncategorized — umassliving @ 4:39 pm
Tags: ,

to be filled in

July 16 3-5 pm: Meeting Summary, Chapter 6 – Variational Methods in Parameter Estimation

Filed under: Uncategorized — umassliving @ 4:38 pm
Tags: ,

to be filled in

August 20, 2009

(Aug 20 3-5PM) Meeting Summary : Wainwright – Estimating the “Wrong” Graphical Model

Filed under: Uncategorized — umassliving @ 5:08 pm
Tags: ,

Things we covered today:

  1. In the regularized surrogate likelihood (Eq 16) \ell_B(\theta), we use B(\theta), from the Bethe surrogate formulation covered in the survey paper. We then went through the steps in the joint estimation and prediction in section 4.2. We noted that the noise model is localized for each element of y : p(y|z) = \prod_{s = 1}^N p(y_s | z_s). Erik wondered whether this was done for simplicity or by necessity and we thought it was for simplicity.
  2. The parameter estimate based on the surrogate likelihood \hat{\theta^n} is asymptotically normal.
  3. We had a long discussion about the relationship between globally stable, Lipschitz, and strongly convex.  Any variational method that has a strongly convex entropy approximation is globally stable. Also, any variational method based on a convex optimization is Lipschitz stable.  I’m not sure if there’s a difference between Lipschitz stable and globally stable…
  4. None of us knew how to derive eq 22, but we got some intuition by observing that if \alpha = 0, then this is when the SNR is pure noise and the observation Y is useless. This is reflected in eq 22 by removing the term involving Y. Similarly, if \alpha = 1, then Z_s = Y_s, which is intuitive.
  5. When the SNR \alpha = 0, in section 6.2 they show that \Delta B(\hat{\theta}) = \mu^* = \Delta A(\theta) which means \Delta B(\theta) will be different.
  6. In Figure 4, the curve marked with the red diamonds (approximate model) is upper bounded, as stated in Theorem 7.  Figure 4 also illustrates that performing approximate inference (TRW method) using the approximate model (parameters) can be superior to performing approximate inference using the true model (black circles).

August 13, 2009

(Aug 20 3-5pm) Pre-meeting Overview: Wainwright – Estimating the “wrong” graphical model

Filed under: Uncategorized — umassliving @ 6:37 pm
Tags: ,

M. J. Wainwright. Estimating the “wrong” graphical model: Benefits in the computation-limited setting. – http://www.eecs.berkeley.edu/~wainwrig/Papers/Wainwright06_JMLR.pdf

This paper is about variational approaches to parameter estimation and prediction in an MRF.  In particular, they argue that, in the variational setting, an inconsistent estimator of \theta can be desirable, because it can offset errors later in the prediction step.

Section 2,3 are covered in the Graphical Model survey.

Some points to examine:

  • the overall procedure presented in Section 4.2.
  • the application to Gaussian mixtures in Section 6.1
  • the comparison of the tree-reweighted approach and sum-product in Section 7.

(Aug 13 3-5pm) Meeting Summary: Wainwright / Jordan, ch 8, p214-233

Filed under: Uncategorized — umassliving @ 6:35 pm
Tags: ,

The main points that will be covered in these subsections are

(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

(2) Examples of first order LP relaxations.

(3) Higher order LP approximations and how they can be more accurate.


(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

The tree-reweighted Bethe variational problem uses the same polytope \mathbb{L} that defines the first order LP in the first half of this section. Thus, naturally, the tree-reweighted Bethe variational problem (and hence the reweighted max-product) has the same connection to first order LP as the regular Bethe problem (or regular max-product).

\lambda^* := log\ M^*, where M^* is the fixed point in reweighted max- product algorithm, can be the solution to the LP in the reweighted case. This is true when equations 8.23 are satisfied by the pseudo-max-marginals.

(2) Examples of first order LP relaxations.

Section 8.4.4 discusses examples where an LP problem is formulated from a graph problem. Example 8.3 shows how the Error-control coding problem may be thought of as an LP problem.

Steps in the process:

(a) Convert the error-control coding problem’s graph to a pairwise markov random field. Create auxillary variables to handle any interactions that have more than two variables. Appendix E.3 should help in understanding this

(b) Find out the constraints that need to be enforced in the model (Equations 8.24, which may be rewritten as  8.25)

(c) Solve the LP problem (Feldman)

Example 8.4 shows the same for the (a) Ising model, (b) (Maximum Weight) Independent Set problem, (c) (Minimum Weight) Vertex Cover problem, (d) Max-cut problem. These are all examples of intractable problems, but a first oder LP approximate solution is possible. Example 8.5 is another classic graph problem, Maximum Weight Matching problem, solved using an LP. Equation 8.29 is the exact problem, Equation 8.30 is the LP version.

Other important points in this subsection

(a) First order LP interger program with binary variables has strong consistency property. This is not true for higher order variables.

(b) MWIS, Maxcut are submodular maximization problems, and MWVC is a supermodular minimization problem. Thus, all three are not regular binary QPs. They are intractable in general. LP approximate solutions are hence needed for all three.

(3) Higher order LP approximations and how they can be more accurate.

Section 8.5 discusses higher order LP relaxations by using the marginal polytopes defined on hypergraphs. So far, we used the polytopes as defined by the pairwise interactions in the graph. By considering polytopes defined by the hypergraphs instead, we can enforce more constraints. This means that the relaxations in the LP will be of higher order.

Higher order relaxation LPs are more accurate than (or atleast as accurate as) the lower order relaxation LPs for the same graphical model. This is because the higher order relaxation constraints are in addition to the lower order ones.

Equation 8.34 is significant. \mathbb{L}_1(G) \supseteq \mathbb{L}_2(G) \supseteq \mathbb{L}_3(G) ...\supseteq \mathbb{L}_k(G) \supseteq...\supseteq \mathbb{M}(G).

If  t is the tree-width of the hypergraph for the graphical model, then the LP solution is the same as the exact solution on the original polytope \mathbb{L}_t(G) = \mathbb{M}(G).

Finally, example 8.6 shows how the second order relaxation adds additional constraints that makes the solution “tighter” than a first order relaxation. This is done by solving the same problem solved earlier in Example 8.3 by two methods:

(a) binary MRF, which is a first order relaxation LP

(b) hypergraph MRF, which is a second order relaxation LP

Comparing the constraints, we see that \tau_{23} has no constraints in (a).  8.38 applied in case of (b) enforces the contraint on  \tau_{23}.

Considering one solution \tilde{\tau} = (1, \frac{1}{2}, \frac{1}{2}, 0), we see this is valid in (a), but not valid in (b). Thus, adding higher order constraints results in more accurate solutions.

Lifting and Projecting

The introduction of additional parameters in order to achieve higher order relaxations is called lifting. The lifted polytope can be projected back to lower space by projecting. Example 8.7 illustrates this. Lifting by introducing \tau_{stu} results in inequalities 8.41. These can be projected back to the original pairwise parameters by using the simple Fourier-Motzkin elimination method, resulting in inequalities 8.41. Some of these inequalites correspond to the cycle inequalities or triangle inequalities. This was alluded to earlier in Example 4.1)

August 11, 2009

(Aug 13 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 8, p214-233

Filed under: Uncategorized — umassliving @ 3:49 pm
Tags: ,

Section 8.4.3 till 8.5

The main points that will be covered in these subsections are

(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

(2) Examples of first order LP relaxations.

(3) Higher order LP approximations and how they can be more accurate.

Section 8.4.3 extends the arguments from earlier subsections in this chapter to show that reweighted max-product (and tree-reweighted Bethe variational problem) can be thought of as an LP problem. Conditions 8.23 have to be met for the max-product solution and LP problem solution to be dual of one another.

Section 8.4.4 discusses examples where an LP problem is formulated from a graph problem. Example 8.3 shows how the Error-control coding problem may be thought of as an LP problem. Example 8.4 shows the same for the (a) Ising model, (b) (Maximum Weight) Independent Set problem, (c) (Minimum Weight) Vertex Cover problem, (d) Max-cut problem. These are all examples of intractable problems, but a first oder LP approximate solution is possible. Example 8.5 is another classic graph problem, Maximum Weight Matching problem, solved using an LP.

Section 8.5 discusses higher order LP relaxations by using the marginal polytopes defined on hypergraphs (rather than the polytopes defined by pairwise interactions in a tree). The main claim and proof in this section is that higher order relaxation LPs are more accurate the lower order relaxation LPs for the same graphical model. The t^{th} relaxation LP solution, where t is the tree-width of the hypergraph for the graphical model, is the same as the exact solution on the original polytope \mathbb{M}(G).

Finally, example 8.6 shows how the second order relaxation adds additional constraints that makes the solution “tighter” than in a first order relaxation. This is done by solving the same problem solved earlier in Example 8.3 by the two methods, and comparing the constraints and considering one solution \tilde{\tau} = (1, \frac{1}{2}, \frac{1}{2}, 0).

We will be reading papers next week, so please come with suggestions for papers you think are interesting.

August 8, 2009

Duality for Linear Programs

Filed under: Uncategorized — umassliving @ 5:11 pm
Tags: ,

I felt bad that my explanation of duality for linear programs was not especially cogent during the meeting, so I decided to write up a quick ~5 minute primer on duality.

First, I should mention some references.  The bulk of what will follow is from Introduction to Linear Optimization by Bertsimas and Tsitsiklis.  They also write a lot of other stuff on optimization that is probably useful if you’re really interested in this area.  Also, there’s a book called Convex Optimization by Boyd and Vandenberghe that talks about the more general case of convex problems rather than strictly linear problems.  Both of these books are in our vision lab library in Moe’s cubicle.  Lastly, duality plays in important role in Support Vector Machines, in particular going from the primal to the dual is necessary to see how to kernels can be used with SVMs.  The notes on this page – http://www.stanford.edu/class/cs229/materials.html – are a good overview of duality for SVMs.  (Also let me plug the fact that these lectures can be found on YouTube.)

To begin with, consider the following dietary problem – you have n types of food and m types of nutrients.  You know that one unit of food j gives you a_{ij} units of nutrient i.  You want to achieve an ideal diet that contains b_i units of nutrient i using amount x_j of food j, and finally you have a cost c_j associated with each food j.  Achieving a mix of foods that gives you your desired nutrients at minimal cost is then the following linear program:

minimize c^T x subject to Ax = b

Now to that, let’s add a vector of Lagrange multipliers p to form the generalized Lagrangian:

minimize c^T x + p^T (b - Ax)

This is what Moe was referring to as turning a constrained problem to an unconstrained problem, but is different from the dual.

Let’s now define q(p) = \min_x c^T x + p^T (b - Ax), and also define x^* as an optimal feasible solution to the original (primal) minimization problem.  Knowing this, we can see that for any p:

q(p) = \min_x c^T x + p^T (b - Ax) \leq c^T x^* + p^T (b - Ax^*) = c^T x^* \leq c^T x'

The second equality comes from the fact that x^* is a feasible solution and so satisfies Ax = b, and the final inequality is for all feasible x' and follows from the fact that x^* is an optimal solution.

Thus we know that for any p, q(p) is a lower bound to the original primal problem, so we want to maximize q(p) over all possible values of p to get the tightest lower bound possible.  We can also rearrange the terms:

q(p) = \min_x p^T b + (c^T - p^T A)x

Now we can see that if p^T A \neq c^T, then we can set x to have the opposite sign as the non-zero component of c^T - p^T A and make the minimum value equal to -\infty.  Putting together this fact with the fact that we want to maximize q(p), we now have our dual problem:

\max_p q(p) subject to p^T A = c^T

What happened here?  Our constraints turned into variables p, and our variables x turned into constraints p^T A = c^T.  This is going to be true in general – going from the primal to the dual will change a minimization problem to a maximization problem, change constraints to variables and variables to constraints.

Moreover, think about going back to the original primal problem and adding the constraint x \geq 0, corresponding to the constraint that we have to use a non-negative amount of food.  Then you can see that we only need c^T \geq p^T A in order to avoid -\infty, and hence direct inequality constraints on variables x \geq 0 in the primal turn into inequality constraints in the dual, and inequality constraints in the primal turn into direct inequality constraints on variables in the dual.  Equality constraints in one form turn into free variables in the other form, as we saw originally.

What’s the point of all of this?  One thing to note is what happens if you manage to find a pair (x', p') that satisfy c^T x' = q(p')?  We know from above that q(p) \leq c^T x for any p and any feasible x, therefore we know that x' is an optimal solution to the primal problem, and vice versa that p' is optimal for the dual.  Thus, one obvious thing to try when solving a linear program is to try to solve the primal and dual at the same time.  The difference in the values gives you bounds on what the optimal solution can be, and once you’ve closed the gap to zero, you know you’ve found the optimal solution.

There is also the stronger result of strong duality, that says if a linear programming problem has an optimal solution, so does its dual, and the optimal costs of the two problems are equal.  So if you find a solution to one problem, then you know the other must be feasible and have the same optimal solution.

Note these results depend on the fact that the original problem was linear, they will not hold for any arbitrary problem.  However, many results can be extended to convex problems.  There are also additional details such as complementary slackness and KKT conditions that play a deeper role in understanding duality and help with solving such problems.  The notion of complementary slackness also gives an intuition behind the meaning of the dual problem for the original dietary problem: we know that food j has cost c_j.  In the dual, we can think of p as being costs on the nutrients, and based on how much food j has of each nutrient, we have a second price p^T A_j on food j, where A_j is the jth column of A.  The primal-dual relationship is saying that, when the prices are correctly chosen on the nutrients, we satisfy the constraints as well as the price accounting that, for the optimal solution, we have c^T x^* = p^* b.

August 6, 2009

(Aug 06 3-5pm) Meeting Summary: Wainwright / Jordan, ch 8, p195-213

Filed under: Uncategorized — umassliving @ 5:47 pm
Tags: ,

Discussed the variational formulation of mode computation in theorem 8.1 and the zero-temperature limit of the Bethe variational principle. The resulting IP can be transformed to a LP over \mathbb{L}(G) (more precisely, its dual). Gary provided a description of the dual of a LP.

Covered the max-product algorithm for Gaussians and the process of converting the standard Gaussian in exponential form to a pairwise MRF. Gary pointed out the type in 8.15 (see pre-meeting overview comments).

Covered LP relaxation and how it provides an approximation to the exact LP. Discussed the relationship between the extreme points in \mathbb{M}(G) and \mathbb{L}(G) and the definition of weakly and strongly persistent solutions (Gary won the debate, after Manju sided with him).

Covered the main point of section 8.4.2, that max-product does not solve the relaxed LP in the case of graphs with cycles (contrary to the analogous result for sum-product).

Next Page »

Blog at WordPress.com.