to be filled in

## August 29, 2009

### July 16 3-5 pm: Meeting Summary, Chapter 6 – Variational Methods in Parameter Estimation

## August 20, 2009

### (Aug 20 3-5PM) Meeting Summary : Wainwright – Estimating the “Wrong” Graphical Model

Things we covered today:

- In the regularized surrogate likelihood (Eq 16) , we use , from the Bethe surrogate formulation covered in the survey paper. We then went through the steps in the joint estimation and prediction in section 4.2. We noted that the noise model is localized for each element of : . Erik wondered whether this was done for simplicity or by necessity and we thought it was for simplicity.
- The parameter estimate based on the surrogate likelihood is asymptotically normal.
- We had a long discussion about the relationship between globally stable, Lipschitz, and strongly convex. Any variational method that has a strongly convex entropy approximation is globally stable. Also, any variational method based on a convex optimization is Lipschitz stable. I’m not sure if there’s a difference between Lipschitz stable and globally stable…
- None of us knew how to derive eq 22, but we got some intuition by observing that if , then this is when the SNR is pure noise and the observation is useless. This is reflected in eq 22 by removing the term involving . Similarly, if , then , which is intuitive.
- When the SNR , in section 6.2 they show that which means will be different.
- In Figure 4, the curve marked with the red diamonds (approximate model) is upper bounded, as stated in Theorem 7. Figure 4 also illustrates that performing approximate inference (TRW method) using the approximate model (parameters) can be superior to performing approximate inference using the true model (black circles).

## August 13, 2009

### (Aug 20 3-5pm) Pre-meeting Overview: Wainwright – Estimating the “wrong” graphical model

M. J. Wainwright. Estimating the “wrong” graphical model: Benefits in the computation-limited setting. – http://www.eecs.berkeley.edu/~wainwrig/Papers/Wainwright06_JMLR.pdf

This paper is about variational approaches to parameter estimation and prediction in an MRF. In particular, they argue that, in the variational setting, an inconsistent estimator of can be desirable, because it can offset errors later in the prediction step.

Section 2,3 are covered in the Graphical Model survey.

Some points to examine:

- the overall procedure presented in Section 4.2.
- the application to Gaussian mixtures in Section 6.1
- the comparison of the tree-reweighted approach and sum-product in Section 7.

### (Aug 13 3-5pm) Meeting Summary: Wainwright / Jordan, ch 8, p214-233

The main points that will be covered in these subsections are

**(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.**

**(2) Examples of first order LP relaxations.**

**(3) Higher order LP approximations and how they can be more accurate.**

**(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.**

The tree-reweighted Bethe variational problem uses the same polytope that defines the first order LP in the first half of this section. Thus, naturally, the tree-reweighted Bethe variational problem (and hence the reweighted max-product) has the same connection to first order LP as the regular Bethe problem (or regular max-product).

, where is the fixed point in reweighted max- product algorithm, can be the solution to the LP in the reweighted case. This is true when equations 8.23 are satisfied by the *pseudo-max-marginals.*

**(2) Examples of first order LP relaxations.**

Section 8.4.4 discusses examples where an LP problem is formulated from a graph problem. Example 8.3 shows how the Error-control coding problem may be thought of as an LP problem.

Steps in the process:

(a) Convert the error-control coding problem’s graph to a pairwise markov random field. Create auxillary variables to handle any interactions that have more than two variables. Appendix E.3 should help in understanding this

(b) Find out the constraints that need to be enforced in the model (Equations 8.24, which may be rewritten as 8.25)

(c) Solve the LP problem (Feldman)

Example 8.4 shows the same for the (a) Ising model, (b) (Maximum Weight) Independent Set problem, (c) (Minimum Weight) Vertex Cover problem, (d) Max-cut problem. These are all examples of intractable problems, but a first oder LP approximate solution is possible. Example 8.5 is another classic graph problem, Maximum Weight Matching problem, solved using an LP. Equation 8.29 is the exact problem, Equation 8.30 is the LP version.

Other important points in this subsection

(a) First order LP interger program with binary variables has strong consistency property. This is not true for higher order variables.

(b) MWIS, Maxcut are submodular maximization problems, and MWVC is a supermodular minimization problem. Thus, all three are not *regular binary QPs. *They are intractable in general. LP approximate solutions are hence needed for all three.

**(3) Higher order LP approximations and how they can be more accurate.**

Section 8.5 discusses higher order LP relaxations by using the marginal polytopes defined on hypergraphs. So far, we used the polytopes as defined by the pairwise interactions in the graph. By considering polytopes defined by the hypergraphs instead, we can enforce more constraints. This means that the relaxations in the LP will be of higher order.

Higher order relaxation LPs are more accurate than (or atleast as accurate as) the lower order relaxation LPs for the same graphical model. This is because the higher order relaxation constraints are in addition to the lower order ones.

Equation 8.34 is significant. .

If is the tree-width of the hypergraph for the graphical model, then the LP solution is the same as the exact solution on the original polytope .

Finally, example 8.6 shows how the second order relaxation adds additional constraints that makes the solution “tighter” than a first order relaxation. This is done by solving the same problem solved earlier in Example 8.3 by two methods:

(a) binary MRF, which is a first order relaxation LP

(b) hypergraph MRF, which is a second order relaxation LP

Comparing the constraints, we see that has no constraints in (a). 8.38 applied in case of (b) enforces the contraint on .

Considering one solution , we see this is valid in (a), but not valid in (b). Thus, adding higher order constraints results in more accurate solutions.

*Lifting and Projecting*

The introduction of additional parameters in order to achieve higher order relaxations is called *lifting*. The lifted polytope can be projected back to lower space by *projecting*. Example 8.7 illustrates this. Lifting by introducing results in inequalities 8.41. These can be projected back to the original pairwise parameters by using the simple Fourier-Motzkin elimination method, resulting in inequalities 8.41. Some of these inequalites correspond to the *cycle inequalities* or *triangle inequalities*. This was alluded to earlier in Example 4.1)

## August 11, 2009

### (Aug 13 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 8, p214-233

Section 8.4.3 till 8.5

The main points that will be covered in these subsections are

(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

(2) Examples of first order LP relaxations.

(3) Higher order LP approximations and how they can be more accurate.

Section 8.4.3 extends the arguments from earlier subsections in this chapter to show that reweighted max-product (and tree-reweighted Bethe variational problem) can be thought of as an LP problem. Conditions 8.23 have to be met for the max-product solution and LP problem solution to be dual of one another.

Section 8.4.4 discusses examples where an LP problem is formulated from a graph problem. Example 8.3 shows how the Error-control coding problem may be thought of as an LP problem. Example 8.4 shows the same for the (a) Ising model, (b) (Maximum Weight) Independent Set problem, (c) (Minimum Weight) Vertex Cover problem, (d) Max-cut problem. These are all examples of intractable problems, but a first oder LP approximate solution is possible. Example 8.5 is another classic graph problem, Maximum Weight Matching problem, solved using an LP.

Section 8.5 discusses higher order LP relaxations by using the marginal polytopes defined on hypergraphs (rather than the polytopes defined by pairwise interactions in a tree). The main claim and proof in this section is that higher order relaxation LPs are more accurate the lower order relaxation LPs for the same graphical model. The relaxation LP solution, where is the tree-width of the hypergraph for the graphical model, is the same as the exact solution on the original polytope .

Finally, example 8.6 shows how the second order relaxation adds additional constraints that makes the solution “tighter” than in a first order relaxation. This is done by solving the same problem solved earlier in Example 8.3 by the two methods, and comparing the constraints and considering one solution .

We will be reading papers next week, so please come with suggestions for papers you think are interesting.

## August 8, 2009

### Duality for Linear Programs

I felt bad that my explanation of duality for linear programs was not especially cogent during the meeting, so I decided to write up a quick ~5 minute primer on duality.

First, I should mention some references. The bulk of what will follow is from Introduction to Linear Optimization by Bertsimas and Tsitsiklis. They also write a lot of other stuff on optimization that is probably useful if you’re really interested in this area. Also, there’s a book called Convex Optimization by Boyd and Vandenberghe that talks about the more general case of convex problems rather than strictly linear problems. Both of these books are in our vision lab library in Moe’s cubicle. Lastly, duality plays in important role in Support Vector Machines, in particular going from the primal to the dual is necessary to see how to kernels can be used with SVMs. The notes on this page – http://www.stanford.edu/class/cs229/materials.html – are a good overview of duality for SVMs. (Also let me plug the fact that these lectures can be found on YouTube.)

To begin with, consider the following dietary problem – you have types of food and types of nutrients. You know that one unit of food gives you units of nutrient . You want to achieve an ideal diet that contains units of nutrient using amount of food , and finally you have a cost associated with each food . Achieving a mix of foods that gives you your desired nutrients at minimal cost is then the following linear program:

minimize subject to

Now to that, let’s add a vector of Lagrange multipliers to form the generalized Lagrangian:

minimize

This is what Moe was referring to as turning a constrained problem to an unconstrained problem, but is different from the dual.

Let’s now define , and also define as an optimal feasible solution to the original (primal) minimization problem. Knowing this, we can see that for any :

The second equality comes from the fact that is a feasible solution and so satisfies , and the final inequality is for all feasible and follows from the fact that is an optimal solution.

Thus we know that for any , is a lower bound to the original primal problem, so we want to maximize over all possible values of to get the tightest lower bound possible. We can also rearrange the terms:

Now we can see that if , then we can set to have the opposite sign as the non-zero component of and make the minimum value equal to . Putting together this fact with the fact that we want to maximize , we now have our dual problem:

subject to

What happened here? Our constraints turned into variables , and our variables turned into constraints . This is going to be true in general – going from the primal to the dual will change a minimization problem to a maximization problem, change constraints to variables and variables to constraints.

Moreover, think about going back to the original primal problem and adding the constraint , corresponding to the constraint that we have to use a non-negative amount of food. Then you can see that we only need in order to avoid , and hence direct inequality constraints on variables in the primal turn into inequality constraints in the dual, and inequality constraints in the primal turn into direct inequality constraints on variables in the dual. Equality constraints in one form turn into free variables in the other form, as we saw originally.

What’s the point of all of this? One thing to note is what happens if you manage to find a pair that satisfy ? We know from above that for any and any feasible , therefore we know that is an optimal solution to the primal problem, and vice versa that is optimal for the dual. Thus, one obvious thing to try when solving a linear program is to try to solve the primal and dual at the same time. The difference in the values gives you bounds on what the optimal solution can be, and once you’ve closed the gap to zero, you know you’ve found the optimal solution.

There is also the stronger result of strong duality, that says if a linear programming problem has an optimal solution, so does its dual, and the optimal costs of the two problems are equal. So if you find a solution to one problem, then you know the other must be feasible and have the same optimal solution.

Note these results depend on the fact that the original problem was linear, they will not hold for any arbitrary problem. However, many results can be extended to convex problems. There are also additional details such as complementary slackness and KKT conditions that play a deeper role in understanding duality and help with solving such problems. The notion of complementary slackness also gives an intuition behind the meaning of the dual problem for the original dietary problem: we know that food has cost . In the dual, we can think of as being costs on the nutrients, and based on how much food has of each nutrient, we have a second price on food , where is the th column of . The primal-dual relationship is saying that, when the prices are correctly chosen on the nutrients, we satisfy the constraints as well as the price accounting that, for the optimal solution, we have .

## August 6, 2009

### (Aug 06 3-5pm) Meeting Summary: Wainwright / Jordan, ch 8, p195-213

Discussed the variational formulation of mode computation in theorem 8.1 and the zero-temperature limit of the Bethe variational principle. The resulting IP can be transformed to a LP over (more precisely, its dual). Gary provided a description of the dual of a LP.

Covered the max-product algorithm for Gaussians and the process of converting the standard Gaussian in exponential form to a pairwise MRF. Gary pointed out the type in 8.15 (see pre-meeting overview comments).

Covered LP relaxation and how it provides an approximation to the exact LP. Discussed the relationship between the extreme points in and and the definition of weakly and strongly persistent solutions (Gary won the debate, after Manju sided with him).

Covered the main point of section 8.4.2, that max-product does not solve the relaxed LP in the case of graphs with cycles (contrary to the analogous result for sum-product).

### (Aug 06 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 8, p195-213

Section 8 covers the problem of mode computation. That is given a probability distribution, compute the most probable configuration or the mode. It turns out that, this problem also has a variational representation as made explicit in Theorem 8.1 (see Appendix B.5. for the proof). Similar to our treatment in chapter 4, this variational problem can be solved exactly for tree-structured and Gaussian graphical models. Some points to focus on for this week:

- The objective function is simpler than the one for inference since it does not include , and so the main source of difficulty is in characterizing the space of mean parameters, explicitly. It should be clear that the optimization problem is an integer programming one (wikipedia).
- Section 8.2 transforms the IP problem to a LP one over the set for the case of tree-structured distributions and shows that a Lagrangian method for solving the dual of the LP are the max-product updates. Note that this result does not carry over to
**non**-tree-structured graphs as shown in section 8.4.2. - Section 8.3 discusses the exact computation of the mode for Gaussian graphical models (see Appendix C).
- Section 8.4.1 discusses first order LP relaxation for computing the mode of a discrete graphical model with cycles, and discusses the concept of strong and weak persistency.

## July 31, 2009

### (Jul 30) Meeting Summary: Wainwright / Jordan, ch 7

Discussion from the meeting:

- Why is the Bethe approximation to the entropy not generally concave?
- Since we know that is convex, and is the negative entropy, we know that the entropy is concave. However, does not generally correspond to a true entropy. We assume that the edges form a tree and compute the entropy as if it were, subtracting on the mutual information on each edge, but this is not a true entropy unless the edges actually do form a tree, and hence is not generally concave.

- On page 167, in the discussion of the projection operation , how do we know that ?
- To see this, we need to go back to the original definition of . Assume . By the definition of , there exists some distribution such that for all sufficient statistics indexed by , where the expectation is with respect to . Clearly then, we can use the exact same to show that since the requirements are the same, but only for a subset of the sufficient statistics .

- Understand 7.2.1. In particular, consider the distribution over spanning trees that places all its mass on one specific spanning tree. This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over. Yet structured mean field gives us a lower bound on whereas tree-reweighted Bethe gives us an upper bound. What accounts for this difference? For instance, why does structured mean field give a lower bound when we know that , and we are subtracting off in the objective function?
- In mean field, we have that , as proven in last week’s meeting summary. Understanding this is the key to understanding this issue. In mean field, we assumed that the canonical parameters that were outside those defined for the tractable subgraph were 0. Therefore, is the same, whether you include the inner product with the extra parameters, which are all zero, or not. The convex combination of spanning trees is different because of the projection operator above. We are still searching across the full space of possible mean parameters, but when we compute entropy, we use the projection operator to essentially ignore all the mean parameters that are not in our subgraph. This differs from mean field, where the mean parameters that are not in our subgraph are deterministic functions of the mean parameters that are in the subgraph, corresponding to the functions in Wainwright and in the structured mean field optimization paper. This difference means that you are searching over an outer bound as opposed to an inner bound, and are using an upper bound to the objective function, thus you are guaranteed to get an upper bound to .

- There should be a in front of the second inner product in the second equation on page 176.
- This is for the they mention that has different from zero. Additionally, if you are wondering why is strictly convex, it is a consequence of the fact that is strictly convex. The details are in Appendix B.3.

- Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important. For instance, understand why is generally not concave, but why using the re-weighting subject to gives a concave function.
- Since entropy is concave, if we take a positive combination of entropy functions, that should also be concave. In , we have a negative amount , so it is not concave. Re-weighting subject to , or actually, , will fix this problem, making the approximation concave.

- Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
- One other important take-away from Figure 7.3 is why the approximations are better the lower the coupling strength (size of the parameter on the edges). This is because, the lower the coupling strength, the more independent the individual nodes, which in the extreme case of a coupling strength of zero becomes a simple product distribution. The higher the coupling strength, the more important the loops in the graph become, the more difficult inference becomes.

- Understand how the convexified algorithms can be used within learning in 7.5. We use the fact that when upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood. Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function. What accounts for this discrepancy?
- This is a check on your understanding of mean field within EM. If you look at what mean field is approximating in EM, you should see why a lower bound on the cumulant gives a lower bound, whereas in 7.5 we need an upper bound on the cumulant to give a lower bound.

- Understand the intuitative interpretation of example 7.3.
- Another interesting point is the idea, in the final paragraph of the chapter, that it may be better to have an inconsistent estimator of the true parameter $\theta^*$, if your interest is in prediction (inference) and you must use an approximate inference algorithm.

Slightly more esoteric details:

- On page 184, why do show up along the diagonal of ? Hint: each random variable is binary valued.
- The diagonal elements are . Since is binary, we have .

- Still unsure of whether you can go the other direction and show that implies . The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
**Update**: Kim, Kojima, Yamashtia, “Second Order Cone Programming Relaxation of Positive Semidefinite Constraint” has a simple proof of this equivalence. Let . Then where . Clearly is positive semidefinite if and only if , and since is nonsingular, is psd if and only if , and hence , is psd.

## July 23, 2009

### (Jul 30 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 7

The approximation methods we’ve covered in chapters 4 and 5 have both been nonconvex. This chapter addresses the issue of how to create convex approximations, where both the set that we use is convex and the objective function that we seek to maximize is concave.

7.1 deals with the generic framework for the convex approximations. 7.2 shows how to convexify the algorithms we saw in chapter 4, while 7.3 shows how to create new algorithms that are convex, not based on using convex combinations. 7.4 shows a separate benefit of convexity in terms of algorithmic stability, and 7.5 discusses how the convex approximations for inference fit in to learning.

Errata:

- There is a missing subscript in equation (7.1) on page 167, the expectation should be .
- There should be (I believe) a in front of the second inner product in the second equation on page 176.

Important things to follow and think about:

- Why is the Bethe approximation to the entropy not generally concave?
- On page 167, in the discussion of the projection operation , how do we know that ?
- Understand 7.2.1. In particular, consider the distribution over spanning trees that places all its mass on one specific spanning tree. This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over. Yet structured mean field gives us a lower bound on whereas tree-reweighted Bethe gives us an upper bound. What accounts for this difference? For instance, why does structured mean field give a lower bound when we know that , and we are subtracting off in the objective function?
- Convince yourself that you can obtain the tree-reweighted sum-product updates in equation (7.12) on page 172 by following the Lagrangian formulation of ordinary sum-product in chapter 4, and using the new messages .
- Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important. For instance, understand why is generally not concave, but why using the re-weighting subject to gives a concave function.
- Understand the general idea in 7.3.1. The details are not as important, although for the interested, I have put some notes on this section below.
- Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
- Understand how the convexified algorithms can be used within learning in 7.5. We use the fact that when upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood. Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function. What accounts for this discrepancy?
- Understand the intuitative interpretation of example 7.3.

Slightly more esoteric details:

- On page 184, why do show up along the diagonal of ? Hint: each random variable is binary valued.
- On page 184, why is the constraint imply the constraint ? Use the Schur complement formula and the stated fact that if is symmetric and positive-definite, then so is the Schur complement of in .
- Here is a quick proof of that fact: we want to show that, if for all , , then, for all , where is the Schur complement of . Split into two components, and , with matching the size of , and matching the size of . Then we have, . Now substitute in since is symmetric, and assign . Putting that in and simplifying, we get , where was arbitrary, thus proving that is positive-definite.
- Now, how can we apply that to prove that implies ? If we let the single element 1 in the top left of be , and in the bottom right be , and take the Schur complement of , we get the fact that, if , then , and the second term is exactly , thus proving the implication.

- One thing I am unsure of is whether you can go the other direction, and show that implies . The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
- The important thing is that we know that the constraint is an outer bound to the marginal polytope . To show this, we use the fact that if there is a distribution that realizes mean parameter , then letting be the random vector , we have for any vector , , proving that .
- So right now, we have is outer bounded by the constraint which in turn is outer bounded by the constraint , and uncertain as to whether the two constraints are equivalent or not.
- Why is it valid to replace the quantity within the determinant on page 184, to , as in equation (7.22) on page 185? Again we use the Schur complement formula, and the property that where is the Schur complement of in (see properties of determinant). Using this and the fact above that by setting as the single element 1 in the top right corner of , the Schur complement of is shows that the substitution is valid.
- Why does setting canonical parameters as in (7.28) lead to mean parameters matching the empirical mean parameters? The paper Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching gives a proof that the optimal returned by the modified sum-product algorithm satisfies an admissibility condition that . Using the given in (7.28), we can see that satisfies this condition, and since the solution is unique, we have .
- On page 193, how do we know that ? From Holder’s inequality.