July 23, 2009

(Jul 30 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 4:26 pm
Tags: ,

The approximation methods we’ve covered in chapters 4 and 5 have both been nonconvex.  This chapter addresses the issue of how to create convex approximations, where both the set \mathcal{L} that we use is convex and the objective function that we seek to maximize is concave.

7.1 deals with the generic framework for the convex approximations.  7.2 shows how to convexify the algorithms we saw in chapter 4, while 7.3 shows how to create new algorithms that are convex, not based on using convex combinations.  7.4 shows a separate benefit of convexity in terms of algorithmic stability, and 7.5 discusses how the convex approximations for inference fit in to learning.


  • There is a missing subscript in equation (7.1) on page 167, the expectation should be \mathbb{E}[\phi_\alpha(X)].
  • There should be (I believe) a \rho(T) in front of the second inner product in the second equation on page 176.

Important things to follow and think about:

  • Why is the Bethe approximation to the entropy H_{Bethe} not generally concave?
  • On page 167, in the discussion of the projection operation \mapsto, how do we know that \mu(F) \in \mathcal{M}(F)?
  • Understand 7.2.1.  In particular, consider the distribution over spanning trees \rho that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on A(\theta) whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that A^*(\mu(F)) \leq A^*(\mu), and we are subtracting off A^* in the objective function?
  • Convince yourself that you can obtain the tree-reweighted sum-product updates in equation (7.12) on page 172 by following the Lagrangian formulation of ordinary sum-product in chapter 4, and using the new messages M_{ts}(x_s) = \exp(\frac{1}{\rho_{st}} \lambda_{st}(x_t)).
  • Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why H_{ep} is generally not concave, but why using the re-weighting subject to \sum_{\ell=1}^{d_I} \rho(\ell) = 1 gives a concave function.
  • Understand the general idea in 7.3.1.  The details are not as important, although for the interested, I have put some notes on this section below.
  • Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
  • Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when B upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
  • Understand the intuitative interpretation of example 7.3.

Slightly more esoteric details:

  • On page 184, why do \mu_i show up along the diagonal of N_1[\mu]?  Hint: each random variable X_i is binary valued.
  • On page 184, why is the constraint N_1[\mu] \succeq 0 imply the constraint \text{cov}(X) \succeq 0?  Use the Schur complement formula and the stated fact that if M is symmetric and positive-definite, then so is the Schur complement of D in M.
    • Here is a quick proof of that fact: we want to show that, if for all x, x^T M x > 0, then, for all y, y^T S y > 0 where S is the Schur complement of D.  Split x into two components, x_1 and x_2, with x_1 matching the size of A, and x_2 matching the size of D.  Then we have, x^T M x = x_1^T A x_1 + x_2^T C x_1 + x_1^T B x_2 + x_2^T D x_2 > 0.  Now substitute in C = B^T since M is symmetric, and assign x_2 = -D^{-1} B^T x_1.  Putting that in and simplifying, we get x_1^T A x - x_1^T B D^{-1} B^T x_1 = x_1^T S x_1 > 0, where x_1 was arbitrary, thus proving that S is positive-definite.
    • Now, how can we apply that to prove that N_1[\mu] \succeq 0 implies \text{cov}(X) \succeq 0?  If we let the single element 1 in the top left of N_1[\mu] be A, and \mathbb{E}[XX^T] in the bottom right be D, and take the Schur complement of A, we get the fact that, if N_1[\mu] \succeq 0, then \mathbb{E}[XX^T] - \mu\mu^T \succeq 0, and the second term is exactly \text{cov}(X), thus proving the implication.
  • One thing I am unsure of is whether you can go the other direction, and show that \text{cov}(X) \succeq 0 implies N_1[\mu] \succeq 0.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
  • The important thing is that we know that the constraint N_1[\mu] \succeq 0 is an outer bound to the marginal polytope \mathbb{M}(K_m).  To show this, we use the fact that if there is a distribution p(X) that realizes mean parameter \mu, then letting Y be the random vector (1, X), we have for any vector a, a^T N_1[\mu] a = a^T \mathbb{E}[YY^T] a = \mathbb{E}[(a^T Y)^2] > 0, proving that N_1[\mu] \succeq 0.
  • So right now, we have \mathbb{M}(K_m) is outer bounded by the constraint N_1[\mu] \succeq 0 which in turn is outer bounded by the constraint \text{cov}(X) \succeq 0, and uncertain as to whether the two constraints are equivalent or not.
  • Why is it valid to replace the quantity within the determinant on page 184, \text{cov}(X) to N_1[\tau], as in equation (7.22) on page 185?  Again we use the Schur complement formula, and the property that \det(M) = \det(A)\det(S) where S is the Schur complement of D in M (see properties of determinant).  Using this and the fact above that by setting A as the single element 1 in the top right corner of N_1[\mu], the Schur complement of N_1[\mu] is \text{cov}(X) shows that the substitution is valid.
  • Why does setting canonical parameters as in (7.28) lead to mean parameters matching the empirical mean parameters?  The paper Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching gives a proof that the optimal \tau returned by the modified sum-product algorithm satisfies an admissibility condition that \langle \theta, \phi(x) \rangle + \text{constant} = \sum \log \tau_s(x_s) + \sum \rho_{st} \log \frac{\tau_{st}(x_s, x_t)}{\tau_s(x_s) \tau_t(x_t)}.  Using the \tilde{\theta} given in (7.28), we can see that \hat{\mu} satisfies this condition, and since the solution is unique, we have \hat{\mu} = \tau.
  • On page 193, how do we know that \|\theta\|_q = \sup_{\|\tau\|_{q'} \leq 1} \langle \tau, \theta \rangle?  From Holder’s inequality.

1 Comment »

  1. “Convince yourself that setting canonical parameters as in 7.28 leads to mean parameters matching the empirical mean parameters.”

    This is actually more difficult than I thought at first glance, so I moved it to the slightly more esoteric details section and replaced it with something easier.

    Comment by Gary — July 27, 2009 @ 1:04 pm | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: