# UMass LIVING

## July 23, 2009

### (Jul 30 3-5pm) Pre-meeting Overview: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 4:26 pm
Tags: ,

The approximation methods we’ve covered in chapters 4 and 5 have both been nonconvex.  This chapter addresses the issue of how to create convex approximations, where both the set $\mathcal{L}$ that we use is convex and the objective function that we seek to maximize is concave.

7.1 deals with the generic framework for the convex approximations.  7.2 shows how to convexify the algorithms we saw in chapter 4, while 7.3 shows how to create new algorithms that are convex, not based on using convex combinations.  7.4 shows a separate benefit of convexity in terms of algorithmic stability, and 7.5 discusses how the convex approximations for inference fit in to learning.

Errata:

• There is a missing subscript in equation (7.1) on page 167, the expectation should be $\mathbb{E}[\phi_\alpha(X)]$.
• There should be (I believe) a $\rho(T)$ in front of the second inner product in the second equation on page 176.

• Why is the Bethe approximation to the entropy $H_{Bethe}$ not generally concave?
• On page 167, in the discussion of the projection operation $\mapsto$, how do we know that $\mu(F) \in \mathcal{M}(F)$?
• Understand 7.2.1.  In particular, consider the distribution over spanning trees $\rho$ that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on $A(\theta)$ whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that $A^*(\mu(F)) \leq A^*(\mu)$, and we are subtracting off $A^*$ in the objective function?
• Convince yourself that you can obtain the tree-reweighted sum-product updates in equation (7.12) on page 172 by following the Lagrangian formulation of ordinary sum-product in chapter 4, and using the new messages $M_{ts}(x_s) = \exp(\frac{1}{\rho_{st}} \lambda_{st}(x_t))$.
• Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why $H_{ep}$ is generally not concave, but why using the re-weighting subject to $\sum_{\ell=1}^{d_I} \rho(\ell) = 1$ gives a concave function.
• Understand the general idea in 7.3.1.  The details are not as important, although for the interested, I have put some notes on this section below.
• Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
• Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when $B$ upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
• Understand the intuitative interpretation of example 7.3.

Slightly more esoteric details:

• On page 184, why do $\mu_i$ show up along the diagonal of $N_1[\mu]$?  Hint: each random variable $X_i$ is binary valued.
• On page 184, why is the constraint $N_1[\mu] \succeq 0$ imply the constraint $\text{cov}(X) \succeq 0$?  Use the Schur complement formula and the stated fact that if $M$ is symmetric and positive-definite, then so is the Schur complement of $D$ in $M$.
• Here is a quick proof of that fact: we want to show that, if for all $x$, $x^T M x > 0$, then, for all $y$, $y^T S y > 0$ where $S$ is the Schur complement of $D$.  Split $x$ into two components, $x_1$ and $x_2$, with $x_1$ matching the size of $A$, and $x_2$ matching the size of $D$.  Then we have, $x^T M x = x_1^T A x_1 + x_2^T C x_1 + x_1^T B x_2 + x_2^T D x_2 > 0$.  Now substitute in $C = B^T$ since $M$ is symmetric, and assign $x_2 = -D^{-1} B^T x_1$.  Putting that in and simplifying, we get $x_1^T A x - x_1^T B D^{-1} B^T x_1 = x_1^T S x_1 > 0$, where $x_1$ was arbitrary, thus proving that $S$ is positive-definite.
• Now, how can we apply that to prove that $N_1[\mu] \succeq 0$ implies $\text{cov}(X) \succeq 0$?  If we let the single element 1 in the top left of $N_1[\mu]$ be $A$, and $\mathbb{E}[XX^T]$ in the bottom right be $D$, and take the Schur complement of $A$, we get the fact that, if $N_1[\mu] \succeq 0$, then $\mathbb{E}[XX^T] - \mu\mu^T \succeq 0$, and the second term is exactly $\text{cov}(X)$, thus proving the implication.
• One thing I am unsure of is whether you can go the other direction, and show that $\text{cov}(X) \succeq 0$ implies $N_1[\mu] \succeq 0$.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
• The important thing is that we know that the constraint $N_1[\mu] \succeq 0$ is an outer bound to the marginal polytope $\mathbb{M}(K_m)$.  To show this, we use the fact that if there is a distribution $p(X)$ that realizes mean parameter $\mu$, then letting $Y$ be the random vector $(1, X)$, we have for any vector $a$, $a^T N_1[\mu] a = a^T \mathbb{E}[YY^T] a = \mathbb{E}[(a^T Y)^2] > 0$, proving that $N_1[\mu] \succeq 0$.
• So right now, we have $\mathbb{M}(K_m)$ is outer bounded by the constraint $N_1[\mu] \succeq 0$ which in turn is outer bounded by the constraint $\text{cov}(X) \succeq 0$, and uncertain as to whether the two constraints are equivalent or not.
• Why is it valid to replace the quantity within the determinant on page 184, $\text{cov}(X)$ to $N_1[\tau]$, as in equation (7.22) on page 185?  Again we use the Schur complement formula, and the property that $\det(M) = \det(A)\det(S)$ where $S$ is the Schur complement of $D$ in $M$ (see properties of determinant).  Using this and the fact above that by setting $A$ as the single element 1 in the top right corner of $N_1[\mu]$, the Schur complement of $N_1[\mu]$ is $\text{cov}(X)$ shows that the substitution is valid.
• Why does setting canonical parameters as in (7.28) lead to mean parameters matching the empirical mean parameters?  The paper Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching gives a proof that the optimal $\tau$ returned by the modified sum-product algorithm satisfies an admissibility condition that $\langle \theta, \phi(x) \rangle + \text{constant} = \sum \log \tau_s(x_s) + \sum \rho_{st} \log \frac{\tau_{st}(x_s, x_t)}{\tau_s(x_s) \tau_t(x_t)}$.  Using the $\tilde{\theta}$ given in (7.28), we can see that $\hat{\mu}$ satisfies this condition, and since the solution is unique, we have $\hat{\mu} = \tau$.
• On page 193, how do we know that $\|\theta\|_q = \sup_{\|\tau\|_{q'} \leq 1} \langle \tau, \theta \rangle$?  From Holder’s inequality.