UMass LIVING

July 31, 2009

(Jul 30) Meeting Summary: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 10:43 am
Tags: ,

Discussion from the meeting:

• Why is the Bethe approximation to the entropy $H_{Bethe}$ not generally concave?
• Since we know that $A^*$ is convex, and is the negative entropy, we know that the entropy is concave.  However, $H_{Bethe}$ does not generally correspond to a true entropy.  We assume that the edges form a tree and compute the entropy as if it were, subtracting on the mutual information on each edge, but this is not a true entropy unless the edges actually do form a tree, and hence is not generally concave.
• On page 167, in the discussion of the projection operation $\mapsto$, how do we know that $\mu(F) \in \mathcal{M}(F)$?
• To see this, we need to go back to the original definition of $\mathcal{M}$.  Assume $\mu \in \mathcal{M}$.  By the definition of $\mathcal{M}$, there exists some distribution $p$ such that $\mu_\alpha = \mathbb{E}_p[\phi_\alpha(x)]$ for all sufficient statistics indexed by $\alpha$, where the expectation is with respect to $p$.  Clearly then, we can use the exact same $p$ to show that $\mu(F) \in \mathcal{M}(F)$ since the requirements are the same, but only for a subset of the sufficient statistics $\mathcal{I}(F)$.
• Understand 7.2.1.  In particular, consider the distribution over spanning trees $\rho$ that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on $A(\theta)$ whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that $A^*(\mu(F)) \leq A^*(\mu)$, and we are subtracting off $A^*$ in the objective function?
• In mean field, we have that $A^*(\tau) = A^*(\mu(F)) = A^*(\mu)$, as proven in last week’s meeting summary.  Understanding this is the key to understanding this issue.  In mean field, we assumed that the canonical parameters that were outside those defined for the tractable subgraph were 0.  Therefore, $A(\theta)$ is the same, whether you include the inner product with the extra parameters, which are all zero, or not.  The convex combination of spanning trees is different because of the projection operator $\mapsto$ above.  We are still searching across the full space of possible mean parameters, but when we compute entropy, we use the projection operator to essentially ignore all the mean parameters that are not in our subgraph.  This differs from mean field, where the mean parameters that are not in our subgraph are deterministic functions of the mean parameters that are in the subgraph, corresponding to the functions $g$ in Wainwright and $\Gamma$ in the structured mean field optimization paper.  This difference means that you are searching over an outer bound as opposed to an inner bound, and are using an upper bound to the objective function, thus you are guaranteed to get an upper bound to $A(\theta)$.
• There should be a $\rho(T')$ in front of the second inner product in the second equation on page 176.
• This is for the $T'$ they mention that has $\Pi^{T'}(\lambda)$ different from zero.  Additionally, if you are wondering why $A^*$ is strictly convex, it is a consequence of the fact that $A$ is strictly convex.  The details are in Appendix B.3.
• Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why $H_{ep}$ is generally not concave, but why using the re-weighting subject to $\sum_{\ell=1}^{d_I} \rho(\ell) = 1$ gives a concave function.
• Since entropy is concave, if we take a positive combination of entropy functions, that should also be concave.  In $H_{ep}$, we have a negative amount $(1-d_I) H(\tau)$, so it is not concave.  Re-weighting subject to $\sum_{\ell=1}^{d_I} \rho(\ell) = 1$, or actually, $\sum_{\ell=1}^{d_I} \rho(\ell) \leq 1$, will fix this problem, making the approximation concave.
• Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
• One other important take-away from Figure 7.3 is why the approximations are better the lower the coupling strength (size of the parameter on the edges).  This is because, the lower the coupling strength, the more independent the individual nodes, which in the extreme case of a coupling strength of zero becomes a simple product distribution.  The higher the coupling strength, the more important the loops in the graph become, the more difficult inference becomes.
• Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when $B$ upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
• This is a check on your understanding of mean field within EM.  If you look at what mean field is approximating in EM, you should see why a lower bound on the cumulant gives a lower bound, whereas in 7.5 we need an upper bound on the cumulant to give a lower bound.
• Understand the intuitative interpretation of example 7.3.
• Another interesting point is the idea, in the final paragraph of the chapter, that it may be better to have an inconsistent estimator of the true parameter $\theta^*$, if your interest is in prediction (inference) and you must use an approximate inference algorithm.

Slightly more esoteric details:

• On page 184, why do $\mu_i$ show up along the diagonal of $N_1[\mu]$?  Hint: each random variable $X_i$ is binary valued.
• The diagonal elements are $\mathbb{E}[X_i^2]$.  Since $X_i$ is binary, we have $\mathbb{E}[X_i^2] = \sum_{0,1} X_i^2 p(X_i) = p(X_i = 1) = \mu_i$.
• Still unsure of whether you can go the other direction and show that $\text{cov}(X) \succeq 0$ implies $N_1[\mu] \succeq 0$.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.
• Update: Kim, Kojima, Yamashtia, “Second Order Cone Programming Relaxation of Positive Semidefinite Constraint” has a simple proof of this equivalence.  Let $A = \left( \begin{array}{cc} 1 & -\mu^T \\ 0 & \mathbf{I} \end{array} \right)$.  Then $B = A^T N_1[\mu] A$ where $B = \left( \begin{array}{cc} 1 & 0 \\ 0 & \mathbb{E}[XX^T] - \mu \mu^T \end{array} \right)$.  Clearly $B$ is positive semidefinite if and only if $\mathbb{E}[XX^T] - \mu \mu^T = \text{cov}(X) \succeq 0$, and since $A$ is nonsingular, $N_1[\mu]$ is psd if and only if $B$, and hence $\text{cov}(X)$, is psd.