# UMass LIVING

## June 26, 2009

### (Jun 25) Meeting Summary: Wainwright / Jordan, ch 4.3

Filed under: Uncategorized — umassliving @ 4:35 pm
Tags: ,

Cleared up confusion over univariate versus multivariate sufficient statistics.  Key point was that $\Phi^i$ will be a vector, not a scalar.

“Marginalization” for a Gaussian means computing mean and covariance, in general it means computing the mean parameters (for discrete random variables using the standard overcomplete sufficient statistics, this will coincide with our normal definition of marginalization).

Note there’s a missing $x$ in equation 4.63, the beginning should be $\exp(-\frac{1}{2}x^T \Sigma^{-1} x)$.  Also, starting on page 113 and repeating several other times, the paper references equation 4.48, but actually means equation 4.58.

Answer to the pre-meeting question of how we know that $\mathcal{L}(\phi, \Phi)$ is convex: for any given projection $\Pi^i$ for a fixed $i$, the set is convex because of the convexity of $\mathcal{M}$$\mathcal{L}$ is then just the intersection of all these convex sets, and hence is convex itself.

How do we get the term-by-term entropy approximation 4.68?  If you start with the definition of entropy as $- \int p(x) \ln p(x) dx$ and use equation 4.58 for $p(x)$, and in addition, ignore the log partition function, you can derive 4.68, but this is a mistake, because the log partition function ties all the components together, so you can’t decompose the probability.  The approximation, then, is to assume that the distribution factorizes as $p(x;\theta,\tilde{\theta}) = p(x;\theta,\vec{0}) \prod_i p(x; \theta, \tilde{\theta}^i) / p(x; \theta, \vec{0})$, and using that approximation, you can derive 4.68.

In the pre-meeting overview, there was the question of what would happen in example 4.9, showing the Bethe approximation as a special case of EP, if you took the sufficient statistics associated with one particular node $v$, $\tau_v$, and put those as a single element of the intractable component.  What I believe will happen is that you will lose the marginalization constraints for that particular node $v$ for all edges connected to $v$ ($\sum_{x_u} \tau_{uv} (x_u, x_v) = \tau_v(x_v)$).  In addition, the entropy approximation will change, since $H(\tau_{uv}) - H(\tau_u) - H(\tau_v)$ will no longer be equal to the mutual information due to the lack of the marginalization constraint.

Deriving equation 4.77: The key is how to take the derivative of $H(\tau)$.  Unlike in the proof of Theorem 4.2, we do not know the exact form of $\tau$ and so we can’t assume that $\tau$ are the marginals and use them in the standard entropy formula.  It turns out that the derivative of the entropy is the negative of the canonical parameters that correspond to $\tau$.  This is because (as you can read on page 68) the conjugate dual is the negative entropy, and $\nabla A^*$ gives the mapping from mean parameters to canonical parameters.  Also, you can prove this directly by letting $\theta'(\tau)$ be the canonical parameters associated with $\tau$, using the fact that $p(x) = \exp(\langle \theta'(\tau), \phi(x) \rangle - A(\theta'(\tau)))$, and take the derivative of $\int p(x) \ln p(x) dx$.  Remember that $\theta'(\tau)$ depends on $\tau$, so you’ll need to take that into account when you take the derivative with respect to $\tau$.

If anyone figures out how to derive equation 4.78, please put that in a comment.

One thing that wasn’t mentioned during the meeting, and which may either be important or obvious, is that $\lambda^i$ does not appear in the augmented distribution $q^i$.  Therefore, you can compute the expectation of $\phi(x)$ with respect to $q^i$ (which is $\eta_i$), and then adjust $\lambda^i$ so that the expected value of $\phi(x)$ is the same under base distribution $q$ without changing $q^i$, which would not be true if you adjusted any of the other $\lambda$‘s.

Another question for thought: at the bottom of p123, it says the entropy $H(\tau(T),\tau_{uv})$ associated with the augmented distribution does not have an explicit form, but can be computed easily.  How would you compute it, and why can’t you do the same thing for the original entire distribution?

In the middle of p124, there is a typo – after the product $\prod_{(s,t) \neq (u,v)}$, to the right of that, it should be $\lambda^{st}$, not $\lambda^{uv}$.