# UMass LIVING

## July 23, 2009

### July 22 10:30 am – 12:30 pm: Meeting Summary (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 10:05 am
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Note the following dimensionalities:

$|\mathcal{M}_{MF}| = |\mathcal{M}| = |\mu| = |\theta| = |E[\Phi(Y_\omega)]| = |F|$ = (Number of sufficient statistics in original distribution)

$|\mathcal{N}| = |\tau| = |\omega| = |F'|$ = (Number of sufficient statistics in tractable distribution)

The embedding function $\Gamma$ maps from the space of $\tau$ to the space of “missing” or “ignored” sufficient statistics.

Proof that $A^*(\mu) = A^*(\tau)$:

Eqn (6) in this paper is the same as Eqn (5.8) in Wainwright-Jordan. We are looking to find the sup (maximum) value of this expression, and that value will be our approximation of the log partition function $A(\theta)$. The maximum value is an approximation and not the exact log partition function because our search space is not all valid values of $\mu$ for the original distribution, but only the ones that can be realized by the tractable distribution.

Section 2.4 takes the partial derivative of (6) to find the max point. The embedding Jacobian ($J$) is defined, which is the first order partial derivative of  $\Gamma$ w.r.t. $\tau$.
The solution for max of Eqn (6) is then found to be
$\tau = \bigtriangledown A(\omega + J(\tau)\vartheta)$
The vector $\tau$ can be broken down into $(\tau^{(1)} \tau^{(2)} ... \tau^{(k)})$ where $(1)...(k)$ represent the $k$ disjoint connected components (subgraphs) into which the original graph has been broken down.

Section 3.3 shows that the embedding Jacobian is easy to compute when the subgraph in question is a v-acyclic subgraph. Note that this section assumes that the exponential family is expressed in an overcomplete representation. In this section, $f$ is the set of statistics in the tractable graph and $g$ is the set of statistics that are not in the tractable graph.  The v-acyclic subgraph property ensures that when we add any external edge (from the $g$ set) to the connected component ($f$) in question, the two nodes of the edge are independent (There is no path between the two nodes). This independence makes computing the $J$ matrix easy for v-acyclic graphs.  The complexity of updating $\tau$ in Prop 5 thus takes $O(|F'^{(c)}|)$ time.

For b-acyclic graphs, adding an edge to the $f$ connected component could result in a loop, the two nodes that define the edge are not independent, and we need to use a chain rule to figure out the value of $J$. Complexity of computation is
$O(|E'| \times |F'| \times |F\backslash F'|)$, which is large.
The alternative approach is to consider each case of $g$ as an exponential family. Calculating the $J$ matrix under this setting reduces the computation cost to $O(|E'| \times |F\backslash F'|)$, which is a considerable reduction from the naive approach.

The paper also makes a connection between the v-acyclic case and Gibbs sampling.

Results:

Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between learning using Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

Given the Generative Aspect model (LDA), where $\lambda$ is the variable (Dirichlet distributed with parameter $\alpha$) that specifies the distribution of the “aspect” variable $a$. We cannot use EM directly by treating $\lambda$ as the hidden variable because the E step requires expectations over the posterior for $\lambda$.

Alternate approaches:

(1) Maximize the Variational Inference likelihood in Eqn (8)

(2) “Variational EM”:

E step – Use an approximate posterior for $\lambda$. Calculating this can be done using the EP algorithm of section 4.2 (Eqns (15) and (21))

M step – Maximize the lower bound on the log-likelihood (Eq (29)). (29) can be obtained by applying the inequality in Eqn (6) to the likelihood in (26) and then taking the log. The solution for the M step is in Eqn (31). The appendix has a more accurate solution based on the Taylor expansion (Eqn (34)).

It may be noted that approach (1) could be taken for EP as well (Find the parameters that results in the largest possible EP estimate of the likelihood ), but the authors make the point that EP makes an approximation which is close to the true likelihood in an average sense. Hence finding the max in EP’s “best approximation on average” will not yield the true maximum in the original likelihood. (Paragraph just before Section 5.2).  Hence (2) is preferred for EP.

Results:

(a) Synthetic data –

Using the VB approach (1), the estimated likelihood resembles the max approximation of the true likelihood. Using the EP approach (2), the estimated likelihood resembles the true likelihood (Figure 1)

VB tends to choose extreme parameters which can explain the data, while EP tends to reflect the true parameters that generated the data.

(b) Real data (TREC)

EP does better than VB. The difference between performance between EP and VB is not very high (As can be seem from the difference in the perplexity curves of EP and VB in Figure 3. Lower perplexity is better). Possible explanation is that in the data set that was used, there was little overlap between the aspects (The topics had little in common). Thus, the true distribution of the aspects are sharply peaked. This means that the difference between the max approximation and the exact likelihoods is not as large as in Figure 1. Thus, the EP and VB approximations are close to each other, with EP still doing slightly better.