**(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)**

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable. The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is *v*-acyclic, then the computation can be done in quick time. If the subgraph is *b*-acyclic, then the computation is much more complicated. A *v*-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name* very*-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is *b*-acyclic (*barely*-acyclic).

Note the following dimensionalities:

= (Number of sufficient statistics in original distribution)

= (Number of sufficient statistics in tractable distribution)

*To be added*

*v*-acyclic subgraph. Note that this section assumes that the exponential family is expressed in an overcomplete representation. In this section, is the set of statistics in the tractable graph and is the set of statistics that are not in the tractable graph. The

*v*-acyclic subgraph property ensures that when we add any external edge (from the set) to the connected component () in question, the two nodes of the edge are independent (There is no path between the two nodes). This independence makes computing the matrix easy for

*v*-acyclic graphs. The complexity of updating in Prop 5 thus takes time.

*b*-acyclic graphs, adding an edge to the connected component could result in a loop, the two nodes that define the edge are not independent, and we need to use a chain rule to figure out the value of . Complexity of computation is

*v*-acyclic case and Gibbs sampling.

**Results: **

Choosing a *b*-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

Convergence time : *v*-acyclic SMF takes an order magnitude more time compared to NMF. *b*-acyclic SMF takes two orders of magnitude more time than *v*-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

**(2) EP for Generative Aspect Model (Minka, Lafferty)**

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between learning using Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

Given the Generative Aspect model (LDA), where is the variable (Dirichlet distributed with parameter ) that specifies the distribution of the “aspect” variable $a$. We cannot use EM directly by treating as the hidden variable because the E step requires expectations over the posterior for .

Alternate approaches:

(1) Maximize the Variational Inference likelihood in Eqn (8)

(2) “Variational EM”:

E step – Use an approximate posterior for . Calculating this can be done using the EP algorithm of section 4.2 (Eqns (15) and (21))

M step – Maximize the lower bound on the log-likelihood (Eq (29)). (29) can be obtained by applying the inequality in Eqn (6) to the likelihood in (26) and then taking the log. The solution for the M step is in Eqn (31). The appendix has a more accurate solution based on the Taylor expansion (Eqn (34)).

It may be noted that approach (1) could be taken for EP as well (Find the parameters that results in the largest possible EP estimate of the likelihood ), but the authors make the point that EP makes an approximation which is close to the true likelihood in an average sense. Hence finding the max in EP’s “best approximation on average” will not yield the true maximum in the original likelihood. (Paragraph just before Section 5.2). Hence (2) is preferred for EP.

**Results:**

(a) Synthetic data –

Using the VB approach (1), the estimated likelihood resembles the max approximation of the true likelihood. Using the EP approach (2), the estimated likelihood resembles the true likelihood (Figure 1)

VB tends to choose extreme parameters which can explain the data, while EP tends to reflect the true parameters that generated the data.

(b) Real data (TREC)

EP does better than VB. The difference between performance between EP and VB is not very high (As can be seem from the difference in the perplexity curves of EP and VB in Figure 3. Lower perplexity is better). Possible explanation is that in the data set that was used, there was little overlap between the aspects (The topics had little in common). Thus, the true distribution of the aspects are sharply peaked. This means that the difference between the max approximation and the exact likelihoods is not as large as in Figure 1. Thus, the EP and VB approximations are close to each other, with EP still doing slightly better.