UMass LIVING

September 19, 2009

(Sep 18 12-2pm) Meeting Summary: Best-First AND/OR Search for MPE (Marinescue, Dechter)

Filed under: Uncategorized — umassliving @ 4:31 pm
Tags: ,

We began with a discussion of MPE versus MAP and max-product versus sum-product.  MPE is concerned with the assignment to all variables that gives highest probability.  MAP is a generalization of MPE where we want the maximum assignment to a subset of variables, marginalizing out the remaining variables.  Due to the marginalization, MAP is more difficult.

Max-product can be used to compute the MPE.  However, when inference can also be done using sum-product and taking the value that gives you the maximum of the computed marginal distribution.  If our variables are letters in a particular word, then max-product gives us the word with highest probability, while sum-product gives us each individual letter with highest probability (marginalizing over the possible assignments to the other letters).

There was the question on the requirements of being a pseudo-tree and how to create a pseudo-tree.  The key was that edges in the original graph, but not in the pseudo-tree, must be back-arcs, that is, they must connect a node to its ancestor in the tree.  This is required so that it is possible to compute the relevant CPTs when traversing a single branch.

There was also the question of how AOBF differs from A*.  It appears that they are the same, the only difference being that AOBF provides the framework for taking advantage of conditional independencies in the original graph.

A subquestion came up in exactly how AOBF is done.  We noted that h(\cdot) is the overestimate of the remaining probabilities to be set, and that the bottom-up propagation revises the initial overestimates v(\cdot).  The marked arcs indicate, given the current estimates, which path is best and should be taken (although there will be ties, for instance when choosing which child to choose at an AND node), in which case a path can be chosen arbitrarily.  The key is that the marked arcs may change, if later choices revise the overestimates accordingly.

More later.

Advertisements

September 14, 2009

(Sep 09 1-3pm) Meeting Summary: New Outer Bounds on the Marginal Polytope (Sontag, Jaakkola)

Filed under: Uncategorized — umassliving @ 11:53 am
Tags: ,

(place holder, to be filled in by Moe)

September 3, 2009

(Sep 03 3-5pm) Meeting Summary: Convex Relaxations for MAP Estimation (Kumar, Kolmogorov, Torr)

Filed under: Uncategorized — umassliving @ 6:03 pm
Tags: ,

We discussed the various types of relaxations of the original IP for MAP estimation, where the two non-convex constraints are the integer constraint {1, -1} on the variables, and the constraint that X = x x^T.  In general, it was assumed that the tightest relaxation was the SDP relaxation (though also the most computational demanding), followed by SOCP and QP, with LP being the least tight of the relaxations.

However, this paper showed that, using only certain SOC constraints associated with edges of the MRF, that the SOCP was equivalent to the QP, and both were dominated by the LP.  We discussed the notion of dominance, and why it makes sense that, for the minimization problem, a relaxation dominates a second relaxation if the first has a greater minimum value than the second.  One key to understanding this is to see that every relaxation uses a feasible set that is a super set of the original problem, thus the original problem has a greater minimum value than every relaxation (why?) and hence dominates every relaxation, as would be expected.

It took me a little while to prove the facts stated in the paper regarding z_{a;i}, on page 95 below equation (46) and at the beginning of page 96.  The set on page 95 can be proved by considering the different cases of positive and negative x, for instance x_{b;j} \geq 0 \geq x_{a;i}, with x_{b;j} \geq | x_{a;i} |.  The first inequality on page 96 is easy to prove (in fact I think it’s an equality).  The second inequality can be proven by using the fact that the geometric average is always less than or equal to the arithmetic average.

The final section of the paper deals with ways of moving forward, given the theoretical results of the paper.  One thing to do is to simply add the linear marginalization constraints to the SOCP, rather than using the SOC constraints on edges.  Cycle inequality SOC constraints can be added to the SOCP (to get SOCP-C), but linear cycle inequalities can also be added to the LP.  The paper ends by giving additional SOC constraints on cliques (SOCP-Q), results in a feasible region that is not a super set of the corresponding LP feasible region.

The takeaway message is that, only using the straightforward SOC constraints on edges is actually dominated by the LP.  In general, SOCP is seen as a balance between LP and SDP, but in order to do better than LP, additional SOC constraints must be used.

The Ravikumar Lafferty QP paper is a good related paper on this general topic.  One thing that is not immediately clear is why, in the QP paper, the LP performs substantially worse, despite the results in the paper we read.  It could be that the LP or QP used in the Ravikumar paper do not match those in the paper we read.

August 29, 2009

(Aug 27 3-5pm) Meeting Summary: Wainwright / Jordan, ch 9, 10

Filed under: Uncategorized — umassliving @ 4:40 pm
Tags: ,

This section describes how to use moment matrics and conic programming, semidefinite programming(SDP) and second-order cone programming(SOCP) to construct variational relaxations.

We spent a good amount of time going over the background information in the section.  First we discussed the definition of a moment matrix and the property that any valid moment matrix is positive semidefinite.  Next we looked at the definition of two different bases, the multinomial base and the indicator base and went trhough the lemma that shows that it is always possible to convert between the two.  Lastly, we looked at the new definition of the marginal polytope for hypergraphs in terms of the multinomial base.

Next we went over how the Lasserre sequence(moment matrices) provides a nested hierarchy of outer bounds on the marginal polytope.  For any hypergraph on m nodes, the S_t  where t=m provides and exact characterization of the margional polytope.

The last section discussed an alternate second-order cone relaxation technique, which we deffered discussing until next week’s meeting.

July 16 3-5 pm: Meeting Summary, Chapter 6 – Variational Methods in Parameter Estimation

Filed under: Uncategorized — umassliving @ 4:38 pm
Tags: ,

to be filled in

August 20, 2009

(Aug 20 3-5PM) Meeting Summary : Wainwright – Estimating the “Wrong” Graphical Model

Filed under: Uncategorized — umassliving @ 5:08 pm
Tags: ,

Things we covered today:

  1. In the regularized surrogate likelihood (Eq 16) \ell_B(\theta), we use B(\theta), from the Bethe surrogate formulation covered in the survey paper. We then went through the steps in the joint estimation and prediction in section 4.2. We noted that the noise model is localized for each element of y : p(y|z) = \prod_{s = 1}^N p(y_s | z_s). Erik wondered whether this was done for simplicity or by necessity and we thought it was for simplicity.
  2. The parameter estimate based on the surrogate likelihood \hat{\theta^n} is asymptotically normal.
  3. We had a long discussion about the relationship between globally stable, Lipschitz, and strongly convex.  Any variational method that has a strongly convex entropy approximation is globally stable. Also, any variational method based on a convex optimization is Lipschitz stable.  I’m not sure if there’s a difference between Lipschitz stable and globally stable…
  4. None of us knew how to derive eq 22, but we got some intuition by observing that if \alpha = 0, then this is when the SNR is pure noise and the observation Y is useless. This is reflected in eq 22 by removing the term involving Y. Similarly, if \alpha = 1, then Z_s = Y_s, which is intuitive.
  5. When the SNR \alpha = 0, in section 6.2 they show that \Delta B(\hat{\theta}) = \mu^* = \Delta A(\theta) which means \Delta B(\theta) will be different.
  6. In Figure 4, the curve marked with the red diamonds (approximate model) is upper bounded, as stated in Theorem 7.  Figure 4 also illustrates that performing approximate inference (TRW method) using the approximate model (parameters) can be superior to performing approximate inference using the true model (black circles).

August 13, 2009

(Aug 13 3-5pm) Meeting Summary: Wainwright / Jordan, ch 8, p214-233

Filed under: Uncategorized — umassliving @ 6:35 pm
Tags: ,

The main points that will be covered in these subsections are

(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

(2) Examples of first order LP relaxations.

(3) Higher order LP approximations and how they can be more accurate.


(1) Extending the LP view of mode finding that we saw for max-product on tree graphs to reweighted-max-product.

The tree-reweighted Bethe variational problem uses the same polytope \mathbb{L} that defines the first order LP in the first half of this section. Thus, naturally, the tree-reweighted Bethe variational problem (and hence the reweighted max-product) has the same connection to first order LP as the regular Bethe problem (or regular max-product).

\lambda^* := log\ M^*, where M^* is the fixed point in reweighted max- product algorithm, can be the solution to the LP in the reweighted case. This is true when equations 8.23 are satisfied by the pseudo-max-marginals.

(2) Examples of first order LP relaxations.

Section 8.4.4 discusses examples where an LP problem is formulated from a graph problem. Example 8.3 shows how the Error-control coding problem may be thought of as an LP problem.

Steps in the process:

(a) Convert the error-control coding problem’s graph to a pairwise markov random field. Create auxillary variables to handle any interactions that have more than two variables. Appendix E.3 should help in understanding this

(b) Find out the constraints that need to be enforced in the model (Equations 8.24, which may be rewritten as  8.25)

(c) Solve the LP problem (Feldman)

Example 8.4 shows the same for the (a) Ising model, (b) (Maximum Weight) Independent Set problem, (c) (Minimum Weight) Vertex Cover problem, (d) Max-cut problem. These are all examples of intractable problems, but a first oder LP approximate solution is possible. Example 8.5 is another classic graph problem, Maximum Weight Matching problem, solved using an LP. Equation 8.29 is the exact problem, Equation 8.30 is the LP version.

Other important points in this subsection

(a) First order LP interger program with binary variables has strong consistency property. This is not true for higher order variables.

(b) MWIS, Maxcut are submodular maximization problems, and MWVC is a supermodular minimization problem. Thus, all three are not regular binary QPs. They are intractable in general. LP approximate solutions are hence needed for all three.

(3) Higher order LP approximations and how they can be more accurate.

Section 8.5 discusses higher order LP relaxations by using the marginal polytopes defined on hypergraphs. So far, we used the polytopes as defined by the pairwise interactions in the graph. By considering polytopes defined by the hypergraphs instead, we can enforce more constraints. This means that the relaxations in the LP will be of higher order.

Higher order relaxation LPs are more accurate than (or atleast as accurate as) the lower order relaxation LPs for the same graphical model. This is because the higher order relaxation constraints are in addition to the lower order ones.

Equation 8.34 is significant. \mathbb{L}_1(G) \supseteq \mathbb{L}_2(G) \supseteq \mathbb{L}_3(G) ...\supseteq \mathbb{L}_k(G) \supseteq...\supseteq \mathbb{M}(G).

If  t is the tree-width of the hypergraph for the graphical model, then the LP solution is the same as the exact solution on the original polytope \mathbb{L}_t(G) = \mathbb{M}(G).

Finally, example 8.6 shows how the second order relaxation adds additional constraints that makes the solution “tighter” than a first order relaxation. This is done by solving the same problem solved earlier in Example 8.3 by two methods:

(a) binary MRF, which is a first order relaxation LP

(b) hypergraph MRF, which is a second order relaxation LP

Comparing the constraints, we see that \tau_{23} has no constraints in (a).  8.38 applied in case of (b) enforces the contraint on  \tau_{23}.

Considering one solution \tilde{\tau} = (1, \frac{1}{2}, \frac{1}{2}, 0), we see this is valid in (a), but not valid in (b). Thus, adding higher order constraints results in more accurate solutions.

Lifting and Projecting

The introduction of additional parameters in order to achieve higher order relaxations is called lifting. The lifted polytope can be projected back to lower space by projecting. Example 8.7 illustrates this. Lifting by introducing \tau_{stu} results in inequalities 8.41. These can be projected back to the original pairwise parameters by using the simple Fourier-Motzkin elimination method, resulting in inequalities 8.41. Some of these inequalites correspond to the cycle inequalities or triangle inequalities. This was alluded to earlier in Example 4.1)

August 6, 2009

(Aug 06 3-5pm) Meeting Summary: Wainwright / Jordan, ch 8, p195-213

Filed under: Uncategorized — umassliving @ 5:47 pm
Tags: ,

Discussed the variational formulation of mode computation in theorem 8.1 and the zero-temperature limit of the Bethe variational principle. The resulting IP can be transformed to a LP over \mathbb{L}(G) (more precisely, its dual). Gary provided a description of the dual of a LP.

Covered the max-product algorithm for Gaussians and the process of converting the standard Gaussian in exponential form to a pairwise MRF. Gary pointed out the type in 8.15 (see pre-meeting overview comments).

Covered LP relaxation and how it provides an approximation to the exact LP. Discussed the relationship between the extreme points in \mathbb{M}(G) and \mathbb{L}(G) and the definition of weakly and strongly persistent solutions (Gary won the debate, after Manju sided with him).

Covered the main point of section 8.4.2, that max-product does not solve the relaxed LP in the case of graphs with cycles (contrary to the analogous result for sum-product).

July 31, 2009

(Jul 30) Meeting Summary: Wainwright / Jordan, ch 7

Filed under: Uncategorized — umassliving @ 10:43 am
Tags: ,

Discussion from the meeting:

  • Why is the Bethe approximation to the entropy H_{Bethe} not generally concave?
    • Since we know that A^* is convex, and is the negative entropy, we know that the entropy is concave.  However, H_{Bethe} does not generally correspond to a true entropy.  We assume that the edges form a tree and compute the entropy as if it were, subtracting on the mutual information on each edge, but this is not a true entropy unless the edges actually do form a tree, and hence is not generally concave.
  • On page 167, in the discussion of the projection operation \mapsto, how do we know that \mu(F) \in \mathcal{M}(F)?
    • To see this, we need to go back to the original definition of \mathcal{M}.  Assume \mu \in \mathcal{M}.  By the definition of \mathcal{M}, there exists some distribution p such that \mu_\alpha = \mathbb{E}_p[\phi_\alpha(x)] for all sufficient statistics indexed by \alpha, where the expectation is with respect to p.  Clearly then, we can use the exact same p to show that \mu(F) \in \mathcal{M}(F) since the requirements are the same, but only for a subset of the sufficient statistics \mathcal{I}(F).
  • Understand 7.2.1.  In particular, consider the distribution over spanning trees \rho that places all its mass on one specific spanning tree.  This seems very similar to structured mean field, where we just use the one spanning tree as the tractable subfamily we optimize over.  Yet structured mean field gives us a lower bound on A(\theta) whereas tree-reweighted Bethe gives us an upper bound.  What accounts for this difference?  For instance, why does structured mean field give a lower bound when we know that A^*(\mu(F)) \leq A^*(\mu), and we are subtracting off A^* in the objective function?
    • In mean field, we have that A^*(\tau) = A^*(\mu(F)) = A^*(\mu), as proven in last week’s meeting summary.  Understanding this is the key to understanding this issue.  In mean field, we assumed that the canonical parameters that were outside those defined for the tractable subgraph were 0.  Therefore, A(\theta) is the same, whether you include the inner product with the extra parameters, which are all zero, or not.  The convex combination of spanning trees is different because of the projection operator \mapsto above.  We are still searching across the full space of possible mean parameters, but when we compute entropy, we use the projection operator to essentially ignore all the mean parameters that are not in our subgraph.  This differs from mean field, where the mean parameters that are not in our subgraph are deterministic functions of the mean parameters that are in the subgraph, corresponding to the functions g in Wainwright and \Gamma in the structured mean field optimization paper.  This difference means that you are searching over an outer bound as opposed to an inner bound, and are using an upper bound to the objective function, thus you are guaranteed to get an upper bound to A(\theta).
  • There should be a \rho(T') in front of the second inner product in the second equation on page 176.
    • This is for the T' they mention that has \Pi^{T'}(\lambda) different from zero.  Additionally, if you are wondering why A^* is strictly convex, it is a consequence of the fact that A is strictly convex.  The details are in Appendix B.3.
  • Understand the general ideas in 7.2.2 and 7.2.3, the exact details are not as important.  For instance, understand why H_{ep} is generally not concave, but why using the re-weighting subject to \sum_{\ell=1}^{d_I} \rho(\ell) = 1 gives a concave function.
    • Since entropy is concave, if we take a positive combination of entropy functions, that should also be concave.  In H_{ep}, we have a negative amount (1-d_I) H(\tau), so it is not concave.  Re-weighting subject to \sum_{\ell=1}^{d_I} \rho(\ell) = 1, or actually, \sum_{\ell=1}^{d_I} \rho(\ell) \leq 1, will fix this problem, making the approximation concave.
  • Understand the benefit of the convexified algorithms for algorithmic stability and what it means for an algorithm to be globally Lipschitz stable, why that is important, and how Figure 7.3 suggests that ordinary sum-product is not globally Lipschitz stable.
    • One other important take-away from Figure 7.3 is why the approximations are better the lower the coupling strength (size of the parameter on the edges).  This is because, the lower the coupling strength, the more independent the individual nodes, which in the extreme case of a coupling strength of zero becomes a simple product distribution.  The higher the coupling strength, the more important the loops in the graph become, the more difficult inference becomes.
  • Understand how the convexified algorithms can be used within learning in 7.5.  We use the fact that when B upper bounds the true cumulant function, then the surrogate likelihood lower bounds the true likelihood.  Yet in chapter 6, we saw how using mean field within EM also gives us a lower bound on the true likelihood, despite the fact that mean field gives a lower bound on the cumulant function.  What accounts for this discrepancy?
    • This is a check on your understanding of mean field within EM.  If you look at what mean field is approximating in EM, you should see why a lower bound on the cumulant gives a lower bound, whereas in 7.5 we need an upper bound on the cumulant to give a lower bound.
  • Understand the intuitative interpretation of example 7.3.
    • Another interesting point is the idea, in the final paragraph of the chapter, that it may be better to have an inconsistent estimator of the true parameter $\theta^*$, if your interest is in prediction (inference) and you must use an approximate inference algorithm.

Slightly more esoteric details:

  • On page 184, why do \mu_i show up along the diagonal of N_1[\mu]?  Hint: each random variable X_i is binary valued.
    • The diagonal elements are \mathbb{E}[X_i^2].  Since X_i is binary, we have \mathbb{E}[X_i^2] = \sum_{0,1} X_i^2 p(X_i) = p(X_i = 1) = \mu_i.
  • Still unsure of whether you can go the other direction and show that \text{cov}(X) \succeq 0 implies N_1[\mu] \succeq 0.  The wording in Wainwright seems to imply that you can, but it’s not clear to me how to do that using the Schur complement formula as they claim.

July 23, 2009

July 22 10:30 am – 12:30 pm: Meeting Summary (1) Optimization of SMF Obectives (Bouchard-Cote, Jordan), (2) EP for Generative Aspect Model (Minka, Lafferty)

Filed under: Uncategorized — umassliving @ 10:05 am
Tags: ,

(1) Optimization of SMF Obectives (Bouchard-Cote, Jordan)

This paper discusses the Structured Mean Field (SMF) approach for inference in graphical models. We have covered SMF in Section 5.5 of Wainwright-Jordan. We also looked at Naive Mean Field (NMF) in Wainwright-Jordan (Section 5.3). Recall that in Mean Field methods, we break a graphical model into “tractable” subgraph(s) and then do approximate inference based on the subgraph(s) and its properties. In NMF, the subgraph is basically the graph with no edges. In SMF, some edges are considered while still keeping the subgraph tractable.  The main objective in this paper is to look at tractability of the SMF method (which depends on the choice of the subgraph). Interestingly, some graph properties of the chosen subgraph(s) determine the tractability of the SMF method.

Specifically, if a subgraph is v-acyclic, then the computation can be done in quick time. If the subgraph is b-acyclic, then the computation is much more complicated. A v-acyclic subgraph is one in which adding any single edge from the original graph results in an acyclic subgraph ( Hence the name very-acyclic). If adding any edge from the graph to the tractable subgraph results in a cyclic graph, then the subgraph is b-acyclic (barely-acyclic).

Note the following dimensionalities:

|\mathcal{M}_{MF}| = |\mathcal{M}| = |\mu| = |\theta| = |E[\Phi(Y_\omega)]| = |F| = (Number of sufficient statistics in original distribution)

|\mathcal{N}| = |\tau| = |\omega| = |F'| = (Number of sufficient statistics in tractable distribution)

The embedding function \Gamma maps from the space of \tau to the space of “missing” or “ignored” sufficient statistics.

Proof that A^*(\mu) = A^*(\tau):
To be added

Eqn (6) in this paper is the same as Eqn (5.8) in Wainwright-Jordan. We are looking to find the sup (maximum) value of this expression, and that value will be our approximation of the log partition function A(\theta). The maximum value is an approximation and not the exact log partition function because our search space is not all valid values of \mu for the original distribution, but only the ones that can be realized by the tractable distribution.

Section 2.4 takes the partial derivative of (6) to find the max point. The embedding Jacobian (J) is defined, which is the first order partial derivative of  \Gamma w.r.t. \tau.
The solution for max of Eqn (6) is then found to be
\tau = \bigtriangledown A(\omega + J(\tau)\vartheta)
The vector \tau can be broken down into (\tau^{(1)} \tau^{(2)} ... \tau^{(k)}) where (1)...(k) represent the $k$ disjoint connected components (subgraphs) into which the original graph has been broken down.

Section 3.3 shows that the embedding Jacobian is easy to compute when the subgraph in question is a v-acyclic subgraph. Note that this section assumes that the exponential family is expressed in an overcomplete representation. In this section, f is the set of statistics in the tractable graph and g is the set of statistics that are not in the tractable graph.  The v-acyclic subgraph property ensures that when we add any external edge (from the g set) to the connected component (f) in question, the two nodes of the edge are independent (There is no path between the two nodes). This independence makes computing the J matrix easy for v-acyclic graphs.  The complexity of updating \tau in Prop 5 thus takes O(|F'^{(c)}|) time.

For b-acyclic graphs, adding an edge to the f connected component could result in a loop, the two nodes that define the edge are not independent, and we need to use a chain rule to figure out the value of J. Complexity of computation is
O(|E'| \times |F'| \times |F\backslash F'|), which is large.
The alternative approach is to consider each case of g as an exponential family. Calculating the J matrix under this setting reduces the computation cost to O(|E'| \times |F\backslash F'|), which is a considerable reduction from the naive approach.

The paper also makes a connection between the v-acyclic case and Gibbs sampling.

Results:

Choosing a b-acyclic subgraph for SMF in a given graphical model, we can achieve much better accuracy (Figure 3)

Convergence time : v-acyclic SMF takes an order magnitude more time compared to NMF. b-acyclic SMF takes two orders of magnitude more time than v-acyclic SMF (Figure 4) – as expected from the theory in Section 3.

(2) EP for Generative Aspect Model (Minka, Lafferty)

This paper talks about use of EP for a Generative Aspect model (which is the same as an LDA model). It is thus related to Section 6 of Wainwright-Jordan paper. This paper makes a comparison between learning using Variational Inference (VB) and Expectation Propagation (EP). Results show EP is more accurate than VB.

Given the Generative Aspect model (LDA), where \lambda is the variable (Dirichlet distributed with parameter \alpha) that specifies the distribution of the “aspect” variable $a$. We cannot use EM directly by treating \lambda as the hidden variable because the E step requires expectations over the posterior for \lambda.

Alternate approaches:

(1) Maximize the Variational Inference likelihood in Eqn (8)

(2) “Variational EM”:

E step – Use an approximate posterior for \lambda. Calculating this can be done using the EP algorithm of section 4.2 (Eqns (15) and (21))

M step – Maximize the lower bound on the log-likelihood (Eq (29)). (29) can be obtained by applying the inequality in Eqn (6) to the likelihood in (26) and then taking the log. The solution for the M step is in Eqn (31). The appendix has a more accurate solution based on the Taylor expansion (Eqn (34)).

It may be noted that approach (1) could be taken for EP as well (Find the parameters that results in the largest possible EP estimate of the likelihood ), but the authors make the point that EP makes an approximation which is close to the true likelihood in an average sense. Hence finding the max in EP’s “best approximation on average” will not yield the true maximum in the original likelihood. (Paragraph just before Section 5.2).  Hence (2) is preferred for EP.

Results:

(a) Synthetic data –

Using the VB approach (1), the estimated likelihood resembles the max approximation of the true likelihood. Using the EP approach (2), the estimated likelihood resembles the true likelihood (Figure 1)

VB tends to choose extreme parameters which can explain the data, while EP tends to reflect the true parameters that generated the data.

(b) Real data (TREC)

EP does better than VB. The difference between performance between EP and VB is not very high (As can be seem from the difference in the perplexity curves of EP and VB in Figure 3. Lower perplexity is better). Possible explanation is that in the data set that was used, there was little overlap between the aspects (The topics had little in common). Thus, the true distribution of the aspects are sharply peaked. This means that the difference between the max approximation and the exact likelihoods is not as large as in Figure 1. Thus, the EP and VB approximations are close to each other, with EP still doing slightly better.

Next Page »

Blog at WordPress.com.