# UMass LIVING

### Wainwright & Jordan Summary

This page contains a summary of “Graphical Models, Exponential Families, and Variational Inference,” by Wainwright & Jordan (2008). The following set of slides summarize the survey.

### Chapter 1: Introduction

Graphical models = graph theory + probability theory

There are generally two types of computational tasks associated with graphical models: (1) inference, this can be thought of as querying the distribution once all of its parameters have been set. Examples of queries include marginal distributions (marginals), max probabilities (modes), and values of the random variables (i.e. the configuration) that result in the max probability, and (2) learning, that is given a collection of samples from the distribution, estimate its parameters.

As will be discussed in the sequel, exact computation of the above quantities is typically not possible in most graph structures and approximate algorithms need to be utilized. There are two main classes of approximate algorithms, the variational methods discussed in this survey and the complementary class of Markov chain Monte Carlo (MCMC) sampling algorithms. MCMC algorithms can in theory be exact with sufficient computation and thus their approximation stems from the limited computation provided in practice. Variational methods are an umbrella term that refer to the optimization-based formulation of the above problems. Approximations then arise when we relax/approximate the objective function and/or the set over which the optimization takes place.

This survey covers variational methods for the exponential families (a large class heavily studied in the statistics literature). The notions of convexity and conjugate duality of the exponential families play a prominent role in the design of variational methods and, consequently, this survey.

### Chapter 2: Background

The following Appendices are helpful for this chapter:

• A.1 – Background on Graphs and Hypergraphs

Graphical Models

Graphical models are diagrammatic representations of probability distributions. They provide a simple way to visualize the structure of the probability model and obtains insights into its properties such as conditional independence. Additionally, inference and learning tasks can be expressed as graph operations, in which the underlying mathematical expressions are carried out implicitly. There are two main types of graphical models called Bayesian networks which are comprised of directed edges and Markov random fields which are comprised of undirected edges. Note that graphs that contain both directed and undirected edges are called Chain graphs, but are not discussed in this survey.

The following table summarizes the properties of the two types of graphical models:

 Bayesian networks Markov random fields Uses More suited for representing conditional independencies (i.e. casual relationships) between random variables More suited for expressing soft constraints between random variables. Factorization $p(x) = \prod_{s \in V} p_s(x_s | x_{\pi(s)})$$x_{\pi(s)} \in x$ are the parent nodes of $x_s$ $p(x) = \frac{1}{Z}\prod_{C \in \mathcal{C}} \psi_C (x_C), \,\,\,\,\,\,\,\, Z = \sum_x \prod_{C \in \mathcal{C}} \psi_C (x_C)$ $\mathcal{C}$ is a collection of cliques (not necessarily maximal). The compatibility functions, $\psi_C (x_C)$, need not have any obvious or direct relation to marginal or conditional distributions. As long as $\psi_C(x_C) \ge 0,$ then $p(x)$ is a valid probability distribution. Thus it is common to set $\psi_C(x_C) = \exp\{-E(x_C)\}$ where $E(\cdot)$ is an unconstrained “Energy” function. Factor graph representation Factors are the conditional probabilities over the child-parent sets. Note that the same pdf can be decomposed in multiple ways. For example, $p(a,b,c) = p(a)p(b|a)p(c|a,b) = p(b)p(c|b)p(a|b,c)$. However, the graphical representation is explicit about the decomposition and thus there is a unique factor graph representation for each graphical representation. Factors are over the compatibility functions. Factor graphs are especially helpful for MRF’s, since for each undirected graphical representation there are many possible factorizations (depending on the choice of compatibility functions) and factor graphs resolve any ambiguities. Testing $x_A \bot x_B | x_C$ Use d-separation. Much simpler: remove $x_C$ nodes from the graph and check for reachability from any node in $x_A$ to any node in $x_B$ Markov Blanket Parents, children, and co-parents (other parents of its children) Neighbors

First, some comments explaining the table:

1. Factorization is a key idea behind graphical models, that is, each graphical model consists of a collection of probability distributions that factorize according to the graph structure.
2. Factor graphs provide an alternative visualization that emphasizes the factorization.
3. We say that $x_A \bot x_B | x_C$, that is $x_A$ is independent of $x_B$, given $x_C$ if $p(x_A,x_B | x_C) = p(x_A | x_C)p(x_B | x_C)$.
4. For both types of graphical models, one can establish equivalence between the set of probability distributions that factorize according to the graph structure and the set of conditional independence statements. For Bayesian networks, the d-separation theorem is used for for Markov random fields it is the Hammersly-Clifford theorem.

It is important to note that Bayesian networks and Markov random fields have the ability to represent different types of probability distributions. While some probability distributions can be represented as both, some can only be represented as a Bayesian network or Markov random field. Additionally, some probability distributions cannot be represented as a graphical model (examples include models whose factorization properties change depending on the values of the random variables).

Examples of Graphical Models

Section 2.4 covers several popular directed and undirected graphical models that re-appear in later chapters.

Exact Inference

Before starting, it is important to note that both directed and undirected graphical models involve factorized expressions for joint probabilities and thus inference algorithms treat them in an essentially identical manner. In order to allow a unified treatment of inference algorithms, it is convenient to work exclusively with undirected models and thus we transform a directed model to an undirected one via moralization.

For inference on trees, the sum-product algorithm for computing marginals (and its slight variant, the max-product algorithm for computing modes) are exact inference algorithms and subsume a large number of inference algorithms derived for special cases. For example, the forward/backward algorithm for computing the probability of a specific observation sequence in an HMM is an application of sum-product, the Viterbi algorithm for computing the most probable state sequence in an HMM is an application of the max-product (or max-sum when dealing with log probabilities for numerical stability), the Kalman filter is an application of the sum-product for Gaussian models, and finally, Pearl’s belief propagation algorithm for Bayesian networks is a special case of the sum-product algorithm. Note that sum-product is applicable to trees and poly-trees, that is the factor graph must be a tree. Additionally, it is mostly used for discrete and Gaussian models where the messages have known forms, a vector of probabilities for the former, and the mean and covariance for the latter. For general continuous non-Gaussian nodes, certain parameterizations of the messages need to be used. Note that the nonparametric belief propagation algorithm (not covered in this survey) was developed to handle this case.

The junction tree algorithm is an exact inference algorithm for general graphs (with cycles), but its complexity is exponential in the size of the maximal clique rendering it impractical for a large number of graphical models. This motivates the need for approximate inference algorithms such as variational methods.

Conclusions

The main goal of this survey is to develop a general theoretical framework for understanding and analyzing variational methods for computing approximations to marginals, modes and likelihoods. Chapter 3, provides the mathematical background for doing so, and we begin describing how algorithms fit into this framework in chapter 4 when we cover loopy BP, generalized BP, and Expectation propagation.

### Chapter 3: Graphical Models as Exponential Families

The following Appendices are helpful for this chapter:

• A.2 – Basics of Convex Sets and Functions
• B.1 – Proof of Theorem 3.3
• B.2. – Proof of Theorem 3.4
• B.3 – General Properties of $\mathcal{M}$ and $A^*$
• C.1 – Variational Principles of Multivariate Gaussians with Known Covariance
• C.2 – Variational Principles of Multivariate Gaussians with Unknown Covariance
• E.3 – Conversion to a Pairwise Markov Random Field

Exponential Families

Exponential families play a critical role in this survey. More specifically, the convexity and conjugate duality of exponential families are central to the discussion.

An exponential family of probability distributions is characterized by a set of sufficient statistics, $\{\phi_{\alpha}(x) \,\,|\,\, \alpha \in \mathcal{I}\}$. Thus given a vector of sufficient statistics $\phi(x)$, the corresponding exponential family, parametrized by the canonical parameters $\theta$ is formally defined as:

$p(x) = \exp\{\langle \theta, \phi(x)\rangle - A(\theta)\},\,\,\,\,A(\theta) = \log \int_{\mathcal{X}^m} \exp\{\langle \theta, \phi(x)\rangle\} \,\nu(dx)$

where $A(\theta)$ is the log partition function that ensures the density integrates to $1$, and $m$ is the dimensionality of the random variable $X$. Note that a sufficient statistic is simply a function $\mathcal{X}^m \rightarrow \Re$. It turns out that many common distributions such as a normal, multinomial, Bernoulli, Poisson, and beta form an exponential family. For example, a normal distribution forms an exponential family with sufficient statistics $\phi_1(x) = x$ and $\phi_2(x) = x^2$. Table 3.1 on page 42 provides other examples.

One way to motivate these families of distributions is via the principle of maximum entropy. Consider the problem of estimating a probability density function given $N$ IID samples from it. The unique distribution $\tilde{p}(x)$ that is consistent with the data and has maximal entropy is a member of the exponential family. Generically, there are many distributions that are consistent with the data, and thus we use the principle of maximum entropy for choosing amongst them. This result states that if the sufficient statistics are the only information we have about the density, then an exponential family provides a natural model.

Note that multiplying distributions from the exponential family results in a product distribution that is in the exponential family with the sufficient statistics (and corresponding parameters) aggregated. This is very useful in the context of graphical models, since a probability distribution is represented as a product of factors and thus as long as each factor is an exponential family, the entire distribution will be an exponential family. Section 3.3. provides examples of graphical models in the exponential family.

Definitions / Notations

$\mathcal{I}$ is the set of indices for the sufficient statistics of a distribution, where $\alpha \in \mathcal{I}$ is an index for a sufficient statistic $\phi_\alpha: \mathcal{X}^m \rightarrow \Re$, where $m$ is the dimensionality of the random variable. Let $|\mathcal{I}| = d$.

$\Omega = \{\theta \in \Re^d \,\, | \,\, A(\theta) < +\infty\}$ is the space of canonical parameters.

$\mu$ are the mean parameters, which provide an alternative parametrization of the exponential family that is critical to this survey. The mean parameter for each sufficient statistic is defined as: $\mu_\alpha = E_p[\phi_\alpha(X)]\,\,\,\,\, \forall \alpha \in \mathcal{I}$.

$\mathcal{M} = \{\mu \in \Re^d \,\,|\,\, \exists p \,\, s.t. \,\, E_p[\phi_\alpha(X)] = \mu_\alpha \,\,\,\, \forall \alpha \in \mathcal{I}\}$ is the space of mean parameters.

Let $S$ be a set, and $C \subseteq \Re^d$ be a convex set:

• $\mathrm{aff}(S)$ represents the affine hull of $S$, which is the smallest set that contains all the affine combinations of the elements in $S$. Recall that affine combinations are linear combinations where the weights sum to 1. (The affine hull of two points in $\Re^2$ is the line passing through the two points.)
• $\mathrm{conv}(S)$ represents the convex hull of $S$, which is the smallest set that contains all the convex combinations of the elements in $S$. Recall that convex combinations are affine combinations where the weights are all positive. (The convex hull of two points in $\Re^2$ is the segment connecting the two points.)
• $C^{\circ}$ is the interior of $C$.
• $\mathrm{ri}(C)$ is the interior of $C$.
• $C$ is full-dimensional if the $\mathrm{aff}(C) = \Re^d$. If $C$ is full-dimensional then $C^{\circ} = \mathrm{ri}(C)$

Summary of Results

We have introduced two different parametrizations for an exponential family: the canonical parameters $\theta \in \Omega$ and the mean parameters $\mu \in \mathcal{M}$. It is thus natural to ask what are the relationships between these two parameterizations and their corresponding spaces. The following table summarizes these relationships stated/proven in chapter 3. First note that a set of sufficient statistics can either form a minimal or overcomplete representation, each of which can result in different characteristics which we make explicit in the table.

 Minimal Overcomplete Definition A minimal representation implies that there is a unique canonical parameter associated with each distribution. Alternatively, there does not exist a nonzero vector $a \in \Re^d$ such that $\langle a,\phi(x)\rangle$ is equal to a constant. An overcomplete implies that there is an entire affine subset of canonical parameters associated with each distribution. Alternatively, there exists a nonzero vector $a \in \Re^d$ such that $\langle a,\phi(x)\rangle$ is equal to a constant. Probability Distribution $p_{\theta}(x) = \exp(\langle \theta,\phi(x)\rangle - A(\theta)),\,\,\,\,\,\,\,\,\, A(\theta) = \log \int_{\mathcal{X}^m} \exp\{\langle \theta, \phi(x)\rangle\} \,\nu(dx)$ Maximum Likelihood $l(\theta\, |\, x_1, ..., x_N)\,\,=\,\,\frac{N}{N} \sum_{n=1}^N \log p_{\theta}(x_n)\,\,=\,\,\frac{N}{N}\sum_{n=1}^N\{\langle \theta,\phi(x_n)\rangle - A(\theta)\}$$\,\,=\,\, N\{\langle \theta,\frac{1}{N}\sum_{n=1}^N\phi(x_n)\rangle - A(\theta)\}$ $\Rightarrow$ maximize $\,\,\,\,\langle \theta,\hat{\mu}\rangle - A(\theta),\,\,\,\,\,\,$ where $\,\, \hat{\mu} = \frac{1}{N}\sum_{n=1}^N \phi(x_n)$ are the empirical moments of the sample. Entropy $H[p_{\theta}] = -\mathbb{E}[\log p_{\theta}] = -\mathbb{E}[\langle \theta,\phi(x)\rangle - A(\theta)] = -[\langle \theta,\mathbb{E}[\phi(x)]\rangle - A(\theta)]$$\,\,= -[\langle \theta,\mu\rangle - A(\theta)]$ Notice that the parameter $\theta$ that maximizes the log-likelihood is equivalent to the one that minimizes the entropy. $\Omega$ $\Omega = \{\theta \in \Re^d \,\, | \,\, A(\theta) < +\infty\}$ is a convex set since $A(\theta)$ is a convex function – see below. $\mathcal{M}$ $\mathcal{M} = \{\mu \in \Re^d \,\,|\,\, \exists p \,\, s.t. \,\, E_p[\phi_\alpha(X)] = \mu_\alpha \,\,\,\, \forall \alpha \in \mathcal{I}\}$ is also a convex set. $\mathcal{M}$ is bounded if and only if $\Omega = \Re^d$ and $A$ is globally Lipschitz on $\Re^d$. $\mathcal{M}$ is full-dimensional. $\mathcal{M}$ is not full-dimensional $\Rightarrow \mathcal{M}^{\circ}$ is an empty set. $\mathcal{M}$ (discrete) For the case of multinomial random variables, $\mathcal{M}$ takes on a special form, $\mathcal{M}\,\,\,=\,\,\, \{\mu \in \Re^d \,\, | \,\, \mu = \sum_{x \in \mathcal{X}^m} \phi(x)p(x)\,\,\,\, \mathrm{for}\,\,\mathrm{some}\,\,\,p(x) \ge 0, \,\,$ $\sum_{x \in \mathcal{X}^m}p(x) = 1\}\,\,\,=\,\,\,\mathrm{conv}(\phi(x), x \in \mathcal{X}^m)$.which is, by definition, a convex polytope. A convex polytope can alternatively be represented as the intersection of a finite collection of half-spaces (as characterized by the Minkowski-Weyl theorem). The number of constraints critically depends on the graph structure. $\mathbb{M}(G)$ Not defined. For the case of multinomial random variables in a pairwise Markov random field $G$ using the standard overcomplete representation, $\mathcal{M}$ is referred to as the marginal polytope $\mathbb{M}(G)$. $A$ $A$ is strictly convex. $A$ is convex. $\nabla A$ $\frac{\partial A}{\partial \theta_\alpha}(\theta) = \mathbb{E}_{\theta}[\phi_{\alpha}(X)] \Rightarrow \nabla_{\theta_\alpha} A = \mu_\alpha$ $\frac{\partial A}{\partial \theta_{\alpha} \partial \theta_{\beta}}(\theta) = \mathbb{E}_{\theta}[\phi_{\alpha}(X)\phi_{\beta}(X)] - \mathbb{E}_{\theta}[\phi_{\alpha}(X)]\mathbb{E}_{\theta}[\phi_{\beta}(X)] = cov\{\phi_{\alpha}(X), \phi_{\beta}(X)\}$ $\Rightarrow\,\,\, \nabla A$ is the mean of the random variable $\phi(X)$ and the Hessian $\nabla^2 A$ is its covariance matrix. Note that the convexity and strict convexity of $A$ is a consequence of $\nabla^2 A$ being positive semi-definite. Gradient map $\nabla A$ is onto $\mathcal{M}^{\circ}$. Gradient map $\nabla A$ is onto $\mathrm{ri}(\mathcal{M})$. $\nabla A: \Omega \rightarrow \mathcal{M}$ is a one to one correspondence between each element of $\nabla A(\Omega) \in \mathcal{M}$ and $\Omega$. $\nabla A: \Omega \rightarrow \mathcal{M}$ is a one to one between each element of $\nabla A(\Omega) \in \mathcal{M}$ and an affine subset of $\Omega$ (the same affine subset that corresponds to a unique distribution). Consequently for each $\mu \in \mathcal{M}^{\circ}$, there is a unique density in minimal exponential family form that realizes it (i.e. $\exists \,\,\theta \,\,s.t. \,\,\mathbb{E}_{\theta}[\phi(X)] = \mu$). Consequently for each $\mu \in\mathrm{ri}(\mathcal{M})$, there is a set of densities in exponential family form that realize it (i.e. $\exists \,\,\theta_1, ..., \theta_n \,\,s.t.\,\,\mathbb{E}_{\theta_i}[\phi(X)] = \mu \,\,\, \forall\,\,1 \le i \le n$). $A^*$ $A^*(\mu) = sup_{\theta \in \Omega} \,\{\langle\theta,\mu\rangle - A(\theta)\}$ Notice that $A^*(\hat{\mu})$ where $\hat{\mu}$ are the empirical moments induced by a data set is the maximum value of the rescaled log-likelihood. Thus it comes as no surprise that $A^*$ is related to entropy as made explicit by Theorem 3.4a. $A^*$ is strictly convex and essentially smooth.$A^*$ is differentiable on $\mathcal{M}^{\circ}$. $A^*$ is convex and lower semi-continuous. $\nabla A^*$ $\forall \,\,\mu \in \mathcal{M}^{\circ},\,\,\,\nabla A^*(\mu) = (\nabla A)^{-1}(\mu)$. $(A^*)^*$ $(A^*)^* = A$.since $A$ is always convex and lower semi-continuous.

The above table warrants some comments:

The mean parameters play a major role in inference and learning for exponential families. More specifically, computing the forward mapping from canonical to mean parameters is equivalent to marginalization and computing the backward mapping from mean parameters to canonical parameters is equivalent to computing the maximum likelihood of the canonical parameters given samples from the distribution.

Note that here, we did not restrict the density $p$ to be an exponential family.

Naming the dual variables in $A^*$ as $\mu$ is deliberate, since they turn out to have a natural interpretation as mean parameters.

### Chapter 4: Sum-Product, Bethe-Kikuchi, and Expectation-Propagation

The following Appendices are helpful for this chapter:

• A.1 – Background on Graphs and Hypergraphs
• B.4 – Proof of Theorem 4.2(b)
• D. – Clustering and Augmented Hypergraphs
• E.1 – Mobius Inversion