This page contains a summary of “Graphical Models, Exponential Families, and Variational Inference,” by Wainwright & Jordan (2008). The following set of slides summarize the survey.
Chapter 1: Introduction
Graphical models = graph theory + probability theory
There are generally two types of computational tasks associated with graphical models: (1) inference, this can be thought of as querying the distribution once all of its parameters have been set. Examples of queries include marginal distributions (marginals), max probabilities (modes), and values of the random variables (i.e. the configuration) that result in the max probability, and (2) learning, that is given a collection of samples from the distribution, estimate its parameters.
As will be discussed in the sequel, exact computation of the above quantities is typically not possible in most graph structures and approximate algorithms need to be utilized. There are two main classes of approximate algorithms, the variational methods discussed in this survey and the complementary class of Markov chain Monte Carlo (MCMC) sampling algorithms. MCMC algorithms can in theory be exact with sufficient computation and thus their approximation stems from the limited computation provided in practice. Variational methods are an umbrella term that refer to the optimizationbased formulation of the above problems. Approximations then arise when we relax/approximate the objective function and/or the set over which the optimization takes place.
This survey covers variational methods for the exponential families (a large class heavily studied in the statistics literature). The notions of convexity and conjugate duality of the exponential families play a prominent role in the design of variational methods and, consequently, this survey.
Chapter 2: Background
The following Appendices are helpful for this chapter:
 A.1 – Background on Graphs and Hypergraphs
Graphical Models
Graphical models are diagrammatic representations of probability distributions. They provide a simple way to visualize the structure of the probability model and obtains insights into its properties such as conditional independence. Additionally, inference and learning tasks can be expressed as graph operations, in which the underlying mathematical expressions are carried out implicitly. There are two main types of graphical models called Bayesian networks which are comprised of directed edges and Markov random fields which are comprised of undirected edges. Note that graphs that contain both directed and undirected edges are called Chain graphs, but are not discussed in this survey.
The following table summarizes the properties of the two types of graphical models:
Bayesian networks  Markov random fields  
Uses  More suited for representing conditional independencies (i.e. casual relationships) between random variables  More suited for expressing soft constraints between random variables. 
Factorization  are the parent nodes of 
is a collection of cliques (not necessarily maximal). The compatibility functions, , need not have any obvious or direct relation to marginal or conditional distributions. As long as then is a valid probability distribution. Thus it is common to set where is an unconstrained “Energy” function. 
Factor graph representation  Factors are the conditional probabilities over the childparent sets. Note that the same pdf can be decomposed in multiple ways. For example, . However, the graphical representation is explicit about the decomposition and thus there is a unique factor graph representation for each graphical representation.  Factors are over the compatibility functions. Factor graphs are especially helpful for MRF’s, since for each undirected graphical representation there are many possible factorizations (depending on the choice of compatibility functions) and factor graphs resolve any ambiguities. 
Testing  Use dseparation.  Much simpler: remove nodes from the graph and check for reachability from any node in to any node in 
Markov Blanket  Parents, children, and coparents (other parents of its children)  Neighbors 
First, some comments explaining the table:
 Factorization is a key idea behind graphical models, that is, each graphical model consists of a collection of probability distributions that factorize according to the graph structure.
 Factor graphs provide an alternative visualization that emphasizes the factorization.
 We say that , that is is independent of , given if .
 For both types of graphical models, one can establish equivalence between the set of probability distributions that factorize according to the graph structure and the set of conditional independence statements. For Bayesian networks, the dseparation theorem is used for for Markov random fields it is the HammerslyClifford theorem.
It is important to note that Bayesian networks and Markov random fields have the ability to represent different types of probability distributions. While some probability distributions can be represented as both, some can only be represented as a Bayesian network or Markov random field. Additionally, some probability distributions cannot be represented as a graphical model (examples include models whose factorization properties change depending on the values of the random variables).
Examples of Graphical Models
Section 2.4 covers several popular directed and undirected graphical models that reappear in later chapters.
Exact Inference
Before starting, it is important to note that both directed and undirected graphical models involve factorized expressions for joint probabilities and thus inference algorithms treat them in an essentially identical manner. In order to allow a unified treatment of inference algorithms, it is convenient to work exclusively with undirected models and thus we transform a directed model to an undirected one via moralization.
For inference on trees, the sumproduct algorithm for computing marginals (and its slight variant, the maxproduct algorithm for computing modes) are exact inference algorithms and subsume a large number of inference algorithms derived for special cases. For example, the forward/backward algorithm for computing the probability of a specific observation sequence in an HMM is an application of sumproduct, the Viterbi algorithm for computing the most probable state sequence in an HMM is an application of the maxproduct (or maxsum when dealing with log probabilities for numerical stability), the Kalman filter is an application of the sumproduct for Gaussian models, and finally, Pearl’s belief propagation algorithm for Bayesian networks is a special case of the sumproduct algorithm. Note that sumproduct is applicable to trees and polytrees, that is the factor graph must be a tree. Additionally, it is mostly used for discrete and Gaussian models where the messages have known forms, a vector of probabilities for the former, and the mean and covariance for the latter. For general continuous nonGaussian nodes, certain parameterizations of the messages need to be used. Note that the nonparametric belief propagation algorithm (not covered in this survey) was developed to handle this case.
The junction tree algorithm is an exact inference algorithm for general graphs (with cycles), but its complexity is exponential in the size of the maximal clique rendering it impractical for a large number of graphical models. This motivates the need for approximate inference algorithms such as variational methods.
Conclusions
The main goal of this survey is to develop a general theoretical framework for understanding and analyzing variational methods for computing approximations to marginals, modes and likelihoods. Chapter 3, provides the mathematical background for doing so, and we begin describing how algorithms fit into this framework in chapter 4 when we cover loopy BP, generalized BP, and Expectation propagation.
Chapter 3: Graphical Models as Exponential Families
The following Appendices are helpful for this chapter:
 A.2 – Basics of Convex Sets and Functions
 B.1 – Proof of Theorem 3.3
 B.2. – Proof of Theorem 3.4
 B.3 – General Properties of and
 C.1 – Variational Principles of Multivariate Gaussians with Known Covariance
 C.2 – Variational Principles of Multivariate Gaussians with Unknown Covariance
 E.3 – Conversion to a Pairwise Markov Random Field
Exponential Families
Exponential families play a critical role in this survey. More specifically, the convexity and conjugate duality of exponential families are central to the discussion.
An exponential family of probability distributions is characterized by a set of sufficient statistics, . Thus given a vector of sufficient statistics , the corresponding exponential family, parametrized by the canonical parameters is formally defined as:
where is the log partition function that ensures the density integrates to , and is the dimensionality of the random variable . Note that a sufficient statistic is simply a function . It turns out that many common distributions such as a normal, multinomial, Bernoulli, Poisson, and beta form an exponential family. For example, a normal distribution forms an exponential family with sufficient statistics and . Table 3.1 on page 42 provides other examples.
One way to motivate these families of distributions is via the principle of maximum entropy. Consider the problem of estimating a probability density function given IID samples from it. The unique distribution that is consistent with the data and has maximal entropy is a member of the exponential family. Generically, there are many distributions that are consistent with the data, and thus we use the principle of maximum entropy for choosing amongst them. This result states that if the sufficient statistics are the only information we have about the density, then an exponential family provides a natural model.
Note that multiplying distributions from the exponential family results in a product distribution that is in the exponential family with the sufficient statistics (and corresponding parameters) aggregated. This is very useful in the context of graphical models, since a probability distribution is represented as a product of factors and thus as long as each factor is an exponential family, the entire distribution will be an exponential family. Section 3.3. provides examples of graphical models in the exponential family.
Definitions / Notations
is the set of indices for the sufficient statistics of a distribution, where is an index for a sufficient statistic , where is the dimensionality of the random variable. Let .
is the space of canonical parameters.
are the mean parameters, which provide an alternative parametrization of the exponential family that is critical to this survey. The mean parameter for each sufficient statistic is defined as: .
is the space of mean parameters.
Let be a set, and be a convex set:
 represents the affine hull of , which is the smallest set that contains all the affine combinations of the elements in . Recall that affine combinations are linear combinations where the weights sum to 1. (The affine hull of two points in is the line passing through the two points.)
 represents the convex hull of , which is the smallest set that contains all the convex combinations of the elements in . Recall that convex combinations are affine combinations where the weights are all positive. (The convex hull of two points in is the segment connecting the two points.)
 is the interior of .
 is the interior of .
 is fulldimensional if the . If is fulldimensional then
Summary of Results
We have introduced two different parametrizations for an exponential family: the canonical parameters and the mean parameters . It is thus natural to ask what are the relationships between these two parameterizations and their corresponding spaces. The following table summarizes these relationships stated/proven in chapter 3. First note that a set of sufficient statistics can either form a minimal or overcomplete representation, each of which can result in different characteristics which we make explicit in the table.
Minimal  Overcomplete  
Definition  A minimal representation implies that there is a unique canonical parameter associated with each distribution. Alternatively, there does not exist a nonzero vector such that is equal to a constant.  An overcomplete implies that there is an entire affine subset of canonical parameters associated with each distribution. Alternatively, there exists a nonzero vector such that is equal to a constant. 
Probability Distribution  
Maximum Likelihood 
maximize where are the empirical moments of the sample. 

Entropy 
Notice that the parameter that maximizes the loglikelihood is equivalent to the one that minimizes the entropy. 

is a convex set since is a convex function – see below.  
is also a convex set.
is bounded if and only if and is globally Lipschitz on . 

is fulldimensional.  is not fulldimensional is an empty set.  
(discrete) 
For the case of multinomial random variables, takes on a special form, .which is, by definition, a convex polytope. A convex polytope can alternatively be represented as the intersection of a finite collection of halfspaces (as characterized by the MinkowskiWeyl theorem). The number of constraints critically depends on the graph structure.  
Not defined.  For the case of multinomial random variables in a pairwise Markov random field using the standard overcomplete representation, is referred to as the marginal polytope .  
is strictly convex.  is convex.  
is the mean of the random variable and the Hessian is its covariance matrix. Note that the convexity and strict convexity of is a consequence of being positive semidefinite. 

Gradient map is onto .  Gradient map is onto .  
is a one to one correspondence between each element of and .  is a one to one between each element of and an affine subset of (the same affine subset that corresponds to a unique distribution).  
Consequently for each , there is a unique density in minimal exponential family form that realizes it (i.e. ).  Consequently for each , there is a set of densities in exponential family form that realize it (i.e. ).  
Notice that where are the empirical moments induced by a data set is the maximum value of the rescaled loglikelihood. Thus it comes as no surprise that is related to entropy as made explicit by Theorem 3.4a. 

is strictly convex and essentially smooth. is differentiable on .  is convex and lower semicontinuous.  
.  
.since is always convex and lower semicontinuous. 
The above table warrants some comments:
The mean parameters play a major role in inference and learning for exponential families. More specifically, computing the forward mapping from canonical to mean parameters is equivalent to marginalization and computing the backward mapping from mean parameters to canonical parameters is equivalent to computing the maximum likelihood of the canonical parameters given samples from the distribution.
Note that here, we did not restrict the density to be an exponential family.
Naming the dual variables in as is deliberate, since they turn out to have a natural interpretation as mean parameters.
Chapter 4: SumProduct, BetheKikuchi, and ExpectationPropagation
The following Appendices are helpful for this chapter:
 A.1 – Background on Graphs and Hypergraphs
 B.4 – Proof of Theorem 4.2(b)
 D. – Clustering and Augmented Hypergraphs
 E.1 – Mobius Inversion
Leave a Reply