June 22, 2009

(Apr 23) Meeting Summary: Wainwright / Jordan, p26-33

Filed under: Uncategorized — umassliving @ 5:43 am
Tags: ,

Worked out example of sum-product to calculate marginal at one node. Noted that \sum_i \sum_j i j = (\sum_i i) (\sum_j j). Looked at extension of sum-product to simultaneously compute marginals at all nodes using 2 |E| messages. Noted that max can also be used in place of sum, and in fact, updates can be done on any commutative semirings, of which sum-product and max-product are specific examples.

Looked at junction tree algorithm. Noted that, using the terminology in the paper, a junction tree is a clique tree that satisfies the running intersection property. A clique tree can be constructed for any graph, but a junction tree can only be built on a triangulated graph. The nodes in a junction tree correspond to the maximal cliques of the triangulated graph, and the edges are set using maximum spanning tree, where the weight on an edge is the size of the separator set for the two nodes connected by the edge. Potentials are kept not only over the nodes in the junction tree but also the separator sets, and messages are sent by dividing by the stored potential in the separator set in order to avoid double-counting information. Worked through Example 2.2.


(Apr 9) Meeting Summary: Wainwright / Jordan, p1-29

Filed under: Uncategorized — umassliving @ 5:42 am
Tags: ,

Specific issues covered – directed and undirected models, parameterizing undirected models using maximal cliques and non-maximal cliques, factor graphs, converting between directed, undirected models, and factor graphs, conditional independence, the need for moralization when converting from directed to undirected, conditional independence assumptions that can only be modeled by either directed or undirected models, explaining away, variable elimination, sum-product algorithm on trees, directed trees versus polytrees.

(Mar 26) Meeting Summary: Roth & Black, “Field of Experts”

Filed under: Uncategorized — umassliving @ 5:40 am

The main point of confusion was the exact difference between Product of Experts and Field of Experts. There were two aspects to this confusion. The first was whether or not Field of Experts was simply Product of Experts trained with overlapping patches. This is not the case, though it might appear so looking at equation 12. The key is that you have one normalizing factor Z for each entire image (that doesn’t factor over each patch), whereas in Product of Experts you have one normalizing factor for each patch example.

The second point was the rationale behind coupling the patches this way. Although during training, one could pick out independent patches, when applying the model to an entire image, overlapping patches will be dependent, and one would want this dependence to be captured during the training process. The experiment results in section 5.3.5 validate this reasoning.

(Mar 12) Meeting Summary: Liang & Jordan, “An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators”

Filed under: Uncategorized — umassliving @ 5:40 am

Why use log loss and Bayes Risk rather than 0-1 loss and Bayes Error? Then again, why use 0-1 loss rather than log loss?

Some confusion of r( \cdot). r is a partitioning, for generative and fully discriminative models, there is only one partitioning, but for pseudolikelihood discriminative, there is one r_i for each node in the graphical model. For the generative model, there is only one partition that encompasses the entire outcome space Z. For the fully discriminative model, there is one partition for each data point x, and encompasses all possible labels Y. By maximizing the quantity (3), we are moving probability mass within a neighborhood r(z) and putting it on the point z. Introducing the r notation allows the generative, discriminative, and pseudolikelihood method to be discussed under one framework.

The take-away messages are Corollaries 1 and 2, and that discriminative models enjoy a O(n^{-1}) convergence in risk even when the model is misspecified, unlike the generative and pseudolikelihood methods, which matches the observation in Wainwright (2006) that one should use the same inference procedure in both training and testing.

(Feb 26) Meeting Summary: Larochelle, Bengio, Louradour & Lamblin, “Exploring Strategies for Training Deep Neural Networks”

Filed under: Uncategorized — umassliving @ 5:39 am

What is the difference between the RBM and the Autoassociator Network? Both seem quite similar, especially with the constraint on the Autoassociator Network that the encoding functions are sigmoids and W^T = W^*, and training the RBM using one step contrastive divergence. However, there appear to be important differences, arising from the fact that the RBM is a generative probabilistic model and the Autoassociator Network is not.

In particular, the best approach seems to be to apply some linear combination of both the updates for the RBM and the Autoassociator Network, as discussed on p26. One hypothesis is that “there is no guarantee that the RBM will encode in its hidden representation all the information in the input vector, [while] an autoassociator is trying to achieve this”.

This relates to the comment on p17 that, if, at some point in a deep belief network, the representation made of units at the highest level are mostly independent, then the next level would have no statistical structure to learn, and thus initialize the weights of the new RBM to zero (and appropriate set the bias weights for the penultimate layer). Conceivably, the RBM could also instead set very strong weights connecting a node in the penultimate layer only to the node in the top layer that is directly above it (a one to one connection). However, we are not sure which the RBM would actual do, using the training scheme given in the paper, and moreover, there is the further question of whether most of the information in the DBN would be contained in bias weights at intermediary levels of the DBN, which would not be as useful when constructing a classifier using the top layer of the DBN as input.

(Feb 12) Meeting Summary: Weiss, Torralba & Fergus, “Spectral Hashing”

Filed under: Uncategorized — umassliving @ 5:37 am

There was some confusion about independence versus correlation. Independence means p(a,b) = p(a)p(b) (or mutual information is zero), while two random variables are uncorrelated if E[AB] = E[A]E[B]. It’s fairly easy to show that independence implies two variables are uncorrelated, but the reverse is not always true (Gaussian random variables are an example of an exception). One simple example is X = -1, 0, 1, each with 1/3 probability, and Y = 1 if X = 0, otherwise Y = 0. Then X and Y are uncorrelated but clearly not independent.

There was some confusion also about PCA. Here is one summary [16].

One other question was how to go from the minimization (1) to the minimization (2), any why the eigenvectors are the solution to the relaxed version of (2). If you expand the ||y_i - y_j||^2 terms into y_i^2 + y_j^2 + 2 y_i y_j for each bit, you should be able to rearrange terms to get it to be 2 times the objective function of (2).

Here are two sets of lecture notes related to this [17] and [18]. These course notes deal with graphs and their Lapalcians (the D-W term), which may help in understanding part of the paper. In particular, this set of notes [19] contains a proof of the Courant-Fischer theorem, which proves that the minimal value of the relaxed problem (2) is the minimal eigenvalue.

(Feb 5) Meeting Summary: Jeff Hawkins, “On Intelligence”, first several chapters

Filed under: Uncategorized — umassliving @ 5:36 am

What is intelligence? Is it behavior, or memory, or something else? General agreement that the Serle Chinese room argument is flawed. If intelligence is behavior, then is the Turing test an appropriate measure of intelligence? Does the fact that a system like ELIZA [21] can apparently fool some laypeople into believing that they are talking to an actual person, despite not having what most AI practitioners would consider real intelligence, indicate a flaw in the Turing test?

If intelligence is memory, then is something like Torralba’s 80 Million Tiny Images [22] the path to AI? Can general AI really be solved by such an approach? Can such an approach provide an interpretation of a conference room, handling the variety of possible chairs, tables, faces, viewing angles, etc?

What will a general AI solution require? Hawkins argues that any approach must have a time component, mimic the physical architecture of the human brain, and have many feedback connections. What else will have to be a part of any general AI solution? Higher-order feature induction?

« Previous Page

Blog at