IR-Lab project of Tien-ho (Henry) Lin
Instructor: Yiming Yang
Dual-wing harmoniums is a promising technique for modeling the relationship between heterogeneous data sources, like associated text and images or genetic variations and observed traits. Unsatisfied with contrastive divergence (CD) and maximum likelihood learning, we implemented Bayesian learning in DWH using brief Langevin MCMC approach. We proposed three different types of priors for both information retrieval and quatitative trait loci (QTL) mapping and examined the results. Bayesian learning and different priors are evaluated by classification and retrieval task for TRECVID'03 video clips and cross validation of predicting morphological shapes of two Drosophila species. Experiments shows that Bayesian learning gives close to CD accuracy and average precision in information retrieval, but fails on QTL prediction. We discussed possible causes and future directions.
Task |
Done by |
Status |
| Proposal and work plan | Oct. 5 | Complete |
| Literature review of DWH | Oct. 14 | Complete |
| Literature review of linkage analysis and QTL | Oct. 28 | Complete |
| DWH learning method | Nov. 30 | Complete |
| Applying DWH to quantitative trait loci data | Dec. 6 | Complete |
| Comparison & performance analysis | Dec. 12 | Complete |
| Presentation | Dec. 16 | Complete |
| Documentation | Dec. 23 | Complete |
Many machine learning problems require modeling the relationship between heterogeneous data sources. For example, in multimedia information retrieval we want to model the relation between video and caption, and in genetic linkage analysis we want to model the relation between genotype (genetic variations) and phenotype (observable traits). The model will contribute to various tasks including classification, retrieval and prediction.
Recently, there are many successful work regarding models with one layer of observed variables and one layer of hidden variables representing latent concepts. Although directed models like latent Dirichlet Association (LDA) [1] have advantages on causal interpretation, missing data, and easier sampling, inferencing posterior (required for EM) could be computationally intractable and one has to use variational or sampling method. Undirected models like the exponential family harmoniums (EFH) [8] can run inference efficiently but suffer from slow learning, which is an acceptable trade-off for many IR applications. Harmoniums can be generalized to multi-wing harmoniums (MWH) to model heterogeneous data sources, in particular, dual-wing harmonium (DWH) has been applied to associated text and images classification and retrieval [9]. Results show that inference is not only faster but also more accurate, probably due to better topic mixing.
Learning in EFH, as well as any undirected graphical models, is difficult. Traditionally maximum likelihood estimates is obtained by approximate gradient ascend, either contrast divergence (CD) [2] or mean field [9]. We believe Bayesian learning enables the use of prior knowledge and usually performs better when less data is available. Bayesian learning in undirected graphical models are only recently been explored, and a MCMC technique called brief Langevin method, inspired by Contrastive Divergence, is found to be efficient and effective [3]. We applied brief Langevin to DWH with three different types of priors, and evaluated the models on information retrieval and genomics domain.
We model the relationship between heterogeneous data, one discrete denoted by x and one continuous denoted by z, by defining p(x, z). This is in essence a unsupervised learning problem. However to evaluate our model we need a supervised learning task, using (distribution of) latent features in the model and classification algorithm like support vector machine (SVM). Domain-related data and supervised learning tasks for information retrieval and genomics are detailed in this section.
The data is images and captions in a single video shot, sampled from TRECVIDˇ¦03 video collection from ABC World News, CNN Headlines, and C-SPAN. Text x is represented as word occurence, 1894 binary features, and image z is represented by color histogram in HSV space, 166 features. There are totally 1078 text and image pairs, labeled as one of five categories: airplane, basketball, weather, baseball, and hockey.
We evaluate the performance of a model by doing retrieval and classification on the reduced feature space, as the case of LSI and LDA. Like in these latent layer models, learning is unsupervised, without concerning labels. Retrieval is performed by matching training examples to query (testing) text-image pair using 1-Nearest Neighbor. Retrieved document is considered relevant if the category is the same. Retrieval is evaluated by precision and recall, combined in mean average precision (MAP) [6] which corresponds to the area under the precision-recall curve. Classification is done by running support vector machine on reduced features, measured by accuracy.
Quantitative trait loci (QTL) are the (multiple) genes responsible for an observed quantitative trait, like blood pressure or tree diameter. To simplify complexity, experiments are often based on backcross population, in which a population is generated from two ˇ¨pureˇ¨ inbred strains, and then bred with one of these two strains (for example, one with larger value), to get the backcross population. QTLs are mapped using genetic markers on the genome, their values called genotype, while the observed traits are called phenotype. For biallelic genes, the genotypes are ˇ¨AAˇ¨, ˇ¨ABˇ¨, and ˇ¨BBˇ¨, but in backcross population there is only homozygous ˇ¨AAˇ¨ and heterozygous ˇ¨ABˇ¨ genotypes. Hence the genotype x are binary (corresponding to homozygous ˇ¨AAˇ¨ and heterozygous ˇ¨ABˇ¨ genotypes) and the phenotype z are one real number. To evaluate a model we follow [11] to use two-fold cross validation and measure correlation coefficient of predicted and true phenotype.
The data is the size and shape of the posterior lobe of male genital arch of two Drosophila species, D. simulans and D. mauritiana [11]. This trait is believed to be under high natural selection. The shape is represented as the top eigenvalues of the elliptical Fourier. There are 45 genotypes and 1 real-valued phenotype. Two-fold cross validation is based on two independent samples and performed on two species, sample size being 186 and 288 in D. simulans population and 192 and 299 in D. mauritiana population. There is always missing values in genotype data due to technical limitations.
Traditionally the relationship between heterogeneous data is modeled as a simple mixture model [7], for instance the Gaussian-multinomial mixture (GM-Mix) [1]. Both LSI and LDA have been applied to multimedia information retrieval. LDA has been generalized to Gaussian-multinomial LDA (GM-LDA) and Correspondence LDA (Corr-LDA), the formerˇ¦s precision-recall lower than GM-Mix while the laterˇ¦s higher in image retrieval task [1]. Unfortunately when word counts are few or just one (common in video captions), it is observed that LDA variants assume single topic draws and no direct topic mixing [9].
MWH is a special case of EFH which has multiple types of observed variables following the same exponential family distribution [9]. It can be viewed as the undirectional counterpart of the directed models with latent topic layer, like LDA. DWH for associated text and images models word counts as a Poisson distribution and image color-histogram as a Gaussian distribution (conditioned on the latent topics). Latent topics are modeled as unitvariance Gaussian distribution. Experiments show that DWH outperforms LSI, GM-Mix and GM-LDA in classification, retrieval, and image annotation.
State-of-the-art QTL mapping are composite interval mapping (CIM) [10], based on multiple regression, and its extension with forward and backward stepwise model selection, called multiple interval mapping (MIM) [12]. A new approach capable of Bayesian inference called pseudomarker has also been proposed, using MCMC sampling and imputation for missing values [5]. Although multiple regression model is very effective, especially on simpler QTLs, we believe both stepwise model selection and imputation have not fully utilized the structure in this data.
Based on EFH, the DWH can be defined by the conditional distribution of observed
variables
x, z and latent variables h. Let xi denote the word counts of word i in text
document
x,

And zk denote the kth bin value in the color correlogram of the image z,

The latent topic for this text-image pair h also follows normal distributions,

Marginalize out h in (1) and (2), we obtain the marginal of x, z,

Maximum likelihood learning of parameters can be done by gradient ascend, if
we taking
derivative of the log p(x, z) in (4),

where < . >p denotes the expectation based on distribution p(x, z) in
(4), < . >~p denotes the expectation
based on the empirical distribution, and
.
Note that the partition function Z(theta) is implicitly present in p(x, z) not shown in (4), so < . >p is intractable to compute and CD or mean field is used to approximate it. Previous experiments show that CD performed better but converged slower. For prediction, required in QTL, it is also intractable. We prefer to make inference fast so we use mean field approximation.
In Bayesian inference, the probability of the testing data X0 given training data X0 is
![]()
It is generally intractable to obtain closed form solution, so usually the parameter is sampled from the posterior p(|X) using MCMC. Even if Metropolis sampling is employed, in undirected graphical models the acceptance test will still be intractable due to the partition function Z(). A nested approximation either by mean field or Bethe loopy belief propagation is necessary. However, the outer loop of ˇ¨random-walkˇ¨ Metropolis sampling combined with inner loop of approximation or sampling might results in slow or no convergence.
Dynamical Monte Carlo methods from molecular dynamics can utilize gradient information to reduce random walk behaviour [4]. Particularly, uncorrected Langevin method provides a way to avoid the partition function by always accepting the newly sampled state (parameter),

This is justified by observing that the total rejections expected in the entire parameter space is propotional to epsilon [4]. Corrected Langevin require calculating (or approximating) the partition function [3]. However, the gradient for just one step in Langevin Monte Carlo is still intractable. Using nested MCMC not only is inefficient but also has high variance and introduces biases [2]. Inspired by Contrastive Divergence, the brief Langevin method starts from the data X and runs a few sampling steps [3]. Since the outer loop of Langevin method reduces random-walk and the inner loop only takes a few sampling steps, brief Langevin is very efficient.
Brief Langevin requires another approximation for hidden variables on general
undirected graphical models. Fortunately the marginal probability can be written
in closed form, so there is no need to approximate the maginal. As in previous
DWH, retrieval and classification is performed by feeding the hidden variables
to k Nearest Neighbor and support vector machine, respectively. For Bayesian
inference we use expected hidden variables instead, which is simply
. The choice of priors plays an important role in Bayesian learning, so we experimented
three types of priors based on different assumptions, and evaluated their behaviors.
All DWH equations assume x and z features are known. This is true for information retrieval, but there are missing marker genotypes in QTL mapping and special care must be taken. We compared four methods of handling missing values.
First we use simple priors for all parameters, mainly for regularization, expecting most weights between the hidden layer and the input layers to be close to zero. Hence i and k are modeled as uniform distribution and Wij and Ukj are modeled as normal distribution with mean zero and a given variance a priori.

We prefer to estimate rho_w, rho_u by empirical Bayes, but how to perform empirical Bayes on undirected graphical models still remains an open problem. Hence we simply set a reasonable value for each parameters by analyzing the data, rho_w = 0.1, rho_u = 0.01, and examine its behaviors.
Although Gaussian is the natural conjugate prior for W and U, assuming (conditional) independence over all weights is inapproapriate. Particularly, weights sharing the same hidden variable or observed variable should not be independent. Hence we assume only rows and columns in W and U are dependent, or, assume two weights not on the same row or column are conditionally independent given other weights. This corresponds to elements of the inverse covariance matrix,


and the derivatives will be

One justification of this prior is that we expect weights on the same row and column to compete with each other in a normalization effect. Ideally only a few weights are high and others are low, either for one observed variable or for one one hidden variable. Following independent Gaussian prior we choose rho_w1 = 5, rho_w2 = 5/|x|, rho_w3 = 5/|h|, rho_u1 = 50, rho_u2 = 50/|z|, rho_u3 = 50/|h|, where |x|, |z|, |x| denotes the dimension of each vector.
The idea of the third prior is taken from structure learning,
![]()
where |W|, |U| denotes number of weights greater than 0.01 and 0.001, respectively. Similarly we set rho_w = 0.1and rho_u = 0.01. The motivation is that most latent aspects should be related to only a few features.
We tested Bayesian DWH on video and QTL data mentioned above. Three chains are used to track convergence, some parameters shown in Figure 1. We ran 5000 iterations for CD, 15000 iterations for independent Gaussian prior and 5000 iterations for dependent Gaussian prior. For missing value imputation method (randomly sampled data) in QTL, 10 imputed dataset are generated and averaged. The results on video data are shown in Figure 2. Bayesian learning with brief Langevin achieves similar accuracy with less latent aspects, but have not outperform CD in the best case (20 latent aspects). The results on QTL are shown in Figure 3. Assigning 0.5 or . Brief Langevin fails utterly (r < 0.2) and is not shown. DWH learned by CD gives higher correlation, but still lower than MIM results reported before. Possible reasons are discussed below.

Figure 2: Classification accuracy and retrieval average precision
of CD and brief Langevin with different priors

Figure 3: Correlation of MIM and DWH (CD). CD0-11 denotes assigning
0 to missing values while CD11 denotes sample missing values in one step Gibbs
sampling. Both have 11 latent aspects.
Consider similar performance of brief Langevin and CD, which is superior than directed models in multimedia IR, the poor performance of brief Langevin is quite surprising. One possible reason is that approximate prediction is based on naive mean field, which approximate the conditional distribution of the phenotype given genotype by a single-modal Gaussian. However mean field tend to approximate multi-modal Gaussian by a singlemodal Gaussian in the center, which may be far away from the optimum. Belief propagation and MCMC sampling should also be tested.
Other future works for Bayesian DWH include Bayesian model selection, especially number of latent aspects. Another direction is using corrected (brief) Langevin to avoid bias, which may be possible due to special structure of Harmoniums. We would also like to test the model on different data set, like text and image data using image regions, and compare with correspondence LDA.
Source code of bayesian DWH is available here.
[1] Blei, D. & Jordan, M. (2003) Modeling annotated data. SIGIR-2003.
[2] Hinton, G. E. (2002) Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800.
[3] Murray, I. & Ghahramani, Z. (2004) Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms. UAI-2004, pp. 392-399.
[4] Neal, R. M. (1993) Probabilistic inference using Markov chain Monte Carlo methods. Technical report, Department of Computer Science, Univ of Toronto.
[5] Sen, S. & Churchill G. A. (2001) A statistical framework for quantitative trait loci mapping. Genetics, 159:371-387.
[6] Smeaton, A. F. and Over, P. (2003) TRECVID: Benchmarking the effectiveness of information retrieval tasks on digital video. Proc. of the Intl. Conf. on Image and Video Retrieval.
[7] Taskar, B., Segal, E. & Koller, D. (2001) Probabilistic clustering in relational data. IJCAI-2001, pp. 870-876.
[8] Welling, M., Rosen-Zvi M.& Hinton, G. (2004) Exponential family harmoniums with an application to information retrieval. Advances in Neural Information Processing Systems 17.
[9] Xing, E.P. , Yan, R. & Hauptmann, A.G. (2005) Mining associated text and images with dual-wing harmoniums. UAI-2005.
[10] Zeng, Z-B. (1994) Precision mapping of quantitative trait loci. Genetics, 136:1457- 1468.
[11] Zeng, Z-B., Liu, J., Stam, L. F., Kao, C-H., Mercer, J. M., & Laurie, C. C. (2000) Genetic architecture of a morphological shape difference between two Drosophila species. Genetics, 154(1):299-310.
[12] Kao, C-H., Zeng, Z-B., & Teasdale, R. D. (1999) Multiple interval mapping for quantitative trait loci. Genetics, 152:1203-1216.