An NSF funded Collaborative Project # III-COR 0704628 & 0704689
Title: Multi-Field Correlated Topic Modeling
Summary: There is a great need in practical applications for
analyzing and maintaining data collections where each entity (object or event)
consists of multiple fields with different but interrelated contents. For example, in a troubleshooting
scenario each record may contain
several free-text fields, such as a brief problem description by a user, an
initial analysis of the problem by a technical specialist, and a detailed
technical description by the expert(s) who fixed the problem. Other fields in
the record may include related information in the forms of nominal,
categorical, ordinal and numerical attributes, such as by whom the problem was
reported, what level of urgency was specified, which expert(s) was assigned
etc. For each new troubleshooting
scenario multiple interrelated tasks must be solved, such as finding similar
past cases (retrieval) or automatically determining severity of the problem,
category, right experts etc (prediction). The main challenge in this scenario
is to model the dependencies among multiple fields so that the rich connections
among tasks can be effectively leveraged.
1)
We have developed a new
multi-field correlated topic modeling approach to enable modeling such multi-field data in a global Bayesian graphical
structure.
2)
We have developed a variant
of the mean-field variational algorithm as the approximation procedure to
perform inference and parameter estimation.
3)
We have evaluated our
approach on the real troubleshooting data. Our approach outperforms state of
the art Correlated Topic Modeling in terms of likelihood (Figure 1.) and
predictive perplexity (Figure 2.)
Figure 1 shows the likelihood of two
modifications of our approach (mf-CTM.dt, mf-CTM.ct) and state of the art
baseline (CTM) as a function of the number of latent topics (the parameter of
the algorithm that needs to be tuned).

Figure 2 shows the predictive
perplexity (the lower the better) of two modifications of our approach
(mf-CTM.dt, mf-CTM.ct) and state of the art baseline (CTM). Perplexity reflects
the ability of the model to predict unseen fields (unsolved tasks) given the
observed fields (solved tasks).
