How to use the Bayes Net Toolbox

This documentation describes the version of BNT released on 11/24/00.
Click here for differences from the last release (5/22/00).
Click here for a French version of this documentation.

Getting started
- Installation
Your first Bayes net
Combining discrete and continuous nodes
Approximate inference
Inference engines
- Comparison of engines
Handling time series with BNT

Getting started: Installation

Download the BNT.tar file.
Unpack the file. In Unix, type "tar xvf BNT.tar" or "unzip BNT.zip". In Windows, use a program like Winzip. This will create a directory called BNT.
Edit the file "BNT/add_BNT_to_path.m" so it contains the correct pathname. For example, in Windows, I comment out the second line (with the % character), and uncomment the third line, which reads
```
BNT_HOME = 'C:\murphyk\matlab\BNT';
```
assuming I downloaded the file into C:\murphyk\matlab.
Start up Matlab.
Type "ver" at the Matlab prompt (">>"). You need Matlab version 5.2 or newer to run BNT.
Move to the BNT directory. For example, in Windows, I type
```
>> cd C:\murphyk\matlab\BNT
```
Type "add_BNT_to_path" to extend the matlab path variable.
Type "test_BNT". If all goes well, this will produce a bunch of numbers (which you can ignore), and no error messages.

If you are new to Matlab, you might like to check out some useful Matlab tips. For instance, this explains how to create a startup file, which can be used to set your path variable automatically, so you can avoid having to type the above commands every time.

Creating your first Bayes net

Representation

Consider the following example (adapted from Russell and Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall, 1995, p454.)

To specify this directed acyclic graph (dag), we create an adjacency matrix:

N = 4; 
dag = zeros(N,N);
C = 1; S = 2; R = 3; W = 4;
dag(C,[R S]) = 1;
dag(R,W) = 1;
dag(S,W)=1;

We have numbered the nodes as follows: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. The nodes must always be numbered in topological order, i.e., ancestors before descendants. For a more complicated graph, this is a little inconvenient: we will see how to get around this below.

In addition to specifying the graph structure, we must specify the size and type of each node. If a node is discrete, its size is the number of possible values each node can take on; if a node is continuous, it can be a vector, and its size is the length of this vector. In this case, we will assume all nodes are discrete and binary.

discrete_nodes = 1:N;
node_sizes = 2*ones(1,N);

We are now ready to make the Bayes net

bnet = mk_bnet(dag, node_sizes, discrete_nodes);

There are several optional arguments to the 'mk_bnet' function, some of which we will see later. In general, to find out more about a function, please see its documentation string by typing

help mk_bnet

Inference

Having created the BN, we can now use it for inference. There are many different algorithms for doing inference in Bayes nets, that make different tradeoffs between speed, complexity, generality, and accuracy. BNT therefore offers a variety of different inference "engines". We will discuss these in more detail below. For now, we will use the junction tree engine, which is the mother of all exact inference algorithms. This can be created as follows.

engine = jtree_inf_engine(bnet);

The other engines have similar constructors, but might take additional, algorithm-specific parameters. All engines are used in the same way, once they have been created. We illustrate this in the following sections.

Suppose we want to compute the probability that the sprinker was on given that the grass is wet. The evidence consists of the fact that W=2. All the other nodes are hidden (unobserved). We can specify this as follows.

evidence = cell(1,N);
evidence{W} = 2;

We use a 1D cell array instead of a vector to cope with the fact that nodes can be vectors of different lengths. In addition, the value [] can be used to denote 'no evidence', instead of having to specify the observation pattern as a separate argument.

We are now ready to add the evidence to the engine.

[engine, loglik] = enter_evidence(engine, evidence);

The behavior of this function is algorithm-specific, and is discussed in more detail below. In the case of the jtree engine, enter_evidence implements a two-pass message-passing scheme. The first return argument contains the modified engine, which incorporates the evidence. The second return argument contains the log-likelihood of the evidence. (Not all engines are capable of computing the log-likelihood.)

Finally, we can compute p=P(S=2|W=2) as follows.

marg = marginal_nodes(engine, S);
p = marg.T(2);

We find that p = 0.4298.

Now let us add the evidence that it was raining, and see what difference it makes.

evidence{R} = 2;
[engine, loglik] = enter_evidence(engine, evidence);
marg = marginal_nodes(engine, S);
p = marg.T(2);

We find that p = P(S=2|W=2,R=2) = 0.1945, which is lower than before, because the sprinkler's being on can ``explain away'' the fact that the grass is wet.

Note: you can plot a marginal distribution over a discrete variable as a barchart using the built 'bar' function.

Computing joint distributions

We can compute the joint probability on a set of nodes as in the following example.

evidence = cell(1,N);
[engine, ll] = enter_evidence(engine, evidence);
m = marginal_nodes(engine, [S R W]);

m is a structure. The 'T' field is a multi-dimensional array (in this case, 3-dimensional) that contains the joint probability distribution on the specified nodes.

>> m.T
ans(:,:,1) =
    0.2900    0.0410
    0.0210    0.0009
ans(:,:,2) =
         0    0.3690
    0.1890    0.0891

We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass to be wet if both the rain and sprinkler are off.

Let us now add some evidence to R.

evidence{R} = 2;
[engine, ll] = enter_evidence(engine, evidence);
m = marginal_nodes(engine, [S R W]);
m = 
    domain: [2 3 4]
         T: [2x1x2 double]
>> m.T
m.T
ans(:,:,1) =
    0.0820
    0.0018
ans(:,:,2) =
    0.7380
    0.1782

The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible with the evidence that R=2. Instead of creating large tables with many 0s, BNT sets the effective size of observed (discrete) nodes to 1. This is why m.T has size 2x1x2. We can convert this back to a 2x2x2 table as follows.

fullm = add_ev_to_dmarginal(m, evidence, ns)

Note: It is not always possible to compute the joint on arbitrary sets of nodes: it depends on the engine, as discussed in more detail below.

Parameter learning

Let's generate some data from this network, randomize the parameters, and then try to recover the original model. First we generate training cases.

nsamples = 50;
samples = sample_bnet(bnet, nsamples);
data = cell2num(samples);

sample_bnet returns a cell array because, in general, each node might be a vector of different length. In this case, all nodes are discrete (and hence scalars), so we convert samples to a regular array: data(i,l) = samples{i,l} is the value of node i in case l.

Now we create a network with random parameters.

% Make a tabula rasa
bnet2 = mk_bnet(dag, ns);
seed = 0;
rand('state', seed);
bnet2.CPD{C} = tabular_CPD(bnet2, C);
bnet2.CPD{R} = tabular_CPD(bnet2, R);
bnet2.CPD{S} = tabular_CPD(bnet2, S);
bnet2.CPD{W} = tabular_CPD(bnet2, W);

Finally, we find the maximum likelihood estimates of the parameters.

[bnet3, LL] = learn_params_tabular(bnet2, data);

To view the learned parameters, we use a little Matlab hackery. (The kosher approach would be to implement a display method for each CPD object.)

CPT3 = cell(1,N);
for i=1:N
  s=struct(bnet3.CPD{i});  % violate object privacy
  CPT3{i}=s.CPT;
end

Here are the results.

celldisp(CPT3)
CPT3{1} =
    0.6000
    0.4000
CPT3{2} =
    0.5000    0.5000
    1.0000         0
CPT3{3} =
    0.8667    0.1333
    0.2000    0.8000
CPT3{4} =
(:,:,1) =
    1.0000         0
    0.3846         0
(:,:,2) =
         0    1.0000
    0.6154    1.0000

So we see that the learned parameters are fairly close to the original ones, which we display below for completeness.

CPT{1} =
    0.5000
    0.5000
CPT{2} =
    0.5000    0.5000
    0.9000    0.1000
CPT{3} =
    0.8000    0.2000
    0.2000    0.8000
CPT{4} =
(:,:,1) =
    1.0000    0.1000
    0.1000    0.0100
(:,:,2) =
         0    0.9000
    0.9000    0.9900

We can get better results by using a larger training set, or using informative priors (see below).

Parameter priors

The conjugate prior for a multinomial is the Dirichlet. (For binary random variables, the multinomial is the same as the Bernoulli, and the Dirichlet is the same as the Beta.) Priors for other kinds of CPDs are not currently supported.

The Dirichlet has a simple interpretation in terms of pseudo counts. If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the training set, where Pa_i are the parents of X_i, then the maximum likelihood (ML) estimate is T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this event was not seen in the training set, we can pretend we saw value k of X_i, for each value j of Pa_i, alpha_ijk times in the past. The MAP (maximum a posterior) estimate is then

T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)

and is never 0 if all alpha_ijk > 0. For example, consider the network A->B, where A is binary and B has 3 values. A Jeffrey's prior for B has the form

    B=1 B=2 B=3
A=1 1/6 1/6 1/6
A=2 1/6 1/6 1/6

The sum of all the entries is called the equivalent sample size (ess). In BNT, to create this prior, with random CPT entries, use

ess = 1;
tabular_CPD(bnet, i, 'rnd', ess);

ess=0 gives a 0 (uniformative) prior, which is equivalent to ML estimation.

The marginal likelihood of a case (which integrates out the parameter values) can be computed by setting each row of the CPT to the normalisation of the corresponding row of the prior. In the above example, this becomes

    B=1 B=2 B=3
A=1 1/3 1/3 1/3
A=2 1/3 1/3 1/3

This is a stochastic matrix, since each row sums to 1. This can be created as follows:

ess = 1;
tabular_CPD(bnet, i, 'unif', ess);

Sequential parameter learning

If we use Dirichlet priors and have fully observed data, we can compute the MAP estimate in batch form as follows.

use_prior = 1;
data = cell2num(samples);
[bnet3, LL3, bic_score, mscore] = learn_params_tabular(bnet2, data, use_prior);

LL3 is the log-likelihood of the data computed at the final MAP estimate of the parameters:

 LL3 = log \prod_l P(X_i | Pa_i, data(:,l), theta_hat)

where theta_hat is the MAP estimate.

mscore is the log marginal likelihood, which integrates out the parameters:

 mscore = log \int_theta \prod_l P(X_i | Pa_i, data(:,l), theta) P(theta)

bic_score is a large sample approximation to mscore; it is computed by assuming the parameter posterior is approximately Gaussian. This is known as the Bayesian Information Criterion (BIC).

We can compute the same result sequentially as follows.

LL4 = 0;
bnet4 = bnet2;
for l=1:nsamples
  LL4 = LL4 + log_likelihood(bnet4, samples(:,l));
  bnet4 = update_params(bnet4, samples(:,l));
end

The parameters of bnet4 are identical to bnet3.

LL4 is given by

LL4 = log P(x1 | th0)  + log  P(x2 | th1) + ... + log P(xn | th(n-1))

where th(i) = the result of Bayesian updating applied to th(i-1) and x(i)=samples(i), and th(0) is specified by the prior. It is simple to show that LL4 = mscore, provided we set the initial CPT entries to the mean of the initial prior, so that log_likelihood computes the log marginal likelihood. (This can be done by making the initial CPTs uniform, and the initial prior a Jeffrey's prior.) See the file BNT/examples/static/seqlearn1.m for details.

Parameter learning with missing values

Now we consider learning when some values are missing. Let us randomly hide half the values.

samples2 = samples;
hide = rand(N, nsamples) > 0.5;
[I,J]=find(hide);
for k=1:length(I)
  samples2{I(k), J(k)} = [];
end

samples2{i,l} is the value of node i in training case l, or [] if unobserved.

Now we will compute the MLEs using the EM algorithm. We need to use an inference algorithm to compute the expected sufficient statistics in the E step; the M (maximization) step is as above.

engine2 = jtree_inf_engine(bnet2);
max_iter = 10;
[bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);

LLtrace(i) is the log-likelihood at iteration i. We can plot this as follows:

plot(LLtrace, 'x-')

Let's display the results after 10 iterations of EM.

celldisp(CPT4)
CPT4{1} =
    0.6616
    0.3384
CPT4{2} =
    0.6510    0.3490
    0.8751    0.1249
CPT4{3} =
    0.8366    0.1634
    0.0197    0.9803
CPT4{4} =
(:,:,1) =
    0.8276    0.0546
    0.5452    0.1658
(:,:,2) =
    0.1724    0.9454
    0.4548    0.8342

We can get improved performance by (1) increasing the size of the training set, (2) decreasing the amount of hidden data, (3) running EM for longer, or (4) using informative priors.

Structure learning

The structure learning code in BNT is still being developed. The following algorithms for static networks have been implemented:

The Cooper and Herskovits K2 algorithm, which assumes a known node ordering, and uses a Bayesian scoring metric. This only works if all nodes have tabular CPDs.

The Pearl-Verma/ Spirtes-Glymour IC/PC algorithm, which uses conditional independence tests.
For details, see "Causation, Prediction, and Search", Spirtes, Glymour and Scheines, 2000 (2nd edition), Springer Lecture Notes in Statistics (in press).

Both K2 and PC assume all nodes are observed. Methods to handle missing data (e.g., structural EM for the Bayesian approach, and IC* for the constraint approach) have not yet been implemented.

We now give a simple example of how to use K2. We use the sprinkler network and sample 50 cases from it as before. Then we can recover the exact structure as follows:

order = [C S R W];
max_fan_in = 2;
dag = learn_struct_K2(data, ns, order, max_fan_in);

The PC algorithm takes as arguments a function f, the number of nodes N, the maximum fan in K, and additional arguments A which are passed to f. The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 otherwise. For example, suppose we cheat by passing in the true graph structure as the argument A, and use d-separation to implement the CI test, i.e., f(X,Y,S) calls dsep(X,Y,S,dag).

pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);

pdag(i,j) = -1 if there is definitely an i->j arc, and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. (Graphs which belong to the same Markov equivalence class can have the direction of some of their arcs reversed without changing any of the CI relationships. The pdag (partially oriented dag, or pattern) is a concise representation of this equivalence class.)

Applied to the sprinkler network, this returns

pdag =
     0     1     1     0
     1     0     0    -1
     1     0     0    -1
     0     0     0     0

So as expected, we see that the V-structure at the W node is uniquely identified, but the other arcs have ambiguous orientation.

We now give an example from p141 of the SGS book. This example concerns the female orgasm. We are given a correlation matrix C between 7 measured factors (such as subjective experiences of coital and mastubatory experiences), derived from 281 samples, and want to learn a causal model of the data. We will not discuss the merits of this type of work here, but merely show how to reproduce the results in the SGS book. Their program, tetrad, makes use of the Fisher Z-test for conditional independence, so we do the same:

max_fan_in = 4;
nsamples = 281;
alpha = 0.05;
pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);

In this case, the CI test is

f(X,Y,S) = cond_indep_fisher_z(X,Y,S,  C,nsamples,alpha)

where The results match those of Fig 12a of SGS apart from two edge differences; presumably this is due to rounding error (although it could be a bug). This example can be found in the file BNT/examples/static/pc2.m.

Combining discrete and continuous nodes

Bayes nets which contain discrete and continuous random variables are sometimes called "hybrid" Bayes nets. We now discuss how to handle these in BNT.

Gaussian nodes

Let us create a model of waste emissions from an incinerator plant. Square nodes are discrete (tabular), and round nodes are Gaussian. Shaded nodes will be observed. (This example is from the paper "Propagation of Probabilities, Means amd Variances in Mixed Graphical Association Models", Steffen Lauritzen, JASA 87(420):1098--1108, 1992. It is reprinted in the book "Probabilistic Networks and Expert Systems", R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter, Springer, 1999.)

We can create this model as follows.

F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9;
n = 9;

dag = zeros(n);
dag(F,E)=1;
dag(W,[E Min D]) = 1;
dag(E,D)=1;
dag(B,[C D])=1;
dag(D,[L Mout])=1;
dag(Min,Mout)=1;

% node sizes - all cts nodes are scalar, all discrete nodes are binary
ns = ones(1, n);
dnodes = [F W B];
cnodes = mysetdiff(1:n, dnodes);
ns(dnodes) = 2;

bnet = mk_bnet(dag, ns, dnodes);

'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.

The parameters of the discrete nodes can be specified as follows.

bnet.CPD{B} = tabular_CPD(bnet, B, [0.85 0.15]'); % 1=stable, 2=unstable
bnet.CPD{F} = tabular_CPD(bnet, F, [0.95 0.05]'); % 1=intact, 2=defect
bnet.CPD{W} = tabular_CPD(bnet, W, [2/7 5/7]'); % 1=industrial, 2=household

In general, Gaussian nodes are parameterized as follows. Suppose the node is called Y, its continuous parents (if any) are called X, and its discrete parents (if any) are called Q. The the distribution on Y is defined as follows:

- no parents: Y ~ N(mu, Sigma)
- cts parents : Y|X=x ~ N(mu + W x, Sigma)
- discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
- cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))

where N(mu, Sigma) denotes a Normal distribution with mean mu and covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q respectively. If there are no discrete parents, |Q|=1; if there is more than one, then |Q| = a vector of the sizes of each discrete parent. If there are no continuous parents, |X|=0; if there is more than one, then |X| = the sum of their sizes. Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) matrix.

We can specify the parameters of the model as follows. (If we omit any parameters, random values of the appropriate size will be created.)


mu = reshape([-3.9 -0.4 -3.2 -0.5], [1 2 2]);
Sigma = reshape([0.00002 0.0001 0.00002 0.0001], [1 1 2 2]);
bnet.CPD{E} = gaussian_CPD(bnet, E, mu, Sigma);

mu = reshape([6.5 6.0 7.5 7.0], [1 2 2]);
Sigma = reshape([0.03 0.04 0.1 0.1], [1 1 2 2]);
weights = reshape([1 1 1 1], [1 1 2 2]);
bnet.CPD{D} = gaussian_CPD(bnet, D, mu, Sigma, weights);

mu = reshape([-2 -1], [1 2]);
Sigma = reshape([0.1 0.3], [1 1 2]);
bnet.CPD{C} = gaussian_CPD(bnet, C, mu, Sigma);

bnet.CPD{L} = gaussian_CPD(bnet, L, 3, 0.25, -0.5);

mu = reshape([0.5 -0.5], [1 2]);
Sigma = reshape([0.01 0.005], [1 1 2]);
bnet.CPD{Min} = gaussian_CPD(bnet, Min, mu, Sigma);

bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 0, 0.002, [1 1]);

Non-Gaussian nodes

Currently BNT does not support any CPDs for continuous nodes other than the Gaussian. In principle you can use a mixture of Gaussians to approximate any other distribution...

Inference

Let us now perform inference. First we compute the unconditional marginals.

engine = jtree_inf_engine(bnet);
evidence = cell(1,n);
[engine, ll] = enter_evidence(engine, evidence);
marg = marginal_nodes(engine, E);

'marg' is a structure that contains the fields 'mu' and 'Sigma', which contain the mean and (co)variance of the marginal on E. In this case, they are both scalars. Let us check they match the published figures to 2dp. (We can't expect more precision than this in general because I have implemented the algorithm of Lauritzen (1992), which can be numerically unstable.)

tol = 1e-2;
assert(approxeq(marg.mu, -3.25, tol));
assert(approxeq(sqrt(marg.Sigma), 0.709, tol));

We can compute the other posteriors similarly. Now let us add some evidence.

evidence = cell(1,n);
evidence{W} = 1; % industrial
evidence{L} = 1.1;
evidence{C} = -0.9;
[engine, ll] = enter_evidence(engine, evidence);

Now we find

marg = marginal_nodes(engine, E);
assert(approxeq(marg.mu, -3.8983, tol));
assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));

We can also compute the joint probability on a set of nodes. For example, P(D, Mout | evidence) is a 2D Gaussian:

marg = marginal_nodes(engine, [D Mout])
marg = 
    domain: [6 8]
        mu: [2x1 double]
     Sigma: [2x2 double]
         T: 1.0000

The mean is

marg.mu
ans =
    3.6077
    4.1077

and the covariance matrix is

marg.Sigma
ans =
    0.1062    0.1062
    0.1062    0.1182

It is easy to visualize this posterior using standard Matlab plotting functions (see e.g., BNT/misc/gaussplot). The T field indicates that the mixing weight of this Gaussian component is 1.0. If the joint contains discrete and continuous variables, the result will be a mixture of Gaussians, e.g.,

marg = marginal_nodes(engine, [F E])
    domain: [1 3]
        mu: [-3.9000 -0.4003]
     Sigma: [1x1x2 double]
         T: [0.9995 4.7373e-04]

The interpretation is Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. In this case, E is a scalar, so i=j=1; k specifies the mixture component.

We saw in the sprinkler network that BNT sets the effective size of observed discrete nodes to 1, since they only have one legal value. For continuous nodes, BNT sets their length to 0, since they have been reduced to a point. For example,

marg = marginal_nodes(engine, [B C])
    domain: [4 5]
        mu: []
     Sigma: []
         T: [0.0123 0.9877]

It is simple to post-process the output of marginal_nodes. For example, the file BNT/examples/static/cg1 sets the mu term of observed nodes to their observed value, and the Sigma term to 0 (since observed nodes have no variance).

PCA, ICA, and all that

We now give a more familiar example of a hybrid Bayes net. In Figure (a) below, we show how Factor Analysis can be thought of as a graphical model. Here, X has an N(0,I) prior, and Y|X=x ~ N(mu + Wx, Psi), where Psi is diagonal and W is called the "factor loading matrix". Since the noise on both X and Y is diagonal, the components of these vectors are uncorrelated, and hence can be represented as individual scalar nodes, as we show in (b). (This is useful if parts of the observations on the Y vector are occasionally missing.) We usually take k=|X| << |Y|=D, so the model tries to explain many observations using a low-dimensional subspace.


(a)	(b)	(c)	(d)

We can create this model in BNT as follows.

ns = [k D];
dag = zeros(2,2);
dag(1,2) = 1;
dnodes = [];
onodes = 2;
bnet = mk_bnet(dag, ns, dnodes);
bnet.CPD{1} = gaussian_CPD(bnet, 1, zeros(k,1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov');
bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(D,1), diag(Psi0), W0, 'diag', 'untied', 'clamp_mean');

The root node is clamped to the N(0,I) distribution, so that we will not update these parameters during learning. The mean of the leaf node is clamped to 0, since we assume the data has been centered (had its mean subtracted off); this is just for simplicity. Finally, the covariance of the leaf node is constrained to be diagonal. W0 and Psi0 are the initial parameter guesses.

We can fit this model using EM in the usual way (see the file BNT/examples/static/fa1 for details). Not surprisingly, the ML estimates for mu and Psi turn out to be identical to the sample mean and variance, which can be computed directly as

mu_ML = mean(data);
Psi_ML = diag(cov(data));

Note that W can only be identified up to a rotation matrix, because of the spherical symmetry of the source.

If we restrict Psi to be spherical, i.e., Psi = sigma I, there is a closed-form solution for W as well, i.e., we do not need to use EM. In particular, W contains the first |X| eigenvectors of the sample covariance matrix, with scalings determined by the eigenvalues and sigma. Classical PCA can be obtained by takting the sigma->0 limit. For details, see

"EM algorithms for PCA and SPCA", Sam Roweis, NIPS 97. ( Matlab software)
"Mixtures of probabilistic principal component analyzers", Tipping and Bishop, Neural Computation 11(2):443--482, 1999.

By adding a hidden discrete variable, we can create mixtures of FA models, as shown in (c). Now we can explain the data using a set of subspaces. We can create this model in BNT as follows.

ns = [M k D];
dag = zeros(3);
dag(1,3) = 1;
dag(2,3) = 1;
dnodes = 1;
onodes = 3;

bnet = mk_bnet(dag, ns, dnodes);
bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0);
bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(k, 1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov');
bnet.CPD{3} = gaussian_CPD(bnet, 3, Mu0', repmat(diag(Psi0), [1 1 M]), W0, 'diag', 'tied');

Notice how the covariance matrix for Y is the same for all values of Q; that is, the noise level in each sub-space is assumed the same. However, we allow the offset, mu, to vary. For details, see

The EM Algorithm for Mixtures of Factor Analyzers , Ghahramani, Z. and Hinton, G.E. (1996), University of Toronto Technical Report CRG-TR-96-1. (Matlab software)
"Mixtures of probabilistic principal component analyzers", Tipping and Bishop, Neural Computation 11(2):443--482, 1999.

I have included Zoubin's specialized MFA code (with his permission) with the toolbox, so you can check that BNT gives the same results: see 'BNT/examples/static/mfa1.m'.

Independent Factor Analysis (IFA) generalizes FA by allowing a non-Gaussian prior on each component of X. (Note that we can approximate a non-Gaussian prior using a mixture of Gaussians.) This means that the likelihood function is no longer rotationally invariant, so we can uniquely identify W and the hidden sources X. IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). We recover classical Independent Components Analysis (ICA) in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the weight matrix W is square and invertible. For details, see

Independent Factor Analysis, H. Attias, Neural Computation 11: 803--851, 1998.

Softmax nodes

If we have a discrete node with a continuous parent, we can define its CPD using a softmax function (also known as the multinomial logit function). This acts like a soft thresholding operator, and is defined as follows:

                    exp(w(:,i)'*x + b(i)) 
Pr(Q=i | X=x)  =  -----------------------------
                  sum_j   exp(w(:,j)'*x + b(j))

The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the following interpretation: w(:,i)-w(:,j) is the normal vector to the decision boundary between classes i and j, and b(i)-b(j) is its offset (bias). For example, suppose X is a 2-vector, and Q is binary. Then

w = [1 -1;
     0 0];

b = [0 0];

means class 1 are points in the 2D plane with positive x coordinate, and class 2 are points in the 2D plane with negative x coordinate. If w has large magnitude, the decision boundary is sharp, otherwise it is soft. In the special case that Q is binary (0/1), the softmax function reduces to the logistic (sigmoid) function.

Fitting a softmax function can be done using the iteratively reweighted least squares (IRLS) algorithm. We use the implementation from netlab. Note that since the softmax distribution is not in the exponential family, it does not have finite sufficient statistics, and hence we must store all the training data in uncompressed form. If this takes too much space, one should use online (stochastic) gradient descent (not implemented in BNT).

Hierarchical mixtures of experts

As an example of the use of the softmax function, we introduce the Mixture of Experts model. We also show the Hierarchical Mixture of Experts model, where the hierarchy has two levels. (This is essentially a probabilistic decision tree of height two.) As before, circles denote continuous-valued nodes, squares denote discrete nodes, clear means hidden, and shaded means observed.

X is the observed input, Y is the output, and the Q nodes are hidden "gating" nodes, which select the appropriate set of parameters for Y. During training, Y is assumed observed, but for testing, the goal is to predict Y given X. Note that this is a conditional density model, so we don't associate any parameters with X. Hence X's CPD will be a root CPD, which is a way of modelling exogenous nodes. If the output is a continuous-valued quantity, we assume the "experts" are linear-regression units, and set Y's CPD to linear-Gaussian. If the output is discrete, we set Y's CPD to a softmax function. The Q CPDs will always be softmax functions.

As a concrete example, consider the mixture of experts model where X and Y are scalars, and Q is binary. This is just piecewise linear regression, where we have two line segments, i.e.,

We can create this model with random parameters as follows.

X = 1;
Q = 2;
Y = 3;
dag = zeros(3,3);
dag(X,[Q Y]) = 1
dag(Q,Y) = 1;
ns = [1 2 1]; % make X and Y scalars, and have 2 experts
dnodes = [2];
onodes = [1 3]; % observed nodes
bnet = mk_bnet(dag, ns, dnodes);

rand('state', 0);
randn('state', 0);
bnet.CPD{1} = root_CPD(bnet, 1);
bnet.CPD{2} = softmax_CPD(bnet, 2);
bnet.CPD{3} = gaussian_CPD(bnet, 3);

Now let us fit this model using EM. First we load the 200 training cases and plot them.

load data -ascii;
plot(data(:,1), data(:,2), '.');

This is what the model looks like before training. (Thanks to Thomas Hoffman for writing this plotting routine.)

Now let's train the model, and plot the final performance.

ncases = size(data, 1); % each row of data is a training case
cases = cell(3, ncases);
cases([1 3], :) = num2cell(data'); % each column of cases is a training case
engine = jtree_inf_engine(bnet, onodes);
max_iter = 20;
[bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);

(We specify which nodes will be observed when we create the engine. Hence BNT knows that the hidden nodes are all discrete. For complex models, this can lead to a significant speedup.) Below we show what the model looks like after 16 iterations of EM (with 100 IRLS iterations per M step), when it converged using the default convergence tolerance (that the fractional change in the log-likelihood be less than 1e-3). Before learning, the log-likelihood was -322.927442; afterwards, it was -13.728778.

For more details on the softmax function, the hierarchical mixtures of experts model and logistic regression respectively, I recommend the following:

Why the logistic function? A tutorial discussion on probabilities and neural networks. M. I. Jordan. MIT Computational Cognitive Science Report 9503, August 1995.
Hierarchical mixtures of experts and the EM algorithm M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
"Generalized Linear Models", McCullagh and Nelder, Chapman and Halll, 1983.

Restrictions on combining discrete and continuous nodes

Exact inference in a hybrid BN can get tricky. We can identify 3 cases:

All hidden nodes are discrete (D).
All hidden nodes are Gaussian (G).
Some hidden nodes are discrete and some are Gaussian.

Exact inference in the D and G cases is easy. In the mixed case, we can distinguish two sub-cases:

There are no C->D arcs, where C represents a hidden continuous node, and D represents a (hidden or observed) discrete node (CG).
There are C->D arcs (CG+).

In the CG case, we are allowed C->D arcs if the continuous nodes are observed, as in the mixture of experts model (which only has discrete hidden state). If some hidden nodes are discrete and some are Gaussian, the implemented version of the junction tree engine is numerically unstable. For a solution to this problem, see

"Stable Local Compuation with Conditional Gaussian Distributions", S. Lauritzen and F. Jensen, Tech Report R-99-2014, Dept. Math. Sciences, Allborg Univ., 1999.

However, even the numerically stable version can be computationally intractable if there are many hidden discrete nodes, because the number of mixture components grows exponentially e.g., in a switching linear dynamical system.

When we have C->D arcs, where C is hidden, we need to use approximations. One approach (not implemented in BNT) is described in

A Variational Approximation for {B}ayesian Networks with Discrete and Continuous Latent Variables", K. Murphy, UAI 99.

Approximate inference

Above we saw that for certain hybrid Bayes nets, there is no closed form solution to the posterior, so exact inference is not possible. However, even if all nodes are discrete, exact inference can still be computationally intractable, even if there is a closed form solution. We give an example of this below, and then discuss approximate inference algorithms.

QMR

Bayes nets originally arose out of an attempt to add probabilities to expert systems, and this is still the most common use for BNs. A famous example is QMR-DT, a decision-theoretic reformulation of the Quick Medical Reference (QMR) model.

Here, the top layer represents hidden disease nodes, and the bottom layer represents observed symptom nodes. The goal is to infer the posterior probability of each disease given all the symptoms (which can be present, absent or unknown).

Noisy-or nodes

The diseases have Bernoulli CPDs, and the findings have noisy-or CPDs. A noisy-OR node is like a regular logical OR gate except that sometimes the effects of parents that are on get inhibited. Let the prob. that parent i gets inhibited be q(i). Then a node, C, with 2 parents, A and B, has the following CPD, where we use F and T to represent off and on (1 and 2 in BNT).

A  B  P(C=off)      P(C=on)
---------------------------
F  F  1.0           0.0
T  F  q(A)          1-q(A)
F  T  q(B)          1-q(B)
T  T  q(A)q(B)      q-q(A)q(B)

Thus we see that the causes get inhibited independently. It is common to associate a "leak" node with a noisy-or CPD, which is like a parent that is always on. This can account for all other unmodelled causes which might turn the node on.

The noisy-or distribution is similar to the logistic distribution defined earlier. To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) be the prob. that j inhibits i. Then

Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)

Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then

Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))

For a sigmoid node, we have

Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))

where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of the activation function (although both are monotonically increasing). In addition, in the case of a noisy-or, the weights are constrained to be positive, since they derive from probabilities q(i,j).

Creating and displaying random graphs

We can create the QMR model as follows.

function bnet = mk_qmr_bnet(G, inhibit, leak, prior)
% MK_QMR_BNET Make a QMR model
% bnet = mk_qmr_bnet(G, inhibit, leak, prior)
%
% G(i,j) = 1 iff there is an arc from disease i to finding j
% inhibit(i,j) = inhibition probability on i->j arc
% leak(j) = inhibition prob. on leak->j arc
% prior(i) = prob. disease i is on

[Ndiseases Nfindings] = size(inhibit);
N = Ndiseases + Nfindings;
finding_node = Ndiseases+1:N;
ns = 2*ones(1,N);
dag = zeros(N,N);
dag(1:Ndiseases, finding_node) = G;
bnet = mk_bnet(dag, ns);

for d=1:Ndiseases
  CPT = [1-prior(d) prior(d)];
  bnet.CPD{d} = tabular_CPD(bnet, d, CPT');
end

for i=1:Nfindings
  fnode = finding_node(i);
  ps = parents(G, i);
  bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i));
end

In the file BNT/examples/static/qmr1, we create a random bipartite graph G, with 5 diseases and 10 findings, and random parameters. (In general, to create a random dag, use 'mk_random_dag'.) (We are not allowed to publish the actual parameters, since they are proprietary.) We can visualize the resulting graph structure using graph drawing code contributed by Ali Taylan Cemgil from the University of Nijmegen:

draw_layout(bnet.dag);

Exact inference in QMR

Now let us put some random evidence on all the leaves except the very first and very last, and compute the disease posteriors.

pos = 2:floor(Nfindings/2);
neg = (pos(end)+1):(Nfindings-1);
onodes = myunion(pos, neg);
evidence = cell(1, N);
evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos)));
evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg)));
engine = jtree_inf_engine(bnet, onodes);
[engine, ll] = enter_evidence(engine, evidence);
post = zeros(1, Ndiseases);
for i=diseases(:)'
  m = marginal_nodes(engine, i);
  post(i) = m.T(2);
end

The junction tree algorithm is quite slow on the QMR network, since the cliques are so big. One simple trick we can use is to notice that hidden leaves do not affect the posteriors on the roots, and hence do not need to be included in the network. A second trick is to notice that the negative findings can be "absorbed" into the prior: see the file BNT/examples/static/mk_minimal_qmr_bnet for details.

Quickscore

A much more significant speedup is obtained by exploiting special properties of the noisy-or node, as done by the quickscore algorithm. For details, see

Heckerman, "A tractable inference algorithm for diagnosing multiple diseases", UAI 89.
Rish and Dechter, "On the impact of causal independence", UCI tech report, 1998.

This has been implemented in BNT as a special-purpose inference engine, which can be created and used as follows:

engine = quickscore_inf_engine(inhibit, leak, prior);
engine = enter_evidence(engine, pos, neg);
m = marginal_nodes(engine, i);

Loopy belief propagation

Even using quickscore, exact inference takes time that is exponential in the number of positive findings. Hence for large networks we need to resort to approximate inference techniques. See for example

T. Jaakkola and M. Jordan, "Variational probabilistic inference and the QMR-DT network", JAIR 10, 1999.
K. Murphy, Y. Weiss and M. Jordan, "Loopy belief propagation for approximate inference: an empirical study", UAI 99.

The latter approximation has been implemented in BNT, and can be used as follows.

max_iter = 30;
engine = loopy_pearl_inf_engine(bnet, max_iter);
engine = enter_evidence(engine, evidence);
m = marginal_nodes(engine, i);

We found that this algorithm often converges, and when it does, often is very accurate, but it depends on the precise setting of the parameter values of the network. (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) This is a topic of ongoing research.

Sampling

Although sampling (Monte Carlo) methods are very flexible (e.g., they can be used for non-Gaussian distributions, for Bayesian parameter estimation, etc.), they tend to be slow, especially in Matlab. To achieve acceptable speed, it is usually necessary to create code for a specific model, either by hand, or preferably to generate it automatically from the model specification. The latter is the approach adopted by BUGS, which generates compiled code for doing Gibbs sampling in static BNs.

BNT does support importance sampling, however: see likelihood_weighting_inf_engine.

Inference engines

So far, we have seen a variety of inference algorithms: junction tree, quickscore, and loopy. There are many, many more, each of which make different tradeoffs between speed, accuracy, complexity and generality. Furthermore, there might be many implementations of the same algorithm; for instance, a general purpose, readable version, and a highly-optimized, specialized one. To cope with this variety, we treat each inference algorithm as an object, which we call an inference engine.

An inference engine is an object that contains a bnet and supports the 'enter_evidence' and 'marginal_nodes' methods. The engine constructor takes the bnet as argument and may do some model-specific processing. When 'enter_evidence' is called, the engine may do some evidence-specific processing. Finally, when 'marginal_nodes' is called, the engine may do some query-specific processing.

The inference engines differ in many other important ways. Here are some of the major "axes":

Designed for static or dynamic (temporal) models?
Exact or approximate inference?
Works for all topologies or makes restrictions?
Handles any pattern of evidence, or must be fixed?
Works for all node types or makes restrictions?
- all nodes must be discrete (ALLD)
- all hidden nodes must be discrete (D)
- all nodes must be Gaussian (ALLG)
- all hidden nodes must be Gaussian (G)
- hidden nodes may be discrete or Gaussian, but no C->D arcs allowed (CG)
- CG case without the C->D restriction (CG+)
- no restrictions on the hidden nodes (ANY)
What work is done when the following are specified?
- Structure: the junction tree algorithm attempts to compute an optimal elimination order as soon as it is given the graph structure (an NP-hard problem), and uses this to create various auxiliary data structures.
- Parameters: the jtree algorithm may parameterize its auxiliary data structures once it knows the BN parameters; in my implementation, it waits until the evidence arrives, which enables certain optimizations.
- Evidence: this is when most of the action happens -- jtree and pearl do a two-pass message passing algorithm (a generalization of the forwards-backwards algorithm for HMMs). After this is complete, they can answer almost any query.
- Query: the variable elimination algorithm only does work once it knows what it is trying to compute; it is maximally lazy.

Although being maximally lazy may seem desirable, this is not always the most efficient. For example, when learning, we need to call marginal_nodes N times, where N is the number of nodes. Variable elimination would end up repeating a lot of work each time marginal_nodes is called, making it inefficient for learning. The junction tree algorithm, by contrast, uses dynamic programming to avoid this redundant computation --- it calculates all marginals in two passes during 'enter_evidence', so calling 'marginal_nodes' takes constant time.

Comparison of Inference engines

Here is a summary of the properties of all the engines in BNT which work on static networks. Click on the name of the engine for more details. (The difference between jtree and jtree_fast is that the latter only works for discrete networks, and requires a fixed pattern of evidence. This enables it to pre-compute various indices that speedup the manipulation of discrete potentials, sometimes dramatically.)

Name Exact? Node type? topology evid. model proc ev. proc query proc

cond_gauss exact CG all any compute joint condition joint marginalize joint
enumerative exact allD all any none none marginalize joint
gaussian exact allG all any compute joint condition joint marginalize joint
jtree exact D,G,CG all any find elim order propagate msgs marginalize clique
jtree_fast exact D all fixed find elim order propagate msgs marginalize clique
jtree_onepass exact D,G,CG any all find elim order propagate msgs lookup
jtree_onepass_fast exact D fixed all find elim order propagate msgs lookup
likelihood_weighting approx any all any none sample lookup
loopy_pearl approx D,G all any none propagate msgs lookup
pearl exact D,G polytree any none propagate msgs lookup
quickscore exact noisy-or QMR leaves none marginalize root nodes lookup
var_elim exact D,G,CG all any none none nested marginalizations

DBNs, HMMs, Kalman filters and all that

Click here for documentation about how to use BNT for dynamical systems.

Documentation last updated: 22 November 2000

Name	Exact?	Node type?	topology	evid.	model proc	ev. proc	query proc
cond_gauss	exact	CG	all	any	compute joint	condition joint	marginalize joint
enumerative	exact	allD	all	any	none	none	marginalize joint
gaussian	exact	allG	all	any	compute joint	condition joint	marginalize joint
jtree	exact	D,G,CG	all	any	find elim order	propagate msgs	marginalize clique
jtree_fast	exact	D	all	fixed	find elim order	propagate msgs	marginalize clique
jtree_onepass	exact	D,G,CG	any	all	find elim order	propagate msgs	lookup
jtree_onepass_fast	exact	D	fixed	all	find elim order	propagate msgs	lookup
likelihood_weighting	approx	any	all	any	none	sample	lookup
loopy_pearl	approx	D,G	all	any	none	propagate msgs	lookup
pearl	exact	D,G	polytree	any	none	propagate msgs	lookup
quickscore	exact	noisy-or	QMR	leaves	none	marginalize root nodes	lookup
var_elim	exact	D,G,CG	all	any	none	none	nested marginalizations