BNT_HOME = 'C:\murphyk\matlab\BNT';assuming I downloaded the file into C:\murphyk\matlab.
>> cd C:\murphyk\matlab\BNT
To specify this directed acyclic graph (dag), we create an adjacency matrix:
N = 4; dag = zeros(N,N); C = 1; S = 2; R = 3; W = 4; dag(C,[R S]) = 1; dag(R,W) = 1; dag(S,W)=1;
We have numbered the nodes as follows: Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4. The nodes must always be numbered in topological order, i.e., ancestors before descendants. For a more complicated graph, this is a little inconvenient: we will see how to get around this below.
In addition to specifying the graph structure, we must specify the size and type of each node. If a node is discrete, its size is the number of possible values each node can take on; if a node is continuous, it can be a vector, and its size is the length of this vector. In this case, we will assume all nodes are discrete and binary.
discrete_nodes = 1:N; node_sizes = 2*ones(1,N);We are now ready to make the Bayes net
bnet = mk_bnet(dag, node_sizes, discrete_nodes);There are several optional arguments to the 'mk_bnet' function, some of which we will see later. In general, to find out more about a function, please see its documentation string by typing
help mk_bnetSee also other useful Matlab tips.
A model consists of the graph structure and the parameters. The parameters are represented by CPD objects (CPD = Conditional Probability Distribution), which define the probability distribution of a node given its parents. (We will use the terms "node" and "random variable" interchangeably.) The simplest kind of CPD is a table (multi-dimensional array), which is suitable when all the nodes are discrete-valued. Note that the discrete values are not assumed to be ordered in any way; that is, they represent categorical quantities, like male and female, rather than ordinal quantities, like low, medium and high.
Tabular CPDs, also called CPTs (conditional probability tables), are stored as multidimensional arrays, where the dimensions are arranged in the same order as the nodes, e.g., the CPT for node 4 (WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself. Hence the child is always the last dimension. If a node has no parents, its CPT is a column vector representing its prior. Note that in Matlab (unlike C), arrays are indexed from 1, and are layed out in memory such that the first index toggles fastest, e.g., the CPT for node 4 (WetGrass) is as follows
where we have used the convention that false==1, true==2. We can create this CPT in Matlab as follows
CPT = zeros(2,2,2); CPT(1,1,1) = 1.0; CPT(2,1,1) = 0.1; ...Here is an easier way:
CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);In fact, we don't need to reshape the array, since the CPD constructor will do that for us. So we can just write
bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]); bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]); bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]); bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);If we omit the third argument to tabular_CPD, random parameters will be created. To ensure repeatable results, use
rand('state', seed); randn('state', seed);
engine = jtree_inf_engine(bnet);The other engines have similar constructors, but might take additional, algorithm-specific parameters. All engines are used in the same way, once they have been created. We illustrate this in the following sections.
Suppose we want to compute the probability that the sprinker was on given that the grass is wet. The evidence consists of the fact that W=2. All the other nodes are hidden (unobserved). We can specify this as follows.
evidence = cell(1,N); evidence{W} = 2;We use a 1D cell array instead of a vector to cope with the fact that nodes can be vectors of different lengths. In addition, the value [] can be used to denote 'no evidence', instead of having to specify the observation pattern as a separate argument.
We are now ready to add the evidence to the engine.
[engine, loglik] = enter_evidence(engine, evidence);The behavior of this function is algorithm-specific, and is discussed in more detail below. In the case of the jtree engine, enter_evidence implements a two-pass message-passing scheme. The first return argument contains the modified engine, which incorporates the evidence. The second return argument contains the log-likelihood of the evidence. (Not all engines are capable of computing the log-likelihood.)
Finally, we can compute p=P(S=2|W=2) as follows.
marg = marginal_nodes(engine, S); p = marg.T(2);We find that p = 0.4298.
Now let us add the evidence that it was raining, and see what difference it makes.
evidence{R} = 2; [engine, loglik] = enter_evidence(engine, evidence); marg = marginal_nodes(engine, S); p = marg.T(2);We find that p = P(S=2|W=2,R=2) = 0.1945, which is lower than before, because the sprinkler's being on can ``explain away'' the fact that the grass is wet.
Note: you can plot a marginal distribution over a discrete variable as a barchart using the built 'bar' function.
evidence = cell(1,N); [engine, ll] = enter_evidence(engine, evidence); m = marginal_nodes(engine, [S R W]);m is a structure. The 'T' field is a multi-dimensional array (in this case, 3-dimensional) that contains the joint probability distribution on the specified nodes.
>> m.T ans(:,:,1) = 0.2900 0.0410 0.0210 0.0009 ans(:,:,2) = 0 0.3690 0.1890 0.0891We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass to be wet if both the rain and sprinkler are off.
Let us now add some evidence to R.
evidence{R} = 2; [engine, ll] = enter_evidence(engine, evidence); m = marginal_nodes(engine, [S R W]); m = domain: [2 3 4] T: [2x1x2 double] >> m.T m.T ans(:,:,1) = 0.0820 0.0018 ans(:,:,2) = 0.7380 0.1782The joint T(i,j,k) = P(S=i,R=j,W=k|evidence) should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible with the evidence that R=2. Instead of creating large tables with many 0s, BNT sets the effective size of observed (discrete) nodes to 1. This is why m.T has size 2x1x2. We can convert this back to a 2x2x2 table as follows.
fullm = add_ev_to_dmarginal(m, evidence, ns)
Note: It is not always possible to compute the joint on arbitrary sets of nodes: it depends on the engine, as discussed in more detail below.
nsamples = 50; samples = sample_bnet(bnet, nsamples); data = cell2num(samples);sample_bnet returns a cell array because, in general, each node might be a vector of different length. In this case, all nodes are discrete (and hence scalars), so we convert samples to a regular array: data(i,l) = samples{i,l} is the value of node i in case l.
Now we create a network with random parameters.
% Make a tabula rasa bnet2 = mk_bnet(dag, ns); seed = 0; rand('state', seed); bnet2.CPD{C} = tabular_CPD(bnet2, C); bnet2.CPD{R} = tabular_CPD(bnet2, R); bnet2.CPD{S} = tabular_CPD(bnet2, S); bnet2.CPD{W} = tabular_CPD(bnet2, W);Finally, we find the maximum likelihood estimates of the parameters.
[bnet3, LL] = learn_params_tabular(bnet2, data);To view the learned parameters, we use a little Matlab hackery. (The kosher approach would be to implement a display method for each CPD object.)
CPT3 = cell(1,N); for i=1:N s=struct(bnet3.CPD{i}); % violate object privacy CPT3{i}=s.CPT; endHere are the results.
celldisp(CPT3) CPT3{1} = 0.6000 0.4000 CPT3{2} = 0.5000 0.5000 1.0000 0 CPT3{3} = 0.8667 0.1333 0.2000 0.8000 CPT3{4} = (:,:,1) = 1.0000 0 0.3846 0 (:,:,2) = 0 1.0000 0.6154 1.0000So we see that the learned parameters are fairly close to the original ones, which we display below for completeness.
CPT{1} = 0.5000 0.5000 CPT{2} = 0.5000 0.5000 0.9000 0.1000 CPT{3} = 0.8000 0.2000 0.2000 0.8000 CPT{4} = (:,:,1) = 1.0000 0.1000 0.1000 0.0100 (:,:,2) = 0 0.9000 0.9000 0.9900We can get better results by using a larger training set, or using informative priors (see below).
The Dirichlet has a simple interpretation in terms of pseudo counts. If we let N_ijk = the num. times X_i=k and Pa_i=j occurs in the training set, where Pa_i are the parents of X_i, then the maximum likelihood (ML) estimate is T_ijk = N_ijk / N_ij (where N_ij = sum_k' N_ijk'), which will be 0 if N_ijk=0. To prevent us from declaring that (X_i=k, Pa_i=j) is impossible just because this event was not seen in the training set, we can pretend we saw value k of X_i, for each value j of Pa_i, alpha_ijk times in the past. The MAP (maximum a posterior) estimate is then
T_ijk = (N_ijk + alpha_ijk) / (N_ij + alpha_ij)and is never 0 if all alpha_ijk > 0. For example, consider the network A->B, where A is binary and B has 3 values. A Jeffrey's prior for B has the form
B=1 B=2 B=3 A=1 1/6 1/6 1/6 A=2 1/6 1/6 1/6The sum of all the entries is called the equivalent sample size (ess). In BNT, to create this prior, with random CPT entries, use
ess = 1; tabular_CPD(bnet, i, 'rnd', ess);ess=0 gives a 0 (uniformative) prior, which is equivalent to ML estimation.
The marginal likelihood of a case (which integrates out the parameter values) can be computed by setting each row of the CPT to the normalisation of the corresponding row of the prior. In the above example, this becomes
B=1 B=2 B=3 A=1 1/3 1/3 1/3 A=2 1/3 1/3 1/3This is a stochastic matrix, since each row sums to 1. This can be created as follows:
ess = 1; tabular_CPD(bnet, i, 'unif', ess);
use_prior = 1; data = cell2num(samples); [bnet3, LL3, bic_score, mscore] = learn_params_tabular(bnet2, data, use_prior);
LL3 is the log-likelihood of the data computed at the final MAP estimate of the parameters:
LL3 = log \prod_l P(X_i | Pa_i, data(:,l), theta_hat)where theta_hat is the MAP estimate.
mscore is the log marginal likelihood, which integrates out the parameters:
mscore = log \int_theta \prod_l P(X_i | Pa_i, data(:,l), theta) P(theta)bic_score is a large sample approximation to mscore; it is computed by assuming the parameter posterior is approximately Gaussian. This is known as the Bayesian Information Criterion (BIC).
We can compute the same result sequentially as follows.
LL4 = 0; bnet4 = bnet2; for l=1:nsamples LL4 = LL4 + log_likelihood(bnet4, samples(:,l)); bnet4 = update_params(bnet4, samples(:,l)); endThe parameters of bnet4 are identical to bnet3.
LL4 is given by
LL4 = log P(x1 | th0) + log P(x2 | th1) + ... + log P(xn | th(n-1))where th(i) = the result of Bayesian updating applied to th(i-1) and x(i)=samples(i), and th(0) is specified by the prior. It is simple to show that LL4 = mscore, provided we set the initial CPT entries to the mean of the initial prior, so that log_likelihood computes the log marginal likelihood. (This can be done by making the initial CPTs uniform, and the initial prior a Jeffrey's prior.) See the file BNT/examples/static/seqlearn1.m for details.
samples2 = samples; hide = rand(N, nsamples) > 0.5; [I,J]=find(hide); for k=1:length(I) samples2{I(k), J(k)} = []; endsamples2{i,l} is the value of node i in training case l, or [] if unobserved.
Now we will compute the MLEs using the EM algorithm. We need to use an inference algorithm to compute the expected sufficient statistics in the E step; the M (maximization) step is as above.
engine2 = jtree_inf_engine(bnet2); max_iter = 10; [bnet4, LLtrace] = learn_params_em(engine2, samples2, max_iter);LLtrace(i) is the log-likelihood at iteration i. We can plot this as follows:
plot(LLtrace, 'x-')Let's display the results after 10 iterations of EM.
celldisp(CPT4) CPT4{1} = 0.6616 0.3384 CPT4{2} = 0.6510 0.3490 0.8751 0.1249 CPT4{3} = 0.8366 0.1634 0.0197 0.9803 CPT4{4} = (:,:,1) = 0.8276 0.0546 0.5452 0.1658 (:,:,2) = 0.1724 0.9454 0.4548 0.8342We can get improved performance by (1) increasing the size of the training set, (2) decreasing the amount of hidden data, (3) running EM for longer, or (4) using informative priors.
We now give a simple example of how to use K2. We use the sprinkler network and sample 50 cases from it as before. Then we can recover the exact structure as follows:
order = [C S R W]; max_fan_in = 2; dag = learn_struct_K2(data, ns, order, max_fan_in);The PC algorithm takes as arguments a function f, the number of nodes N, the maximum fan in K, and additional arguments A which are passed to f. The function f(X,Y,S,A) returns 1 if X is conditionally independent of Y given S, and 0 otherwise. For example, suppose we cheat by passing in the true graph structure as the argument A, and use d-separation to implement the CI test, i.e., f(X,Y,S) calls dsep(X,Y,S,dag).
pdag = learn_struct_pdag_pc('dsep', N, max_fan_in, dag);pdag(i,j) = -1 if there is definitely an i->j arc, and pdag(i,j) = 1 if there is either an i->j or and i<-j arc. (Graphs which belong to the same Markov equivalence class can have the direction of some of their arcs reversed without changing any of the CI relationships. The pdag (partially oriented dag, or pattern) is a concise representation of this equivalence class.)
Applied to the sprinkler network, this returns
pdag = 0 1 1 0 1 0 0 -1 1 0 0 -1 0 0 0 0So as expected, we see that the V-structure at the W node is uniquely identified, but the other arcs have ambiguous orientation.
We now give an example from p141 of the SGS book. This example concerns the female orgasm. We are given a correlation matrix C between 7 measured factors (such as subjective experiences of coital and mastubatory experiences), derived from 281 samples, and want to learn a causal model of the data. We will not discuss the merits of this type of work here, but merely show how to reproduce the results in the SGS book. Their program, tetrad, makes use of the Fisher Z-test for conditional independence, so we do the same:
max_fan_in = 4; nsamples = 281; alpha = 0.05; pdag = learn_struct_pdag_pc('cond_indep_fisher_z', n, max_fan_in, C, nsamples, alpha);In this case, the CI test is
f(X,Y,S) = cond_indep_fisher_z(X,Y,S, C,nsamples,alpha)where The results match those of Fig 12a of SGS apart from two edge differences; presumably this is due to rounding error (although it could be a bug). This example can be found in the file BNT/examples/static/pc2.m.
We can create this model as follows.
F = 1; W = 2; E = 3; B = 4; C = 5; D = 6; Min = 7; Mout = 8; L = 9; n = 9; dag = zeros(n); dag(F,E)=1; dag(W,[E Min D]) = 1; dag(E,D)=1; dag(B,[C D])=1; dag(D,[L Mout])=1; dag(Min,Mout)=1; % node sizes - all cts nodes are scalar, all discrete nodes are binary ns = ones(1, n); dnodes = [F W B]; cnodes = mysetdiff(1:n, dnodes); ns(dnodes) = 2; bnet = mk_bnet(dag, ns, dnodes);'dnodes' is a list of the discrete nodes; 'cnodes' is the continuous nodes. 'mysetdiff' is a faster version of the built-in 'setdiff'.
The parameters of the discrete nodes can be specified as follows.
bnet.CPD{B} = tabular_CPD(bnet, B, [0.85 0.15]'); % 1=stable, 2=unstable bnet.CPD{F} = tabular_CPD(bnet, F, [0.95 0.05]'); % 1=intact, 2=defect bnet.CPD{W} = tabular_CPD(bnet, W, [2/7 5/7]'); % 1=industrial, 2=householdIn general, Gaussian nodes are parameterized as follows. Suppose the node is called Y, its continuous parents (if any) are called X, and its discrete parents (if any) are called Q. The the distribution on Y is defined as follows:
- no parents: Y ~ N(mu, Sigma) - cts parents : Y|X=x ~ N(mu + W x, Sigma) - discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i)) - cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))where N(mu, Sigma) denotes a Normal distribution with mean mu and covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q respectively. If there are no discrete parents, |Q|=1; if there is more than one, then |Q| = a vector of the sizes of each discrete parent. If there are no continuous parents, |X|=0; if there is more than one, then |X| = the sum of their sizes. Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight) matrix.
We can specify the parameters of the model as follows. (If we omit any parameters, random values of the appropriate size will be created.)
mu = reshape([-3.9 -0.4 -3.2 -0.5], [1 2 2]); Sigma = reshape([0.00002 0.0001 0.00002 0.0001], [1 1 2 2]); bnet.CPD{E} = gaussian_CPD(bnet, E, mu, Sigma); mu = reshape([6.5 6.0 7.5 7.0], [1 2 2]); Sigma = reshape([0.03 0.04 0.1 0.1], [1 1 2 2]); weights = reshape([1 1 1 1], [1 1 2 2]); bnet.CPD{D} = gaussian_CPD(bnet, D, mu, Sigma, weights); mu = reshape([-2 -1], [1 2]); Sigma = reshape([0.1 0.3], [1 1 2]); bnet.CPD{C} = gaussian_CPD(bnet, C, mu, Sigma); bnet.CPD{L} = gaussian_CPD(bnet, L, 3, 0.25, -0.5); mu = reshape([0.5 -0.5], [1 2]); Sigma = reshape([0.01 0.005], [1 1 2]); bnet.CPD{Min} = gaussian_CPD(bnet, Min, mu, Sigma); bnet.CPD{Mout} = gaussian_CPD(bnet, Mout, 0, 0.002, [1 1]);
engine = jtree_inf_engine(bnet); evidence = cell(1,n); [engine, ll] = enter_evidence(engine, evidence); marg = marginal_nodes(engine, E);'marg' is a structure that contains the fields 'mu' and 'Sigma', which contain the mean and (co)variance of the marginal on E. In this case, they are both scalars. Let us check they match the published figures to 2dp. (We can't expect more precision than this in general because I have implemented the algorithm of Lauritzen (1992), which can be numerically unstable.)
tol = 1e-2; assert(approxeq(marg.mu, -3.25, tol)); assert(approxeq(sqrt(marg.Sigma), 0.709, tol));We can compute the other posteriors similarly. Now let us add some evidence.
evidence = cell(1,n); evidence{W} = 1; % industrial evidence{L} = 1.1; evidence{C} = -0.9; [engine, ll] = enter_evidence(engine, evidence);Now we find
marg = marginal_nodes(engine, E); assert(approxeq(marg.mu, -3.8983, tol)); assert(approxeq(sqrt(marg.Sigma), 0.0763, tol));We can also compute the joint probability on a set of nodes. For example, P(D, Mout | evidence) is a 2D Gaussian:
marg = marginal_nodes(engine, [D Mout]) marg = domain: [6 8] mu: [2x1 double] Sigma: [2x2 double] T: 1.0000The mean is
marg.mu ans = 3.6077 4.1077and the covariance matrix is
marg.Sigma ans = 0.1062 0.1062 0.1062 0.1182It is easy to visualize this posterior using standard Matlab plotting functions (see e.g., BNT/misc/gaussplot). The T field indicates that the mixing weight of this Gaussian component is 1.0. If the joint contains discrete and continuous variables, the result will be a mixture of Gaussians, e.g.,
marg = marginal_nodes(engine, [F E]) domain: [1 3] mu: [-3.9000 -0.4003] Sigma: [1x1x2 double] T: [0.9995 4.7373e-04]The interpretation is Sigma(i,j,k) = Cov[ E(i) E(j) | F=k ]. In this case, E is a scalar, so i=j=1; k specifies the mixture component.
We saw in the sprinkler network that BNT sets the effective size of observed discrete nodes to 1, since they only have one legal value. For continuous nodes, BNT sets their length to 0, since they have been reduced to a point. For example,
marg = marginal_nodes(engine, [B C]) domain: [4 5] mu: [] Sigma: [] T: [0.0123 0.9877]It is simple to post-process the output of marginal_nodes. For example, the file BNT/examples/static/cg1 sets the mu term of observed nodes to their observed value, and the Sigma term to 0 (since observed nodes have no variance).
![]() | ![]() | ![]() | ![]() |
(a) | (b) | (c) | (d) |
We can create this model in BNT as follows.
ns = [k D]; dag = zeros(2,2); dag(1,2) = 1; dnodes = []; onodes = 2; bnet = mk_bnet(dag, ns, dnodes); bnet.CPD{1} = gaussian_CPD(bnet, 1, zeros(k,1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov'); bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(D,1), diag(Psi0), W0, 'diag', 'untied', 'clamp_mean');The root node is clamped to the N(0,I) distribution, so that we will not update these parameters during learning. The mean of the leaf node is clamped to 0, since we assume the data has been centered (had its mean subtracted off); this is just for simplicity. Finally, the covariance of the leaf node is constrained to be diagonal. W0 and Psi0 are the initial parameter guesses.
We can fit this model using EM in the usual way (see the file BNT/examples/static/fa1 for details). Not surprisingly, the ML estimates for mu and Psi turn out to be identical to the sample mean and variance, which can be computed directly as
mu_ML = mean(data); Psi_ML = diag(cov(data));Note that W can only be identified up to a rotation matrix, because of the spherical symmetry of the source.
If we restrict Psi to be spherical, i.e., Psi = sigma I, there is a closed-form solution for W as well, i.e., we do not need to use EM. In particular, W contains the first |X| eigenvectors of the sample covariance matrix, with scalings determined by the eigenvalues and sigma. Classical PCA can be obtained by takting the sigma->0 limit. For details, see
By adding a hidden discrete variable, we can create mixtures of FA models, as shown in (c). Now we can explain the data using a set of subspaces. We can create this model in BNT as follows.
ns = [M k D]; dag = zeros(3); dag(1,3) = 1; dag(2,3) = 1; dnodes = 1; onodes = 3; bnet = mk_bnet(dag, ns, dnodes); bnet.CPD{1} = tabular_CPD(bnet, 1, Pi0); bnet.CPD{2} = gaussian_CPD(bnet, 2, zeros(k, 1), eye(k), [], 'diag', 'untied', 'clamp_mean', 'clamp_cov'); bnet.CPD{3} = gaussian_CPD(bnet, 3, Mu0', repmat(diag(Psi0), [1 1 M]), W0, 'diag', 'tied');Notice how the covariance matrix for Y is the same for all values of Q; that is, the noise level in each sub-space is assumed the same. However, we allow the offset, mu, to vary. For details, see
I have included Zoubin's specialized MFA code (with his permission) with the toolbox, so you can check that BNT gives the same results: see 'BNT/examples/static/mfa1.m'.
Independent Factor Analysis (IFA) generalizes FA by allowing a non-Gaussian prior on each component of X. (Note that we can approximate a non-Gaussian prior using a mixture of Gaussians.) This means that the likelihood function is no longer rotationally invariant, so we can uniquely identify W and the hidden sources X. IFA also allows a non-diagonal Psi (i.e. correlations between the components of Y). We recover classical Independent Components Analysis (ICA) in the Psi -> 0 limit, and by assuming that |X|=|Y|, so that the weight matrix W is square and invertible. For details, see
exp(w(:,i)'*x + b(i)) Pr(Q=i | X=x) = ----------------------------- sum_j exp(w(:,j)'*x + b(j))The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the following interpretation: w(:,i)-w(:,j) is the normal vector to the decision boundary between classes i and j, and b(i)-b(j) is its offset (bias). For example, suppose X is a 2-vector, and Q is binary. Then
w = [1 -1; 0 0]; b = [0 0];means class 1 are points in the 2D plane with positive x coordinate, and class 2 are points in the 2D plane with negative x coordinate. If w has large magnitude, the decision boundary is sharp, otherwise it is soft. In the special case that Q is binary (0/1), the softmax function reduces to the logistic (sigmoid) function.
Fitting a softmax function can be done using the iteratively reweighted least squares (IRLS) algorithm. We use the implementation from netlab. Note that since the softmax distribution is not in the exponential family, it does not have finite sufficient statistics, and hence we must store all the training data in uncompressed form. If this takes too much space, one should use online (stochastic) gradient descent (not implemented in BNT).
![]() | ![]() |
X is the observed input, Y is the output, and the Q nodes are hidden "gating" nodes, which select the appropriate set of parameters for Y. During training, Y is assumed observed, but for testing, the goal is to predict Y given X. Note that this is a conditional density model, so we don't associate any parameters with X. Hence X's CPD will be a root CPD, which is a way of modelling exogenous nodes. If the output is a continuous-valued quantity, we assume the "experts" are linear-regression units, and set Y's CPD to linear-Gaussian. If the output is discrete, we set Y's CPD to a softmax function. The Q CPDs will always be softmax functions.
As a concrete example, consider the mixture of experts model where X and Y are scalars, and Q is binary. This is just piecewise linear regression, where we have two line segments, i.e.,
We can create this model with random parameters as follows.
X = 1; Q = 2; Y = 3; dag = zeros(3,3); dag(X,[Q Y]) = 1 dag(Q,Y) = 1; ns = [1 2 1]; % make X and Y scalars, and have 2 experts dnodes = [2]; onodes = [1 3]; % observed nodes bnet = mk_bnet(dag, ns, dnodes); rand('state', 0); randn('state', 0); bnet.CPD{1} = root_CPD(bnet, 1); bnet.CPD{2} = softmax_CPD(bnet, 2); bnet.CPD{3} = gaussian_CPD(bnet, 3);Now let us fit this model using EM. First we load the 200 training cases and plot them.
load data -ascii; plot(data(:,1), data(:,2), '.');
This is what the model looks like before training. (Thanks to Thomas Hoffman for writing this plotting routine.)
Now let's train the model, and plot the final performance.
ncases = size(data, 1); % each row of data is a training case cases = cell(3, ncases); cases([1 3], :) = num2cell(data'); % each column of cases is a training case engine = jtree_inf_engine(bnet, onodes); max_iter = 20; [bnet2, LLtrace] = learn_params_em(engine, cases, max_iter);(We specify which nodes will be observed when we create the engine. Hence BNT knows that the hidden nodes are all discrete. For complex models, this can lead to a significant speedup.) Below we show what the model looks like after 16 iterations of EM (with 100 IRLS iterations per M step), when it converged using the default convergence tolerance (that the fractional change in the log-likelihood be less than 1e-3). Before learning, the log-likelihood was -322.927442; afterwards, it was -13.728778.
For more details on the softmax function, the hierarchical mixtures of experts model and logistic regression respectively, I recommend the following:
When we have C->D arcs, where C is hidden, we need to use approximations. One approach (not implemented in BNT) is described in
A B P(C=off) P(C=on) --------------------------- F F 1.0 0.0 T F q(A) 1-q(A) F T q(B) 1-q(B) T T q(A)q(B) q-q(A)q(B)Thus we see that the causes get inhibited independently. It is common to associate a "leak" node with a noisy-or CPD, which is like a parent that is always on. This can account for all other unmodelled causes which might turn the node on.
The noisy-or distribution is similar to the logistic distribution defined earlier. To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j) be the prob. that j inhibits i. Then
Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))For a sigmoid node, we have
Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of the activation function (although both are monotonically increasing). In addition, in the case of a noisy-or, the weights are constrained to be positive, since they derive from probabilities q(i,j).
function bnet = mk_qmr_bnet(G, inhibit, leak, prior) % MK_QMR_BNET Make a QMR model % bnet = mk_qmr_bnet(G, inhibit, leak, prior) % % G(i,j) = 1 iff there is an arc from disease i to finding j % inhibit(i,j) = inhibition probability on i->j arc % leak(j) = inhibition prob. on leak->j arc % prior(i) = prob. disease i is on [Ndiseases Nfindings] = size(inhibit); N = Ndiseases + Nfindings; finding_node = Ndiseases+1:N; ns = 2*ones(1,N); dag = zeros(N,N); dag(1:Ndiseases, finding_node) = G; bnet = mk_bnet(dag, ns); for d=1:Ndiseases CPT = [1-prior(d) prior(d)]; bnet.CPD{d} = tabular_CPD(bnet, d, CPT'); end for i=1:Nfindings fnode = finding_node(i); ps = parents(G, i); bnet.CPD{fnode} = noisyor_CPD(bnet, fnode, leak(i), inhibit(ps, i)); endIn the file BNT/examples/static/qmr1, we create a random bipartite graph G, with 5 diseases and 10 findings, and random parameters. (In general, to create a random dag, use 'mk_random_dag'.) (We are not allowed to publish the actual parameters, since they are proprietary.) We can visualize the resulting graph structure using graph drawing code contributed by Ali Taylan Cemgil from the University of Nijmegen:
draw_layout(bnet.dag);
pos = 2:floor(Nfindings/2); neg = (pos(end)+1):(Nfindings-1); onodes = myunion(pos, neg); evidence = cell(1, N); evidence(findings(pos)) = num2cell(repmat(2, 1, length(pos))); evidence(findings(neg)) = num2cell(repmat(1, 1, length(neg))); engine = jtree_inf_engine(bnet, onodes); [engine, ll] = enter_evidence(engine, evidence); post = zeros(1, Ndiseases); for i=diseases(:)' m = marginal_nodes(engine, i); post(i) = m.T(2); endThe junction tree algorithm is quite slow on the QMR network, since the cliques are so big. One simple trick we can use is to notice that hidden leaves do not affect the posteriors on the roots, and hence do not need to be included in the network. A second trick is to notice that the negative findings can be "absorbed" into the prior: see the file BNT/examples/static/mk_minimal_qmr_bnet for details.
engine = quickscore_inf_engine(inhibit, leak, prior); engine = enter_evidence(engine, pos, neg); m = marginal_nodes(engine, i);
max_iter = 30; engine = loopy_pearl_inf_engine(bnet, max_iter); engine = enter_evidence(engine, evidence); m = marginal_nodes(engine, i);We found that this algorithm often converges, and when it does, often is very accurate, but it depends on the precise setting of the parameter values of the network. (See the file BNT/examples/static/qmr1 to repeat the experiment for yourself.) This is a topic of ongoing research.
BNT does support importance sampling, however: see likelihood_weighting_inf_engine.
An inference engine is an object that contains a bnet and supports the 'enter_evidence' and 'marginal_nodes' methods. The engine constructor takes the bnet as argument and may do some model-specific processing. When 'enter_evidence' is called, the engine may do some evidence-specific processing. Finally, when 'marginal_nodes' is called, the engine may do some query-specific processing.
The inference engines differ in many other important ways. Here are some of the major "axes":
Although being maximally lazy may seem desirable, this is not always the most efficient. For example, when learning, we need to call marginal_nodes N times, where N is the number of nodes. Variable elimination would end up repeating a lot of work each time marginal_nodes is called, making it inefficient for learning. The junction tree algorithm, by contrast, uses dynamic programming to avoid this redundant computation --- it calculates all marginals in two passes during 'enter_evidence', so calling 'marginal_nodes' takes constant time.
Name | Exact? | Node type? | topology | evid. | model proc | ev. proc | query proc |
cond_gauss | exact | CG | all | any | compute joint | condition joint | marginalize joint |
enumerative | exact | allD | all | any | none | none | marginalize joint |
gaussian | exact | allG | all | any | compute joint | condition joint | marginalize joint |
jtree | exact | D,G,CG | all | any | find elim order | propagate msgs | marginalize clique |
jtree_fast | exact | D | all | fixed | find elim order | propagate msgs | marginalize clique |
jtree_onepass | exact | D,G,CG | any | all | find elim order | propagate msgs | lookup |
jtree_onepass_fast | exact | D | fixed | all | find elim order | propagate msgs | lookup |
likelihood_weighting | approx | any | all | any | none | sample | lookup |
loopy_pearl | approx | D,G | all | any | none | propagate msgs | lookup |
pearl | exact | D,G | polytree | any | none | propagate msgs | lookup |
quickscore | exact | noisy-or | QMR | leaves | none | marginalize root nodes | lookup |
var_elim | exact | D,G,CG | all | any | none | none | nested marginalizations |
Documentation last updated: 22 November 2000