{"title": "Fisher Scoring and a Mixture of Modes Approach for Approximate Inference and Learning in Nonlinear State Space Models", "book": "Advances in Neural Information Processing Systems", "page_first": 403, "page_last": 409, "abstract": null, "full_text": "Fisher Scoring and a Mixture of Modes \nApproach for Approximate Inference and \nLearning in Nonlinear State Space Models \n\nThomas Briegel and Volker Tresp \nSiemens AG, Corporate Technology \n\nDept. Information and Communications \n\nOtto-Hahn-Ring 6,81730 Munich, Germany \n\n{Thomas.Briegel, Volker.Tresp} @mchp.siemens.de \n\nAbstract \n\nWe present Monte-Carlo generalized EM equations for learning in non(cid:173)\nlinear state space models. The difficulties lie in the Monte-Carlo E-step \nwhich consists of sampling from the posterior distribution of the hidden \nvariables given the observations. The new idea presented in this paper is \nto generate samples from a Gaussian approximation to the true posterior \nfrom which it is easy to obtain independent samples. The parameters of \nthe Gaussian approximation are either derived from the extended Kalman \nfilter or the Fisher scoring algorithm. In case the posterior density is mul(cid:173)\ntimodal we propose to approximate the posterior by a sum of Gaussians \n(mixture of modes approach). We show that sampling from the approxi(cid:173)\nmate posterior densities obtained by the above algorithms leads to better \nmodels than using point estimates for the hidden states. In our exper(cid:173)\niment, the Fisher scoring algorithm obtained a better approximation of \nthe posterior mode than the EKF. For a multimodal distribution, the mix(cid:173)\nture of modes approach gave superior results. \n\n1 INTRODUCTION \nNonlinear state space models (NSSM) are a general framework for representing nonlinear \ntime series. In particular, any NARMAX model (nonlinear auto-regressive moving average \nmodel with external inputs) can be translated into an equivalent NSSM. Mathematically, a \nNSSM is described by the system equation \n\n(1) \n\nwhere Xt denotes a hidden state variable, (t denotes zero-mean uncorrelated Gaussian noise \nwith covariance Qt and Ut is an exogenous (deterministic) input vector. The time-series \nmeasurements Yt are related to the unobserved hidden states Xt through the observation \nequation \n\n(2) \nwhere Vt is uncorrelated Gaussian noise with covariance lit. In the following we assume \nthat the nonlinear mappings fw{.) and gv{.) are neural networks with weight vectors w \nand v, respectively. The initial state Xo is assumed to be Gaussian distributed with mean \nao and covariance Qo. All variables are in general multidimensional. The two challenges \n\n\f404 \n\nT Briegel and V. Tresp \n\nin NSSMs are the interrelated tasks of inference and learning. In inference we try to es(cid:173)\ntimate the states of unknown variables Xs given some measurements Yb ... , Yt (typically \nthe states of past (s < t), present (s = t) or future (s > t) values of Xt) and in learning we \nwant to adapt some unknown parameters in the model (i.e. neural network weight vectors \nwand v) given a set of measurements.1 In the special case of linear state space models \nwith Gaussian noise, efficient algorithms for inference and maximum likelihood learning \nexist. The latter can be implemented using EM update equations in which the E-step is \nimplemented using forward-backward Kalman filtering (Shumway & Stoffer, 1982). If \nthe system is nonlinear, however, the problem of inference and learning leads to complex \nintegrals which are usually considered intractable (Anderson & Moore, 1979). A useful \napproximation is presented in section 3 where we show how the learning equations for \nNSSMs can be implemented using two steps which are repeated until convergence. First in \nthe (Monte-Carlo) E-step, random samples are generated from the unknown variables (e.g. \nthe hidden variables Xt) given the measurements. In the second step (a generalized M-step) \nthose samples are treated as real data and are used to adapt Iw (.) and gv (.) using some \nversion of the backpropagation algorithm. The problem lies in the first step, since it is dif(cid:173)\nficult to generate independent samples from a general multidimensional distribution. Since \nit is difficult to generate samples from the proper distribution the next best thing might be \nto generate samples using an approximation to the proper distribution which is the idea \npursued in this paper. The first thing which might come to mind is to approximate the \nposterior distribution of the hidden variables by a multidimensional Gaussian distribution \nsince generating samples from such a distribution is simple. In the first approach we use the \nextended Kalman filter and smoother to obtain mode and covariance ofthis Gaussian. 2 Al(cid:173)\nternatively, we estimate the mode and the covariance of the posterior distribution using an \nefficient implementation of Fisher scoring derived by Fahrmeir and Kaufmann (1991) and \nuse those as parameters of the Gaussian. In some cases the approximation of the posterior \nmode by a single Gaussian might be considered too crude. Therefore, as a third solution, \nwe approximate the posterior distribution by a sum of Gaussians (mixture of modes ap(cid:173)\nproach). Modes and covariances of those Gaussians are obtained using the Fisher scoring \nalgorithm. The weights of the Gaussians are derived from the likelihood of the observed \ndata given the individual Gaussian. In the following section we derive the gradient of the \nlog-likelihood with respect to the weights in I w (.) and gv (.). In section 3, we show that \nthe network weights can be updated using a Monte-Carlo E-step and a generalized M-step. \nFurthermore, we derive the different Gaussian approximations to the posterior distribution \nand introduce the mixture of modes approach. In section 4 we validate our algorithms using \na standard nonlinear stochastic time-series model. In section 5 we present conclusions. \n\n2 THE GRADIENTS FOR NONLINEAR STATE SPACE MODELS \n\nGiven our assumptions we can write the joint probability of the complete data for t \n1, ... , T as3 \n\np(Xr, Yr, Ur) = p(Ur) p(xo) II p(Xt IXt-l. ut} II p(Yt IXt, ut} \n\nr \n\nr \n\n(3) \n\nt=1 \n\nt=l \n\n1 In this paper we focus on the case s :::; t (smoothing and offline learning, respectively). \n2Independently from our work, a single Gaussian approximation to the E-step using the EKFS \nhas been proposed by Ghahramani & Roweis (1998) for the special case of a RBF network. They \nshow that one obtains a closed form M-step when just adapting the linear parameters by holding \nthe nonlinear parameters fixed. Although avoiding sampling, the computational load of their M-step \nseems to be significant. \n\n3In the following, each probability density is conditioned on the current model. For notational \n\nconvenience, we do not indicate this fact explicitly. \n\n\fFisher Scoring and Mixture of Modes for Inference and Learning in NSSM \n405 \nwhere UT = {Ul,\"\" UT} is a set of known inputs which means that p( UT) is irrelevant \nin the following. Since only YT = {Yl,\"\" YT} and UT are observed, the log-likelihood \nof the model is \n\nlog L = log J p(XT, YTIUT)p(UT) dXT ex log J p(XT, YTIUT ) dXT \n\n(4) \n\nwith XT = {xo, ... , XT}. By inserting the Gaussian noise assumptions we obtain the \ngradients of the log-likelihood with respect to the neural network weight vectors wand v, \nrespectively (Tresp & Hofmann, 1995) \n\n810gL \n\n8w \n\n810gL \n\n8v \n\n8w \n\nT \n~J!8fw(Xt-l,Ut)( \nex L.J \nt=1 \n~J8gv(Xt'Ut)( \nex L.J \nt=1 \n\n8v \n\nXt -fw(Xt-llUd p(Xt,Xt-l YT,UT)dxt-ldxt \n\n) \n\nI \n\nYt-gv(Xt,ut} p(Xt!YT,UT)dxt. \n\n) \n\n(5) \n\naIogL \n\n8w \n\naIogL \n\n8 \nv \n\nS t=1 s=1 \n5 \n\nT \n\n3 APPROXIMATIONS TO THE E-STEP \n3.1 Monte-Carlo Generalized EM Learning \n\nThe integrals in the previous equations can be solved using Monte-Carlo integration which \nleads to the following learning algorithm. \n\n1. Generate S samples {xo, ... , xr };=1 from P(XT \\YT , UT ) assuming the current \n2. Treat those samples as real data and update wnew = wold + 1] &~! Land v new = \n\nmodel is correct (Monte-Carlo E-Step). \n\nvoid + 1]&I~~L with stepsize 1] and \n\nT \n\n5 \n\nex ~2:2:8fw(Xt-l,udl \n\n8w \n\nXt-l=t:_ 1 \n\n(x:-fW(X:_l,ud) (6) \n\n~~~8gv (Xt,Ut)1 \n\nex SL.JL.J \nt=1 s=1 \n\n8 \nV \n\nXt=i:; \n\n(_ (\"S \nYt \n\ngv Xt,Ut \n\n)) \n\n(7) \n\n(generalized M-step). Go back to step one. \n\nThe second step is simply a stochastic gradient step. The computational difficulties lie \nin the first step. Methods which produce samples from multivariate distributions such as \nGibbs sampling and other Markov chain Monte-Carlo methods have (at least) two prob(cid:173)\nlems. First, the sampling process has to \"forget\" its initial condition which means that the \nfirst samples have to be discarded and there are no simple analytical tools available to de(cid:173)\ntermine how many samples must be discarded . Secondly, subsequent samples are highly \ncorrelated which means that many samples have to be generated before a sufficient amount \nof independent samples is available. Since it is so difficult to sample from the correct \nposterior distribution p(XT !YT, UT) the idea in this paper is to generate samples from an \napproximate distribution from which it is easy to draw samples. In the next sections we \npresent approximations using a multivariate Gaussian and a mixture of Gaussians. \n\n3.2 Approximate Mode Estimation Using the Extended Kalman Filter \n\nWhereas the Kalman filter is an optimal state estimator for linear state space models the \nextended Kalman filter is a suboptimal state estimator for NSSMs based on locallineariza(cid:173)\ntions of the nonlinearities.4 The extended Kalman filter and smoother (EKFS) algorithm is \n\n4 Note that we do not include the parameters in the NSSM as additional states to be estimated as \n\ndone by other authors, e.g. Puskorius & Feldkamp (1994). \n\n\f406 \n\nT. Briegel and V. Tresp \n\na forward-backward algorithm and can be derived as an approximation to posterior mode \nestimation for Gaussian error sequences (Sage & Melsa, 1971). Its application to our frame(cid:173)\nwork amounts to approximating x~ode ~ x~KFS where x~KFS is the smoothed estimate \nof Xt obtained from forward-backward extended Kalman filtering over the set of measure(cid:173)\nments YT and x~ode is the mode of the posterior distribution p( Xt I YT , UT). We use x~KFS \nas the center of the approximating Gaussian. The EKFS also provides an estimate of the \nerror covariance of the state vector at each time step t which can be used to form the covari(cid:173)\nance matrix of the approximating Gaussian. The EKFS equations can be found in Anderson \n& Moore (1979). To generate samples we recursively apply the following algorithm. Given \nxLI is a sample from the Gaussian approximation of p(xt-IIYT, UT) at time t - 1 draw \na sample xt from p(XtIXt-1 = X:_I' YT, UT). The last conditional density is Gaussian \nwith mean and covariance calculated from the EKFS approximation and the lag-one error \ncovariances derived in Shumway & Stoffer (1982), respectively. \n\n3.3 Exact Mode Estimation Using the Fisher Scoring Algorithm \n\nIf the system is highly nonlinear, however, the EKFS can perform badly in finding the \nposterior mode due to the fact that it uses a first order Taylor series expansion of the non(cid:173)\nlinearities fw (.) and gv(.) (for an illustration, see Figure 1). A u!:>cful- and computationally \ntractable - alternative to the EKFS is to compute the \"exact\" posterior mode by maximizing \nlogp(XT IYrr, UT) with respect to XT. A suitable way to determine a stationary point of \nthe log posterior, or equivalently, of P(XT, YTIUT) (derived from (3) by dropping P(UT)) \n. \nIS to app y rlS er scormg. \nt e current estImate T' we get a etter estImate \nX~s,new = X;S,old + 1] J for the unknown state sequence XT where J is the solution of \n\n. W' h h \n\nX FS old \n\nI D' h \n\nb \n\n. \n\nIt \n\n. \n\n(8) \n\nT \n\nT \n\nwith the score function s(XT ) = alogp(::~YTIUT) and the expected information matrix \nS(XT) = E[_a210~1X;{fIUT'J.5 By extending the arguments given in Fahrmeir & \nKaufmann (1991) to nonlinear state space models it turns out that solving equation (8) -\ne.g. to compute the inverse of the expected information matrix - can be performed by \nCholesky decomposition in one forward and backward pass.6 The forward-backward steps \ncan be implemented as a fast EKFS-Iike algorithm which has to be iterated to obtain the \nmaximum posterior estimates x~ode = x;S (see Appendix). Figure 1 shows the estimate \nobtained by the Fisher scoring procedure for a bimodal posterior density. Fisher scoring \nis successful in finding the \"exact\" mode, the EKFS algorithm is not. Samples of the \napproximating Gaussian are generated in the same way as in the last section. \n\n3.4 The Mixture of Modes Approach \n\nThe previous two approaches to posterior mode smoothing can be viewed as single Gaus(cid:173)\nsian approximations of the mode of p(XTIYT, UT). In some cases the approximation of \nthe posterior density by a single Gaussian might be considered too crude, in particular if \nthe posterior distribution is multimodal. In this section we approximate the posterior by a \nweighted sum of m Gaussians p(XT IYT, UT) ~ :E~I okp(XT Ik) where p(XT Ik) is the \nk-th Gaussian. If the individual Gaussians model the different modes we are able to model \nmultimodal posterior distributions accurately. The approximations of the individual modes \nare local maxima of the Fisher scoring algorithm which are f~)Und by starting the algorithm \nusing different initial conditions. Given the different Gaussians, the optimal weighting fac(cid:173)\ntors are ok = p(YTlk)p(k)jp(YT) where p(YTlk) = jp(YTIXT)P(XTlk)dXT is the \nSNote that the difference between the Fisher scoring and the Gauss-Newton update is that in the \n\nfonner we take the expectation of the information matrix. \n\n6The expected information matrix is a positive definite blOCk-tridiagonal matrix. \n\n\fFisher Scoring and Mixture of Modes for Inference and Learning in NSSM \n\n407 \n\nlikelihood of the data given mode k. If we approximate that integral by inserting the Fisher \nscoring solutions x;S,k for each time step t and linearize the nonlinearity gv (.) about the \nFisher scoring solutions, we obtain a closed form solution for computing the ok (see Ap(cid:173)\npendix). The resulting estimator is a weighted sum of the m single Fisher scoring estimates \nx~M = L::=l ok x;s,k. The mixture of modes algorithm can be found in the Appendix. \nFor the learning task samples of the mixture of Gaussians are based on samples of each of \nthe m single Gaussians which are obtained the same way as in subsection 3.2. \n\n4 EXPERIMENTAL RESULTS \nIn the first experiment we want to test how well the different approaches can approximate \nthe posterior distribution of a nonlinear time series (inference). As a time-series model we \nchose \n\n( \n\n) \n\n1 \n\n2 \n\nXt-l \n\n1 + x t _ 1 \n\nf(Xt-l, Ut} = 0.5 Xt-l + 25 \n\n2 + 8eas 1.2(t -I}, g(xt} = 20Xt, \n\n(9) \nthe covariances Qt = 10, lit = 1 and initial conditions ao = 0 and Qo = 5 which is \nconsidered a hard inference problem (Kitagawa, 1987). At each time step we calculate the \nexpected value of the hidden variables Xt, t = 1, ... , 400 based on a set of measurements \nY400 = {Yl, ... , Y400} (which is the optimal estimator in the mean squared sense) and based \non the different approximations presented in the last section. Note that for the single mode \napproximation, x~ode is the best estimate of Xt based on the approximating Gaussian. For \nthe mixture of modes approach, the best estimate is L:~l ok x;S,k where x;S,k is the mode \nof the k-th Gaussian in the dimension of Xt. Figure 2 (left) shows the mean squared error \n(MSE) of the smoothed estimates using the different approaches. The Fisher scoring (FS) \nalgorithm is significantly better than the EKFS approach. In this experiment, the mixture of \nmodes (MM) approach is significantly better than both the EKFS and Fisher scoring. The \nreason is that the posterior probability is multimodal as shown in Figure 1. \n\nIn the second experiment we used the same time-series model and trained a neural net(cid:173)\nwork to approximate fw (.) where all covariances were assumed to be fixed and known. \nFor adaptation we used the learning rules of section 3 using the various approximations \nto the posterior distribution of XT . Figure 2 (right) shows the results. The experiments \nshow that truly sampling from the approximating Gaussians gives significantly better re(cid:173)\nsults than using the expected value as a point estimate. Furthermore, using the mixture \nof modes approach in conjunction with sampling gave significantly better results than the \napproximations using a single Gaussian. When used for inference, the network trained us(cid:173)\ning the mixture of modes approach was not significantly worse than the true model (5% \nsignificance level, based on 20 experiments). \n\n5 CONCLUSIONS \nIn our paper we presented novel approaches for inference and learning in NSSMs. The \napplication of Fisher scoring and the mixture of modes approach to nonlinear models as \npresented in our paper is new. Also the idea of sampling from an approximation to the \nposterior distribution of the hidden variables is presented here for the first time. Our results \nindicate that the Fisher scoring algorithm gives better estimates of the expected value of \nthe hidden variable than the EKFS based approximations. Note that the Fisher scoring al(cid:173)\ngorithm is more complex in requiring typically 5 forward-backward passes instead of only \none forward-backward pass for the EKFS approach. Our experiments also showed that if \nthe posterior distribution is multi modal, the mixture of modes approach gives significantly \nbetter estimates if compared to the approaches based on a single Gaussian approximation. \nOur learning experiments show that it is important to sample from the approximate dis(cid:173)\ntributions and that it is not sufficient to simply substitute point estimates. Based on the \n\n\f408 \n\nT. Briegel and V. Tresp \n\n0 .2 , - - -- -- - - - - - - - - - - - - - - - . \n\n0.4,---------------------, \n\n0 . 18 \n\n0 . 16 \n\n0 .14 \n\n, , \n\n, . \n, . \n\n0 . 35 \n\n0.3 \n\n1t 0.25 \n.8 0.2 \n1.0.15 \n\n0.05 \n\n,,- .... \n\no -'o---~~~-=-=--~o--~--~~-~ \n\n, \n\n, \n\nt =-= 295 \n\nFigure 1: Approximations to the posterior distribution p( x t iY400, U 400) for t = 294 and t = \n295. The continuous line shows the posterior distribution based on Gibbs sampling using \n1000 samples and can be considered a close approximation to the true posterior. The EKFS \napproximation (dotted) does not converge to a mode. The Fisher scoring solution (dash(cid:173)\ndotted) finds the largest mode. The mixture of modes approach with 50 modes (dashed) \ncorrectly finds the two modes. \n\nsampling approach it is also possible to estimate hyperparameters (e.g. the covariance ma(cid:173)\ntrices) which was not done in this paper. The approaches can also be extended towards \nonline learning and estimation in various ways (e.g. missing data problems). \n\nAppendix: Mixture of Modes Algorithm \nThe mixture of modes estimate x~M is derived as a weighted sum of k = 1, . .. ,m individual Fisher \nscoring (mode) estimates x;S ,k. For m = 1 we obtain the Fisher scoring algorithm of subsection 3.3. \nFirst, one performs the set of forward recursions (t = 1, ... , T) for each single mode estimator k. \n\n\",k \n\"\"'tit-I \nBtk \n\n= p(~FS , k)\"'k \n\ntXt_I \"\"'t-Ilt-I tXt_I \n\npT( , FS ,k)+Q \nt \n\n\",k \n\"\"'t-llt-I tXt_I \n\npT('FS ,k)(\",k \n\n\"\"'tit-I \n\n)-1 \n\n(10) \n\n(11 ) \n\n(12) \n\nk \n'Yt \n\n( ,FS ,k) \n\n+ B t \n\nk T \n\nk \n\nSt x t \n\n(13) \nwith the initialization :E~lo = Qo, 'Yo = So (X~S , k). Then, one performs the set of backward smooth(cid:173)\ning recursions (t = T , .. . , 1) \n(Dk \n\nBk:Ek \n\nBk T \n\n'Yt-l \n\n)-1 \n\nt-I \n\n:E~_1 \n0:_1 = ( k \n\nt \n\ntit-I \n\nt-llt-l -\n\n:Ek \nt \n(D~_d-l + B;:E~ B; T \nBkok \nDt- 1 \nt \n\n)-1 k \n\n'Yt-l + t \n( \nXt=Z, St Z \n\nt Z \n\n-\n\n&Xt_l \n\nXt_l=Z' \n\nG ( ) - &YdXt,Utll \n\nwith Pt(z) = 8fw(Xt_l ' U') I \na d \nXt=Z n \ninitialization o} = :E}'Y}. The k individual mode estimates x;S ,k are obtained by iterative applica(cid:173)\ntion of the update rule X~S , k := '7 Ok + X~S ,k with stepsize '7 where X~S , k = {X~S , k, .. . ,X~S,k} \nand Ok = {o~ , ... , o} }. After convergence we obtain the mixture of modes estimate as the weighted \nsum X t = 6k=1 0' x t \n0) \nare computed recursively starting with a uniform prior O'} = .k (N(xlp,:E) stands for a Gaussian \nwith center p and covariance :E evaluated at x): \n\n:= 0'0 were O't t = - , \"' , \n\n) - &logp(Xr,YrIUT) I \n\n' WIt welg ttng coe Clents 0' \n\nk~FS k \n\nk h \n\n. h' \n\nffi . \n\n,MM \n\n\",m \n\n. h \n\nk( \n\n&Xt \n\n&Xt \n\nk \n\nT \n\n1 \n\n-\n\nO'~+IN(Ytlgv(xfs,k, ur), nn \n\nk \n\nO't = \n\n(14) \n\n(15) \n\n(16) \n\n(17) \n\n(18) \n\n\fFisher Scoring and Mixture of Modes for Inference and Learning in NSSM \n\n409 \n\ne \n\n8 \n\n7 \n\nf: \n\n';\"4 \n~3 \n\n0.8 \n\n0.7 \n\no.s \n\n0.5 \n\n~ \n~O.4 \n~ \n\n0.3 \n\n2 \n\n0 \n\n0.2 \n\n0 . \"1 \n\n0 \n\nFigure 2: Left (inference): The heights of the bars indicate the mean squared error between \nthe true Xt (which we know since we simulated the system) and the estimates using the \nvarious approximations. The error bars show the standard deviation derived from 20 rep(cid:173)\netitions of the experiment. Based on the paired t-test, Fisher scoring is significantly better \nthan the EKFS and all mixture of modes approaches are significantly better than both EKFS \nand Fisher scoring based on a 1 % rejection region. The mixture of modes approximation \nwith 50 modes (MM 50) is significantly better than the approximation using 20 modes. The \nimprovement of the approximation using 20 modes (MM 20) is not significantly better than \nthe approximation with 10 (MM 10) modes using a 5% rejection region. \nRight (learning): The heights of the bars indicate the mean squared error between the true \nfw (.) (which is known) and the approximations using a multi-layer perceptron with 3 hid(cid:173)\nden units and T = 200. Shown are results using the EKFS approximation, (left) the Fisher \nscoring approximation (center) and the mixture of modes approximation (right). There are \ntwo bars for each experiment: The left bars show results where the expected value of x t \ncalculated using the approximating Gaussians are used as (single) samples for the general(cid:173)\nized M-step - in other words - we use a point estimate for Xt. Using the point estimates, the \nresults of all three approximations are not significantly different based on a 5% significance \nlevel. The right bars shows the result where S = 50 samples are generated for approxi(cid:173)\nmating the gradient using the Gaussian approximations. The results using sampling are all \nsignificantly better than the results using point estimates (l % significance level). The sam(cid:173)\npling approach using the mixture of modes approximation is significantly better than the \nother two sampling-based approaches (l % significance level). If compared to the inference \nresults of the experiments shown on the left, we achieved a mean squared error of 6.02 for \nthe mixture of modes approach with 10 modes which is not significantly worse than the \nresults the with the true model of 5.87 (5% significance level). \n\nReferences \n\nAnderson, B. and Moore, J. (1979) Optimal Filtering, Prentice-Hall, New Jersey. \nFahnneir, L. and Kaufmann, H. (1991) On Kalman Filtering. Posterior Mode Estimation and Fisher \nScoring in Dynamic Exponential Family Regression, Metrika, 38, pp. 37-60. \nGhahramani, Z. and Roweis, S. (1999) Learning Nonlinear Stochastic Dynamics using the Gener(cid:173)\nalized EM ALgorithm, Advances in Neural Infonnation Processing Systems 11, eps. M. Keams, S. \nSolla, D. Cohn, MIT Press, Cambridge, MA. \nKitagawa, G. (1987) Non-Gaussian State Space Modeling of Nonstationary Time Series (with Com(cid:173)\nments), JASA 82, pp. 1032-1063. \nPuskorius, G. and Feldkamp, L. (1994) NeurocontroL of Nonlinear Dynamical Systems with KaLman \nFilter Trained Recurrent Networks, IEEE Transactions on Neural Networks, 5:2, pp. 279-297. \nSage, A. and Melsa, J. (1971) Estimation Theory with Applications to Communications and Control, \nMcGraw-Hill, New York. \nShumway, R. and Stoffer, D. (1982) Time Series Smoothing and Forecasting Using the EM Algo(cid:173)\nrithm, Technical Report No. 27, Division of Statistics, UC Davis. \nTresp, V. and Hofmann, R. (1995) Missing and Noisy Data in NonLinear Time-Series Prediction, \nNeural Networks for Signal Processing 5, IEEE Sig. Proc. Soc., pp. 1-10. \n\n\f", "award": [], "sourceid": 1539, "authors": [{"given_name": "Thomas", "family_name": "Briegel", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}