Notes on Numerical Analysis

\section{Introduction}\label{sec1}

Numerical analysis can be seen as the “operational” part of a scientific computation. In particular, a thorough error analysis is necessary for an informed assessment of the quality of any obtained result. Three examples are considered to illustrate different types of errors and error estimation techniques.

Some preliminaries on machine data representation and error analysis are given in Section 2. In Section 3, two methods of evaluating trigonometric functions are compared with respect to precision. Section 4 illustrates the use of iteration for solving the discrete time matrix algebraic Riccati equation. A third example on numerical integration of a stochastic differential equation is given in Section 5, which shows the application of statistical error analysis. Some conclusions are given in Section 6.

\section{Preliminaries}\label{sec2}

A computer handles pieces of information of a fixed size, a \emph{word}. The number of digits in a word is the \emph{word length}. A normalized \emph{floating point representation} of a real number $a$ is a representation in the form

$a=\pm m\cdot \beta^{e},\quad \beta^{-1}\leq m<1, \quad e\in\mathbb Z$

All real numbers $a$ can be expressed in this representation, where the number $m$ is the \emph{mantissa}, $\beta$ is the \emph{base} and $e$ is the \emph{exponent}. In a computer, the number of digits for $e$ and $m$ is limited by the word length.

Suppose that $t$ digits are used to represent $m$ . Then it is only possible to represent floating point numbers of the form

$\bar{a}=\pm \bar{m}\cdot\beta^{e},$

where $\bar{m}$ is the mantissa $m$ rounded to $t$ digits, and the exponent is limited to a finite range

$\bar{m}=(.d_1d_2\ldots d_t)_{\beta},\quad 0\leq d_i <\beta, \quad L\leq e\leq U.$

The normalization $\beta^{-1}\leq |\bar{m}|<1$ ensures that $d_1\neq 0$ . In the binary system, $d_1=1$ , so this digit need not be stored. An exception is $a=0$ , for which one sets $m=0$ , and it is also practical to set $e=L$ , the lower limit of the exponent.

The set of floating point numbers $F$ that can be represented by the system contains exactly $2(\beta-1)\beta^{t-1}(U-L+1)+1$ numbers. The limited number of digits in the exponent implies that $a$ is limited in magnitude to an interval – the \emph{range} of the floating point system. If $a$ is larger in magnitude than the largest number in the set $F$ , then $a$ cannot be represented at all. The same is true, in principle, of numbers smaller than the smallest nonzero number in the set $F$ .

It can be shown that in a floating point system, every real number in the floating point range of $F$ can be represented with a relative error which does not exceed the \emph{machine unit} $u$ , which, when rounding is used, is defined by

$u=\frac{1}{2}\beta^{-t+1}$

In a floating point system, both large and small numbers are represented with the same relative accuracy. For most computers, the machine unit $u$ lies between $10^{-6}$ and $10^{-16}$ .

The IEEE 754 standard from 1985 for floating point arithmetic is implemented on most chips used for personal computers and workstations. Two main formats, single and double precision, are defined. The standard specifies that arithmetic operations should be performed as if they were first calculated to infinite precision and then rounded off. The default rounding mode is to round to nearest representable number, with round to even in case of a tie. The infinite precision is implemented by using extra guard digits in the intermediate result of the operation before normalization and rounding. In single precision a floating point representation $v$ of a number $a$ is stored as

$v=(-1)^s(1.m)_{2}2^{e-127},\quad 0<e<255,$

where $s$ is the sign (one bit), $e$ is the exponent (8 bits), and $m$ is the mantissa (23 bits). If $x$ and $y$ are two floating point numbers, the corresponding operations are denoted by

$fl(x+y),\quad fl(x-y),\quad fl(x\cdot y),\quad fl(x/y)$

for the results of floating point addition, subtraction, multiplication and division. These operations have to some degree other properties than the exact arithmetic operations. The associative and distributive laws may fail for addition and multiplication. Consequently, the order of calculation may have a large impact on the error propagation.

An approximation to the machine unit can be determined as $u\approx \mu$ , where $\mu$ is the smallest floating point number $x$ such that

$fl(1+x)>1.$

The machine unit $u$ can therefore be regarded as a lower bound for the relative error in any computation, see [1].

\subsection{Error Types}

Numerical results are influenced by many types of error. Some sources of error are difficult to influence while others can be reduced or eliminated by changing the computational sequence. Errors are propagated from their sources to quantities computed later, sometimes with a considerable amplification. It is therefore important to distinguish between the new error produced at the computation of a quantity, and errors propagated from the data that the quantity is calculated from. The following types of error are encountered in machine computations.

\subsubsection{Errors in Given Input Data}

Input data can be the results of measurements which have been influenced by errors. A \emph{rounding error} occurs, for example, whenever an irrational number is shortened (“rounded off”) to a fixed number of decimals.

\subsubsection{Errors During the Computations}

The limited word length in a computer or calculator leads at times to a loss of information. There are two typical and important cases:

\begin{itemize}\item If the device cannot handle numbers which have more than, say, $s$ digits, then the exact product of two $s$ -digit numbers (which contains $2s$ or $2s-1$ digits) cannot be used in subsequent calculations, but is rounded off. \item When a relatively small term $b$ is added to $a$ , then some digits of $b$ are “shifted out”, and they will not have any effect on future quantities that depend on the value of $a+b$ . The effect of such roundings can be quite noticeable in an extensive calculation, or in an algorithm which is numerically unstable.\end{itemize}

\subsubsection{Truncation Errors}

These are errors arising when a limiting process is truncated (broken off) before the limiting value has been reached. Such \emph{truncation error} occurs, for example, when an infinite series is broken off after a finite number of terms, when a nonlinear function is approximated with a linear function, or when a derivative is approximated with a difference quotient (also called \emph{discretization error}).

\subsubsection{Other Types of Errors}

The most important, but sometimes overlooked sources of error are due to simplifications in the mathematical model, and human or machine errors.

Simplifications in the Mathematical Model

In most of the applications of mathematics, one makes idealizations and assumptions in order to formulate a mathematical model. The idealizations and assumptions made may not be appropriate to describe the present problem and can therefore cause errors in the result. The effects of such sources of error are usually more difficult to estimate than errors in given input data, rounding errors during the computations and truncation errors.

Human Errors and Machine Errors

In all numerical work, one must be aware that clerical errors, errors in hand calculation, and misunderstandings may occur. When using computers, one can encounter errors in the program itself, typing errors in entering the data, operator errors, and (more seldom) machine errors.

Errors in given input data and errors due to simplification of the mathematical model are usually considered to be uncontrollable in the numerical treatment. Truncation errors are usually controllable. The rounding error in the individual arithmetic operation is, in a computer, controllable to a limited extent, mainly through the choice of single and double precision. This type of error can also be reduced by reformulation of the calculation sequence.

\subsection{Functions of a Single Variable}

One of the fundamental extensions of a computing device is to have methods to compute \emph{elementary functions}. A function of a single variable is a mapping $f:\mathbb{R}\rightarrow \mathbb{R}$ where the argument $x$ of $f$ is a one-dimensional variable. For all calculated quantities, the following definition of absolute and relative error can be applied.

Theorem: Let $\tilde{z}$ be an approximate value to an exact quantity $z$ . The \emph{absolute error} in $\tilde{z}$ is

$\Delta z=\tilde{z}-z,$

and if $z\neq0$ , the \emph{relative error} is

$\Delta z/z=(\tilde{z}-z)/z.$

In many situations one wants to compute error bounds, either strict or approximate, for the absolute or relative error. Since it is sometimes rather hard to obtain an error bound that is both strict and sharp, one often prefers to use less strict but often realistic error estimates, see [1].

To carry out rounding error analysis of an algorithm based on floating point operations, a model of how these operations are carried out is needed. The standard model

$fl(x \square y)=(x \square y)(1+\delta),\quad |\delta|\leq u,$

where $\square$ is any of the four elementary operations. The model is valid for the IEEE standard.

For a function $f(x)$ , elementary results from calculus can be used to give a bound for the error in the evaluated function value. Let $y=f(x)$ be a function of a single variable $x$ and let $\tilde x-x=\Delta x$ . Then

$\Delta y=f(x+\Delta x)-f(x)=f'(\xi)\Delta x,$

where $x\leq \xi \leq x+\Delta x$ . Suppose that $|\Delta x|\leq\epsilon$ . Then

$|\Delta y|\leq \max_{\xi}|f'(\xi)|\epsilon, \quad \xi\in[x-\epsilon,x+\epsilon].$

General Error Propagation Formula: Assume that the errors in $x_1,x_2,\ldots,x_n$ are $\Delta x_1,\Delta x_2,\ldots,\Delta x_n$ . Then the maximal error in $f(x_1,x_2,\ldots,x_n)$ has the approximate bound

$|\Delta f|\lesssim \sum_{i=1}^{n}\left|\frac{\partial f}{\partial x_i}\right||\Delta x_i|.$

\subsection{Vectors and Matrices}

A \emph{vector norm} on $\mathbb{R}^n$ is a function $N:\mathbb{R}^n\rightarrow \mathbb{R}$ satisfying

\begin{itemize}\item $N(\mathbf{x})>0,\quad \forall \mathbf{x}\in\mathbb{R}^n,\quad \mathbf{x}\neq 0$ \item $N(\alpha \mathbf{x})=|\alpha|N(\mathbf{x}),\quad \forall \alpha \in \mathbb{R},\quad \mathbf{x}\in\mathbb{R}^n$ \item $N(\mathbf{x}+\mathbf{y})\leq N(\mathbf{x})+N(\mathbf{y}),\quad \forall \mathbf{x},\mathbf{y} \in \mathbb{R}^n$ \end{itemize}

The vector norm of a vector $\mathbf{x}$ is denoted $\|\mathbf{x}\|$ . The most common vector norms are special cases of the family of \emph{H\”older norms} or \emph{ $l_p$ -norms}, defined by

$\|\mathbf{x}\|_p=(|x_1|^p+|x_2|^p+\cdots+|x_n|^p)^{1/p},\quad 1\leq p<\infty.$

The $l_{\infty}$ -norm (or \emph{maximum norm}) is defined

$\|\mathbf{x}\|_{\infty}=\max_{1\leq i\leq n}|x_i|.$

An infinite sequence of matrices $\mathbf{A}_1, \mathbf{A}_2, \ldots$ is said to converge to $\mathbf{A}$ , that is $\lim_{n\rightarrow \infty}\mathbf{A}_n=\mathbf{A}$ if

$\lim_{n\rightarrow \infty}\|\mathbf{A}_n - \mathbf{A}\|=0.$

From the equivalence of norms in a finite dimensional vector space it follows that the order of convergence is independent of the choice of norm. The particular choice $\|\cdot\|_{\infty}$ shows that convergence of vectors in $\mathbb{R}^n$ is equivalent to convergence of the $n$ sequences of scalars formed by the components of the vectors. This conclusion is extended to matrices, see [1].

\subsection{Iteration}

A general idea often used in numerical computation is \emph{iteration}, where a numerical process is applied repeatedly to a problem. As an example, the equation

$x=F(x)$

can be solved by successive approximation using an initial approximation $x_0$ . By using the equation $x_{n+1}=F(x_n)$ repeatedly, a sequence of numbers $\{x_n\}$ is obtained that (hopefully) converges to a limiting value $\alpha$ , which is a root to the equation. As $n$ grows, the numbers $x_n$ are then successively better and better approximations to the root, that is

$\lim_{n\rightarrow \infty} x_{n+1}=\alpha =\lim_{n\rightarrow\infty} F(x_n)=F(\alpha),$

so that $x=\alpha$ satisfies the equation $x=F(x)$ . The iteration is stopped when sufficient accuracy has been attained. Consider a nonlinear equation of the form

$x_i=g_i(x_1,x_2,\ldots,x_n), \quad i=1,2,\ldots,n.$

Then, for $k=1,2,\ldots$ , the \emph{fixed-point iteration}

$x_{i}^{(k+1)}=g_i(x_1^{(k)},x_2^{(k)},\ldots,x_n^{(k)}), \quad i=1,2,\ldots,n$

can be used to find the root to the equation. If $g$ is continuous and $\lim_{k\rightarrow\infty}\mathbf{x}^{(k)}=\mathbf{x}^{\ast}$ , then $\mathbf{x}^{\ast}=g(\mathbf{x}^{\ast})$ and $\mathbf{x}^{\ast}$ solves the system $\mathbf{x}=g(\mathbf{x})$ .

Generally, a vector sequence is said to converge to a limit $\mathbf{x}^{\ast}$ if

$\lim_{k\rightarrow \infty}\|\mathbf{x}^{(k)}-\mathbf{x}^{\ast}\|=0$

for some norm $\|\cdot\|$ , see [1].

\subsection{Numerical Stochastic Integration}

Consider an initial value problem consisting of an ordinary differential equation and an initial value,

$\dot{x}=\frac{dx}{dt}=a(t,x),$

$x(t_0)&=&x_0$

There is a multitude of methods available for numerical evaluation of the integral. One simple and common method is the \emph{Euler method}, in which the derivative is approximated with a forward time difference equation as

$\frac{dx}{dt}=\frac{y_{n+1}-y_n}{\Delta_n},$

for some given time discretization $\Delta_n=t_{n+1}-t_{n},t_0<t_1<\ldots<t_n<\ldots$ , which gives the Euler method for the problem

$y_{n+1}&=&y_{n}+a(t_n,y_n)\Delta_n,$

$\Delta_n&=&t_{n+1}-t_n,$

$y_0&=&x_0.$

For the Euler method, the \emph{local discretization error}, the error introduced in each step of the method, is defined as

$l_{n+1}=x(t_{n+1};t_n,y_n)-y_{n+1},$

and the \emph{global discretization error}, the total error after $n$ steps, is defined as

$e_{n+1}=x(t_{n+1};t_0,x_0)-y_{n+1}.$

A method is \emph{convergent} if the global discretization error converges to zero with the maximum time step $\Delta=\max_n\Delta_n$ , that is,

$\lim_{\Delta \downarrow 0}|e_{n+1}|=\lim_{\Delta \downarrow 0}|x(t_{n+1};t_0,x_0)-y_{n+1}|=0,$

where $y_0=x_0$ on any finite time interval $[t_0,T]$ . A method is \emph{stable} if the propagated errors remain bounded.

Now, consider an \emph{It{\^o} process} $X=\{X_t,t_0\leq t\leq T\}$ satisfying the stochastic differential equation

$dX_t&=&a(t,X_t)dt+b(t,X_t)dW_t,\quad t_0\leq t\leq T,$

$X_{t_0}&=&X_0.$

For a given discretization $t_0=\tau_0<\tau_1<\ldots <\tau_n<\ldots<\tau_N=T$ of the time interval $[t_0,T]$ , the \emph{Euler-Maruyama approximation} is the continuous stochastic process $Y=\{Y(t),t_0\leq t\leq T\}$ satisfying

$Y_{n+1}=Y_{n}+a(\tau_n,Y_n)(\tau_{n+1}-\tau_n)+b(\tau_n,Y_n)(W_{\tau_{n+1}}-W_{\tau_n}),$

for $n=0,1,2,\ldots,N-1$ with initial value $Y_0=X_0$ , $Y_n=Y(\tau_n)$ and $\Delta_n=\tau_{n+1}-\tau_n$ . The process is constructed by generating a white noise process approximation

$\Delta W_n=(W_{\tau_{n+1}}-W_{\tau_n})\sim\mathop{\mathrm{N}}(0,\Delta_n),\quad n=0,1,\ldots,N-1,$

of the Wiener process $W=\{W_t,t\geq 0\}$ , that is, the increments of the Wiener process are normally distributed with a variance equal to the time step length. The white noise process therefore has the properties that

$\mathop{\mathbf{E}}(\Delta W_n)=0,\quad \mathop{\mathbf{E}}((\Delta W_n)^2)=\Delta_n.$

In order to generate a continuous function, piecewise constant interpolation is performed to give

$Y(t)=Y_{n_t}+\frac{t-\tau_{n_t}}{\tau_{n_t+1}-\tau_{n_t}}(Y_{n_t+1}-Y_{n_t}),$

with $n_t=\max\{n=0,1,\ldots,N:\tau_n \leq t\}$ . Error estimates for deterministic problems cannot be used in this case. However, if the exact solution is known, a statistical analysis based on simulation can be used to determine the global discretization error. By using path-wise approximations, the expectation of the global discretization error is

$\epsilon=\mathop{\mathbf{E}}(|X_T -Y(T)|).$

The approximation $Y^{\delta}$ with maximum step size $\delta$ \emph{converges strongly} to $X$ at time $T$ if

$\lim_{\delta \downarrow 0}\mathop{\mathbf{E}}(|X_T -Y^{\delta}(T)|)=0.$

The Euler-Maruyama approximation is the simplest useful approximation for stochastic integration. It is, however, not very numerically efficient. A comprehensive discussion on numerical stochastic integration is given in [2].

\section{Evaluation of Trigonometric Functions}\label{sec3}

The evaluation of trigonometric functions consists of two numerical problems; the approximation of the irrational number $\pi$ – which cannot be expressed in a finite sequence of decimals in a floating point representation – and the evaluation of the trigonometric function using basic operations. A numerical experiment was carried out to compare two different methods of evaluation of a trigonometric function with an exact argument. The machine unit on the computer was determined to be $1\cdot 10^{-16}$ , using the method described in section ??.

Hida et al. [3] suggest an algorithm giving a precision of four times that of double precision IEEE arithmetic. The method for computing $\sin (x)$ and $\cos (x)$ uses argument reduction, look-up tables and Taylor expansions. In order to compute $\sin (x)$ , the argument $x$ is reduced modulo $2\pi$ , so that $|x|\leq \pi$ , due to the fact that the sine function is periodic. Noting that sine and cosine functions are symmetric with respect to $\pi/2$ and $\sin(y+\pi/2)=\cos(y)$ , the argument can be reduced modulo $\pi/2$ . In general, $\sin(y+k\pi/2)$ can be expressed as $\pm \sin (y)$ or $\pm \cos (y)$ , so that it is only necessary to compute $\sin (y)$ and $\cos (y)$ with $|y|\leq \pi/4$ . Finally, let $y=z+m(\pi/1024)$ , where the integer $m$ is chosen so that $|z|\leq\pi/2048 \approx 0.001534$ . With $|y|\leq \pi/4$ , the constant $|m|\leq 256$ .

The value of the sine function with an arbitrary argument can be computed by using the relation

$\sin(z+m\pi/1024)=\sin (z)\cos(m\pi/1024)+\cos (z)\sin(m\pi/1024),$

where $\sin (z)$ is expanded in a Maclaurin series (Taylor expansion around $x=0$ ) and $\sin(m\pi/1024)$ and $\cos(m\pi/1024)$ are tabulated pre-computed values. These values are computed using the recursive relation

$\sin\left(\frac{\theta}{2}\right)&=&\frac{1}{2}\sqrt{2-2\cos\theta},$

$\cos\left(\frac{\theta}{2}\right)&=&\frac{1}{2}\sqrt{2-2\cos\theta},$

starting with $\cos (\pi)=-1$ . By using the trigonometric formulas

$\sin(\alpha\pm\beta)=\sin (\alpha) \cos(\beta) \pm \cos (\alpha)\sin(\beta),$

$\cos(\alpha \pm \beta)=\cos (\alpha) \cos (\beta)\mp \sin (\alpha)\cos (\beta).$

all the values for integers $|m|\leq 256$ can be computed. Using the reduction of the argument, the convergence rate for the Maclaurin series is significantly increased. The Maclaurin expansion for a function $f(x)$ is

$f(x)=\sum_{n=0}^{\infty}\frac{f^{(n)}(0)}{n!}x^n,\quad |x|<\infty,$

so that

$\sin x&=&x-\frac{x^3}{3!}+\frac{x^5}{5!}-\frac{x^7}{7!}+\ldots,$

$\cos x&=&1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!}+\ldots,$

$|x|&<&\infty,$

The \emph{remainder} after $n+1$ terms gives an estimate of the absolute error,

$R_n=\int_0^x f^{(n+1)}(n)\frac{(x-u)^n}{n!}du=\frac{f^{(n+1)}(\xi)x^{n+1}}{(n+1)!},$

for some $0<\xi<x$ . Thus, an upper bound for the absolute error of the trigonometric functions is therefore given by $x^{n+1}/(n+1)!$ and the number of terms $n$ can be chosen as to obtain desired accuracy.

Since the tabulated values are determined with the recursive relation above, the function computation are dependent only on the square root operation, apart from the basic operations. The square root is computed using \emph{Netwton’s method} on the function

$f(x)=\frac{1}{x^2}-a$

which has the root $\pm a^{-1/2}$ . This gives the iterative formula

$x_{i+1}&=&x_{i}+\frac{x_{i}(1-ax_{i}^2)}{2},$

$x_0&=&\sqrt{a_0}.$

Newton’s method is quadratically convergent and any degree of accuracy in the result can be obtained with the given precision. The square root is obtained by multiplying the result with $a$ , that is, $\sqrt{a}=ax$ .

If the elementary operations are assumed to be performed without losing precision (according to the IEEE standard), then the calculations of $\sin (x)$ would benefit from using the above mentioned method due to that the error in the truncated Maclaurin expansion is smaller the closer to zero the argument is.

The computation of $y=\sin (0.7)=644217\ldots$ , where the argument is assumed to be exact, was performed on a personal computer, using two methods; the Maclaurin expansion up to 5 terms without argument reduction (method 1, the result denoted $y_1$ ), and with argument reduction using the above procedure (method 2, $y_2$ ), respectively. The difference in the calculated results between the methods is $|y_1-y_2|\approx 4.94 \cdot 10^{-10}>u$ . The results are summarized in the following table:

The result by using method 2, $y_2$ , is here assumed to have the error bound given by the machine unit $u$ . The example illustrates how the truncation error depends on the argument. By reducing the argument higher precision can be achieved.

If there is an error in the input, the general error propagation formula (see \autoref{sec2}) can be used to estimate the propagated error in the result,

$|\Delta f|\lesssim \left|\frac{\partial f}{\partial x}\right||\Delta x|\approx 1\cdot |\Delta x|.$

\section{The Algebraic Riccati Equation}\label{sec4}

Many measurements of events evolving in time can be regarded as signals disturbed by noise representing measurement errors. One mathematical model that can be used to describe this is the \emph{dynamic system}, defined by known \emph{state equations} which has an input signal controlling the states of the system and an output signal, which is the observations of the system states.

A discrete time model can be represented by the matrix equation

$\mathbf{x}(k+1)&=&\mathbf{A}(k)\mathbf{x}(k)+\mathbf{B}(k)(\mathbf{u}(k)+\mathbf{v}(k)),$

$\mathbf{x}(0)&=&\mathbf{x}_0,$

where $k$ is the discrete time step. In this representation, the vector $\mathbf{x}$ represents the state of the system, where the number of state variables is equal to the length of $\mathbf{x}$ .

The state at a time $k+1$ is a linear combination of previous states (at times less or equal to $k$ ) and a control variable $\mathbf{u}$ with a superimposed white noise process $\mathbf{v}$ . The linear combinations are given by the matrices $\mathbf{A}$ and $\mathbf{B}$ , respectively. The initial state at time $k=0$ is $\mathbf{x}_0$ . Measurements on the system are represented by the matrix equation

$\mathbf{y}(k)=\mathbf{C}(k)\mathbf{x}(k)+\mathbf{D}(k)\mathbf{w}(k),$

where the vector $\mathbf{y}$ represents the observed values at time $k$ , that is the measurements of some linear combination of the state variables of the system, $\mathbf{x}$ at time $k$ , and a noise component, $\mathbf{w}$ . The linear combinations are defined by the matrices $\mathbf{C}$ and $\mathbf{D}$ , respectively.

If the input signal is assumed to be constant with an imposed white noise, the \emph{Kalman filter} defines a scheme to obtain the optimal estimation of the actual system states, given the noisy observations $\mathbf{y}$ , see for example [4].

The Kalman estimator is optimal in the sense that it minimizes the least square error, that is

$\mathop{\mathbf{E}}([x_{i}(k)-\hat{x}_{i}(k)]^2).$

The Kalman filter is given by

$\hat{\mathbf{x}}(k+1)&=&\mathbf{A}(k)\hat{\mathbf{x}}(k)+\mathbf{K}(k)[\mathbf{y}(k)-\mathbf{C}(k)\hat{\mathbf{x}}(k)],$

$\hat{\mathbf{x}}(0)&=&0,$

where the Kalman gain $\mathbf{K}(k)$ is given by

$\mathbf{K}(k)=\mathbf{A}(k)\mathbf{P}(k)\mathbf{C}(k)^{\mathrm{T}}[\mathbf{C}(k)\mathbf{P}(k)\mathbf{C}(k)^{\mathrm{T}}+\mathbf{D}(k)\mathbf{D}(k)^{\mathrm{T}}]^{-1},$

and $\mathbf{P}(k)$ is the stationary solution to the discrete matrix \emph{Riccati equation}

$\mathbf{P}(k+1)=\mathbf{A}(k)\mathbf{P}(k)\mathbf{A}(k)^{\mathrm{T}}-\\&-&\mathbf{A}(k)\mathbf{P}(k)\mathbf{C}(k)^{\mathrm{T}} [\mathbf{C}(k)\mathbf{P}(k)\mathbf{C}(k)^{\mathrm{T}}+\mathbf{D}(k)\mathbf{D}(k)^{\mathrm{T}}]^{-1}\mathbf{C}(k)\mathbf{P}(k)\mathbf{A}(k)^{\mathrm{T}}+\mathbf{B}(k)\mathbf{B}(k)^{\mathrm{T}},$

$\mathbf{P}(0)&=&\mathbf{P}_0,$

Normally the \emph{steady-state matrices} $\mathbf{P}_{\infty}$ and $\mathbf{K}_{\infty}$ for $k\rightarrow \infty$ are used.

It can be shown (see [4]), that if the system is \emph{controllable} and \emph{observable}, then the discrete-time algebraic Riccati equation has a unique positive semi-definite solution.

Definition: \emph{Construct the \emph{controllability matrix}

$\mathbf{\Gamma}=(\mathbf{B},\mathbf{A}\mathbf{B},\ldots,\mathbf{A}^{n-1}\mathbf{B}).$

Then the matrix pair $(\mathbf{A},\mathbf{B})$ is called \emph{controllable} if $\mathop{\mathrm{rank}}(\mathbf{\Gamma})=n$ .}

Definition: \emph{Construct the \emph{observability matrix}

$\mathbf{\Omega}=\left(\begin{array}{c} \mathbf{C}\\\mathbf{C}\mathbf{A}\\ \vdots\\\mathbf{C}\mathbf{A}^{n-1}\end{array}\right).$

Then the matrix pair $(\mathbf{C},\mathbf{A})$ is called \emph{observable} if $\mathop{\mathrm{rank}}(\mathbf{\Omega})=n$ .}

As a numerical example, let the vertical motion of a cruising aircraft be modelled by the one-dimensional motion of a particle, that is

$\ddot{z}(t)=u(t)+v(t),$

where $z(t)$ is the position of the particle and $u(t)+v(t)$ is the applied force. If $u(t)$ is constant (it is assumed here that $u\equiv 0$ ), then the remaining force can be assumed to be the influence of turbulence on the aircraft motion, which is modelled by the white noise process $v(t)$ . The tracking system performs measurements of the height $x_1$ , which are assumed to be disturbed by measurement errors modelled by white noise $w$ . Define the state space variables $x_1$ to be the position and $x_2$ to be the velocity of the aircraft. Then

$\dot{z}&=&\dot{x}_1=x_2,$

$\ddot{z}=\dot{x}_2=v,$

so that

$\dot{\mathbf{x}}=\mathbf{A}\mathbf{x}=\left(\begin{array}{cc} 0& 1\\0 &0\end{array}\right)\mathbf{x}+\left(\begin{array}{c} 0\\1\end{array}\right)v,$

$\mathbf{y}&=&\mathbf{C}\mathbf{x}+dw=(\begin{array}{cc} 1 &0\end{array})\mathbf{x}+w,$

where the matrices are time-independent and $v$ and $w$ are scalar white noise processes. Furthermore, it is assumed that the tracking system is operating in discrete time, sampled with time step $h$ . If $h$ is small, the applied force can be approximated by a piecewise constant signal. Integration of the state equation gives

$\mathbf{x}(kh+h)=e^{\mathbf{A}h}\mathbf{x}(kh)+\int_{kh}^{kh+h}e^{\mathbf{A}(kh+h-s)}\mathbf{A}v(s)ds=$

$=e^{\mathbf{A}h}\mathbf{x}(kh)+\int_{0}^{h}e^{\mathbf{A}s}ds\mathbf{B}v(k),$

by the assumption of the scalar noise process $v(k)$ being constant on $t\in[kh,kh+h],k\in\mathbb{Z}$ and $\mathbf{A}$ being independent of $t$ . Series expansion gives

$e^{\mathbf{A}h}=\mathbf{I}+\mathbf{A}h+O(h^2)=\left(\begin{array}{cc}1 & h\\ 0 & 1\end{array}\right)$

and therefore

$\int_{0}^{h}e^{\mathbf{A}s}ds\mathbf{B}v(k)=\int_{0}^{h}\left(\begin{array}{cc}1&s\\0&1\end{array}\right)ds\mathbf{B}v(k)=$

$=\left(\begin{array}{cc}h&\frac{h^2}{2}\\0& h\end{array}\right)\left(\begin{array}{c}0\\1\end{array}\right)v(k)=\left(\begin{array}{c}\frac{h^2}{2}\\h\end{array}\right)v(k),$

which gives the resulting discrete time state equation

$\left(\begin{array}{c}x_1(k+1)\\x_2(k+1)\end{array}\right)=\left(\begin{array}{cc}1&h\\0&1\end{array}\right)\left(\begin{array}{c}x_1(k)\\x_2(k)\end{array}\right)+\left(\begin{array}{c}\frac{h^2}{2}\\h\end{array}\right)v(k),$

where $x_1(k)$ and $x_2(k)$ is the position and velocity of the particle at time $t=kh$ , respectively. The system is controllable and observable for $h\neq 0$ , since

$\mathbf{\Gamma}=\left(\begin{array}{cc} \frac{h^2}{2} &\frac{3h^2}{2}\\ h & h\end{array}\right),\quad \mathbf{\Omega}=\left(\begin{array}{cc} 1 & 0\\1 & h\end{array}\right).$

The discrete time algebraic Riccati equation can be interpreted as a system of nonlinear equations. The form of the equation suggests that the solution can be found by iterating the Riccati equation.

As initial value, any positive definite matrix, for example $\mathbf{P}(0)=\mathbf{I}$ , can be used. The result from each iteration is tested component wise using the maximum norm (see \autoref{sec2}), so that the component wise maximum error is used as a stop condition for the iteration. The result after a few iterations is

$\mathbf{P}_{\infty}=\left(\begin{array}{cc} 0.0744 & 0.1453\\0.1453 & 0.3812\end{array}\right),$

with a component wise error of approximately $u$ . The corresponding Kalman gain and filter are, respectively

$\mathbf{K}_{\infty}=\left(\begin{array}{c} 1.7419\\1.7207\end{array}\right),$

$\hat{\mathbf{x}}=\left(\begin{array}{cc} -0.7419 & 0.5000\\-1.7207 & 1.000\end{array}\right)\hat{\mathbf{x}}+\left(\begin{array}{c} 1.7419\\1.7207\end{array}\right)\mathbf{y}.$

\section{Numerical Solution of a Stochastic Differential Equation}\label{sec5}

A numerical experiment was performed on the It{\^o} process

$dX_t=-X_t dt+X_t dW_t$

on the interval $[0,1]$ . The equation has the exact solution

$X_t=X_0\exp\left(\left(a-\frac{1}{2}b^2\right)t+bW_t\right).$

To make an error estimation on a computer, simulations of the same sample paths of the It{\^o} process and their Euler approximation corresponding to the same sample paths of the Wiener process are repeated $N$ times. The error estimate (see \autoref{sec2}) is given by

$\hat{\epsilon}=\frac{1}{N}\sum_{k=1}^N|X_{T,k}-Y_{T,k}|.$

In order to estimate the variance $\sigma_{\epsilon}^2$ of $\hat{\epsilon}$ , the simulations are arranged into $M$ batches of $N$ simulations each. The average errors

$\hat{\epsilon}_j=\frac{1}{N}\sum_{k=1}^N|X_{T,k,j}-Y_{T,k,j}|$

of the $M$ batches are independent and approximately Gaussian for large $N$ . Student’s $t$ -distribution with $M-1$ degrees of freedom is used to construct a $100(1-\alpha)\%$ confidence interval $(\hat{\epsilon}-\Delta\hat{\epsilon},\hat{\epsilon}+\Delta\hat{\epsilon})$ for $\epsilon$ , where

$\hat{\epsilon}=\frac{1}{M}\sum_{j=1}^M\hat{\epsilon}_j=\frac{1}{NM}\sum_{j=1}^M\sum_{k=1}^N|X_{T,k,j}-Y_{T,k,j}|,$

$\hat{\sigma}_{\epsilon}^2=\frac{1}{M-1}\sum_{j=1}^M(\hat{\epsilon}_j-\hat{\epsilon})^2,$

$\Delta\hat{\epsilon}=t_{1-\alpha,M-1}\sqrt{\frac{\hat{\sigma}_{\epsilon}^2}{M}}.$

The Euler method has the order of strong convergence 1/2. The solution was simulated with $1000$ steps and the Euler-Maruyama method was used with time step $0.01$ . In order to estimate the global discretization error, $20$ different simulations and approximations were performed $20$ times, which yielded the error estimates

$\hat{\epsilon}=0.0806,$

$\hat{\sigma}_{\epsilon}^2=6.764\cdot 10^{-4}.$

An approximate $95\%$ confidence interval for $\hat{\epsilon}$ , that is, the expected deviation of the Euler-Maruyama approximation from the exact solution at time $T$ , is therefore given by $(0.07,0.09)$ . Some simulations and the corresponding Euler-Maruyama approximations are shown in figure 1.

\section{Conclusion}\label{sec6}

In this paper, three examples are given to illustrate estimation of different types of errors. In numerical computation, it is essential to identify error sources and follow the propagation of errors throughout the calculations. Solving a problem using two different methods is a powerful way to distinguish error sources, when this approach is feasible. Two different methods are used to evaluate a trigonometric function, which reveals the impact of different error sources.

For matrix equations, error estimation is more complex, since a suitable (matrix) norm has to be chosen. The discrete time Riccati equation can be solved by iteration, and the use of the maximum norm as a measure of the error is illustrated in the second example.

For numerical solution methods for stochastic problems, statistical methods have to be used for error estimation. This approach is illustrated in the example of numerical solution of a stochastic differential equation.

\section{References}

[bibshow][bibcite key=Approx,SDE,Hida,Control][/bibshow]

Figure 1a. Simulations and approximations of a stochastic differential equation (1/6).

Figure 1b. Simulations and approximations of a stochastic differential equation (2/6).

Figure 1c. Simulations and approximations of a stochastic differential equation (3/6).

Figure 1d. Simulations and approximations of a stochastic differential equation (4/6).

Figure 1e. Simulations and approximations of a stochastic differential equation (5/6).

Figure 1f. Simulations and approximations of a stochastic differential equation (6/6).