7 Appendix: Notes on Gaussian Approximation
Mathematically, computing the mutual information between trajectories in the Gaussian framework is relatively simple. From Shannon’s formula for the entropy of a multivariate Gaussian distribution it follows that the MI is given by
See [1,2] for details. This is a straightforward formula involving the logarithm of determinants of symmetric matrices.
However, computationally evaluating this formula correctly and efficiently requires some thought. The matrices and have dimensions , and the matrix has dimensions where is the trajectory length. Thus, for long trajectories, the involved matrices can become very large. Computing the determinant of a large matrix is computationally non-trivial and possibly numerically unstable. Note that the generic algorithm to compute the determinant of a matrix of dimensions scales with order . Thus, doubling the size of the matrix makes the computation of the determinant take approximately 8 times as long. Moreover, the determinant of a large matrix may be very close to zero or very close to infinity. Representing such numbers in the computer using floating point representations can lead to significant numerical accuracy issues. In practice this further limits the maximal size of covariance matrices that one can use.
Fortunately, we can leverage the special structure of the Gaussian covariance matrices to simplify the computation of determinants significantly. Generally, all involved matrices are symmetric and positive definite. Using a clever trick we can speed up the computation of the determinant of a symmetric matrix. But even then, we still have a scaling of with matrix size. Using one additional assumption we can do much better. If the system under consideration is in steady state, the matrices have Toeplitz structure. We will see that this structure allows us to construct a scaling algorithm that approximates the determinant very well for large . Lastly, to deal with numerical accuracy issues, we will discuss how to compute the log-determinant of a matrix directly. This will be dramatically more accurate than first computing the determinant and then taking the logarithm.
7.1 Structure of Covariance Matrices in the Gaussian Framework
We consider a Gaussian system with stochastic input and output . Both, input and output, are vectors representing trajectories and sampled at discrete times . Hence, is the dimensionality of the vector space of and . The joint probability density of and is given by
which is the PDF of a -dimensional multivariate Gaussian distribution. is a matrix with block form
The individual block entries of correspond to the (cross-)covariance matrices of input and output. Specifically, is a matrix with entries
We can make a few observations concerning the form of these matrices. It follows immediately from the definition that both and are positive definite, symmetric matrices. and are generally not symmetric, but they are transposes of each other, i.e., . It follows from these two observations that itself is symmetric. Furthermore, we usually deal with systems in steady state, in which case the matrices have even more structure. When a system is in steady state, the correlation functions are (time-)translation invariant. That means that the correlation of observations at and only depends on the difference , i.e., we can find functions such that . It follows that the matrix element only depends on the difference . We can thus write the matrix elements as where are the components of a vector that fully specifies the matrix. A matrix with this structure is called Toeplitz matrix. We can use this structure to simplify the computation of determinants significantly, as described below. But before we explain how to compute determinants efficiently, we discuss one further issue regarding the matrix itself.
Note that unlike the matrices , itself does not have Toeplitz structure in general. This means the method described below is not directly applicable for the computation of . Yet, it seems we need to find for computing the MI using Equation 7.1. It turns out that this is not necessary. One can make use of the fact that is composed of Toeplitz blocks to avoid computing its determinant. At the end of the following section we show how to adapt the formula described in Ref. [3] for computing the mutual information rate to derive an efficient approximation of it in terms of the discrete Fourier transform.
7.2 Efficient Evaluation of Determinants
For a symmetric positive definite matrix the determinant can be computed faster than for a general matrix. We exploit the fact that such a matrix can always be decomposed as
where is a lower triangular matrix with positive diagonal entries. This factorization is called Cholesky decomposition and there are efficient algorithms for computing the Cholesky decomposition in many numerical linear algebra packages.
Now we make use of two basic properties of the determinant. First, the determinant of a product of matrices is equal to the product of the individual matrices’ determinants. Second, the determinant of the transpose is equal to the determinant of the original matrix. Combining these two facts, we see that . Now, we remember that is a lower triangular matrix. It is easily checked that the determinant of a lower triangular matrix is given by the product of its diagonal entries where . Thus, we find two important identities for symmetric positive definite matrices:
The second form is especially useful for large matrices. There, computing the log-determinant is numerically preferable to computing the determinant itself.
We can also exploit the structure of symmetric Toeplitz matrices to further simplify the computation of the determinant. However, here we won’t obtain an exact result but only an asymptotic approximation. This will nonetheless be very useful for large matrices where even the Cholesky factorization can be difficult to obtain.
We will investigate the asymptotic behavior of eigenvalues of Toeplitz matrices. First, we recall the defining property of Toeplitz matrices. Let be a Toeplitz matrix. Then we can write where is a vector that fully specifies the matrix. We will now try to understand what happens in the asymptotic limit that the vector is very high dimensional. Thus, suppose we have an infinite sequence of real numbers . From this sequence we can construct a sequence of Toeplitz matrices for as follows. For any , we define the matrix elements of as
The definition of will thus only need elements from the sequence. Note that for a symmetric Toeplitz sequence .
Under some light conditions there exists a continuous -periodic function that is defined through the discrete-time Fourier transform
Conversely, the can be recovered from using inverse transform
Thus the sequence determines the function and vice versa. Moreover, this means that the entire sequence of Toeplitz matrices is completely specified by the continuous function .
Now we are ready to state one of the most important theorems about Toeplitz matrices: Szegő’s theorem. Let be the eigenvalues of the matrix . Szegő’s theorem states that
for any continuous function . This result is very useful to find the asymptotic determinant. In particular, note that the determinant is the product of eigenvalues, and hence . Thus, it is not hard to show that
which we can leverage to compute information rates. In particular, this formula can be used for analytical computation of entropy rates of Gaussian processes [1].
We will instead use this theorem to find a an approximate way to compute the determinant of a large Toeplitz matrix. We can approximate the integral by a Riemann sum
where for some large . Now, inserting Equation 7.8 we find
In practice we cannot perform the infinite sum over . Thus, we must truncate the series. We truncate the series such that for . Then, the sum over becomes the discrete Fourier transform (DFT). We recall that the DFT of the sequence for is another sequence , defined by
for . Hence, we have derived the approximation
The DFT coefficients represent the inner sum in Equation 7.13, when truncated to . In principle we want to choose as large as possible. However, given a matrix we only know the values . Thus, the maximum value of that we can use is .
This means, our approximation for the log-determinant of for large can be written as
It is easy to verify that this formula converges to Equation 7.13 as . This formula is also very efficient to evaluate on modern computers. The sequence can be easily computed via the FFT algorithm from the sequence of . Efficient implementations of the FFT algorithm are widely available.
Finally, we discuss how to evaluate the mutual information rate, defined as , using Szegő’s theorem. Munakata and Kamiyabu [3] used Szegő’s theorem to derive the formula
where the functions represent the discrete-time Fourier transforms of the covariances of the matrices in the limit . As above, we are going to approximate this formula using the discrete Fourier transform. We define DFT sequences for the matrices as for . This results in the estimate for the mutual information rate
More information about asymptotics and other results for Toeplitz matrices can be found in the review of Gray [4].