1 Introduction
The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.
We live in the era of information. Information technology permeates every aspect of modern life, shaping how we communicate, learn, have social interactions, and spend our leisure time. Beyond daily life, information plays a crucial role in fields like physics, biology, neuroscience, and engineering, where it is used to study and enhance the function of complex systems and machines. Quantifying the flow of information within these domains is essential, and, although the concept of information is abstract, its power in explaining the processes that shape our world is profound.
The explanatory power of information stems from the intrinsic link between information and performance. Without a potential reward, or the possibility of avoiding harm, information has no value.1 As a result, information collection and processing typically serves a clear purpose. For example, a self-driving car processes information from its sensors in order to make decisions about navigation [3]. Similarly, bacteria acquire chemical information about their environment in order to optimize their movement toward nutrients and away from toxins, maximizing their chance of survival [4]. More generally, in evolutionary biology the link between genetic information and fitness is explored [5,6]. Thus, whether in biological organisms or engineered systems, understanding how information is used is essential for optimizing performance.
Quantifying information transmission is vital for understanding and improving natural or engineered information-processing systems. Shannon’s information theory [7] provides the framework for studying the efficiency and reliability of any communication channel, whether it’s a telephone line, a biochemical signaling cascade, or a neural pathway in the brain. The cornerstone of information theory is a set of mathematical definitions to rigorously quantify amounts of information. These makes it possible to determine, in absolute terms, the amount of information that is transmitted by a given information-processing mechanism, for a specific input signal. Moreover, it is possible to quantify the maximum amount of information that can be transmitted through a given mechanism under optimal conditions: this limit is known as the channel capacity, measured in bits per time unit. Shannon’s information measures enable us to characterize a wide range of systems in terms of their information transmission capabilities.
Information theory has found many applications across disciplines, and is frequently used to understand and improve sensory or computational systems. In biology, information transmission is studied, e.g., in the brain, by analyzing the timing of electrical impulses between neurons [8,9]. Within cells, information flow in biochemical signaling and transcription regulation has been extensively studied by analyzing biochemical pathways [10–12]. In artificial intelligence, information theory has proven useful in improving learning in neural networks. The information bottleneck theory [13] suggests that the performance of neural networks can be enhanced by balancing compression and information retention during training [14,15]. In economics and finance, information theory has been applied to describe financial markets [16] and to optimize financial decision-making under uncertainty [17]. In optics, information theory is employed to study the efficiency of signal processing in optical resonators, with applications in precision sensing and optical computing [18,19]. Information theory boasts a wealth of applications and is essential for the analysis and theoretical understanding of information-processing systems.
The canonical measure for the quality of information transmission is the mutual information. It quantifies how much information is shared between two random variables, such as the input and output signals of an information-processing mechanism, see Figure 1.1. Let and be two random variables that are jointly distributed according to the density and with marginal densities and . The mutual information between and is then defined as
and provides a measure of correlation between the random variables.2 From the definition it follows that only if and are statistically independent, and otherwise. Thus, the mutual information quantifies the statistical dependence between random variables, equally characterizing the degrees of influence from and from . Hence, the mutual information is a symmetric measure, satisfying . In a typical information processing system, the input influences the output but there is no feedback from to . In such cases, the mutual information provides a measure for how effectively information about is transmitted through the system into the output .
In biological systems, information transmission has frequently been quantified via the instantaneous mutual information (IMI) , i.e. the mutual information between stimulus and response at two time points. This measure has been applied for analyzing biochemical pathways [12,22,25–29] and neural spiking dynamics [8,30]. However, in many cases, the IMI cannot correctly quantify information transmission due to correlations within the input or the output which reduce the total information transmitted. More generally, information may be encoded in the temporal patterns of signals, which cannot be captured by a pointwise information measure like the IMI. Thus, the IMI is generally inadequate for computing information transmission in systems which process dynamical signals.
There are many examples of information being encoded in dynamical features of signals. In cellular Ca2+ signaling, information seems to be encoded in the timing and duration of calcium bursts [31], while in the MAPK pathway information is encoded in the amplitude and duration of the transient phosphorylation response to external stimuli [32,33]. Moreover, there are reasons to believe that encoding information in dynamical signal features is advantageous for reliable information transmission [34]. Studying the information transmitted via temporal features is thus highly desirable but not possible with an instantaneous information measure. Therefore, in cases where the dynamics of input or output time-series may carry relevant information, the need for appropriate dynamical information measures has been widely recognized [4,33,35–42].
The natural measure for quantifying information transmission via dynamical signals is the trajectory mutual information. It takes into account the total information encoded in the input and output trajectories of a system, and therefore captures all information transmitted over a specific time interval. Conceptually, its definition is simple. The trajectory mutual information is the mutual information between the input and output trajectories of a stochastic process, given by
where the bold symbols and are used to denote trajectories. These trajectories arise from a stochastic process that defines the joint probability distribution . The integral itself runs over all possible input and output trajectories.
The closely related mutual information rate is defined as the rate at which the trajectory mutual information increases with the duration of the trajectories in the long-time limit. Let and be trajectories of duration , then the mutual information rate is given by
The mutual information rate quantifies how many independent messages can be transmitted per unit time, on average, via a communication channel. It depends on both, the signal statistics of the input, as well as the transmission properties of the channel. In the absence of feedback it is equal to the transfer entropy [43,44].
The trajectory mutual information and the mutual information rate are fundamental measures for information transmission in dynamical systems. They serve as key performance metrics for biochemical signaling networks [12,36], as well as for neural sensory systems [8,30]. More generally, in communication channels with memory, the mutual information rate for the optimal input signal determines the channel capacity [20]. In financial markets, it quantifies correlations in stochastic time series, such as stock prices and trading volumes [16]. Finally, in non-equilibrium thermodynamics, the trajectory mutual information provides a link between information theory and stochastic thermodynamics [45,46]. Efficient methods for calculating the trajectory mutual information and the mutual information rate are needed and constitute the primary objective of this thesis.
Unfortunately, calculating the mutual information between trajectories is notoriously difficult due to the high dimensionality of trajectory space [47]. Conventional approaches for computing mutual information require non-parametric estimates of the input and output entropy, typically obtained via histograms or kernel density estimators [8,10,12,38,47,48]. However, the high-dimensional nature of trajectories makes it infeasible to obtain enough data for accurate non-parametric distribution estimates. Other non-parametric entropy estimators such as the k-nearest-neighbor estimator [44,49] depend on a choice of metric in trajectory space and become unreliable for long trajectories [50]. Thus, except for very simple systems [38], the curse of dimensionality makes it infeasible to obtain accurate results for the trajectory mutual information using conventional mutual information estimators.
Due to the inherent difficulty of directly estimating the mutual information between trajectories, previous research has often employed simplified models or approximations. In some cases, the problem can be simplified by considering static (scalar) inputs instead of input signal trajectories [34,39,50]. But this approach ignores the dynamics of the input signal. Lower bounds for the mutual information can be derived from the Donsker-Varadhan inequality [51–53], or obtained through general-purpose compression algorithms [50,54,55]. While exact analytical results for the trajectory mutual information are available for certain simple processes such as Gaussian [36] or Poisson channels [56,57], many complex, realistic systems lack analytical solutions, and approximations have to be employed. For systems governed by a master equation, numerical or analytical approximations are sometimes feasible [58,59] but these become intractable for complex systems. Finally, the Gaussian framework for approximating the mutual information rate is particularly widely used [4,36,40], though it assumes linear system dynamics and Gaussian noise statistics. These assumptions make it ill-suited for many realistic nonlinear information-processing systems.
To address the limitations of previous methods, we introduce Path Weight Sampling (PWS), a novel Monte Carlo technique for computing the trajectory mutual information efficiently and accurately. PWS leverages free-energy estimators from statistical physics and combines analytical and numerical methods to circumvent the curse of dimensionality associated with long trajectories. The approach relies on exact calculations of trajectory likelihoods derived analytically from a stochastic model. By averaging these likelihoods in a Monte Carlo fashion, PWS can accurately compute the trajectory mutual information, even in high-dimensional settings.
PWS is an exact Monte Carlo scheme, in the sense that it provides an unbiased statistical estimate of the trajectory mutual information. In PWS, the mutual information is computed via the identity
as the difference between the marginal output entropy associated with the marginal distribution of the output trajectories and the conditional output entropy associated with , the conditional output distribution for a given input . Both entropies are evaluated as Monte-Carlo averages over the associated distribution, i.e., and , where the notation denotes an average with respect to the joint distribution . The key insights of PWS are that the conditional probability can be directly evaluated from a generative model of the system, and that the marginal probability can be computed efficiently via marginalization using Monte Carlo procedures inspired by computational statistical physics.
The crux of PWS lies in the efficient computation of via the marginalization integral
To evaluate this integral efficiently, we present different variants of PWS. In Chapter 2 we introduce Direct PWS, the simplest variant of PWS, where Equation 1.5 is computed bia a “brute-force” Monte Carlo approach that works well for short trajectories, but which becomes exponentially harder for long trajectories. In Chapter 3, we present two additional variants of PWS that evaluate the marginalization integral more efficiently, RR-PWS and TI-PWS. Rosenbluth-Rosenbluth PWS (RR-PWS) is based on efficient free-energy estimation techniques developed in polymer physics [60–63]. Thermodynamic integration PWS (TI-PWS) uses techniques from transition path sampling to derive a MCMC sampler in trajectory space [64]. From this MCMC chain, we can compute the marginalization integral using thermodynamic integration [63,65,66]. Finally, in Chapter 6, we introduce a fourth marginalization technique based on variational inference via neural networks [67]. Its conceptual simplicity, coupled with powerful marginalization methods, make PWS a versatile framework for computing the trajectory mutual information in a variety of scenarios.
Yet, to compute the mutual information PWS requires evaluating the conditional trajectory probability , which in turn requires a stochastic model defining a probability measure over trajectories. While (stochastic) mechanistic models of experimental systems are increasingly becoming available, the question remains whether PWS can be applied directly to experimental data when no such model is available. In Chapter 6, we show that machine learning can be used to construct a data-driven stochastic model that captures the trajectory statistics, i.e. , enabling the application of PWS to experimental data.
We demonstrate the practical utility of PWS by calculating the trajectory mutual information for a range of systems. In Chapter 3, Chapter 5, we study a minimal model for gene expression, showing that PWS can estimate the mutual information rate for this system more accurately than any previous technique. Using PWS, we reveal that the Gaussian approximation, though expected to hold due to the system’s linearity, does not provide an accurate estimate in this case. In Chapter 5, Chapter 6 we extend our analysis to simple nonlinear models for information transmission, comparing PWS results against the Gaussian approximation; for these models, PWS is the first technique capable of accurately computing trajectory mutual information. Moreover, in Chapter 4 we apply PWS to a complex stochastic model of bacterial chemotaxis, marking the first instance where the information rate for a system of this complexity can be computed exactly. Together, these examples demonstrate that an exact technique like PWS is indispensable for understanding information transmission in realistic scenarios.
1.1 Contributions of This Work
The main contributions of this thesis are as follows:
PWS: A novel framework for computing the trajectory mutual information: We introduce Path Weight Sampling, a computational framework for calculating the trajectory mutual information in dynamical stochastic systems. This framework is exact, applicable to both continuous and discrete time processes, and does not rely on any assumptions about the system’s dynamics. PWS and its main variants are described in Chapter 2, Chapter 3.
Discovery of discrepancies between experiments and mathematical models of chemotaxis: We apply PWS to various systems, including the complex bacterial chemotaxis signaling network. By studying the information transmission rate of chemotaxis and comparing our results against those of Mattingly et al. [4], we find that the widely-used MWC model of chemotaxis cannot explain the experimental data. We find that the number of receptor clusters is smaller and that the size of these clusters is larger than hitherto believed. We describe and characterize this finding in Chapter 4.
Study of the accuracy of the gaussian approximation for the information rate: In Chapter 5, we use PWS to quantitatively study the accuracy of the widely-used Gaussian approximation. Before PWS, no exact technique was available to obtain ground truth results of the mutual information rate for non-linear systems, and the accuracy of the Gaussian framework could not be evaluated. We reveal that the Gaussian model can be surprisingly inaccurate, even for linear reaction systems.
Neural networks for learning the stochastic dynamics from time-series data: In Chapter 6, we demonstrate that recent machine learning techniques can be employed to automatically learn the stochastic dynamics from experimental data. We show that by combining these learned models with PWS, it becomes possible to compute the trajectory mutual information directly from time-series data. This approach outperforms previous techniques, like the Gaussian approximation, for estimating information rates from data.
1.2 Thesis Outline
The remainder of this thesis is divided into 5 chapters. We first present three variants of PWS, all of which compute the conditional entropy in the same manner, but differ in the way this Monte Carlo averaging procedure for computing the marginal probability is carried out. Chapter 2, Chapter 3, Chapter 4 of this thesis have been published previously in Physical Review X.3
In Chapter 2 we present the simplest PWS variant, Direct PWS (DPWS). To compute , DPWS performs a brute-force average of the path likelihoods over the input trajectories . While we show that this scheme works for simple systems, the brute-force Monte Carlo averaging procedure becomes more difficult for larger systems and exponentially harder for longer trajectories.
In Chapter 3, we present our second and third variant of PWS which are based on the realization that the marginal probability is akin to a partition function. These schemes leverage techniques for computing free energies from statistical physics. We also apply PWS to a simple model system which consists of a simple pair of coupled birth-death processes which allows us to compare the efficiency of the three PWS variants, as well as to compare the PWS results against analytical results from the Gaussian approximation [36].
In Chapter 4, we apply PWS to the bacterial chemotaxis system, which is arguably the best characterized signaling system in biology. Mattingly et al. [4] recently argued that bacterial chemotaxis in shallow gradients is information limited. Yet, to compute the information rate from their experimental data they had to employ a Gaussian framework. PWS makes it possible to asses the accuracy of this approximation.
Chapter 5 is devoted to studying the accuracy of the Gaussian approximation for non-Gaussian systems. By understanding the limitations and strengths of the Gaussian approximation, this chapter aims to provide deeper insights into selecting the appropriate method for MI estimation depending on the system.
Finally, Chapter 6 we introduce ML-PWS, which combines recent machine learning models with PWS, to compute the mutual information directly from data. This idea significantly extends the range of applications for PWS, since we no longer require a mechanistic model of the system. Instead, the stochastic model is automatically learned from the data.
References
In mathematical terms, this interplay between information and reward can be characterized by utility functions, which quantify the benefits of different actions based on available information [1,2].↩︎
In contrast to other correlation measures used in statistics, such as the Pearson correlation coefficient, the mutual information captures both linear and nonlinear dependencies between variables. Additionally, in contrast to other correlation measures, the mutual information satisfies the data processing inequality, which states that no type of post-processing can increase the mutual information between the input and output [20,21]. These properties make the mutual information uniquely suited for describing the fidelity of the input-output mapping in information-processing systems. Note however that a naïve use of the data processing inequality leads to seemingly contradictory results when applied to the stationary dynamics of processing cascades [22–24].↩︎
M. Reinhardt, G. Tkačik, and P. R. ten Wolde, Path Weight Sampling: Exact Monte Carlo Computation of the Mutual Information between Stochastic Trajectories, Phys. Rev. X 13, 041017 (2023) [68]↩︎