3 More Efficient Variants of PWS
The contents of this chapter have been published in Phys. Rev. X 13, 041017 (2023) [1].
In the previous chapter, we introduced Path Weight Sampling (PWS), a computational approach capable of providing exact information rate estimates for any stochastic system. However, the direct implementation of PWS becomes inefficient for complex systems and long trajectories due to the high dimensionality of trajectory space. To overcome these limitations, we present two improved PWS variants in this chapter, inspired by free-energy estimation techniques from statistical physics. First, Rosenbluth-Rosenbluth PWS (RR-PWS) leverages computational strategies developed for polymer chemical potential calculations, enhancing efficiency for sampling in trajectory spaces. Second, Thermodynamic Integration PWS (TI-PWS) applies thermodynamic integration combined with trajectory space MCMC sampling, inspired by transition path sampling. We benchmark these methods using a simple coupled birth-death model, comparing the effectiveness of all three PWS variants against analytical results and the Gaussian approximation.
3.1 Introduction
To quantify information transmission, be it in a natural or engineered information processing system, we need to be able to compute the mutual information between input and output trajectories, from which the information transmission rate can be obtained. However, because of the high dimensionality of the trajectory space, computing the mutual information between trajectories is exceedingly difficult, if not impossible, because the conventional non-parametric binning approach to estimate the required trajectory probability distributions cannot be used. Indeed, except for very simple models, the mutual information between trajectories is typically computed using approximations that are often uncontrolled. In the previous chapter, we introduced a computational scheme called Path Weight Sampling (PWS), which, for the first time, makes it possible to compute the information rate exactly, for any stochastic system.
Yet, the scheme presented in that chapter, Direct PWS, becomes inefficient for more complex systems and longer trajectories. The reason is that the number of possible trajectories increases exponentially with trajectory length, leading to a corresponding increase in the variance of the estimate. Hence, for long trajectories the PWS estimate may prove to be computationally infeasible. To address this issue, we describe two improved variants of PWS in this section, both based on free-energy estimators from statistical physics.
Specifically, in Section 3.3 we present Rosenbluth-Rosenbluth PWS (RR-PWS) which exploits the observation that the computation of is analogous to the calculation of the (excess) chemical potential of a polymer, for which efficient methods have been developed [2โ4]. In Section 3.4, we present Thermodynamic Integration PWS (TI-PWS) which is based on the classic free energy estimation technique of thermodynamic integration [5โ7] in conjunction with a trajectory space MCMC sampler using ideas from transition path sampling [8].
In Section 3.5 we apply PWS to a well-known model system. It consists a simple pair of coupled birth-death processes which allows us to test the efficiency of the three PWS variants, as well as to compare the PWS results with analytical results from the Gaussian approximation [9] and the technique by Duso and Zechner [10].
3.2 Marginalizing in Trajectory Space
PWS evaluates the mutual information from the marginal entropy and the conditional entropy , see Equation 2.2. Of these two entropies, the conditional one can be efficiently computed using the scheme described in the previous chapter, and as used in DPWS. However, obtaining the marginal entropy is much more challenging. Indeed, the computationally most expensive part of Direct PWS is the evaluation of the marginalization integral which needs to be performed for every sample . Consequently, the computational efficiency of this marginalization is essential for the overall performance.
Marginalization is a general term to denote an operation where one or more variables are integrated out of a joint probability distribution. For instance, we obtain the marginal probability distribution from by computing the integral
In DPWS, we use Equation 2.7 to compute which involves generating independent input trajectories from . However, this this is not the optimal Monte Carlo technique to perform the marginalization. The generated input trajectories are independent from the output trajectory . Thus, we ignore the causal connection between and , and we typically end up sampling trajectories whose likelihoods are very small. Then, most sampled trajectories have small integral weights, and only very few samples provide a significant contribution to the average. The variance of the result is then very large because the effective sample size is much smaller than the total sample size. The use of as the sampling distribution is thus only practical in cases where the dependence of the output on the input is not too strong. It follows that this sampling scheme works best when the mutual information is not too large 1.
This is a well known Monte Carlo sampling problem and a large number of techniques have been developed to solve it. The two variants of our scheme, RR-PWS and TI-PWS, both make use of ideas from statistical physics for the efficient computation of free energies.
To understand how these ideas can be applied to compute the marginal probability , it is helpful to rephrase the marginalization integral in Equation 3.1 in the language of statistical physics. In this language, corresponds to the normalization constant, or partition function, of the Boltzmann distribution for the potential2
In Equation 3.2, is interpreted as a variable in the configuration space, while acts as an auxiliary variable, i.e., a parameter. Note that both and still represent trajectories. For this potential, the partition function is given by
The integral only runs over the configuration space, i.e. we integrate only with respect to . By inserting the expression for , we see that the partition function is exactly equal to the marginal probability of the output, i.e. . The free energy is given by
In statistical physics it is well known that the free energy cannot be directly measured from a simulation. Instead, one estimates the free-energy difference
between the system and a reference system with known free energy . The reference system can be freely chosen and is usually defined using a Boltzmann distribution for a convenient reference potential . In our case, a natural choice of reference potential is
with the corresponding partition function being simply
The reference free energy therefore is zero (). Hence, the free-energy difference is
Note that in our case the reference potential does not depend on the output trajectory , i.e. . It describes a non-interacting version of our input-output system where the input trajectories evolve independently of the fixed output trajectory .
What is the interaction between the output and the input trajectory ensemble? We define the interaction potential through
The interaction potential makes it apparent that the distribution of corresponding to the potential is biased by with respect to the distribution corresponding to the reference potential . By inserting the expressions for and into Equation 3.9 we see that
where is given by Equation 2.15 and can be directly computed from the master equation. This expression illustrates that the interaction of the output trajectory with the ensemble of input trajectories is characterized by the trajectory likelihood .
To summarize, in this section we have introduced notation (see Table 3.1) showing that computing a marginalization integral is equivalent to the computation of a free-energy difference. This picture allows us to distinguish between two ensembles for , the non-interacting ensemble distributed according to , and the interacting ensemble distributed according to . With these notions we can rewrite the brute force estimate in Direct PWS (Chapter 2) as
where the notation refers to an average with respect to the non-interacting ensemble. By inserting the expressions for and , it is easy to verify that this estimate is equivalent to Equation 2.7.
The more advanced variants of PWS introduced below leverage the analogy with statistical physics to improve the efficiency of marginalization. RR-PWS draws ideas from soft condensed matter simulations by recognizing that Equation 3.5 has the same form as the excess chemical potential of a polymer for which efficient computation techniques have been developed [2,11]. Meanwhile, TI-PWS is inspired by Transition Path Sampling (TPS) for sampling rare trajectories [8] and uses thermodynamic integration for free-energy estimation.
3.3 RR-PWS
In Rosenbluth-Rosenbluth PWS we compute the free-energy difference between the ideal system and in a single simulation just like in the brute force method. However, instead of generating trajectories in an uncorrelated fashion according to , we bias our sampling distribution towards to reduce the sampling problems found in DPWS.
The classical scheme for biasing the sampling distribution in polymer physics is due to Rosenbluth and Rosenbluth [12] in their study of self-avoiding chains. A substantial improvement of the Rosenbluth algorithm was achieved by Grassberger, by generating polymers using pruning and enrichment steps, thereby eliminating configurations that do not significantly contribute to the average. This scheme is known as the pruned-enriched Rosenbluth method, or PERM [3]. While PERM is much more powerful than the standard Rosenbluth algorithm, its main drawback is that it requires careful tuning of the pruning and enrichment schedule to achieve optimal convergence. Therefore we have opted to use a technique that is similar in spirit to PERM but requires less tuning, the bootstrap particle filter [13]. We will describe how to use PWS with a particle filter below. That said, we want to stress that the particle filter can easily be replaced by PERM or other related methods [14]. Also schemes inspired by variants of Forward Flux Sampling [15,16] could be developed.
3.3.1 Bootstrap Particle Filter
In the methods discussed above, a polymer is grown monomer by monomer. In a continuous-time Markov process this translates to trajectories being grown segment by segment. To define the segments, we choose a time discretization . Thus, each trajectory of duration consists of segments where we denote the segment between and by (we define and ). The particle filter uses the following procedure to grow an ensemble of trajectories segment by segment:
Generate starting points according to the initial condition of the input signal .
Iterate for :
Starting with an ensemble of partial trajectories of duration (if an ensemble of starting points) which we label for : propagate each trajectory (or each starting point) forward in time from to . Propagation is performed according to the natural dynamics of , i.e. generating a new segment with probability for .
Compute the Boltzmann weight of each new segment. This Boltzmann weight of a segment from to can be expressed as see Equation 3.10, and is therefore straightforward to compute from the master equation.
Sample times from the distribution where the Rosenbluth weight is defined as This sampling procedure yields randomly drawn indices . Each is an index that lies in the range from and that points to one of the trajectories that have been generated up to . To continue the sampling procedure, we relabel the indices such that the resampled set of trajectories is defined by for . The list is subsequently used as the input for the next iteration of the algorithm.
The normalized Rosenbluth factor of the final ensemble is then given by
As shown in Section 3.3.3, we can derive an unbiased estimate for the desired ratio based on the Rosenbluth factor:
with being the probability of the initial output . The particle filter can therefore be integrated into the DPWS algorithm to compute the marginal density , substituting the brute-force estimate given in Equation 2.7. We call the resulting algorithm to compute the mutual information RR-PWS.
3.3.2 Intuitive Justification of the Algorithm
First note that steps 1 and 2(a) of the procedure above involve just propagating trajectories in parallel, according to . The interesting steps are 2(b-c) where we eliminate or duplicate some of the trajectories according to the Boltzmann weights of the most recent segment. Note, that in general the list of indices that are sampled in step 2(c) will contain duplicates ( for ), thus cloning the corresponding trajectory. Concomitantly, the indices may not include every original index , therefore eliminating some trajectories. Since indices of trajectories with high Boltzmann weight are more likely to be sampled from ?eq-index-prob, this scheme biases the sampling distribution towards trajectories with large Boltzmann weight, ensuring that we are only spending computational effort on propagating trajectories which contribute significantly to the marginalization integral.
Hence, at its heart, the particle filter is an importance sampling scheme. It produces samples that are biased towards the ideal importance sampling distribution , i.e., towards to the Boltzmann distribution of the interacting ensemble. The Rosenbluth factor represents the importance sampling weight which would be required to correct for the sampling bias when computing averages using the sampled trajectories. Importantly for our case, the Rosenbluth factor can also be used to estimate the marginal probability . For illustration of the algorithm, one iteration of the particle filter is presented schematically in Figure 3.1.
3.3.3 Detailed Justification
This subsection justifies the marginal probability estimate shown in Equation 3.13 in greater detail, and may be skipped on first reading. We show that the bootstrap particle filter provides a consistent estimator for the marginal probability , or, equivalently, the ratio of partition functions . The result that this estimate is also unbiased is more difficult to establish; a proof is given by Del Moral [17].
We structure our justification of the particle filter into three steps. We first give a brief description of how a resampling procedure can generally be used to generate samples approximating a target distribution when only samples from a different distribution are available. Secondly, we use these insights to explain how the resampling procedure used in the particle filter generates trajectories whose distribution is biased towards , even though we only generate trajectories according to . Finally, we use this result to show that the particle filter provides a consistent estimate for .
3.3.3.1 Sampling and resampling
Sampling and then resampling is a strategy to use samples from a given prior distribution to generate approximate 3 samples from a different distribution of interest, with density proportional to the product . In general, is not normalized, and we denote the corresponding normalized probability density by . To generate samples from , we assign each of the existing samples from a normalized weight
Then, by sampling from the discrete set according to the assigned weights , we select samples that are approximately distributed according to . Indeed, for the distribution of the resulting samples approaches the density [18]. We use resampling at each iteration of the algorithm of Section 3.3 to regularly prune those trajectories with low overall contribution to the marginalization integral.
3.3.3.2 Particle filter
In the bootstrap particle filter, at each iteration, we start with a set of trajectories . In each iteration of the particle filter, the goal is to produce a set of elongated trajectories (from time step ) whose distribution tends towards . By iterating such a procedure, we can generate a set of trajectories distributed approximately according to for any . Thus, the particle filter is a biased sampling scheme which provides an approximation of . Moreover, using the particle filter we can also compute the corresponding importance weights which can be used to compute the marginal probability for .
We now take a closer look at one iteration of the particle filter. Start with a set of trajectories in , denoted by . These trajectories are then propagated forward to time , by adding a new segment to the trajectory for . Each new segment is generated from the distribution such that the propagation step results in a set of trajectories , distributed according to .
Next, we resample from the set of trajectories, with the goal of producing a set of trajectories distributed according to the target density . Thus, we have to find the appropriate weighting function in order to approximately produce samples according to the target distribution. By choosing , we generate normalized weights
cf. Equation 3.14. Note that this is the same choice of weighting function as in Section 3.3, ?eq-index-prob. By comparison with the notation used there, we see that the Boltzmann factors and Rosenbluth weights were defined such that we can express the normalized weight equivalently as
Why is this choice of weighting function the correct one? First, observe that resampling with the normalized weights of Equation 3.15 produces samples approximately distributed according to
What remains to be shown is that this density , when normalized, becomes the desired target distribution .
To do so, we need to rewrite the expression for using Bayesโ theorem
Notice that the first term of the numerator can be written as . After inserting this result into Equation 3.17, we obtain
The second term in this product is a constant, since is fixed. The first term is a normalized probability density for . Therefore we find that the normalized density corresponding to is
Consequently, this is the distribution that is approximated by the set of trajectories at the end of the -th iteration of the particle filter, which is what we wanted to show. At its heart, the particle filter is therefore an algorithm to produce samples that are approximately distributed according to .
3.3.3.3 Marginal probability estimate
We now use these insights to derive an estimate of . We start by noting that the marginal density of the -th output segment, , is given by
The third line follows from the definition of . Hence, we find that the probability can be expressed as the average
In principle, this average can be computed using a Monte Carlo scheme, using trajectories generated from . Notice that at each iteration of the particle filter, we do dispose of a set of trajectories which are approximately distributed according to above. Therefore, we can compute the average Equation 3.22 directly from the trajectories that are present for each iteration of the particle filter. With the notation from Section 3.3, using , we thus obtain the estimate
The probability of the entire output trajectory is given by the product
where is the probability of the initial output state which is assumed to be known. In conclusion, we arrive at the following estimate for the marginal output probability
which is precisely Equation 3.13.
3.3.4 Tuning the Particle Filter
For the efficiency of the particle filter, it is important to carefully choose the number of segments . When segments are very short (i.e., when is large), the accumulated weights (?eq-weight_segment) tend to differ very little between newly generated segments . Hence, the pruning and enrichment of the segments is dominated by noise. In contrast, when the segments are very long, the distribution of Boltzmann weights becomes very wide. Then only a small number of segments contribute substantially to the corresponding Rosenbluth weight . Hence, to correctly choose , we need a measure that quantifies the variance in the trajectory weights of the particles. To this end, we follow Martino et al. [19] and introduce an effective sample size (ESS)
which lies in the range ; if one trajectory has a much higher weight than all the others and if all trajectories have the same weight. As a rule of thumb, we resample only when the drops below . Additionally, as recommended in Ref. [20], we use the systematic sampling algorithm to randomly draw the indices in step 2(c) which helps to reduce the variance; we find, however, the improvement over simple sampling is very minor. Using these techniques, the only parameter that needs to be chosen by hand for the particle filter is the ensemble size .
3.4 TI-PWS
Thermodynamic integration PWS (TI-PWS), is based on the analogy of marginalization integrals with free-energy computations. As before, we view the problem of computing the marginal probability as equivalent to that of computing the free-energy difference between ensembles defined by the potentials and , respectively. For TI-PWS, we define a potential with a continuous parameter that allows us to transform the ensemble from to . The corresponding partition function is
For instance, for , we can define our potential as
such that . Note that this is the simplest choice for a continuous transformation between and , but by no means the only one. For reasons of computational efficiency, it can be beneficial to choose a different path between and , depending on the specific system [6]. Here we will not consider other paths however, and derive the thermodynamic integration estimate for the potential given in Equation 3.28.
To derive the thermodynamic integration estimate for the free-energy difference, we first compute the derivative of with respect to :
Thus, the derivative of is an average of the Boltzmann weight with respect to which is the ensemble distribution of given by
Integrating Equation 3.29 with respect to leads to the formula for the free-energy difference
which is the fundamental identity underlying thermodynamic integration.
To compute the free-energy difference using Equation 3.31, we evaluate the integral with respect to numerically using Gaussian quadrature, while the inner average is computed using MCMC simulations. To perform MCMC simulations in trajectory space we use ideas from transition path sampling (TPS). Specifically, we define a MCMC proposal distribution for trajectories using forward shooting and backward shooting [8]. These proposals regrow either the end, or the beginning of a trajectory, respectively. A proposal is accepted according to the Metropolis criterion [21]. Since the efficiency of MCMC samplers strongly depends on the proposal moves that are employed, we are certain that better MCMC estimates are possible with more sophisticated proposal distributions.
3.4.1 MCMC Sampling in Trajectory Space
TI-PWS relies on the computation of averages with respect to the ensembles corresponding to the interaction parameter , given by . Sampling from this family of distributions using the SSA (Gillespie) algorithm is not possible. Instead, in this section, we show different ways of how to implement a Markov Chain Monte Carlo (MCMC) sampler in trajectory space to generate correctly distributed trajectories.
We can build an MCMC sampler in trajectory space using the Metropolis-Hastings algorithm. To create a Markov chain in trajectory space, we need to find a suitable proposal kernel, that generates a new trajectory from a given trajectory with probability . We accept the proposal using the Metropolis criterion with probability
to create a chain of trajectories with stationary distribution given by for . To ensure efficient convergence of the resulting Markov chain to its stationary distribution, the proposal kernel must balance two conflicting requirements. To efficiently explore the state space per unit amount of CPU time, the proposed trajectory must be sufficiently different from the original trajectory , while at the same time it should not be so different that the acceptance probability becomes too low. Thus, the design of the proposal kernel is crucial for an efficient MCMC sampler, and we will discuss various strategies to create trial trajectories. Since different types of trial moves can easily be combined in a Metropolis-Hastings algorithm, the most efficient samplers often incorporate multiple complementary proposal strategies to improve the exploration speed of the trajectory space.
The simplest (and naรฏve) proposal kernel is to generate an entirely new trajectory independent of , by sampling directly from using the SSA. Hence, the transition kernel is given by and a proposal is accepted with probability
where the second line follows by using the definition of given in Equation 3.28. Although this simple scheme is correct, it should not be used in practice to compute . Indeed, on would get a better estimate of by just using the same number of independent sample trajectories from and using the brute-force scheme in Equation 2.7 without taking the detour of using MCMC to estimate the normalization constant.
Instead, an idea from transition path sampling is to only regenerate a part of the old trajectory as part of the proposal kernel [22]. By not regenerating the entire trajectory, the new trajectory is going to be correlated with the original trajectory , improving the acceptance rate. The simplest way to generate trial trajectories using a partial update is a move termed forward shooting in which a time point along the existing trajectory is randomly selected, and a new trajectory segment is regrown from this point to the end, resulting in the proposal . Since the new segment is generated according to the unbiased input statistics, the acceptance probability for the proposed trajectory is given by Equation 3.33, i.e., the same as if the entire trajectory had been regenerated. If the input dynamics given by are time-reversible, we can also perform a backward shooting move. Here, the beginning of is replaced by a new segment that is generated backwards in time. Assuming that the initial condition is the inputโs steady state distribution, the corresponding acceptance probability of the backward shooting move is again given by Equation 3.33. Using these two moves we create an MCMC sampler where both ends of the trajectory are flexible, and thus if the trajectory is not too long, the chain will quickly relax to its stationary distribution.
For long trajectories it can prove to be a problem that the middle section is too inflexible when the proposal moves only regenerate either the beginning or the end of a trajectory. Therefore, one could additionally try to incorporate mid-section regrowth to make sure that also the middle parts of the trajectory become flexible. To regrow a middle segment with duration of a trajectory , we have to generate a new segment of duration according to the stochastic dynamics given by but with the additional condition that we have to connect both endpoints of the new segment to the existing trajectory. Although the starting point of the segment can be freely chosen, the challenge is to ensure that the end point of the new segment satisfies the end-point constraint. Stochastic processes that generate trajectories under the condition of hitting a specific point after a given duration are called stochastic bridging processes.
The simplest way to generate trajectories from a bridging process is by generating a trajectory segment of length from the normal stochastic process and rejecting the segment if it does not hit the correct end point [23]. Clearly, this strategy is only feasible for very short segments and when the state space is discrete, as otherwise almost every generated segment will be rejected due to not hitting the correct end point. To avoid this problem, more efficient algorithms have been developed to simulate stochastic bridges for some types of stochastic processes. For diffusion processes, bridges can be simulated efficiently by introducing a guiding term into the corresponding Langevin equation [24]. For jump processes, bridges can be simulated using particle filters [25], by a weighted stochastic simulation algorithm (wSSA) [26], or using random time-discretization (uniformization) [23].
Further techniques to create a trajectory space MCMC samplers have been developed in the literature. Crooks [27] describes a scheme to create MCMC moves for trajectories evolving in non-equilibrium dynamics, by making MCMC moves to change the trajectoriesโ noise histories. In the Particle Markov Chain Monte Carlo (PMCMC) algorithm, proposal trajectories are generated using a particle filter and accepted with an appropriate Metropolis criterion [28]. Another class of efficient samplers for Markov jump processes can be built using uniformization [29].
3.5 Simple Application and Benchmark
To demonstrate the power of our framework and illustrate how the techniques of the previous sections can be used in practice, we apply PWS to a simple chemical reaction network. We consider a linearly coupled birth-death process which has been studied previously using a Gaussian model [9], and by Duso and Zechner [10] using an approximate technique, and we compare our results with these studies. This simple birth-death system serves to illustrate the main ideas of our approach and also highlights that linear systems can be distinctly non-Gaussian.
The code used to produce the PWS estimates was written in the Julia programming language [30] and has been made freely available [31,32]. For performing stochastic simulations we use the DifferentialEquations.jl package [33].
We consider a stochastic process of species which is created at rate and decays with constant rate per copy of . This system receives information from an input signal that modulates the birth rate . For simplicity, we assume it is given by
where is a constant and is the input copy number at time . This is a simple model for gene expression, where the rate of production of a protein is controlled by a transcription factor , and itself has a characteristic decay rate. The input trajectories themselves are generated via a separate birth-death process with production rate and decay rate .
We compute the trajectory mutual information for this system as a function of the trajectory duration of the input and output trajectories. For , the trajectory mutual information is expected to increase linearly with , since, on average, every additional output segment contains the same additional amount of information on the input trajectory. Because we are interested in the mutual information in steady state, the initial states were drawn from the stationary distribution . This distribution was obtained using a Gaussian approximation. This does not influence the asymptotic rate of increase of the mutual information, but leads to a nonzero mutual information already for .
Figure 3.2 shows the mutual information as a function of the trajectory duration . We compare the three PWS variants and two approximate schemes. One is that of Duso and Zechner [10]. To apply it, we used the code publicly provided by the authors4, and to avoid making modifications to this code, we chose a fixed initial condition which causes the mutual information to be zero for . The figure also shows the analytical result of a Gaussian model [9], obtained using the linear-noise approximation (see Section 3.7.1).
We find that the efficiency of the respective PWS variants depends on the duration of the input-output trajectories. For short trajectories all PWS variants yield very similar estimates for the mutual information. However, for longer trajectories the estimates of DPWS and, to a smaller degree, TI-PWS diverge, because of poor sampling of the trajectory space in the estimate of . For longer trajectories, the estimate becomes increasingly dominated by rare trajectories, which make an exceptionally large contribution to the average of . Missing these rare trajectories with a high weight tends to increase the marginal entropy [see Equation 2.8], and thereby the mutual information; indeed, the estimates of DPWS and TI-PWS are higher than that of RR-PWS. For brute-force DPWS, the error decreases as we increase the number of input trajectories per output trajectory used to estimate . Similarly, for TI-PWS the error decreases as we use more MCMC samples for the marginalization scheme. For the RR-PWS, however, already for the estimate has converged; we verified that a further increase of does not change the results.
We also find excellent agreement between the RR-PWS estimate and the approximate result of Duso and Zechner [10]. Only very small deviations are visible in Figure 3.2. These deviations are mostly caused by the different choice for the initial conditions. In RR-PWS, the initial conditions are drawn from the stationary distribution, while in the Duso scheme they are fixed, such that the mutual information computed with RR-PWS is finite while that computed with the Duso scheme is zero. Yet, as the trajectory duration increases, the Duso estimate slowly โcatches upโ with the RR-PWS result.
Figure 3.2 also shows that although the Gaussian model matches the PWS result for , it systematically underestimates the mutual information for trajectories of finite duration . Interestingly, this is not a consequence of small copy-number fluctuations: increasing the average copy number does not significantly improve the Gaussian estimate.5
The different approaches for computing the marginal probability lead to different computational efficiencies of the respective PWS schemes. In Figure 3.3, as a benchmark, we show the magnitude of the error of the different PWS estimates in relation to the required CPU time. Indeed, as expected, the computation of the marginal probability poses problems for long trajectories when using the brute force DPWS scheme. More interestingly, while TI-PWS improves the estimate of the mutual information, the improvement is not dramatic. Unlike the brute-force scheme, thermodynamic integration does make it possible to generate input trajectories that are correlated with the output trajectories , but it still overestimates the mutual information for long trajectories unless a very large number of MCMC samples are used.
The RR-PWS implementation evidently outperforms the other estimates for this system. The regular resampling steps ensure that we mostly sample input trajectories with non-vanishing likelihood , thereby avoiding the sampling problem from DPWS. Moreover, sequential Monte Carlo techniques such as RR-PWS and FFS [15] have a considerable advantage over MCMC techniques in trajectory sampling. With MCMC path sampling, we frequently make small changes to an existing trajectory such that the system moves slowly in path space, leading to poor statistics. In contrast, in RR-PWS we generate new trajectories from scratch, segment by segment, and these explore the trajectory space much faster.
3.6 Discussion
Aside from Direct PWS introduced in Chapter 2, we developed two additional variants of PWS, capitalizing on the connection between information theory and statistical physics. Specifically, the computation of the mutual information requires the evaluation of the marginal probability of individual output trajectories . This corresponds to the computation of a partition function in statistical physics, . RR-PWS and TI-PWS are based on techniques from polymer and rare-event simulations to make the computation of the marginal trajectory probability more efficient.
The different PWS variants share some characteristics yet also differ in others. DPWS and RR-PWS are static Monte Carlo schemes in which the trajectories are generated independently from the previous ones. These methods are similar to static polymer sampling schemes like PERM [3] and rare-event methods like DFFS or BG-FFS [15]. In contrast, TI-PWS is a dynamic Monte Carlo scheme, where a new trajectory is generated from the previous trajectory. In this regard, this method is similar to the CBMC scheme for polymer simulations [35] and the TPS [8], TIS [36], and RB-FFS [15] schemes to harvest transition paths. The benefit of static schemes is that the newly generated trajectories are uncorrelated from the previous ones, which means that they are less likely to get stuck in certain regions of path space. Concomitantly, they tend to diffuse faster through the configuration space. Indeed, TI-PWS suffers from a problem that is also often encountered in TPS or TIS, which is that the middle sections of the trajectories move only slowly in their perpendicular direction. Tricks that have been applied to TPS and TIS to solve this problem, such as parallel tempering, could also be of use here [37].
Another distinction is that RR-PWS generates all the trajectories in the ensemble simultaneously yet segment by segment, like DFFS, while DPWS and TI-PWS generate only one full trajectory at the time, similar to RB-FFS, BG-FFS, and also TPS and TIS. Consequently, RR-PWS, like DFFS, faces the risk of genetic drift, which means that, after sufficiently many resampling steps, most paths of the ensemble will originate from the same initial seed. Thus, when continuing to sample new segments, the old segments that are far in the past become essentially fixed, which makes it possible to miss important paths in the RR-PWS sampling procedure. As in DFFS, the risk of genetic drift in RR-PWS can be mitigated by increasing the initial number of path segments. Although we did not employ this trick here, we found that RR-PWS was by far the most powerful scheme of the three variants studied.
Nonetheless, we expect that DPWS and TI-PWS become more efficient in systems that respond to the input signal with a significant delay . In these cases, the weight of a particular output trajectory depends on the degree to which the dynamics of the output trajectory correlates with the dynamics of the intput trajectory a time earlier. Because in RR-PWS a new segment of an output trajectory is generated based on the corresponding segment of the input trajectory that spans the same time-interval, it may therefore miss these correlations between the dynamics of the output and that of the input a time earlier. In contrast, DPWS and TI-PWS generate full trajectories one at the time, and are therefore more likely to capture these correlations. Also the machine-learning based approach for determining the optimal importance sampling distribution presented in Chapter 6 (Section 6.2.4) is likely to prove useful in these scenarios with complex temporal dependences between the input and output trajectories.
3.7 Supplementary Information
3.7.1 Gaussian Approximation of the Linear System
We derive the Gaussian approximation of the simple reaction system used in Section 3.5. We recall the elementary biochemical reaction motif consisting of four reactions
with input and output . This reaction motif is a simple model for gene expression, where the rate of production of a protein is controlled by a transcription factor , and X itself has a characteristic decay rate. The dynamics of are given by a constant birth rate and a constant (per-molecule) decay rate.
We compute the covariance functions of this model which are then used to derive Gaussian signal statistics, and allow us to compute the Gaussian information rate. Specifically, we assume that the process is sampled at a sampling rate , with being the sequence of sampled inputs and being the sequence of outputs in time. We can describe the dynamics of the input and output as fluctuations around the mean, i.e. we write and . In the limit of large copy numbers, due to the central limit theorem, the distribution of these fluctuations become Gaussian [38].
In particular, let be the concatenation of the input and output sequences. Then, the distribution of is multivariate normal, i.e.,
where the covariance Matrix has the following block structure
Here and are the (auto-)covariance matrices of the input and the output, respectively, whereas contain the cross-covariances. The matrix elements are given by where denote the (cross-)covariance functions and are the sampling times. Thus, the full statistics of the trajectories are determined from the (cross-)covariance functions.
Since the reaction scheme in Equation 3.35 features only first-order reactions, the covariance functions can be calculated explicitly using the regression theorem [38,39]. For , we obtain the following expressions for the covariance functions:
In the expressions above we used the relative exponential function
and the point (co-)variances
Because the process is stationary, the values of the covariance functions for can be obtained by applying the symmetry relation .
Now, we can directly compute the mutual information from the covariances. Using Equation 3.38, ?eq-c_sx, ?eq-c_xs, ?eq-c_xx we obtain the matrix elements of the joint covariance matrix defined in Equation 3.37 and then the mutual information is given by the expression
Note, that for discretized trajectories of length , the matrix has dimensions of . Thus, the computation of the trajectory mutual information requires the computation of a matrix, which can be computationally challenging for long trajectories (large ). In ?sec-notes-gaussian of this thesis, we discuss some techniques to considerably accelerate the computation of the mutual information.
Using spectral analysis, Tostevin and Wolde [40] were able to derive an analytical expression for the Gaussian mutual information rate of this model in the continuous-time limit, given by
The information rate of the discretely sampled process converges to this value as the sampling rate approaches infinity [40]. Notably however, the model corresponding to Equation 3.42 is a continuum description that assumes Gaussian statistics; indeed, this rate deviates from the mutual information rate of the exact model that is described by the chemical master equation, even in the limit of large copy numbers [41]. This finding is also further discussed in Chapter 5 (Section 5.3.1).
References
Indeed, the mutual information precisely quantifies how strong the statistical dependence is between the trajectory-valued random variables and . From its definition we can understand more clearly how this affects the efficiency of the Monte Carlo estimate. Roughly speaking, is related to the number of distinct trajectories that can arise from the dynamics given by , while is related to the number of distinct trajectories that could have lead to a specific output , on average. Therefore, if the mutual information is very large, the difference between these two numbers is very large, and consequently the number of overall distinct trajectories is much larger than the number of distinct trajectories compatible with output . Now, if we generate trajectories according to the dynamics given by , with overwhelming probability we generate a trajectory which is not compatible with the output trajectory , and therefore . Hence, the effective number of samples is much smaller than the actual number of generated trajectories , i.e. . We therefore only expect the estimate in Equation 2.7 to be reliable when computing the mutual information for systems where it is not too high. Thus, strikingly, the difficulty of computing the mutual information is proportional to the magnitude of the mutual information itself.โฉ๏ธ
Note that while the distribution resembles the equilibrium distribution of a canonical ensemble from statistical mechanics, this does not imply that we are studying systems in thermal equilibrium. Indeed, PWS is used to quantify information transmission in systems driven out of equilibrium by the input signal. Thus, the notation introduced in this section is merely a mathematical reformulation of the marginalization integral to make the analogy to statistical physics apparent. We assign no physical meaning to the potentials and free energies. Also note that we set since temperature is irrelevant here.โฉ๏ธ
The samples generated through resampling are only approximate because they are limited to the discrete set , which was originally drawn from the prior distribution . The resampling process assigns weights based on the target distribution, which are used to select from that set, but it does not generate entirely new samples directly from the target. Therefore, the sampled points do not constitute draws from the target distribution unless .โฉ๏ธ
A detailed analysis of this observation was carried out in a (currently unpublished) collaborative work with Anne-Lena Moor and Christoph Zechner from the MPI-CBG in Dresden [34]. This work demonstrates that the discrepancy arises because all the information on the input signal is contained in the output speciesโ production process, which is catalyzed by the input, rather than in the decay process of the output, which occurs independently of the input. The PWS result captures this distinction by using a fully discrete approach. In contrast, the Gaussian estimate of the linear-noise approximation misses this distinction because in the noise term for the output the contributions from the production and decay reactions are added together. We also comment further on this matter in Chapter 5.โฉ๏ธ