Travis-CI Build Status AppVeyor Build status codecov

An R wrapper for the Inform v1.0.0 C library developed by Douglas G. Moore for performing information analysis of complex system. As for the Inform library, rinform is structured around the concepts of:

  • discrete empirical probability distributions which form the basis for all of the information-theoretic measures,
  • classic information-theoretic measures built upon empirical distributions,
  • measures of information dynamics for time series.

In addition to the core components, rinform also provides a small collection of utilities to deal with time series.

If you are using rinform, consider citing the following articles:

  • D.G. Moore, G. Valentini, S.I. Walker, M. Levin. “Inform: Efficient Information-Theoretic Analysis of Collective Behaviors”. Frontiers in Robotics & AI. (under review)
  • D.G. Moore, G. Valentini, S.I. Walker, M. Levin. “Inform: A Toolkit for Information-Theoretic Analysis of Complex Systems”. In: Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence, Symposium on Artificial Life, 1-8, IEEE Press, 2017. https://doi.org/10.1109/SSCI.2017.8285197

Acknowledgement: This project was supported in part by the grant Emergent computation in collective decision making by the crevice-dwelling rock ant Temnothorax rugatulus provided by the National Science Foundation (NSF grant PHY-1505048).

1 Getting Started

1.1 Installation

The rinform package includes some C code, that is, the sources of the Inform library. You may need some extra tools to install rinform as they are required to compile the Inform source (e.g., Xcode for Mac users, Rtools for Windows users)

1.1.1 Installation from CRAN

To install rinform directly from the CRAN repository you will simply need to type:

Once installed, you can load rinform prior to use it with:

1.1.2 Installation from GitHub

To install rinform from its GitHub repository you will need to have installed the devtools package. If you don’t have devtools already, you can install it with:

Load devtools and install the latest stable version of rinform (i.e., master branch):

In case you need to use the development version of rinform, you can specify to install from the dev branch:

Once installed, you can load rinform prior to use it with:

1.2 Getting Help

rinform, as its parent library inform, is developed to help anyone interested in applying information-theoretic techniques get things done quickly and painlessly. We cannot do that without your feedback. We host the project’s source code and issue tracker on Github. Please create an issue if you find a bug, an error in this documentation, or have a feature you’d like to request. Your contribution will make rinform a better tool for everyone.

If you are interested in contributing to rinform, please contact the developers, and we’ll get you up and running!

Resources: Source Repository and Issue Tracker

2 Empirical Distributions

The Dist class provides an empirical distribution, i.e. a histogram, representing the observed frequencies of some fixed-sized set of events. This class is the basis for all of the fundamental information measures provided on discrete probability distributions.

The purpose of the Dist class is to make writing custom implementations of information-theoretic measures possible. Most of the measures implemented in rinform use this structure at their core.

Type Functions
Class Dist
Allocation resize, copy, infer, approximate, uniform
Accessors/Mutators length, counts, valid, get_item, set_item, tick, accumulate
Probabilities probability, dump

2.1 Distribution Class

This class is the basis for almost all of the calculations performed via the library. It provides a means of managing observations and converting the observed event frequencies to probabilities.

The distribution is, roughly, a histogram with finite support. The events are assumed to be mapped to the dense set of integers \(\{0,1,\ldots,N-1\}\) where \(N\) is the number of observable events. The number of observable events can be extracted with length.

Whenever the size of the support or the number of observed events is zero, the distribution is considered invalid meaning that you can’t trust any probabilities extracted from it.

Interface:

A distribution of observed event frequencies. If the parameter n is an integer, the distribution is constructed with a zeroed support of size \(n\). If n is a vector of integer values, the sequence is treated as the underlying support. On the other hand, if n is a vector of floating point values, it is treated as a probability distribution and must sum to unity. Note, if a probability distribution is given as the underlying support, it will first be converted to a histogram with a precision of 9 significant figures.

Examples:

Create an empty distribution with support size 3:

Create a distribution with 2 events, the first observed 5 times, the second observed 3 times:

2.2 Allocation

2.2.1 Resize

If the distribution is NULL, a new distribution of the requested size is created.

If the distribution already has the requested size, then the original distribution is returned. If the requested size is greater than the current size, then the newly observable events are assigned zero frequency. If the requested size is less than the current size, then the discarded events are lost and the observation count is adjusted accordingly.

Note that the distribution cannot be reduced to size 0. In that case the distribution is left unmodified.

Interface:

Resize the distribution d to have new support of size n.

Examples:

Create a distribution with size 2:

Increase the size of the support to 5:

Decrease the size of the support to 3:

2.2.2 Copy

If the source distribution is NULL, then NULL is returned.

Interface:

Return a copy of the distribution d.

Examples:

Create a base distribution d:

Copy distribution to a different object p:

2.2.3 Infer

Interface:

Infer a distribution from a collection of observed events.

Examples:

Create a distribution \(d = \{3/5, 2/5\}\):

Create a distribution \(d = \{3/8, 3/8, 2/8\}\):

2.2.4 Approximate

Interface:

Approximate a given probability distribution to a given tolerance.

Examples:

Approximate a distribution with tolerance \(10^{-3}\):

2.2.5 Uniform

Interface:

Create a uniform distribution of a given size n. The support size n must define a valid support (i.e., n must be greater than 0).

Examples:

Initialize a uniform distribution of size 3:

2.3 Accessors/Mutators

2.3.1 Length

Interface:

Get the size of the distribution’s support. If the distribution is NULL, then a support of 0 is returned.

Examples:

Size of a NULL distribution:

Size of a distribution with 5 elements:

2.3.2 Counts

Interface:

Get the total number of observations so far made.

counts(d)

Examples:

Counts of valid distribution:

Let’s modify the distribution and count again:

2.3.3 Valid

Interface:

Determine whether or not the distribution is valid. In order to safely extract probabilities, the size of the support must be non-zero and the number of observations must be non-zero. In any other case, the distribution is invalid.

Examples:

Invalid distribution with 0 observations:

A (valid) distribution with a support of 5 elements:

Let’s invalidate distribution by zeroing its support:

2.3.4 Get Item

Interface:

Get the number of occurrences of a given event.

Examples:

Get all items in a sequence from a valid distribution:

2.3.5 Set Item

Interface:

Set the number of occurrences of a given event. This function manually sets the number of occurrences of a given event. Note that the only restriction is that the value be positive. This means that this function can be used to invalidate the distribution by changing all of the event frequencies to zero.

Examples:

Let’s initialize empty distribution

Now we can set items and make it a valid distribution:

2.3.6 Tick

Interface:

Increment the number of observations of a given event and return an updated copy of the distribution d. As an alternative to set_item, this function simply increments the number of occurrences of a given event. This is useful when iteratively observing events.

Examples:

Let’s creare an initial distribution:

Now we can add an observation for each event:

2.3.7 Accumulate

Interface:

Accumulate observations from a series. If an invalid distribution is provided, no events will be observed and an error will be raised. If an invalid event is provided, then the number of valid events to that point will be added and a warning will be raised.

Examples:

Let’s create a valid distribution and dump it:

We can now accumulate events:

If we try to accumulate invalid events,

only events up to the first invalid entry will be added.

2.4 Probabilities

2.4.1 Probability

Interface:

Extract the probability of an event from a distribution d. This function simply computes the probability of a given event and returns that value.

Example:

Let’s initialize a distribution with a support of 3:

We can compute probabilities as:

2.4.2 Dump

Interface:

Dump the probabilities of all events to an array. This function computes the probabilities of all of the events in the support and return them as an array.

Example:

Let’s initialize distribution and dump its probabilities:

Now we can modify the distribution and dump it again:

3 Shannon Information Measures

The rinform package provides a collection of entropy and other information measures defined on discrete probability distributions (see Dist). These functions form the core of Inform as all of the time series analysis functions are built upon them. rinform provides access to these functions in the case the user wants to use them outside the (already wrapped) time series measures.

3.1 Entropy

Taking \(X\) to be a random variable with \(p_X\) a probability distribution on \(X\), the base-\(b\) Shannon entropy is defined as \[ H(X) = - \sum_{x} p_X(x) \log_b p_X(x). \]

See (Shannon 1948) for more details.

Interface:

Compute the base-b Shannon entropy of the distribution p.

Example:

Compute the Shannon entropy of a uniform distribution:

Compute the Shannon entropy of a non-uniform distribution:

3.2 Mutual Information

Mutual information provides a measure of the mutual dependence between two random variables. Let \(X\) and \(Y\) be random variables with probability distributions \(p_X\) and \(p_Y\) respectively, and \(p_{X,Y}\) the joint probability distribution over \((X,Y)\). The base-\(b\) mutual information between \(X\) and \(Y\) is defined as \[ \begin{split} I(X;Y) &= \sum_{x,y} p_{X,Y}(x,y) \log_b \frac{p_{X,Y}(x,y)}{p_X(x)p_Y(y)} \\ &= H(X) + H(Y) - H(X,Y). \end{split} \] Here the second line takes advantage of the properties of logarithms and the definition of Shannon entropy.

To some degree one can think of mutual information as a measure of the (linear and non-linear) coorelations between random variables.

See (Cover and Thomas 1991) for more details.

Interface:

Compute the base-b mutual information between two random variables \(X\) and \(Y\).

Examples:

Compute the base-2 mutual information between two random variables:

3.3 Conditional Entropy

Conditional entropy quantifies the amount of information required to describe a random variable \(X\) given knowledge of a random variable \(Y\). With \(p_Y\) the probability distribution of \(Y\), and \(p_{X,Y}\) the joint distribution for \((X,Y)\), the base-\(b\) conditional entropy is defined as \[ \begin{split} H(X|Y) &= -\sum_{x,y} p_{X,Y}(x,y) \log_b \frac{p_{X,Y}(x,y)}{p_Y(y)} \\ &= H(X,Y) - H(Y). \end{split} \]

See (Cover and Thomas 1991) for more details.

Interface:

Compute the base-b conditional entropy given joint (p_xy) and marginal (p_y) distributions.

Examples:

Compute the base-2 conditional entropy between two random variables:

3.4 Conditional Mutual Information

Conditional mutual information was introduced by (Dobrushin 1959) and (Wyner 1978) and approximately quantifies the average mutual information between random variables \(X\) and \(Y\) given knowledge of a third \(Z\). Following the same notation as for the conditional entropy, the base-\(b\) conditional mutual information is defined as \[ \begin{split} I(X;Y|Z) &= -\sum_{x,y,z} p_{X,Y,Z}(x,y,z) \log_b \frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}(x|z)p_{Y|Z}(y|z)} \\ &= -\sum_{x,y,z} p_{X,Y,Z}(x,y,z) \log_b \frac{p_{X,Y,Z}(x,y,z)p_{Z}(z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)} \\ &= H(X,Z) + H(Y,Z) - H(Z) - H(X,Y,Z) \end{split} \]

Interface:

Compute the base-b conditional mutual information given the joint (p_xyz) and marginal (p_xz, p_yz, p_z) distributions.

Examples:

Compute the base-2 conditional mutual information between two random variables given a third:

3.5 Relative Entropy

Relative entropy, also known as the Kullback-Leibler divergence, was introduced by Kullback and Leiber in 1951 (Kullback and Leibler 1951). Given a random variable \(X\), two probability distributions \(p_X\) and \(q_X\), relative entropy measures the information gained in switching from the prior \(q_X\) to the posterior \(p_X\): \[ D_{KL}(p_X || q_X) = \sum_x p_X(x) \log_b \frac{p_X(x)}{q_X(x)}. \]

Many of the information measures (e.g. mutual information, conditional entropy) amount to applications of relative entropy for various prior and posterior distributions.

Interface:

Compute the base-b relative entropy between posterior (p) and prior (q) distributions.

Examples:

Compute the base-2 relative entropy between posterior and prior distributions:

3.6 Cross Entropy

The cross entropy between two probability distributions \(p\) and \(q\) over the same support measures the average information needed to identify an event drawn from the set using a coding optimized for an “unnatural” probability distribution \(q\), rather than the “true” distribution \(p\). The cross entropy for the distributions \(p\) and \(q\) is defined as: \[ \begin{split} H(p_{X}, q_{X}) &= -\sum_{x} p_{X}(x) \log_b q_{X}(x) \\ &= H(p_{X}) - D_{KL}(p_X || q_X). \end{split} \]

Interface:

Compute the base-‘b’ Shannon cross entropy between a true distribution ‘p’ and an unnatural distribution ‘q’.

Examples:

Compute the base-2 cross entropy between two distributions:

4 Time Series Measures

The original purpose of Inform was to analyze time series data. This explains why most of Inform’s functionality resides in functions specifically optimized for analyzing time series. Many information measures have “local” variants which compute a time series of point-wise values. These feature is directly inherited by the rinform wrapper.

We have been meticulous in ensuring that function and parameter names are consistent across measures. If you notice some inconsistency, please report it as an issue.

4.1 Notation

Throughout the discussion of time series measures, we will try to use a consistent notation. We will denote random variables as \(X,Y,\ldots\), and let \(x_i,y_i,\ldots\) represent the \(i\)-th time step of a time series drawn from the associated random variable. Many of the measures consider \(k\)-histories (a.k.a \(k\)-blocks) of the time series, e.g. \[x_i^{(k)} = \left\{x_{i-k+1}, x_{i-k+2},\ldots,x_i\right\}\].

When denoting probability distributions, we will only make the random variable explicit in situations where the notation is ambiguous. We will typically write \(p(x_i)\), \(p(x_i^{(k)})\), and \(p(x_i^{(k)}, x_{i+1})\) to denote the empirical probability of observing the \(x_i\) state, the \(x_i^{(k)}\) \(k\)-history, and the joint probability of observing \(\left(x_i^{(k)}, x_{i+1}\right)\).

4.2 Implementation Details

4.2.1 The Base: States and Logarithms

The word “base” has two different meanings in the context of information measures on time series. It could refer to the base of the time series itself, that is the number of unique states in the time series. For example, the time series \(\{0,2,1,0,0\}\) is a base-3 time series. On the other hand, it could refer to the base of the logarithm used in computing the information content of the inferred probability distributions. The problem is that these two meanings clash. The base of the time series affects the range of values the measure can produce, and the base of the logarithm represents a rescaling of those values.

In the rinform wrapper we deal with this by always using base-2 logarithms and automatically computing the base of each time series internally to the wrapper. All of this ensures that the library is a simple as reasonably possible.

4.2.2 Multiple Variables, Initial Conditions, and Time Steps

You generally need a lot of data to infer a probability distribution. An experimentalist or simulator might then collect data over multiple trials or initial conditions. Generally, this process also involves a large number of variables. Most of rinform’s time series measures allow the user to pass in a two-dimensional matrix with each column representing a time series from a different initial condition and/or variable. From this the probability distributions are inferred, and the desired value is calculated. This has the downside of requiring that the user store all of the data in memory, but it has the advantage of being fast and simple.

Time series measures provided by rinform accept as arguments Vector and Matrix objects. A Vector, for example,

defines one time series and its length defines the number of time steps, \(m\). Specifically, it defines one initial condition or trial for one variable. The user can deal with multiple variables, \(l\), and multiple initial conditions, \(n\), using Matrix objects with \(l*n\) columns and \(m\) rows:

where each column represents a different time series, possibly from different variables. In the previous example we have \(l = 2\) variables, respectively, \(X_1\) and \(X_2\), with \(n = 2\) trials each, where each series has \(m = 5\) time steps. Columns [, 1] and [, 2] represent trial 1 and 2 of variable \(X_1\) while columns [, 3] and [, 4] represent trial 1 and 2 of variable \(X_2\).

When possible, rinform’s functions infer the number of variables, \(l\), the number of initial conditions, \(n\), and the number of time steps \(m\) from the provided data structures. This is possible in most cases due to the fact that all time series must have the same number of initial conditions and time steps. If this is not the case, the rinform’s API will explicity require the user to specify the number of variables \(l\).

4.3 Active Information

Active information (AI) was introduced in (Lizier, Prokopenko, and Zomaya 2012) to quantify information storage in distributed computations. Active information is defined in terms of a temporally local variant

\[ a_{X,i}(k) = \log_2{\frac{p(x_i^{(k)}, x_{i+1})}{p(x_i^{(k)})p(x_{i+1})}}, \]

where the probabilities are constructed empirically from the entire time series. From the local variant, the temporally global active information is defined as

\[ A_X(k) = \langle a_{X,i}(k) \rangle_i = \sum_{x_i^{(k)},x_{i+1}} p(x_i^{(k)},x_{i+1}) \log_2{\frac{p(x_i^{(k)}, x_{i+1})}{p(x_i^{(k)})p(x_{i+1})}}. \]

Strictly speaking, the local and average active information are defined as

\[ a_{X,i} = \lim_{k\rightarrow \infty}a_{X,i}(k) \qquad \textrm{and} \qquad A_X = \lim_{k\rightarrow \infty}A_X(k), \]

but we do not provide yet limiting functionality in this library (GitHub issues).

Interface:

Compute the average or local active information with a history length k.

Examples:

Compute Active Information with one initial condition:

and local variant:

Compute Active Information with two initial conditions:

and local variant:

4.4 Block Entropy

Block entropy, also known as \(N\)-gram entropy (Shannon 1948), is the standard Shannon entropy of the \(k\)-histories of a time series: \[ H(X^{(k)}) = -\sum_{x_i^{(k)}} p(x_i^{(k)}) \log_2{p(x_i^{(k)})} \] which reduces to the traditional Shannon entropy for \(k = 1\).

Interface:

Compute the average or local block entropy of a time series with block size k.

Examples:

Compute Block Entropy with one initial condition and history lenght \(k \in \{1, 2\}\):

and local variant:

Compute Block Entropy with two initial conditions and history lenght \(k = 2\):

and local variant:

4.5 Conditional Entropy

Conditional entropy is a measure of the amount of information required to describe a random variable \(Y\) given knowledge of another random variable \(X\). When applied to time series, two time series are used to construct the empirical distributions and the conditional entropy is given \[ H(Y|X) = - \sum_{x_i,y_i} p(x_i,y_i) \log_2{p(y_i|x_i)}. \]

This can be viewed as the time-average of the local conditional entropy \[ h_i(Y|X) = -\log_2{p(y_i|x_i)}. \] See (Cover and Thomas 1991) for more information.

Interface:

Compute the average and local conditional entropy between two time series.

This function expects the condition to be the first argument, xs. It is expected that each time series be the same length n, but may have different bases.

Examples:

Compute Conditional Entropy between two time series:

and local variant:

4.6 Cross Entropy

Cross entropy between two distributions \(p_X\) and \(q_X\) measures the amount of information needed to identify events using a coding scheme optimized for \(q_X\) when \(p_X\) is the “real” distribution over \(X\). \[ H(p,q) = -\sum_{x} p(x) \log_2{q(x)} \] Cross entropy’s local variant is equivalent to the self-information of \(q_X\) and as such is implemented by the local block entropy.

See (Cover and Thomas 1991) for more details.

Interface:

Compute the cross entropy between the “true” and “unnatural” distributions \(p_X\) and \(q_X\) from associated time series ps and qs, respectively.

Examples:

Compute Cross Entropy between two time series:

4.7 Effective Information

Effective information is a causal measure aimed at teasing out the causal structure of a dynamical system. In essence, it is the mutual information between an “intervention” distribution — a probability distribution over initial states — and the distribution after one time step: \[ EI(A,p) = I(p,p^TA) \] where \(A\) is the transition probability matrix and \(p\) an intervention distribution. Functionality to construct a transition probability matrix from time series is provided by the series_to_tpm function.

See (Tononi and Sporns 2003), (Hoel, Albantakis, and Tononi 2013) or (Hoel 2017) for more information.

Interface:

Compute the effective information from an \(n \times n\) transition probability matrix tpm given an intervention distribution inter. If inter is NULL, then the uniform distribution over the \(n\) states is used.

Examples:

Compute Effective Information with a uniform intervention distribution:

Compute Effective Information with a non-uniform intervention distribution:

4.8 Entropy Rate

Entropy rate quantifies the amount of information needed to describe the next state of \(X\) given observations of \(X^{(k)}.\) In other wrods, it is the entropy of the time series conditioned on the \(k\)-histories. The local entropy rate \[ h_{X,i}(k) = \log_2{\frac{p(x_i^{(k)}, x_{i+1})}{p(x_i^{(k)})}}. \] can be averaged to obtain the global entropy rate \[ H_X(k) = \langle h_{X,i}(k) \rangle_i = \sum_{x_i^{(k)},x_{i+1}} p(x_i^{(k)},x_{i+1}) \log_2{\frac{p(x_i^{(k)}, x_{i+1})}{p(x_i^{(k)})}}. \]

Much as with active information, the local and average entropy rates are formally obtained in the limit \[ h_{X,i} = \lim_{k\rightarrow \infty}h_{X,i}(k) \qquad \textrm{and} \qquad H_X = \lim_{k\rightarrow \infty}H_X(k), \]

but we do not provide yet limiting functionality in this library (GitHub issues).

See (Cover and Thomas 1991) for more details.

Interface:

Compute the average or local entropy rate with a history length k.

Examples:

Compute the Entropy Rate for one initial condition:

and local variant:

Compute the Entropy Rate for two initial conditions:

local variant:

4.9 Excess Entropy

Formally, the excess entropy is the mutual information between two adjacent, semi-infinite blocks of variables: \[ E_X = \lim_{k \rightarrow \infty}I[(x_{-k},\ldots,x_{-1}),(x_0,\ldots,x_{k-1})]. \] Because we cannot take the limit in practice, we implement the finite form: \[ E_X(k) = I[(x_{-k},\ldots,x_1),(x_0,\ldots,x_{k-1})]. \]

We can think of excess entropy as a slight generalization of active information or a special case of predictive information.

See (Crutchfield and Feldman 2003) and (Feldman and Crutchfield 2003) for more details.

Interface:

Compute the average or local excess entropy from a time series with block size k.

Examples:

Compute Excess Entropy for one initial condition:

and local variant:

Compute Excess Entropy for two initial conditions:

and local variant:

4.10 Evidence of Integration

Evidence of Integration (EoI) was introduced in (Biehl, Ikegami, and Polani 2016) as a means of identifying representations of agents in dynamical systems. Given a collection of random variables, \(\mathcal{X}=\{X_1,X_2,\ldots,X_l\}\), we can form (non-trivial) partitionings of \(\mathcal{X}\): \[ \{\mathcal{Z} \subset \mathcal{P}(\mathcal{X})\setminus\{\mathcal{X},\varnothing\}\ |\ \mathcal{Z} \not= \varnothing, \bigcup_{Z \in \mathcal{Z}} Z = \mathcal{X} ~\textrm{and}~ Z_i \cap Z_j = \varnothing, i \not = j \} \]

Each partitioning \(\mathcal{Z} = \{Z_1, \ldots, Z_m\}\) consists of a collection of joint random variables built from \(\mathcal{X}\). The configurations of \(\mathcal{Z}\) are in one-to-one correspondence with the configurations of \(\mathcal{X}\). Given such a partitioning, we can ask what the pointwise mutual information is: \[ mi_\mathcal{Z}(z_1, \ldots, z_m) = \log{\frac{p(z_1, \ldots, z_m)}{p(z_1) \ldots p(z_m)}}. \]

If \(z=\{z_1,\ldots,z_m\}\) corresponds to the configuration \(x=\{x_1,\ldots,x_l\}\), we refer to the above mutual information as the “evidence of integration of \(x\) with respect to the partitioning \(\mathcal{Z}\)”.

We say that a configuration \(x\) is integrated if and only if the evidence of integration of \(x\) is positive for all partitionings of the system.

Interface:

Given a sequence of \(n\) observed states of \(l\) random variables (series), compute the evidence of integration for each partitioning of the \(l\) variables, and return the minimum and maximum evidence for each observation. If the number of variables \(l\) is large, the user can test a small subset of the partitioning schemata by providing a partitioning parts to increase confidence that the system is integrated. In this case, the function computes and returns the evidence of integration for each observation with respect to the partitioning parts.

Examples:

Compute Evidence of Integration of three time series:

Compute Evidence of Integration for the same three time series but only for the partitionning c(1, 1, 2):

4.11 Information Flow

Information flow (IF) was introduced by Ay and Polani as a measure of the “strength of a causal effect” (Ay and Polani 2008). Unlike many of the other measures in rinform, information flow was designed around interventions. Rather than allowing dynamics to progress naturally, some subset of the random variables are forced into a particular state. If no interventions are performed, then IF reduces to Conditional Mutual Information.

Practially, there is no way for rinform to know from time series whether or not any interventions have actually been performed upon the system. As such, the onous is on the user to determine whether this function is in-fact computing Ay and Polani’s information flow.

In accordance with (Ay and Polani 2008), information flow is defined as \[ I_p(X \rightarrow Y | Z) = \sum_{x,y,z} p(z)p(x|\hat{z})p(y|\hat{x},\hat{z}) \log{ \frac{p(y|\hat{x},\hat{z})} {\sum_{x^\prime} p(x^\prime|\hat{z})p(y|\hat{x}^\prime,\hat{z})} } \]

All probabilities are estimated from sequences of observations — the order of those sequences is irrelevant. A future version of Inform, and thereafter rinform, will allow the user to specify intervention distributions.

Interface:

Compute the information flow from lsrc time series src to another ldst time series dst. Optionally, the user can specify lback background time series back to conditioning probabilities.

Examples:

This example demonstrates how to user info_flow to analyzed Ay and Polani’s diamond structure example, p.28-29 in (Ay and Polani 2008). In this example, we have four interacting Boolean variables \(W\), \(X\), \(Y\) and \(Z\). The variables \(X\) and \(Y\) are each dependent on \(W\): they simply copy it’s state. The \(Z\) variable is the XOR of the states of \(X\) and \(Y\). The variable \(W\) is uniformly distributed.

Left to it’s own devices, this system might produce a sequence of observations like (natural dynamics):

From these we can compute the following information flows: \[ I_p(X \rightarrow Y) = 1 \\ I_p(X \rightarrow Y | W) = 0 \\ I_p(W \rightarrow Z | Y) = 0. \]

Our notation departs somewhat from that of (Ay and Polani 2008) as we denote these as information flows despite the fact that we are not strictly intervening upon the system. If the user prefers, they can think of this as “intervening on the system that is indistiguishable from the natural dynamics”. In any case, we can use info_flow to compute the above values:

Now, consider that we can intervene on \(Y\) and force it to whichever state we so choose. Then we might end up with a sequence of observations as:

From these we can compute the following information flows: \[ I_p(X \rightarrow Y) = 0 \\ I_p(X \rightarrow Y | W) = 0 \\ I_p(W \rightarrow Z | Y) = 1. \]

In rinform this looks like:

4.12 Mutual Information

Mutual information (MI) is a measure of the amount of mutual dependence between at least two random variables. Locally, MI is defined as \[ i_i(X_1,\ldots,X_l) = \frac{p(x_{1,i},\ldots,x_{l,i})}{p(x_{1,i})\ldots p(x_{l,i})}. \] The mutual information is then just the time average of \(i_i(X_1, \ldots, X_l)\): \[ I(X_1,\ldots,X_l) = \sum_{x_{1,i},\ldots,x_{l,i}} p(x_{1,i},\ldots,x_{l,i}) \frac{p(x_{1,i},\ldots,x_{l,i})}{p(x_{1,i})\ldots p(x_{l,i})}. \]

See (Cover and Thomas 1991) for more details.

Interface:

Compute the mutual information or its local variant between two or more time series. Each variable can have a different base.

Examples:

Compute the Mutual Information between two variables:

and local variant:

Compute the Mutual Information between three variables:

and local variant:

4.13 Predictive Information

Formally, the predictive information is the mutual information between a finite-history and a semi-infinite future: \[ P_X(k) = \lim_{l \rightarrow \infty}I[(x_{-k},\ldots,x_{-1}),(x_0,\ldots,x_{l-1})]. \] Of course, we cannot take the limit in practice, so we implement the finite form: \[ P_X(k,l) = I[(x_{-k},\ldots,x_1),(x_0,\ldots,x_{k-1})]. \]

We can think of active information and excess entropy as a special cases of predictive information.

See (Bialek, Nemenman, and Tishby 2001b) and (Bialek, Nemenman, and Tishby 2001a) for more details.

Interface:

Compute the average or local predictive information from a time series series with history length kpast and future length kfuture.

Examples:

Our examples will compute the predictive information between the current time step and the next two time steps of a time series, \(P_X(1, 2)\).

One initial condition:

and local variant:

Two initial conditions:

and local variant:

4.14 Relative Entropy

Relative entropy, also known as the Kullback-Leibler divergence, measures the amount of information gained in switching from a prior distribution \(q_X\) to a posterior distribution \(p_X\) over the same support: \[ D_{KL}(p||q) = \sum_{x_i} p(x_i) \log_2{\frac{p(x_i)}{q(x_i)}}. \] The local counterpart is \[ d_{KL,i}(p||q) = log_2{\frac{p(x_i)}{q(x_i)}}. \] Note that the average in moving from the local to the non-local relative entropy is taken over the posterior distribution.

See (Kullback and Leibler 1951) and (Cover and Thomas 1991) for more information.

Interface:

Compute the average and local relative entropy between time series drawn from posterior and prior distributions, here xs and ys respectively.

Examples:

Compute the average relative entropy between two time series:

and local variant:

4.15 Separable Information

Separable information (SI) was introduced in (Lizier, Prokopenko, and Zomaya 2010) as a method for detecting information modification in distributed computations. Given a random variable \(Y\) and a collection of potentially informative sources \(V_Y\) (wherein \(Y \not\in V_Y\)), separable information quantifies the total information gained about \(Y\) from seperate observations of histories of \(Y\) and “transfer contributions” from each of the sources \(X \in V_Y\). In other words, it is the sum of the Active Information of \(Y\) and the (apparent) Transfer Entropy from each source to \(Y\): \[ s_{Y,i}(k) = a_{Y,i}(k) + \sum_{X \in V_Y} t_{X \rightarrow Y, i}(k),\\ S_Y(k) = \langle s_{Y,i}(k) \rangle_i = A_Y(k) + \sum_{X \in V_Y} T_{X \rightarrow Y}(k). \]

Interface:

Compute the average or local separable information of dest given a set of sources srcs and history length k.

Examples:

Compute Separable Information for one initial condition and one source:

and local variant:

Compute Separable Information for one initial condition and multiple sources:

and local variant:

Compute Separable Information for multiple initial conditions and multiple sources:

and local variant:

4.16 Transfer Entropy

Transer entropy (TE) was introduced in (Schreiber 2000) to quantify information transfer between an information source and a destination, conditioning out shared history effects. TE was originally formulated considering only the source and destination; however many systems of interest have more than just those two components. As such, it may be further necessary to condition the probabilities on the states of all “background” components in the system. These two forms are sometimes referred to as apparent and complete transfer entropy, respectively ([Lizier2008]).

Our implementation of TE allows the user to condition the probabilities on any number of background processes, within hardware limits. For the subsequent description, take \(X\) to be the source, \(Y\) the target, \(W = \{W_1, \ldots, W_l\}\) be the background processes against which we’d like to condition. For example, we might take the state of two nodes in a dynamical network as the source and target, while all other nodes in the network are treated as the background. Transfer entropy is then defined in terms of a time-local variant \[ t_{X \rightarrow Y, \mathcal{W}, i}(k) = \log_2{\frac{p(y_{i+1},x_i|y^{(k)}_i,W_i)} {p(y_{i+1}|y^{(k)}_i,W_i)p(x_{i+1}|y^{(k)}_i,W_i)}} \]

where \(W_i = w_{1,i}, \ldots, w_{l,i}\) are the states of each of the background nodes at time step \(i\), and the probabilities are constructed empirically from the entire time series. From the local variant, the temporally global transfer entropy is defined as \[ T_{X \rightarrow Y, \mathcal{W}}(k) = \langle t_{X \rightarrow Y, \mathcal{W}, i}(k) \rangle_i = \sum_{y_{i+1},y^{(k)},x_i,W_i} p(y_{i+1},y^{(k)}_i,x_i,W_i) \log_2{\frac{p(y_{i+1},x_i|y^{(k)}_i,W_i)} {p(y_{i+1}|y^{(k)}_i,W_i)p(x_{i+1}|y^{(k)}_i,W_i)}}. \]

Strictly speaking, the local and average transfer entropy are defined as \[ t_{X \rightarrow Y, \mathcal{W}, i} = \lim_{k\rightarrow \infty} t_{X \rightarrow Y, \mathcal{W}, i}(k) \qquad \textrm{and} \qquad T_{X \rightarrow Y, \mathcal{W}} = \lim_{k\rightarrow \infty} t_{X \rightarrow Y, \mathcal{W}}(k), \]

but we do not provide yet limiting functionality in this library (GitHub issues).

Interface:

Compute the average or local transfer entropy with a history length k with the possibility to conditioning on the background ws.

Examples:

One initial condition, no background:

and local variant:

Two initial conditions, no background:

and local variant:

One initial condition, one background process:

and local variant:

The following example is interesting in that the two background processes provide the same information about the destination as the source process does (\(X = W_1 \oplus W_2\)) but they don’t separately.

One initial condition, two background processes:

and local variant:

5 Utilities

Type Functions
Binning series_range, bin_series
Black-Boxing black_box, black_box_parts
Coalescing coalesce
State Encoding encode, decode
Partitioning Time Series partitioning
Time Series to TPM series_to_tpm

5.1 Binning Time Series

All of the currently implemented time series measures are only defined on discretely-valued time series. However, in practice continuously-valued time series are ubiquitous. There are two approaches to accomodating continuous values.

The simplest is to bin the time series, forcing the values into discrete states. This method has its downsides, namely that the binning is often a bit unphysical and it can introduce bias. What’s more, without some kind of guiding principle it can be difficult to decide exactly which binning approach.

The second approach attempts to infer condinuous probability distributions from continuous data. This is potentially more robust, but more technically d ifficult. Unfortunately, rinform does not yet have an implementation of information measures on continous distributions.

5.1.1 Series Range

Interface:

Compute the range, minimum and maximum values in a floating-point time series and return them as a list.

Examples:

Find range, minimum, and maximum values of a continuous state time series:

5.1.2 Bin Series

Bin a floating-point time series following one of three different modalities:

  • Bin a floating-point time series into a finite-state time series with b uniform sized bins, and return the size of the bins. If the size of each bin is too small, less than \(10 \epsilon\), then all entries are placed in the same bin and an error is set. (\(\epsilon\) is the double-precision machine epsilon.)
  • Bin a floating-point time series into bins of a specified size step, and return the number of bins. If the bin size is too small, less than \(10 \epsilon\), then an error is set.
  • Bin a floating-point time series into bins defined by an array of bounds, and return the bounds of bins.

Interface:

Bin a floating-point time series and return a list giving the binned sequence, the number of bins and either the bin sizes or bin bounds.

Examples:

First method with b uniformely sized bins:

Second method with speficied bin size step:

Third method with specific bin bounds:

5.2 Black-Boxing Time Series

It is often useful when analyzing complex systems to black-box components of the system by aggregating its states. This process amounts to grouping the states of different components of the system and treating them as a single state on a larger state-space and, in doing so, the black-boxed components are treated as a single entity, without regard for its small-scale structure. As an example, consider you have two Boolean random variables \(X\) and \(Y\), and you observe time series of each simultaneously: \[ X:\ \{0,1,1,0,1,0,0,1\}\\ Y:\ \{1,0,0,1,1,0,1,0\} \]

It may be worthwhile to consider these observations as observations of the joint variable \((X, Y)\): \[ (X,Y):\ \{(0,1),(1,0),(1,0),(0,1),(1,1),(0,0),(0,1),(1,0)\}. \]

The joint observations can then be naturally encoded, for example, as base-4 states \[ (X,Y):\ \{1,2,2,1,3,0,1,2\}. \]

We refer this process of mapping the observations of \(X\) and \(Y\) to the encoded observations of \((X, Y)\) as black boxing. In this case, the black boxing procedure is performed in “space” (you might think of \(X\) and \(Y\) having locations in space. We may also black box in time. The canonical example of this is considering the \(k\)-history of a random variable: \[ X:\ \{0,1,1,0,1,0,0,1\}\\ X^{(2)}:\ \{((0,1),(1,1),(1,0),(0,1),(1,0),(0,0),(0,1)\}, \] and the observations of \(X^{(2)}\) can be encoded as base-4 states: \[ X^{(2)}:\ \{(1,3,2,1,2,0,1\}. \]

We provide a basic black-boxing function, black_box, that allows the user to black-box in both space and into the future and past of a collection of random variables. The black_box_parts allows the user to black-box time series based on a partitioning scheme (useful in the implementation of integration measures such as Evidence of Integration.

5.2.1 Black-Box

Interface:

Black-box time series from a collection of l sources where each series has the same number of time steps, same number of initial conditions, and possibly different bases into a single time series. The base of the resulting time series is given by the product of the bases of each time series in the collection. Black-boxing can be performed in time by providing history lengths r and future lengths through s.

Examples:

We’ll black box two time series with no history and future:

Note that, this is the first example described at the beginning of this section.

Black-box a single time series in time with history length 2:

This is the second example described at the beginning of this section.

Black-box two time series with histories and futures:

5.2.2 Black-Box Parts

Interface:

Black-box time series from a collection of \(l\) sources where each series has the same number of time steps but possibly different bases into a number of different time series according to the partitioning scheme parts. The resulting time series and their bases are returned. The base of the resulting time series is given by the product of the bases of each time series in the partition.

Examples:

Black-box 4 time series (each of length 8) into a single time series:

This could have been more simple using black_box, but it is for illustrative purpose.

Black-box 4 time series (of length 8) into two time series using the partitioning scheme \((1,2,2,1)\). That is, combine the 1st and 4th, and the 2nd and 3rd.

Note that the two time series each have a base of 4, and the bases are returned as $b.

Finally, in this example, we compute the multivariate mutual information between 4 time series (each of length 8) for all partitionings of the system (except the very first since that only yields a single time series). That is, each partition is treated as a seperate variable and the mutual information is computed between the partitioned variables.

5.3 Coalescing Time Series

The magic of information measures is that the actual values of a time series are irrelavent. For example, \(\{0, 1, 0, 1, 1 \}\) has the same entropy as \(\{2, 9, 2, 9, 9\}\) (possibly up to a rescaling). This give us the freedom to shift around the values of a time series as long as we do not change the relative number of states.

This function thus provides a way of “compressing” a time series into as small a base as possible. Why is this useful? Many of the measures use the base of the time series to determine how much memory to allocate; the larger the base, the higher the memory usage. It also affects the overall performance as the combinatorics climb exponentially with the base.

Notice that the encoding that is used ensures that the ordering of the states stays the same, e.g., \[ \{-8 \rightarrow 0, -2 \rightarrow 1, 2 \rightarrow 2, 4 \rightarrow 3, 6 \rightarrow 4\} \].

Interface:

Coalesce a timeseries into as few contiguous states as possible.

Examples:

The two standard usage cases for this function are to reduce the base of a time series:

and to ensure that the states are non-negative:

5.4 State Encoding

State encoding is a necessity when complex time series are being analyzed. For example, k-history must be encoded as an integer in order to “observe” it using a Dist. What if you are interested in correlating the aggragate state of one group of nodes with that of another? You’d need to encode the groups’ states as integers. The following functions provides such functionality.

Attention!

As a practical matter, these utility functions should only be used as a stop-gap while a solution for your problem is implemented in the core Inform library and wrapped in rinform. “Why?” you ask? Well, these functions are about as efficient as they can be for one-off state encodings, but most of the time you are interested in encoding sequences of states. This can be done much more efficiently if you encode the entire sequence at once. You need domain-specific information to make that happen.

5.4.1 Encode

This function uses a big-endian encoding scheme. That is, the most significant bits of the encoded integer are determined by the left-most end of the unencoded state. If b is not provided (or is NA), the base is inferred from the state with a minimum value of 2.

Interface:

Encode a base-‘b’ array of integers into a single integer.

Examples:

Using the big-endian encoding scheme, the most significant bits of the encoded integer are determined by the left-most end of the unencoded state:

If b is not provided or is NA, itt is inferred from the state:

5.4.2 Decode

The provided encoded state is decoded using the big-endian encoding scheme. Note that the base b must be provided, but the number of digits n is optional. If n is provided then the decoded state will have exactly that many elements.

Interface:

Decode an integer into a base-b array with n digits.

Examples:

This function assumes a big-endian encoding scheme:

If n is provided then the decoded state will have exactly that many elements. However, note that if the provided n is too small to contain a full representation of the state, an error will be raised.

If n is not provided, the length of the decoded state is inferred from the encoding to be as small as possible:

5.5 Partitioning Time Series

Many analyses of complex systems consider partitioning the system into components or modules. One example of this is Evidence of Integration.

For generality, we represent a partitioning of \(N\) items into \(1 \leq M \leq N\) partitions as a sequence of integers \(p_1, \ldots, p_N\) where \(1 \leq p_i \leq M\) is the partition to which the \(i\)-th item belongs.

As an example, suppose we partitioned \(\{X_1, X_2, X_3\}\) as \(\{\{X_1\}, \{X_2,X_3\}\}\). This would be represented as \((1, 2, 2)\) because \(X_1\) belongs to the first partition and \(X_2, X_3\) belong to the second partition.

We provide a function, partitioning to facilitate the generation of all unique partitionings of a system of \(N\) elements.

Interface:

Partition a number n of items into all possible different subsets. The number of partitions is given by the Bell number \(B_n\).

Examples:

Say we wish to partition a set containing 3 items:

You’ll notice that the number of partitions, i.e, number of columns, is equal to the Bell number \(B_3 = 5\).

We can compute the 9-th Bell number as follows:

5.6 Time Series to TPM

Some information measures are defined on transition probability matrices (TPM) as, for example, Effective Information. For this reason, we provide a function series_to_tpm which estimates the one-time-step TPM from time series data.

Interface:

Estimate the one-time-step transition probability matrix \(A\) from a time series. The element \(A_{ji}\) is the probability of transitioning to state \(j\) in the next time step given the system is in state \(i\) (note the column-major convention).

Examples:

Estimate a TPM from one time series:

Estimate a TPM from multiple time series, in this case 7 2-steps series:

References

Ay, N., and D. Polani. 2008. “Information Flows in Causal Networks.” Advances in Complex Systems 11 (01): 17–41. https://doi.org/10.1142/S0219525908001465.

Bialek, W., I. Nemenman, and N. Tishby. 2001a. “Complexity Through Nonextensivity.” Physica A: Statistical Mechanics and Its Applications 302 (1): 89–99. https://doi.org/10.1016/S0378-4371(01)00444-7.

———. 2001b. “Predictability, Complexity, and Learning.” Neural Computation 13 (11): 2409–63. https://doi.org/10.1162/089976601753195969.

Biehl, M., T. Ikegami, and D. Polani. 2016. “Towards Information Based Spatiotemporal Patterns as a Foundation for Agent Representation in Dynamical Systems.” In Proceedings of the Artificial Life Conference 2016, 722–29. https://doi.org/10.7551/978-0-262-33936-0-ch115.

Cover, T. M., and J. A. Thomas. 1991. Elements of Information Theory. New York: John Wiley & Sons.

Crutchfield, J. P., and D. P. Feldman. 2003. “Regularities Unseen, Randomness Observed: Levels of Entropy Convergence.” Chaos: An Interdisciplinary Journal of Nonlinear Science 13 (1): 25–54. https://doi.org/10.1063/1.1530990.

Dobrushin, R. L. 1959. “A General Formulation of the Fundamental Theorem of Shannon in the Theory of Information.” Uspekhi Matematicheskikh Nauk 14 (6): 3–104.

Feldman, D. P., and J. P. Crutchfield. 2003. “Structural Information in Two-Dimensional Patterns: Entropy Convergence and Excess Entropy.” Physical Review E 67 (5): 051104. https://doi.org/10.1103/PhysRevE.67.051104.

Hoel, E. P. 2017. “When the Map Is Better Than the Territory.” Entropy 19 (5): 188. https://doi.org/10.3390/e19050188.

Hoel, E. P., L. Albantakis, and G. Tononi. 2013. “Quantifying Causal Emergence Shows That Macro Can Beat Micro.” Proceedings of the National Academy of Sciences 110 (49): 19790–5. https://doi.org/10.1073/pnas.1314922110.

Kullback, S., and R. A. Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1): 79–86. https://doi.org/10.1214/aoms/1177729694.

Lizier, J. T., M. Prokopenko, and A. Y. Zomaya. 2010. “Information Modification and Particle Collisions in Distributed Computation.” Chaos 20 (3): 037109. https://doi.org/10.1063/1.3486801.

———. 2012. “Local Measures of Information Storage in Complex Distributed Computation.” Information Sciences 208 (Supplement C): 39–54. https://doi.org/10.1016/j.ins.2012.04.016.

Schreiber, T. 2000. “Measuring Information Transfer.” Physical Review Letters 85 (2): 461–64. https://doi.org/10.1103/PhysRevLett.85.461.

Shannon, C. E. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.

Tononi, G., and O. Sporns. 2003. “Measuring Information Integration.” BMC Neuroscience 4 (1): 31. https://doi.org/10.1186/1471-2202-4-31.

Wyner, A.D. 1978. “A Definition of Conditional Mutual Information for Arbitrary Ensembles.” Information and Control 38 (1): 51–59. https://doi.org/10.1016/S0019-9958(78)90026-8.