An information-theoretic approach to higher-order Markov processes: Theory and applications

De Gregorio, Juan (supervisors: Toral, Raúl; Sánchez, David)
PhD Thesis (2025)

Accurately modeling the temporal evolution of a stochastic process, which is essential in many fields in complex systems, requires a thorough understanding of the correlations in the system. This is particularly relevant for higher-order Markov chains, where the likelihood to make a transition to a future state only depends on a finite number of past outcomes. However, without previous knowledge of the system dynamics, it is not straightforward to quantify these temporal dependencies.
In this thesis, we demonstrate that information theory provides a comprehensive framework for describing correlations within a system. In particular, the block entropy, an extension of Shannon entropy defined for consecutive repetitions of the process, and its discrete derivatives, are shown to be effective tools for quantifying step by step the influence of previous outcomes in the evolution of the system. Within this approach, we redefine the order, or memory, of a process completely in terms of information-theoretic measures, allowing us to develop a method to determine this
memory value.
Adapting the proposed method to data samples of finite length requires estimating the entropy, which led us to introduce two new estimators designed to account for correlations in the data, and to compare these estimators with other well-known methods when applied to Markovian sequences. Combining the theoretical results that link the memory of a process with information theory, the entropy estimator that we find to have an overall minimum mean squared error when acting on correlated sequences and statistical methods from hypothesis testing, we develop a memory estimator that
shows high accuracy and is independent on model selection.
Subsequently, we apply the previous results to two real-world datasets. First, we analyze the correlations in sequences of precipitation occurrence across Spain, observing that the strength of these correlations varies seasonally, being stronger in winter than in summer, and across regions, with more pronounced temporal dependencies observed in Northern Spain. The second application involves analyzing correlations within sequences of parts of speech for a large number of contemporary languages, observing that the syntactic structure of these languages is effectively captured by the
probability distribution of three consecutive parts of speech. Defining a distance metric between languages based on these distributions, we identify well-known language families and groups, and we observe that languages that are geographically closer tend to have more similar syntactic structures than those located further apart.
Overall, through a combination of theoretical and statistical methods, this thesis develops a framework for quantifying and analyzing correlations and memory effects in stochastic processes using information theory, demonstrating the effectiveness of these methods in the analysis of real-world data.


This web uses cookies for data collection with a statistical purpose. If you continue browsing, it means acceptance of the installation of the same.


More info I agree