Automatic identification of states from time series

Thanks to the ever increasing computing resources, temporal simulations of biological macromolecules are reaching time scales that allows one to shed light on physiological and pathological processes, e.g., protein folding and the aggregation of amyloid peptides in Alzheimer’s disease. The resulting large amount of data, mostly high-dimensional, has to be analyzed with scalable protocols that provide a meaningful and human-readable information about the system dynamics. Typically the time trajectories enter, sample and exit particular regions of the state space many times, such that the whole time series can be interpreted as a jump process between underlying discrete states. We have developed a scalable analysis tool, called SAPPHIRE (states and pathways projected at high resolution) plot, to exploit recurrence in and within these states, to identify data points structurally and kinetically similar, and to reorder them on the basis of their degree of similarity (Blöchliger et al., 2013; 2014). Importantly, the SAPPHIRE plot provides annotations to visually emphasize the high-density regions and the transitions between them with an optimal resolution. Relying on these annotations, an automatic identification of the states is performed. Applications of the algorithm to molecular dynamics trajectories of proteins provided a comprehensive picture of the main metastable basins and of the pathways connecting them (the figure shows a SAPPHIRE plot analysis of atomistic simulations of the protein BPTI). Further applications to temporal series of complex systems, ranging from river hydrology data to neuronal recordings, are being considered currently.