Persistent Topology for Peak Detection
This is a little story of how an abstract mathematical theory like persistent homology can be a practical tool for a usual task in industry.
Update: A related paper on this topic appeared at iDSC2020.
Motivation
At B+R Industrial Automation we have an open discussion culture and from time to time there are little take-away questions for people from computer science or mathematics. One such task was the following:
In order to identify the natural frequency of some mechanics attached to an ACOPOS servo drive one actuates the drive with a frequency sweep signal and records the response signal of the position. The (Fourier) spectrum of such a response signal could then look like in the picture below. The question is now how to detect the “main peaks” in this signal?
The original problem statement came from motion control, however, the underlying task carries more universality1: Given a real-valued signal resp. a time series, where are the “significant peaks”?
One of the first attempts would be to find maxima. The global maximum is not what we look for because we might be interested in this very significant local maximum at 10. However, all (or only some?) local maxima is probably also not what we look for: (i) the local maximum at 33 is quite insignificant compared to the local maximum at around 55 although the signal is higher and (ii) measured signals come with noise that cause many local maxima that are irrelevant for our analysis.
We could get rid of the noise by smoothing the signal. However, this often comes at costs: For instance the sharp peak at 10 would be quickly dampened and our impression on the “significance” of the peak would strongly altered. If possible we would always prefer to operate on the original data. But even if we would decide to smooth the signal, say, by a Gaussian smoothing kernel, what would be suitable standard deviation? Bringing bigger guns like scale space into the game appears to be inadequate to address the simple question originally raised.
Persistence
What we look for is a method to identify the “relative maxima” in a stable, noise-resilient fashion, and, moreover, quantify the significance (or dominance) of these maxima. If we look for the eight most significant peaks in the signal above then we would expect something similar to this:
What the plot above depicts are the eight most persistent 0-dimensional cycles in the sublevel set filtration of the function graph.
In simple words: Consider a water level starting above the global maximum that continuously lowers its level. Whenever the water level reaches a local maximum a new island is born. Whenever it reaches a local minimum two islands merge, and we consider the lower island to be merged into the higher island. This gives a precise notion of a lifespan of an island; it is the significance we mentioned above, or the so-called persistence in persistent homology. In this precise sense the above plot shows the eight most persistent islands/0-dimensional cycles on the original data.
The blue bars start at local maxima and grow downwards until the level where the corresponding island merges with another island. Only the island raised by the global maximum never dies; everything is eventually merged into it on the way to minus infinity. The blue bars are known as the barcode in persistent topology.
The persistence diagram is another representation of the features of persistent homology. It depicts every island as a point with birth- and death-level as the x- and y-coordinates. In our example the (0-dimensional) persistence diagram looks as follows:
An efficient algorithm
The above example is a simple case of the general problem of computing the persistent homology (over \(\mathbb{Z}_2\)) in a given dimension of a filtration of a simplicial complex. The classical approach to this is via matrix reduction (c.f. Gaussian elimination) by Edelsbrunner and Harer2 and takes \(O(n^3)\) time, where \(n\) denotes the length of the signal in the above example, or the number of simplices of the underlying simplcial complex in general. This is what my C++ library libstick or one of the several other available implementations do.
In the simple case of the sublevel set filtration of a real-valued signal, however, we can do much better, namely in the speed of sorting numbers, which is \(O(n \log n)\), and the algorithm is very simple, too:
(I have received many mails about the license of the code. You may use the code under MIT license.)
The general theme for this algorithm is to track growing islands and merge them from time to time. A union-find data structure is what we look for in general. However, since our sets are disjoint intervals on the real line, we can replace the part after sorting by signals values to a loop with a simple, constant-time body instead of otherwise time-optimal \(O(\alpha(n))\) union-find operations.
Concluding remarks
Typically one considers the superlevel set filtration in which the water level rises and death levels are higher than birth levels, which then leads to the usual definition of persistence as death level minus birth level and dots in the persistence diagrams reside above the diagonal. However, the sublevel set filtration of one function is the superlevel set filtration of the negative function.
References and further reading
This article deals with algorithms and mathematical formalisms that belong to the field of computational topology. The introduction to this field is the book by Edelsbrunner and Harer2.
Examples for the 0-th and 1-st dimensional persistence diagram on two-dimensional input are given on the demonstration page of the libstick library. The getting started of libstick describes the general work flow from defining simplicial complexes to computing persistence diagrams.
Update. A demonstration and code of peak detection in two-dimensional data is given here.