Persistent Homology and Noise

Oliver Vipond

Persistent Homology and Noise

This presentation is designed to serve multiple purposes. Through a series of interactive examples we will:

  • Introduce key concepts from persistent homology and multiparameter persistent homology.
  • Explore the significance of outliers in disrupting the output of single parameter persistent homology.
  • Present multiparameter persistence as a remedy for ameliorating the impact of noise.

In order to be accessible to the widest possible audience, occasionally we will compromise precision and detail in favour of giving a higher level understandable description.

Software

The plots in this presentation are built upon a visualization library Bokeh, and a number of software packages developed by the Topological Data Analysis community.

The slides are assembled using reveal.js. Below are some keyboard shortcuts to aid navigating the slides.

  • Spacebar or N = Advance Slide
  • P = Previous Slide
  • F = Fullscreen
  • O = Slide Overview

Introduction 1-Parameter Persistent Homology

What is persistent homology?

Algebraic topology is a field of pure mathematics which seeks to understand shapes - (topology) using numbers - (algebra). Homology originates from this field of mathematics as a way to distinguish shapes by examining the holes they contain.

Consider a billiard ball, a donut, and a soap bubble. We can distinguish the shape of these three objects by virtue of the holes they contain. A billiard ball is a solid mass containing no holes, a donut has a hole through the middle and the bubble is a thin layer of soap film which has a hole on the inside full of air.

A Billiard Ball, Donut and Bubble

At this point you might notice that the hole in the donut and the hole in the bubble have a distinct nature. We could thread a piece of string through the hole in the donut to make a donut necklace, whilst we could not make a bubble necklace without bursting the bubble. Similarly the hole in the bubble contains a volume of air which cannot escape, whilst air is free to flow through the hole in the middle of the donut.

Mathematically speaking these holes have different dimensions. The dimension of the hole in the soap bubble is said to be dimension 2 because the hole is encompassed by a 2 dimensional sphere soap film. The hole in the donut is said to be dimension 1 because the dough surrounding the hole forms a 1 dimensional circle.$^1$

A 0 dimensional hole is formed for each connected piece of our shape. The billiard ball, donut, and soap bubble each have a single zero dimensional hole because they are each in one connected piece. In contrast a sprinkle of 137 donut crumbs consists of 137 pieces and so contains 137 zero dimensional holes.

The number of holes of each dimension in a shape are detected by homology groups. We shall use $H_d$ to denote $d$-dimensional homology.

Homology is sufficient to distinguish certain shapes, for example a donut and a pretzel have a different number of 1 dimensional holes. However homology is not sufficient for distinguishing a donut from a mini-donut, both have a single 1 dimensional hole.

Persistent homology allows us to detect both the number of holes and the size of the holes in a shape. This allows us to distinguish a donut and a mini-donut. Both the donut and donut have a 1 dimensional hole, but the donut has a larger 1 dimensional hole than the mini-donut.


$^1$A $d$-dimensional hole is enclosed by a $d$-dimensional sphere.

Persistent Homology for Data

Traditional data analysis techniques are often insensitive to the shape of data. This is acutely exhibited by the Datasaurus data set. This data set consists of 13 point clouds. The mean of the x-values, the mean of the y-values, the standard deviation and the Pearson correlation coefficient of each of these point clouds all coincide to 2 decimal places. These statistics are not well suited to detect the shape of these point clouds.

Datasaurus data

We could distinguish these point clouds by examining the number and size of holes, and the number of connected pieces each point cloud contains. Using persistent homology we can extract these features and provide descriptions of data sets which are usually missed by more traditional descriptors.

Persistent Homology Examples

Single Loop

Consider 100 points arranged in a circle plotted in this first simple example.

In order to create a shape from this collection of points we will imagine gluing together points whenever they touch each other.

When plotted as small points very few of the points are glued together and so the circular shape of the data is not realised. We couldn't thread a piece of string through the apparent hole in the centre of the points and make a necklace. If we increase the size of the points nearby points begin to touch and so we glue them together. At some stage each point will be glued to its neighbours on the perimeter of the circle. At this moment which we could thread a piece of string through the hole in the centre and make a necklace.

If we continue increasing the size of the points the hole in the centre will persist. However at some stage all of our points will touch each other and so will all glued to each other in a single blob. This gluing will close the hole in the centre.

The procedure described above, explains how persistent homology detects the number and size of the holes in a point cloud. We gradually increase the size of the plotted points and record when holes are formed and filled in. This is most easily visualised with the interactive plot below.

Let us explain in detail how the 1 dimensional homology ${H_1}$ evolves in this example and how we record this information.

As stated earlier, initially when the points are too small, the points do not form a complete circle and do not enclose a 1 dimensional hole. As we increase the size of our points eventually the points grow large enough to touch each other enclose a hole in the middle. Try sliding the "Rips Parameter" slider across to the value of $0.37$. The final gap in the perimeter our circle is closed at this point. We record that a 1 dimensional hole has been formed by drawing a green bar starting at value $0.37$.

As we continue increasing the size of our points the hole in the middle of the circle begins to be filled in. We want to record when the hole is filled in by finishing our bar at the value for which the hole is filled in to denote that this hole has now disappeared. For technical computational reasons$^1$ it is easier to undershoot and end the bar slightly before the hole is completely filled in. If you continue sliding the "Rips Parameter" slider we stop the bar at the value $1.73$, shortly before the hole is filled in at Rips Parameter value $2.00$.$^2$

Similarly we may consider the 0 dimensional homology ${H_0}$, i.e. the number of connected pieces our point cloud breaks up into at each stage. Initially our points are all in separate pieces and so we start drawing many blue bars one for each point. As we increase the "Rips Parameter" slider value the points touch each other and we glue them together reducing the number of connected pieces. Each time two pieces join together one of our blue bars forms an endpoint. At value $0.27$ the final two pieces join and we are left with a single bar corresponding to the fact that our data is now all in one piece.

Together the collection of bars is known as the barcode. We can also summarise the collection of bars using landscape functions. We replace each bar with a mountain with peak in the centre of the bar at of half the height of the length of the bar.

For example the green bar corresponding to the 1 dimensional hole has length ${1.73 - 0.37} = 1.36$ and midpoint $\frac{1.73 +0.37}{2} = 1.05$, and thus gives rise to a green mountain with a peak of height $\frac{1.36}{2} = 0.68$ at the value $1.05$.

The small blue bars gives rise to a series of small blue mountains with the largest peak corresponding to the 2 pieces of the circle joining together at value $0.27$.$^3$ The midpoint and of this blue bar and the This mountain has peak at the value $0.135 = \frac{0.27 + 0}{2}$ and height $0.135 = \frac{0.27 - 0}{2}$.


$^1$We compute homology using the Rips Complex. The Rips Complex only consider pairwise gluings and is easier to compute than the Cech complex which examines higher order gluings. The tradeoff is that the Rips Complex thinks holes are filled in slightly before they are.
$^2$Note that $ 1.73 \sim \sqrt{3}$ as the Rips complex thinks the hole is filled in once the points lying at the vertices of an almost equilateral triangle are big enough to be touching each other.
$^3$We omit the peak corresponding to the infinite length bar.

Multiple Loops Different Sizes

We claimed earlier that persistent homology allows us to detect both the number of holes and the size of the holes in a shape, and that this allows us to distinguish a donut and a mini-donut. Let us consider sampling 100 points from a large circle and 100 points from a small circle, and examine the persistent homology features they give rise to which allows us to distinguish the two donuts.

Notice that the points we sampled from the small circle are more tightly packed, which means we do not have to increase the size of the points very much until these points form a 1 dimensional hole ($\sim 0.20$). A little while later ($\sim 0.33$) the large circle of points form a hole. Thus we record two green bars one for each hole. The smaller hole is filled in earlier than the large hole and gives rise to a shorter bar.

We translate these two bars to $H_1$ landscape functions. Each green bar gives rise to a mountain whose peak lies in the centre of the bar and has height half of the length of the bar. We organise the collection of mountains into a sequence of functions, one function for $k=1,2,3,...$. The first landscape $k=1$ picks out the highest point in the mountain range at each value, the second landscape $k=2$ picks out the second highest point in the range at each value. Try clicking $H_1: k=1$ in the legend to highlight the first landscape function.

We notice that our first landscape has two peaks. A small peak at $0.54$ corresponding to the small circle and a large peak at $1.03$ corresponding to the large circle. The second landscape $k=2$ has a mountain peak centred at the range of values for which we can see both bars simultaneously.$^1$ The third landscape $k=3$ has no peaks since at no point are three holes seen simultaneously.

Persistent homology is often described as a multi-scale approach, since this methodology analyses the shape of the data at all scales. Notice that at different "Rips Parameter" values our data forms different shapes. At radius $0.25$ we only see the small circle, at radius $0.60$ we see both circles clearly, and at radius $1.00$ we see only the large circle. A priori it was not clear that the radius value $\sim 0.60$ was the appropriate scale to see both feature simultaneously, we can read this from the barcode and landscape function. The multi-scale nature of persistent homology is a strength since a priori we don't need to specify a scale and can explore all scale simultaneously.


$^1$Using the Rips complex rather than the Cech complex for computational convenience means this mountain hits ground level slightly before the small circle is filled.

Multiple Loops Perturbed

You may be concerned that in our previous examples the point clouds perfectly lie on circles. It should be reassuring to learn that if we slightly perturb the locations of the points in our point cloud we only slightly perturb the resulting barcodes and landscapes. This property is known as stability.

A number of tiny holes are created as we increase the size of the points, but these holes are filled in very soon after, thus the bars and mountain peaks they give rise too are tiny. The bars and mountain peaks corresponding to the large and small circle are only slightly diminished by the perturbation.

Olympic Pointcloud

As a little exercise to check you have digested the previous slides, let us consider running persistent homology on the Olympic Rings. We will sample 150 points from each of the rings.

Olympic Rings

Before advancing to the next slide, think about how many $H_1$ bars you expect to see. How will these bars be converted into a sequence of landscapes?

Olympic Pointcloud

Notice that we have 5 large bars corresponding to the 5 rings, but also 4 smaller bars corresponding to the holes formed where the rings link. The first 4 landscapes have a small peak and a large peak, the fifth landscape just has a large peak, and the sixth landscape onwards is flat. The bars and mountain peaks are not uniform in length and height due to the variation introduced by our random sampling of the rings.

Introduction to Multiparameter Persistent Homology

What is Multiparameter Persistent Homology?

We have seen single parameter persistent homology applied to example point clouds. We plotted points at increasing size and recorded the homology at each stage. In other words we recorded the homology as we varied a single parameter - the size of the points.

Multiparameter persistent homology describes the same procedure, but we record the homology as we vary more than one parameter. Recording the homology as we vary more than one parameter is a complicated task.

There is no multiparameter QR code in analogy to the barcodes we saw earlier.

However we can use multiparameter landscape functions for multiparameter persistent homology in analogy to the one parameter case. Again, mountains indicate the presence of holes and the height of the mountains corresponds to the prominence of the corresponding hole.

Filtering by Radius and x-coordinate

In this first example we will sample 100 points from a circle of radius 0.2 and 100 points from a circle of radius 0.5. We will vary the size of the points and also filter the points by their x-coordinate.

Varying the sliders corresponding to each parameter displays the point cloud under consideration and highlights the corresponding position in the landscape. The small loop gives rise to a low mountain ridge and the large loop gives rise to mountain in the bottom right of the first landscape.

The second landscape highlights the range of values for which both circles are detected.

Filtering by Radius and Colour

Recall the Olympic Rings point cloud. We can refine our analysis of this toy point cloud using multiparameter persistent homology.

For example we may filter by the colour of the rings in our second parameter. Using the multiparameter persistence landscape we can detect the rings and their colours.

Recall the $k^\text{th}$ landscape has peaks when the point cloud has $k$ holes.

Notice the following features:

  • The mountains in our second landscape function do not appear until the black and blue rings are introduced.
  • The fourth landscape has a small peak when the black. blue and green rings are present since there are 3 holes formed by the rings and another small hole formed in the link between the black and green ring.
  • The we only see a maximum of 2 holes at large Rips parameter $\sim 1.5$. Unlinked rings are not filled in at this Rips parameter value whereas linked rings are filled in. We only ever see a maximum of 2 unlinked rings in our filtration.

Multiparameter Perturbation

We may perturb the position of the points in this point cloud or perturb the colour of the the points and examine the effect of the multiparameter landscape functions.

Perturbing both the point cloud and colours slightly perturbs the multiparameter landscape functions, in the same way a perturbation of point locations slightly perturbed one parameter landscape functions.

1 Parameter Persistent Homology and Noise

Adding Noise

We've seen that persistent homology is robust to a perturbation of a point cloud. However we are also likely to encounter another type of noise in data sets whereby rogue outlier points in random positions are introduced rather than the true data points being perturbed slightly.

Suppose once again we sample 100 points on the unit circle, but this time we accidentally introduce 25 outlier points sampled from the unit disc. In our first example without noise the unit circle gave rise to a bar of length $\sim 1.36$. The perturbation of points diminished the length of the bar to a length of $\sim 1.20$. With outlier noise our longest bar in the $H_1$ barcode is $\sim 0.48$ units long as seen in the plot above. The outlier noise has filled the hole and we can no longer detect the true size of the large circle.

Adding Noise Continued

Suppose we sample a point cloud with noise from 100 points from different size circles together with 25 outlier points within each circle. The outliers are so disruptive to 1 parameter persistent homology we can no longer identify two significant bars in the barcode or two significant peaks in the landscape functions.

We would like to find a remedy to reduce the impact of these noisy outlier points.

Multiparameter Persistent Homology and Noise

Filtering Outliers

In this section we shall present multiparameter persistence as a remedy to outlier noise.

Ideally, we want to exclude noisy points from out data set in order to detect the true shape of the data undisturbed by the outliers. If we knew which points were outliers this would be easy - we could just remove the outliers from our collection of points! In the more likely case that we don't know which points are outliers we need to come up with a rule distinguishes a true data point from an outlier.

We could dictate that if a data point is far from from all other data points then it is an outlier. That is we say a point is an outlier if it is distance $x_\text{far}$ from all other points. This method requires us to decide the threshold $x_\text{far}$, which is somewhat arbitrary, and a priori it may not be possible to come up with a good threshold. In the same way we took a multi-scale approach to plot our data points at a range of radii, we may avoid choosing a threshold $x_\text{far}$ and instead vary the threshold parameter $x_\text{far}$ across a range of values.

The distance of a point from its neighbouring points is a measure of codensity. If the distance from a point to its neighbouring points is small it is sitting in a dense region with many points, and if the distance to its neighbours is large it is lying in a sparse region.

We shall filter our points by codensity, explore our point cloud by gradually admitting more and more points, from sparser regions. In analogy to the barcode formed when we track the holes across different radius values in the earlier examples, we would like to be able to track holes as we vary both the radius and codensity threshold using a 2-dimensional barcode. For technical reasons, it is not possible to produce a 2-dimensional barcode, however the landscape functions from the 1-parameter examples generalise to multiparameter persistent homology.

We can compute multiparameter landscape functions whose peaks again correspond to holes in the data set at the corresponding values. We plot these two dimensional landscapes as images using colour to indicate the height of the landscape, from black at ground level to yellow at the mountain peaks. An example should make this idea clearer.

Single Loop with Noise Revisited

Let us explore filtering the outliers by codensity in practice. Recall our unit circle with 100 points sampled and 25 noise points sampled from the enclosed disc. Our codensity value is computed by summing the distance to the 5-nearest neighbours of each point. We colour each point by the codensity value, from purple indicating low codensity to yellow indicating high codensity.

Using the sliders you can experiment with the various codensity thresholds and radius parameters. The white crosshairs highlight the corresponding position on the multiparameter landscape functions. Hovering over the landscape functions also raises the height of the landscape and the radius and codensity values.

With the codensity threshold in the range $0.45$ to $0.85$ and radius parameter in the range $0.40$ to $1.60$ we detect the large circle. If we increase the codensity parameter further we admit outlier points which significantly reduces the maximum radius value for which we still detect a hole.

Multiple Loops Different Sizes with Noise Revisited

Recall the point cloud with a large circle and small circle both containing 25 noisy points. Let us apply our previous technique to this data. We see two mountain ranges running diagonally in our first landscape. The small mountain range with peak at radius $\sim 0.55$ and codensity threshold $\sim 0.55$ corresponds to the small circle and the large mountain range corresponds to the large circle.

We extract the two feature of this noisy point cloud where filtering solely by the radius failed in our earlier analysis.

Statistics for Multiparameter Persistent Homology

Example Methodology

We have seen that it is possible to extract features from data which describe the data's shape. If we want to compare the shape of two collections of data we would like to be able to decide whether the difference in the shape of the data is significant of coincidental.

Let us return to our analysis of baked goods. Suppose a bakery served two snack options:

  • Donut with sprinkles
  • Combo: Plain donut and a mini-donut

One way to compare these purchases would be to compare the weight of the two orders. We might hypothesise that the combined weight of the plain donut and mini-donut is more than the weight of the donut with sprinkles.

To test this hypothesis we could order 30 donuts with sprinkles and 30 combos, and weigh each order. We would then be able to compute the sample mean and sample variance of each snack option and verify our hypthesis using statistical tests.

Another way we could compare these snack options is to analyse their shape. We could quantify the shape of the two snack options using persistent homology.

Again, we could order 30 donuts with sprinkles and 30 combos. We could take a picture of our orders and convert them to point clouds.

We may notice that in converting our pictures to point clouds we introduce outlier noise. Bread crumbs in the centre of our donuts have been mistakenly registered as donut points.

To remedy the introduced noise we could use multiparameter persistent homology, filtering by codensity, in order to filter the outliers introduced.

Now for each point cloud we produce multiparameter persistence landscapes. In total we have 30 landscapes for each order. We can compare the landscapes using functional summaries of the landscapes.

One option is to examine the mass of the mountains in our landscapes. We can look at the mass of the mountains in different regions of our landscape to detect different features.

Below we plot the average landscape over 30 samples for our two snack options. In the boxplot we display the distribution of the mass of the mountains in the region specified by the range sliders:

Using the range sliders you can dictate the region over which we sum the mass of the mountains in the landscapes.

If you set the ranges as: $x_\text{range} = [0.40, 1.10]$, $y_\text{range} = [1.30, 1.80]$. Then we sum the mass of large moutain range corresponding to the hole in the donut with sprinkles and the plain donut. The resulting distributions in the boxplot do not distinguish the shape of these donuts.

If you set the ranges as: $x_\text{range} = [0.20, 0.70]$, $y_\text{range} = [0.30, 0.80]$. Then we sum up the small moutain range corresponding to the hole in the mini-donut. The resulting distributions identify that the combo order contains a donut with a small donut and the donut with sprinkles does not have a small hole.

The Bigger Picture

The Fingerprint of a Datasaurus

Many of our illustrative examples have been contrived, and there are no doubts more efficient methods for measuring the size of holes in donuts.

In practice, persistent homology is useful for quantifying features of data sets with obscure shapes that do not admit a simple description. For example, the active site of a protein, brain arteries, and cells in histology samples all form shapes that are not easy to describe and quantify.

Persistent homology and multiparameter persistence homology allow us to produce fingerprints for these more complicated shapes. We can statistically analyse these fingerprints in order to compare the shape of data in these more complicated situations.

As a toy example we plot the fingerprint of the Datasaurus point cloud. The various bars correspond to the holes formed by the mouth, skull, eye and hands. On our next slide we summarise some of the real world applications of persistent homology.

End