This notebook is designed to exhibit the functionality of the python scripts in this repository.
For those unfamiliar with persistent homology see these introductory slides which give an intuitive introduction. The slides also provide an introduction as to how one can use multiparameter persistence to filter out noise.
The plots in this notebook are built upon a visualization library Bokeh, and a number of software packages developed by the Topological Data Analysis community.
import numpy as np;
from bokeh.io import output_notebook, show;
output_notebook()
In our first simple example let us remind ourselves how we derive the 1 parameter persistence landscape from the barcode of a persistence module.
We replace each bar with a mountain with peak in the centre of the bar of half the height of the length of the bar.
# The Rips_Filtration function takes as input a 2D pointcloud
# and range of radii and outputs an interactive bokeh plot of
# the filtration, barcode and persistence landscapes.
from one_parameter_plotting import Rips_Filtration
# 100 points on a radius 1 circle
points = np.load('example_pointclouds/CirclePoints.npy')
show(Rips_Filtration(points,[0.01,2]))
The first landscape $k=1$ picks out the highest point in the mountain range at each value, the second landscape $k=2$ picks out the second highest point in the range at each value. Thus for small Rips parameter values the first landscape traces the mountain corresponding to the small circle and for large Rips parameter values the first landscape traces the mountain corresponding to the large circle. The second landscape is non-zero when both circles are detected by the filtration.
# 100 points on a radius 1 circle and 100 points on a radius 0.5 circle
points = np.load('example_pointclouds/BigCircleSmallCircle.npy')
show(Rips_Filtration(points,[0.01,2]))
The landscape functions are stable in the sense that a small perturbation of the point cloud induces only a small perturbation of the landscape functions.
# 100 points on a radius 1 circle and 100 points on a radius 0.5 circle
points = np.load('example_pointclouds/PeturbedBigCircleSmallCircle.npy')
show(Rips_Filtration(points,[0.01,2]))
As a little exercise to check you have digested the previous slides, let us consider running persistent homology on the Olympic Rings. We will sample 150 points from each of the rings.
Before scrolling to the next cell, think about how many $H_1$ bars you expect to see. How will these bars be converted into a sequence of landscapes?
points = np.load('example_pointclouds/OlympicPoints.npy')
show(Rips_Filtration(points,[0.01,2]))
Notice that we have 5 large bars corresponding to the 5 rings, but also 4 smaller bars corresponding to the holes formed where the rings link. The first 4 landscapes have a small peak and a large peak, the fifth landscape just has a large peak, and the sixth landscape onwards is flat. The bars and mountain peaks are not uniform in length and height due to the variation introduced by our random sampling of the rings.
We have seen single parameter persistent homology applied to example point clouds. We plotted points at increasing size and recorded the homology at each stage. In other words we recorded the homology as we varied a single parameter - the size of the points.
Multiparameter persistent homology describes the same procedure, but we record the homology as we vary more than one parameter. Recording the homology as we vary more than one parameter is a complicated task.
There is no multiparameter QR code in analogy to the barcodes we saw earlier.
However we can use multiparameter landscape functions for multiparameter persistent homology in analogy to the one parameter case. Again, mountains indicate the presence of holes and the height of the mountains corresponds to the prominence of the corresponding hole.
In this first example we will sample 100 points from a circle of radius 0.2 and 100 points from a circle of radius 0.5. We will vary the size of the points and also filter the points by their x-coordinate.
Varying the sliders corresponding to each parameter displays the point cloud under consideration and highlights the corresponding position in the landscape. The small loop gives rise to a low mountain ridge and the large loop gives rise to mountain in the bottom right of the first landscape.
The second landscape highlights the range of values for which both circles are detected.
# The function Rips_Filter_Bifiltration takes as input a 2D point cloud
# with a filter function. The input is an nx3 array where the columns
# represent x-coord, y-coord and filter value. A range of radii for the
# Rips parameter.
from multiparameter_landscape_plotting import Rips_Filter_Bifiltration
filtered_points = np.load('example_pointclouds/SideBySideCircleFiltered.npy')
show(Rips_Filter_Bifiltration(filtered_points = filtered_points,
radius_range=[0,2],
palette = "Cividis256",
FilterName = "x Coordinate", maxind=3, dim =1))
Recall the Olympic Rings point cloud. We can refine our analysis of this toy point cloud using multiparameter persistent homology.
For example we may filter by the colour of the rings in our second parameter. Using the multiparameter persistence landscape we can detect the rings and their colours.
Recall the $k^\text{th}$ landscape has peaks when the point cloud has $k$ holes.
Notice the following features:
filtered_points = np.load('example_pointclouds/OlympicPointsColoured.npy')
show(Rips_Filter_Bifiltration(filtered_points, [0,2], palette = "Turbo256",
FilterName = "Colour", maxind=5, dim =1))
The multiparameter landscape functions are also in the sense that a small perturbation of the position and filter of the filtered point cloud induces only a small perturbation of the multiparameter landscape functions.
filtered_points = np.load('example_pointclouds/OlympicPointsColouredPerturbed.npy')
show(Rips_Filter_Bifiltration(filtered_points, [0,2], palette = "Turbo256",
FilterName = "Colour", maxind=5, dim =1))
We've seen that persistent homology is robust to a perturbation of a point cloud. However we are also likely to encounter another type of noise in data sets whereby rogue outlier points in random positions are introduced rather than the true data points being perturbed slightly.
Suppose once again we sample 100 points on the unit circle, but this time we accidentally introduce 25 outlier points sampled from the unit disc. In our first example without noise the unit circle gave rise to a bar of length $\sim 1.36$. The perturbation of points diminished the length of the bar to a length of $\sim 1.20$. With outlier noise our longest bar in the $H_1$ barcode is $\sim 0.48$ units long as seen in the plot above. The outlier noise has filled the hole and we can no longer detect the true size of the large circle.
points = np.load('example_pointclouds/25NoisyPoints.npy')
show(Rips_Filtration(points,[0.01,2]))
Suppose we sample a point cloud with noise from 100 points from different size circles together with 25 outlier points within each circle. The outliers are so disruptive to 1 parameter persistent homology we can no longer identify two significant bars in the barcode or two significant peaks in the landscape functions.
We would like to find a remedy to reduce the impact of these noisy outlier points.
points = np.load('example_pointclouds/NoisyBigCircleSmallCircle.npy')
show(Rips_Filtration(points,[0.01,2]))
In this section we shall present multiparameter persistence as a remedy to outlier noise.
Ideally, we want to exclude noisy points from out data set in order to detect the true shape of the data undisturbed by the outliers. If we knew which points were outliers this would be easy - we could just remove the outliers from our collection of points! In the more likely case that we don't know which points are outliers we need to come up with a rule distinguishes a true data point from an outlier.
We could dictate that if a data point is far from from all other data points then it is an outlier. That is we say a point is an outlier if it is distance $x_\text{far}$ from all other points. This method requires us to decide the threshold $x_\text{far}$, which is somewhat arbitrary, and a priori it may not be possible to come up with a good threshold. In the same way we took a multi-scale approach to plot our data points at a range of radii, we may avoid choosing a threshold $x_\text{far}$ and instead vary the threshold parameter $x_\text{far}$ across a range of values.
The distance of a point from its neighbouring points is a measure of codensity. If the distance from a point to its neighbouring points is small it is sitting in a dense region with many points, and if the distance to its neighbours is large it is lying in a sparse region.
We shall filter our points by codensity, explore our point cloud by gradually admitting more and more points, from sparser regions.
We can compute multiparameter landscape functions whose peaks again correspond to holes in the data set at the corresponding values. We plot these two dimensional landscapes as images using colour to indicate the height of the landscape, from black at ground level to yellow at the mountain peaks. An example should make this idea clearer.
Recall the unit circle point cloud with 100 points sampled and 25 noise points sampled from the enclosed disc. Our codensity value is computed by summing the distance to the 5-nearest neighbours of each point. We colour each point by the codensity value, from purple indicating low codensity to yellow indicating high codensity.
Using the sliders you can experiment with the various codensity thresholds and radius parameters. The white crosshairs highlight the corresponding position on the multiparameter landscape functions. Hovering over the landscape functions also raises the height of the landscape and the radius and codensity values.
With the codensity threshold in the range $0.45$ to $0.85$ and radius parameter in the range $0.40$ to $1.60$ we detect the large circle. If we increase the codensity parameter further we admit outlier points which significantly reduces the maximum radius value for which we still detect a hole.
# The function Rips_Codensity_Bifiltration takes as input a 2D point cloud
# as an nx2 array. A range of radii for the Rips parameter. An integer k
# for the k-nearest-neighbour codensity function.
from multiparameter_landscape_plotting import Rips_Codensity_Bifiltration
points = np.load('example_pointclouds/25NoisyPoints.npy')
show(Rips_Codensity_Bifiltration(points,[0,2],
kNN = 5,
maxind=3,
dim = 1))
Recall the point cloud with a large circle and small circle both containing 25 noisy points. Let us apply our previous technique to this data. In our first landscape, we see two mountain ranges running diagonally from low-codensity high-radius to high-codensity low-radius. The small mountain range with peak at radius $\sim 0.55$ and codensity threshold $\sim 0.55$ corresponds to the small circle and the large mountain range corresponds to the large circle.
We extract the two feature of this noisy point cloud where filtering solely by the radius failed in our earlier analysis.
points = np.load('example_pointclouds/NoisyBigCircleSmallCircle.npy')
show(Rips_Codensity_Bifiltration(points,[0,2],kNN = 5,maxind=3, dim = 1))
Suppose we want to compare the multiparameter persistence modules arising from two collections of data. We would like to be able to decide whether the difference in the shape of the data is significant of coincidental.
Suppose we sample 30 'Large Loop' point clouds (100 points sample from the unit circle and 25 noise points sampled from the enclosed unit disc), and 30 'Large and Small Loop' point clouds (100 points from the unit circle and 100 points from 0.5 radius circle together with 25 outlier points within each circle).
Now for each point cloud we produce multiparameter persistence landscapes. We can compare the landscapes using functional summaries of the landscapes.
One option is to integrate the resulting multiparameter landscapes over different regions of the parameter space.
Below we plot the average landscape over 30 samples for our two point cloud types. In the boxplot we display the distribution of the value of the integrals of the multiparameter landscapes over the parameter region specified by the range sliders:
# The function compare_multiparameter_landscape_samples takes as input a list of lists of
# multiparameter persistence landscapes, a range of indices of multiparameter landscapes
# to plot, and GroupLabels.
from multiparameter_landscape_plotting import compare_multiparameter_landscape_samples
one_loop_landscapes = np.load('example_pointclouds/one_loop_landscapes.npy', allow_pickle=True)
two_loop_landscapes = np.load('example_pointclouds/two_loop_landscapes.npy', allow_pickle=True)
show(compare_multiparameter_landscape_samples(Samples = [one_loop_landscapes, two_loop_landscapes],
indices = [1,3],
GroupLabels = ['Large Loop', 'Large and Small Loop']))
Using the range sliders you can dictate the region over which we integrate the multiparameter landscapes.
If you set the ranges as: $x_\text{range} = [0.40, 1.10]$, $y_\text{range} = [1.30, 1.80]$. Then we integrate the multiparameter landscapes over the range of parameter values for which the large loop is detected and thus we see no significant difference in the boxplot distributions.
If you set the ranges as: $x_\text{range} = [0.20, 0.70]$, $y_\text{range} = [0.30, 0.80]$. Then we integrate the multiparameter landscapes over the parameter values for which the small loop is detected and thus we see signal from the 'Large and Small Loop' point clouds but not the 'Large Loop' point clouds.