Introduction to GPSat

Introduction to GPSat#

The original documentation of GPSat can be accessed here. In the previous session, we’ve looked at some basics in Gaussian Processes. As discussed, we will dive into GPSat in this session. Let’s now begin with some recap about the GP basics and get familar with the GPsat Model API to get get/set parameters.

First of all, please run the following code to install GPSat.

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

# TODO: allow for mounting of gdrive
# TODO: allow for checking out a branch

if IN_COLAB:
    
    import os
    import re

    # change to working directory
    work_dir = "/content"
    
    assert os.path.exists(work_dir), f"workspace directory: {work_dir} does not exist"
    os.chdir(work_dir)
    
    # clone repository
    !git clone https://github.com/CPOMUCL/GPSat.git

    repo_dir = os.path.join(work_dir, "GPSat")

    print(f"changing directory to: {repo_dir}")
    os.chdir(repo_dir)

if IN_COLAB:
    !pip install -r requirements.txt

if IN_COLAB:
    !pip install -e .

import numpy as np
import scipy
import matplotlib.pyplot as plt
from GPSat.models.sklearn_models import sklearnGPRModel

Consider data generated from a simple cosine function

\[ \tag{1} y = \cos(X) + \epsilon, \]

where \(X = (x_1, \ldots, x_N)\) is a set of randomly generated points within the interval \([-L, L]\), and \(\epsilon\) is a measurement error, which we take to be an i.i.d. zero-mean Gaussian noise with standard deviation \(0.05\).

Our goal is to use a Gaussian process model to filter out the noise \(\epsilon\) and recover the function \(f(x) = \cos(x)\) from the training data \((X, y)\).

# Set random seed
np.random.seed(0)

# Generate data
N = 30
L = 5
noise_std = 0.05

X_grid = np.linspace(-L, L, 100)
X = np.random.uniform(-L, L, (N,))
f = lambda x: np.cos(x)
epsilon = noise_std * np.random.randn(N)

y = f(X) + epsilon
f_truth = f(X_grid) # Ground truth

# Plot
plt.plot(X_grid, f_truth, 'k', zorder=1, label='Ground truth')
plt.scatter(X, y, color='C3', alpha=0.6, zorder=2, label='Noisy observations')
plt.legend()

Gaussian processes#

Intuitively, a zero-mean Gaussian process (GP) can be understood as a Gaussian distribution on an arbitrary collection of inputs.

More specifically, it is a random function \(f : \mathbb{R} \rightarrow \mathbb{R}\) such that for an arbitrary collection of inputs \(X = (x_1, \ldots, x_N)\), the random variable \(f(X)\) is a multivariate Gaussian

\[ f(X) \sim \mathcal{N}(0, K_{XX}), \]

for some \(N \times N\) covariance matrix \(K_{XX}\). Importantly, this covariance matrix can be computed using a kernel function \(k : \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\) as

\[ [K_{XX}]_{ij} = k(x_i, x_j), \quad \forall i,j = 1, \ldots, N, \]

and this completely characterises the zero-mean GP. i.e. the properties of a GP are completely determined by the kernel!

Below, we set up a GP with the so-called radial basis function (RBF) kernel, given as

\[ \tag{RBF} k_{\text{RBF}}(x, x') = \sigma^2 \exp\left(-\frac{|x-x'|^2}{2\ell^2}\right), \]

using sklearnGPRModel, a GPSat GP regression model based on the scikit-learn GPR module.

gpr = sklearnGPRModel(coords=X, obs=y, kernel='RBF', verbose=False)

In the expression for the kernel (RBF) above, we see that it is controlled by two parameters \(\sigma^2\) and \(\ell\), which are referred to as the kernel variance and the lengthscale hyperparameters respectively (in machine learning lingo, we refer to parameters that define the models as hyperparameters).

Every GPSat model comes equipped with a getter/setter method for all (hyper)-parameters in the model. A list of all parameters is stored in the param_names property.

print(gpr.param_names)

We can retrieve their values using the get_*() method, where * is to be substituted with the parameter name.

ls = gpr.get_lengthscales()
kv = gpr.get_kernel_variance()

print(f"Lengthscale: {ls}, Kernel variance: {kv}")

Suppose we want to set the kernel variance to 1.5. We can achieve this using the set_*() method.

gpr.set_kernel_variance(1.5)
kv = gpr.get_kernel_variance()
print(f"New kernel variance: {kv:.1f}")

The likelihood#

In param_names above, we also saw a parameter likelihood_variance. This is not a hyperparameter of the GP kernel, but is instead a parameter of the so-called likelihood.

In general, the likelihood describes the probability of an observation \(y\) given the ground truth field \(f(X)\), i.e., the conditional distribution \(p(y | f(X))\). In our case, since the observations are assumed to only differ from the ground truth by some measurement error, the likelihood is understood as modelling precisely this measurement error.

From (1), we see that the likelihood is given by

\[ p(y | f(X)) \sim \mathcal{N}(f(X), \alpha^2 I), \]

with \(\alpha = 0.05\) and \(f(x) = \cos(x)\). Here, the parameter \(\alpha^2\) is referred to as the likelihood variance.

We can get the default value for the likelihood variance using the get_* method:

print(f"Likelihood variance: {gpr.get_likelihood_variance()}")

and set the correct value by using the set_* method.

alpha = 0.05
gpr.set_likelihood_variance(alpha**2)
print(f"New likelihood variance: {gpr.get_likelihood_variance():.4f}")

Alternatively, we could have also initialised the GPR model by specifying the likelihood_variance and kernel_variance arguments with their respective values.

# This initialises a GP model with the desired values
gpr = sklearnGPRModel(coords=X, obs=y, kernel='RBF', likelihood_variance=alpha**2, kernel_variance=1.5, verbose=False)

ls = gpr.get_lengthscales()
kv = gpr.get_kernel_variance()
lv = gpr.get_likelihood_variance()

print(f"Lengthscale: {ls},  Kernel variance: {kv:.1f},  Likelihood variance: {lv:.4f}")

Prediction#

From just these information, we can now infer what our ground truth function \(f\) should be, given the data pair \((X, y)\).

Mathematically this is achieved by the simple, yet powerful Bayes’ rule to update our belief on the function \(f\) given our data \((X, y)\). Informally, this reads:

\[ \tag{2} \underbrace{p(f \,|\, X, y)}_{\text{posterior}} \propto \underbrace{p(y \,|\, f(X))}_{\text{likelihood}} \,\, \underbrace{p(f)}_{\text{prior}}. \]

In GP regression, one can understand the GP as modelling a prior distribution on the function \(f\). Thus, the term \(p(f)\) corresponds to our GP model. The posterior distribution \(p(f | X, y)\) thus gives our prediction of the field \(f\) given the data.

In GPSat models, this is computed using the predict() method. This takes as inputs a set of \(N_*\) prediction points, which must be an array of size \((N_*, D)\), where \(D\) is the input dimension (in our case, just 1).

# Make predictions on a uniform grid
pred_dict = gpr.predict(X_grid.reshape(-1, 1))

# Extract the mean and variance from the results dictionary
f_mean = pred_dict['f*']
f_var = pred_dict['f*_var']
f_std = np.sqrt(f_var)

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_grid, f_mean, color='C0', zorder=1, label='GP Prediction')
plt.fill_between(X_grid, f_mean-1.96*f_std, f_mean+1.96*f_std, color='C0', alpha=0.3)
plt.scatter(X, y, color='C3', alpha=0.6, zorder=2, label='Noisy observations')
plt.legend()

In the plot above, we have plotted the prediction in blue, where the shaded region indicates a 90% credible interval, where we believe the ground truth lies.

To assess the quality of this prediction, we can look at two metrics:

(1) the mean squared error \(\frac{1}{N_*} \sum_{n=1}^{N^*} (f_{truth}(x_n') - f_{mean}(x_n'))^2\) between the predictive mean and the ground truth, and

(2) the mean log-likelihood \(\frac{1}{N_*} \sum_{n=1}^{N^*} \log \mathcal{N}(f_{truth}(x_n') \,|\, f_{mean}(x_n'), f_{std}(x_n')^2)\) of the ground truth given the predictive mean and standard deviation.

The former only assess the quality of the mean, however the latter also assess the quality of the predictive uncertainty. For the log-likelihood loss, a higher value indicates better performance.

print(f"Mean squared error: {np.mean((f_truth - f_mean)**2):.4f}")
print(f"Mean log likelihood: {scipy.stats.norm.logpdf(f_truth, f_mean, f_std).mean():.4f}")

Training#

To further improve our predictions, we can think of finding a value for the hyperparameters \(\Theta = (\sigma^2, \ell, \alpha^2)\) that fit the data “better”. This is called the training process.

To define what a “better” model means, we can compare them using a certain metric. A perferred such metric in Bayesian modelling is the so-called marginal likelihood, defined as:

\[ \tag{3} p(y | \Theta) = \int p(y | f(X), \Theta) \,p(f(X) | \Theta) \,df(X). \]

Thus, we can find an optimal set of parameters by maximising (3) with respect to \(\Theta\). Equivalently, we can also maximise their log-transformed counterpart, which is more typically used.

In GPSat models, we can compute the log-transformed version of the metric (3) by simply calling the get_objective_function_value() method.

print(f"log marginal likelihood = {gpr.get_objective_function_value():.4f}")

Now, let’s optimise this loss function, which can be achieved in GPSat model by calling the optimise_parameters() method.

# Optimise model
opt_success = gpr.optimise_parameters()

# Print outputs
print(f"Optimise success: {opt_success}")
print("-"*30)
param_dict = gpr.get_parameters(*gpr.param_names)
print("Values of model hyperparameters after training:")
for k, v in param_dict.items():
    print(f"{k} : {v:.4f}")
print("-"*30)
print(f"log marginal likelihood (after training) = {gpr.get_objective_function_value():.4f}")

We see that after training, the values of the lengthscale and kernel variance hyperparameters have changed. In addition, the log marginal likelihood value has increased.

Note: For scikit-learn models, the likelihood variance is assumed constant and does not get optimised. If you want to optimise the likelihood variance, use e.g. GPSat.models.gpflow_models.GPflowGPRModel instead.

Now let’s see the updated predictions.

# Predict again
pred_dict = gpr.predict(X_grid[:,None])

# Extract mean, variance and standard deviation
f_mean = pred_dict['f*']
f_var = pred_dict['f*_var']
f_std = np.sqrt(f_var)

# Plot predictions
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_grid, f_mean, color='C0', zorder=1, label='GP Prediction')
plt.fill_between(X_grid, f_mean-1.96*f_std, f_mean+1.96*f_std, color='C0', alpha=0.3)
plt.scatter(X, y, color='C3', alpha=0.6, zorder=2, label='Noisy observations')
plt.legend()

We see that the uncertainty bounds are now tighter around the ground truth after training, although the mean prediction is a little bit more off than before.

This is reflected in the metrics:

print(f"Mean squared error: {np.mean((f_truth - f_mean)**2):.4f}")
print(f"Mean log likelihood: {scipy.stats.norm.logpdf(f_truth, f_mean, f_std).mean():.4f}")

Using GPUs to accelerate training and inference#

Now, with some basics knowledge in Gaussian Processes, we will see the advantages of using GPUs to do training and inference on GP.

import numpy as np
import matplotlib.pyplot as plt
from GPSat.models.sklearn_models import sklearnGPRModel
from GPSat.models.gpflow_models import GPflowGPRModel

For the experiment, we use the same model as before but sampling more data points.

# Set random seed
np.random.seed(0)

# Generate data
N = 2500
L = 5
noise_std = 0.05

X_grid = np.linspace(-L, L, 100)
X = np.random.uniform(-L, L, (N,))
f = lambda x: np.cos(x)
epsilon = noise_std * np.random.randn(N)

y = f(X) + epsilon
f_truth = f(X_grid) # Ground truth

# Plot
plt.plot(X_grid, f_truth, 'k', zorder=1, label='Ground truth')
plt.scatter(X, y, color='C3', alpha=0.6, zorder=2, s=0.5, label='Noisy observations')
plt.legend()

As before, we model this data using sklearnGPRModel. Scikit-learn does not have GPU support so this will do all the training and prediciton on CPU, even if it detects GPUs on the machine.

# ---------------
# Training on CPU
# ---------------

# Initialise sklearn GPR model
sklearn_gpr = sklearnGPRModel(coords=X, obs=y, kernel='RBF', likelihood_variance=noise_std**2, verbose=False)

# Train model
_ = sklearn_gpr.optimise_parameters()

# Predict on test points
pred_dict_sklearn = sklearn_gpr.predict(X_grid[:,None])

Now, let’s use GPflowGPRModel to model the same data, which is based on the python package GPflow, itself a tensorflow based package for modelling with GPs. This automatically does the training and prediction on GPUs, if available.

# ---------------
# Training on GPU
# ---------------

# Initialise GPflow GPR model
gpflow_gpr = GPflowGPRModel(coords=X, obs=y, kernel='RBF', noise_variance=noise_std**2, verbose=False)

# Train model
_ = gpflow_gpr.optimise_parameters(fixed_params=['likelihood_variance'])

# Predict on test points
pred_dict_gpflow = gpflow_gpr.predict(X_grid[:,None])

Note: In GPflowGPRModel, the likelihood variance is initialised with the argument noise_variance instead of likelihood_variance as in sklearnGPRModel. Note also that since likelihood variance is a trainable parameter in GPflowGPRModel, we pass an additional argument fixed_params = ['likelihood_variance'] to the optimise_parameters() method to freeze the assigned likelhood variance value. For more details, see the API references for both models.

We see that training time is much shorter on a GPU than on CPU, which is where most of the computation takes place in the enitre GP workflow. Hence, when we have a large dataset, it is advantageous to use GPUs over CPUs.

We can also check that the results of the two predictions are near identical:

# Extract mean and variance for the sklearn prediction
f_mean_sklearn = pred_dict_sklearn['f*']
f_var_sklearn = pred_dict_sklearn['f*_var']
f_std_sklearn = np.sqrt(f_var_sklearn)

# Extract mean and variance for the gpflow prediction
f_mean_gpflow = pred_dict_gpflow['f*']
f_var_gpflow = pred_dict_gpflow['f*_var']
f_std_gpflow = np.sqrt(f_var_gpflow)

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_grid, f_mean_sklearn, color='C0', zorder=1, label='sklearn Prediction')
plt.fill_between(X_grid, f_mean_sklearn-1.96*f_std_sklearn, f_mean_sklearn+1.96*f_std_sklearn, color='C0', alpha=0.6)
plt.plot(X_grid, f_mean_gpflow, color='C1', zorder=1, label='gpflow Prediction')
plt.fill_between(X_grid, f_mean_gpflow-1.96*f_std_gpflow, f_mean_gpflow+1.96*f_std_gpflow, color='C1', alpha=0.2)
plt.legend()

GPSat modelling with local GP experts: A 1-D case study (Part 1)#

The main intended use case of GPSat is to model complex looking fields from a large set of data points, a situation commonly encountered in the geosciences, namely optimal interpolation (OI). The strategy that we adopted is quite simple: to model local chunks of data using different GPs and then gluing their predictions together.

In this tutorial notebook, we will see how this method performs compared to a single “global” GP. We note that GPSat has an automated API to carry out this whole workflow, which we will see in the second part of this tutorial. However for the sake of understanding, we will hard-code this method here.

First, we generate noisy data as follows:

\[ \tag{1} y = \sin(1/X) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 0.05^2 I), \]

in the region \(X \in [0.1, 0.6]\).

# Set random seed
np.random.seed(0)

# Generate data
N = 100
noise_std = 0.05

X_grid = np.linspace(0.1, 0.6, 100)
X = np.random.uniform(0.1, 0.6, (N,))
f = lambda x: np.sin(1/x)
epsilon = noise_std * np.random.randn(N)

y = f(X) + epsilon
f_truth = f(X_grid) # Ground truth

# Plot
plt.plot(X_grid, f_truth, 'k', zorder=1, label='Ground truth')
plt.scatter(X, y, color='C3', alpha=0.6, zorder=2, label='Noisy observations')
plt.legend()

This function is tricky to model well as the variability of the curve varies depending on where you are (shorter lengthscales near 0 and longer lengthscales near 1).

Let’s first see how a standard GP fits on this data.

# Initialise sklearn GPR model
gpr = sklearnGPRModel(coords=X, obs=y, kernel='RBF', likelihood_variance=noise_std**2, verbose=False)

# Train model
_ = gpr.optimise_parameters()

# Predict on test locations
pred_dict = gpr.predict(X_grid[:,None])

# Extract mean and variance of predictions
f_mean = pred_dict['f*']
f_var = pred_dict['f*_var']
f_std = np.sqrt(f_var)

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_grid, f_mean, color='C0', zorder=1, label='sklearn GPR prediction')
plt.fill_between(X_grid, f_mean-1.96*f_std, f_mean+1.96*f_std, color='C0', alpha=0.3)
plt.legend()

We see that the model fits quite well near 0.1, however as it approaches 0.6, we start to see some spurious fluctuations that does not exist in the ground truth field.

Checking the learned lengthscale,

print(f"Lengthscale: {gpr.get_lengthscales():.4f}")

it is quite short, which explains the rapid fluctuations.

Let us also check the mean squared error and the log-likelihood loss from the ground truth field for future reference.

print(f"Mean squared error: {np.mean((f_truth - f_mean)**2):.4f}")
print(f"Mean log likelihood: {scipy.stats.norm.logpdf(f_truth, f_mean, f_std).mean():.4f}")

Local GP experts#

Next, let us consider a natural idea to solve this issue by fitting different GPs (called local experts) on different regions of the domain. For simplicity, we will use two GP experts: one to model the data for smaller values of \(X\) and another to model the data for larger values of \(X\).

We assign the following data to the two GPs (call it GP1 and GP2):

GP1 gets assigned data points within the interval [0.1, 0.4] and makes predictions in the same region.
GP2 gets assigned data points within the interval [0.3, 0.6] and makes predictions in the same region.

Note that we deliberately let the two regions overlap, which will be useful later.

# Data points assigned to GP1 and GP2
Data1 = np.array([[x, y] for (x, y) in zip(X, y) if x < 0.4])
Data2 = np.array([[x, y] for (x, y) in zip(X, y) if x > 0.3])

# Prediction points assigned to GP1 and GP2
X_test_1 = X_grid[X_grid < 0.4]
X_test_2 = X_grid[X_grid > 0.3]

By convention, we will associate each region ([0.1, 0.4] and [0.3, 0.6]) by their mid-points (0.25 and 0.45 respectively), and refer to them as the local expert locations.

The distance from the local expert location to the boundary of the region where data points are assigned is referred to as the training radius. In this case, we can check that our two experts both have a training radius of 0.15.

Likewise, the distance from the local expert location to the boundary of the region where prediction points are assigned is referred to as the inference radius. In our case, we have set the inference radius to be equal to the training radius.

# Set expert locations
xpert_loc_1 = 0.25
xpert_loc_2 = 0.45

# Set training and inference radii
training_radius = inference_radius = 0.15

Now we train GP1 and GP2 in their respective regions and make predictions.

# Train and predict with GP1
gp1 = sklearnGPRModel(coords=Data1[:,0], obs=Data1[:,1], kernel='RBF', likelihood_variance=noise_std**2, verbose=False)
_ = gp1.optimise_parameters()
pred_dict_1 = gp1.predict(X_test_1[:,None])
f_mean_1 = pred_dict_1['f*']
f_var_1 = pred_dict_1['f*_var']
f_std_1 = np.sqrt(f_var_1)

# Train and predict with GP2
gp2 = sklearnGPRModel(coords=Data2[:,0], obs=Data2[:,1], kernel='RBF', likelihood_variance=noise_std**2, verbose=False)
_ = gp2.optimise_parameters()
pred_dict_2 = gp2.predict(X_test_2[:,None])
f_mean_2 = pred_dict_2['f*']
f_var_2 = pred_dict_2['f*_var']
f_std_2 = np.sqrt(f_var_2)

Below, we plot the predictions from the two GPs over-layed on top of one another.

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_test_1, f_mean_1, color='C0', zorder=1, label='Local expert 1')
plt.fill_between(X_test_1, f_mean_1-1.96*f_std_1, f_mean_1+1.96*f_std_1, color='C0', alpha=0.3)
plt.plot(X_test_2, f_mean_2, color='C1', zorder=1, label='Local expert 2')
plt.fill_between(X_test_2, f_mean_2-1.96*f_std_2, f_mean_2+1.96*f_std_2, color='C1', alpha=0.3)
plt.legend()

As expected, we find that GP1 fits the data with a shorter lengthscale and GP2 fits the data with a longer lengthscale.

print(f"Lengthscale of GP1: {gp1.get_lengthscales():.4f}")
print(f"Lengthscale of GP2: {gp2.get_lengthscales():.4f}")

We can now use the glue_local_predictions_1d() method from the GPSat.postprocessing module to glue the two predictions smoothly to yield a single prediction.

This is achieved by a gating mechanism, which considers a Gaussian-weighted average of the two predictions.

First, we record our results into a pandas dataframe as follows. This dataframe should have as columns (1) the prediction locations, (2) local expert locations, and (3) any results we wish to glue such as the predicted mean and variance.

from GPSat.postprocessing import glue_local_predictions_1d

# Prediction locations for GP1 + GP2
pred_locs = list(X_test_1) + list(X_test_2)

# Expert locations for GP1 + GP2
expert_locs = [xpert_loc_1 for _ in X_test_1] + [xpert_loc_2 for _ in X_test_2]

# Predictions from GP1 + GP2
f_mean = list(f_mean_1) + list(f_mean_2)
f_var = list(f_var_1) + list(f_var_2)

# Put these information together into a dataframe
results_df = pd.DataFrame({'pred_locs': pred_locs, 'xprt_locs': expert_locs, 'f_mean': f_mean, 'f_var': f_var})

print(results_df.head())
print(" "*20 + ":")
print(" "*20 + ":")
print(results_df.tail())

We can now glue local predictions by running glue_local_predictions_1d(). This returns a dataframe containing the results of a single glued prediction.

# Glue predictions
glued_preds = glue_local_predictions_1d(preds_df = results_df,                  # The dataframe where results are stored
                                        pred_loc_col = 'pred_locs',             # The column in dataframe corresponding to the prediction locations
                                        xprt_loc_col = 'xprt_locs',             # The column in dataframe corresponding to the local expert locations
                                        vars_to_glue = ['f_mean', 'f_var'],     # The columns in dataframe corresponding to the predictions
                                        inference_radius = inference_radius)    # The inference radius (by passing a single float, it is assumed to be equal for both regions)

glued_preds.head()

Finally we plot the results of this glued prediction.

# Extract mean and variance of glued prediction
f_mean = glued_preds["f_mean"]
f_var = glued_preds["f_var"]
f_std = np.sqrt(f_var)
X_test = glued_preds['pred_locs']

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_test, f_mean, color='C3', zorder=1, label='Glued predictions')
plt.fill_between(X_test, f_mean-1.96*f_std, f_mean+1.96*f_std, color='C3', alpha=0.3)
plt.legend()

The result of the glued prediction looks slightly better than our first attempt using a global GP. This improvement is also reflected in the metrics, with a slightly improved log-likelihood score:

print(f"Mean squared error: {np.mean((f_truth - f_mean)**2):.4f}")
print(f"Mean log likelihood: {scipy.stats.norm.logpdf(f_truth, f_mean, f_std).mean():.4f}")

GPSat modelling with local GP experts: A 1-D case study (Part 2)#

In the previous part of the tutorial, we implemented a local GP expert model to fit on non-stationary data. Here, we will do the same except using GPSat’s LocalExpertOI class, which automates some of the procedures involved making experiments less cumbersome.

import scipy
import os
import GPSat
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from GPSat import get_parent_path
from GPSat.postprocessing import glue_local_predictions_1d

We generate the same data as before:

# Set random seed
np.random.seed(0)

# Generate data
N = 100
noise_std = 0.05

X_grid = np.linspace(0.1, 0.6, 100)
X = np.random.uniform(0.1, 0.6, (N,))
f = lambda x: np.sin(1/x)
epsilon = noise_std * np.random.randn(N)

y = f(X) + epsilon
f_truth = f(X_grid) # Ground truth

# Plot
plt.plot(X_grid, f_truth, 'k', zorder=1, label='Ground truth')
plt.scatter(X, y, color='C3', alpha=0.6, zorder=2, label='Noisy observations')
plt.legend()

Configuration dataclasses#

We will now conduct the same experiments as in the previous tutorial using GPSat.local_experts.LocalExpertOI.

First, we break down a single experiment into the following four key components:

The local expert locations
The GP model assigned to each local expert
The training data
The points where we want to make predictions

In GPSat, we configure each of these four components with a so-called configuration dataclass. The goal is to allow sufficient modelling flexibility to accomodate various problems and datasets.

1. Local expert config#

We start by setting the configuration for the local expert locations. This can be done by assigning a dataframe containing the locations of the local experts.

from GPSat.config_dataclasses import DataConfig, ModelConfig, PredictionLocsConfig, ExpertLocsConfig

# Construct a data frame containing the two local expert locations
xpert_loc_1 = 0.25
xpert_loc_2 = 0.45
xpert_locs_df = pd.DataFrame({'x': [xpert_loc_1, xpert_loc_2]})

# Set up an expert location configuration dataclass
expert_loc_config = ExpertLocsConfig(source=xpert_locs_df)

The source argument is where we point to the expert locations. In this case, we simply used a dataframe to represent the expert locations and pointed to that. However in more advanced applications, we also have the functionality to instead point to a file where the expert locations are saved, which can be more convenient.

2. Model config#

Next, we set up the configuration for the model assigned to each local expert. Here, we will use the sklearnGPRModel, which we specify as follows:

# Set up configuration for the model
model_config = ModelConfig(oi_model="sklearnGPRModel",
                           init_params={"likelihood_variance": noise_std**2,
                                        "kernel": 'RBF',
                                        "verbose": False}
                           )

We specified the model we are using in oi_model (pre-implemented GPSat models can be referred to by strings), and in init_params, we pass any arguments used to initialise the model (expressed as a dictionary). Note that we do not need to specify arguments to set the data here (namely data, coords and obs) as this will be done automatically in the main loop.

There are also functionalities to specify constraints on parameters, re-scale the data, etc… however, we will ignore these for the sake of keeping the presentation simple.

3. Data config#

Next we set up the configuration for data. Here, we configure information such as the source of data and instructions on how to assign a subset of the data to each local expert.

First, we put our training data into a pandas dataframe.

# Write data as dataframe
data_df = pd.DataFrame({'x': X, 'y': y})

data_df.head()

For the local data selection, we want to select data points within ± the training radius from the expert locations.

In GPSat, we have a unique API to select data from simple instructions. These instructions are expressed in a dictionary with the keys "col", "comp" and "val". For example, see below for the instructions to select data within ± the inference radius of some reference point.

# Set inference radius
training_radius = 0.15

local_select_instructions = [{"col": "x", "comp": "<=", "val": training_radius},
                             {"col": "x", "comp": ">=", "val": -training_radius}]
                

The first argument "col" indicates which column in the dataframe we want to impose conditions on (in this case "x"), the "comp" arguments specifies a relation such as “greater than”, “less than”, etc… and the "val" argument specifies the value with which we want to compare our column with (in our case, the training radius).

Thus programmatically, the above list of commands will select data as follows:

>>> data_1 = data_df[ (data_df["x"] - ref_point) <= training_radius ]
>>> data_2 = data_df[ (data_df["x"] - ref_point) >= -training_radius ]
>>> local_data = union(data_1, data_2)

Here, ref_point is some reference point, which, in the main loop, will correspond to the expert locations. The command union is a pseudo-function to take the intersection of members in data_1 and data_2.

With this data selection instruction specified, we can now set the configuration for data as follows.

# Set data config
data_config = DataConfig(data_source = data_df,
                         obs_col = ["y"],
                         coords_col = ["x"],
                         local_select = local_select_instructions
                        )

The argument data_source points to the dataframe where our data is stored, obs_col specifies the column in our dataframe corresponding to the measurements, coords_col specifies the column corresponding to the input coordinates, and local_select is where we put our instructions for local data selection.

4. Prediction location config#

Finally, we configure the prediction locations. This should include information about the test locations and the local inference region, where the local experts make predictions. The inference region is simply set to be a circular region around the expert location, with radius given by the inference radius.

First, we write the prediction locations into a dataframe.

# Set up prediction locations as a dataframe
prediction_locs = X_grid
prediction_locs_df = pd.DataFrame({'x': X_grid})

prediction_locs_df.head()

We can now set the configuration for prediction locations. We take the inference radius to be slightly larger than the training radius to include predictions on the boundaries.

inference_radius = training_radius + 1e-8

pred_loc_config = PredictionLocsConfig(method = "from_dataframe",
                                       df = prediction_locs_df,
                                       max_dist = inference_radius
                                       )

Here, the method argument specifies how the prediction locations are selected. In our case this is from_dataframe and we specify the dataframe in the argument df. The max_dist argument specifies the inference radius around the expert location.

Run experiment#

We are now in shape to run our experiment. To do this, we initialise a LocalExpertOI object from the four config classes we created.

from GPSat.local_experts import LocalExpertOI

# Set up local expert experiment
locexp = LocalExpertOI(data_config = data_config,
                       model_config = model_config,
                       expert_loc_config = expert_loc_config,
                       pred_loc_config = pred_loc_config)
                       

Now, we just need to specify a path where we want to store our results and run an experiment with the run() method. The stored path should be a HDF5 file, which uses the extension ".h5".

# path to store results
store_path = get_parent_path("results", "1d_tutorial_example.h5")

# for the purposes of a simple example, if store_path exists: delete it
if os.path.exists(store_path):
    os.remove(store_path)
    
# run local expert optimal interpolation
locexp.run(store_path=store_path)

We can extract the results from the HDF5 with the local_experts.get_results_from_h5file() method.

# extract, store in dict
dfs, _ = GPSat.local_experts.get_results_from_h5file(store_path)

Check the results that are stored by accessing the keys of the dictionary dfs:

dfs.keys()

We see tables storing the model parameters ('kernel_variance', 'lengthscales', 'likelihood_variance'), the full configuration used to run the experiment stored in json format ('oi_config'), model predictions ('preds') and details of the experiment run such as run time, device name, etc… ('run_details').

Let’s check the 'preds' table storing the model predictions.

dfs['preds'].head()

As in the previous tutorial, we can glue overlapping predictions from different experts by running the glue_local_predictions_1d() method.

glued_preds = glue_local_predictions_1d(preds_df = dfs['preds'],
                                        pred_loc_col = 'pred_loc_x',
                                        xprt_loc_col = 'x',
                                        vars_to_glue = ['f*', 'f*_var'],
                                        inference_radius = inference_radius)
glued_preds.head()

We plot the results below

# Extract glued mean and variance predictions
f_mean = glued_preds['f*']
f_var = glued_preds['f*_var']
f_std = np.sqrt(f_var)
X_test = glued_preds['pred_loc_x']

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_test, f_mean, color='C3', zorder=1, label='Glued predictions (2 experts)')
plt.fill_between(X_test, f_mean-1.96*f_std, f_mean+1.96*f_std, color='C3', alpha=0.3)

xvals = [0.25, 0.45]
yvals = [-1.2, -1.]
plt.errorbar(xvals, yvals, xerr=0.15, fmt='o', elinewidth=2, barsabove=True, capsize=5, alpha=0.5, label='Local expert locations')
for (x, y) in zip(xvals, yvals):
    plt.vlines(x, -1.4, y, linestyles='dashed')
ax = plt.gca()
ax.set_ylim([-1.4, 1.1])

plt.legend()

In the above, we have also illustrated the local expert locations (blue circle) and the inference regions (blue horizontal bars).

Below, we assess the performance of the model.

print(f"Mean squared error: {np.mean((f_truth - f_mean)**2):.4f}")
print(f"Mean log likelihood: {scipy.stats.norm.logpdf(f_truth, f_mean, f_std).mean():.4f}")

Using more local experts#

Finally, let’s see what happens when we double the number of local experts. Below, we set up the configurations for an experiment using the expert locations at x = [0.2, 0.3, 0.4, 0.5] and training radius = 0.1.

# Set new expert locations
xprt_locs = [0.2, 0.3, 0.4, 0.5]

# Set training and inference radii
training_radius = 0.1
inference_radius = training_radius + 1e-8

# Set up configs
expert_loc_config = ExpertLocsConfig(source=pd.DataFrame({'x': xprt_locs}))

model_config = ModelConfig(oi_model="sklearnGPRModel",
                           init_params={"likelihood_variance": noise_std**2,
                                        "kernel": 'RBF',
                                        "verbose": False}
                           )

data_config = DataConfig(data_source=data_df,
                         obs_col=["y"],
                         coords_col=["x"],
                         local_select=[{"col": "x", "comp": "<=", "val": training_radius},
                                       {"col": "x", "comp": ">=", "val": -training_radius}]
                        )

pred_loc_config = PredictionLocsConfig(method="from_dataframe",
                                       df=prediction_locs_df,
                                       max_dist=inference_radius
                                       )

We run this experiment below using LocalExpertOI.

# Set up local expert experiment
locexp = LocalExpertOI(data_config = data_config,
                       model_config = model_config,
                       expert_loc_config = expert_loc_config,
                       pred_loc_config = pred_loc_config)

# path to store results
store_path = get_parent_path("results", "1d_tutorial_example.h5")

# for the purposes of a simple example, if store_path exists: delete it
if os.path.exists(store_path):
    os.remove(store_path)
    
# run local expert optimal interpolation
locexp.run(store_path=store_path, store_every=1)

Plot results:

# extract, store in dict
dfs, _ = GPSat.local_experts.get_results_from_h5file(store_path)

glued_preds = glue_local_predictions_1d(preds_df = dfs['preds'],
                                        pred_loc_col = 'pred_loc_x',
                                        xprt_loc_col = 'x',
                                        vars_to_glue = ['f*', 'f*_var'],
                                        inference_radius = inference_radius)

# Extract glued mean and variance predictions
f_mean = glued_preds['f*']
f_var = glued_preds['f*_var']
f_std = np.sqrt(f_var)
X_test = glued_preds['pred_loc_x']

# Plot results
plt.plot(X_grid, f_truth, 'k', zorder=0, label='Ground truth')
plt.plot(X_test, f_mean, color='C4', zorder=1, label='Glued predictions (4 experts)')
plt.fill_between(X_test, f_mean-1.96*f_std, f_mean+1.96*f_std, color='C4', alpha=0.3)

xvals = [0.2, 0.3, 0.4, 0.5]
yvals = [-1.3, -1.2, -1.1, -1.]
plt.errorbar(xvals, yvals, xerr=0.1, fmt='o', elinewidth=2, barsabove=True, capsize=5, alpha=0.5, label='Local expert locations')
for (x, y) in zip(xvals, yvals):
    plt.vlines(x, -1.4, y, linestyles='dashed')
ax = plt.gca()
ax.set_ylim([-1.4, 1.1])

plt.legend()

We see that the results look much better and this is also reflected in the metrics:

print(f"Mean squared error: {np.mean((f_truth - f_mean)**2):.4f}")
print(f"Mean log likelihood: {scipy.stats.norm.logpdf(f_truth, f_mean, f_std).mean():.4f}")

Note: To achieve the best performance using local experts model, each local experts should have sufficiently many data points to prevent overfitting on a particular region. However, if this happens, we can prevent this by hyperparameter smoothing.

Note: In GPSat, we have not yet considered learning the optimal distribution of expert locations and the corresponding inference/training radii that best fit the data. We typically assume the expert locations to be distributed on an even grid and use the same inference/training at every expert locations. However it might be interesting in the future to consider the learning of such hyperparameters to further improve performance.

2D interpolation of ABC satellite data#

Now, we look at a more practical example to generate gridded predictions of the sea-ice freeboard from satellite data using GPSat. For the satellite data, we use a sample of the Sentinel 3A and 3B data as well as CryoSat-2 data (referred to as A, B and C respectively).

import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from GPSat import get_data_path, get_parent_path
from GPSat.dataprepper import DataPrep
from GPSat.utils import WGS84toEASE2_New, EASE2toWGS84_New, cprint, grid_2d_flatten, get_weighted_values
from GPSat.local_experts import LocalExpertOI, get_results_from_h5file
from GPSat.plot_utils import plot_pcolormesh, get_projection, plot_pcolormesh_from_results_data
from GPSat.config_dataclasses import DataConfig, ModelConfig, PredictionLocsConfig, ExpertLocsConfig
from GPSat.postprocessing import glue_local_predictions_2d

We read in the raw ABC satellite data saved in this repo for demo purposes. We combine the data sources into a single dataframe.

# read in all the *_RAW.csv files in data/example
# - get files to read
raw_files = [get_data_path("example", i)
             for i in os.listdir(get_data_path("example")) if re.search("_RAW\.csv$", i)]

# read in, add source col
tmp = []
for file in raw_files:
    source = re.sub("_RAW\.csv$", "", os.path.basename(file))
    sat_data = pd.read_csv(file)
    sat_data['source'] = source
    tmp.append(sat_data)
df = pd.concat(tmp)

df.head()

Here, the variable z signifies the freeboard measurement that we wish to interpolate. For convenience, we will convert the lon, lat variables into x, y EASE(2) grid coordinates and the datetime column to t UTC days.

# convert lon, lat, datetime to x, y, t - to be used as the coordinate space
# - x,y are in meters, t in days
df['x'], df['y'] = WGS84toEASE2_New(lon=df['lon'], lat=df['lat'], lat_0=90, lon_0=0)
df['t'] = df['datetime'].values.astype("datetime64[D]").astype(float)

df.head()

We bin the raw data to a 50km grid to reduce the total data size.

# bin by date, source to a 50x50km grid
# - returns a DataSet
bin_ds = DataPrep.bin_data_by(df=df.loc[(df['z'] > -0.35) & (df['z'] < 0.65)],
                              by_cols=['t', 'source'],
                              val_col='z',
                              x_col='x',
                              y_col='y',
                              grid_res=50_000,
                              x_range=[-4500000.0, 4500000.0],
                              y_range=[-4500000.0, 4500000.0])

# convert bin data to DataFrame
# - removing all the nans that would be added at grid locations away from data
bin_df = bin_ds.to_dataframe().dropna().reset_index()

bin_df.head()

Below we plot the binned data:

# this will plot all observations, some on top of each other
bin_df['lon'], bin_df['lat'] = EASE2toWGS84_New(bin_df['x'], bin_df['y'])

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(1, 1, 1, projection=get_projection('north'))

plot_pcolormesh(ax=ax,
                lon=bin_df['lon'],
                lat=bin_df['lat'],
                plot_data=bin_df['z'],
                title="example: binned obs",
                scatter=True,
                s=20,
                fig=fig,
                extent=[-180, 180, 60, 90])

plt.tight_layout()
plt.show()

Setting up configurations#

Now, we set up the configuration dataclasses for interpolating this data using local GP experts. For the expert locations, we use points on an evenly spaced 200km grid. For this tutorial, we will only consider interpolating on a small square region around the pole.

# - spaced every 200km for some x,y range
xy_grid = grid_2d_flatten(x_range=[-500000.0, 500000.0],
                          y_range=[-500000.0, 500000.0],
                          step_size=200_000)

# store in dataframe
eloc = pd.DataFrame(xy_grid, columns=['x', 'y'])

# add a time coordinate
eloc['t'] = np.floor(df['t'].mean())

print("Local expert locations:")
eloc.head()

We plot this below.

# plot expert locations
eloc['lon'], eloc['lat'] = EASE2toWGS84_New(eloc['x'], eloc['y'])


fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(1, 1, 1, projection=get_projection('north'))

plot_pcolormesh(ax=ax,
                lon=eloc['lon'],
                lat=eloc['lat'],
                plot_data=eloc['t'],
                title="expert locations",
                scatter=True,
                s=20,
                extent=[-180, 180, 60, 90])

plt.tight_layout()
plt.show()

For the prediction locations, we use points on a finer 5km grid.

# - spaced every 5km
xy_grid = grid_2d_flatten(x_range=[-500000.0, 500000.0],
                          y_range=[-500000.0, 500000.0],
                          step_size=5_000)

# store in dataframe
# NOTE: the missing 't' coordinate will be determine by the expert location
# - alternatively the prediction location can be specified
ploc = pd.DataFrame(xy_grid, columns=['x', 'y'])

# Add lon-lat measurements
ploc['lon'], ploc['lat'] = EASE2toWGS84_New(ploc['x'], ploc['y'])

print("Prediction locations:")
ploc.head()

Again we plot this below.

# plot prediction locations
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(1, 1, 1, projection=get_projection('north'))

plot_pcolormesh(ax=ax,
                lon=ploc['lon'],
                lat=ploc['lat'],
                plot_data=np.full(len(ploc), 1.0), #np.arange(len(ploc)),
                title="prediction locations",
                scatter=True,
                s=0.1,
                extent=[-180, 180, 60, 90])

plt.tight_layout()
plt.show()

We now set up configurations for Local Expert OI

# Set training and inference radius
training_radius = 300_000   # 300km
inference_radius = 200_000  # 200km

# Local expert locations config
local_expert = ExpertLocsConfig(source = eloc)

# Model config
model = ModelConfig(oi_model = "GPflowGPRModel", # Use GPflow GPR model
                    init_params = {
                        # normalise xy coordinates by 50km
                        "coords_scale": [50_000, 50_000, 1]
                        },
                    constraints = {
                        # set bounds on the lengthscale hyperparameters
                        "lengthscales": {
                            "low": [1e-08, 1e-08, 1e-08],
                            "high": [600_000, 600_000, 9]
                        }
                        }
                    )

# Data config
data = DataConfig(data_source = bin_df,
                  obs_col = "z",
                  coords_col = ["x", "y", "t"],
                  local_select = [
                    # Select data within 300km and ± 4 days of the expert location
                    {"col": "t", "comp": "<=", "val": 4},
                    {"col": "t", "comp": ">=", "val": -4},
                    {"col": ["x", "y"], "comp": "<", "val": training_radius}
                  ]
                )

# Prediction locs config
pred_loc = PredictionLocsConfig(method = "from_dataframe",
                                df = ploc,
                                max_dist = inference_radius)

Set up Local Expert OI object from these configs.

locexp = LocalExpertOI(expert_loc_config = local_expert,
                       data_config = data,
                       model_config = model,
                       pred_loc_config = pred_loc)

Specify a file to store results and run the experiment.

# path to store results
store_path = get_parent_path("results", "inline_example.h5")

# for the purposes of a simple example, if store_path exists: delete it
if os.path.exists(store_path):
    cprint(f"removing: {store_path}")
    os.remove(store_path)

# run optimal interpolation
locexp.run(store_path=store_path)

Extract the results from the store path:

# extract, store in dict
dfs, _ = get_results_from_h5file(store_path)

print(f"tables in results file: {list(dfs.keys())}")

preds_data = dfs["preds"]

preds_data.head()

As before there are going to be overlapping predictions coming from different local experts hence we glue them together using the glue_local_predictions_2d() method.

# multiple local experts may make predictions at the same prediction location (pred_loc).
# - for each prediction at a given location, take we weighted combination
# - weights being a function of the distance to each local expert that made a prediction at a given location.

plt_data = glue_local_predictions_2d(preds_df=preds_data,
                                     pred_loc_cols=["pred_loc_x", "pred_loc_y"],
                                     xprt_loc_cols=["x", "y"],
                                     vars_to_glue=["f*", "f*_var"],
                                     inference_radius=inference_radius)

plt_data.head()

We then plot this result:

# add convert x,y to lon,lat
plt_data['lon'], plt_data['lat'] = EASE2toWGS84_New(plt_data['pred_loc_x'], plt_data['pred_loc_y'])


fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(1, 1, 1, projection=get_projection('north'))
plot_pcolormesh_from_results_data(ax=ax,
                                  dfs={"preds": plt_data},
                                  table='preds',
                                  val_col="f*",
                                  scatter=False,
                                  x_col='pred_loc_x',
                                  y_col='pred_loc_y',
                                  fig=fig,
                                  plot_kwargs={"title": "f*: predictions"})
plt.tight_layout()
plt.show()