Application of Regression Techniques in Satellite Imagery Analysis#

In the first session of week 5, we delved into an array of regression techniques, including polynomial regression, neural networks, and Gaussian processes, each offering unique perspectives and methodologies for modeling complex relationships within data. This week, we pivot our focus towards the practical application of these regression techniques to a challenging yet highly relevant problem in the field of satellite imagery analysis: predicting sea-ice concentration and the fraction of leads/melt ponds. Our dataset comprises 21 spectral bands from satellite imagery, each spanning over 5000 data points, which we aim to regress onto scalar values that may represent sea-ice concentration and lead/melt pond fractions across the same 5000 observations depending on what we want. In the previous notebook, we prepared such a dataset for us to apply the regression techniques.

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Data Preprocessing#

Let’s recall some key phases of our machine learning project cycle:

  • Data Collection: Data is the cornerstone of any ML project. This stage involves gathering necessary data relevant to our problem. The quality, quantity, and variety of data can significantly influence the model’s performance. For example, collecting satellite images like those from OLCI represents a common data collection process, with much of the raw data being publicly available online for download.

  • Data Preprocessing: Raw data often requires cleaning and formatting before use. This step includes converting raw data into a format interpretable by ML models, handling missing values, normalizing data, and feature engineering. Previously, we introduced a method for creating a machine learning dataset using IRIS.

In the previous notebook, we completed the data collection phase of our cycle. Now, we move to data preprocessing. The primary task here is to split the data into training, validation, and testing sets, which will allow us to evaluate our model’s performance after training.

import numpy as np
from sklearn.model_selection import train_test_split

dataset_path = '/content/drive/MyDrive/GEOL0069/2324/Week_6/2425_test/s3_ML_dataset.npz' # change it to the directory where you saved the dataset from the last notebook

data = np.load(dataset_path)
input_features = data['X']
target_variables = data['y']

X_train, X_test, y_train, y_test = train_test_split(
    input_features, target_variables, test_size=0.2, random_state=42)

print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training targets shape:", y_train.shape)
print("Testing targets shape:", y_test.shape)
Training features shape: (11556, 21)
Testing features shape: (2889, 21)
Training targets shape: (11556,)
Testing targets shape: (2889,)

Polynomial Regression [Draper and Smith, 1998]#

Recall Polynomial Regression#

Polynomial regression is a form of regression analysis in which the relationship between the independent variable \(x\) and the dependent variable \(y\) is modeled as an \(n\) th degree polynomial. Polynomial regression fits a nonlinear relationship between the value of \(x\) and the corresponding conditional mean of \(y\), denoted \(E(y |x)\). Below code shows how we can apply it on our data.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

polynomial_features = PolynomialFeatures(degree=2)
X_poly_train = polynomial_features.fit_transform(X_train)

model_poly = LinearRegression()
model_poly.fit(X_poly_train, y_train)

X_poly_test = polynomial_features.transform(X_test)

y_pred_poly = model_poly.predict(X_poly_test)
mse = mean_squared_error(y_test, y_pred_poly)
print(f"The Mean Squared Error (MSE) on the test set is: {mse}")

sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred_poly[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.title('Polynomial Regression with Degree 2 - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('y')
plt.legend()
plt.show()
The Mean Squared Error (MSE) on the test set is: 0.0031374039199429933
_images/56ffc678200824f0a700f963a2a57498a8011cad7d3728ecaeba461653724abf.png
y_test
array([0.92356688, 0.8701482 , 0.97062987, ..., 0.80819498, 0.8700885 ,
       0.88825825])
y_pred_poly
array([0.9067383 , 0.7709961 , 0.8955078 , ..., 0.87353516, 0.87646484,
       0.9160156 ], dtype=float32)
import numpy as np

# Assuming y_test and y_pred_poly are NumPy arrays
# Compute Mean Squared Error
mse = np.mean((y_test - y_pred_poly) ** 2)
print("Mean Squared Error:", mse)
Mean Squared Error: 0.0031374039199429933

Neural Networks [Goodfellow et al., 2016]#

Recall Important Components of Neural Networks#

  1. Layers: Composed of neurons, layers are the fundamental units of neural networks. A fully connected network consists of input, hidden, and output layers.

  2. Neurons: Each neuron in a layer is connected to all neurons in the previous and next layers, processing the input data and passing on its output.

  3. Weights and Biases: These parameters are adjusted during training to minimize the network’s error in predicting the target variable.

  4. Activation Functions: Functions like ReLU or Sigmoid introduce non-linearities, allowing the network to model complex relationships.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import mean_squared_error

model_nn = Sequential([
    Dense(256, activation='relu', input_shape=(21,)),
    Dense(256, activation='relu'),
    Dense(1)
])

model_nn.compile(optimizer='adam', loss='mean_squared_error')

model_nn.fit(X_train, y_train, epochs=10)

y_pred = model_nn.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print(f"The Mean Squared Error (MSE) on the test set is: {mse}")

model_nn.summary()

y_pred_nn = y_pred.flatten()

sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)

plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred_nn[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.title('Neural Network Regression - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
/usr/local/lib/python3.11/dist-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Epoch 1/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 0.0291
Epoch 2/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0064
Epoch 3/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0052
Epoch 4/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - loss: 0.0048
Epoch 5/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - loss: 0.0044
Epoch 6/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - loss: 0.0043
Epoch 7/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0046
Epoch 8/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0041
Epoch 9/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0037
Epoch 10/10
362/362 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0041
91/91 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
The Mean Squared Error (MSE) on the test set is: 0.003859773428732748
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                          Output Shape                         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense (Dense)                        │ (None, 256)                 │           5,632 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 256)                 │          65,792 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 1)                   │             257 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 215,045 (840.02 KB)
 Trainable params: 71,681 (280.00 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 143,364 (560.02 KB)
_images/909195fa03c82ca84651dfcde424af407224258854412ffc9ed53138f6c375e9.png

Gaussian Processes [Bishop and Nasrabadi, 2006]#

Recall Gaussian Processes#

A Gaussian Process (GP) is essentially an advanced form of a Gaussian (or normal) distribution, but instead of being over simple variables, it’s over functions. Imagine a GP as a method to predict or estimate a function based on known data points. Note that we are using sparse GP here as the data we have here is somethat high-dimensional (21 bands).

pip install GPy
Collecting GPy
  Downloading GPy-1.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Requirement already satisfied: numpy<2.0.0,>=1.7 in /usr/local/lib/python3.11/dist-packages (from GPy) (1.26.4)
Requirement already satisfied: six in /usr/local/lib/python3.11/dist-packages (from GPy) (1.17.0)
Collecting paramz>=0.9.6 (from GPy)
  Downloading paramz-0.9.6-py3-none-any.whl.metadata (1.4 kB)
Requirement already satisfied: cython>=0.29 in /usr/local/lib/python3.11/dist-packages (from GPy) (3.0.11)
Collecting scipy<=1.12.0,>=1.3.0 (from GPy)
  Downloading scipy-1.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 kB 5.3 MB/s eta 0:00:00
?25hRequirement already satisfied: decorator>=4.0.10 in /usr/local/lib/python3.11/dist-packages (from paramz>=0.9.6->GPy) (4.4.2)
Downloading GPy-1.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 96.7 MB/s eta 0:00:00
?25hDownloading paramz-0.9.6-py3-none-any.whl (103 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.2/103.2 kB 11.1 MB/s eta 0:00:00
?25hDownloading scipy-1.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.4/38.4 MB 27.4 MB/s eta 0:00:00
?25hInstalling collected packages: scipy, paramz, GPy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.13.1
    Uninstalling scipy-1.13.1:
      Successfully uninstalled scipy-1.13.1
Successfully installed GPy-1.13.2 paramz-0.9.6 scipy-1.12.0
import GPy
from sklearn.metrics import mean_squared_error

kernel = GPy.kern.RBF(input_dim=21)
num_inducing = 100

gp = GPy.models.SparseGPRegression(X_train, y_train.reshape(-1, 1), kernel, num_inducing=num_inducing)

gp.optimize(messages=True)

y_pred_gp, variance = gp.predict(X_test)
y_pred_gp = y_pred.flatten()
sigma = np.sqrt(variance).flatten()

mse = mean_squared_error(y_test, y_pred_gp)
print(f"The Mean Squared Error (MSE) on the test set is: {mse}")

sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred_gp[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.title('Gaussian Process Regression - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('Target')
plt.legend()
plt.show()

The Mean Squared Error (MSE) on the test set is: 0.003859773428732748
_images/b64f8f83c1b1241a7bd7949e2d393e3ed3bce8f33ba2ee55f9d31e3aaf8db55f.png

Comparison of Performances#

import matplotlib.pyplot as plt
import numpy as np

x_min, x_max = X_test[:, 0].min(), X_test[:, 0].max()
y_min, y_max = y_test.min(), y_test.max()

predictions = [y_pred_poly, y_pred_nn, y_pred_gp]
titles = ['Polynomial Regression', 'Neural Network', 'Gaussian Process']

for i, y_pred in enumerate(predictions):
    plt.figure(figsize=(8, 6))
    sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
    plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
    plt.scatter(X_test[sample_idx, 0], y_pred[sample_idx], color='blue', label='Predicted', alpha=0.5)
    plt.plot([x_min, x_max], [y_min, y_max], 'r--', label='Perfect Prediction')
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)
    plt.title(titles[i] + ' - Test Set Prediction')
    plt.xlabel('First Feature')
    plt.ylabel('Target')
    plt.legend()
    plt.show()
_images/62443162b33efc683d8349a69ffabf338026b8f58bbceac378c46053ed68fe42.png _images/463f568ea3308842d3e47339bbc989db85c9d39e646f0d6eeb0b466067c4482b.png _images/f67ff77a12bf8cdead500593de7584de04947bd0c52f2a17829ab74bb7fbfda9.png

Rollout#

Now we can test our model on another part of OLCI image. It is from the same pair of S3/S2 imagery and the relavant mask limits are:

zoom_x_min = 390000.0

zoom_x_max = 410000.0

zoom_y_min = 7790230.0

zoom_y_max = 7820000.0

import numpy as np
import matplotlib.pyplot as plt


# 1. Load the new S3 data
s3_test = np.load('/content/drive/MyDrive/GEOL0069/2324/Week_6/2425_test/s3_zoomed_data_test.npz')
s3_x_new = s3_test['x']
s3_y_new = s3_test['y']
s3_reflectance_new = s3_test['reflectance']


# 2. Make Predictions using the 3 Models

# Prediction for Polynomial Regression:
X_poly_new = polynomial_features.transform(s3_reflectance_new)
y_pred_poly_new = model_poly.predict(X_poly_new)

# Prediction for Neural Network:
y_pred_nn_new = model_nn.predict(s3_reflectance_new).flatten()

# Prediction for Gaussian Process Regression:
y_pred_gp_new, variance_gp = gp.predict(s3_reflectance_new)
y_pred_gp_new = y_pred_gp_new.flatten()


# 3. Plot the Original S3 Data and Predicted SIC for each Model

plt.figure(figsize=(16, 4))

# Plot 1: Original S3 data (using reflectance band 1 as a proxy for visualization)
plt.subplot(1, 4, 1)
sc1 = plt.scatter(s3_x_new, s3_y_new, c=s3_reflectance_new[:, 0], cmap='viridis', s=10)
plt.title("Original S3\n(Reflectance Band 1)")
plt.xlabel("X")
plt.ylabel("Y")
plt.colorbar(sc1, label='Reflectance')

# Plot 2: Predicted SIC using Polynomial Regression
y_pred_poly_new = np.clip(y_pred_poly_new, 0, 1)
plt.subplot(1, 4, 2)
sc2 = plt.scatter(s3_x_new, s3_y_new, c=y_pred_poly_new, cmap='viridis', s=10)
plt.title("Predicted SIC\n(Poly Regression)")
plt.xlabel("X")
plt.ylabel("Y")
plt.colorbar(sc2, label='SIC')

# Plot 3: Predicted SIC using Neural Network
plt.subplot(1, 4, 3)
sc3 = plt.scatter(s3_x_new, s3_y_new, c=y_pred_nn_new, cmap='viridis', s=10)
plt.title("Predicted SIC\n(Neural Network)")
plt.xlabel("X")
plt.ylabel("Y")
plt.colorbar(sc3, label='SIC')

# Plot 4: Predicted SIC using Gaussian Process
plt.subplot(1, 4, 4)
sc4 = plt.scatter(s3_x_new, s3_y_new, c=y_pred_gp_new, cmap='viridis', s=10)
plt.title("Predicted SIC\n(Gaussian Process)")
plt.xlabel("X")
plt.ylabel("Y")
plt.colorbar(sc4, label='SIC')

plt.tight_layout()
plt.show()
233/233 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step
_images/e2399459c3f93e595e2276f46ed2645037750cc149d0ab67ee442e93f79c7b5f.png