Application of Regression Techniques in Satellite Imagery Analysis#
In Week 5, we delved into an array of regression techniques, including polynomial regression, neural networks, and Gaussian processes, each offering unique perspectives and methodologies for modeling complex relationships within data. This week, we pivot our focus towards the practical application of these regression techniques to a challenging yet highly relevant problem in the field of satellite imagery analysis: predicting sea-ice concentration and the fraction of leads/melt ponds. Our dataset comprises 21 spectral bands from satellite imagery, each spanning over 5000 data points, which we aim to regress onto scalar values that may represent sea-ice concentration and lead/melt pond fractions across the same 5000 observations depending on what we want. In the previous notebook, we prepared such a dataset for us to apply the regression techniques.
Data Preprocessing#
Let’s recall some key phases of our machine learning project cycle:
Data Collection: Data is the cornerstone of any ML project. This stage involves gathering necessary data relevant to our problem. The quality, quantity, and variety of data can significantly influence the model’s performance. For example, collecting satellite images like those from OLCI represents a common data collection process, with much of the raw data being publicly available online for download.
Data Preprocessing: Raw data often requires cleaning and formatting before use. This step includes converting raw data into a format interpretable by ML models, handling missing values, normalizing data, and feature engineering. Previously, we introduced a method for creating a machine learning dataset using IRIS.
In the previous notebook, we completed the data collection phase of our cycle. Now, we move to data preprocessing. The primary task here is to split the data into training, validation, and testing sets, which will allow us to evaluate our model’s performance after training.
import numpy as np
from sklearn.model_selection import train_test_split
features_path = '/content/drive/MyDrive/GEOL0069/Week_6/reshaped_array_condition_21.npy'
targets_path = '/content/drive/MyDrive/GEOL0069/Week_6/SICavg_condition.npy'
input_features = np.load(features_path)
target_variables = np.load(targets_path)
X_train, X_test, y_train, y_test = train_test_split(input_features, target_variables, test_size=0.2, random_state=42)
Polynomial Regression [Draper and Smith, 1998]#
Recall Polynomial Regression#
Polynomial regression is a form of regression analysis in which the relationship between the independent variable \(x\) and the dependent variable \(y\) is modeled as an \(n\) th degree polynomial. Polynomial regression fits a nonlinear relationship between the value of \(x\) and the corresponding conditional mean of \(y\), denoted \(E(y |x)\). Below code shows how we can apply it on our data.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
polynomial_features = PolynomialFeatures(degree=2)
X_poly_train = polynomial_features.fit_transform(X_train)
model_poly = LinearRegression()
model_poly.fit(X_poly_train, y_train)
X_poly_test = polynomial_features.transform(X_test)
y_pred_poly = model_poly.predict(X_poly_test)
mse = mean_squared_error(y_test, y_pred_poly)
print(f"The Mean Squared Error (MSE) on the test set is: {mse}")
sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred_poly[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.title('Polynomial Regression with Degree 2 - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('y')
plt.legend()
plt.show()
Neural Networks [Goodfellow et al., 2016]#
Recall Important Components of Neural Networks#
Layers: Composed of neurons, layers are the fundamental units of neural networks. A fully connected network consists of input, hidden, and output layers.
Neurons: Each neuron in a layer is connected to all neurons in the previous and next layers, processing the input data and passing on its output.
Weights and Biases: These parameters are adjusted during training to minimize the network’s error in predicting the target variable.
Activation Functions: Functions like ReLU or Sigmoid introduce non-linearities, allowing the network to model complex relationships.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import mean_squared_error
model_nn = Sequential([
Dense(256, activation='relu', input_shape=(21,)),
Dense(256, activation='relu'),
Dense(1)
])
model_nn.compile(optimizer='adam', loss='mean_squared_error')
model_nn.fit(X_train, y_train, epochs=10)
y_pred = model_nn.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"The Mean Squared Error (MSE) on the test set is: {mse}")
model_nn.summary()
y_pred_nn = y_pred.flatten()
sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred_nn[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.title('Neural Network Regression - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
Gaussian Processes [Bishop and Nasrabadi, 2006]#
Recall Gaussian Processes#
A Gaussian Process (GP) is essentially an advanced form of a Gaussian (or normal) distribution, but instead of being over simple variables, it’s over functions. Imagine a GP as a method to predict or estimate a function based on known data points. Note that we are using sparse GP here as the data we have here is somethat high-dimensional (21 bands).
pip install GPy
import GPy
from sklearn.metrics import mean_squared_error
kernel = GPy.kern.RBF(input_dim=21)
num_inducing = 100
gp = GPy.models.SparseGPRegression(X_train, y_train.reshape(-1, 1), kernel, num_inducing=num_inducing)
gp.optimize(messages=True)
y_pred_gp, variance = gp.predict(X_test)
y_pred_gp = y_pred.flatten()
sigma = np.sqrt(variance).flatten()
mse = mean_squared_error(y_test, y_pred_gp)
print(f"The Mean Squared Error (MSE) on the test set is: {mse}")
sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred_gp[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.title('Gaussian Process Regression - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
Comparison of Performances#
import matplotlib.pyplot as plt
import numpy as np
x_min, x_max = X_test[:, 0].min(), X_test[:, 0].max()
y_min, y_max = y_test.min(), y_test.max()
predictions = [y_pred_poly, y_pred_nn, y_pred_gp]
titles = ['Polynomial Regression', 'Neural Network', 'Gaussian Process']
for i, y_pred in enumerate(predictions):
plt.figure(figsize=(8, 6))
sample_idx = np.random.choice(np.arange(len(y_test)), size=50, replace=False)
plt.scatter(X_test[sample_idx, 0], y_test[sample_idx], color='black', label='Actual')
plt.scatter(X_test[sample_idx, 0], y_pred[sample_idx], color='blue', label='Predicted', alpha=0.5)
plt.plot([x_min, x_max], [y_min, y_max], 'r--', label='Perfect Prediction')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.title(titles[i] + ' - Test Set Prediction')
plt.xlabel('First Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
Rollout#
Now we can test our model on another part of OLCI image. We will use polynomial regression as an example.
import numpy as np
path1 = '/content/drive/MyDrive/GEOL0069/Week_6/reshaped_array_condition21_rollout.npy'
path2 = '/content/drive/MyDrive/GEOL0069/Week_6/x_s3_condition_rollout.npy'
path3 = '/content/drive/MyDrive/GEOL0069/Week_6/y_s3_condition_rollout.npy'
reshaped_array_condition21_rollout = np.load(path1)
x_s3_condition_rollout = np.load(path2)
y_s3_condition_rollout = np.load(path3)
X_poly_test = polynomial_features.transform(reshaped_array_condition21_rollout)
y_pred_poly = model_poly.predict(X_poly_test)
plt.scatter(x_s3_condition_rollout,y_s3_condition_rollout,c=y_pred_poly,vmin=0.7,vmax=1)
plt.colorbar()