GPSat package

Submodules

GPSat.bin_data module

class GPSat.bin_data.BinData

Bases: object

bin_data(file=None, source=None, load_by=None, table=None, where=None, batch=False, add_output_cols=None, bin_config=None, chunksize=5000000, **data_load_kwargs)

Bins the dataset, either in a single pass or in batches, based on the provided configuration.

This method decides between processing the entire dataset at once or in chunks based on the batch parameter. It applies binning according to the specified bin_config, along with any preprocessing defined by col_funcs, col_select, and row_select. Additional columns can be added to the output dataset using add_output_cols. The method is capable of handling both small and very large datasets efficiently.

Parameters:
filestr, optional

Path to the source file containing the dataset if source is not specified. Defaults to None.

sourcestr, optional

An alternative specification of the data source. This could be a path to a file or another identifier depending on the context. If both file and source are provided, source takes precedence. Defaults to None.

load_bylist of str, optional

List of column names based on which data will be loaded and binned in batches if batch is True. Each unique combination of values in these columns defines a batch. Defaults to None.

tablestr, optional

The name of the table within the data source from which to load the data. Defaults to None.

wherelist of dict, optional

Conditions for filtering rows from the source, expressed as a list of dictionaries representing SQL-like where clauses. Defaults to None.

batchbool, optional

If True, the data is processed in chunks based on load_by columns. If False, the entire dataset is processed at once. Defaults to False.

add_output_colsdict, optional

Dictionary mapping new column names to functions that define their values, for adding columns to the output DataFrame after binning. Defaults to None.

bin_configdict

Configuration for the binning process, including parameters such as bin sizes, binning method, and criteria for binning. This parameter is required.

chunksizeint, optional

The number of rows to read into memory and process at a time, applicable when batch is True. Defaults to 5,000,000.

**data_load_kwargsdict, optional

Additional keyword arguments to be passed into DataLoader.load see load

Returns:
df_binpandas.DataFrame

A DataFrame containing the binned data.

statspandas.DataFrame

A DataFrame containing statistics of the binned data, useful for analyzing the distribution and quality of the binned data.

Raises:
AssertionError

If bin_config is not provided or is not a dictionary.

Notes

The bin_data method offers flexibility in processing datasets of various sizes by allowing for both batch processing and single-pass processing. The choice between these modes is controlled by the batch parameter, making it suitable for scenarios ranging from small datasets that fit easily into memory to very large datasets requiring chunked processing to manage memory usage effectively.

The additional parameters for row and column selection and the ability to add new columns after binning allow for significant customization of the binning process, enabling users to tailor the method to their specific data processing and analysis needs.

bin_data_all_at_once(file=None, source=None, table=None, where=None, add_output_cols=None, bin_config=None, **data_load_kwargs)

Reads the entire dataset, applies binning, and returns binned data along with statistics.

This method handles the entire binning process in a single pass, making it suitable for datasets that can fit into memory. It allows for preprocessing of data through column functions, selection of specific rows and columns, and the addition of output columns after binning based on provided configurations.

Parameters:
filestr, optional

Path to the source file containing the dataset if source is not specified. Defaults to None.

sourcestr, optional

An alternative specification of the data source. This could be a path to a file or another identifier depending on the context. If both file and source are provided, source takes precedence. Defaults to None.

tablestr, optional

The name of the table within the data source to apply binning. Defaults to None.

wherelist of dict, optional

Conditions for filtering rows before binning, expressed as a list of dictionaries representing SQL-like where clauses. Defaults to None.

add_output_colsdict, optional

Dictionary mapping new column names to functions that define their values, for adding columns to the output DataFrame after binning. Defaults to None.

bin_configdict

Configuration for the binning process, including parameters such as bin sizes, binning method, and criteria for binning. This parameter is required.

**data_load_kwargsdict, optional

Additional keyword arguments to be passed into DataLoader.load see load

Returns:
df_binpandas.DataFrame

A DataFrame containing the binned data.

stats_dfpandas.DataFrame

A DataFrame containing statistics of the binned data, useful for analyzing the distribution and quality of the binned data.

Raises:
AssertionError

If bin_config is not provided or is not a dictionary.

Notes

This method is designed to handle datasets that can be loaded entirely into memory. For very large datasets, consider using the bin_data_by_batch method to process the data in chunks and avoid memory issues.

The add_output_cols parameter allows for the dynamic addition of columns to the binned dataset based on custom logic, which can be useful for enriching the dataset with additional metrics or categorizations derived from the binned data.

bin_data_by_batch(file=None, source=None, load_by=None, table=None, where=None, add_output_cols=None, chunksize=5000000, bin_config=None, **data_load_kwargs)

Bins the data in chunks based on unique values of specified columns and returns the aggregated binned data and statistics.

This method is particularly useful for very large datasets that cannot fit into memory. It reads the data in batches, applies binning to each batch based on the unique values of the specified load_by columns, and aggregates the results. This approach helps manage memory usage while allowing for comprehensive data analysis and binning.

Parameters:
filestr, optional

Path to the source file containing the dataset if source is not specified. Defaults to None.

sourcestr, optional

An alternative specification of the data source. This could be a path to a file or another identifier depending on the context. If both file and source are provided, source takes precedence. Defaults to None.

load_bylist of str

List of column names based on which data will be loaded and binned in batches. Each unique combination of values in these columns defines a batch.

tablestr, optional

The name of the table within the data source from which to load the data. Defaults to None.

wherelist of dict, optional

Conditions for filtering rows from the source, expressed as a list of dictionaries representing SQL-like where clauses. Defaults to None.

add_output_colsdict, optional

Dictionary mapping new column names to functions that define their values, for adding columns to the output DataFrame after binning. Defaults to None.

chunksizeint, optional

The number of rows to read into memory and process at a time. Defaults to 5,000,000.

bin_configdict

Configuration for the binning process, including parameters such as bin sizes, binning method, and criteria for binning. This parameter is required.

**data_load_kwargsdict, optional

Additional keyword arguments to be passed into DataLoader.load see load

Returns:
df_binpandas.DataFrame

A DataFrame containing the aggregated binned data from all batches.

stats_allpandas.DataFrame

A DataFrame containing aggregated statistics of the binned data from all batches, useful for analyzing the distribution and quality of the binned data.

Raises:
AssertionError

If bin_config is not provided or is not a dictionary.

Notes

The bin_data_by_batch method is designed to handle large datasets by processing them in manageable chunks. It requires specifying load_by columns to define how the dataset is divided into batches for individual binning operations. This method ensures efficient memory usage while allowing for complex data binning and analysis tasks on large datasets.

The add_output_cols parameter enables the dynamic addition of columns to the output dataset based on custom logic applied after binning, which can be used to enrich the dataset with additional insights or metrics derived from the binned data.

static bin_wrapper(df, col_funcs=None, print_stats=True, **bin_config)

Perform binning on a DataFrame with optional statistics printing and column modifications.

This function wraps the binning process, allowing for optional statistics on the data before binning, dynamic column additions or modifications, and the application of various binning configurations.

Parameters:
dfpandas.DataFrame

The DataFrame to be binned.

col_funcsdict, optional

A dictionary where keys are column names to add or modify, and values are functions that take a pandas Series and return a modified Series. This allows for the dynamic addition or modification of columns before binning. Defaults to None.

print_statsbool, optional

If True, prints basic statistics of the DataFrame before binning. Useful for a preliminary examination of the data. Defaults to True.

**bin_configdict

Arbitrary keyword arguments defining the binning configuration. These configurations dictate how binning is performed and include parameters such as bin sizes, binning method, criteria for binning, etc.

Returns:
ds_binxarray.Dataset

The binned data as an xarray Dataset. Contains the result of binning the input DataFrame according to the specified configurations.

stats_dfpandas.DataFrame

A DataFrame containing statistics of the input DataFrame after any column additions or modifications and before binning. Provides insights into the data distribution and can inform decisions on binning parameters or data preprocessing.

Notes

The actual structure and contents of the ds_bin xarray Dataset will depend on the binning configurations specified in **bin_config. Similarly, the stats_df DataFrame provides a summary of the data’s distribution based on the column specified in the binning configuration and can vary widely in its specifics.

The binning process may be adjusted significantly through the **bin_config parameters, allowing for a wide range of binning behaviors and outcomes. For detailed configuration options, refer to the documentation of the specific binning functions used within this wrapper.

write_dataframe_to_table(df_bin, file=None, table=None)

Writes the binned DataFrame to a specified table in an HDF5 file.

This method saves the binned data, represented by a DataFrame, into a table within an HDF5 file. The method assumes that the HDF5 file is accessible and writable. It allows for the efficient storage of large datasets and facilitates easy retrieval for further analysis or processing.

Parameters:
df_binpandas.DataFrame

The DataFrame containing the binned data to be written to the file. This DataFrame should already be processed and contain the final form of the data to be saved.

filestr

The path to the HDF5 file where the DataFrame will be written. If the file does not exist, it will be created. If the file exists, the method will write the DataFrame to the specified table within the file.

tablestr

The name of the table within the HDF5 file where the DataFrame will be stored. If the table already exists, the new data will be appended to it.

Raises:
AssertionError

If either file or table is not specified.

Notes

The HDF5 file format is a versatile data storage format that can efficiently store large datasets. It is particularly useful in contexts where data needs to be retrieved for analysis, as it supports complex queries and data slicing. This method leverages the pandas HDFStore mechanism for storing DataFrames, which abstracts away many of the complexities of working directly with HDF5 files.

This method also includes the raw_data_config, config (the binning configuration), and run_info as attributes of the stored table, providing a comprehensive audit trail of how the binned data was generated. This can be crucial for reproducibility and understanding the context of the stored data.

GPSat.bin_data.get_bin_data_config()
GPSat.bin_data.plot_wrapper(plt_df, val_col, lon_col='lon', lat_col='lat', date_col='date', scatter_plot_size=2, plt_where=None, projection=None, extent=None)

GPSat.dataloader module

class GPSat.dataloader.DataLoader(hdf_store=None, dataset=None)

Bases: object

static add_cols(df, col_func_dict=None, filename=None, verbose=False)

Adds new columns to a given DataFrame based on the provided dictionary of column-function pairs.

This function allows the user to add new columns to a DataFrame using a dictionary that maps new column names to functions that compute the column values. The functions can be provided as values in the dictionary, and the new columns can be added to the DataFrame in a single call to this function.

If a tuple is provided as a key in the dictionary, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple.

Parameters:
dfpandas.DataFrame

The input DataFrame to which new columns will be added.

col_func_dictdict, optional

A dictionary that maps new column names (keys) to functions (values) that compute the column values. If a tuple is provided as a key, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple. If None, an empty dictionary will be used. Default is None.

filenamestr, optional

The name of the file from which the DataFrame was read. This parameter will be passed to the functions provided in the col_func_dict. Default is None.

verboseint or bool, optional

Determines the level of verbosity of the function. If verbose is 3 or higher, the function will print messages about the columns being added. Default is False.

Returns:
None
Raises:
AssertionError

If the length of the new columns returned by the function does not match the length of the tuple key in the col_func_dict.

Notes

DataFrame is manipulated inplace. If a single value is returned by the function, it will be assigned to a column with the name specified in the key. See help(utils.config_func) for more details.

Examples

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> add_one = lambda x: x + 1
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> DataLoader.add_cols(df, col_func_dict= {
>>>     'C': {'func': add_one, "col_args": "A"}
>>>     })
   A  B  C
0  1  4  2
1  2  5  3
2  3  6  4
static add_data_to_col(df, add_data_to_col=None, verbose=False)

Adds new data to an existing column or creates a new column with the provided data in a DataFrame.

This function takes a DataFrame and a dictionary with the column name as the key and the data to be added as the value. It can handle scalar values or lists of values, and will replicate the DataFrame rows for each value in the list.

Parameters:
dfpandas.DataFrame

The input DataFrame to which data will be added or updated.

add_data_to_coldict, optional

A dictionary with the column name (key) and data to be added (value). The data can be a scalar value or a list of values. If a list of values is provided, the DataFrame rows will be replicated for each value in the list. If None, an empty dictionary will be used. Default is None.

verbosebool, default False.

If True, the function will print progress messages

Returns:
dfpandas.DataFrame

The DataFrame with the updated or added columns.

Raises:
AssertionError

If the add_data_to_col parameter is not a dictionary.

Notes

This method adds data to a specified column in a pandas DataFrame repeatedly. The method creates a copy of the DataFrame for each entry in the data to be added, and concatenates them to create a new DataFrame with the added data.

Examples

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> updated_df = DataLoader.add_data_to_col(df, add_data_to_col={"C": [7, 8]})
>>> print(updated_df)
   A  B  C
0  1  4  7
1  2  5  7
2  3  6  7
0  1  4  8
1  2  5  8
2  3  6  8
>>> len(df)
3
>>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4]})
>>> len(out)
12
>>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4], "b": [5,6,7,8]})
>>> len(out)
48
static bin_data(df, x_range=None, y_range=None, grid_res=None, x_col='x', y_col='y', val_col=None, bin_statistic='mean', return_bin_center=True)

Bins data from a given DataFrame into a 2D grid, applying the specified statistical function to the data in each bin.

This function takes a DataFrame containing x, y, and value columns and bins the data into a 2D grid. It returns the resulting grid, as well as the x and y bin edges or centers, depending on the value of return_bin_center.

Parameters:
dfpd.DataFrame

The input DataFrame containing the data to be binned.

x_rangelist or tuple of floats, optional

The range of x values, specified as [min, max]. If not provided, a default value of [-4500000.0, 4500000.0]. will be used.

y_rangelist or tuple of floats, optional

The range of y values, specified as [min, max]. If not provided, a default value of [-4500000.0, 4500000.0]. will be used.

grid_resfloat or None.

The grid resolution, expressed in kilometers. This parameter must be provided.

x_colstr, default is “x”.

The name of the column in the DataFrame containing the x values.

y_colstr, default is “y”.

The name of the column in the DataFrame containing the y values.

val_colstr, optional

The name of the column in the DataFrame containing the values to be binned. This parameter must be provided.

bin_statisticstr, default is “mean”.

The statistic to apply to the binned data. Options are 'mean', 'median', 'count', 'sum', 'min', 'max', or a custom callable function.

return_bin_centerbool, default is True.

If True, the function will return the bin centers instead of the bin edges.

Returns:
binned_datanumpy.ndarray

The binned data as a 2D grid.

x_outnumpy.ndarray

The x bin edges or centers, depending on the value of return_bin_center.

y_outnumpy.ndarray

The y bin edges or centers, depending on the value of return_bin_center.

classmethod bin_data_by(df, by_cols=None, val_col=None, x_col='x', y_col='y', x_range=None, y_range=None, grid_res=None, bin_statistic='mean', limit=10000)

Bins the input DataFrame df based on the given columns and computes the bin statistics for a specified value column.

This function takes a DataFrame, filters it based on the unique combinations of the by_cols column values, and then bins the data in each filtered DataFrame based on the x_col and y_col column values. It computes the bin statistic for the specified val_col and returns the result as an xarray DataArray. The output DataArray has dimensions "y", "x", and the given by_cols.

Parameters:
dfpandas.DataFrame

The input DataFrame to be binned.

by_colsstr or list[str] or tuple[str]

The column(s) by which the input DataFrame should be filtered. Unique combinations of these columns are used to create separate DataFrames for binning.

val_colstr

The column in the input DataFrame for which the bin statistics should be computed.

x_colstr, optional, default=’x’

The column in the input DataFrame to be used for binning along the x-axis.

y_colstr, optional, default=’y’

The column in the input DataFrame to be used for binning along the y-axis.

x_rangetuple, optional

The range of the x-axis values for binning. If None, the minimum and maximum x values are used.

y_rangetuple, optional

The range of the y-axis values for binning. If None, the minimum and maximum y values are used.

grid_resfloat, optional

The resolution of the grid used for binning. If None, the resolution is calculated based on the input data.

bin_statisticstr, optional, default=”mean”

The statistic to compute for each bin. Supported values are "mean", "median", "sum", "min", "max", and "count".

limitint, optional, default=10000

The maximum number of unique combinations of the by_cols column values allowed. Raises an AssertionError if the number of unique combinations exceeds this limit.

Returns:
outxarray.Dataset

The binned data as an xarray Dataset with dimensions 'y', 'x', and the given by_cols. Raises

Raises:
DeprecationWarning

If the deprecated method DataLoader.bin_data_by(...) is used instead of DataPrep.bin_data_by(...).

AssertionError

If any of the input parameters do not meet the specified conditions.

connect_to_hdf_store(store, table=None, mode='r')
classmethod data_select(obj, where=None, combine_where='AND', table=None, return_df=True, reset_index=False, drop=True, copy=True, columns=None, close=False, **kwargs)

Selects data from an input object (pd.DataFrame, pd.HDFStore, xr.DataArray or xr.DataSet) based on filtering conditions.

This function filters data from various types of input objects based on the provided conditions specified in the 'where' parameter. It also supports selecting specific columns, resetting the index, and returning the output as a DataFrame.

Parameters:
objpd.DataFrame, pd.Series, dict, pd.HDFStore, xr.DataArray, or xr.Dataset

The input object from which data will be selected. If dict, it will try to convert it to pandas.DataFrame.

wheredict, list of dict or None, default None

Filtering conditions to be applied to the input object. It can be a single dictionary or a list of dictionaries. Each dictionary should have keys: "col", "comp", "val". e.g.

where = {"col": "t", "comp": "<=", "val": 4}

The "col" value specifies the column, "comp" specifies the comparison to be performed (>, >=, ==, !=, <=, <) and “val” is the value to be compared against. If None, then selects all data. Specifying 'where' parameter can avoid reading all data in from filesystem when obj is pandas.HDFStore or xarray.Dataset.

combine_where: str, default ‘AND’

How should where conditions, if there are multiple, be combined? Valid values are ["AND", "OR"], not case-sensitive.

tablestr, default None

The table name to select from when using an HDFStore object. If obj is pandas.HDFStore then table must be supplied.

return_dfbool, default True

If True, the output will be returned as a pandas.DataFrame.

reset_indexbool, default False

If True, the index of the output DataFrame will be reset.

dropbool, default True

If True, the output will have the filtered-out values removed. Applicable only for xarray objects. Default is True.

copybool, default True

If True, the output will be a copy of the selected data. Applicable only for DataFrame objects.

columnslist or None, default None

A list of column names to be selected from the input object. If None, selects all columns.

closebool, default False

If True, and obj is pandas.HDFStore it will be closed after selecting data.

kwargsany

Additional keyword arguments to be passed to the obj.select method when using an HDFStore object.

Returns:
outpandas.DataFrame, pandas.Series, or xarray.DataArray

The filtered data as a pd.DataFrame, pd.Series, or xr.DataArray, based on the input object type and return_df parameter.

Raises:
AssertionError

If the table parameter is not provided when using an HDFStore object.

AssertionError

If the provided columns are not found in the input object when using a DataFrame object.

Examples

>>> import pandas as pd
>>> import xarray as xr
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> # Select data from a DataFrame with a filtering condition
>>> selected_df = DataLoader.data_select(df, where={"col": "A", "comp": ">=", "val": 2})
>>> print(selected_df)
   A  B
1  2  5
2  3  6
file_suffix_engine_map = {'csv': 'read_csv', 'h5': 'HDFStore', 'nc': 'netcdf4', 'parquet': 'read_parquet', 'tsv': 'read_csv', 'zarr': 'zarr'}
classmethod generate_local_expert_locations(loc_dims, ref_data=None, format_type=None, masks=None, include_col='include', col_func_dict=None, row_select=None, keep_cols=None, sort_by=None)
static get_attribute_from_table(source, table, attribute_name)

Retrieve an attribute from a specific table in a HDF5 file or HDFStore.

This function can handle both cases when the source is a filepath string to a HDF5 file or a pandas HDFStore object. The function opens the source (if it’s a filepath), then attempts to retrieve the specified attribute from the specified table within the source. If the retrieval fails for any reason, a warning is issued and None is returned.

Parameters:
sourcestr or pandas.HDFStore

The source from where to retrieve the attribute. If it’s a string, it is treated as a filepath to a HDF5 file. If it’s a pandas HDFStore object, the function operates directly on it.

tablestr

The name of the table within the source from where to retrieve the attribute.

attribute_namestr

The name of the attribute to retrieve.

Returns:
attributeobject

The attribute retrieved from the specified table in the source. If the attribute could not be retrieved, None is returned.

Raises:
NotImplementedError

If the type of the source is neither a string nor a pandas.HDFStore.

classmethod get_keys(source, verobse=False)
static get_masks_for_expert_loc(ref_data, el_masks=None, obs_col=None)

Generate a list of masks based on given local experts locations (el_masks) and a reference data (ref_data).

This function can generate masks in two ways:
  1. If el_mask is a string “had_obs”, a mask is created based on the obs_col of the reference data where any non-NaN value is present.

  2. If el_mask is a dictionary with “grid_space” key, a regularly spaced mask is created based on the dimensions specified and the grid_space value.

The reference data is expected to be an xarray DataArray or xarray Dataset. Support for pandas DataFrame may be added in future.

Parameters:
ref_dataxarray.DataArray or xarray.Dataset

The reference data to use when generating the masks. The data should have coordinates that match the dimensions specified in the el_masks dictionary, if provided.

el_maskslist of str or dict, optional

A list of instructions for generating the masks. Each element in the list can either be a string or a dictionary. If a string, it should be “had_obs”, which indicates a mask should be created where any non-NaN value is present in the obs_col of the ref_data. If a dictionary, it should have a “grid_space” key indicating the regular spacing to be used when creating a mask and ‘dims’ key specifying dimensions in the reference data to be considered. By default, it is None, which indicates no mask is to be generated.

obs_colstr, optional

The column in the reference data to use when generating a mask based on “had_obs” instruction. This parameter is ignored if “had_obs” is not present in el_masks.

Returns:
list of xarray.DataArray

A list of masks generated based on the el_masks instructions. Each mask is an xarray DataArray with the same coordinates as the ref_data. Each value in the mask is a boolean indicating whether a local expert should be located at that point.

Raises:
AssertionError

If ref_data is not an instance of xarray.DataArray or xarray.Dataset, or if “grid_space” is in el_masks but the corresponding dimensions specified in the ‘dims’ key do not exist in ref_data.

Notes

The function could be extended to read data from file system and allow different reference data.

Future extensions could also include support for lel_mask to be only list of dict and for reference data to be pandas DataFrame.

static get_run_info(script_path=None)

Retrieves information about the current Python script execution environment, including run time, Python executable path, and Git information.

This function collects information about the current script execution environment, such as the date and time when the script is executed, the path of the Python interpreter, the script’s file path, and Git information (if available).

Parameters:
script_pathstr, default None

The file path of the currently executed script. If None, it will try to retrieve the file path automatically.

Returns:
run_infodict

A dictionary containing the following keys:

  • "run_time": The date and time when the script was executed, formatted as "YYYY-MM-DD HH:MM:SS".

  • "python_executable": The path of the Python interpreter.

  • "script_path": The absolute file path of the script (if available).

  • Git-related keys: "git_branch", "git_commit", "git_url", and "git_modified" (if available).

Examples

>>> from GPSat.dataloader import DataLoader
>>> run_info = DataLoader.get_run_info()
>>> print(run_info)
{
    "run_time": "2023-04-28 10:30:00",
    "python_executable": "/usr/local/bin/python3.9",
    "script_path": "/path/to/your/script.py",
    "branch": "main",
    "commit": "123abc",
    "remote": ["https://github.com/user/repo.git" (fetch),"https://github.com/user/repo.git" (push)]
    "details": ['commit 123abc',
      'Author: UserName <username42@gmail.com>',
      'Date:   Fri Apr 28 07:22:31 2023 +0100',
      ':bug: fix ']
    "modified" : ['list_of_files.py', 'modified_since.py', 'last_commit.py']
}
static get_where_list(global_select, local_select=None, ref_loc=None)

Generate a list of selection criteria for data filtering based on global and local conditions, as well as reference location.

The function accepts a list of global select conditions, and optional local select conditions and reference location. Each condition in global select can either be ‘static’ (with keys ‘col’, ‘comp’, and ‘val’) or ‘dynamic’ (requiring local select and reference location and having keys ‘loc_col’, ‘src_col’, ‘func’). The function evaluates each global select condition and constructs a corresponding selection dictionary.

Parameters:
global_selectlist of dict

A list of dictionaries defining global selection conditions. Each dictionary can be either ‘static’ or ‘dynamic’. ‘Static’ dictionaries should contain the keys ‘col’, ‘comp’, and ‘val’ which define a column, a comparison operator, and a value respectively. ‘Dynamic’ dictionaries should contain the keys ‘loc_col’, ‘src_col’, and ‘func’ which define a location column, a source column, and a function respectively.

local_selectlist of dict, optional

A list of dictionaries defining local selection conditions. Each dictionary should contain keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively. This parameter is required if any ‘dynamic’ condition is present in global_select.

ref_locpandas DataFrame, optional

A reference location as a pandas DataFrame. This parameter is required if any ‘dynamic’ condition is present in global_select.

Returns:
list of dict

A list of dictionaries each representing a selection condition to be applied on data. Each dictionary contains keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively.

Raises:
AssertionError

If a ‘dynamic’ condition is present in global_select but local_select or ref_loc is not provided, or if the required keys are not present in the ‘dynamic’ condition, or if the location column specified in a ‘dynamic’ condition is not present in ref_loc.

static get_where_list_legacy(read_in_by=None, where=None)

Generate a list (of lists) of where conditions that can be consumed by pd.HDFStore(...).select.

Parameters:
read_in_by: dict of dict or None

Sub-dictionary must contain the keys "values", "how".

where: str or None

Used if read_in_by is not provided.

Returns:
list of list

Containing string where conditions.

classmethod hdf_tables_in_store(store=None, path=None)

Retrieve the list of tables available in an HDFStore.

This class method allows the user to get the names of all tables stored in a given HDFStore. It accepts either an already open HDFStore object or a path to an HDF5 file. If a path is provided, the method will open the HDFStore in read-only mode, retrieve the table names, and then close the store.

Parameters:
storepd.io.pytables.HDFStore, optional

An open HDFStore object. If this parameter is provided, path should not be specified.

pathstr, optional

The file path to an HDF5 file. If this parameter is provided, store should not be specified. The method opens the HDFStore at this path in read-only mode to retrieve the table names.

Returns:
list of str

A list containing the names of all tables in the HDFStore.

Raises:
AssertionError

If both store and path are None, or if the store provided is not an instance of pd.io.pytables.HDFStore.

Notes

The method ensures that only one of store or path is provided. If path is specified, the HDFStore is opened in read-only mode and closed after retrieving the table names.

Examples

>>> DataLoader.hdf_tables_in_store(store=my_store)
['/table1', '/table2']
>>> DataLoader.hdf_tables_in_store(path='path/to/hdf5_file.h5')
['/table1', '/table2', '/table3']
static is_list_of_dict(lst)

Checks if the given input is a list of dictionaries.

This utility function tests if the input is a list where all elements are instances of the dict type.

Parameters:
lstlist

The input list to be checked for containing only dictionaries.

Returns:
bool

True if the input is a list of dictionaries, False otherwise.

Examples

>>> from GPSat.dataloader import DataLoader
>>> DataLoader.is_list_of_dict([{"col": "t", "comp": "==", "val": 1}])
True
>>> DataLoader.is_list_of_dict([{"a": 1, "b": 2}, {"c": 3, "d": 4}])
True
>>> DataLoader.is_list_of_dict([1, 2, 3])
False
>>> DataLoader.is_list_of_dict("not a list")
False
static kdt_tree_list_for_local_select(df, local_select)

Pre-calculates a list of KDTree objects for selecting points within a radius based on the local_select input.

Given a DataFrame and a list of local selection criteria, this function builds a list of KDTree objects that can be used for spatially selecting points within specified radii.

Parameters:
dfpd.DataFrame

The input DataFrame containing the data to be used for KDTree construction.

local_selectlist of dict

A list of dictionaries containing the selection criteria for each local select. Each dictionary should have the following keys:

  • "col": The name of the column(s) used for spatial selection. Can be a single string or a list of strings.

  • "comp": The comparison operator, either "<" or "<=". Currently, only less than comparisons are supported for multi-dimensional values.

Returns:
outlist

A list of KDTree objects or None values, where each element corresponds to an entry in the local_select input. If an entry in local_select has a single string for "col", the corresponding output element will be None. Otherwise, the output element will be a KDTree object built from the specified columns.

Examples

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> local_select = [{"col": ["x", "y"], "comp": "<"}]
>>> kdt_trees = DataLoader.kdt_tree_list_for_local_select(df, local_select)
>>> print(kdt_trees)
classmethod load(source, where=None, engine=None, table=None, source_kwargs=None, col_funcs=None, row_select=None, col_select=None, reset_index=False, add_data_to_col=None, close=False, verbose=False, combine_row_select='AND', **kwargs)

Load data from various sources and (optionally) apply selection of columns/rows and add/modify columns.

Parameters:
source: str, pd.DataFrame, pd.Series, pd.HDFStore, xr.DataSet, default None

If str, will try to convert to other types.

where: dict or list of dict, default None

Used when querying pd.HDFStore, xr.DataSet, xr.DataArray. Specified as a list of one or more dictionaries, each containing the keys:

  • "col": refers to a column (or variable for xarray objects.

  • "comp": is the type of comparison to apply e.g. "==", "!=", ">=", ">", "<=", "<".

  • "val": value to be compared with.

e.g.

where = [{"col": "A", "comp": ">=", "val": 0}]

will select entries where the column "A" is greater than 0.

Note: Think of this as a database query, with the where used to read data from the file system into memory.

engine: str or None, default None

Specify the type of ‘engine’ to use to read in data. If not supplied, it will be inferred by source if source is string. Valid values: "HDFStore", "netcdf4", "scipy", "pydap", "h5netcdf", "pynio", "cfgrib", "pseudonetcdf", "zarr" or any of Pandas "read_*".

table: str or None, default None

Used only if source is pd.HDFStore (or is converted to one) and is required if so. Should be a valid table (i.e. key) in HDFStore.

source_kwargs: dict or None, default None

Additional keyword arguments to pass to the data source reading functions, depending on engine. e.g. keyword arguments for pandas.read_csv() if engine=read_csv.

col_funcs: dict or None, default None

If dict, it will be provided to add_cols method to add or modify columns.

row_select: dict, list of dict, or None, default None

Used to select a subset of data after data is initially read into memory. Can be the same type of input as where i.e.

row_select = {"col": "A", "comp": ">=", "val": 0}

or use col_funcs that return bool array

e.g.

row_select = {"func": "lambda x: ~np.isnan(x)", "col_args": 1}

see help(utils.config_func) for more details.

col_select: list of str or None, default None

If specified as a list of strings, it will return a subset of columns using col_select. All values must be valid. If None, all columns will be returned.

filename: str or None, default None

Used by add_cols method.

reset_index: bool, default True

Apply reset_index(inplace=True) before returning?

add_data_to_col:

Add new column to data frame. See argument add_data_to_col in add_data_to_col.

close: bool, default False

See DataLoader.data_select for details

verbose: bool, default False

Set verbosity.

kwargs:

Additional arguments to be provided to data_select method

Returns:
pd.DataFrame

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df = DataLoader.load(source = df,
...                      where = {"col": "A", "comp": ">=", "val": 2})
>>> print(df.head())
   A  B
0  2  5
1  3  6

If the data is stored in a file, we can extract it as follows (here, we assume the data is saved in “path/to/data.h5” under the table “data”):

>>> df = DataLoader.load(source = "path/to/data.h5",
...                      table = "data")
classmethod local_data_select(df, reference_location, local_select, kdtree=None, verbose=True)

Selects data from a DataFrame based on a given criteria and reference (expert) location.

This method applies local selection criteria to a DataFrame, allowing for flexible, column-wise data selection based on comparison operations. For multi (dimensional) column selections, a KDTree can be used for efficiency.

Parameters:
dfpd.DataFrame

The DataFrame from which data will be selected.

reference_locationdict or pd.DataFrame

Reference location used for comparisons. If DataFrame is provided, it will be converted to dict.

local_selectlist of dict

List of dictionaries containing the selection criteria for each local select. Each dictionary must contain keys ‘col’, ‘comp’, and ‘val’. ‘col’ is the column in ‘df’ to apply the comparison on, ‘comp’ is the comparison operator as a string (can be ‘>=’, ‘>’, ‘==’, ‘<’, ‘<=’), and ‘val’ is the value to compare with.

kdtreeKDTree or list of KDTree, optional

Precomputed KDTree or list of KDTrees for optimization. Each KDTree in the list corresponds to an entry in local_select. If not provided, a new KDTree will be created.

verbosebool, default=True

If True, print details for each selection criteria.

Returns:
pd.DataFrame

A DataFrame containing only the data that meets all of the selection criteria.

Raises:
AssertionError

If ‘col’ is not in ‘df’ or ‘reference_location’, if the comparison operator in ‘local_select’ is not valid, or if the provided ‘kdtree’ is not of type KDTree.

Notes

If ‘col’ is a string, a simple comparison is performed. If ‘col’ is a list of strings, a KDTree-based selection is performed where each dimension is a column from ‘df’. For multi-dimensional comparisons, only less than comparisons are currently handled.

If ‘kdtree’ is provided and is a list, it must be of the same length as ‘local_select’ with each element corresponding to the same index in ‘local_select’.

static make_multiindex_df(idx_dict, **kwargs)

Create a multi-indexed DataFrame from the provided index dictionary for each keyword argument supplied.

This function creates a multi-indexed DataFrame, with each row having the same multi-index value The index dictionary serves as the levels and labels for the multi-index, while the keyword arguments provide the data.

Parameters:
idx_dictdict or pd.Series

A dictionary or pandas Series containing the levels and labels for the multi-index.

**kwargsdict

Keyword arguments specifying the data and column names for the resulting DataFrame. The data can be of various types: int, float, bool, np.ndarray, pd.DataFrame, dict, or tuple. This data will be transformed into a DataFrame, where the multi-index will be added.

Returns:
dict

A dictionary containing the multi-indexed DataFrames with keys corresponding to the keys of provided keyword arguments.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> idx_dict = {"year": 2020, "month": 1}
>>> data = pd.DataFrame({"x": np.arange(10)})
>>> df = pd.DataFrame({"y": np.arange(3)})
>>> DataLoader.make_multiindex_df(idx_dict, data=data, df=df)
{'data': <pandas.DataFrame (multiindexed) with shape (3, 4)>}
static mindex_df_to_mindex_dataarray(df, data_name, dim_cols=None, infer_dim_cols=True, index_name='index')

Converts a multi-index DataFrame to a multi-index DataArray.

The method facilitates a transition from pandas DataFrame representation to the Xarray DataArray format, while preserving multi-index structure. This can be useful for higher-dimensional indexing, labeling, and performing mathematical operations on the data.

Parameters:
dfpd.DataFrame

The input DataFrame with a multi-index to be converted to a DataArray.

data_namestr

The name of the column in ‘df’ that contains the data values for the DataArray.

dim_colslist of str, optional

A list of columns in ‘df’ that will be used as additional dimensions in the DataArray. If None, dimension columns will be inferred if ‘infer_dim_cols’ is True.

infer_dim_colsbool, default=True

If True and ‘dim_cols’ is None, dimension columns will be inferred from ‘df’. Columns will be considered a dimension column if they match the pattern “^_dim_d”.

index_namestr, default=”index”

The name assigned to the placeholder index created during the conversion process.

Returns:
xr.DataArray

A DataArray derived from the input DataFrame with the same multi-index structure. The data values are taken from the column in ‘df’ specified by ‘data_name’. Additional dimensions can be included from ‘df’ as specified by ‘dim_cols’.

Raises:
AssertionError

If ‘data_name’ is not a column in ‘df’.

Notes

The function manipulates ‘df’ by reference. If the original DataFrame needs to be preserved, provide a copy to the function.

classmethod read_flat_files(file_dirs, file_regex, sub_dirs=None, read_csv_kwargs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, verbose=False)

wrapper for read_from_multiple_files with read_engine=’csv’ Parameters

Read flat files (.csv, .tsv, etc) from file system and returns a pd.DataFrame object.

Parameters:
file_dirs: str or List[str]

The directories containing the files to read.

file_regex: str

A regular expression pattern to match file names within the specified directories.

sub_dirs: str or List[str], optional

Subdirectories within each file directory to search for files.

read_csv_kwargs: dict, optional

Additional keyword arguments specifically for CSV reading. These are keyword arguments for the function pandas.read_csv().

col_funcs: dict of dict, optional

A dictionary with column names as keys and column functions to apply during data reading as values. The column functions should be a dictionary of keyword arguments to utils.config_func.

row_select: list of dict, optional

A list of functions to select rows during data reading.

col_select: list of str, optional

A list of column names to read from data.

new_column_names: List[str], optional

New column names to assign to the resulting DataFrame.

strict: bool, default True

Whether to raise an error if a file directory does not exist.

verbose: bool or int, default False

Verbosity level for printing progress.

Returns:
pd.DataFrame

A DataFrame containing the combined data from multiple files.

Notes

  • This method reads data from multiple files located in specified directories and subdirectories.

  • The file_regex argument is used to filter files to be read.

  • Various transformations can be applied to the data, including adding new columns and selecting rows/columns.

  • If new_column_names is provided, it should be a list with names matching the number of columns in the output DataFrame.

  • The resulting DataFrame contains the combined data from all the specified files.

Examples

The command below reads the files "A_RAW.csv", "B_RAW.csv" and "C_RAW.csv" in the path "/path/to/dir" and combines them into a single dataframe.

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> col_funcs = {
...    "source": { # Add a new column "source" with entries "A", "B" or "C".
...        "func": "lambda x: re.sub('_RAW.*$', '', os.path.basename(x))",
...        "filename_as_arg": true
...    },
...    "datetime": { # Modify column "datetime" by converting to datetime64[s].
...        "func": "lambda x: x.astype('datetime64[s]')",
...        "col_args": "datetime"
...    },
...    "obs": { # Rename column "z" to "obs" and subtract mean value 0.1.
...        "func": "lambda x: x-0.1",
...        "col_args": "z"
...    }
... }
>>> row_select = [ # Read data whose "lat" value is >= 65.
...    {
...        "func": "lambda x: x>=65",
...        "col_kwargs": {
...            "x": "lat"
...        }
...    }
... ]
>>> df = DataLoader.read_flat_files(file_dirs = "/path/to/dir/",
...                                 file_regex = ".*_RAW.csv$",
...                                 col_funcs = col_funcs,
...                                 row_select = row_select)
>>> print(df.head(2))
        lon             lat             datetime                source  obs
0       59.944790       82.061122       2020-03-01 13:48:50     C       -0.0401
1       59.939555       82.063771       2020-03-01 13:48:50     C       -0.0861
classmethod read_from_multiple_files(file_dirs, file_regex, read_engine='csv', sub_dirs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, read_kwargs=None, read_csv_kwargs=None, verbose=False)

Reads and merges data from multiple files in specified directories, Optionally apply various transformations such as column renaming, row selection, column selection or other transformation functions to the data.

The primary input is a list of directories and a regular expression used to select which files within those directories should be read.

Parameters:
file_dirslist of str

A list of directories to read the files from. Each directory is a string. If a string is provided instead of a list, it will be wrapped into a single-element list.

file_regexstr

Regular expression to match the files to be read from the directories specified in ‘file_dirs’. e.g. “NEW.csv$’ with match all files ending with NEW.csv

read_enginestr, optional

The engine to be used to read the files. Options include ‘csv’, ‘nc’, ‘netcdf’, and ‘xarray’. Default is ‘csv’.

sub_dirslist of str, optional

A list of subdirectories to be appended to each directory in ‘file_dirs’. If a string is provided, it will be wrapped into a single-element list. Default is None.

col_funcsdict, optional

A dictionary that maps new column names to functions that compute the column values. Provided to add_cols via col_func_dict parameter. Default is None.

row_selectlist of dict, optional

A list of dictionaries, each representing a condition to select rows from the DataFrame. Provided to the row_select_bool method. Default is None.

col_selectslice, optional

A slice object to select specific columns from the DataFrame. If not provided, all columns are selected.

new_column_nameslist of str, optional

New names for the DataFrame columns. The length should be equal to the number of columns in the DataFrame. Default is None.

strictbool, optional

Determines whether to raise an error if a directory in ‘file_dirs’ does not exist. If False, a warning is issued instead. Default is True.

read_kwargsdict, optional

Additional keyword arguments to pass to the read function (pd.read_csv or xr.open_dataset). Default is None.

read_csv_kwargsdict, optional

Deprecated. Additional keyword arguments to pass to pd.read_csv. Use ‘read_kwargs’ instead. Default is None.

verbosebool or int, optional

Determines the verbosity level of the function. If True or an integer equal to or higher than 3, additional print statements are executed.

Returns:
outpandas.DataFrame

The resulting DataFrame, merged from all the files that were read and processed.

Raises:
AssertionError

Raised if the ‘read_engine’ parameter is not one of the valid choices, if ‘read_kwargs’ or ‘col_funcs’ are not dictionaries, or if the length of ‘new_column_names’ is not equal to the number of columns in the DataFrame. Raised if ‘strict’ is True and a directory in ‘file_dirs’ does not exist.

Notes

The function supports reading from csv, netCDF files and xarray Dataset formats. For netCDF and xarray Dataset, the data is converted to a DataFrame using the ‘to_dataframe’ method.

static read_from_npy(npy_files, npy_dir, dims=None, flatten_xy=True, return_xarray=True)

Read NumPy array(s) from the specified .npy file(s) and return as xarray DataArray(s).

This function reads one or more .npy files from the specified directory and returns them as xarray DataArray(s). The input can be a single file, a list of files, or a dictionary of files with the desired keys. The returned dictionary contains the xarray DataArray(s) with the corresponding keys.

Parameters:
npy_filesstr, list, or dict

The .npy file(s) to be read. It can be a single file (str), a list of files, or a dictionary of files.

npy_dirstr

The directory containing the .npy file(s).

dimslist or tuple, optional

The dimensions for the xarray DataArray(s), (default: None).

flatten_xybool, optional

If True, flatten the x and y arrays by taking the first row and first column, respectively (default: True).

return_xarray: bool, default True

If True will convert numpy arrays to pandas DataArray, otherwise will return dict of numpy arrays.

Returns:
dict

A dictionary containing xarray DataArray(s) with keys corresponding to the input files.

Examples

>>> read_from_npy(npy_files="data.npy", npy_dir="./data")
{'obs': <xarray.DataArray (shape)>
>>> read_from_npy(npy_files=["data1.npy", "data2.npy"], npy_dir="./data")
{'obs': [<xarray.DataArray (shape1)>, <xarray.DataArray (shape2)>]}
>>> read_from_npy(npy_files={"x": "data_x.npy", "y": "data_y.npy"}, npy_dir="./data")
{'x': <xarray.DataArray (shape_x)>, 'y': <xarray.DataArray (shape_y)>}
static read_from_pkl_dict(pkl_files, pkl_dir=None, default_name='obs', strict=True, dim_names=None)

Reads and processes data from pickle files and returns a DataFrame containing all data.

Parameters:
pkl_filesstr, list, or dict

The pickle file(s) to be read. This can be a string (representing a single file), a list of strings (representing multiple files), or a dictionary, where keys are the names of different data sources and the values are lists of file names.

pkl_dirstr, optional

The directory where the pickle files are located. If not provided, the current directory is used.

default_namestr, optional

The default data source name. This is used when pkl_files is a string or a list. Default is “obs”.

strictbool, optional

If True, the function will raise an exception if a file does not exist. If False, it will print a warning and continue with the remaining files. Default is True.

dim_nameslist, optional

The names of the dimensions. This is used when converting the data to a DataArray. If not provided, default names are used.

Returns:
DataFrame

A DataFrame containing the data from all provided files. The DataFrame has a MultiIndex with ‘idx0’, ‘idx1’ and ‘date’ as index levels, and ‘obs’ and ‘source’ as columns. Each ‘source’ corresponds to a different data source (file).

Notes

The function reads the data from the pickle files and converts them into a DataFrame For each file, it creates a MultiIndex DataFrame where the indices are a combination of two dimensions and dates extracted from the keys in the dictionary loaded from the pickle file.

The function assumes the dictionary loaded from the pickle file has keys that can be converted to dates with the format “YYYYMMDD”. It also assumes that the values in the dictionary to be 2D numpy array.

If pkl_files is a string or a list, the function treats them as files from a single data source and uses default_name as the source name. If it’s a dictionary, the keys are treated as data source names, and the values are lists of file names.

When multiple files are provided, the function concatenates the data along the date dimension.

static read_hdf(table, store=None, path=None, close=True, **select_kwargs)

Reads data from an HDF5 file, and returns a DataFrame.

This method can either read data directly from an open HDF5 store or from a provided file path. In case a file path is provided, it opens the HDF5 file in read mode, and closes it after reading, if ‘close’ is set to True.

Parameters:
tablestr

The key or the name of the dataset in the HDF5 file.

storepd.io.pytables.HDFStore, optional

An open HDF5 store. If provided, the method will directly read data from it. Default is None.

pathstr, optional

The path to the HDF5 file. If provided, the method will open the HDF5 file in read mode, and read data from it. Default is None.

closebool, optional

A flag that indicates whether to close the HDF5 store after reading the data. It is only relevant when ‘path’ is provided, in which case the default is True.

**select_kwargsdict, optional

Additional keyword arguments that are passed to the ‘select’ method of the HDFStore object. This can be used to select only a subset of data from the HDF5 file.

Returns:
dfpd.DataFrame

A DataFrame containing the data read from the HDF5 file.

Raises:
AssertionError

If both ‘store’ and ‘path’ are None, or if ‘store’ is not an instance of pd.io.pytables.HDFStore.

Notes

Either ‘store’ or ‘path’ must be provided. If ‘store’ is provided, ‘path’ will be ignored.

Examples

#>>> store = pd.HDFStore(‘data.h5’) #>>> df = read_hdf(table=’my_data’, store=store) #>>> print(df)

classmethod row_select_bool(df, row_select=None, combine='AND', **kwargs)

Returns a boolean array indicating which rows of the DataFrame meet the specified conditions.

This class method applies a series of conditions, provided in the ‘row_select’ list, to the input DataFrame ‘df’. Each condition is represented by a dictionary that is used as input to the ‘_bool_numpy_from_where’ method.

All conditions are combined via an ‘&’ operator, meaning if all conditions for a given row are True the return value for that row entry will be True and False if any condition is not satisfied.

If ‘row_select’ is None or an empty dictionary, all indices will be True.

Parameters:
dfDataFrame

The DataFrame to apply the conditions on.

row_selectlist of dict, optional

A list of dictionaries, each representing a condition to apply to ‘df’. Each dictionary should contain the information needed for the ‘_bool_numpy_from_where’ method. If None or an empty dictionary, all indices in the returned array will be True.

verbosebool or int, optional

If set to True or a number greater than or equal to 3, additional print statements will be executed.

kwargsdict

Additional keyword arguments passed to the ‘_bool_numpy_from_where’ method.

Returns:
selectnp.array of bool

A boolean array indicating which rows of the DataFrame meet the conditions. The length of the array is equal to the number of rows in ‘df’.

Raises:
AssertionError

If ‘row_select’ is not None, not a dictionary and not a list, or if any element in ‘row_select’ is not a dictionary.

Notes

The function is designed to work with pandas DataFrames.

If ‘row_select’ is None or an empty dictionary, the function will return an array with all elements set to True (indicating all rows of ‘df’ are selected).

classmethod write_to_hdf(df, store, table=None, append=False, config=None, run_info=None)
static write_to_netcdf(ds, path, mode='w', **to_netcdf_kwargs)

GPSat.dataprepper module

class GPSat.dataprepper.DataPrep

Bases: object

static bin_data(df, x_range=None, y_range=None, grid_res=None, x_col='x', y_col='y', val_col=None, bin_statistic='mean', bin_2d=True, return_bin_center=True)

Bins the data contained within a DataFrame into a grid, optionally computes a statistic on the binned data, and returns the binned data along with the bin edges or centers.

This method supports both 2D and 1D binning, allowing for various statistical computations on the binned values such as mean, median, count, etc.

Parameters:
dfpandas.DataFrame

The DataFrame containing the data to be binned.

x_rangetuple of float, optional

The minimum and maximum values of the x-axis to be binned. If not provided, a default range is used.

y_rangetuple of float, optional

The minimum and maximum values of the y-axis to be binned. Only required for 2D binning. If not provided, a default range is used.

grid_resfloat

The resolution of the grid in the same units as the x and y data. Defines the size of each bin.

x_colstr, default ‘x’

The name of the column in df that contains the x-axis values.

y_colstr, default ‘y’

The name of the column in df that contains the y-axis values. Ignored if bin_2d is False.

val_colstr

The name of the column in df that contains the values to be binned and aggregated.

bin_statisticstr, default ‘mean’

The statistic to compute on the binned data. Can be ‘mean’, ‘median’, ‘count’, or any other statistic supported by scipy.stats.binned_statistic or scipy.stats.binned_statistic_2d.

bin_2dbool, default True

If True, performs 2D binning using both x and y values. If False, performs 1D binning using only x values.

return_bin_centerbool, default True

If True, returns the center of each bin. If False, returns the edges of the bins.

Returns:
binned_datanumpy.ndarray

An array of the binned and aggregated data. The shape of the array depends on the binning dimensions and the grid resolution.

x_binnumpy.ndarray

An array of the x-axis bin centers or edges, depending on the value of return_bin_center.

y_binnumpy.ndarray, optional

An array of the y-axis bin centers or edges, only returned if bin_2d is True and return_bin_center is specified.

Raises:
AssertionError

If val_col or grid_res is not specified, or if the DataFrame df is empty. Also raises an error if the provided x_range or y_range are invalid or if the specified column names are not present in df.

Notes

  • The default x_range and y_range are set to [-4500000.0, 4500000.0] if not provided.

  • This method requires that val_col and grid_res be explicitly provided.

  • The binning process is influenced by the bin_statistic parameter, which determines how the values in each bin are aggregated.

  • When bin_2d is False, y_col is ignored and only x_col and val_col are used for binning.

  • The method ensures that the x_col, y_col, and val_col exist in the DataFrame df.

classmethod bin_data_by(df, col_funcs=None, row_select=None, by_cols=None, val_col=None, x_col='x', y_col='y', x_range=None, y_range=None, grid_res=None, bin_statistic='mean', bin_2d=True, limit=10000, return_df=False, verbose=False)

Class method to bin data by given columns.

Parameters:
dfpandas.DataFrame

The dataframe containing the data to be binned.

col_funcsdict, optional

Dictionary with functions to be applied on the dataframe columns.

row_selectdict, optional

Dictionary with conditions to select rows of the dataframe.

by_colsstr, list, tuple, optional

Columns to be used for binning.

val_colstr, optional

Column with values to be used for binning.

x_colstr, optional

Name of the column to be used for x-axis, by default ‘x’.

y_colstr, optional

Name of the column to be used for y-axis, by default ‘y’.

x_rangelist, tuple, optional

Range for the x-axis binning.

y_rangelist, tuple, optional

Range for the y-axis binning.

grid_resfloat, optional

Grid resolution for the binning process.

bin_statisticstr or list, optional

Statistic(s) to compute (default is ‘mean’).

bin_2dbool, default True

if True bin data on a 2d grid, otherwise will perform 1d binning using only ‘x’

limitint, optional

Maximum number of unique values for the by_cols, by default 10000.

return_dfbool, default False

if True return results in a DataFrame, otherwise a Dataset (xarray)

verbosebool or int, optional

If True or integer larger than 0, print information about process.

Returns:
xarray.Dataset

An xarray.Dataset containing the binned data.

GPSat.datetime_utils module

GPSat.datetime_utils.date_from_datetime(dt)

Remove the time component of an array of datetimes (represented as strings) and just return the date

The datetime format is expected to be YYYY-MM-DD HH:mm:SS The returned date format is YYYYMMDD

Parameters:
dt: list, np.array, pd.Series

string with datetime format YYYY-MM-DD HH:mm:SS.

Returns:
numpy.ndarray: A date column with format YYYY-MM-DD.

Note

This function uses a lambda function to remove the time portion and the dash from the datetime column. It then returns a numpy array of the resulting date column. It is possible to use apply on a Series to achieve the same result, but it may not be as fast as using a lambda function and numpy array.

GPSat.datetime_utils.datetime_from_float_column(float_datetime, epoch=(1950, 1, 1), time_unit='D')

Converts a float datetime column to a datetime64 format.

Parameters:
float_datetimepd.Series or np.array

A pandas series or numpy array containing float values, corresponding to datetime.

epochtuple, default is (1950, 1, 1).

A tuple representing the epoch date in the format (year, month, day).

time_unitstr, optional

The time unit of the float datetime values. Default is ‘D’ (days).

Returns:
numpy.ndarray

A numpy array of datetime64 values, with dtype ‘datetime64[s]’

Examples

>>> df = pd.DataFrame({'float_datetime': [18262.5, 18263.5, 18264.5]})
>>> datetime_from_float_column(df['float_datetime'])
array(['2000-01-01T12:00:00', '2000-01-02T12:00:00',
       '2000-01-03T12:00:00'], dtype='datetime64[s]')
>>> df = pd.DataFrame({'float_datetime': [18262.5, 18263.5, 18264.5]})
>>> datetime_from_float_column(df['float_datetime'], epoch=(1970, 1, 1))
array(['2020-01-01T12:00:00', '2020-01-02T12:00:00',
       '2020-01-03T12:00:00'], dtype='datetime64[s]')
>>> x = np.array([18262.5, 18263.5, 18264.5])
>>> datetime_from_float_column(x, epoch=(1970, 1, 1))
array(['2020-01-01T12:00:00', '2020-01-02T12:00:00',
       '2020-01-03T12:00:00'], dtype='datetime64[s]')
GPSat.datetime_utils.datetime_from_ymd_cols(year, month, day, hhmmss)

Converts separate columns/arrays of year, month, day, and time (in hhmmss format) into a numpy array of datetime objects.

Parameters:
yeararray-like

An array of integers representing the year.

montharray-like

An array of integers representing the month (1-12).

dayarray-like

An array of integers representing the day of the month.

hhmmssarray-like

An array of integers representing the time in hhmmss format.

Returns:
datetimenumpy.ndarray

An array of datetime objects representing the input dates and times.

Raises:
AssertionError

If the input arrays are not of equal length.

Examples

>>> year = [2021, 2021, 2021]
>>> month = [1, 2, 3]
>>> day = [10, 20, 30]
>>> hhmmss = [123456, 234537, 165648]
>>> datetime_from_ymd_cols(year, month, day, hhmmss)
array(['2021-01-10T12:34:56', '2021-02-20T23:45:37',
       '2021-03-30T16:56:48'], dtype='datetime64[s]')
GPSat.datetime_utils.from_file_start_end_datetime_GPOD(f, df)

Extract an implied sequence of evenly spaced time intervals based off of a ‘processed’ GPOD file name

This function takes in a file path and a pandas dataframe as input. It extracts the start and end datetime from the file name and calculates the time interval between them.

It then generates a datetime array with the same length as the dataframe, evenly spaced over the time interval. The resulting datetime array is returned.

Parameters:
f: str

filename

df: pd.DataFrame, pd.Series, np.array, tuple, list

the len(df) is used to determine the number and size of the intervals

Returns:
np.array

dtype datetime64[ns]

Examples

>>> f = "/path/to/S3A_GPOD_SAR__SRA_A__20191031T233355_20191101T002424_2019112_IL_v3.proc"
>>> df = pd.DataFrame({"x": np.arange(11)})
>>> from_file_start_end_datetime_GPOD(f, df)
array(['2019-10-31T23:33:55.000000000', '2019-10-31T23:38:57.900000000',
       '2019-10-31T23:44:00.800000000', '2019-10-31T23:49:03.700000000',
       '2019-10-31T23:54:06.600000000', '2019-10-31T23:59:09.500000000',
       '2019-11-01T00:04:12.400000000', '2019-11-01T00:09:15.300000000',
       '2019-11-01T00:14:18.200000000', '2019-11-01T00:19:21.100000000',
       '2019-11-01T00:24:24.000000000'], dtype='datetime64[ns]')
GPSat.datetime_utils.from_file_start_end_datetime_SARAL(f, df)

This function takes in a file path to a file and a pandas dataframe and returns a numpy array of datetime objects.

The file path is expected to be in the format of SARAL data files, with the datetime information encoded in the file name. The function extracts the start and end datetime information from the file name, calculates the time interval between them based on the length of the dataframe, and generates a numpy array of datetime objects with the same length as the dataframe.

Parameters:
f: str

the file path of the SARAL data file

df: pd.DataFrame

the data contained in the SARAL data file

Returns:
np.array

datetime objects, representing the time stamps of the data in the SARAL data file with dtype: ‘datetime64[s]’

Examples

>>> f = "/path/to/SARAL_C139_0036_20200331_234125_20200401_003143_CS2mss_IL_v1.proc"
>>> df = pd.DataFrame({"x": np.arange(11)})
>>> from_file_start_end_datetime_SARAL(f, df)
array(['2020-03-31T23:41:25', '2020-03-31T23:46:26',
       '2020-03-31T23:51:28', '2020-03-31T23:56:30',
       '2020-04-01T00:01:32', '2020-04-01T00:06:34',
       '2020-04-01T00:11:35', '2020-04-01T00:16:37',
       '2020-04-01T00:21:39', '2020-04-01T00:26:41',
       '2020-04-01T00:31:43'], dtype='datetime64[s]')

GPSat.decorators module

GPSat.decorators.timer(func)

This function is a decorator that can be used to time the execution of other functions.

It takes a function as an argument and returns a new function that wraps the original function.

When the wrapped function is called, it measures the time it takes to execute the original function and prints the result to the console.

The function uses the time.perf_counter() function to measure the time. This function returns the current value of a performance counter, which is a high-resolution timer that measures the time in seconds since a fixed point in time.

The wrapped function takes any number of positional and keyword arguments, which are passed on to the original function. The result of the original function is returned by the wrapped function.

The decorator also uses the functools.wraps() function to preserve the metadata of the original function, such as its name, docstring, and annotations. This makes it easier to debug and introspect the code.

To use the decorator, simply apply it to the function you want to time, like this:

@timer def my_function():

GPSat.local_experts module

class GPSat.local_experts.LocalExpertData(obs_col: str | None = None, coords_col: list | None = None, global_select: list | None = None, local_select: list | None = None, where: list | None = None, row_select: list | None = None, col_select: list | None = None, col_funcs: list | None = None, table: str | None = None, data_source: str | None = None, engine: str | None = None, read_kwargs: dict | None = None)

Bases: object

col_funcs: list | None = None
col_select: list | None = None
coords_col: list | None = None
data_source: str | None = None
engine: str | None = None
file_suffix_engine_map = {'csv': 'read_csv', 'h5': 'HDFStore', 'nc': 'netcdf4', 'tsv': 'read_csv', 'zarr': 'zarr'}
global_select: list | None = None
load(where=None, verbose=False, **kwargs)
local_select: list | None = None
obs_col: str | None = None
read_kwargs: dict | None = None
row_select: list | None = None
set_data_source(verbose=False)
table: str | None = None
where: list | None = None
class GPSat.local_experts.LocalExpertOI(expert_loc_config: Dict | ExpertLocsConfig | None = None, data_config: Dict | DataConfig | None = None, model_config: Dict | ModelConfig | None = None, pred_loc_config: Dict | PredictionLocsConfig | None = None, local_expert_config: ExperimentConfig | None = None)

Bases: object

This provides the main interface for conducting an experiment in GPSat to predict an underlying field from satellite measurements using local Gaussian process (GP) models.

This proceeds by iterating over the local expert locations, training the local GPs on data in a neighbourhood of the expert location and making predictions on specified locations. The results will be saved in an HDF5 file.

Example usage:

>>> store_path = "/path/to/store.h5"
>>> locexp = LocalExpertOI(data_config, model_config, expert_loc_config, pred_loc_config)
>>> locexp.run(store_path=store_path) # Run full sweep and save results in store_path
static dict_of_array_to_table(x, ref_loc=None, concat=False, table=None, default_dim=1)

given a dictionary of numpy arrays create DataFrame(s) with ref_loc as the multi index

file_suffix_engine_map = {'csv': 'read_csv', 'h5': 'HDFStore', 'nc': 'netcdf4', 'tsv': 'read_csv', 'zarr': 'zarr'}
load_params(model, previous=None, previous_params=None, file=None, param_names=None, ref_loc=None, index_adjust=None, table_suffix='', **param_dict)
plot_locations_and_obs(image_file, obs_col=None, lat_col='lat', lon_col='lon', exprt_lon_col='lon', exprt_lat_col='lat', sort_by='date', col_funcs=None, xrpt_loc_col_funcs=None, vmin=None, vmax=None, s=0.5, s_exprt_loc=250, cbar_label='Input Observations', cmap='YlGnBu_r', figsize=(15, 15), projection=None, extent=None)
run(store_path=None, store_every=10, check_config_compatible=True, skip_valid_checks_on=None, optimise=True, predict=True, min_obs=3, table_suffix='')

Run a full sweep to perform local optimal interpolation at every expert location. The results will be stored in an HDF5 file containing (1) the predictions at each location, (2) parameters of the model at each location, (3) run details such as run times, and (4) the full experiment configuration.

Parameters:
store_path: str

File path where results should be stored as HDF5 file.

store_every: int, default 10

Results will be stored to file after every store_every expert locations. Reduce if optimisation is slow, must be greater than 1.

check_config_compatible: bool, default True

Check if current LocalExpertOI configuration is compatible with previous, if applicable. If file exists in store_path, it will check the oi_config attribute in the oi_config table to ensure that configurations are compatible.

skip_valid_checks_on: list, optional

When checking if config is compatible, skip keys specified in this list.

optimise: bool, default True

If True, will run model.optimise_parameters() to learn the model parameters at each expert location.

predict: bool, default True

If True, will run model.predict() to make predictions at the locations specified in the prediction locations configuration.

min_obs: int, default 3

Minimum number observations required to run optimisation or make predictions.

table_suffix: str, optional

Suffix to be appended to all table names when writing to file.

Returns:
None

Notes

  • By default, both training and inference are performed at every location. However one can opt to do either one with the optimise and predict options, respectively.

  • If check_config_compatible is set to True, it makes sure that all results saved to store_path use the same configurations. That is, if one re-runs an experiment with a different configuration but pointing to the same store_path, it will return an error. Make sure that if you run an experiment with a different configuration, either set a different store_path, or if you want to override the results, delete the generated store_path.

  • The table_suffix is useful for storing multiple results in a single HDF5 file, each with a different suffix. See <hyperparameter smoothing> for an example use case.

set_data(**kwargs)
set_expert_locations(df=None, file=None, source=None, where=None, add_data_to_col=None, col_funcs=None, keep_cols=None, col_select=None, row_select=None, sort_by=None, reset_index=False, source_kwargs=None, verbose=False, **kwargs)
set_model(oi_model=None, init_params=None, constraints=None, load_params=None, optim_kwargs=None, pred_kwargs=None, params_to_store=None, replacement_threshold=None, replacement_model=None, replacement_init_params=None, replacement_constraints=None, replacement_optim_kwargs=None, replacement_pred_kwargs=None)
set_pred_loc(**kwargs)
GPSat.local_experts.get_results_from_h5file(results_file, global_col_funcs=None, merge_on_expert_locations=True, select_tables=None, table_suffix='', add_suffix_to_table=True, verbose=False)

Retrieve results from an HDF5 file.

Parameters:
results_file: str

The location where the results file is saved. Must point to a HDF5 file with the file extension .h5.

select_tables: list, optional

A list of table names to select from the HDF5 file.

global_col_funcs: dict, optional

A dictionary of column functions to apply to selected tables.

merge_on_expert_locations: bool, default True

Whether to merge expert location data with results data.

table_suffix: str, optional

A suffix to add to selected table names.

add_suffix_to_table: bool, default True

Whether to add the table suffix to selected table names.

verbose: bool, default False

Set verbosity.

Returns:
tuple:

A tuple containing two elements:

  1. dict: A dictionary of DataFrames where each table name is the key. This contains the predictions and learned model parameters at every location.

  2. list: A list of configuration dictionaries.

Notes

  • This function reads data from an HDF5 file, applies optional column functions, and optionally merges expert location data with results data.

  • The 'select_tables' parameter allows you to choose specific tables from the HDF5 file.

  • Column functions specified in 'global_col_funcs' can be applied to selected tables.

  • Expert location data can be merged onto results data if 'merge_on_expert_locations' is set to True.

GPSat.plot_utils module

GPSat.plot_utils.get_projection(projection=None)
GPSat.plot_utils.plot_gpflow_minimal_example(model: object, model_init: object = None, opt_params: object = None, pred_params: object = None) object

Run a basic usage example for a given model. Model will be initialised, parameters will be optimised and predictions will be made for the minimal model example found (as of 2023-05-04):

https://gpflow.github.io/GPflow/2.8.0/notebooks/getting_started/basic_usage.html

Methods called are: optimise_parameters, predict, get_parameters

Predict expected to return a dict with ‘f*’, ‘f*_var’ and ‘y_var’ as np.arrays

Parameters:
model: any model inherited from BaseGPRModel
model_init: dict or None, default None

dict of parameters to be provided when model is initialised. If None default parameters are used

opt_params: dict or None, default None

dict of parameters to be passed to optimise_parameter method. If None default parameters are used

pred_params: dict or None, default None

dict of parameters to be passed to predict method. If None default parameters are used

Returns:
tuple:

predictions dict parameters dict

GPSat.plot_utils.plot_hist(ax, data, title='Histogram / Density', ylabel=None, xlabel=None, select_bool=None, stats_values=None, stats_loc=(0.2, 0.9), drop_nan_inf=True, q_vminmax=None, rasterized=False)
GPSat.plot_utils.plot_hist_from_results_data(ax, dfs, table, val_col, load_kwargs=None, plot_kwargs=None, verbose=False, **kwargs)
GPSat.plot_utils.plot_hyper_parameters(dfs, coords_col, row_select=None, table_names=None, table_suffix='', plot_template: dict | None = None, plots_per_row=3, suptitle='hyper params', qvmin=0.01, qvmax=0.99)
GPSat.plot_utils.plot_pcolormesh(ax, lon, lat, plot_data, fig=None, title=None, vmin=None, vmax=None, qvmin=None, qvmax=None, cmap='YlGnBu_r', cbar_label=None, scatter=False, extent=None, ocean_only=False, **scatter_args)
GPSat.plot_utils.plot_pcolormesh_from_results_data(ax, dfs, table, val_col, lon_col=None, lat_col=None, x_col=None, y_col=None, lat_0=90, lon_0=0, fig=None, load_kwargs=None, plot_kwargs=None, weighted_values_kwargs=None, verbose=False, **kwargs)
GPSat.plot_utils.plot_wrapper(plt_df, val_col, lon_col='lon', lat_col='lat', scatter_plot_size=2, plt_where=None, projection=None, extent=None, max_obs=1000000.0, vmin_max=None, q_vminmax=None, abs_vminmax=False, stats_loc=None, figsize=None, where_sep='\n ')
GPSat.plot_utils.plot_xy(ax, x, y, title=None, y_label=None, x_label=None, xtick_rotation=45, scatter=False, **kwargs)
GPSat.plot_utils.plot_xy_from_results_data(ax, dfs, table, x_col, y_col, load_kwargs=None, plot_kwargs=None, verbose=False, **kwargs)
GPSat.plot_utils.plots_from_config(plot_configs, dfs: dict[str, DataFrame], plots_per_row: int = 3, num_plots_row_col_size: dict[int, dict] | None = None, suptitle: str = '')

GPSat.postprocessing module

class GPSat.postprocessing.SmoothingConfig(l_x: int | float = 1, l_y: int | float = 1, max: int | float = None, min: int | float = None)

Bases: object

Configuration used for hyperparameter smoothing.

Attributes:
l_x: int or float, default 1

The lengthscale (x-direction) parameter for Gaussian smoothing.

l_y: int or float, default 1

The lengthscale (y-direction) parameter for Gaussian smoothing.

max: int or float, optional

Maximal value that the hyperparameter can take.

min: int or float, optional

Minimal value that the hyperparameter can take.

Notes

This configuration is used to smooth 2D hyperparameter fields.

get(key, default=None)
l_x: int | float = 1
l_y: int | float = 1
max: int | float = None
min: int | float = None
GPSat.postprocessing.get_smooth_params_config()
GPSat.postprocessing.glue_local_predictions(preds_df: DataFrame, inference_radius: DataFrame, R: int | float | list = 3) DataFrame

DEPRECATED. See glue_local_predictions_1d and glue_local_predictions_2d. Glues overlapping predictions by taking a normalised Gaussian weighted average.

WARNING: This method only deals with expert locations on a regular grid

Parameters:
preds_df: pd.DataFrame

containing predictions generated from local expert OI. It should have the following columns: - pred_loc_x (float): The x-coordinate of the prediction location. - pred_loc_y (float): The y-coordinate of the prediction location. - f* (float): The predictive mean at the location (pred_loc_x, pred_loc_y). - f*_var (float): The predictive variance at the location (pred_loc_x, pred_loc_y).

expert_locs_df: pd.DataFrame

containing local expert locations used to perform OI. It should have the following columns: - x (float): The x-coordinate of the expert location. - y (float): The y-coordinate of the expert location.

sigma: int, float, or list, default 3

The standard deviation of the Gaussian weighting in the x and y directions. If a single value is provided, it is used for both directions. If a list is provided, the first value is used for the x direction and the second value is used for the y direction. Defaults to 3.

Returns:
pd.DataFrame:

dataframe consisting of glued predictions (mean and std). It has the following columns: - pred_loc_x (float): The x-coordinate of the prediction location. - pred_loc_y (float): The y-coordinate of the prediction location. - f* (float): The glued predictive mean at the location (pred_loc_x, pred_loc_y). - f*_std (float): The glued predictive standard deviation at the location (pred_loc_x, pred_loc_y).

Notes

The function assumes that the expert locations are equally spaced in both the x and y directions. The function uses the scipy.stats.norm.pdf function to compute the Gaussian weights. The function normalizes the weighted sums with the total weights at each location.

GPSat.postprocessing.glue_local_predictions_1d(preds_df: DataFrame, pred_loc_col: str, xprt_loc_col: str, vars_to_glue: str | List[str], inference_radius: int | float | dict, R=3) DataFrame

Glues together overlapping local expert predictions in 1D by Gaussian-weighted averaging.

Parameters:
preds_df: pandas dataframe

A dataframe containing the results of local experts predictions. The dataframe should have columns containing the (1) prediction locations, (2) expert locations, and (3) any predicted variables we wish to glue (e.g. the predictive mean).

pred_loc_col: str

The column in the results dataframe corresponding to the prediction locations

xprt_loc_col: str

The column in the results dataframe corresponding to the local expert locations

vars_to_glue: str | list of strs

The column(s) corresponding to variables we wish to glue (e.g. the predictive mean and variance).

inference_radius: int | float | dict

The inference radius for each local experts. If specified as a dict, the keys should be the expert locations and the corresponding values should be the corresponding inference radius of that expert. If specified as an int or float, it assumes that all experts have the same inference radius.

R: int | float, default 3

A weight controlling the standard deviation of the Gaussian weights. The standard deviation will be given by the formula std = inference_radius / R. The default value of 3 will place 99% of the Gaussian mass within the inference radius.

Returns:
pandas dataframe

A dataframe of glued predictions, whose columns contain (1) the prediction locations and (2) the glued variables.

GPSat.postprocessing.glue_local_predictions_2d(preds_df: DataFrame, pred_loc_cols: List[str], xprt_loc_cols: List[str], vars_to_glue: str | List[str], inference_radius: int | float | dict, R=3) DataFrame

Glues together overlapping local expert predictions in 2D by Gaussian-weighted averaging.

Parameters:
preds_df: pandas dataframe

A dataframe containing the results of local experts predictions. The dataframe should have columns containing the (1) prediction locations, (2) expert locations, and (3) any predicted variables we wish to glue (e.g. the predictive mean).

pred_loc_col: list of strs

The xy-columns in the results dataframe corresponding to the prediction locations

xprt_loc_cols: list of strs

The xy-columns in the results dataframe corresponding to the local expert locations

vars_to_glue: str | list of strs

The column(s) corresponding to variables we wish to glue (e.g. the predictive mean and variance).

inference_radius: int | float

The inference radius for each local experts. We assume that all experts have the same inference radius.

R: int | float, default 3

A weight controlling the standard deviation of the Gaussian weights. The standard deviation will be given by the formula std = inference_radius / R. The default value of 3 will place 99% of the Gaussian mass within the inference radius.

Returns:
pandas dataframe

A dataframe of glued predictions, whose columns contain (1) the prediction locations and (2) the glued variables.

GPSat.postprocessing.smooth_hyperparameters(result_file: str, params_to_smooth: List[str], smooth_config_dict: Dict[str, SmoothingConfig], xy_dims: List[str] = ['x', 'y'], reference_table_suffix: str = '', table_suffix: str = '_SMOOTHED', output_file: str = None, model_name: str = None, save_config_file: bool = True)

Smooth hyperparameters in an HDF5 results file using Gaussian smoothing.

Parameters:
result_file: str

The path to the HDF5 results file.

params_to_smooth: list of str

A list of hyperparameters to be smoothed.

smooth_config_dict: Dict[str, SmoothingConfig]

A dictionary specifying smoothing configurations for each hyperparameter. This should be a dictionary where keys are hyperparameter names, and values are instances of the SmoothingConfig class specifying smoothing parameters.

xy_dims: list of str, default [‘x’, ‘y’]

The dimensions to use for smoothing (default: ['x', 'y']).

reference_table_suffix: str, default “”

The suffix to use for reference table names (default: "").

table_suffix: str, default “_SMOOTHED”

The suffix to add to smoothed hyperparameter table names (default: "_SMOOTHED").

output_file: str, optional

The path to the output HDF5 file to store smoothed hyperparameters.

model_name: str, optional

The name of the model for which hyperparameters are being smoothed.

save_config_file: bool, optional

Whether to save a configuration file for making predictions with smoothed values.

Returns:
None

Notes

  • This function applies Gaussian smoothing to specified hyperparameters in an HDF5 results file.

  • The output_file parameter allows you to specify a different output file for storing the smoothed hyperparameters.

  • If model_name is not provided, it will be determined from the input HDF5 file.

  • If save_config_file is True, a configuration file for making predictions with smoothed values will be saved.

GPSat.prediction_locations module

class GPSat.prediction_locations.PredictionLocations(method='expert_loc', coords_col=None, expert_loc=None, **kwargs)

Bases: object

property coords_col
property expert_loc

GPSat.read_and_store module

GPSat.read_and_store.update_attr(x, cid, vals)

GPSat.utils module

GPSat.utils.EASE2toWGS84(x, y, return_vals='both', lon_0=0, lat_0=90)

Converts EASE2 grid coordinates to WGS84 longitude and latitude coordinates.

Parameters:
x: float

EASE2 grid x-coordinate in meters.

y: float

EASE2 grid y-coordinate in meters.

return_vals: str, optional

Determines what values to return. Valid options are "both" (default), "lon", or "lat".

lon_0: float, optional

Longitude of the center of the EASE2 grid in degrees. Default is 0.

lat_0: float, optional

Latitude of the center of the EASE2 grid in degrees. Default is 90.

Returns:
tuple or float

Depending on the value of return_vals, either a tuple of WGS84 longitude and latitude coordinates (both floats), or a single float representing either the longitude or latitude.

Raises:
AssertionError

If return_vals is not one of the valid options.

Examples

>>> EASE2toWGS84(1000000, 2000000)
(153.434948822922, 69.86894542225777)
GPSat.utils.EASE2toWGS84_New(*args, **kwargs)
GPSat.utils.WGS84toEASE2(lon, lat, return_vals='both', lon_0=0, lat_0=90)

Converts WGS84 longitude and latitude coordinates to EASE2 grid coordinates.

Parameters:
lonfloat

Longitude coordinate in decimal degrees.

latfloat

Latitude coordinate in decimal degrees.

return_valsstr, optional

Determines what values to return. Valid options are "both" (default), "x", or "y".

lon_0float, optional

Longitude of the center of the EASE2 grid in decimal degrees. Default is 0.

lat_0float, optional

Latitude of the center of the EASE2 grid in decimal degrees. Default is 90.

Returns:
float

If return_vals is "x". Returns the x EASE2 grid coordinate in meters.

float

If return_vals is "y". Returns the y EASE2 grid coordinate in meters

tuple of float

If return_vals is "both". Returns a tuple of (x, y) EASE2 grid coordinates in meters.

Raises:
AssertionError

If return_vals is not one of the valid options.

Examples

>>> WGS84toEASE2(-105.01621, 39.57422)
(-5254767.014984061, 1409604.1043472202)
GPSat.utils.WGS84toEASE2_New(*args, **kwargs)
GPSat.utils.array_to_dataframe(x, name, dim_prefix='_dim_', reset_index=False)

Converts a numpy array to a pandas DataFrame with a multi-index based on the array’s dimensions.

(Also see dataframe_to_array)

Parameters:
xnp.ndarray

The numpy array to be converted to a DataFrame.

namestr

The name of the column in the resulting DataFrame.

dim_prefixstr, optional

The prefix to be used for the dimension names in the multi-index. Default is "_dim_". Integers will be appended to dim_prefix for each dimension of x, i.e. if x is 2d, it will have dimension names "_dim_0", "_dim_1", assuming default dim_prefix is used.

reset_indexbool, optional

Whether to reset the index of the resulting DataFrame. Default is False.

Returns:
outpd.DataFrame

The resulting DataFrame with a multi-index based on the dimensions of the input array.

Raises:
AssertionError

If the input is not a numpy array.

Examples

>>> # express a 2d numpy array in DataFrame
>>> x = np.array([[1, 2], [3, 4]])
>>> array_to_dataframe(x, "data")
                data
_dim_0 _dim_1
0      0        1
       1        2
1      0        3
       1        4
GPSat.utils.assign_category_col(val, df, categories=None)

Generate categorical pd.Series equal in length to a reference DataFrame (df)

Parameters:
valstr

The value to assign to the categorical Series.

dfpandas DataFrame

reference DataFrame, used to determine length of output

categorieslist, optional

A list of categories to be used for the categorical column.

Returns:
pandas Categorical Series

A categorical column with the assigned value and specified categories (if provided).

Notes

This function creates a new categorical column in the DataFrame with the specified value and categories. If categories are not provided, they will be inferred from the data. The function returns a pandas Categorical object representing the new column.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
>>> x_series = assign_category_col('x', df)
GPSat.utils.bin_obs_by_date(df, val_col, date_col='date', all_dates_in_range=True, x_col='x', y_col='y', grid_res=None, date_col_format='%Y%m%d', x_min=-4500000.0, x_max=4500000.0, y_min=-4500000.0, y_max=4500000.0, n_x=None, n_y=None, bin_statistic='mean', verbose=False)

This function takes in a pandas DataFrame and bins the data based on the values in a specified column and the x and y coordinates in other specified columns. The data is binned based on a grid with a specified resolution or number of bins. The function returns a dictionary of binned values for each unique date in the DataFrame.

Parameters:
df: pandas DataFrame

A DataFrame containing the data to be binned.

val_col: string

Name of the column containing the values to be binned.

date_col: string, default “date”

Name of the column containing the dates for which to bin the data.

all_dates_in_range: boolean, default True

Whether to include all dates in the range of the DataFrame.

x_col: string, default “x”

Name of the column containing the x coordinates.

y_col: string, default “y”

Name of the column containing the y coordinates.

grid_res: float or int, default None

Resolution of the grid in kilometers. If None, then n_x and n_y must be specified.

date_col_format: string, default “%Y%m%d”

Format of the date column.

x_min: float, default -4500000.0

Minimum x value for the grid.

x_max: float, default 4500000.0

Maximum x value for the grid.

y_min: float, default -4500000.0

Minimum y value for the grid.

y_max: float, default 4500000.0

Maximum y value for the grid.

n_x: int, default None

Number of bins in the x direction.

n_y: int, default None

Number of bins in the y direction.

bin_statistic: string or callable, default “mean”

Statistic to compute in each bin.

verbose: boolean, default False

Whether to print additional information during execution.

Returns:
bvals: dictionary

The binned values for each unique date in the DataFrame.

x_edge: numpy array

x values for the edges of the bins.

y_edge: numpy array

y values for the edges of the bins.

Notes

The x and y coordinates are swapped in the returned binned values due to the transpose operation used in the function.

GPSat.utils.check_prev_oi_config(prev_oi_config, oi_config, skip_valid_checks_on=None)

This function checks if the previous configuration matches the current one. It takes in two dictionaries, prev_oi_config and oi_config, which represent the previous and current configurations respectively.

The function also takes an optional list skip_valid_checks_on, which contains keys that should be skipped during the comparison.

Parameters:
prev_oi_config: dict

Previous configuration to be compared against.

oi_config: dict

Current configuration to compare against prev_oi_config.

skip_valid_checks_on: list or None, default None

If not None, should be a list of keys to not check.

Returns:
None

Notes

  • If skip_valid_checks_on is not provided, it defaults to an empty list. The function then compares the two configurations and raises an AssertionError if any key-value pairs do not match.

  • If the configurations do not match exactly, an AssertionError is raised.

  • This function assumes that the configurations are represented as dictionaries and that the keys in both dictionaries are the same.

GPSat.utils.compare_dataframes(df1, df2, merge_on, columns_to_compare, drop_other_cols=False, how='outer', suffixes=['_1', '_2'])
GPSat.utils.config_func(func, source=None, args=None, kwargs=None, col_args=None, col_kwargs=None, df=None, filename_as_arg=False, filename=None, col_numpy=True)

Apply a function based on configuration input.

The aim is to allow one to apply a function, possibly on data from a DataFrame, using a specification that can be stored in a JSON configuration file.

Note

  • This function uses eval() so could allow for arbitrary code execution.

  • If DataFrame df is provided, then can provide input (col_args and/or col_kwargs) based on columns of df.

Parameters:
func: str or callable.
  • If str, it will use eval(func) to convert it to a function.

  • If it contains one of "|", "&", "=", "+", "-", "*", "/", "%", "<", and ">", it will create a lambda function:

lambda arg1, arg2: eval(f"arg1 {func} arg2")
  • If eval(func) raises NameError and source is not None, it will run

f"from {source} import {func}"

and try again. This is to allow import function from a source.

source: str or None, default None

Package name where func can be found, if applicable. Used to import func from a package. e.g.

>>> GPSat.utils.config_func(func="cumprod", source="numpy", ...)

calls the function cumprod from the package numpy.

args: list or None, default None

If None, an empty list will be used, i.e. no args will be used. The values will be unpacked and provided to func: i.e. func(*args, **kwargs)

kwargs: dict or None, default None

If dict, it will be unpacked (**kwargs) to provide key word arguments to func.

col_args: None or list of str, default None

If DataFrame df is provided, it can use col_args to specify which columns of df will be passed into func as arguments.

col_kwargs: None or dict, default is None

Keyword arguments to be passed to func specified as dict whose keys are parameters of func and values are column names of a DataFrame df. Only applicable if df is provided.

df: DataFrame or None, default None

To provide if one wishes to use columns of a DataFrame as arguments to func.

filename_as_arg: bool, default False

Set True if filename is used as an argument to func.

filename: str or None, default None

If filename_as_arg is True, then will provide filename as first arg.

col_numpy: bool, default True

If True, when extracting columns from DataFrame, .values is used to convert to numpy array.

Returns:
any

Values returned by applying func on data. The type depends on func.

Raises:
AssertionError

If kwargs is not a dict, col_kwargs is not a dict, or func is not a string or callable.

AssertionError

If df is not provided but col_args or col_kwargs are.

AssertionError

If func is a string and cannot be imported on it’s own and source is None.

Examples

>>> import pandas as pd
>>> from GPSat.utils import config_func
>>> config_func(func="lambda x, y: x + y", args=[1, 1]) # Computes 1 + 1
2
>>> config_func(func="==", args=[1, 1]) # Computes 1 == 1
True

Using columns of a DataFrame as inputs:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> config_func(func="lambda x, y: x + y", df=df, col_args=["A", "B"]) # Computes df["A"] + df["B"]
array([5, 7, 9])
>>> config_func(func="<=", col_args=["A", "B"], df=df) # Computes df["A"] <= df["B"]
array([ True,  True,  True])

We can also use functions from an external package by specifying source. For example, the below reproduces the last example in numpy.cumprod:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> config_func(func="cumprod", source="numpy", df=df, kwargs={"axis": 0}, col_args=[["A", "B"]])
array([[  1,   4],
       [  2,  20],
       [  6, 120]])
GPSat.utils.convert_lon_lat_str(x)

Converts a string representation of longitude or latitude to a float value.

Parameters:
x: str

A string representation of longitude or latitude in the format of "[degrees] [minutes] [direction]", where [direction] is one of "N", "S", "E", or "W".

Returns:
float

The converted value of the input string as a float.

Raises:
AssertionError

If the input is not a string.

Examples

>>> convert_lon_lat_str('74 0.1878 N')
74.00313
>>> convert_lon_lat_str('140 0.1198 W')
-140.001997
GPSat.utils.cprint(x, c='ENDC', bcolors=None, sep=' ', end='\n')

Add color to print statements.

Based off of https://stackoverflow.com/questions/287871/how-do-i-print-colored-text-to-the-terminal.

Parameters:
x: str

String to be printed.

c: str, default “ENDC”

Valid key in bcolors. If bcolors is not provided, then default will be used, containing keys: 'HEADER', 'OKBLUE', 'OKCYAN', 'OKGREEN', 'WARNING', 'FAIL', 'ENDC', 'BOLD', 'UNDERLINE'.

bcolors: dict or None, default None

Dict with values being colors / how to format the font. These cane be chained together. See the codes in: https://en.wikipedia.org/wiki/ANSI_escape_code#3-bit_and_4-bit.

sep: str, default “ “

sep argument passed along to print().

end: str, default “\n”

end argument passed along to print().

Returns:
None
GPSat.utils.dataframe_to_2d_array(df, x_col, y_col, val_col, tol=1e-09, fill_val=nan, dtype=None, decimals=1)

Extract values from DataFrame to create a 2-d array of values (val_col) - assuming the values came from a 2-d array. Requires dimension columns x_col, y_col (do not have to be ordered in DataFrame).

Parameters:
df: pandas.DataFrame

The dataframe to convert to a 2D array.

x_col: str

The name of the column in the dataframe that contains the x coordinates.

y_col: str

The name of the column in the dataframe that contains the y coordinates.

val_col: str

The name of the column in the dataframe that contains the values to be placed in the 2D array.

tol: float, default 1e-9

The tolerance for matching the x and y coordinates to the grid.

fill_val: float, default np.nan

The value to fill the 2D array with if a coordinate is missing.

dtype: str or numpy.dtype or None, default None

The data type of the values in the 2D array.

decimals: int, default 1

The number of decimal places to round x and y values to before taking unique. If decimals is negative, it specifies the number of positions to the left of the decimal point.

Returns:
tuple

A tuple containing the 2D numpy array of values, the x coordinates of the grid, and the y coordinates of the grid.

Raises:
AssertionError

If any of the required columns are missing from the dataframe, or if any coordinates have more than one value.

Notes

  • The spacing of grid is determined by the smallest step size in the x_col, y_col direction, respectively.

  • This is meant to reverse the process of putting values from a regularly spaced grid into a DataFrame. Do not expect this to work on arbitrary x,y coordinates.

GPSat.utils.dataframe_to_array(df, val_col, idx_col=None, dropna=True, fill_val=nan)

Converts a pandas DataFrame to a numpy array, where the DataFrame has columns that represent dimensions of the array and the DataFrame rows represent values in the array.

Parameters:
dfpandas DataFrame

The DataFrame containing values convert to a numpy ndarray.

val_colstr

The name of the column in the DataFrame that contains the values to be placed in the array.

idx_colstr or list of str or None, default None

The name(s) of the column(s) in the DataFrame that represent the dimensions of the array. If not provided, the index of the DataFrame will be used as the dimension(s).

dropnabool, default True

Whether to drop rows with missing values before converting to the array.

fill_valscalar, default np.nan

The value to fill in the array for missing values.

Returns:
numpy array

The resulting numpy array.

Raises:
AssertionError

If the dimension values are not integers or have gaps, or if the idx_col parameter contains column names that are not in the DataFrame.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from GPSat.utils import dataframe_to_array
>>> df = pd.DataFrame({
...     'dim1': [0, 0, 1, 1],
...     'dim2': [0, 1, 0, 1],
...     'values': [1, 2, 3, 4]
... })
>>> arr = dataframe_to_array(df, 'values', ['dim1', 'dim2'])
>>> print(arr)
[[1 2]
 [3 4]]
GPSat.utils.dict_of_array_to_dict_of_dataframe(array_dict, concat=False, reset_index=False)

Converts a dictionary of arrays to a dictionary of pandas DataFrames.

Parameters:
array_dictdict

A dictionary where the keys are strings and the values are numpy arrays.

concatbool, optional

If True, concatenates DataFrames with the same number of dimensions. Default is False.

reset_indexbool, optional

If True, resets the index of each DataFrame. Default is False.

Returns:
dict

A dictionary where the keys are strings and the values are pandas DataFrames.

Notes

This function uses the array_to_dataframe function to convert each array to a DataFrame. If concat is True, it will concatenate DataFrames with the same number of dimensions. If reset_index is True, it will reset the index of each DataFrame.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> array_dict = {'a': np.array([1, 2, 3]), 'b': np.array([[1, 2], [3, 4]]), 'c': np.array([1.1, 2.2, 3.3])}
>>> dict_of_array_to_dict_of_dataframe(array_dict)
{'a':       a
    _dim_0   
    0       1
    1       2
    2       3,
'b':               b
    _dim_0 _dim_1   
    0      0       1
           1       2
    1      0       3
           1       4,
'c':        c
    _dim_0     
    0       1.1
    1       2.2
    2       3.3}
>>> dict_of_array_to_dict_of_dataframe(array_dict, concat=True)
{1:         a    c
    _dim_0
    0       1  1.1
    1       2  2.2
    2       3  3.3,
2:                 b
    _dim_0 _dim_1
    0      0       1
           1       2
    1      0       3
           1       4}
>>> dict_of_array_to_dict_of_dataframe(array_dict, reset_index=True)
{'a':    _dim_0  a
    0       0    1
    1       1    2
    2       2    3,
 'b':    _dim_0  _dim_1  b
    0       0       0    1
    1       0       1    2
    2       1       0    3
    3       1       1    4,
 'c':    _dim_0  c
    0       0    1.1
    1       1    2.2
    2       2    3.3}
GPSat.utils.diff_distance(x, p=2, k=1, default_val=nan)
GPSat.utils.expand_dict_by_vals(d, expand_keys)
GPSat.utils.get_col_values(df, col, return_numpy=True)

This function takes in a pandas DataFrame, a column name or index, and a boolean flag indicating whether to return the column values as a numpy array or not. It returns the values of the specified column as either a pandas Series or a numpy array, depending on the value of the return_numpy flag.

If the column is specified by name and it does not exist in the DataFrame, the function will attempt to use the column index instead. If the column is specified by index and it is not a valid integer index, the function will raise an AssertionError.

Parameters:
df: pandas DataFrame

A pandas DataFrame containing data.

col: str or int

The name of column to extract data from. If specified as an int n, it will extract data from the n-th column.

return_numpy: bool, default True

Whether to return as numpy array.

Returns:
numpy array

If return_numpy is set to True.

pandas Series

If return_numpy is set to False.

Examples

>>> import pandas as pd
>>> from GPSat.utils import get_col_values
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> col_values = get_col_values(df, 'A')
>>> print(col_values)
[1 2 3]
GPSat.utils.get_config_from_sysargv(argv_num=1)

This function takes an optional argument argv_num (default value of 1) and attempts to read a JSON configuration file from the corresponding index in sys.argv.

If the file extension is not .json, it prints a message indicating that the file is not a JSON file.

If an error occurs while reading the file, it prints an error message.

This function could benefit from refactoring to use the argparse package instead of manually parsing sys.argv.

Parameters:
argv_num :int, default 1

The index in sys.argv to read the configuration file from.

Returns:
dict or None

The configuration data loaded from the JSON file, or None if an error occurred while reading the file.

GPSat.utils.get_git_information()

This function retrieves information about the current state of a Git repository.

Returns:
dict

Contains the following keys:

  • "branch": the name of the current branch.

  • "remote": a list of strings representing the remote repositories and their URLs.

  • "commit": the hash of the current commit.

  • "details": a list of strings representing the details of the last commit (author, date, message).

  • "modified" (optional): a list of strings representing the files modified since the last commit.

Note

  • If the current branch cannot be determined, the function will attempt to retrieve it from the list of all branches.

  • If there are no remote repositories, the "remote" key will be an empty list.

  • If there are no modified files, the "modified" key will not be present in the output.

  • This function requires the Git command line tool to be installed and accessible from the command line.

GPSat.utils.get_previous_oi_config(store_path, oi_config, table_name='oi_config', skip_valid_checks_on=None)

This function retrieves the previous configuration from optimal interpolation (OI) results file (store_path)

If the store_path exists, it is expected to contain a table called “oi_config” with the previous configurations stored as rows.

If store_path does not exist, the function creates the file and adds the current configuration (oi_config) as the first row in “oi_config” table.

Each row in the “oi_config” table contains columns ‘idx’ (index), ‘datetime’ and ‘config’. The values in the ‘config’ are provided oi_config (dict) converted to str.

If the table (oi_config) already exists, the function will match the provide oi_config against the previous config values, if any match exactly the largest config id will be returned. Otherwise (oi_config does not exactly match any previous config) then the largest idx value will be increment and returned.

Parameters:
store_path: str

The file path where the configurations are stored.

oi_config: dict

Representing the current configuration for the OI system.

table_name: str, default “oi_config”

The table where the configurations will be store.

skip_valid_checks_on: list of str or None, default None

If list the names of the configuration keys that should be skipped during validation checks. Note: validation checks are not done in this function.

Returns:
dict

Previous configuration as a dictionary.

list

List of configuration keys to skipped during validation checks.

int

Configuration ID.

GPSat.utils.get_weighted_values(df, ref_col, dist_to_col, val_cols, weight_function='gaussian', drop_weight_cols=True, **weight_kwargs)

Calculate the weighted values of specified columns in a DataFrame based on the distance between two other columns, using a specified weighting function. The current implementation supports a Gaussian weight based on the euclidean distance between the values in ref_col and dist_to_col.

Parameters:
dfpandas.DataFrame

The input DataFrame containing the reference column, distance-to column, and value columns.

ref_collist of str or str

The name of the column(s) to use as reference points for calculating distances.

dist_to_collist of str or str

The name of the column(s) to calculate distances to, from ref_col. They should align / correspond to the column(s) set by ref_col.

val_colslist of str or str

The names of the column(s) for which the weighted values are calculated. Can be a single column name or a list of names.

weight_functionstr, optional

The type of weighting function to use. Currently, only “gaussian” is implemented, which applies a Gaussian weighting (exp(-d^2)) based on the squared euclidean distance. The default is “gaussian”.

drop_weight_cols: bool, optional, default True.

if False the total weight and total weighted function values are included in the output

**weight_kwargsdict

Additional keyword arguments for the weighting function. For the Gaussian weight, this includes: - lengthscale (float): The length scale to use in the Gaussian function. This parameter scales the distance before applying the Gaussian function and must be provided.

Returns:
pandas.DataFrame

A DataFrame containing the weighted values for each of the specified value columns. The output DataFrame has the reference column as the index and each of the specified value columns with their weighted values.

Raises:
AssertionError

If the shapes of the ref_col and dist_to_col do not match, or if the required lengthscale parameter for the Gaussian weighting function is not provided.

NotImplementedError

If a weight_function other than “gaussian” is specified.

Notes

  • The function currently only implements Gaussian weighting. The Gaussian weight is calculated as exp(-d^2 / (2 * l^2)), where d is the squared euclidean distance between ref_col and dist_to_col, and l is the lengthscale.

  • This implementation assumes the input DataFrame does not contain NaN values in the reference or distance-to columns. Handling NaN values may require additional preprocessing or the use of fillna methods.

Examples

>>> import pandas as pd
>>>
>>> data = {
...     'ref_col': [0, 1, 0, 1],
...     'dist_to_col': [1, 2, 3, 4],
...     'value1': [10, 20, 30, 40],
...     'value2': [100, 200, 300, 400]
... }
>>> df = pd.DataFrame(data)
>>> weighted_df = get_weighted_values(df, 'ref_col', 'dist_to_col', ['value1', 'value2'], lengthscale=1.0)
>>> print(weighted_df)
GPSat.utils.glue_local_predictions(preds_df: DataFrame, expert_locs_df: DataFrame, sigma: int | float | list = 3) DataFrame

Depracated. Use glue_local_predictions_1d and glue_local_predictions_2d instead.

Glues overlapping predictions by taking a normalised Gaussian weighted average.

Warning: This method only deals with expert locations on a regular grid.

Parameters:
preds_df: pd.DataFrame

containing predictions generated from local expert OI. It should have the following columns:

  • pred_loc_x (float): The x-coordinate of the prediction location.

  • pred_loc_y (float): The y-coordinate of the prediction location.

  • f* (float): The predictive mean at the location (pred_loc_x, pred_loc_y).

  • f*_var (float): The predictive variance at the location (pred_loc_x, pred_loc_y).

expert_locs_df: pd.DataFrame

containing local expert locations used to perform optimal interpolation. It should have the following columns:

  • x (float): The x-coordinate of the expert location.

  • y (float): The y-coordinate of the expert location.

sigma: int, float, or list, default 3

The standard deviation of the Gaussian weighting in the x and y directions.

  • If a single value is provided, it is used for both directions.

  • If a list is provided, the first value is used for the x direction and the second value is used for the y direction. Defaults to 3.

Returns:
pd.DataFrame:

Dataframe consisting of glued predictions (mean and std). It has the following columns:

  • pred_loc_x (float): The x-coordinate of the prediction location.

  • pred_loc_y (float): The y-coordinate of the prediction location.

  • f* (float): The glued predictive mean at the location (pred_loc_x, pred_loc_y).

  • f*_std (float): The glued predictive standard deviation at the location (pred_loc_x, pred_loc_y).

Notes

  • The function assumes that the expert locations are equally spaced in both the x and y directions.

  • The function uses the scipy.stats.norm.pdf function to compute the Gaussian weights.

  • The function normalizes the weighted sums with the total weights at each location.

GPSat.utils.grid_2d_flatten(x_range, y_range, grid_res=None, step_size=None, num_step=None, center=True)

Create a 2D grid of points defined by x and y ranges, with the option to specify the grid resolution, step size, or number of steps. The resulting grid is flattened and concatenated into a 2D array of (x,y) coordinates.

Parameters:
x_range: tuple or list of floats

Two values representing the minimum and maximum values of the x-axis range.

y_range: tuple or list of floats

Two values representing the minimum and maximum values of the y-axis range.

grid_res: float or None, default None

The grid resolution, i.e. the distance between adjacent grid points. If specified, this parameter takes precedence over step_size and num_step.

step_size: float or None, default None

The step size between adjacent grid points. If specified, this parameter takes precedence over num_step.

num_step: int or None, default None

The number of steps between the minimum and maximum values of the x and y ranges. If specified, this parameter is used only if grid_res and step_size are not specified (are None). Note: the number of steps includes the starting point, so from 0 to 1 is two steps

center: bool, default True
  • If True, the resulting grid points will be the centers of the grid cells.

  • If False, the resulting grid points will be the edges of the grid cells.

Returns:
ndarray

A 2D array of (x,y) coordinates, where each row represents a single point in the grid.

Raises:
AssertionError

If grid_res, step_size and num_step are all unspecified. Must specify at least one.

Examples

>>> from GPSat.utils import grid_2d_flatten
>>> grid_2d_flatten(x_range=(0, 2), y_range=(0, 2), grid_res=1)
array([[0.5, 0.5],
       [1.5, 0.5],
       [0.5, 1.5],
       [1.5, 1.5]])
GPSat.utils.guess_track_num(x, thresh, start_track=0)
GPSat.utils.inverse_sigmoid(y, low=0, high=1)
GPSat.utils.inverse_softplus(y, shift=0)
GPSat.utils.json_load(file_path)

This function loads a JSON file from the specified file path and applies a nested dictionary literal evaluation (nested_dict_literal_eval) to convert any string keys in the format of ‘(…,…)’ to tuple keys.

The resulting dictionary is returned.

Parameters:
file_path: str

The path to the JSON file to be loaded.

Returns:
dict or list of dict

The loaded JSON file as a dictionary or list of dictionaries.

Examples

Assuming a JSON file named ‘config.json’ with the following contents: {

“key1”: “value1”,

“(‘key2’, ‘key3’)”: “value2”, “key4”: {“(‘key5’, ‘key6’)”: “value3”}

}

The following code will load the file and convert the ‘(key2, key3)’ and ‘(key5, key6)’ keys to tuple keys: config = json_load(‘config.json’) print(config)

{‘key1’: ‘value1’,

‘(key2, key3)’: ‘value2’, ‘key4’: {‘(key5, key6)’: ‘value3’}}

GPSat.utils.json_serializable(d, max_len_df=100)

Converts a dictionary to a format that can be stored as JSON via the json.dumps() method.

Parameters:
d :dict

The dictionary to be converted.

max_len_df: int, default 100

The maximum length of a Pandas DataFrame or Series that can be converted to a string representation. If the length of the DataFrame or Series is greater than this value, it will be stored as a string. Defaults to 100.

Returns:
dict

The converted dictionary.

Raises:
AssertionError: If the input is not a dictionary.

Notes

  • If a key in the dictionary is a tuple, it will be converted to a string.

To recover the original tuple, use nested_dict_literal_eval. - If a value in the dictionary is a dictionary, the function will be called recursively to convert it. - If a value in the dictionary is a NumPy array, it will be converted to a list. - If a value in the dictionary is a Pandas DataFrame or Series, it will be converted to a dictionary and the function will be called recursively to convert it if its length is less than or equal to max_len_df. Otherwise, it will be stored as a string. - If a value in the dictionary is not JSON serializable, it will be cast as a string.

GPSat.utils.log_lines(*args, level='debug')

This function logs lines to a file with a specified logging level.

This function takes in any number of arguments and a logging level.

The function checks that the logging level is valid and then iterates through the arguments.

If an argument is a string, integer, float, dictionary, tuple, or list, it is printed and logged with the specified logging level.

If an argument is not one of these types, it is not logged and a message is printed indicating the argument’s type.

Parameters:
*args: tuple

arguments to be provided to logging using the method specified by level

level: str, default “debug”

must be one of [“debug”, “info”, “warning”, “error”, “critical”] each argument provided is logged with getattr(logging, level)(arg)

Returns:
None
GPSat.utils.match(x, y, exact=True, tol=1e-09)

This function takes two arrays, x and y, and returns an array of indices indicating where the elements of x match the elements of y. Can match exactly or within a specified tolerance.

Parameters:
x: array-like

the first array to be matched. If not an array will convert via to_array.

y: array-like

the second array to be matched against. If not an array will convert via to_array.

exact: bool, default=True.

If True, the function matches exactly. If False, the function matches within a specified tolerance.

tol: float, optional, default=1e-9.

The tolerance used for matching when exact=False.

Returns:
indices: array

the indices of the matching elements in y for each element in x.

Raises:
AssertionError: if any element in x is not found in y or if multiple matches are found for any element in x.

Note

This function requires x and y to be arrays or can be converted by to_array If exact=False, the function only makes sense with floats. Use exact=True for int and str. If both x and y are large, with lengths n and m, this function can take up alot of memory as an intermediate bool array of size nxm is created. If there are multiple matches of x in y the index of the first match is return

GPSat.utils.move_to_archive(top_dir, file_names=None, suffix='', archive_sub_dir='Archive', verbose=False)

Moves specified files from a directory to an archive sub-directory within the same directory. Moved files will have a suffix added on before file extension.

Parameters:
top_dirstr

The path to the directory containing the files to be moved.

file_nameslist of str, default None

The names of the files to be moved. If not specified, all files in the directory will be moved.

suffixstr, default “”.

A string to be added to the end of the file name before the extension in the archive directory.

archive_sub_dirstr, default ‘Archive’

The name of the sub-directory within the top directory where the files will be moved.

verbosebool, default is False.

If True, prints information about the files being moved.

Returns:
None

The function only moves files and does not return anything.

Note

If the archive sub-directory does not exist, it will be created.

If a file with the same name as the destination file already exists in the archive sub-directory, it will be overwritten.

Raises:
AssertionError

If top_dir does not exist or file_names is not specified.

Examples

Move all files in directory to archive sub-directory: >>> move_to_archive(“path/to/directory”)

Move specific files to archive sub-directory with a suffix added to the file name: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], suffix=”_backup”)

Move specific files to a custom archive sub-directory: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], archive_sub_dir=”Old Files”)

GPSat.utils.nested_dict_literal_eval(d, verbose=False)

Converts a nested dictionary with string keys that represent tuples to a dictionary with tuple keys.

Parameters:
d: dict

The nested dictionary to be converted.

verbose: bool, default False

If True, prints information about the keys being converted.

Returns:
dict

The converted dictionary with tuple keys.

Raises:
ValueError: If a string key cannot be evaluated as a tuple.

Note

This function modifies the original dictionary in place.

GPSat.utils.nll(y, mu, sig, return_tot=True)
GPSat.utils.not_nan(x)
GPSat.utils.pandas_to_dict(x)

Converts a pandas Series or DataFrame (row) to a dictionary.

Parameters:
x: pd.Series, pd.DataFrame or dict

The input object to be converted to a dictionary.

Returns:
dict:

A dictionary representation of the input object.

Raises:
AssertionError: If the input object is a DataFrame with more than one row.

Warning

If the input object is not a pandas Series, DataFrame, or dictionary, a warning is issued and the input object is returned as is.

Examples

>>> import pandas as pd
>>> data = {'name': ['John', 'Jane'], 'age': [30, 25]}
>>> df = pd.DataFrame(data)
>>> pandas_to_dict(df)
AssertionError: in pandas_to_dict input provided as DataFrame, expected to only have 1 row, shape is: (2, 2)
>>> series = pd.Series(data['name'])
>>> pandas_to_dict(series)
{0: 'John', 1: 'Jane'}
>>> dictionary = {'name': ['John', 'Jane'], 'age': [30, 25]}
>>> pandas_to_dict(dictionary)
{'name': ['John', 'Jane'], 'age': [30, 25]}

select a single row of the dataframe

>>> pandas_to_dict(df.iloc[[0]])
{'name': 'John', 'age': 30}
GPSat.utils.pip_freeze_to_dataframe()
GPSat.utils.pretty_print_class(x)

This function takes in a class object as input and returns a string representation of the class name without the leading “<class ‘” and trailing “’>”.

Alternatively will remove leading ‘<__main__.’ and remove ‘ object at ‘, including anything that follows

The function achieves this by invoking the __str__ method of the class object and then using regular expressions to remove the unwanted characters.

Parameters:
x: an arbitrary class instance
Returns:
str

Examples

class MyClass:

pass

print(pretty_print_class(MyClass))

GPSat.utils.rmse(y, mu)
GPSat.utils.sigmoid(x, low=0, high=1)
GPSat.utils.softplus(x, shift=0)
GPSat.utils.sparse_true_array(shape, grid_space=1, grid_space_offset=0)

Create a boolean numpy array with True values regularly spaced throughout, and False elsewhere.

Parameters:
shape: iterable (e.g. list or tuple)

representing the shape of the output array.

grid_space: int, default 1

representing the spacing between True values.

grid_space_offset: int, default 0

representing the offset of the first True value in each dimension.

Returns:
np.array

A boolean array with dimension equal to shape, with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape).

Note

The first dimension is treated as the y dimension. This function will return a bool array with dimension equal to shape with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape). The function allows for grid_space_offset to be specific to each dimension.

GPSat.utils.stats_on_vals(vals, measure=None, name=None, qs=None)

This function calculates various statistics on a given array of values.

Parameters:
vals: array-like

The input array of values.

measure: str or None, default is None

The name of the measure being calculated.

name: str or None, default is None

The name of the column in the output dataframe. Default is None.

qs: list or None, defualt None

A list of quantiles to calculate. If None then will use [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99].

Returns:
pd.DataFrame

containing the following statistics: - measure: The name of the measure being calculated. - size: The number of elements in the input array. - num_not_nan: The number of non-NaN elements in the input array. - num_inf: The number of infinite elements in the input array. - min: The minimum value in the input array. - mean: The mean value of the input array. - max: The maximum value in the input array. - std: The standard deviation of the input array. - skew: The skewness of the input array. - kurtosis: The kurtosis of the input array. - qX: The Xth quantile of the input array, where X is the value in the qs parameter.

Note

The function also includes a timer decorator that calculates the time taken to execute the function.

GPSat.utils.to_array(*args, date_format='%Y-%m-%d')

Converts input arguments to numpy arrays.

Parameters:
*argstuple

Input arguments to be converted to numpy arrays.

date_formatstr, optional

Date format to be used when converting datetime.date objects to numpy arrays.

Returns:
generator

A generator that yields numpy arrays.

Note

This function converts input arguments to numpy arrays. If the input argument is already a numpy array, it is yielded as is. If the input argument is a list or tuple, it is converted to a numpy array and yielded. If the input argument is an integer, float, string, boolean, or numpy boolean, it is converted to a numpy array and yielded. If the input argument is a numpy integer or float, it is converted to a numpy array and yielded. If the input argument is a datetime.date object, it is converted to a numpy array using the specified date format and yielded. If the input argument is a numpy datetime64 object, it is yielded as is. If the input argument is None, an empty numpy array is yielded. If the input argument is of any other data type, a warning is issued and the input argument is converted to a numpy array of type object and yielded.

Examples

>>> import datetime
>>> import numpy as np
>>> x = [1, 2, 3]

since function returns are generator, get values out with next

>>> print(next(to_array(x)))
[1 2 3]

or, for a single array like object, can assign with

>>> c, =  to_array(x)
>>> y = np.array([4, 5, 6])
>>> z = datetime.date(2021, 1, 1)
>>> for arr in to_array(x, y, z):
...     print(f"arr type: {type(arr)}, values: {arr}")
arr type: <class 'numpy.ndarray'>, values: [1 2 3]
arr type: <class 'numpy.ndarray'>, values: [4 5 6]
arr type: <class 'numpy.ndarray'>, values: ['2021-01-01']
GPSat.utils.track_num_for_date(x)

GPSat.vff module

Code adapted from: https://github.com/st–/VFF

class GPSat.vff.BlockDiagMat(A, B)

Bases: object

get()
get_diag()
inv()
inv_diag()
logdet()
matmul(X)
matmul_sqrt(X)
matmul_sqrt_transpose(X)
property shape
solve(X)
property sqrt_dims
trace_KiX(X)

X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)

class GPSat.vff.DiagMat(d)

Bases: object

get()
get_diag()
inv()
inv_diag()
logdet()
matmul(B)
matmul_sqrt(B)
matmul_sqrt_transpose(B)
property shape
solve(B)
property sqrt_dims
trace_KiX(X)

X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)

class GPSat.vff.GPR_kron(data, ms, a, b, kernel_list)

Bases: GPModel, InternalDataTrainingLossMixin

elbo()
maximum_log_likelihood_objective()

Objective for maximum likelihood estimation. Should be maximized. E.g. log-marginal likelihood (hyperparameter likelihood) for GPR, or lower bound to the log-marginal likelihood (ELBO) for sparse and variational GPs.

Returns:

  • return has shape [].

predict_f(Xnew, full_cov=False, full_output_cov=False)

Compute the mean and variance of the posterior latent function(s) at the input points.

Given $x_i$ this computes $f_i$, for:

\begin{align} \theta & \sim p(\theta) \\ f & \sim \mathcal{GP}(m(x), k(x, x'; \theta)) \\ f_i & = f(x_i) \\ \end{align}

For an example of how to use predict_f, see ../../../../notebooks/getting_started/basic_usage.

Parameters:
  • Xnew

    • Xnew has shape [batch…, N, D].

    Input locations at which to compute mean and variance.

  • full_cov – If True, compute the full covariance between the inputs. If False, only returns the point-wise variance.

  • full_output_cov – If True, compute the full covariance between the outputs. If False, assumes outputs are independent.

Returns:

  • return[0] has shape [batch…, N, P].

  • return[1] has shape [batch…, N, P, N, P] if full_cov and full_output_cov.

  • return[1] has shape [batch…, N, P, P] if (not full_cov) and full_output_cov.

  • return[1] has shape [batch…, N, P] if (not full_cov) and (not full_output_cov).

  • return[1] has shape [batch…, P, N, N] if full_cov and (not full_output_cov).

class GPSat.vff.LowRankMat(d, W)

Bases: object

get()
get_diag()
inv()
inv_diag()
logdet()
matmul(B)
matmul_sqrt(B)
There’s a non-square sqrt of this matrix given by

[ D^{1/2}] [ W^T ]

This method right-multiplies the sqrt by the matrix B

matmul_sqrt_transpose(B)
There’s a non-square sqrt of this matrix given by

[ D^{1/2}] [ W^T ]

This method right-multiplies the transposed-sqrt by the matrix B

property shape
solve(B)
property sqrt_dims
trace_KiX(X)

X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)

class GPSat.vff.LowRankMatNeg(d, W)

Bases: object

get()
property shape
class GPSat.vff.Rank1Mat(d, v)

Bases: object

get()
get_diag()
inv()
inv_diag()
logdet()
matmul(B)
matmul_sqrt(B)
There’s a non-square sqrt of this matrix given by

[ D^{1/2}] [ V^T ]

This method right-multiplies the sqrt by the matrix B

matmul_sqrt_transpose(B)
There’s a non-square sqrt of this matrix given by

[ D^{1/2}] [ W^T ]

This method right-multiplies the transposed-sqrt by the matrix B

property shape
solve(B)
property sqrt_dims
trace_KiX(X)

X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)

class GPSat.vff.Rank1MatNeg(d, v)

Bases: object

get()
property shape
GPSat.vff.kron(K)
GPSat.vff.kron_two(A, B)

compute the Kronecker product of two tensorfow tensors

GPSat.vff.make_Kuf(k, X, a, b, ms)
GPSat.vff.make_Kuf_np(X, a, b, ms)
GPSat.vff.make_Kuu(kern, a, b, ms)

# Make a representation of the Kuu matrices

GPSat.vff.make_kvs(k)

Compute the kronecker-vector stack of the list of matrices k

GPSat.vff.make_kvs_np(A_list)
GPSat.vff.make_kvs_two(A, B)

compute the Kronecer-Vector stack of the matrices A and B

GPSat.vff.make_kvs_two_np(A, B)

Module contents

Add package docstring here

GPSat.get_config_path(*sub_dir)
GPSat.get_data_path(*sub_dir)
GPSat.get_parent_path(*sub_dir)
GPSat.get_path(*sub_dir)

get_path to package