Utils

TODO: Divide utils into specific categories.

GPSat.utils.EASE2toWGS84(x, y, return_vals='both', lon_0=0, lat_0=90)

Converts EASE2 grid coordinates to WGS84 longitude and latitude coordinates.

Parameters:
x: float

EASE2 grid x-coordinate in meters.

y: float

EASE2 grid y-coordinate in meters.

return_vals: str, optional

Determines what values to return. Valid options are "both" (default), "lon", or "lat".

lon_0: float, optional

Longitude of the center of the EASE2 grid in degrees. Default is 0.

lat_0: float, optional

Latitude of the center of the EASE2 grid in degrees. Default is 90.

Returns:
tuple or float

Depending on the value of return_vals, either a tuple of WGS84 longitude and latitude coordinates (both floats), or a single float representing either the longitude or latitude.

Raises:
AssertionError

If return_vals is not one of the valid options.

Examples

>>> EASE2toWGS84(1000000, 2000000)
(153.434948822922, 69.86894542225777)
GPSat.utils.EASE2toWGS84_New(*args, **kwargs)
GPSat.utils.WGS84toEASE2(lon, lat, return_vals='both', lon_0=0, lat_0=90)

Converts WGS84 longitude and latitude coordinates to EASE2 grid coordinates.

Parameters:
lonfloat

Longitude coordinate in decimal degrees.

latfloat

Latitude coordinate in decimal degrees.

return_valsstr, optional

Determines what values to return. Valid options are "both" (default), "x", or "y".

lon_0float, optional

Longitude of the center of the EASE2 grid in decimal degrees. Default is 0.

lat_0float, optional

Latitude of the center of the EASE2 grid in decimal degrees. Default is 90.

Returns:
float

If return_vals is "x". Returns the x EASE2 grid coordinate in meters.

float

If return_vals is "y". Returns the y EASE2 grid coordinate in meters

tuple of float

If return_vals is "both". Returns a tuple of (x, y) EASE2 grid coordinates in meters.

Raises:
AssertionError

If return_vals is not one of the valid options.

Examples

>>> WGS84toEASE2(-105.01621, 39.57422)
(-5254767.014984061, 1409604.1043472202)
GPSat.utils.WGS84toEASE2_New(*args, **kwargs)
GPSat.utils.array_to_dataframe(x, name, dim_prefix='_dim_', reset_index=False)

Converts a numpy array to a pandas DataFrame with a multi-index based on the array’s dimensions.

(Also see dataframe_to_array)

Parameters:
xnp.ndarray

The numpy array to be converted to a DataFrame.

namestr

The name of the column in the resulting DataFrame.

dim_prefixstr, optional

The prefix to be used for the dimension names in the multi-index. Default is "_dim_". Integers will be appended to dim_prefix for each dimension of x, i.e. if x is 2d, it will have dimension names "_dim_0", "_dim_1", assuming default dim_prefix is used.

reset_indexbool, optional

Whether to reset the index of the resulting DataFrame. Default is False.

Returns:
outpd.DataFrame

The resulting DataFrame with a multi-index based on the dimensions of the input array.

Raises:
AssertionError

If the input is not a numpy array.

Examples

>>> # express a 2d numpy array in DataFrame
>>> x = np.array([[1, 2], [3, 4]])
>>> array_to_dataframe(x, "data")
                data
_dim_0 _dim_1
0      0        1
       1        2
1      0        3
       1        4
GPSat.utils.assign_category_col(val, df, categories=None)

Generate categorical pd.Series equal in length to a reference DataFrame (df)

Parameters:
valstr

The value to assign to the categorical Series.

dfpandas DataFrame

reference DataFrame, used to determine length of output

categorieslist, optional

A list of categories to be used for the categorical column.

Returns:
pandas Categorical Series

A categorical column with the assigned value and specified categories (if provided).

Notes

This function creates a new categorical column in the DataFrame with the specified value and categories. If categories are not provided, they will be inferred from the data. The function returns a pandas Categorical object representing the new column.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
>>> x_series = assign_category_col('x', df)
GPSat.utils.bin_obs_by_date(df, val_col, date_col='date', all_dates_in_range=True, x_col='x', y_col='y', grid_res=None, date_col_format='%Y%m%d', x_min=-4500000.0, x_max=4500000.0, y_min=-4500000.0, y_max=4500000.0, n_x=None, n_y=None, bin_statistic='mean', verbose=False)

This function takes in a pandas DataFrame and bins the data based on the values in a specified column and the x and y coordinates in other specified columns. The data is binned based on a grid with a specified resolution or number of bins. The function returns a dictionary of binned values for each unique date in the DataFrame.

Parameters:
df: pandas DataFrame

A DataFrame containing the data to be binned.

val_col: string

Name of the column containing the values to be binned.

date_col: string, default “date”

Name of the column containing the dates for which to bin the data.

all_dates_in_range: boolean, default True

Whether to include all dates in the range of the DataFrame.

x_col: string, default “x”

Name of the column containing the x coordinates.

y_col: string, default “y”

Name of the column containing the y coordinates.

grid_res: float or int, default None

Resolution of the grid in kilometers. If None, then n_x and n_y must be specified.

date_col_format: string, default “%Y%m%d”

Format of the date column.

x_min: float, default -4500000.0

Minimum x value for the grid.

x_max: float, default 4500000.0

Maximum x value for the grid.

y_min: float, default -4500000.0

Minimum y value for the grid.

y_max: float, default 4500000.0

Maximum y value for the grid.

n_x: int, default None

Number of bins in the x direction.

n_y: int, default None

Number of bins in the y direction.

bin_statistic: string or callable, default “mean”

Statistic to compute in each bin.

verbose: boolean, default False

Whether to print additional information during execution.

Returns:
bvals: dictionary

The binned values for each unique date in the DataFrame.

x_edge: numpy array

x values for the edges of the bins.

y_edge: numpy array

y values for the edges of the bins.

Notes

The x and y coordinates are swapped in the returned binned values due to the transpose operation used in the function.

GPSat.utils.check_prev_oi_config(prev_oi_config, oi_config, skip_valid_checks_on=None)

This function checks if the previous configuration matches the current one. It takes in two dictionaries, prev_oi_config and oi_config, which represent the previous and current configurations respectively.

The function also takes an optional list skip_valid_checks_on, which contains keys that should be skipped during the comparison.

Parameters:
prev_oi_config: dict

Previous configuration to be compared against.

oi_config: dict

Current configuration to compare against prev_oi_config.

skip_valid_checks_on: list or None, default None

If not None, should be a list of keys to not check.

Returns:
None

Notes

  • If skip_valid_checks_on is not provided, it defaults to an empty list. The function then compares the two configurations and raises an AssertionError if any key-value pairs do not match.

  • If the configurations do not match exactly, an AssertionError is raised.

  • This function assumes that the configurations are represented as dictionaries and that the keys in both dictionaries are the same.

GPSat.utils.compare_dataframes(df1, df2, merge_on, columns_to_compare, drop_other_cols=False, how='outer', suffixes=['_1', '_2'])
GPSat.utils.config_func(func, source=None, args=None, kwargs=None, col_args=None, col_kwargs=None, df=None, filename_as_arg=False, filename=None, col_numpy=True)

Apply a function based on configuration input.

The aim is to allow one to apply a function, possibly on data from a DataFrame, using a specification that can be stored in a JSON configuration file.

Note

  • This function uses eval() so could allow for arbitrary code execution.

  • If DataFrame df is provided, then can provide input (col_args and/or col_kwargs) based on columns of df.

Parameters:
func: str or callable.
  • If str, it will use eval(func) to convert it to a function.

  • If it contains one of "|", "&", "=", "+", "-", "*", "/", "%", "<", and ">", it will create a lambda function:

lambda arg1, arg2: eval(f"arg1 {func} arg2")
  • If eval(func) raises NameError and source is not None, it will run

f"from {source} import {func}"

and try again. This is to allow import function from a source.

source: str or None, default None

Package name where func can be found, if applicable. Used to import func from a package. e.g.

>>> GPSat.utils.config_func(func="cumprod", source="numpy", ...)

calls the function cumprod from the package numpy.

args: list or None, default None

If None, an empty list will be used, i.e. no args will be used. The values will be unpacked and provided to func: i.e. func(*args, **kwargs)

kwargs: dict or None, default None

If dict, it will be unpacked (**kwargs) to provide key word arguments to func.

col_args: None or list of str, default None

If DataFrame df is provided, it can use col_args to specify which columns of df will be passed into func as arguments.

col_kwargs: None or dict, default is None

Keyword arguments to be passed to func specified as dict whose keys are parameters of func and values are column names of a DataFrame df. Only applicable if df is provided.

df: DataFrame or None, default None

To provide if one wishes to use columns of a DataFrame as arguments to func.

filename_as_arg: bool, default False

Set True if filename is used as an argument to func.

filename: str or None, default None

If filename_as_arg is True, then will provide filename as first arg.

col_numpy: bool, default True

If True, when extracting columns from DataFrame, .values is used to convert to numpy array.

Returns:
any

Values returned by applying func on data. The type depends on func.

Raises:
AssertionError

If kwargs is not a dict, col_kwargs is not a dict, or func is not a string or callable.

AssertionError

If df is not provided but col_args or col_kwargs are.

AssertionError

If func is a string and cannot be imported on it’s own and source is None.

Examples

>>> import pandas as pd
>>> from GPSat.utils import config_func
>>> config_func(func="lambda x, y: x + y", args=[1, 1]) # Computes 1 + 1
2
>>> config_func(func="==", args=[1, 1]) # Computes 1 == 1
True

Using columns of a DataFrame as inputs:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> config_func(func="lambda x, y: x + y", df=df, col_args=["A", "B"]) # Computes df["A"] + df["B"]
array([5, 7, 9])
>>> config_func(func="<=", col_args=["A", "B"], df=df) # Computes df["A"] <= df["B"]
array([ True,  True,  True])

We can also use functions from an external package by specifying source. For example, the below reproduces the last example in numpy.cumprod:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> config_func(func="cumprod", source="numpy", df=df, kwargs={"axis": 0}, col_args=[["A", "B"]])
array([[  1,   4],
       [  2,  20],
       [  6, 120]])
GPSat.utils.convert_lon_lat_str(x)

Converts a string representation of longitude or latitude to a float value.

Parameters:
x: str

A string representation of longitude or latitude in the format of "[degrees] [minutes] [direction]", where [direction] is one of "N", "S", "E", or "W".

Returns:
float

The converted value of the input string as a float.

Raises:
AssertionError

If the input is not a string.

Examples

>>> convert_lon_lat_str('74 0.1878 N')
74.00313
>>> convert_lon_lat_str('140 0.1198 W')
-140.001997
GPSat.utils.cprint(x, c='ENDC', bcolors=None, sep=' ', end='\n')

Add color to print statements.

Based off of https://stackoverflow.com/questions/287871/how-do-i-print-colored-text-to-the-terminal.

Parameters:
x: str

String to be printed.

c: str, default “ENDC”

Valid key in bcolors. If bcolors is not provided, then default will be used, containing keys: 'HEADER', 'OKBLUE', 'OKCYAN', 'OKGREEN', 'WARNING', 'FAIL', 'ENDC', 'BOLD', 'UNDERLINE'.

bcolors: dict or None, default None

Dict with values being colors / how to format the font. These cane be chained together. See the codes in: https://en.wikipedia.org/wiki/ANSI_escape_code#3-bit_and_4-bit.

sep: str, default “ “

sep argument passed along to print().

end: str, default “\n”

end argument passed along to print().

Returns:
None
GPSat.utils.dataframe_to_2d_array(df, x_col, y_col, val_col, tol=1e-09, fill_val=nan, dtype=None, decimals=1)

Extract values from DataFrame to create a 2-d array of values (val_col) - assuming the values came from a 2-d array. Requires dimension columns x_col, y_col (do not have to be ordered in DataFrame).

Parameters:
df: pandas.DataFrame

The dataframe to convert to a 2D array.

x_col: str

The name of the column in the dataframe that contains the x coordinates.

y_col: str

The name of the column in the dataframe that contains the y coordinates.

val_col: str

The name of the column in the dataframe that contains the values to be placed in the 2D array.

tol: float, default 1e-9

The tolerance for matching the x and y coordinates to the grid.

fill_val: float, default np.nan

The value to fill the 2D array with if a coordinate is missing.

dtype: str or numpy.dtype or None, default None

The data type of the values in the 2D array.

decimals: int, default 1

The number of decimal places to round x and y values to before taking unique. If decimals is negative, it specifies the number of positions to the left of the decimal point.

Returns:
tuple

A tuple containing the 2D numpy array of values, the x coordinates of the grid, and the y coordinates of the grid.

Raises:
AssertionError

If any of the required columns are missing from the dataframe, or if any coordinates have more than one value.

Notes

  • The spacing of grid is determined by the smallest step size in the x_col, y_col direction, respectively.

  • This is meant to reverse the process of putting values from a regularly spaced grid into a DataFrame. Do not expect this to work on arbitrary x,y coordinates.

GPSat.utils.dataframe_to_array(df, val_col, idx_col=None, dropna=True, fill_val=nan)

Converts a pandas DataFrame to a numpy array, where the DataFrame has columns that represent dimensions of the array and the DataFrame rows represent values in the array.

Parameters:
dfpandas DataFrame

The DataFrame containing values convert to a numpy ndarray.

val_colstr

The name of the column in the DataFrame that contains the values to be placed in the array.

idx_colstr or list of str or None, default None

The name(s) of the column(s) in the DataFrame that represent the dimensions of the array. If not provided, the index of the DataFrame will be used as the dimension(s).

dropnabool, default True

Whether to drop rows with missing values before converting to the array.

fill_valscalar, default np.nan

The value to fill in the array for missing values.

Returns:
numpy array

The resulting numpy array.

Raises:
AssertionError

If the dimension values are not integers or have gaps, or if the idx_col parameter contains column names that are not in the DataFrame.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from GPSat.utils import dataframe_to_array
>>> df = pd.DataFrame({
...     'dim1': [0, 0, 1, 1],
...     'dim2': [0, 1, 0, 1],
...     'values': [1, 2, 3, 4]
... })
>>> arr = dataframe_to_array(df, 'values', ['dim1', 'dim2'])
>>> print(arr)
[[1 2]
 [3 4]]
GPSat.utils.dict_of_array_to_dict_of_dataframe(array_dict, concat=False, reset_index=False)

Converts a dictionary of arrays to a dictionary of pandas DataFrames.

Parameters:
array_dictdict

A dictionary where the keys are strings and the values are numpy arrays.

concatbool, optional

If True, concatenates DataFrames with the same number of dimensions. Default is False.

reset_indexbool, optional

If True, resets the index of each DataFrame. Default is False.

Returns:
dict

A dictionary where the keys are strings and the values are pandas DataFrames.

Notes

This function uses the array_to_dataframe function to convert each array to a DataFrame. If concat is True, it will concatenate DataFrames with the same number of dimensions. If reset_index is True, it will reset the index of each DataFrame.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> array_dict = {'a': np.array([1, 2, 3]), 'b': np.array([[1, 2], [3, 4]]), 'c': np.array([1.1, 2.2, 3.3])}
>>> dict_of_array_to_dict_of_dataframe(array_dict)
{'a':       a
    _dim_0   
    0       1
    1       2
    2       3,
'b':               b
    _dim_0 _dim_1   
    0      0       1
           1       2
    1      0       3
           1       4,
'c':        c
    _dim_0     
    0       1.1
    1       2.2
    2       3.3}
>>> dict_of_array_to_dict_of_dataframe(array_dict, concat=True)
{1:         a    c
    _dim_0
    0       1  1.1
    1       2  2.2
    2       3  3.3,
2:                 b
    _dim_0 _dim_1
    0      0       1
           1       2
    1      0       3
           1       4}
>>> dict_of_array_to_dict_of_dataframe(array_dict, reset_index=True)
{'a':    _dim_0  a
    0       0    1
    1       1    2
    2       2    3,
 'b':    _dim_0  _dim_1  b
    0       0       0    1
    1       0       1    2
    2       1       0    3
    3       1       1    4,
 'c':    _dim_0  c
    0       0    1.1
    1       1    2.2
    2       2    3.3}
GPSat.utils.diff_distance(x, p=2, k=1, default_val=nan)
GPSat.utils.expand_dict_by_vals(d, expand_keys)
GPSat.utils.get_col_values(df, col, return_numpy=True)

This function takes in a pandas DataFrame, a column name or index, and a boolean flag indicating whether to return the column values as a numpy array or not. It returns the values of the specified column as either a pandas Series or a numpy array, depending on the value of the return_numpy flag.

If the column is specified by name and it does not exist in the DataFrame, the function will attempt to use the column index instead. If the column is specified by index and it is not a valid integer index, the function will raise an AssertionError.

Parameters:
df: pandas DataFrame

A pandas DataFrame containing data.

col: str or int

The name of column to extract data from. If specified as an int n, it will extract data from the n-th column.

return_numpy: bool, default True

Whether to return as numpy array.

Returns:
numpy array

If return_numpy is set to True.

pandas Series

If return_numpy is set to False.

Examples

>>> import pandas as pd
>>> from GPSat.utils import get_col_values
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> col_values = get_col_values(df, 'A')
>>> print(col_values)
[1 2 3]
GPSat.utils.get_config_from_sysargv(argv_num=1)

This function takes an optional argument argv_num (default value of 1) and attempts to read a JSON configuration file from the corresponding index in sys.argv.

If the file extension is not .json, it prints a message indicating that the file is not a JSON file.

If an error occurs while reading the file, it prints an error message.

This function could benefit from refactoring to use the argparse package instead of manually parsing sys.argv.

Parameters:
argv_num :int, default 1

The index in sys.argv to read the configuration file from.

Returns:
dict or None

The configuration data loaded from the JSON file, or None if an error occurred while reading the file.

GPSat.utils.get_git_information()

This function retrieves information about the current state of a Git repository.

Returns:
dict

Contains the following keys:

  • "branch": the name of the current branch.

  • "remote": a list of strings representing the remote repositories and their URLs.

  • "commit": the hash of the current commit.

  • "details": a list of strings representing the details of the last commit (author, date, message).

  • "modified" (optional): a list of strings representing the files modified since the last commit.

Note

  • If the current branch cannot be determined, the function will attempt to retrieve it from the list of all branches.

  • If there are no remote repositories, the "remote" key will be an empty list.

  • If there are no modified files, the "modified" key will not be present in the output.

  • This function requires the Git command line tool to be installed and accessible from the command line.

GPSat.utils.get_previous_oi_config(store_path, oi_config, table_name='oi_config', skip_valid_checks_on=None)

This function retrieves the previous configuration from optimal interpolation (OI) results file (store_path)

If the store_path exists, it is expected to contain a table called “oi_config” with the previous configurations stored as rows.

If store_path does not exist, the function creates the file and adds the current configuration (oi_config) as the first row in “oi_config” table.

Each row in the “oi_config” table contains columns ‘idx’ (index), ‘datetime’ and ‘config’. The values in the ‘config’ are provided oi_config (dict) converted to str.

If the table (oi_config) already exists, the function will match the provide oi_config against the previous config values, if any match exactly the largest config id will be returned. Otherwise (oi_config does not exactly match any previous config) then the largest idx value will be increment and returned.

Parameters:
store_path: str

The file path where the configurations are stored.

oi_config: dict

Representing the current configuration for the OI system.

table_name: str, default “oi_config”

The table where the configurations will be store.

skip_valid_checks_on: list of str or None, default None

If list the names of the configuration keys that should be skipped during validation checks. Note: validation checks are not done in this function.

Returns:
dict

Previous configuration as a dictionary.

list

List of configuration keys to skipped during validation checks.

int

Configuration ID.

GPSat.utils.get_weighted_values(df, ref_col, dist_to_col, val_cols, weight_function='gaussian', drop_weight_cols=True, **weight_kwargs)

Calculate the weighted values of specified columns in a DataFrame based on the distance between two other columns, using a specified weighting function. The current implementation supports a Gaussian weight based on the euclidean distance between the values in ref_col and dist_to_col.

Parameters:
dfpandas.DataFrame

The input DataFrame containing the reference column, distance-to column, and value columns.

ref_collist of str or str

The name of the column(s) to use as reference points for calculating distances.

dist_to_collist of str or str

The name of the column(s) to calculate distances to, from ref_col. They should align / correspond to the column(s) set by ref_col.

val_colslist of str or str

The names of the column(s) for which the weighted values are calculated. Can be a single column name or a list of names.

weight_functionstr, optional

The type of weighting function to use. Currently, only “gaussian” is implemented, which applies a Gaussian weighting (exp(-d^2)) based on the squared euclidean distance. The default is “gaussian”.

drop_weight_cols: bool, optional, default True.

if False the total weight and total weighted function values are included in the output

**weight_kwargsdict

Additional keyword arguments for the weighting function. For the Gaussian weight, this includes: - lengthscale (float): The length scale to use in the Gaussian function. This parameter scales the distance before applying the Gaussian function and must be provided.

Returns:
pandas.DataFrame

A DataFrame containing the weighted values for each of the specified value columns. The output DataFrame has the reference column as the index and each of the specified value columns with their weighted values.

Raises:
AssertionError

If the shapes of the ref_col and dist_to_col do not match, or if the required lengthscale parameter for the Gaussian weighting function is not provided.

NotImplementedError

If a weight_function other than “gaussian” is specified.

Notes

  • The function currently only implements Gaussian weighting. The Gaussian weight is calculated as exp(-d^2 / (2 * l^2)), where d is the squared euclidean distance between ref_col and dist_to_col, and l is the lengthscale.

  • This implementation assumes the input DataFrame does not contain NaN values in the reference or distance-to columns. Handling NaN values may require additional preprocessing or the use of fillna methods.

Examples

>>> import pandas as pd
>>>
>>> data = {
...     'ref_col': [0, 1, 0, 1],
...     'dist_to_col': [1, 2, 3, 4],
...     'value1': [10, 20, 30, 40],
...     'value2': [100, 200, 300, 400]
... }
>>> df = pd.DataFrame(data)
>>> weighted_df = get_weighted_values(df, 'ref_col', 'dist_to_col', ['value1', 'value2'], lengthscale=1.0)
>>> print(weighted_df)
GPSat.utils.glue_local_predictions(preds_df: DataFrame, expert_locs_df: DataFrame, sigma: int | float | list = 3) DataFrame

Depracated. Use glue_local_predictions_1d and glue_local_predictions_2d instead.

Glues overlapping predictions by taking a normalised Gaussian weighted average.

Warning: This method only deals with expert locations on a regular grid.

Parameters:
preds_df: pd.DataFrame

containing predictions generated from local expert OI. It should have the following columns:

  • pred_loc_x (float): The x-coordinate of the prediction location.

  • pred_loc_y (float): The y-coordinate of the prediction location.

  • f* (float): The predictive mean at the location (pred_loc_x, pred_loc_y).

  • f*_var (float): The predictive variance at the location (pred_loc_x, pred_loc_y).

expert_locs_df: pd.DataFrame

containing local expert locations used to perform optimal interpolation. It should have the following columns:

  • x (float): The x-coordinate of the expert location.

  • y (float): The y-coordinate of the expert location.

sigma: int, float, or list, default 3

The standard deviation of the Gaussian weighting in the x and y directions.

  • If a single value is provided, it is used for both directions.

  • If a list is provided, the first value is used for the x direction and the second value is used for the y direction. Defaults to 3.

Returns:
pd.DataFrame:

Dataframe consisting of glued predictions (mean and std). It has the following columns:

  • pred_loc_x (float): The x-coordinate of the prediction location.

  • pred_loc_y (float): The y-coordinate of the prediction location.

  • f* (float): The glued predictive mean at the location (pred_loc_x, pred_loc_y).

  • f*_std (float): The glued predictive standard deviation at the location (pred_loc_x, pred_loc_y).

Notes

  • The function assumes that the expert locations are equally spaced in both the x and y directions.

  • The function uses the scipy.stats.norm.pdf function to compute the Gaussian weights.

  • The function normalizes the weighted sums with the total weights at each location.

GPSat.utils.grid_2d_flatten(x_range, y_range, grid_res=None, step_size=None, num_step=None, center=True)

Create a 2D grid of points defined by x and y ranges, with the option to specify the grid resolution, step size, or number of steps. The resulting grid is flattened and concatenated into a 2D array of (x,y) coordinates.

Parameters:
x_range: tuple or list of floats

Two values representing the minimum and maximum values of the x-axis range.

y_range: tuple or list of floats

Two values representing the minimum and maximum values of the y-axis range.

grid_res: float or None, default None

The grid resolution, i.e. the distance between adjacent grid points. If specified, this parameter takes precedence over step_size and num_step.

step_size: float or None, default None

The step size between adjacent grid points. If specified, this parameter takes precedence over num_step.

num_step: int or None, default None

The number of steps between the minimum and maximum values of the x and y ranges. If specified, this parameter is used only if grid_res and step_size are not specified (are None). Note: the number of steps includes the starting point, so from 0 to 1 is two steps

center: bool, default True
  • If True, the resulting grid points will be the centers of the grid cells.

  • If False, the resulting grid points will be the edges of the grid cells.

Returns:
ndarray

A 2D array of (x,y) coordinates, where each row represents a single point in the grid.

Raises:
AssertionError

If grid_res, step_size and num_step are all unspecified. Must specify at least one.

Examples

>>> from GPSat.utils import grid_2d_flatten
>>> grid_2d_flatten(x_range=(0, 2), y_range=(0, 2), grid_res=1)
array([[0.5, 0.5],
       [1.5, 0.5],
       [0.5, 1.5],
       [1.5, 1.5]])
GPSat.utils.guess_track_num(x, thresh, start_track=0)
GPSat.utils.inverse_sigmoid(y, low=0, high=1)
GPSat.utils.inverse_softplus(y, shift=0)
GPSat.utils.json_load(file_path)

This function loads a JSON file from the specified file path and applies a nested dictionary literal evaluation (nested_dict_literal_eval) to convert any string keys in the format of ‘(…,…)’ to tuple keys.

The resulting dictionary is returned.

Parameters:
file_path: str

The path to the JSON file to be loaded.

Returns:
dict or list of dict

The loaded JSON file as a dictionary or list of dictionaries.

Examples

Assuming a JSON file named ‘config.json’ with the following contents: {

“key1”: “value1”,

“(‘key2’, ‘key3’)”: “value2”, “key4”: {“(‘key5’, ‘key6’)”: “value3”}

}

The following code will load the file and convert the ‘(key2, key3)’ and ‘(key5, key6)’ keys to tuple keys: config = json_load(‘config.json’) print(config)

{‘key1’: ‘value1’,

‘(key2, key3)’: ‘value2’, ‘key4’: {‘(key5, key6)’: ‘value3’}}

GPSat.utils.json_serializable(d, max_len_df=100)

Converts a dictionary to a format that can be stored as JSON via the json.dumps() method.

Parameters:
d :dict

The dictionary to be converted.

max_len_df: int, default 100

The maximum length of a Pandas DataFrame or Series that can be converted to a string representation. If the length of the DataFrame or Series is greater than this value, it will be stored as a string. Defaults to 100.

Returns:
dict

The converted dictionary.

Raises:
AssertionError: If the input is not a dictionary.

Notes

  • If a key in the dictionary is a tuple, it will be converted to a string.

To recover the original tuple, use nested_dict_literal_eval. - If a value in the dictionary is a dictionary, the function will be called recursively to convert it. - If a value in the dictionary is a NumPy array, it will be converted to a list. - If a value in the dictionary is a Pandas DataFrame or Series, it will be converted to a dictionary and the function will be called recursively to convert it if its length is less than or equal to max_len_df. Otherwise, it will be stored as a string. - If a value in the dictionary is not JSON serializable, it will be cast as a string.

GPSat.utils.log_lines(*args, level='debug')

This function logs lines to a file with a specified logging level.

This function takes in any number of arguments and a logging level.

The function checks that the logging level is valid and then iterates through the arguments.

If an argument is a string, integer, float, dictionary, tuple, or list, it is printed and logged with the specified logging level.

If an argument is not one of these types, it is not logged and a message is printed indicating the argument’s type.

Parameters:
*args: tuple

arguments to be provided to logging using the method specified by level

level: str, default “debug”

must be one of [“debug”, “info”, “warning”, “error”, “critical”] each argument provided is logged with getattr(logging, level)(arg)

Returns:
None
GPSat.utils.match(x, y, exact=True, tol=1e-09)

This function takes two arrays, x and y, and returns an array of indices indicating where the elements of x match the elements of y. Can match exactly or within a specified tolerance.

Parameters:
x: array-like

the first array to be matched. If not an array will convert via to_array.

y: array-like

the second array to be matched against. If not an array will convert via to_array.

exact: bool, default=True.

If True, the function matches exactly. If False, the function matches within a specified tolerance.

tol: float, optional, default=1e-9.

The tolerance used for matching when exact=False.

Returns:
indices: array

the indices of the matching elements in y for each element in x.

Raises:
AssertionError: if any element in x is not found in y or if multiple matches are found for any element in x.

Note

This function requires x and y to be arrays or can be converted by to_array If exact=False, the function only makes sense with floats. Use exact=True for int and str. If both x and y are large, with lengths n and m, this function can take up alot of memory as an intermediate bool array of size nxm is created. If there are multiple matches of x in y the index of the first match is return

GPSat.utils.move_to_archive(top_dir, file_names=None, suffix='', archive_sub_dir='Archive', verbose=False)

Moves specified files from a directory to an archive sub-directory within the same directory. Moved files will have a suffix added on before file extension.

Parameters:
top_dirstr

The path to the directory containing the files to be moved.

file_nameslist of str, default None

The names of the files to be moved. If not specified, all files in the directory will be moved.

suffixstr, default “”.

A string to be added to the end of the file name before the extension in the archive directory.

archive_sub_dirstr, default ‘Archive’

The name of the sub-directory within the top directory where the files will be moved.

verbosebool, default is False.

If True, prints information about the files being moved.

Returns:
None

The function only moves files and does not return anything.

Note

If the archive sub-directory does not exist, it will be created.

If a file with the same name as the destination file already exists in the archive sub-directory, it will be overwritten.

Raises:
AssertionError

If top_dir does not exist or file_names is not specified.

Examples

Move all files in directory to archive sub-directory: >>> move_to_archive(“path/to/directory”)

Move specific files to archive sub-directory with a suffix added to the file name: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], suffix=”_backup”)

Move specific files to a custom archive sub-directory: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], archive_sub_dir=”Old Files”)

GPSat.utils.nested_dict_literal_eval(d, verbose=False)

Converts a nested dictionary with string keys that represent tuples to a dictionary with tuple keys.

Parameters:
d: dict

The nested dictionary to be converted.

verbose: bool, default False

If True, prints information about the keys being converted.

Returns:
dict

The converted dictionary with tuple keys.

Raises:
ValueError: If a string key cannot be evaluated as a tuple.

Note

This function modifies the original dictionary in place.

GPSat.utils.nll(y, mu, sig, return_tot=True)
GPSat.utils.not_nan(x)
GPSat.utils.pandas_to_dict(x)

Converts a pandas Series or DataFrame (row) to a dictionary.

Parameters:
x: pd.Series, pd.DataFrame or dict

The input object to be converted to a dictionary.

Returns:
dict:

A dictionary representation of the input object.

Raises:
AssertionError: If the input object is a DataFrame with more than one row.

Warning

If the input object is not a pandas Series, DataFrame, or dictionary, a warning is issued and the input object is returned as is.

Examples

>>> import pandas as pd
>>> data = {'name': ['John', 'Jane'], 'age': [30, 25]}
>>> df = pd.DataFrame(data)
>>> pandas_to_dict(df)
AssertionError: in pandas_to_dict input provided as DataFrame, expected to only have 1 row, shape is: (2, 2)
>>> series = pd.Series(data['name'])
>>> pandas_to_dict(series)
{0: 'John', 1: 'Jane'}
>>> dictionary = {'name': ['John', 'Jane'], 'age': [30, 25]}
>>> pandas_to_dict(dictionary)
{'name': ['John', 'Jane'], 'age': [30, 25]}

select a single row of the dataframe

>>> pandas_to_dict(df.iloc[[0]])
{'name': 'John', 'age': 30}
GPSat.utils.pip_freeze_to_dataframe()
GPSat.utils.pretty_print_class(x)

This function takes in a class object as input and returns a string representation of the class name without the leading “<class ‘” and trailing “’>”.

Alternatively will remove leading ‘<__main__.’ and remove ‘ object at ‘, including anything that follows

The function achieves this by invoking the __str__ method of the class object and then using regular expressions to remove the unwanted characters.

Parameters:
x: an arbitrary class instance
Returns:
str

Examples

class MyClass:

pass

print(pretty_print_class(MyClass))

GPSat.utils.rmse(y, mu)
GPSat.utils.sigmoid(x, low=0, high=1)
GPSat.utils.softplus(x, shift=0)
GPSat.utils.sparse_true_array(shape, grid_space=1, grid_space_offset=0)

Create a boolean numpy array with True values regularly spaced throughout, and False elsewhere.

Parameters:
shape: iterable (e.g. list or tuple)

representing the shape of the output array.

grid_space: int, default 1

representing the spacing between True values.

grid_space_offset: int, default 0

representing the offset of the first True value in each dimension.

Returns:
np.array

A boolean array with dimension equal to shape, with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape).

Note

The first dimension is treated as the y dimension. This function will return a bool array with dimension equal to shape with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape). The function allows for grid_space_offset to be specific to each dimension.

GPSat.utils.stats_on_vals(vals, measure=None, name=None, qs=None)

This function calculates various statistics on a given array of values.

Parameters:
vals: array-like

The input array of values.

measure: str or None, default is None

The name of the measure being calculated.

name: str or None, default is None

The name of the column in the output dataframe. Default is None.

qs: list or None, defualt None

A list of quantiles to calculate. If None then will use [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99].

Returns:
pd.DataFrame

containing the following statistics: - measure: The name of the measure being calculated. - size: The number of elements in the input array. - num_not_nan: The number of non-NaN elements in the input array. - num_inf: The number of infinite elements in the input array. - min: The minimum value in the input array. - mean: The mean value of the input array. - max: The maximum value in the input array. - std: The standard deviation of the input array. - skew: The skewness of the input array. - kurtosis: The kurtosis of the input array. - qX: The Xth quantile of the input array, where X is the value in the qs parameter.

Note

The function also includes a timer decorator that calculates the time taken to execute the function.

GPSat.utils.to_array(*args, date_format='%Y-%m-%d')

Converts input arguments to numpy arrays.

Parameters:
*argstuple

Input arguments to be converted to numpy arrays.

date_formatstr, optional

Date format to be used when converting datetime.date objects to numpy arrays.

Returns:
generator

A generator that yields numpy arrays.

Note

This function converts input arguments to numpy arrays. If the input argument is already a numpy array, it is yielded as is. If the input argument is a list or tuple, it is converted to a numpy array and yielded. If the input argument is an integer, float, string, boolean, or numpy boolean, it is converted to a numpy array and yielded. If the input argument is a numpy integer or float, it is converted to a numpy array and yielded. If the input argument is a datetime.date object, it is converted to a numpy array using the specified date format and yielded. If the input argument is a numpy datetime64 object, it is yielded as is. If the input argument is None, an empty numpy array is yielded. If the input argument is of any other data type, a warning is issued and the input argument is converted to a numpy array of type object and yielded.

Examples

>>> import datetime
>>> import numpy as np
>>> x = [1, 2, 3]

since function returns are generator, get values out with next

>>> print(next(to_array(x)))
[1 2 3]

or, for a single array like object, can assign with

>>> c, =  to_array(x)
>>> y = np.array([4, 5, 6])
>>> z = datetime.date(2021, 1, 1)
>>> for arr in to_array(x, y, z):
...     print(f"arr type: {type(arr)}, values: {arr}")
arr type: <class 'numpy.ndarray'>, values: [1 2 3]
arr type: <class 'numpy.ndarray'>, values: [4 5 6]
arr type: <class 'numpy.ndarray'>, values: ['2021-01-01']
GPSat.utils.track_num_for_date(x)