Utils
TODO: Divide utils into specific categories.
- GPSat.utils.EASE2toWGS84(x, y, return_vals='both', lon_0=0, lat_0=90)
Converts EASE2 grid coordinates to WGS84 longitude and latitude coordinates.
- Parameters:
- x: float
EASE2 grid x-coordinate in meters.
- y: float
EASE2 grid y-coordinate in meters.
- return_vals: str, optional
Determines what values to return. Valid options are
"both"
(default),"lon"
, or"lat"
.- lon_0: float, optional
Longitude of the center of the EASE2 grid in degrees. Default is
0
.- lat_0: float, optional
Latitude of the center of the EASE2 grid in degrees. Default is
90
.
- Returns:
- tuple or float
Depending on the value of
return_vals
, either a tuple of WGS84 longitude and latitude coordinates (both floats), or a single float representing either the longitude or latitude.
- Raises:
- AssertionError
If
return_vals
is not one of the valid options.
Examples
>>> EASE2toWGS84(1000000, 2000000) (153.434948822922, 69.86894542225777)
- GPSat.utils.EASE2toWGS84_New(*args, **kwargs)
- GPSat.utils.WGS84toEASE2(lon, lat, return_vals='both', lon_0=0, lat_0=90)
Converts WGS84 longitude and latitude coordinates to EASE2 grid coordinates.
- Parameters:
- lonfloat
Longitude coordinate in decimal degrees.
- latfloat
Latitude coordinate in decimal degrees.
- return_valsstr, optional
Determines what values to return. Valid options are
"both"
(default),"x"
, or"y"
.- lon_0float, optional
Longitude of the center of the EASE2 grid in decimal degrees. Default is
0
.- lat_0float, optional
Latitude of the center of the EASE2 grid in decimal degrees. Default is
90
.
- Returns:
- float
If
return_vals
is"x"
. Returns the x EASE2 grid coordinate in meters.- float
If
return_vals
is"y"
. Returns the y EASE2 grid coordinate in meters- tuple of float
If
return_vals
is"both"
. Returns a tuple of (x, y) EASE2 grid coordinates in meters.
- Raises:
- AssertionError
If
return_vals
is not one of the valid options.
Examples
>>> WGS84toEASE2(-105.01621, 39.57422) (-5254767.014984061, 1409604.1043472202)
- GPSat.utils.WGS84toEASE2_New(*args, **kwargs)
- GPSat.utils.array_to_dataframe(x, name, dim_prefix='_dim_', reset_index=False)
Converts a numpy array to a pandas DataFrame with a multi-index based on the array’s dimensions.
(Also see
dataframe_to_array
)- Parameters:
- xnp.ndarray
The numpy array to be converted to a DataFrame.
- namestr
The name of the column in the resulting DataFrame.
- dim_prefixstr, optional
The prefix to be used for the dimension names in the multi-index. Default is
"_dim_"
. Integers will be appended todim_prefix
for each dimension ofx
, i.e. ifx
is 2d, it will have dimension names"_dim_0"
,"_dim_1"
, assuming defaultdim_prefix
is used.- reset_indexbool, optional
Whether to reset the index of the resulting DataFrame. Default is
False
.
- Returns:
- outpd.DataFrame
The resulting DataFrame with a multi-index based on the dimensions of the input array.
- Raises:
- AssertionError
If the input is not a numpy array.
Examples
>>> # express a 2d numpy array in DataFrame >>> x = np.array([[1, 2], [3, 4]]) >>> array_to_dataframe(x, "data") data _dim_0 _dim_1 0 0 1 1 2 1 0 3 1 4
- GPSat.utils.assign_category_col(val, df, categories=None)
Generate categorical
pd.Series
equal in length to a reference DataFrame (df
)- Parameters:
- valstr
The value to assign to the categorical Series.
- dfpandas DataFrame
reference DataFrame, used to determine length of output
- categorieslist, optional
A list of categories to be used for the categorical column.
- Returns:
- pandas Categorical Series
A categorical column with the assigned value and specified categories (if provided).
Notes
This function creates a new categorical column in the DataFrame with the specified value and categories. If categories are not provided, they will be inferred from the data. The function returns a pandas Categorical object representing the new column.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}) >>> x_series = assign_category_col('x', df)
- GPSat.utils.bin_obs_by_date(df, val_col, date_col='date', all_dates_in_range=True, x_col='x', y_col='y', grid_res=None, date_col_format='%Y%m%d', x_min=-4500000.0, x_max=4500000.0, y_min=-4500000.0, y_max=4500000.0, n_x=None, n_y=None, bin_statistic='mean', verbose=False)
This function takes in a pandas DataFrame and bins the data based on the values in a specified column and the x and y coordinates in other specified columns. The data is binned based on a grid with a specified resolution or number of bins. The function returns a dictionary of binned values for each unique date in the DataFrame.
- Parameters:
- df: pandas DataFrame
A DataFrame containing the data to be binned.
- val_col: string
Name of the column containing the values to be binned.
- date_col: string, default “date”
Name of the column containing the dates for which to bin the data.
- all_dates_in_range: boolean, default True
Whether to include all dates in the range of the DataFrame.
- x_col: string, default “x”
Name of the column containing the x coordinates.
- y_col: string, default “y”
Name of the column containing the y coordinates.
- grid_res: float or int, default None
Resolution of the grid in kilometers. If
None
, thenn_x
andn_y
must be specified.- date_col_format: string, default “%Y%m%d”
Format of the date column.
- x_min: float, default -4500000.0
Minimum x value for the grid.
- x_max: float, default 4500000.0
Maximum x value for the grid.
- y_min: float, default -4500000.0
Minimum y value for the grid.
- y_max: float, default 4500000.0
Maximum y value for the grid.
- n_x: int, default None
Number of bins in the x direction.
- n_y: int, default None
Number of bins in the y direction.
- bin_statistic: string or callable, default “mean”
Statistic to compute in each bin.
- verbose: boolean, default False
Whether to print additional information during execution.
- Returns:
- bvals: dictionary
The binned values for each unique date in the DataFrame.
- x_edge: numpy array
x values for the edges of the bins.
- y_edge: numpy array
y values for the edges of the bins.
Notes
The x and y coordinates are swapped in the returned binned values due to the transpose operation used in the function.
- GPSat.utils.check_prev_oi_config(prev_oi_config, oi_config, skip_valid_checks_on=None)
This function checks if the previous configuration matches the current one. It takes in two dictionaries,
prev_oi_config
andoi_config
, which represent the previous and current configurations respectively.The function also takes an optional list
skip_valid_checks_on
, which contains keys that should be skipped during the comparison.- Parameters:
- prev_oi_config: dict
Previous configuration to be compared against.
- oi_config: dict
Current configuration to compare against
prev_oi_config
.- skip_valid_checks_on: list or None, default None
If not
None
, should be a list of keys to not check.
- Returns:
- None
Notes
If
skip_valid_checks_on
is not provided, it defaults to an empty list. The function then compares the two configurations and raises anAssertionError
if any key-value pairs do not match.If the configurations do not match exactly, an
AssertionError
is raised.This function assumes that the configurations are represented as dictionaries and that the keys in both dictionaries are the same.
- GPSat.utils.compare_dataframes(df1, df2, merge_on, columns_to_compare, drop_other_cols=False, how='outer', suffixes=['_1', '_2'])
- GPSat.utils.config_func(func, source=None, args=None, kwargs=None, col_args=None, col_kwargs=None, df=None, filename_as_arg=False, filename=None, col_numpy=True)
Apply a function based on configuration input.
The aim is to allow one to apply a function, possibly on data from a DataFrame, using a specification that can be stored in a JSON configuration file.
Note
This function uses
eval()
so could allow for arbitrary code execution.If DataFrame
df
is provided, then can provide input (col_args
and/orcol_kwargs
) based on columns ofdf
.
- Parameters:
- func: str or callable.
If
str
, it will useeval(func)
to convert it to a function.If it contains one of
"|"
,"&"
,"="
,"+"
,"-"
,"*"
,"/"
,"%"
,"<"
, and">"
, it will create a lambda function:
lambda arg1, arg2: eval(f"arg1 {func} arg2")
If
eval(func)
raisesNameError
andsource
is notNone
, it will run
f"from {source} import {func}"
and try again. This is to allow import function from a source.
- source: str or None, default None
Package name where
func
can be found, if applicable. Used to importfunc
from a package. e.g.>>> GPSat.utils.config_func(func="cumprod", source="numpy", ...)
calls the function
cumprod
from the packagenumpy
.- args: list or None, default None
If
None
, an empty list will be used, i.e. no args will be used. The values will be unpacked and provided tofunc
: i.e.func(*args, **kwargs)
- kwargs: dict or None, default None
If
dict
, it will be unpacked (**kwargs
) to provide key word arguments tofunc
.- col_args: None or list of str, default None
If DataFrame
df
is provided, it can usecol_args
to specify which columns ofdf
will be passed intofunc
as arguments.- col_kwargs: None or dict, default is None
Keyword arguments to be passed to
func
specified as dict whose keys are parameters offunc
and values are column names of a DataFramedf
. Only applicable ifdf
is provided.- df: DataFrame or None, default None
To provide if one wishes to use columns of a DataFrame as arguments to
func
.- filename_as_arg: bool, default False
Set
True
iffilename
is used as an argument tofunc
.- filename: str or None, default None
If
filename_as_arg
isTrue
, then will providefilename
as first arg.- col_numpy: bool, default True
If
True
, when extracting columns from DataFrame,.values
is used to convert to numpy array.
- Returns:
- any
Values returned by applying
func
on data. The type depends onfunc
.
- Raises:
- AssertionError
If
kwargs
is not a dict,col_kwargs
is not a dict, orfunc
is not a string or callable.- AssertionError
If
df
is not provided butcol_args
orcol_kwargs
are.- AssertionError
If
func
is a string and cannot be imported on it’s own andsource
isNone
.
Examples
>>> import pandas as pd >>> from GPSat.utils import config_func >>> config_func(func="lambda x, y: x + y", args=[1, 1]) # Computes 1 + 1 2 >>> config_func(func="==", args=[1, 1]) # Computes 1 == 1 True
Using columns of a DataFrame as inputs:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> config_func(func="lambda x, y: x + y", df=df, col_args=["A", "B"]) # Computes df["A"] + df["B"] array([5, 7, 9]) >>> config_func(func="<=", col_args=["A", "B"], df=df) # Computes df["A"] <= df["B"] array([ True, True, True])
We can also use functions from an external package by specifying
source
. For example, the below reproduces the last example in numpy.cumprod:>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> config_func(func="cumprod", source="numpy", df=df, kwargs={"axis": 0}, col_args=[["A", "B"]]) array([[ 1, 4], [ 2, 20], [ 6, 120]])
- GPSat.utils.convert_lon_lat_str(x)
Converts a string representation of longitude or latitude to a float value.
- Parameters:
- x: str
A string representation of longitude or latitude in the format of
"[degrees] [minutes] [direction]"
, where[direction]
is one of"N"
,"S"
,"E"
, or"W"
.
- Returns:
- float
The converted value of the input string as a float.
- Raises:
- AssertionError
If the input is not a string.
Examples
>>> convert_lon_lat_str('74 0.1878 N') 74.00313 >>> convert_lon_lat_str('140 0.1198 W') -140.001997
- GPSat.utils.cprint(x, c='ENDC', bcolors=None, sep=' ', end='\n')
Add color to print statements.
Based off of https://stackoverflow.com/questions/287871/how-do-i-print-colored-text-to-the-terminal.
- Parameters:
- x: str
String to be printed.
- c: str, default “ENDC”
Valid key in
bcolors
. Ifbcolors
is not provided, then default will be used, containing keys:'HEADER'
,'OKBLUE'
,'OKCYAN'
,'OKGREEN'
,'WARNING'
,'FAIL'
,'ENDC'
,'BOLD'
,'UNDERLINE'
.- bcolors: dict or None, default None
Dict with values being colors / how to format the font. These cane be chained together. See the codes in: https://en.wikipedia.org/wiki/ANSI_escape_code#3-bit_and_4-bit.
- sep: str, default “ “
sep
argument passed along toprint()
.- end: str, default “\n”
end
argument passed along toprint()
.
- Returns:
- None
- GPSat.utils.dataframe_to_2d_array(df, x_col, y_col, val_col, tol=1e-09, fill_val=nan, dtype=None, decimals=1)
Extract values from DataFrame to create a 2-d array of values (
val_col
) - assuming the values came from a 2-d array. Requires dimension columnsx_col
,y_col
(do not have to be ordered in DataFrame).- Parameters:
- df: pandas.DataFrame
The dataframe to convert to a 2D array.
- x_col: str
The name of the column in the dataframe that contains the x coordinates.
- y_col: str
The name of the column in the dataframe that contains the y coordinates.
- val_col: str
The name of the column in the dataframe that contains the values to be placed in the 2D array.
- tol: float, default 1e-9
The tolerance for matching the x and y coordinates to the grid.
- fill_val: float, default np.nan
The value to fill the 2D array with if a coordinate is missing.
- dtype: str or numpy.dtype or None, default None
The data type of the values in the 2D array.
- decimals: int, default 1
The number of decimal places to round x and y values to before taking unique. If decimals is negative, it specifies the number of positions to the left of the decimal point.
- Returns:
- tuple
A tuple containing the 2D numpy array of values, the x coordinates of the grid, and the y coordinates of the grid.
- Raises:
- AssertionError
If any of the required columns are missing from the dataframe, or if any coordinates have more than one value.
Notes
The spacing of grid is determined by the smallest step size in the
x_col
,y_col
direction, respectively.This is meant to reverse the process of putting values from a regularly spaced grid into a DataFrame. Do not expect this to work on arbitrary x,y coordinates.
- GPSat.utils.dataframe_to_array(df, val_col, idx_col=None, dropna=True, fill_val=nan)
Converts a pandas DataFrame to a numpy array, where the DataFrame has columns that represent dimensions of the array and the DataFrame rows represent values in the array.
- Parameters:
- dfpandas DataFrame
The DataFrame containing values convert to a numpy ndarray.
- val_colstr
The name of the column in the DataFrame that contains the values to be placed in the array.
- idx_colstr or list of str or None, default None
The name(s) of the column(s) in the DataFrame that represent the dimensions of the array. If not provided, the index of the DataFrame will be used as the dimension(s).
- dropnabool, default True
Whether to drop rows with missing values before converting to the array.
- fill_valscalar, default np.nan
The value to fill in the array for missing values.
- Returns:
- numpy array
The resulting numpy array.
- Raises:
- AssertionError
If the dimension values are not integers or have gaps, or if the
idx_col
parameter contains column names that are not in the DataFrame.
Examples
>>> import pandas as pd >>> import numpy as np >>> from GPSat.utils import dataframe_to_array >>> df = pd.DataFrame({ ... 'dim1': [0, 0, 1, 1], ... 'dim2': [0, 1, 0, 1], ... 'values': [1, 2, 3, 4] ... }) >>> arr = dataframe_to_array(df, 'values', ['dim1', 'dim2']) >>> print(arr) [[1 2] [3 4]]
- GPSat.utils.dict_of_array_to_dict_of_dataframe(array_dict, concat=False, reset_index=False)
Converts a dictionary of arrays to a dictionary of pandas DataFrames.
- Parameters:
- array_dictdict
A dictionary where the keys are strings and the values are numpy arrays.
- concatbool, optional
If
True
, concatenates DataFrames with the same number of dimensions. Default isFalse
.- reset_indexbool, optional
If
True
, resets the index of each DataFrame. Default isFalse
.
- Returns:
- dict
A dictionary where the keys are strings and the values are pandas DataFrames.
Notes
This function uses the
array_to_dataframe
function to convert each array to a DataFrame. Ifconcat
isTrue
, it will concatenate DataFrames with the same number of dimensions. Ifreset_index
isTrue
, it will reset the index of each DataFrame.Examples
>>> import numpy as np >>> import pandas as pd >>> array_dict = {'a': np.array([1, 2, 3]), 'b': np.array([[1, 2], [3, 4]]), 'c': np.array([1.1, 2.2, 3.3])} >>> dict_of_array_to_dict_of_dataframe(array_dict) {'a': a _dim_0 0 1 1 2 2 3, 'b': b _dim_0 _dim_1 0 0 1 1 2 1 0 3 1 4, 'c': c _dim_0 0 1.1 1 2.2 2 3.3}
>>> dict_of_array_to_dict_of_dataframe(array_dict, concat=True) {1: a c _dim_0 0 1 1.1 1 2 2.2 2 3 3.3, 2: b _dim_0 _dim_1 0 0 1 1 2 1 0 3 1 4}
>>> dict_of_array_to_dict_of_dataframe(array_dict, reset_index=True) {'a': _dim_0 a 0 0 1 1 1 2 2 2 3, 'b': _dim_0 _dim_1 b 0 0 0 1 1 0 1 2 2 1 0 3 3 1 1 4, 'c': _dim_0 c 0 0 1.1 1 1 2.2 2 2 3.3}
- GPSat.utils.diff_distance(x, p=2, k=1, default_val=nan)
- GPSat.utils.expand_dict_by_vals(d, expand_keys)
- GPSat.utils.get_col_values(df, col, return_numpy=True)
This function takes in a pandas DataFrame, a column name or index, and a boolean flag indicating whether to return the column values as a numpy array or not. It returns the values of the specified column as either a pandas Series or a numpy array, depending on the value of the
return_numpy
flag.If the column is specified by name and it does not exist in the DataFrame, the function will attempt to use the column index instead. If the column is specified by index and it is not a valid integer index, the function will raise an
AssertionError
.- Parameters:
- df: pandas DataFrame
A pandas DataFrame containing data.
- col: str or int
The name of column to extract data from. If specified as an int n, it will extract data from the n-th column.
- return_numpy: bool, default True
Whether to return as numpy array.
- Returns:
- numpy array
If
return_numpy
is set toTrue
.- pandas Series
If
return_numpy
is set toFalse
.
Examples
>>> import pandas as pd >>> from GPSat.utils import get_col_values >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) >>> col_values = get_col_values(df, 'A') >>> print(col_values) [1 2 3]
- GPSat.utils.get_config_from_sysargv(argv_num=1)
This function takes an optional argument
argv_num
(default value of1
) and attempts to read a JSON configuration file from the corresponding index insys.argv
.If the file extension is not
.json
, it prints a message indicating that the file is not a JSON file.If an error occurs while reading the file, it prints an error message.
This function could benefit from refactoring to use the
argparse
package instead of manually parsingsys.argv
.- Parameters:
- argv_num :int, default 1
The index in
sys.argv
to read the configuration file from.
- Returns:
- dict or None
The configuration data loaded from the JSON file, or
None
if an error occurred while reading the file.
- GPSat.utils.get_git_information()
This function retrieves information about the current state of a Git repository.
- Returns:
- dict
Contains the following keys:
"branch"
: the name of the current branch."remote"
: a list of strings representing the remote repositories and their URLs."commit"
: the hash of the current commit."details"
: a list of strings representing the details of the last commit (author, date, message)."modified"
(optional): a list of strings representing the files modified since the last commit.
Note
If the current branch cannot be determined, the function will attempt to retrieve it from the list of all branches.
If there are no remote repositories, the
"remote"
key will be an empty list.If there are no modified files, the
"modified"
key will not be present in the output.This function requires the Git command line tool to be installed and accessible from the command line.
- GPSat.utils.get_previous_oi_config(store_path, oi_config, table_name='oi_config', skip_valid_checks_on=None)
This function retrieves the previous configuration from optimal interpolation (OI) results file (
store_path
)If the
store_path
exists, it is expected to contain a table called “oi_config” with the previous configurations stored as rows.If
store_path
does not exist, the function creates the file and adds the current configuration (oi_config
) as the first row in “oi_config” table.Each row in the “oi_config” table contains columns ‘idx’ (index), ‘datetime’ and ‘config’. The values in the ‘config’ are provided
oi_config
(dict) converted to str.If the table (
oi_config
) already exists, the function will match the provideoi_config
against the previous config values, if any match exactly the largest config id will be returned. Otherwise (oi_config
does not exactly match any previous config) then the largest idx value will be increment and returned.- Parameters:
- store_path: str
The file path where the configurations are stored.
- oi_config: dict
Representing the current configuration for the OI system.
- table_name: str, default “oi_config”
The table where the configurations will be store.
- skip_valid_checks_on: list of str or None, default None
If list the names of the configuration keys that should be skipped during validation checks. Note: validation checks are not done in this function.
- Returns:
- dict
Previous configuration as a dictionary.
- list
List of configuration keys to skipped during validation checks.
- int
Configuration ID.
- GPSat.utils.get_weighted_values(df, ref_col, dist_to_col, val_cols, weight_function='gaussian', drop_weight_cols=True, **weight_kwargs)
Calculate the weighted values of specified columns in a DataFrame based on the distance between two other columns, using a specified weighting function. The current implementation supports a Gaussian weight based on the euclidean distance between the values in ref_col and dist_to_col.
- Parameters:
- dfpandas.DataFrame
The input DataFrame containing the reference column, distance-to column, and value columns.
- ref_collist of str or str
The name of the column(s) to use as reference points for calculating distances.
- dist_to_collist of str or str
The name of the column(s) to calculate distances to, from ref_col. They should align / correspond to the column(s) set by ref_col.
- val_colslist of str or str
The names of the column(s) for which the weighted values are calculated. Can be a single column name or a list of names.
- weight_functionstr, optional
The type of weighting function to use. Currently, only “gaussian” is implemented, which applies a Gaussian weighting (exp(-d^2)) based on the squared euclidean distance. The default is “gaussian”.
- drop_weight_cols: bool, optional, default True.
if False the total weight and total weighted function values are included in the output
- **weight_kwargsdict
Additional keyword arguments for the weighting function. For the Gaussian weight, this includes: - lengthscale (float): The length scale to use in the Gaussian function. This parameter scales the distance before applying the Gaussian function and must be provided.
- Returns:
- pandas.DataFrame
A DataFrame containing the weighted values for each of the specified value columns. The output DataFrame has the reference column as the index and each of the specified value columns with their weighted values.
- Raises:
- AssertionError
If the shapes of the ref_col and dist_to_col do not match, or if the required lengthscale parameter for the Gaussian weighting function is not provided.
- NotImplementedError
If a weight_function other than “gaussian” is specified.
Notes
The function currently only implements Gaussian weighting. The Gaussian weight is calculated as exp(-d^2 / (2 * l^2)), where d is the squared euclidean distance between ref_col and dist_to_col, and l is the lengthscale.
This implementation assumes the input DataFrame does not contain NaN values in the reference or distance-to columns. Handling NaN values may require additional preprocessing or the use of fillna methods.
Examples
>>> import pandas as pd >>> >>> data = { ... 'ref_col': [0, 1, 0, 1], ... 'dist_to_col': [1, 2, 3, 4], ... 'value1': [10, 20, 30, 40], ... 'value2': [100, 200, 300, 400] ... } >>> df = pd.DataFrame(data) >>> weighted_df = get_weighted_values(df, 'ref_col', 'dist_to_col', ['value1', 'value2'], lengthscale=1.0) >>> print(weighted_df)
- GPSat.utils.glue_local_predictions(preds_df: DataFrame, expert_locs_df: DataFrame, sigma: int | float | list = 3) DataFrame
Depracated. Use
glue_local_predictions_1d
andglue_local_predictions_2d
instead.Glues overlapping predictions by taking a normalised Gaussian weighted average.
Warning: This method only deals with expert locations on a regular grid.
- Parameters:
- preds_df: pd.DataFrame
containing predictions generated from local expert OI. It should have the following columns:
pred_loc_x
(float): The x-coordinate of the prediction location.pred_loc_y
(float): The y-coordinate of the prediction location.f*
(float): The predictive mean at the location (pred_loc_x, pred_loc_y).f*_var
(float): The predictive variance at the location (pred_loc_x, pred_loc_y).
- expert_locs_df: pd.DataFrame
containing local expert locations used to perform optimal interpolation. It should have the following columns:
x
(float): The x-coordinate of the expert location.y
(float): The y-coordinate of the expert location.
- sigma: int, float, or list, default 3
The standard deviation of the Gaussian weighting in the x and y directions.
If a single value is provided, it is used for both directions.
If a list is provided, the first value is used for the x direction and the second value is used for the y direction. Defaults to 3.
- Returns:
- pd.DataFrame:
Dataframe consisting of glued predictions (mean and std). It has the following columns:
pred_loc_x
(float): The x-coordinate of the prediction location.pred_loc_y
(float): The y-coordinate of the prediction location.f*
(float): The glued predictive mean at the location (pred_loc_x
,pred_loc_y
).f*_std
(float): The glued predictive standard deviation at the location (pred_loc_x
,pred_loc_y
).
Notes
The function assumes that the expert locations are equally spaced in both the x and y directions.
The function uses the
scipy.stats.norm.pdf
function to compute the Gaussian weights.The function normalizes the weighted sums with the total weights at each location.
- GPSat.utils.grid_2d_flatten(x_range, y_range, grid_res=None, step_size=None, num_step=None, center=True)
Create a 2D grid of points defined by x and y ranges, with the option to specify the grid resolution, step size, or number of steps. The resulting grid is flattened and concatenated into a 2D array of (x,y) coordinates.
- Parameters:
- x_range: tuple or list of floats
Two values representing the minimum and maximum values of the x-axis range.
- y_range: tuple or list of floats
Two values representing the minimum and maximum values of the y-axis range.
- grid_res: float or None, default None
The grid resolution, i.e. the distance between adjacent grid points. If specified, this parameter takes precedence over
step_size
andnum_step
.- step_size: float or None, default None
The step size between adjacent grid points. If specified, this parameter takes precedence over
num_step
.- num_step: int or None, default None
The number of steps between the minimum and maximum values of the x and y ranges. If specified, this parameter is used only if
grid_res
andstep_size
are not specified (areNone
). Note: the number of steps includes the starting point, so from 0 to 1 is two steps- center: bool, default True
If
True
, the resulting grid points will be the centers of the grid cells.If
False
, the resulting grid points will be the edges of the grid cells.
- Returns:
- ndarray
A 2D array of (x,y) coordinates, where each row represents a single point in the grid.
- Raises:
- AssertionError
If
grid_res
,step_size
andnum_step
are all unspecified. Must specify at least one.
Examples
>>> from GPSat.utils import grid_2d_flatten >>> grid_2d_flatten(x_range=(0, 2), y_range=(0, 2), grid_res=1) array([[0.5, 0.5], [1.5, 0.5], [0.5, 1.5], [1.5, 1.5]])
- GPSat.utils.guess_track_num(x, thresh, start_track=0)
- GPSat.utils.inverse_sigmoid(y, low=0, high=1)
- GPSat.utils.inverse_softplus(y, shift=0)
- GPSat.utils.json_load(file_path)
This function loads a JSON file from the specified file path and applies a nested dictionary literal evaluation (nested_dict_literal_eval) to convert any string keys in the format of ‘(…,…)’ to tuple keys.
The resulting dictionary is returned.
- Parameters:
- file_path: str
The path to the JSON file to be loaded.
- Returns:
- dict or list of dict
The loaded JSON file as a dictionary or list of dictionaries.
Examples
Assuming a JSON file named ‘config.json’ with the following contents: {
- “key1”: “value1”,
“(‘key2’, ‘key3’)”: “value2”, “key4”: {“(‘key5’, ‘key6’)”: “value3”}
}
The following code will load the file and convert the ‘(key2, key3)’ and ‘(key5, key6)’ keys to tuple keys: config = json_load(‘config.json’) print(config)
- {‘key1’: ‘value1’,
‘(key2, key3)’: ‘value2’, ‘key4’: {‘(key5, key6)’: ‘value3’}}
- GPSat.utils.json_serializable(d, max_len_df=100)
Converts a dictionary to a format that can be stored as JSON via the json.dumps() method.
- Parameters:
- d :dict
The dictionary to be converted.
- max_len_df: int, default 100
The maximum length of a Pandas DataFrame or Series that can be converted to a string representation. If the length of the DataFrame or Series is greater than this value, it will be stored as a string. Defaults to 100.
- Returns:
- dict
The converted dictionary.
- Raises:
- AssertionError: If the input is not a dictionary.
Notes
If a key in the dictionary is a tuple, it will be converted to a string.
To recover the original tuple, use nested_dict_literal_eval. - If a value in the dictionary is a dictionary, the function will be called recursively to convert it. - If a value in the dictionary is a NumPy array, it will be converted to a list. - If a value in the dictionary is a Pandas DataFrame or Series, it will be converted to a dictionary and the function will be called recursively to convert it if its length is less than or equal to max_len_df. Otherwise, it will be stored as a string. - If a value in the dictionary is not JSON serializable, it will be cast as a string.
- GPSat.utils.log_lines(*args, level='debug')
This function logs lines to a file with a specified logging level.
This function takes in any number of arguments and a logging level.
The function checks that the logging level is valid and then iterates through the arguments.
If an argument is a string, integer, float, dictionary, tuple, or list, it is printed and logged with the specified logging level.
If an argument is not one of these types, it is not logged and a message is printed indicating the argument’s type.
- Parameters:
- *args: tuple
arguments to be provided to logging using the method specified by level
- level: str, default “debug”
must be one of [“debug”, “info”, “warning”, “error”, “critical”] each argument provided is logged with getattr(logging, level)(arg)
- Returns:
- None
- GPSat.utils.match(x, y, exact=True, tol=1e-09)
This function takes two arrays, x and y, and returns an array of indices indicating where the elements of x match the elements of y. Can match exactly or within a specified tolerance.
- Parameters:
- x: array-like
the first array to be matched. If not an array will convert via to_array.
- y: array-like
the second array to be matched against. If not an array will convert via to_array.
- exact: bool, default=True.
If True, the function matches exactly. If False, the function matches within a specified tolerance.
- tol: float, optional, default=1e-9.
The tolerance used for matching when exact=False.
- Returns:
- indices: array
the indices of the matching elements in y for each element in x.
- Raises:
- AssertionError: if any element in x is not found in y or if multiple matches are found for any element in x.
Note
This function requires x and y to be arrays or can be converted by to_array If exact=False, the function only makes sense with floats. Use exact=True for int and str. If both x and y are large, with lengths n and m, this function can take up alot of memory as an intermediate bool array of size nxm is created. If there are multiple matches of x in y the index of the first match is return
- GPSat.utils.move_to_archive(top_dir, file_names=None, suffix='', archive_sub_dir='Archive', verbose=False)
Moves specified files from a directory to an archive sub-directory within the same directory. Moved files will have a suffix added on before file extension.
- Parameters:
- top_dirstr
The path to the directory containing the files to be moved.
- file_nameslist of str, default None
The names of the files to be moved. If not specified, all files in the directory will be moved.
- suffixstr, default “”.
A string to be added to the end of the file name before the extension in the archive directory.
- archive_sub_dirstr, default ‘Archive’
The name of the sub-directory within the top directory where the files will be moved.
- verbosebool, default is False.
If True, prints information about the files being moved.
- Returns:
- None
The function only moves files and does not return anything.
Note
If the archive sub-directory does not exist, it will be created.
If a file with the same name as the destination file already exists in the archive sub-directory, it will be overwritten.
- Raises:
- AssertionError
If top_dir does not exist or file_names is not specified.
Examples
Move all files in directory to archive sub-directory: >>> move_to_archive(“path/to/directory”)
Move specific files to archive sub-directory with a suffix added to the file name: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], suffix=”_backup”)
Move specific files to a custom archive sub-directory: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], archive_sub_dir=”Old Files”)
- GPSat.utils.nested_dict_literal_eval(d, verbose=False)
Converts a nested dictionary with string keys that represent tuples to a dictionary with tuple keys.
- Parameters:
- d: dict
The nested dictionary to be converted.
- verbose: bool, default False
If True, prints information about the keys being converted.
- Returns:
- dict
The converted dictionary with tuple keys.
- Raises:
- ValueError: If a string key cannot be evaluated as a tuple.
Note
This function modifies the original dictionary in place.
- GPSat.utils.nll(y, mu, sig, return_tot=True)
- GPSat.utils.not_nan(x)
- GPSat.utils.pandas_to_dict(x)
Converts a pandas Series or DataFrame (row) to a dictionary.
- Parameters:
- x: pd.Series, pd.DataFrame or dict
The input object to be converted to a dictionary.
- Returns:
- dict:
A dictionary representation of the input object.
- Raises:
- AssertionError: If the input object is a DataFrame with more than one row.
Warning
If the input object is not a pandas Series, DataFrame, or dictionary, a warning is issued and the input object is returned as is.
Examples
>>> import pandas as pd >>> data = {'name': ['John', 'Jane'], 'age': [30, 25]} >>> df = pd.DataFrame(data) >>> pandas_to_dict(df) AssertionError: in pandas_to_dict input provided as DataFrame, expected to only have 1 row, shape is: (2, 2)
>>> series = pd.Series(data['name']) >>> pandas_to_dict(series) {0: 'John', 1: 'Jane'}
>>> dictionary = {'name': ['John', 'Jane'], 'age': [30, 25]} >>> pandas_to_dict(dictionary) {'name': ['John', 'Jane'], 'age': [30, 25]}
select a single row of the dataframe
>>> pandas_to_dict(df.iloc[[0]]) {'name': 'John', 'age': 30}
- GPSat.utils.pip_freeze_to_dataframe()
- GPSat.utils.pretty_print_class(x)
This function takes in a class object as input and returns a string representation of the class name without the leading “<class ‘” and trailing “’>”.
Alternatively will remove leading ‘<__main__.’ and remove ‘ object at ‘, including anything that follows
The function achieves this by invoking the __str__ method of the class object and then using regular expressions to remove the unwanted characters.
- Parameters:
- x: an arbitrary class instance
- Returns:
- str
Examples
- class MyClass:
pass
print(pretty_print_class(MyClass))
- GPSat.utils.rmse(y, mu)
- GPSat.utils.sigmoid(x, low=0, high=1)
- GPSat.utils.softplus(x, shift=0)
- GPSat.utils.sparse_true_array(shape, grid_space=1, grid_space_offset=0)
Create a boolean numpy array with True values regularly spaced throughout, and False elsewhere.
- Parameters:
- shape: iterable (e.g. list or tuple)
representing the shape of the output array.
- grid_space: int, default 1
representing the spacing between True values.
- grid_space_offset: int, default 0
representing the offset of the first True value in each dimension.
- Returns:
- np.array
A boolean array with dimension equal to shape, with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape).
Note
The first dimension is treated as the y dimension. This function will return a bool array with dimension equal to shape with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape). The function allows for grid_space_offset to be specific to each dimension.
- GPSat.utils.stats_on_vals(vals, measure=None, name=None, qs=None)
This function calculates various statistics on a given array of values.
- Parameters:
- vals: array-like
The input array of values.
- measure: str or None, default is None
The name of the measure being calculated.
- name: str or None, default is None
The name of the column in the output dataframe. Default is None.
- qs: list or None, defualt None
A list of quantiles to calculate. If None then will use [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99].
- Returns:
- pd.DataFrame
containing the following statistics: - measure: The name of the measure being calculated. - size: The number of elements in the input array. - num_not_nan: The number of non-NaN elements in the input array. - num_inf: The number of infinite elements in the input array. - min: The minimum value in the input array. - mean: The mean value of the input array. - max: The maximum value in the input array. - std: The standard deviation of the input array. - skew: The skewness of the input array. - kurtosis: The kurtosis of the input array. - qX: The Xth quantile of the input array, where X is the value in the qs parameter.
Note
The function also includes a timer decorator that calculates the time taken to execute the function.
- GPSat.utils.to_array(*args, date_format='%Y-%m-%d')
Converts input arguments to numpy arrays.
- Parameters:
- *argstuple
Input arguments to be converted to numpy arrays.
- date_formatstr, optional
Date format to be used when converting datetime.date objects to numpy arrays.
- Returns:
- generator
A generator that yields numpy arrays.
Note
This function converts input arguments to numpy arrays. If the input argument is already a numpy array, it is yielded as is. If the input argument is a list or tuple, it is converted to a numpy array and yielded. If the input argument is an integer, float, string, boolean, or numpy boolean, it is converted to a numpy array and yielded. If the input argument is a numpy integer or float, it is converted to a numpy array and yielded. If the input argument is a datetime.date object, it is converted to a numpy array using the specified date format and yielded. If the input argument is a numpy datetime64 object, it is yielded as is. If the input argument is None, an empty numpy array is yielded. If the input argument is of any other data type, a warning is issued and the input argument is converted to a numpy array of type object and yielded.
Examples
>>> import datetime >>> import numpy as np >>> x = [1, 2, 3]
since function returns are generator, get values out with next
>>> print(next(to_array(x))) [1 2 3]
or, for a single array like object, can assign with
>>> c, = to_array(x)
>>> y = np.array([4, 5, 6]) >>> z = datetime.date(2021, 1, 1) >>> for arr in to_array(x, y, z): ... print(f"arr type: {type(arr)}, values: {arr}") arr type: <class 'numpy.ndarray'>, values: [1 2 3] arr type: <class 'numpy.ndarray'>, values: [4 5 6] arr type: <class 'numpy.ndarray'>, values: ['2021-01-01']
- GPSat.utils.track_num_for_date(x)