GPSat package
Submodules
GPSat.bin_data module
- class GPSat.bin_data.BinData
Bases:
object
- bin_data(file=None, source=None, load_by=None, table=None, where=None, batch=False, add_output_cols=None, bin_config=None, chunksize=5000000, **data_load_kwargs)
Bins the dataset, either in a single pass or in batches, based on the provided configuration.
This method decides between processing the entire dataset at once or in chunks based on the batch parameter. It applies binning according to the specified bin_config, along with any preprocessing defined by col_funcs, col_select, and row_select. Additional columns can be added to the output dataset using add_output_cols. The method is capable of handling both small and very large datasets efficiently.
- Parameters:
- filestr, optional
Path to the source file containing the dataset if source is not specified. Defaults to None.
- sourcestr, optional
An alternative specification of the data source. This could be a path to a file or another identifier depending on the context. If both file and source are provided, source takes precedence. Defaults to None.
- load_bylist of str, optional
List of column names based on which data will be loaded and binned in batches if batch is True. Each unique combination of values in these columns defines a batch. Defaults to None.
- tablestr, optional
The name of the table within the data source from which to load the data. Defaults to None.
- wherelist of dict, optional
Conditions for filtering rows from the source, expressed as a list of dictionaries representing SQL-like where clauses. Defaults to None.
- batchbool, optional
If True, the data is processed in chunks based on load_by columns. If False, the entire dataset is processed at once. Defaults to False.
- add_output_colsdict, optional
Dictionary mapping new column names to functions that define their values, for adding columns to the output DataFrame after binning. Defaults to None.
- bin_configdict
Configuration for the binning process, including parameters such as bin sizes, binning method, and criteria for binning. This parameter is required.
- chunksizeint, optional
The number of rows to read into memory and process at a time, applicable when batch is True. Defaults to 5,000,000.
- **data_load_kwargsdict, optional
Additional keyword arguments to be passed into DataLoader.load see
load
- Returns:
- df_binpandas.DataFrame
A DataFrame containing the binned data.
- statspandas.DataFrame
A DataFrame containing statistics of the binned data, useful for analyzing the distribution and quality of the binned data.
- Raises:
- AssertionError
If bin_config is not provided or is not a dictionary.
Notes
The bin_data method offers flexibility in processing datasets of various sizes by allowing for both batch processing and single-pass processing. The choice between these modes is controlled by the batch parameter, making it suitable for scenarios ranging from small datasets that fit easily into memory to very large datasets requiring chunked processing to manage memory usage effectively.
The additional parameters for row and column selection and the ability to add new columns after binning allow for significant customization of the binning process, enabling users to tailor the method to their specific data processing and analysis needs.
- bin_data_all_at_once(file=None, source=None, table=None, where=None, add_output_cols=None, bin_config=None, **data_load_kwargs)
Reads the entire dataset, applies binning, and returns binned data along with statistics.
This method handles the entire binning process in a single pass, making it suitable for datasets that can fit into memory. It allows for preprocessing of data through column functions, selection of specific rows and columns, and the addition of output columns after binning based on provided configurations.
- Parameters:
- filestr, optional
Path to the source file containing the dataset if source is not specified. Defaults to None.
- sourcestr, optional
An alternative specification of the data source. This could be a path to a file or another identifier depending on the context. If both file and source are provided, source takes precedence. Defaults to None.
- tablestr, optional
The name of the table within the data source to apply binning. Defaults to None.
- wherelist of dict, optional
Conditions for filtering rows before binning, expressed as a list of dictionaries representing SQL-like where clauses. Defaults to None.
- add_output_colsdict, optional
Dictionary mapping new column names to functions that define their values, for adding columns to the output DataFrame after binning. Defaults to None.
- Returns:
- df_binpandas.DataFrame
A DataFrame containing the binned data.
- stats_dfpandas.DataFrame
A DataFrame containing statistics of the binned data, useful for analyzing the distribution and quality of the binned data.
- Raises:
- AssertionError
If bin_config is not provided or is not a dictionary.
Notes
This method is designed to handle datasets that can be loaded entirely into memory. For very large datasets, consider using the bin_data_by_batch method to process the data in chunks and avoid memory issues.
The add_output_cols parameter allows for the dynamic addition of columns to the binned dataset based on custom logic, which can be useful for enriching the dataset with additional metrics or categorizations derived from the binned data.
- bin_data_by_batch(file=None, source=None, load_by=None, table=None, where=None, add_output_cols=None, chunksize=5000000, bin_config=None, **data_load_kwargs)
Bins the data in chunks based on unique values of specified columns and returns the aggregated binned data and statistics.
This method is particularly useful for very large datasets that cannot fit into memory. It reads the data in batches, applies binning to each batch based on the unique values of the specified load_by columns, and aggregates the results. This approach helps manage memory usage while allowing for comprehensive data analysis and binning.
- Parameters:
- filestr, optional
Path to the source file containing the dataset if source is not specified. Defaults to None.
- sourcestr, optional
An alternative specification of the data source. This could be a path to a file or another identifier depending on the context. If both file and source are provided, source takes precedence. Defaults to None.
- load_bylist of str
List of column names based on which data will be loaded and binned in batches. Each unique combination of values in these columns defines a batch.
- tablestr, optional
The name of the table within the data source from which to load the data. Defaults to None.
- wherelist of dict, optional
Conditions for filtering rows from the source, expressed as a list of dictionaries representing SQL-like where clauses. Defaults to None.
- add_output_colsdict, optional
Dictionary mapping new column names to functions that define their values, for adding columns to the output DataFrame after binning. Defaults to None.
- chunksizeint, optional
The number of rows to read into memory and process at a time. Defaults to 5,000,000.
- bin_configdict
Configuration for the binning process, including parameters such as bin sizes, binning method, and criteria for binning. This parameter is required.
- **data_load_kwargsdict, optional
Additional keyword arguments to be passed into DataLoader.load see
load
- Returns:
- df_binpandas.DataFrame
A DataFrame containing the aggregated binned data from all batches.
- stats_allpandas.DataFrame
A DataFrame containing aggregated statistics of the binned data from all batches, useful for analyzing the distribution and quality of the binned data.
- Raises:
- AssertionError
If bin_config is not provided or is not a dictionary.
Notes
The bin_data_by_batch method is designed to handle large datasets by processing them in manageable chunks. It requires specifying load_by columns to define how the dataset is divided into batches for individual binning operations. This method ensures efficient memory usage while allowing for complex data binning and analysis tasks on large datasets.
The add_output_cols parameter enables the dynamic addition of columns to the output dataset based on custom logic applied after binning, which can be used to enrich the dataset with additional insights or metrics derived from the binned data.
- static bin_wrapper(df, col_funcs=None, print_stats=True, **bin_config)
Perform binning on a DataFrame with optional statistics printing and column modifications.
This function wraps the binning process, allowing for optional statistics on the data before binning, dynamic column additions or modifications, and the application of various binning configurations.
- Parameters:
- dfpandas.DataFrame
The DataFrame to be binned.
- col_funcsdict, optional
A dictionary where keys are column names to add or modify, and values are functions that take a pandas Series and return a modified Series. This allows for the dynamic addition or modification of columns before binning. Defaults to None.
- print_statsbool, optional
If True, prints basic statistics of the DataFrame before binning. Useful for a preliminary examination of the data. Defaults to True.
- **bin_configdict
Arbitrary keyword arguments defining the binning configuration. These configurations dictate how binning is performed and include parameters such as bin sizes, binning method, criteria for binning, etc.
- Returns:
- ds_binxarray.Dataset
The binned data as an xarray Dataset. Contains the result of binning the input DataFrame according to the specified configurations.
- stats_dfpandas.DataFrame
A DataFrame containing statistics of the input DataFrame after any column additions or modifications and before binning. Provides insights into the data distribution and can inform decisions on binning parameters or data preprocessing.
Notes
The actual structure and contents of the ds_bin xarray Dataset will depend on the binning configurations specified in **bin_config. Similarly, the stats_df DataFrame provides a summary of the data’s distribution based on the column specified in the binning configuration and can vary widely in its specifics.
The binning process may be adjusted significantly through the **bin_config parameters, allowing for a wide range of binning behaviors and outcomes. For detailed configuration options, refer to the documentation of the specific binning functions used within this wrapper.
- write_dataframe_to_table(df_bin, file=None, table=None)
Writes the binned DataFrame to a specified table in an HDF5 file.
This method saves the binned data, represented by a DataFrame, into a table within an HDF5 file. The method assumes that the HDF5 file is accessible and writable. It allows for the efficient storage of large datasets and facilitates easy retrieval for further analysis or processing.
- Parameters:
- df_binpandas.DataFrame
The DataFrame containing the binned data to be written to the file. This DataFrame should already be processed and contain the final form of the data to be saved.
- filestr
The path to the HDF5 file where the DataFrame will be written. If the file does not exist, it will be created. If the file exists, the method will write the DataFrame to the specified table within the file.
- tablestr
The name of the table within the HDF5 file where the DataFrame will be stored. If the table already exists, the new data will be appended to it.
- Raises:
- AssertionError
If either file or table is not specified.
Notes
The HDF5 file format is a versatile data storage format that can efficiently store large datasets. It is particularly useful in contexts where data needs to be retrieved for analysis, as it supports complex queries and data slicing. This method leverages the pandas HDFStore mechanism for storing DataFrames, which abstracts away many of the complexities of working directly with HDF5 files.
This method also includes the raw_data_config, config (the binning configuration), and run_info as attributes of the stored table, providing a comprehensive audit trail of how the binned data was generated. This can be crucial for reproducibility and understanding the context of the stored data.
- GPSat.bin_data.get_bin_data_config()
- GPSat.bin_data.plot_wrapper(plt_df, val_col, lon_col='lon', lat_col='lat', date_col='date', scatter_plot_size=2, plt_where=None, projection=None, extent=None)
GPSat.dataloader module
- class GPSat.dataloader.DataLoader(hdf_store=None, dataset=None)
Bases:
object
- static add_cols(df, col_func_dict=None, filename=None, verbose=False)
Adds new columns to a given DataFrame based on the provided dictionary of column-function pairs.
This function allows the user to add new columns to a DataFrame using a dictionary that maps new column names to functions that compute the column values. The functions can be provided as values in the dictionary, and the new columns can be added to the DataFrame in a single call to this function.
If a tuple is provided as a key in the dictionary, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple.
- Parameters:
- dfpandas.DataFrame
The input DataFrame to which new columns will be added.
- col_func_dictdict, optional
A dictionary that maps new column names (keys) to functions (values) that compute the column values. If a tuple is provided as a key, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple. If
None
, an empty dictionary will be used. Default isNone
.- filenamestr, optional
The name of the file from which the DataFrame was read. This parameter will be passed to the functions provided in the
col_func_dict
. Default isNone
.- verboseint or bool, optional
Determines the level of verbosity of the function. If verbose is
3
or higher, the function will print messages about the columns being added. Default isFalse
.
- Returns:
- None
- Raises:
- AssertionError
If the length of the new columns returned by the function does not match the length of the tuple key in the col_func_dict.
Notes
DataFrame is manipulated inplace. If a single value is returned by the function, it will be assigned to a column with the name specified in the key. See
help(utils.config_func)
for more details.Examples
>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> add_one = lambda x: x + 1
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) >>> DataLoader.add_cols(df, col_func_dict= { >>> 'C': {'func': add_one, "col_args": "A"} >>> }) A B C 0 1 4 2 1 2 5 3 2 3 6 4
- static add_data_to_col(df, add_data_to_col=None, verbose=False)
Adds new data to an existing column or creates a new column with the provided data in a DataFrame.
This function takes a DataFrame and a dictionary with the column name as the key and the data to be added as the value. It can handle scalar values or lists of values, and will replicate the DataFrame rows for each value in the list.
- Parameters:
- dfpandas.DataFrame
The input DataFrame to which data will be added or updated.
- add_data_to_coldict, optional
A dictionary with the column name (key) and data to be added (value). The data can be a scalar value or a list of values. If a list of values is provided, the DataFrame rows will be replicated for each value in the list. If
None
, an empty dictionary will be used. Default isNone
.- verbosebool, default False.
If
True
, the function will print progress messages
- Returns:
- dfpandas.DataFrame
The DataFrame with the updated or added columns.
- Raises:
- AssertionError
If the
add_data_to_col
parameter is not a dictionary.
Notes
This method adds data to a specified column in a pandas DataFrame repeatedly. The method creates a copy of the DataFrame for each entry in the data to be added, and concatenates them to create a new DataFrame with the added data.
Examples
>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> updated_df = DataLoader.add_data_to_col(df, add_data_to_col={"C": [7, 8]}) >>> print(updated_df) A B C 0 1 4 7 1 2 5 7 2 3 6 7 0 1 4 8 1 2 5 8 2 3 6 8
>>> len(df) 3 >>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4]}) >>> len(out) 12 >>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4], "b": [5,6,7,8]}) >>> len(out) 48
- static bin_data(df, x_range=None, y_range=None, grid_res=None, x_col='x', y_col='y', val_col=None, bin_statistic='mean', return_bin_center=True)
Bins data from a given DataFrame into a 2D grid, applying the specified statistical function to the data in each bin.
This function takes a DataFrame containing x, y, and value columns and bins the data into a 2D grid. It returns the resulting grid, as well as the x and y bin edges or centers, depending on the value of
return_bin_center
.- Parameters:
- dfpd.DataFrame
The input DataFrame containing the data to be binned.
- x_rangelist or tuple of floats, optional
The range of x values, specified as
[min, max]
. If not provided, a default value of[-4500000.0, 4500000.0]
. will be used.- y_rangelist or tuple of floats, optional
The range of y values, specified as
[min, max]
. If not provided, a default value of[-4500000.0, 4500000.0]
. will be used.- grid_resfloat or None.
The grid resolution, expressed in kilometers. This parameter must be provided.
- x_colstr, default is “x”.
The name of the column in the DataFrame containing the x values.
- y_colstr, default is “y”.
The name of the column in the DataFrame containing the y values.
- val_colstr, optional
The name of the column in the DataFrame containing the values to be binned. This parameter must be provided.
- bin_statisticstr, default is “mean”.
The statistic to apply to the binned data. Options are
'mean'
,'median'
,'count'
,'sum'
,'min'
,'max'
, or a custom callable function.- return_bin_centerbool, default is True.
If
True
, the function will return the bin centers instead of the bin edges.
- Returns:
- binned_datanumpy.ndarray
The binned data as a 2D grid.
- x_outnumpy.ndarray
The x bin edges or centers, depending on the value of
return_bin_center
.- y_outnumpy.ndarray
The y bin edges or centers, depending on the value of
return_bin_center
.
- classmethod bin_data_by(df, by_cols=None, val_col=None, x_col='x', y_col='y', x_range=None, y_range=None, grid_res=None, bin_statistic='mean', limit=10000)
Bins the input DataFrame
df
based on the given columns and computes the bin statistics for a specified value column.This function takes a DataFrame, filters it based on the unique combinations of the
by_cols
column values, and then bins the data in each filtered DataFrame based on thex_col
andy_col
column values. It computes the bin statistic for the specifiedval_col
and returns the result as an xarray DataArray. The output DataArray has dimensions"y"
,"x"
, and the givenby_cols
.- Parameters:
- dfpandas.DataFrame
The input DataFrame to be binned.
- by_colsstr or list[str] or tuple[str]
The column(s) by which the input DataFrame should be filtered. Unique combinations of these columns are used to create separate DataFrames for binning.
- val_colstr
The column in the input DataFrame for which the bin statistics should be computed.
- x_colstr, optional, default=’x’
The column in the input DataFrame to be used for binning along the x-axis.
- y_colstr, optional, default=’y’
The column in the input DataFrame to be used for binning along the y-axis.
- x_rangetuple, optional
The range of the x-axis values for binning. If
None
, the minimum and maximum x values are used.- y_rangetuple, optional
The range of the y-axis values for binning. If
None
, the minimum and maximum y values are used.- grid_resfloat, optional
The resolution of the grid used for binning. If
None
, the resolution is calculated based on the input data.- bin_statisticstr, optional, default=”mean”
The statistic to compute for each bin. Supported values are
"mean"
,"median"
,"sum"
,"min"
,"max"
, and"count"
.- limitint, optional, default=10000
The maximum number of unique combinations of the
by_cols
column values allowed. Raises an AssertionError if the number of unique combinations exceeds this limit.
- Returns:
- outxarray.Dataset
The binned data as an xarray Dataset with dimensions
'y'
,'x'
, and the givenby_cols
. Raises
- Raises:
- DeprecationWarning
If the deprecated method
DataLoader.bin_data_by(...)
is used instead ofDataPrep.bin_data_by(...)
.- AssertionError
If any of the input parameters do not meet the specified conditions.
- connect_to_hdf_store(store, table=None, mode='r')
- classmethod data_select(obj, where=None, combine_where='AND', table=None, return_df=True, reset_index=False, drop=True, copy=True, columns=None, close=False, **kwargs)
Selects data from an input object (
pd.DataFrame
,pd.HDFStore
,xr.DataArray
orxr.DataSet
) based on filtering conditions.This function filters data from various types of input objects based on the provided conditions specified in the
'where'
parameter. It also supports selecting specific columns, resetting the index, and returning the output as a DataFrame.- Parameters:
- objpd.DataFrame, pd.Series, dict, pd.HDFStore, xr.DataArray, or xr.Dataset
The input object from which data will be selected. If
dict
, it will try to convert it topandas.DataFrame
.- wheredict, list of dict or None, default None
Filtering conditions to be applied to the input object. It can be a single dictionary or a list of dictionaries. Each dictionary should have keys:
"col"
,"comp"
,"val"
. e.g.where = {"col": "t", "comp": "<=", "val": 4}
The
"col"
value specifies the column,"comp"
specifies the comparison to be performed (>
,>=
,==
,!=
,<=
,<
) and “val” is the value to be compared against. IfNone
, then selects all data. Specifying'where'
parameter can avoid reading all data in from filesystem whenobj
ispandas.HDFStore
orxarray.Dataset
.- combine_where: str, default ‘AND’
How should where conditions, if there are multiple, be combined? Valid values are [
"AND"
,"OR"
], not case-sensitive.- tablestr, default None
The table name to select from when using an HDFStore object. If
obj
ispandas.HDFStore
then table must be supplied.- return_dfbool, default True
If
True
, the output will be returned as apandas.DataFrame
.- reset_indexbool, default False
If
True
, the index of the output DataFrame will be reset.- dropbool, default True
If
True
, the output will have the filtered-out values removed. Applicable only for xarray objects. Default isTrue
.- copybool, default True
If
True
, the output will be a copy of the selected data. Applicable only for DataFrame objects.- columnslist or None, default None
A list of column names to be selected from the input object. If
None
, selects all columns.- closebool, default False
If
True
, andobj
ispandas.HDFStore
it will be closed after selecting data.- kwargsany
Additional keyword arguments to be passed to the
obj.select
method when using an HDFStore object.
- Returns:
- outpandas.DataFrame, pandas.Series, or xarray.DataArray
The filtered data as a
pd.DataFrame
,pd.Series
, orxr.DataArray
, based on the input object type andreturn_df
parameter.
- Raises:
- AssertionError
If the table parameter is not provided when using an HDFStore object.
- AssertionError
If the provided columns are not found in the input object when using a DataFrame object.
Examples
>>> import pandas as pd >>> import xarray as xr >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> # Select data from a DataFrame with a filtering condition >>> selected_df = DataLoader.data_select(df, where={"col": "A", "comp": ">=", "val": 2}) >>> print(selected_df) A B 1 2 5 2 3 6
- file_suffix_engine_map = {'csv': 'read_csv', 'h5': 'HDFStore', 'nc': 'netcdf4', 'parquet': 'read_parquet', 'tsv': 'read_csv', 'zarr': 'zarr'}
- classmethod generate_local_expert_locations(loc_dims, ref_data=None, format_type=None, masks=None, include_col='include', col_func_dict=None, row_select=None, keep_cols=None, sort_by=None)
- static get_attribute_from_table(source, table, attribute_name)
Retrieve an attribute from a specific table in a HDF5 file or HDFStore.
This function can handle both cases when the source is a filepath string to a HDF5 file or a pandas HDFStore object. The function opens the source (if it’s a filepath), then attempts to retrieve the specified attribute from the specified table within the source. If the retrieval fails for any reason, a warning is issued and None is returned.
- Parameters:
- sourcestr or pandas.HDFStore
The source from where to retrieve the attribute. If it’s a string, it is treated as a filepath to a HDF5 file. If it’s a pandas HDFStore object, the function operates directly on it.
- tablestr
The name of the table within the source from where to retrieve the attribute.
- attribute_namestr
The name of the attribute to retrieve.
- Returns:
- attributeobject
The attribute retrieved from the specified table in the source. If the attribute could not be retrieved, None is returned.
- Raises:
- NotImplementedError
If the type of the source is neither a string nor a pandas.HDFStore.
- classmethod get_keys(source, verobse=False)
- static get_masks_for_expert_loc(ref_data, el_masks=None, obs_col=None)
Generate a list of masks based on given local experts locations (el_masks) and a reference data (ref_data).
- This function can generate masks in two ways:
If el_mask is a string “had_obs”, a mask is created based on the obs_col of the reference data where any non-NaN value is present.
If el_mask is a dictionary with “grid_space” key, a regularly spaced mask is created based on the dimensions specified and the grid_space value.
The reference data is expected to be an xarray DataArray or xarray Dataset. Support for pandas DataFrame may be added in future.
- Parameters:
- ref_dataxarray.DataArray or xarray.Dataset
The reference data to use when generating the masks. The data should have coordinates that match the dimensions specified in the el_masks dictionary, if provided.
- el_maskslist of str or dict, optional
A list of instructions for generating the masks. Each element in the list can either be a string or a dictionary. If a string, it should be “had_obs”, which indicates a mask should be created where any non-NaN value is present in the obs_col of the ref_data. If a dictionary, it should have a “grid_space” key indicating the regular spacing to be used when creating a mask and ‘dims’ key specifying dimensions in the reference data to be considered. By default, it is None, which indicates no mask is to be generated.
- obs_colstr, optional
The column in the reference data to use when generating a mask based on “had_obs” instruction. This parameter is ignored if “had_obs” is not present in el_masks.
- Returns:
- list of xarray.DataArray
A list of masks generated based on the el_masks instructions. Each mask is an xarray DataArray with the same coordinates as the ref_data. Each value in the mask is a boolean indicating whether a local expert should be located at that point.
- Raises:
- AssertionError
If ref_data is not an instance of xarray.DataArray or xarray.Dataset, or if “grid_space” is in el_masks but the corresponding dimensions specified in the ‘dims’ key do not exist in ref_data.
Notes
The function could be extended to read data from file system and allow different reference data.
Future extensions could also include support for lel_mask to be only list of dict and for reference data to be pandas DataFrame.
- static get_run_info(script_path=None)
Retrieves information about the current Python script execution environment, including run time, Python executable path, and Git information.
This function collects information about the current script execution environment, such as the date and time when the script is executed, the path of the Python interpreter, the script’s file path, and Git information (if available).
- Parameters:
- script_pathstr, default None
The file path of the currently executed script. If
None
, it will try to retrieve the file path automatically.
- Returns:
- run_infodict
A dictionary containing the following keys:
"run_time"
: The date and time when the script was executed, formatted as"YYYY-MM-DD HH:MM:SS"
."python_executable"
: The path of the Python interpreter."script_path"
: The absolute file path of the script (if available).Git-related keys:
"git_branch"
,"git_commit"
,"git_url"
, and"git_modified"
(if available).
Examples
>>> from GPSat.dataloader import DataLoader >>> run_info = DataLoader.get_run_info() >>> print(run_info) { "run_time": "2023-04-28 10:30:00", "python_executable": "/usr/local/bin/python3.9", "script_path": "/path/to/your/script.py", "branch": "main", "commit": "123abc", "remote": ["https://github.com/user/repo.git" (fetch),"https://github.com/user/repo.git" (push)] "details": ['commit 123abc', 'Author: UserName <username42@gmail.com>', 'Date: Fri Apr 28 07:22:31 2023 +0100', ':bug: fix '] "modified" : ['list_of_files.py', 'modified_since.py', 'last_commit.py'] }
- static get_where_list(global_select, local_select=None, ref_loc=None)
Generate a list of selection criteria for data filtering based on global and local conditions, as well as reference location.
The function accepts a list of global select conditions, and optional local select conditions and reference location. Each condition in global select can either be ‘static’ (with keys ‘col’, ‘comp’, and ‘val’) or ‘dynamic’ (requiring local select and reference location and having keys ‘loc_col’, ‘src_col’, ‘func’). The function evaluates each global select condition and constructs a corresponding selection dictionary.
- Parameters:
- global_selectlist of dict
A list of dictionaries defining global selection conditions. Each dictionary can be either ‘static’ or ‘dynamic’. ‘Static’ dictionaries should contain the keys ‘col’, ‘comp’, and ‘val’ which define a column, a comparison operator, and a value respectively. ‘Dynamic’ dictionaries should contain the keys ‘loc_col’, ‘src_col’, and ‘func’ which define a location column, a source column, and a function respectively.
- local_selectlist of dict, optional
A list of dictionaries defining local selection conditions. Each dictionary should contain keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively. This parameter is required if any ‘dynamic’ condition is present in global_select.
- ref_locpandas DataFrame, optional
A reference location as a pandas DataFrame. This parameter is required if any ‘dynamic’ condition is present in global_select.
- Returns:
- list of dict
A list of dictionaries each representing a selection condition to be applied on data. Each dictionary contains keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively.
- Raises:
- AssertionError
If a ‘dynamic’ condition is present in global_select but local_select or ref_loc is not provided, or if the required keys are not present in the ‘dynamic’ condition, or if the location column specified in a ‘dynamic’ condition is not present in ref_loc.
- static get_where_list_legacy(read_in_by=None, where=None)
Generate a list (of lists) of where conditions that can be consumed by
pd.HDFStore(...).select
.- Parameters:
- read_in_by: dict of dict or None
Sub-dictionary must contain the keys
"values"
,"how"
.- where: str or None
Used if
read_in_by
is not provided.
- Returns:
- list of list
Containing string where conditions.
- classmethod hdf_tables_in_store(store=None, path=None)
Retrieve the list of tables available in an HDFStore.
This class method allows the user to get the names of all tables stored in a given HDFStore. It accepts either an already open HDFStore object or a path to an HDF5 file. If a path is provided, the method will open the HDFStore in read-only mode, retrieve the table names, and then close the store.
- Parameters:
- storepd.io.pytables.HDFStore, optional
An open HDFStore object. If this parameter is provided, path should not be specified.
- pathstr, optional
The file path to an HDF5 file. If this parameter is provided, store should not be specified. The method opens the HDFStore at this path in read-only mode to retrieve the table names.
- Returns:
- list of str
A list containing the names of all tables in the HDFStore.
- Raises:
- AssertionError
If both store and path are None, or if the store provided is not an instance of pd.io.pytables.HDFStore.
Notes
The method ensures that only one of store or path is provided. If path is specified, the HDFStore is opened in read-only mode and closed after retrieving the table names.
Examples
>>> DataLoader.hdf_tables_in_store(store=my_store) ['/table1', '/table2']
>>> DataLoader.hdf_tables_in_store(path='path/to/hdf5_file.h5') ['/table1', '/table2', '/table3']
- static is_list_of_dict(lst)
Checks if the given input is a list of dictionaries.
This utility function tests if the input is a list where all elements are instances of the
dict
type.- Parameters:
- lstlist
The input list to be checked for containing only dictionaries.
- Returns:
- bool
True
if the input is a list of dictionaries,False
otherwise.
Examples
>>> from GPSat.dataloader import DataLoader >>> DataLoader.is_list_of_dict([{"col": "t", "comp": "==", "val": 1}]) True
>>> DataLoader.is_list_of_dict([{"a": 1, "b": 2}, {"c": 3, "d": 4}]) True
>>> DataLoader.is_list_of_dict([1, 2, 3]) False
>>> DataLoader.is_list_of_dict("not a list") False
- static kdt_tree_list_for_local_select(df, local_select)
Pre-calculates a list of KDTree objects for selecting points within a radius based on the
local_select
input.Given a DataFrame and a list of local selection criteria, this function builds a list of KDTree objects that can be used for spatially selecting points within specified radii.
- Parameters:
- dfpd.DataFrame
The input DataFrame containing the data to be used for KDTree construction.
- local_selectlist of dict
A list of dictionaries containing the selection criteria for each local select. Each dictionary should have the following keys:
"col"
: The name of the column(s) used for spatial selection. Can be a single string or a list of strings."comp"
: The comparison operator, either"<"
or"<="
. Currently, only less than comparisons are supported for multi-dimensional values.
- Returns:
- outlist
A list of KDTree objects or None values, where each element corresponds to an entry in the
local_select
input. If an entry inlocal_select
has a single string for"col"
, the corresponding output element will be None. Otherwise, the output element will be a KDTree object built from the specified columns.
Examples
>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}) >>> local_select = [{"col": ["x", "y"], "comp": "<"}] >>> kdt_trees = DataLoader.kdt_tree_list_for_local_select(df, local_select) >>> print(kdt_trees)
- classmethod load(source, where=None, engine=None, table=None, source_kwargs=None, col_funcs=None, row_select=None, col_select=None, reset_index=False, add_data_to_col=None, close=False, verbose=False, combine_row_select='AND', **kwargs)
Load data from various sources and (optionally) apply selection of columns/rows and add/modify columns.
- Parameters:
- source: str, pd.DataFrame, pd.Series, pd.HDFStore, xr.DataSet, default None
If
str
, will try to convert to other types.- where: dict or list of dict, default None
Used when querying
pd.HDFStore
,xr.DataSet
,xr.DataArray
. Specified as a list of one or more dictionaries, each containing the keys:"col"
: refers to a column (or variable for xarray objects."comp"
: is the type of comparison to apply e.g."=="
,"!="
,">="
,">"
,"<="
,"<"
."val"
: value to be compared with.
e.g.
where = [{"col": "A", "comp": ">=", "val": 0}]
will select entries where the column
"A"
is greater than 0.Note: Think of this as a database query, with the
where
used to read data from the file system into memory.- engine: str or None, default None
Specify the type of ‘engine’ to use to read in data. If not supplied, it will be inferred by source if source is string. Valid values:
"HDFStore"
,"netcdf4"
,"scipy"
,"pydap"
,"h5netcdf"
,"pynio"
,"cfgrib"
,"pseudonetcdf"
,"zarr"
or any of Pandas"read_*"
.- table: str or None, default None
Used only if source is
pd.HDFStore
(or is converted to one) and is required if so. Should be a valid table (i.e. key) in HDFStore.- source_kwargs: dict or None, default None
Additional keyword arguments to pass to the data source reading functions, depending on
engine
. e.g. keyword arguments forpandas.read_csv()
ifengine=read_csv
.- col_funcs: dict or None, default None
If
dict
, it will be provided toadd_cols
method to add or modify columns.- row_select: dict, list of dict, or None, default None
Used to select a subset of data after data is initially read into memory. Can be the same type of input as
where
i.e.row_select = {"col": "A", "comp": ">=", "val": 0}
or use
col_funcs
that returnbool
arraye.g.
row_select = {"func": "lambda x: ~np.isnan(x)", "col_args": 1}
see
help(utils.config_func)
for more details.- col_select: list of str or None, default None
If specified as a list of strings, it will return a subset of columns using
col_select
. All values must be valid. IfNone
, all columns will be returned.- filename: str or None, default None
Used by
add_cols
method.- reset_index: bool, default True
Apply
reset_index(inplace=True)
before returning?- add_data_to_col:
Add new column to data frame. See argument
add_data_to_col
inadd_data_to_col
.- close: bool, default False
See
DataLoader.data_select
for details- verbose: bool, default False
Set verbosity.
- kwargs:
Additional arguments to be provided to
data_select
method
- Returns:
- pd.DataFrame
Examples
>>> import numpy as np >>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df = DataLoader.load(source = df, ... where = {"col": "A", "comp": ">=", "val": 2}) >>> print(df.head()) A B 0 2 5 1 3 6
If the data is stored in a file, we can extract it as follows (here, we assume the data is saved in “path/to/data.h5” under the table “data”):
>>> df = DataLoader.load(source = "path/to/data.h5", ... table = "data")
- classmethod local_data_select(df, reference_location, local_select, kdtree=None, verbose=True)
Selects data from a DataFrame based on a given criteria and reference (expert) location.
This method applies local selection criteria to a DataFrame, allowing for flexible, column-wise data selection based on comparison operations. For multi (dimensional) column selections, a KDTree can be used for efficiency.
- Parameters:
- dfpd.DataFrame
The DataFrame from which data will be selected.
- reference_locationdict or pd.DataFrame
Reference location used for comparisons. If DataFrame is provided, it will be converted to dict.
- local_selectlist of dict
List of dictionaries containing the selection criteria for each local select. Each dictionary must contain keys ‘col’, ‘comp’, and ‘val’. ‘col’ is the column in ‘df’ to apply the comparison on, ‘comp’ is the comparison operator as a string (can be ‘>=’, ‘>’, ‘==’, ‘<’, ‘<=’), and ‘val’ is the value to compare with.
- kdtreeKDTree or list of KDTree, optional
Precomputed KDTree or list of KDTrees for optimization. Each KDTree in the list corresponds to an entry in local_select. If not provided, a new KDTree will be created.
- verbosebool, default=True
If True, print details for each selection criteria.
- Returns:
- pd.DataFrame
A DataFrame containing only the data that meets all of the selection criteria.
- Raises:
- AssertionError
If ‘col’ is not in ‘df’ or ‘reference_location’, if the comparison operator in ‘local_select’ is not valid, or if the provided ‘kdtree’ is not of type KDTree.
Notes
If ‘col’ is a string, a simple comparison is performed. If ‘col’ is a list of strings, a KDTree-based selection is performed where each dimension is a column from ‘df’. For multi-dimensional comparisons, only less than comparisons are currently handled.
If ‘kdtree’ is provided and is a list, it must be of the same length as ‘local_select’ with each element corresponding to the same index in ‘local_select’.
- static make_multiindex_df(idx_dict, **kwargs)
Create a multi-indexed DataFrame from the provided index dictionary for each keyword argument supplied.
This function creates a multi-indexed DataFrame, with each row having the same multi-index value The index dictionary serves as the levels and labels for the multi-index, while the keyword arguments provide the data.
- Parameters:
- idx_dictdict or pd.Series
A dictionary or pandas Series containing the levels and labels for the multi-index.
- **kwargsdict
Keyword arguments specifying the data and column names for the resulting DataFrame. The data can be of various types:
int
,float
,bool
,np.ndarray
,pd.DataFrame
,dict
, ortuple
. This data will be transformed into a DataFrame, where the multi-index will be added.
- Returns:
- dict
A dictionary containing the multi-indexed DataFrames with keys corresponding to the keys of provided keyword arguments.
Examples
>>> import numpy as np >>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> idx_dict = {"year": 2020, "month": 1} >>> data = pd.DataFrame({"x": np.arange(10)}) >>> df = pd.DataFrame({"y": np.arange(3)}) >>> DataLoader.make_multiindex_df(idx_dict, data=data, df=df) {'data': <pandas.DataFrame (multiindexed) with shape (3, 4)>}
- static mindex_df_to_mindex_dataarray(df, data_name, dim_cols=None, infer_dim_cols=True, index_name='index')
Converts a multi-index DataFrame to a multi-index DataArray.
The method facilitates a transition from pandas DataFrame representation to the Xarray DataArray format, while preserving multi-index structure. This can be useful for higher-dimensional indexing, labeling, and performing mathematical operations on the data.
- Parameters:
- dfpd.DataFrame
The input DataFrame with a multi-index to be converted to a DataArray.
- data_namestr
The name of the column in ‘df’ that contains the data values for the DataArray.
- dim_colslist of str, optional
A list of columns in ‘df’ that will be used as additional dimensions in the DataArray. If None, dimension columns will be inferred if ‘infer_dim_cols’ is True.
- infer_dim_colsbool, default=True
If True and ‘dim_cols’ is None, dimension columns will be inferred from ‘df’. Columns will be considered a dimension column if they match the pattern “^_dim_d”.
- index_namestr, default=”index”
The name assigned to the placeholder index created during the conversion process.
- Returns:
- xr.DataArray
A DataArray derived from the input DataFrame with the same multi-index structure. The data values are taken from the column in ‘df’ specified by ‘data_name’. Additional dimensions can be included from ‘df’ as specified by ‘dim_cols’.
- Raises:
- AssertionError
If ‘data_name’ is not a column in ‘df’.
Notes
The function manipulates ‘df’ by reference. If the original DataFrame needs to be preserved, provide a copy to the function.
- classmethod read_flat_files(file_dirs, file_regex, sub_dirs=None, read_csv_kwargs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, verbose=False)
wrapper for read_from_multiple_files with read_engine=’csv’ Parameters
Read flat files (
.csv
,.tsv
, etc) from file system and returns apd.DataFrame
object.- Parameters:
- file_dirs: str or List[str]
The directories containing the files to read.
- file_regex: str
A regular expression pattern to match file names within the specified directories.
- sub_dirs: str or List[str], optional
Subdirectories within each file directory to search for files.
- read_csv_kwargs: dict, optional
Additional keyword arguments specifically for CSV reading. These are keyword arguments for the function
pandas.read_csv()
.- col_funcs: dict of dict, optional
A dictionary with column names as keys and column functions to apply during data reading as values. The column functions should be a dictionary of keyword arguments to
utils.config_func
.- row_select: list of dict, optional
A list of functions to select rows during data reading.
- col_select: list of str, optional
A list of column names to read from data.
- new_column_names: List[str], optional
New column names to assign to the resulting DataFrame.
- strict: bool, default True
Whether to raise an error if a file directory does not exist.
- verbose: bool or int, default False
Verbosity level for printing progress.
- Returns:
- pd.DataFrame
A DataFrame containing the combined data from multiple files.
Notes
This method reads data from multiple files located in specified directories and subdirectories.
The
file_regex
argument is used to filter files to be read.Various transformations can be applied to the data, including adding new columns and selecting rows/columns.
If
new_column_names
is provided, it should be a list with names matching the number of columns in the output DataFrame.The resulting DataFrame contains the combined data from all the specified files.
Examples
The command below reads the files
"A_RAW.csv"
,"B_RAW.csv"
and"C_RAW.csv"
in the path"/path/to/dir"
and combines them into a single dataframe.>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> col_funcs = { ... "source": { # Add a new column "source" with entries "A", "B" or "C". ... "func": "lambda x: re.sub('_RAW.*$', '', os.path.basename(x))", ... "filename_as_arg": true ... }, ... "datetime": { # Modify column "datetime" by converting to datetime64[s]. ... "func": "lambda x: x.astype('datetime64[s]')", ... "col_args": "datetime" ... }, ... "obs": { # Rename column "z" to "obs" and subtract mean value 0.1. ... "func": "lambda x: x-0.1", ... "col_args": "z" ... } ... } >>> row_select = [ # Read data whose "lat" value is >= 65. ... { ... "func": "lambda x: x>=65", ... "col_kwargs": { ... "x": "lat" ... } ... } ... ] >>> df = DataLoader.read_flat_files(file_dirs = "/path/to/dir/", ... file_regex = ".*_RAW.csv$", ... col_funcs = col_funcs, ... row_select = row_select) >>> print(df.head(2)) lon lat datetime source obs 0 59.944790 82.061122 2020-03-01 13:48:50 C -0.0401 1 59.939555 82.063771 2020-03-01 13:48:50 C -0.0861
- classmethod read_from_multiple_files(file_dirs, file_regex, read_engine='csv', sub_dirs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, read_kwargs=None, read_csv_kwargs=None, verbose=False)
Reads and merges data from multiple files in specified directories, Optionally apply various transformations such as column renaming, row selection, column selection or other transformation functions to the data.
The primary input is a list of directories and a regular expression used to select which files within those directories should be read.
- Parameters:
- file_dirslist of str
A list of directories to read the files from. Each directory is a string. If a string is provided instead of a list, it will be wrapped into a single-element list.
- file_regexstr
Regular expression to match the files to be read from the directories specified in ‘file_dirs’. e.g. “NEW.csv$’ with match all files ending with NEW.csv
- read_enginestr, optional
The engine to be used to read the files. Options include ‘csv’, ‘nc’, ‘netcdf’, and ‘xarray’. Default is ‘csv’.
- sub_dirslist of str, optional
A list of subdirectories to be appended to each directory in ‘file_dirs’. If a string is provided, it will be wrapped into a single-element list. Default is None.
- col_funcsdict, optional
A dictionary that maps new column names to functions that compute the column values. Provided to add_cols via col_func_dict parameter. Default is None.
- row_selectlist of dict, optional
A list of dictionaries, each representing a condition to select rows from the DataFrame. Provided to the row_select_bool method. Default is None.
- col_selectslice, optional
A slice object to select specific columns from the DataFrame. If not provided, all columns are selected.
- new_column_nameslist of str, optional
New names for the DataFrame columns. The length should be equal to the number of columns in the DataFrame. Default is None.
- strictbool, optional
Determines whether to raise an error if a directory in ‘file_dirs’ does not exist. If False, a warning is issued instead. Default is True.
- read_kwargsdict, optional
Additional keyword arguments to pass to the read function (pd.read_csv or xr.open_dataset). Default is None.
- read_csv_kwargsdict, optional
Deprecated. Additional keyword arguments to pass to pd.read_csv. Use ‘read_kwargs’ instead. Default is None.
- verbosebool or int, optional
Determines the verbosity level of the function. If True or an integer equal to or higher than 3, additional print statements are executed.
- Returns:
- outpandas.DataFrame
The resulting DataFrame, merged from all the files that were read and processed.
- Raises:
- AssertionError
Raised if the ‘read_engine’ parameter is not one of the valid choices, if ‘read_kwargs’ or ‘col_funcs’ are not dictionaries, or if the length of ‘new_column_names’ is not equal to the number of columns in the DataFrame. Raised if ‘strict’ is True and a directory in ‘file_dirs’ does not exist.
Notes
The function supports reading from csv, netCDF files and xarray Dataset formats. For netCDF and xarray Dataset, the data is converted to a DataFrame using the ‘to_dataframe’ method.
- static read_from_npy(npy_files, npy_dir, dims=None, flatten_xy=True, return_xarray=True)
Read NumPy array(s) from the specified
.npy
file(s) and return as xarray DataArray(s).This function reads one or more .npy files from the specified directory and returns them as xarray DataArray(s). The input can be a single file, a list of files, or a dictionary of files with the desired keys. The returned dictionary contains the xarray DataArray(s) with the corresponding keys.
- Parameters:
- npy_filesstr, list, or dict
The
.npy
file(s) to be read. It can be a single file (str), a list of files, or a dictionary of files.- npy_dirstr
The directory containing the
.npy
file(s).- dimslist or tuple, optional
The dimensions for the xarray DataArray(s), (default:
None
).- flatten_xybool, optional
If
True
, flatten the x and y arrays by taking the first row and first column, respectively (default:True
).- return_xarray: bool, default True
If
True
will convert numpy arrays to pandas DataArray, otherwise will return dict of numpy arrays.
- Returns:
- dict
A dictionary containing xarray DataArray(s) with keys corresponding to the input files.
Examples
>>> read_from_npy(npy_files="data.npy", npy_dir="./data") {'obs': <xarray.DataArray (shape)>
>>> read_from_npy(npy_files=["data1.npy", "data2.npy"], npy_dir="./data") {'obs': [<xarray.DataArray (shape1)>, <xarray.DataArray (shape2)>]}
>>> read_from_npy(npy_files={"x": "data_x.npy", "y": "data_y.npy"}, npy_dir="./data") {'x': <xarray.DataArray (shape_x)>, 'y': <xarray.DataArray (shape_y)>}
- static read_from_pkl_dict(pkl_files, pkl_dir=None, default_name='obs', strict=True, dim_names=None)
Reads and processes data from pickle files and returns a DataFrame containing all data.
- Parameters:
- pkl_filesstr, list, or dict
The pickle file(s) to be read. This can be a string (representing a single file), a list of strings (representing multiple files), or a dictionary, where keys are the names of different data sources and the values are lists of file names.
- pkl_dirstr, optional
The directory where the pickle files are located. If not provided, the current directory is used.
- default_namestr, optional
The default data source name. This is used when pkl_files is a string or a list. Default is “obs”.
- strictbool, optional
If True, the function will raise an exception if a file does not exist. If False, it will print a warning and continue with the remaining files. Default is True.
- dim_nameslist, optional
The names of the dimensions. This is used when converting the data to a DataArray. If not provided, default names are used.
- Returns:
- DataFrame
A DataFrame containing the data from all provided files. The DataFrame has a MultiIndex with ‘idx0’, ‘idx1’ and ‘date’ as index levels, and ‘obs’ and ‘source’ as columns. Each ‘source’ corresponds to a different data source (file).
Notes
The function reads the data from the pickle files and converts them into a DataFrame For each file, it creates a MultiIndex DataFrame where the indices are a combination of two dimensions and dates extracted from the keys in the dictionary loaded from the pickle file.
The function assumes the dictionary loaded from the pickle file has keys that can be converted to dates with the format “YYYYMMDD”. It also assumes that the values in the dictionary to be 2D numpy array.
If pkl_files is a string or a list, the function treats them as files from a single data source and uses default_name as the source name. If it’s a dictionary, the keys are treated as data source names, and the values are lists of file names.
When multiple files are provided, the function concatenates the data along the date dimension.
- static read_hdf(table, store=None, path=None, close=True, **select_kwargs)
Reads data from an HDF5 file, and returns a DataFrame.
This method can either read data directly from an open HDF5 store or from a provided file path. In case a file path is provided, it opens the HDF5 file in read mode, and closes it after reading, if ‘close’ is set to True.
- Parameters:
- tablestr
The key or the name of the dataset in the HDF5 file.
- storepd.io.pytables.HDFStore, optional
An open HDF5 store. If provided, the method will directly read data from it. Default is None.
- pathstr, optional
The path to the HDF5 file. If provided, the method will open the HDF5 file in read mode, and read data from it. Default is None.
- closebool, optional
A flag that indicates whether to close the HDF5 store after reading the data. It is only relevant when ‘path’ is provided, in which case the default is True.
- **select_kwargsdict, optional
Additional keyword arguments that are passed to the ‘select’ method of the HDFStore object. This can be used to select only a subset of data from the HDF5 file.
- Returns:
- dfpd.DataFrame
A DataFrame containing the data read from the HDF5 file.
- Raises:
- AssertionError
If both ‘store’ and ‘path’ are None, or if ‘store’ is not an instance of pd.io.pytables.HDFStore.
Notes
Either ‘store’ or ‘path’ must be provided. If ‘store’ is provided, ‘path’ will be ignored.
Examples
#>>> store = pd.HDFStore(‘data.h5’) #>>> df = read_hdf(table=’my_data’, store=store) #>>> print(df)
- classmethod row_select_bool(df, row_select=None, combine='AND', **kwargs)
Returns a boolean array indicating which rows of the DataFrame meet the specified conditions.
This class method applies a series of conditions, provided in the ‘row_select’ list, to the input DataFrame ‘df’. Each condition is represented by a dictionary that is used as input to the ‘_bool_numpy_from_where’ method.
All conditions are combined via an ‘&’ operator, meaning if all conditions for a given row are True the return value for that row entry will be True and False if any condition is not satisfied.
If ‘row_select’ is None or an empty dictionary, all indices will be True.
- Parameters:
- dfDataFrame
The DataFrame to apply the conditions on.
- row_selectlist of dict, optional
A list of dictionaries, each representing a condition to apply to ‘df’. Each dictionary should contain the information needed for the ‘_bool_numpy_from_where’ method. If None or an empty dictionary, all indices in the returned array will be True.
- verbosebool or int, optional
If set to True or a number greater than or equal to 3, additional print statements will be executed.
- kwargsdict
Additional keyword arguments passed to the ‘_bool_numpy_from_where’ method.
- Returns:
- selectnp.array of bool
A boolean array indicating which rows of the DataFrame meet the conditions. The length of the array is equal to the number of rows in ‘df’.
- Raises:
- AssertionError
If ‘row_select’ is not None, not a dictionary and not a list, or if any element in ‘row_select’ is not a dictionary.
Notes
The function is designed to work with pandas DataFrames.
If ‘row_select’ is None or an empty dictionary, the function will return an array with all elements set to True (indicating all rows of ‘df’ are selected).
- classmethod write_to_hdf(df, store, table=None, append=False, config=None, run_info=None)
- static write_to_netcdf(ds, path, mode='w', **to_netcdf_kwargs)
GPSat.dataprepper module
- class GPSat.dataprepper.DataPrep
Bases:
object
- static bin_data(df, x_range=None, y_range=None, grid_res=None, x_col='x', y_col='y', val_col=None, bin_statistic='mean', bin_2d=True, return_bin_center=True)
Bins the data contained within a DataFrame into a grid, optionally computes a statistic on the binned data, and returns the binned data along with the bin edges or centers.
This method supports both 2D and 1D binning, allowing for various statistical computations on the binned values such as mean, median, count, etc.
- Parameters:
- dfpandas.DataFrame
The DataFrame containing the data to be binned.
- x_rangetuple of float, optional
The minimum and maximum values of the x-axis to be binned. If not provided, a default range is used.
- y_rangetuple of float, optional
The minimum and maximum values of the y-axis to be binned. Only required for 2D binning. If not provided, a default range is used.
- grid_resfloat
The resolution of the grid in the same units as the x and y data. Defines the size of each bin.
- x_colstr, default ‘x’
The name of the column in df that contains the x-axis values.
- y_colstr, default ‘y’
The name of the column in df that contains the y-axis values. Ignored if bin_2d is False.
- val_colstr
The name of the column in df that contains the values to be binned and aggregated.
- bin_statisticstr, default ‘mean’
The statistic to compute on the binned data. Can be ‘mean’, ‘median’, ‘count’, or any other statistic supported by scipy.stats.binned_statistic or scipy.stats.binned_statistic_2d.
- bin_2dbool, default True
If True, performs 2D binning using both x and y values. If False, performs 1D binning using only x values.
- return_bin_centerbool, default True
If True, returns the center of each bin. If False, returns the edges of the bins.
- Returns:
- binned_datanumpy.ndarray
An array of the binned and aggregated data. The shape of the array depends on the binning dimensions and the grid resolution.
- x_binnumpy.ndarray
An array of the x-axis bin centers or edges, depending on the value of return_bin_center.
- y_binnumpy.ndarray, optional
An array of the y-axis bin centers or edges, only returned if bin_2d is True and return_bin_center is specified.
- Raises:
- AssertionError
If val_col or grid_res is not specified, or if the DataFrame df is empty. Also raises an error if the provided x_range or y_range are invalid or if the specified column names are not present in df.
Notes
The default x_range and y_range are set to [-4500000.0, 4500000.0] if not provided.
This method requires that val_col and grid_res be explicitly provided.
The binning process is influenced by the bin_statistic parameter, which determines how the values in each bin are aggregated.
When bin_2d is False, y_col is ignored and only x_col and val_col are used for binning.
The method ensures that the x_col, y_col, and val_col exist in the DataFrame df.
- classmethod bin_data_by(df, col_funcs=None, row_select=None, by_cols=None, val_col=None, x_col='x', y_col='y', x_range=None, y_range=None, grid_res=None, bin_statistic='mean', bin_2d=True, limit=10000, return_df=False, verbose=False)
Class method to bin data by given columns.
- Parameters:
- dfpandas.DataFrame
The dataframe containing the data to be binned.
- col_funcsdict, optional
Dictionary with functions to be applied on the dataframe columns.
- row_selectdict, optional
Dictionary with conditions to select rows of the dataframe.
- by_colsstr, list, tuple, optional
Columns to be used for binning.
- val_colstr, optional
Column with values to be used for binning.
- x_colstr, optional
Name of the column to be used for x-axis, by default ‘x’.
- y_colstr, optional
Name of the column to be used for y-axis, by default ‘y’.
- x_rangelist, tuple, optional
Range for the x-axis binning.
- y_rangelist, tuple, optional
Range for the y-axis binning.
- grid_resfloat, optional
Grid resolution for the binning process.
- bin_statisticstr or list, optional
Statistic(s) to compute (default is ‘mean’).
- bin_2dbool, default True
if True bin data on a 2d grid, otherwise will perform 1d binning using only ‘x’
- limitint, optional
Maximum number of unique values for the by_cols, by default 10000.
- return_dfbool, default False
if True return results in a DataFrame, otherwise a Dataset (xarray)
- verbosebool or int, optional
If True or integer larger than 0, print information about process.
- Returns:
- xarray.Dataset
An xarray.Dataset containing the binned data.
GPSat.datetime_utils module
- GPSat.datetime_utils.date_from_datetime(dt)
Remove the time component of an array of datetimes (represented as strings) and just return the date
The datetime format is expected to be YYYY-MM-DD HH:mm:SS The returned date format is YYYYMMDD
- Parameters:
- dt: list, np.array, pd.Series
string with datetime format YYYY-MM-DD HH:mm:SS.
- Returns:
- numpy.ndarray: A date column with format YYYY-MM-DD.
Note
This function uses a lambda function to remove the time portion and the dash from the datetime column. It then returns a numpy array of the resulting date column. It is possible to use apply on a Series to achieve the same result, but it may not be as fast as using a lambda function and numpy array.
- GPSat.datetime_utils.datetime_from_float_column(float_datetime, epoch=(1950, 1, 1), time_unit='D')
Converts a float datetime column to a datetime64 format.
- Parameters:
- float_datetimepd.Series or np.array
A pandas series or numpy array containing float values, corresponding to datetime.
- epochtuple, default is (1950, 1, 1).
A tuple representing the epoch date in the format (year, month, day).
- time_unitstr, optional
The time unit of the float datetime values. Default is ‘D’ (days).
- Returns:
- numpy.ndarray
A numpy array of datetime64 values, with dtype ‘datetime64[s]’
Examples
>>> df = pd.DataFrame({'float_datetime': [18262.5, 18263.5, 18264.5]}) >>> datetime_from_float_column(df['float_datetime']) array(['2000-01-01T12:00:00', '2000-01-02T12:00:00', '2000-01-03T12:00:00'], dtype='datetime64[s]')
>>> df = pd.DataFrame({'float_datetime': [18262.5, 18263.5, 18264.5]}) >>> datetime_from_float_column(df['float_datetime'], epoch=(1970, 1, 1)) array(['2020-01-01T12:00:00', '2020-01-02T12:00:00', '2020-01-03T12:00:00'], dtype='datetime64[s]')
>>> x = np.array([18262.5, 18263.5, 18264.5]) >>> datetime_from_float_column(x, epoch=(1970, 1, 1)) array(['2020-01-01T12:00:00', '2020-01-02T12:00:00', '2020-01-03T12:00:00'], dtype='datetime64[s]')
- GPSat.datetime_utils.datetime_from_ymd_cols(year, month, day, hhmmss)
Converts separate columns/arrays of year, month, day, and time (in hhmmss format) into a numpy array of datetime objects.
- Parameters:
- yeararray-like
An array of integers representing the year.
- montharray-like
An array of integers representing the month (1-12).
- dayarray-like
An array of integers representing the day of the month.
- hhmmssarray-like
An array of integers representing the time in hhmmss format.
- Returns:
- datetimenumpy.ndarray
An array of datetime objects representing the input dates and times.
- Raises:
- AssertionError
If the input arrays are not of equal length.
Examples
>>> year = [2021, 2021, 2021] >>> month = [1, 2, 3] >>> day = [10, 20, 30] >>> hhmmss = [123456, 234537, 165648] >>> datetime_from_ymd_cols(year, month, day, hhmmss) array(['2021-01-10T12:34:56', '2021-02-20T23:45:37', '2021-03-30T16:56:48'], dtype='datetime64[s]')
- GPSat.datetime_utils.from_file_start_end_datetime_GPOD(f, df)
Extract an implied sequence of evenly spaced time intervals based off of a ‘processed’ GPOD file name
This function takes in a file path and a pandas dataframe as input. It extracts the start and end datetime from the file name and calculates the time interval between them.
It then generates a datetime array with the same length as the dataframe, evenly spaced over the time interval. The resulting datetime array is returned.
- Parameters:
- f: str
filename
- df: pd.DataFrame, pd.Series, np.array, tuple, list
the len(df) is used to determine the number and size of the intervals
- Returns:
- np.array
dtype datetime64[ns]
Examples
>>> f = "/path/to/S3A_GPOD_SAR__SRA_A__20191031T233355_20191101T002424_2019112_IL_v3.proc" >>> df = pd.DataFrame({"x": np.arange(11)}) >>> from_file_start_end_datetime_GPOD(f, df) array(['2019-10-31T23:33:55.000000000', '2019-10-31T23:38:57.900000000', '2019-10-31T23:44:00.800000000', '2019-10-31T23:49:03.700000000', '2019-10-31T23:54:06.600000000', '2019-10-31T23:59:09.500000000', '2019-11-01T00:04:12.400000000', '2019-11-01T00:09:15.300000000', '2019-11-01T00:14:18.200000000', '2019-11-01T00:19:21.100000000', '2019-11-01T00:24:24.000000000'], dtype='datetime64[ns]')
- GPSat.datetime_utils.from_file_start_end_datetime_SARAL(f, df)
This function takes in a file path to a file and a pandas dataframe and returns a numpy array of datetime objects.
The file path is expected to be in the format of SARAL data files, with the datetime information encoded in the file name. The function extracts the start and end datetime information from the file name, calculates the time interval between them based on the length of the dataframe, and generates a numpy array of datetime objects with the same length as the dataframe.
- Parameters:
- f: str
the file path of the SARAL data file
- df: pd.DataFrame
the data contained in the SARAL data file
- Returns:
- np.array
datetime objects, representing the time stamps of the data in the SARAL data file with dtype: ‘datetime64[s]’
Examples
>>> f = "/path/to/SARAL_C139_0036_20200331_234125_20200401_003143_CS2mss_IL_v1.proc" >>> df = pd.DataFrame({"x": np.arange(11)}) >>> from_file_start_end_datetime_SARAL(f, df) array(['2020-03-31T23:41:25', '2020-03-31T23:46:26', '2020-03-31T23:51:28', '2020-03-31T23:56:30', '2020-04-01T00:01:32', '2020-04-01T00:06:34', '2020-04-01T00:11:35', '2020-04-01T00:16:37', '2020-04-01T00:21:39', '2020-04-01T00:26:41', '2020-04-01T00:31:43'], dtype='datetime64[s]')
GPSat.decorators module
- GPSat.decorators.timer(func)
This function is a decorator that can be used to time the execution of other functions.
It takes a function as an argument and returns a new function that wraps the original function.
When the wrapped function is called, it measures the time it takes to execute the original function and prints the result to the console.
The function uses the time.perf_counter() function to measure the time. This function returns the current value of a performance counter, which is a high-resolution timer that measures the time in seconds since a fixed point in time.
The wrapped function takes any number of positional and keyword arguments, which are passed on to the original function. The result of the original function is returned by the wrapped function.
The decorator also uses the functools.wraps() function to preserve the metadata of the original function, such as its name, docstring, and annotations. This makes it easier to debug and introspect the code.
To use the decorator, simply apply it to the function you want to time, like this:
@timer def my_function():
GPSat.local_experts module
- class GPSat.local_experts.LocalExpertData(obs_col: str | None = None, coords_col: list | None = None, global_select: list | None = None, local_select: list | None = None, where: list | None = None, row_select: list | None = None, col_select: list | None = None, col_funcs: list | None = None, table: str | None = None, data_source: str | None = None, engine: str | None = None, read_kwargs: dict | None = None)
Bases:
object
- col_funcs: list | None = None
- col_select: list | None = None
- coords_col: list | None = None
- data_source: str | None = None
- engine: str | None = None
- file_suffix_engine_map = {'csv': 'read_csv', 'h5': 'HDFStore', 'nc': 'netcdf4', 'tsv': 'read_csv', 'zarr': 'zarr'}
- global_select: list | None = None
- load(where=None, verbose=False, **kwargs)
- local_select: list | None = None
- obs_col: str | None = None
- read_kwargs: dict | None = None
- row_select: list | None = None
- set_data_source(verbose=False)
- table: str | None = None
- where: list | None = None
- class GPSat.local_experts.LocalExpertOI(expert_loc_config: Dict | ExpertLocsConfig | None = None, data_config: Dict | DataConfig | None = None, model_config: Dict | ModelConfig | None = None, pred_loc_config: Dict | PredictionLocsConfig | None = None, local_expert_config: ExperimentConfig | None = None)
Bases:
object
This provides the main interface for conducting an experiment in
GPSat
to predict an underlying field from satellite measurements using local Gaussian process (GP) models.This proceeds by iterating over the local expert locations, training the local GPs on data in a neighbourhood of the expert location and making predictions on specified locations. The results will be saved in an HDF5 file.
Example usage:
>>> store_path = "/path/to/store.h5" >>> locexp = LocalExpertOI(data_config, model_config, expert_loc_config, pred_loc_config) >>> locexp.run(store_path=store_path) # Run full sweep and save results in store_path
- static dict_of_array_to_table(x, ref_loc=None, concat=False, table=None, default_dim=1)
given a dictionary of numpy arrays create DataFrame(s) with ref_loc as the multi index
- file_suffix_engine_map = {'csv': 'read_csv', 'h5': 'HDFStore', 'nc': 'netcdf4', 'tsv': 'read_csv', 'zarr': 'zarr'}
- load_params(model, previous=None, previous_params=None, file=None, param_names=None, ref_loc=None, index_adjust=None, table_suffix='', **param_dict)
- plot_locations_and_obs(image_file, obs_col=None, lat_col='lat', lon_col='lon', exprt_lon_col='lon', exprt_lat_col='lat', sort_by='date', col_funcs=None, xrpt_loc_col_funcs=None, vmin=None, vmax=None, s=0.5, s_exprt_loc=250, cbar_label='Input Observations', cmap='YlGnBu_r', figsize=(15, 15), projection=None, extent=None)
- run(store_path=None, store_every=10, check_config_compatible=True, skip_valid_checks_on=None, optimise=True, predict=True, min_obs=3, table_suffix='')
Run a full sweep to perform local optimal interpolation at every expert location. The results will be stored in an HDF5 file containing (1) the predictions at each location, (2) parameters of the model at each location, (3) run details such as run times, and (4) the full experiment configuration.
- Parameters:
- store_path: str
File path where results should be stored as HDF5 file.
- store_every: int, default 10
Results will be stored to file after every
store_every expert
locations. Reduce if optimisation is slow, must be greater than 1.- check_config_compatible: bool, default True
Check if current
LocalExpertOI
configuration is compatible with previous, if applicable. If file exists instore_path
, it will check theoi_config
attribute in theoi_config
table to ensure that configurations are compatible.- skip_valid_checks_on: list, optional
When checking if config is compatible, skip keys specified in this list.
- optimise: bool, default True
If
True
, will runmodel.optimise_parameters()
to learn the model parameters at each expert location.- predict: bool, default True
If
True
, will runmodel.predict()
to make predictions at the locations specified in the prediction locations configuration.- min_obs: int, default 3
Minimum number observations required to run optimisation or make predictions.
- table_suffix: str, optional
Suffix to be appended to all table names when writing to file.
- Returns:
- None
Notes
By default, both training and inference are performed at every location. However one can opt to do either one with the
optimise
andpredict
options, respectively.If
check_config_compatible
is set toTrue
, it makes sure that all results saved tostore_path
use the same configurations. That is, if one re-runs an experiment with a different configuration but pointing to the samestore_path
, it will return an error. Make sure that if you run an experiment with a different configuration, either set a differentstore_path
, or if you want to override the results, delete the generatedstore_path
.The
table_suffix
is useful for storing multiple results in a single HDF5 file, each with a different suffix. See <hyperparameter smoothing> for an example use case.
- set_data(**kwargs)
- set_expert_locations(df=None, file=None, source=None, where=None, add_data_to_col=None, col_funcs=None, keep_cols=None, col_select=None, row_select=None, sort_by=None, reset_index=False, source_kwargs=None, verbose=False, **kwargs)
- set_model(oi_model=None, init_params=None, constraints=None, load_params=None, optim_kwargs=None, pred_kwargs=None, params_to_store=None, replacement_threshold=None, replacement_model=None, replacement_init_params=None, replacement_constraints=None, replacement_optim_kwargs=None, replacement_pred_kwargs=None)
- set_pred_loc(**kwargs)
- GPSat.local_experts.get_results_from_h5file(results_file, global_col_funcs=None, merge_on_expert_locations=True, select_tables=None, table_suffix='', add_suffix_to_table=True, verbose=False)
Retrieve results from an HDF5 file.
- Parameters:
- results_file: str
The location where the results file is saved. Must point to a HDF5 file with the file extension
.h5
.- select_tables: list, optional
A list of table names to select from the HDF5 file.
- global_col_funcs: dict, optional
A dictionary of column functions to apply to selected tables.
- merge_on_expert_locations: bool, default True
Whether to merge expert location data with results data.
- table_suffix: str, optional
A suffix to add to selected table names.
- add_suffix_to_table: bool, default True
Whether to add the table suffix to selected table names.
- verbose: bool, default False
Set verbosity.
- Returns:
- tuple:
A tuple containing two elements:
dict
: A dictionary of DataFrames where each table name is the key. This contains the predictions and learned model parameters at every location.list
: A list of configuration dictionaries.
Notes
This function reads data from an HDF5 file, applies optional column functions, and optionally merges expert location data with results data.
The
'select_tables'
parameter allows you to choose specific tables from the HDF5 file.Column functions specified in
'global_col_funcs'
can be applied to selected tables.Expert location data can be merged onto results data if
'merge_on_expert_locations'
is set toTrue
.
GPSat.plot_utils module
- GPSat.plot_utils.get_projection(projection=None)
- GPSat.plot_utils.plot_gpflow_minimal_example(model: object, model_init: object = None, opt_params: object = None, pred_params: object = None) object
Run a basic usage example for a given model. Model will be initialised, parameters will be optimised and predictions will be made for the minimal model example found (as of 2023-05-04):
https://gpflow.github.io/GPflow/2.8.0/notebooks/getting_started/basic_usage.html
Methods called are: optimise_parameters, predict, get_parameters
Predict expected to return a dict with ‘f*’, ‘f*_var’ and ‘y_var’ as np.arrays
- Parameters:
- model: any model inherited from BaseGPRModel
- model_init: dict or None, default None
dict of parameters to be provided when model is initialised. If None default parameters are used
- opt_params: dict or None, default None
dict of parameters to be passed to optimise_parameter method. If None default parameters are used
- pred_params: dict or None, default None
dict of parameters to be passed to predict method. If None default parameters are used
- Returns:
- tuple:
predictions dict parameters dict
- GPSat.plot_utils.plot_hist(ax, data, title='Histogram / Density', ylabel=None, xlabel=None, select_bool=None, stats_values=None, stats_loc=(0.2, 0.9), drop_nan_inf=True, q_vminmax=None, rasterized=False)
- GPSat.plot_utils.plot_hist_from_results_data(ax, dfs, table, val_col, load_kwargs=None, plot_kwargs=None, verbose=False, **kwargs)
- GPSat.plot_utils.plot_hyper_parameters(dfs, coords_col, row_select=None, table_names=None, table_suffix='', plot_template: dict | None = None, plots_per_row=3, suptitle='hyper params', qvmin=0.01, qvmax=0.99)
- GPSat.plot_utils.plot_pcolormesh(ax, lon, lat, plot_data, fig=None, title=None, vmin=None, vmax=None, qvmin=None, qvmax=None, cmap='YlGnBu_r', cbar_label=None, scatter=False, extent=None, ocean_only=False, **scatter_args)
- GPSat.plot_utils.plot_pcolormesh_from_results_data(ax, dfs, table, val_col, lon_col=None, lat_col=None, x_col=None, y_col=None, lat_0=90, lon_0=0, fig=None, load_kwargs=None, plot_kwargs=None, weighted_values_kwargs=None, verbose=False, **kwargs)
- GPSat.plot_utils.plot_wrapper(plt_df, val_col, lon_col='lon', lat_col='lat', scatter_plot_size=2, plt_where=None, projection=None, extent=None, max_obs=1000000.0, vmin_max=None, q_vminmax=None, abs_vminmax=False, stats_loc=None, figsize=None, where_sep='\n ')
- GPSat.plot_utils.plot_xy(ax, x, y, title=None, y_label=None, x_label=None, xtick_rotation=45, scatter=False, **kwargs)
- GPSat.plot_utils.plot_xy_from_results_data(ax, dfs, table, x_col, y_col, load_kwargs=None, plot_kwargs=None, verbose=False, **kwargs)
- GPSat.plot_utils.plots_from_config(plot_configs, dfs: dict[str, DataFrame], plots_per_row: int = 3, num_plots_row_col_size: dict[int, dict] | None = None, suptitle: str = '')
GPSat.postprocessing module
- class GPSat.postprocessing.SmoothingConfig(l_x: int | float = 1, l_y: int | float = 1, max: int | float = None, min: int | float = None)
Bases:
object
Configuration used for hyperparameter smoothing.
- Attributes:
- l_x: int or float, default 1
The lengthscale (x-direction) parameter for Gaussian smoothing.
- l_y: int or float, default 1
The lengthscale (y-direction) parameter for Gaussian smoothing.
- max: int or float, optional
Maximal value that the hyperparameter can take.
- min: int or float, optional
Minimal value that the hyperparameter can take.
Notes
This configuration is used to smooth 2D hyperparameter fields.
- get(key, default=None)
- l_x: int | float = 1
- l_y: int | float = 1
- max: int | float = None
- min: int | float = None
- GPSat.postprocessing.get_smooth_params_config()
- GPSat.postprocessing.glue_local_predictions(preds_df: DataFrame, inference_radius: DataFrame, R: int | float | list = 3) DataFrame
DEPRECATED. See
glue_local_predictions_1d
andglue_local_predictions_2d
. Glues overlapping predictions by taking a normalised Gaussian weighted average.WARNING: This method only deals with expert locations on a regular grid
- Parameters:
- preds_df: pd.DataFrame
containing predictions generated from local expert OI. It should have the following columns: - pred_loc_x (float): The x-coordinate of the prediction location. - pred_loc_y (float): The y-coordinate of the prediction location. - f* (float): The predictive mean at the location (pred_loc_x, pred_loc_y). - f*_var (float): The predictive variance at the location (pred_loc_x, pred_loc_y).
- expert_locs_df: pd.DataFrame
containing local expert locations used to perform OI. It should have the following columns: - x (float): The x-coordinate of the expert location. - y (float): The y-coordinate of the expert location.
- sigma: int, float, or list, default 3
The standard deviation of the Gaussian weighting in the x and y directions. If a single value is provided, it is used for both directions. If a list is provided, the first value is used for the x direction and the second value is used for the y direction. Defaults to 3.
- Returns:
- pd.DataFrame:
dataframe consisting of glued predictions (mean and std). It has the following columns: - pred_loc_x (float): The x-coordinate of the prediction location. - pred_loc_y (float): The y-coordinate of the prediction location. - f* (float): The glued predictive mean at the location (pred_loc_x, pred_loc_y). - f*_std (float): The glued predictive standard deviation at the location (pred_loc_x, pred_loc_y).
Notes
The function assumes that the expert locations are equally spaced in both the x and y directions. The function uses the scipy.stats.norm.pdf function to compute the Gaussian weights. The function normalizes the weighted sums with the total weights at each location.
- GPSat.postprocessing.glue_local_predictions_1d(preds_df: DataFrame, pred_loc_col: str, xprt_loc_col: str, vars_to_glue: str | List[str], inference_radius: int | float | dict, R=3) DataFrame
Glues together overlapping local expert predictions in 1D by Gaussian-weighted averaging.
- Parameters:
- preds_df: pandas dataframe
A dataframe containing the results of local experts predictions. The dataframe should have columns containing the (1) prediction locations, (2) expert locations, and (3) any predicted variables we wish to glue (e.g. the predictive mean).
- pred_loc_col: str
The column in the results dataframe corresponding to the prediction locations
- xprt_loc_col: str
The column in the results dataframe corresponding to the local expert locations
- vars_to_glue: str | list of strs
The column(s) corresponding to variables we wish to glue (e.g. the predictive mean and variance).
- inference_radius: int | float | dict
The inference radius for each local experts. If specified as a dict, the keys should be the expert locations and the corresponding values should be the corresponding inference radius of that expert. If specified as an int or float, it assumes that all experts have the same inference radius.
- R: int | float, default 3
A weight controlling the standard deviation of the Gaussian weights. The standard deviation will be given by the formula
std = inference_radius / R
. The default value of 3 will place 99% of the Gaussian mass within the inference radius.
- Returns:
- pandas dataframe
A dataframe of glued predictions, whose columns contain (1) the prediction locations and (2) the glued variables.
- GPSat.postprocessing.glue_local_predictions_2d(preds_df: DataFrame, pred_loc_cols: List[str], xprt_loc_cols: List[str], vars_to_glue: str | List[str], inference_radius: int | float | dict, R=3) DataFrame
Glues together overlapping local expert predictions in 2D by Gaussian-weighted averaging.
- Parameters:
- preds_df: pandas dataframe
A dataframe containing the results of local experts predictions. The dataframe should have columns containing the (1) prediction locations, (2) expert locations, and (3) any predicted variables we wish to glue (e.g. the predictive mean).
- pred_loc_col: list of strs
The xy-columns in the results dataframe corresponding to the prediction locations
- xprt_loc_cols: list of strs
The xy-columns in the results dataframe corresponding to the local expert locations
- vars_to_glue: str | list of strs
The column(s) corresponding to variables we wish to glue (e.g. the predictive mean and variance).
- inference_radius: int | float
The inference radius for each local experts. We assume that all experts have the same inference radius.
- R: int | float, default 3
A weight controlling the standard deviation of the Gaussian weights. The standard deviation will be given by the formula
std = inference_radius / R
. The default value of 3 will place 99% of the Gaussian mass within the inference radius.
- Returns:
- pandas dataframe
A dataframe of glued predictions, whose columns contain (1) the prediction locations and (2) the glued variables.
- GPSat.postprocessing.smooth_hyperparameters(result_file: str, params_to_smooth: List[str], smooth_config_dict: Dict[str, SmoothingConfig], xy_dims: List[str] = ['x', 'y'], reference_table_suffix: str = '', table_suffix: str = '_SMOOTHED', output_file: str = None, model_name: str = None, save_config_file: bool = True)
Smooth hyperparameters in an HDF5 results file using Gaussian smoothing.
- Parameters:
- result_file: str
The path to the HDF5 results file.
- params_to_smooth: list of str
A list of hyperparameters to be smoothed.
- smooth_config_dict: Dict[str, SmoothingConfig]
A dictionary specifying smoothing configurations for each hyperparameter. This should be a dictionary where keys are hyperparameter names, and values are instances of the
SmoothingConfig
class specifying smoothing parameters.- xy_dims: list of str, default [‘x’, ‘y’]
The dimensions to use for smoothing (default:
['x', 'y']
).- reference_table_suffix: str, default “”
The suffix to use for reference table names (default:
""
).- table_suffix: str, default “_SMOOTHED”
The suffix to add to smoothed hyperparameter table names (default:
"_SMOOTHED"
).- output_file: str, optional
The path to the output HDF5 file to store smoothed hyperparameters.
- model_name: str, optional
The name of the model for which hyperparameters are being smoothed.
- save_config_file: bool, optional
Whether to save a configuration file for making predictions with smoothed values.
- Returns:
- None
Notes
This function applies Gaussian smoothing to specified hyperparameters in an HDF5 results file.
The
output_file
parameter allows you to specify a different output file for storing the smoothed hyperparameters.If
model_name
is not provided, it will be determined from the input HDF5 file.If
save_config_file
isTrue
, a configuration file for making predictions with smoothed values will be saved.
GPSat.prediction_locations module
GPSat.read_and_store module
- GPSat.read_and_store.get_dirs_to_search(file_dirs, sub_dirs=None, walk=False)
- GPSat.read_and_store.update_attr(x, cid, vals)
GPSat.utils module
- GPSat.utils.EASE2toWGS84(x, y, return_vals='both', lon_0=0, lat_0=90)
Converts EASE2 grid coordinates to WGS84 longitude and latitude coordinates.
- Parameters:
- x: float
EASE2 grid x-coordinate in meters.
- y: float
EASE2 grid y-coordinate in meters.
- return_vals: str, optional
Determines what values to return. Valid options are
"both"
(default),"lon"
, or"lat"
.- lon_0: float, optional
Longitude of the center of the EASE2 grid in degrees. Default is
0
.- lat_0: float, optional
Latitude of the center of the EASE2 grid in degrees. Default is
90
.
- Returns:
- tuple or float
Depending on the value of
return_vals
, either a tuple of WGS84 longitude and latitude coordinates (both floats), or a single float representing either the longitude or latitude.
- Raises:
- AssertionError
If
return_vals
is not one of the valid options.
Examples
>>> EASE2toWGS84(1000000, 2000000) (153.434948822922, 69.86894542225777)
- GPSat.utils.EASE2toWGS84_New(*args, **kwargs)
- GPSat.utils.WGS84toEASE2(lon, lat, return_vals='both', lon_0=0, lat_0=90)
Converts WGS84 longitude and latitude coordinates to EASE2 grid coordinates.
- Parameters:
- lonfloat
Longitude coordinate in decimal degrees.
- latfloat
Latitude coordinate in decimal degrees.
- return_valsstr, optional
Determines what values to return. Valid options are
"both"
(default),"x"
, or"y"
.- lon_0float, optional
Longitude of the center of the EASE2 grid in decimal degrees. Default is
0
.- lat_0float, optional
Latitude of the center of the EASE2 grid in decimal degrees. Default is
90
.
- Returns:
- float
If
return_vals
is"x"
. Returns the x EASE2 grid coordinate in meters.- float
If
return_vals
is"y"
. Returns the y EASE2 grid coordinate in meters- tuple of float
If
return_vals
is"both"
. Returns a tuple of (x, y) EASE2 grid coordinates in meters.
- Raises:
- AssertionError
If
return_vals
is not one of the valid options.
Examples
>>> WGS84toEASE2(-105.01621, 39.57422) (-5254767.014984061, 1409604.1043472202)
- GPSat.utils.WGS84toEASE2_New(*args, **kwargs)
- GPSat.utils.array_to_dataframe(x, name, dim_prefix='_dim_', reset_index=False)
Converts a numpy array to a pandas DataFrame with a multi-index based on the array’s dimensions.
(Also see
dataframe_to_array
)- Parameters:
- xnp.ndarray
The numpy array to be converted to a DataFrame.
- namestr
The name of the column in the resulting DataFrame.
- dim_prefixstr, optional
The prefix to be used for the dimension names in the multi-index. Default is
"_dim_"
. Integers will be appended todim_prefix
for each dimension ofx
, i.e. ifx
is 2d, it will have dimension names"_dim_0"
,"_dim_1"
, assuming defaultdim_prefix
is used.- reset_indexbool, optional
Whether to reset the index of the resulting DataFrame. Default is
False
.
- Returns:
- outpd.DataFrame
The resulting DataFrame with a multi-index based on the dimensions of the input array.
- Raises:
- AssertionError
If the input is not a numpy array.
Examples
>>> # express a 2d numpy array in DataFrame >>> x = np.array([[1, 2], [3, 4]]) >>> array_to_dataframe(x, "data") data _dim_0 _dim_1 0 0 1 1 2 1 0 3 1 4
- GPSat.utils.assign_category_col(val, df, categories=None)
Generate categorical
pd.Series
equal in length to a reference DataFrame (df
)- Parameters:
- valstr
The value to assign to the categorical Series.
- dfpandas DataFrame
reference DataFrame, used to determine length of output
- categorieslist, optional
A list of categories to be used for the categorical column.
- Returns:
- pandas Categorical Series
A categorical column with the assigned value and specified categories (if provided).
Notes
This function creates a new categorical column in the DataFrame with the specified value and categories. If categories are not provided, they will be inferred from the data. The function returns a pandas Categorical object representing the new column.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}) >>> x_series = assign_category_col('x', df)
- GPSat.utils.bin_obs_by_date(df, val_col, date_col='date', all_dates_in_range=True, x_col='x', y_col='y', grid_res=None, date_col_format='%Y%m%d', x_min=-4500000.0, x_max=4500000.0, y_min=-4500000.0, y_max=4500000.0, n_x=None, n_y=None, bin_statistic='mean', verbose=False)
This function takes in a pandas DataFrame and bins the data based on the values in a specified column and the x and y coordinates in other specified columns. The data is binned based on a grid with a specified resolution or number of bins. The function returns a dictionary of binned values for each unique date in the DataFrame.
- Parameters:
- df: pandas DataFrame
A DataFrame containing the data to be binned.
- val_col: string
Name of the column containing the values to be binned.
- date_col: string, default “date”
Name of the column containing the dates for which to bin the data.
- all_dates_in_range: boolean, default True
Whether to include all dates in the range of the DataFrame.
- x_col: string, default “x”
Name of the column containing the x coordinates.
- y_col: string, default “y”
Name of the column containing the y coordinates.
- grid_res: float or int, default None
Resolution of the grid in kilometers. If
None
, thenn_x
andn_y
must be specified.- date_col_format: string, default “%Y%m%d”
Format of the date column.
- x_min: float, default -4500000.0
Minimum x value for the grid.
- x_max: float, default 4500000.0
Maximum x value for the grid.
- y_min: float, default -4500000.0
Minimum y value for the grid.
- y_max: float, default 4500000.0
Maximum y value for the grid.
- n_x: int, default None
Number of bins in the x direction.
- n_y: int, default None
Number of bins in the y direction.
- bin_statistic: string or callable, default “mean”
Statistic to compute in each bin.
- verbose: boolean, default False
Whether to print additional information during execution.
- Returns:
- bvals: dictionary
The binned values for each unique date in the DataFrame.
- x_edge: numpy array
x values for the edges of the bins.
- y_edge: numpy array
y values for the edges of the bins.
Notes
The x and y coordinates are swapped in the returned binned values due to the transpose operation used in the function.
- GPSat.utils.check_prev_oi_config(prev_oi_config, oi_config, skip_valid_checks_on=None)
This function checks if the previous configuration matches the current one. It takes in two dictionaries,
prev_oi_config
andoi_config
, which represent the previous and current configurations respectively.The function also takes an optional list
skip_valid_checks_on
, which contains keys that should be skipped during the comparison.- Parameters:
- prev_oi_config: dict
Previous configuration to be compared against.
- oi_config: dict
Current configuration to compare against
prev_oi_config
.- skip_valid_checks_on: list or None, default None
If not
None
, should be a list of keys to not check.
- Returns:
- None
Notes
If
skip_valid_checks_on
is not provided, it defaults to an empty list. The function then compares the two configurations and raises anAssertionError
if any key-value pairs do not match.If the configurations do not match exactly, an
AssertionError
is raised.This function assumes that the configurations are represented as dictionaries and that the keys in both dictionaries are the same.
- GPSat.utils.compare_dataframes(df1, df2, merge_on, columns_to_compare, drop_other_cols=False, how='outer', suffixes=['_1', '_2'])
- GPSat.utils.config_func(func, source=None, args=None, kwargs=None, col_args=None, col_kwargs=None, df=None, filename_as_arg=False, filename=None, col_numpy=True)
Apply a function based on configuration input.
The aim is to allow one to apply a function, possibly on data from a DataFrame, using a specification that can be stored in a JSON configuration file.
Note
This function uses
eval()
so could allow for arbitrary code execution.If DataFrame
df
is provided, then can provide input (col_args
and/orcol_kwargs
) based on columns ofdf
.
- Parameters:
- func: str or callable.
If
str
, it will useeval(func)
to convert it to a function.If it contains one of
"|"
,"&"
,"="
,"+"
,"-"
,"*"
,"/"
,"%"
,"<"
, and">"
, it will create a lambda function:
lambda arg1, arg2: eval(f"arg1 {func} arg2")
If
eval(func)
raisesNameError
andsource
is notNone
, it will run
f"from {source} import {func}"
and try again. This is to allow import function from a source.
- source: str or None, default None
Package name where
func
can be found, if applicable. Used to importfunc
from a package. e.g.>>> GPSat.utils.config_func(func="cumprod", source="numpy", ...)
calls the function
cumprod
from the packagenumpy
.- args: list or None, default None
If
None
, an empty list will be used, i.e. no args will be used. The values will be unpacked and provided tofunc
: i.e.func(*args, **kwargs)
- kwargs: dict or None, default None
If
dict
, it will be unpacked (**kwargs
) to provide key word arguments tofunc
.- col_args: None or list of str, default None
If DataFrame
df
is provided, it can usecol_args
to specify which columns ofdf
will be passed intofunc
as arguments.- col_kwargs: None or dict, default is None
Keyword arguments to be passed to
func
specified as dict whose keys are parameters offunc
and values are column names of a DataFramedf
. Only applicable ifdf
is provided.- df: DataFrame or None, default None
To provide if one wishes to use columns of a DataFrame as arguments to
func
.- filename_as_arg: bool, default False
Set
True
iffilename
is used as an argument tofunc
.- filename: str or None, default None
If
filename_as_arg
isTrue
, then will providefilename
as first arg.- col_numpy: bool, default True
If
True
, when extracting columns from DataFrame,.values
is used to convert to numpy array.
- Returns:
- any
Values returned by applying
func
on data. The type depends onfunc
.
- Raises:
- AssertionError
If
kwargs
is not a dict,col_kwargs
is not a dict, orfunc
is not a string or callable.- AssertionError
If
df
is not provided butcol_args
orcol_kwargs
are.- AssertionError
If
func
is a string and cannot be imported on it’s own andsource
isNone
.
Examples
>>> import pandas as pd >>> from GPSat.utils import config_func >>> config_func(func="lambda x, y: x + y", args=[1, 1]) # Computes 1 + 1 2 >>> config_func(func="==", args=[1, 1]) # Computes 1 == 1 True
Using columns of a DataFrame as inputs:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> config_func(func="lambda x, y: x + y", df=df, col_args=["A", "B"]) # Computes df["A"] + df["B"] array([5, 7, 9]) >>> config_func(func="<=", col_args=["A", "B"], df=df) # Computes df["A"] <= df["B"] array([ True, True, True])
We can also use functions from an external package by specifying
source
. For example, the below reproduces the last example in numpy.cumprod:>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> config_func(func="cumprod", source="numpy", df=df, kwargs={"axis": 0}, col_args=[["A", "B"]]) array([[ 1, 4], [ 2, 20], [ 6, 120]])
- GPSat.utils.convert_lon_lat_str(x)
Converts a string representation of longitude or latitude to a float value.
- Parameters:
- x: str
A string representation of longitude or latitude in the format of
"[degrees] [minutes] [direction]"
, where[direction]
is one of"N"
,"S"
,"E"
, or"W"
.
- Returns:
- float
The converted value of the input string as a float.
- Raises:
- AssertionError
If the input is not a string.
Examples
>>> convert_lon_lat_str('74 0.1878 N') 74.00313 >>> convert_lon_lat_str('140 0.1198 W') -140.001997
- GPSat.utils.cprint(x, c='ENDC', bcolors=None, sep=' ', end='\n')
Add color to print statements.
Based off of https://stackoverflow.com/questions/287871/how-do-i-print-colored-text-to-the-terminal.
- Parameters:
- x: str
String to be printed.
- c: str, default “ENDC”
Valid key in
bcolors
. Ifbcolors
is not provided, then default will be used, containing keys:'HEADER'
,'OKBLUE'
,'OKCYAN'
,'OKGREEN'
,'WARNING'
,'FAIL'
,'ENDC'
,'BOLD'
,'UNDERLINE'
.- bcolors: dict or None, default None
Dict with values being colors / how to format the font. These cane be chained together. See the codes in: https://en.wikipedia.org/wiki/ANSI_escape_code#3-bit_and_4-bit.
- sep: str, default “ “
sep
argument passed along toprint()
.- end: str, default “\n”
end
argument passed along toprint()
.
- Returns:
- None
- GPSat.utils.dataframe_to_2d_array(df, x_col, y_col, val_col, tol=1e-09, fill_val=nan, dtype=None, decimals=1)
Extract values from DataFrame to create a 2-d array of values (
val_col
) - assuming the values came from a 2-d array. Requires dimension columnsx_col
,y_col
(do not have to be ordered in DataFrame).- Parameters:
- df: pandas.DataFrame
The dataframe to convert to a 2D array.
- x_col: str
The name of the column in the dataframe that contains the x coordinates.
- y_col: str
The name of the column in the dataframe that contains the y coordinates.
- val_col: str
The name of the column in the dataframe that contains the values to be placed in the 2D array.
- tol: float, default 1e-9
The tolerance for matching the x and y coordinates to the grid.
- fill_val: float, default np.nan
The value to fill the 2D array with if a coordinate is missing.
- dtype: str or numpy.dtype or None, default None
The data type of the values in the 2D array.
- decimals: int, default 1
The number of decimal places to round x and y values to before taking unique. If decimals is negative, it specifies the number of positions to the left of the decimal point.
- Returns:
- tuple
A tuple containing the 2D numpy array of values, the x coordinates of the grid, and the y coordinates of the grid.
- Raises:
- AssertionError
If any of the required columns are missing from the dataframe, or if any coordinates have more than one value.
Notes
The spacing of grid is determined by the smallest step size in the
x_col
,y_col
direction, respectively.This is meant to reverse the process of putting values from a regularly spaced grid into a DataFrame. Do not expect this to work on arbitrary x,y coordinates.
- GPSat.utils.dataframe_to_array(df, val_col, idx_col=None, dropna=True, fill_val=nan)
Converts a pandas DataFrame to a numpy array, where the DataFrame has columns that represent dimensions of the array and the DataFrame rows represent values in the array.
- Parameters:
- dfpandas DataFrame
The DataFrame containing values convert to a numpy ndarray.
- val_colstr
The name of the column in the DataFrame that contains the values to be placed in the array.
- idx_colstr or list of str or None, default None
The name(s) of the column(s) in the DataFrame that represent the dimensions of the array. If not provided, the index of the DataFrame will be used as the dimension(s).
- dropnabool, default True
Whether to drop rows with missing values before converting to the array.
- fill_valscalar, default np.nan
The value to fill in the array for missing values.
- Returns:
- numpy array
The resulting numpy array.
- Raises:
- AssertionError
If the dimension values are not integers or have gaps, or if the
idx_col
parameter contains column names that are not in the DataFrame.
Examples
>>> import pandas as pd >>> import numpy as np >>> from GPSat.utils import dataframe_to_array >>> df = pd.DataFrame({ ... 'dim1': [0, 0, 1, 1], ... 'dim2': [0, 1, 0, 1], ... 'values': [1, 2, 3, 4] ... }) >>> arr = dataframe_to_array(df, 'values', ['dim1', 'dim2']) >>> print(arr) [[1 2] [3 4]]
- GPSat.utils.dict_of_array_to_dict_of_dataframe(array_dict, concat=False, reset_index=False)
Converts a dictionary of arrays to a dictionary of pandas DataFrames.
- Parameters:
- array_dictdict
A dictionary where the keys are strings and the values are numpy arrays.
- concatbool, optional
If
True
, concatenates DataFrames with the same number of dimensions. Default isFalse
.- reset_indexbool, optional
If
True
, resets the index of each DataFrame. Default isFalse
.
- Returns:
- dict
A dictionary where the keys are strings and the values are pandas DataFrames.
Notes
This function uses the
array_to_dataframe
function to convert each array to a DataFrame. Ifconcat
isTrue
, it will concatenate DataFrames with the same number of dimensions. Ifreset_index
isTrue
, it will reset the index of each DataFrame.Examples
>>> import numpy as np >>> import pandas as pd >>> array_dict = {'a': np.array([1, 2, 3]), 'b': np.array([[1, 2], [3, 4]]), 'c': np.array([1.1, 2.2, 3.3])} >>> dict_of_array_to_dict_of_dataframe(array_dict) {'a': a _dim_0 0 1 1 2 2 3, 'b': b _dim_0 _dim_1 0 0 1 1 2 1 0 3 1 4, 'c': c _dim_0 0 1.1 1 2.2 2 3.3}
>>> dict_of_array_to_dict_of_dataframe(array_dict, concat=True) {1: a c _dim_0 0 1 1.1 1 2 2.2 2 3 3.3, 2: b _dim_0 _dim_1 0 0 1 1 2 1 0 3 1 4}
>>> dict_of_array_to_dict_of_dataframe(array_dict, reset_index=True) {'a': _dim_0 a 0 0 1 1 1 2 2 2 3, 'b': _dim_0 _dim_1 b 0 0 0 1 1 0 1 2 2 1 0 3 3 1 1 4, 'c': _dim_0 c 0 0 1.1 1 1 2.2 2 2 3.3}
- GPSat.utils.diff_distance(x, p=2, k=1, default_val=nan)
- GPSat.utils.expand_dict_by_vals(d, expand_keys)
- GPSat.utils.get_col_values(df, col, return_numpy=True)
This function takes in a pandas DataFrame, a column name or index, and a boolean flag indicating whether to return the column values as a numpy array or not. It returns the values of the specified column as either a pandas Series or a numpy array, depending on the value of the
return_numpy
flag.If the column is specified by name and it does not exist in the DataFrame, the function will attempt to use the column index instead. If the column is specified by index and it is not a valid integer index, the function will raise an
AssertionError
.- Parameters:
- df: pandas DataFrame
A pandas DataFrame containing data.
- col: str or int
The name of column to extract data from. If specified as an int n, it will extract data from the n-th column.
- return_numpy: bool, default True
Whether to return as numpy array.
- Returns:
- numpy array
If
return_numpy
is set toTrue
.- pandas Series
If
return_numpy
is set toFalse
.
Examples
>>> import pandas as pd >>> from GPSat.utils import get_col_values >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) >>> col_values = get_col_values(df, 'A') >>> print(col_values) [1 2 3]
- GPSat.utils.get_config_from_sysargv(argv_num=1)
This function takes an optional argument
argv_num
(default value of1
) and attempts to read a JSON configuration file from the corresponding index insys.argv
.If the file extension is not
.json
, it prints a message indicating that the file is not a JSON file.If an error occurs while reading the file, it prints an error message.
This function could benefit from refactoring to use the
argparse
package instead of manually parsingsys.argv
.- Parameters:
- argv_num :int, default 1
The index in
sys.argv
to read the configuration file from.
- Returns:
- dict or None
The configuration data loaded from the JSON file, or
None
if an error occurred while reading the file.
- GPSat.utils.get_git_information()
This function retrieves information about the current state of a Git repository.
- Returns:
- dict
Contains the following keys:
"branch"
: the name of the current branch."remote"
: a list of strings representing the remote repositories and their URLs."commit"
: the hash of the current commit."details"
: a list of strings representing the details of the last commit (author, date, message)."modified"
(optional): a list of strings representing the files modified since the last commit.
Note
If the current branch cannot be determined, the function will attempt to retrieve it from the list of all branches.
If there are no remote repositories, the
"remote"
key will be an empty list.If there are no modified files, the
"modified"
key will not be present in the output.This function requires the Git command line tool to be installed and accessible from the command line.
- GPSat.utils.get_previous_oi_config(store_path, oi_config, table_name='oi_config', skip_valid_checks_on=None)
This function retrieves the previous configuration from optimal interpolation (OI) results file (
store_path
)If the
store_path
exists, it is expected to contain a table called “oi_config” with the previous configurations stored as rows.If
store_path
does not exist, the function creates the file and adds the current configuration (oi_config
) as the first row in “oi_config” table.Each row in the “oi_config” table contains columns ‘idx’ (index), ‘datetime’ and ‘config’. The values in the ‘config’ are provided
oi_config
(dict) converted to str.If the table (
oi_config
) already exists, the function will match the provideoi_config
against the previous config values, if any match exactly the largest config id will be returned. Otherwise (oi_config
does not exactly match any previous config) then the largest idx value will be increment and returned.- Parameters:
- store_path: str
The file path where the configurations are stored.
- oi_config: dict
Representing the current configuration for the OI system.
- table_name: str, default “oi_config”
The table where the configurations will be store.
- skip_valid_checks_on: list of str or None, default None
If list the names of the configuration keys that should be skipped during validation checks. Note: validation checks are not done in this function.
- Returns:
- dict
Previous configuration as a dictionary.
- list
List of configuration keys to skipped during validation checks.
- int
Configuration ID.
- GPSat.utils.get_weighted_values(df, ref_col, dist_to_col, val_cols, weight_function='gaussian', drop_weight_cols=True, **weight_kwargs)
Calculate the weighted values of specified columns in a DataFrame based on the distance between two other columns, using a specified weighting function. The current implementation supports a Gaussian weight based on the euclidean distance between the values in ref_col and dist_to_col.
- Parameters:
- dfpandas.DataFrame
The input DataFrame containing the reference column, distance-to column, and value columns.
- ref_collist of str or str
The name of the column(s) to use as reference points for calculating distances.
- dist_to_collist of str or str
The name of the column(s) to calculate distances to, from ref_col. They should align / correspond to the column(s) set by ref_col.
- val_colslist of str or str
The names of the column(s) for which the weighted values are calculated. Can be a single column name or a list of names.
- weight_functionstr, optional
The type of weighting function to use. Currently, only “gaussian” is implemented, which applies a Gaussian weighting (exp(-d^2)) based on the squared euclidean distance. The default is “gaussian”.
- drop_weight_cols: bool, optional, default True.
if False the total weight and total weighted function values are included in the output
- **weight_kwargsdict
Additional keyword arguments for the weighting function. For the Gaussian weight, this includes: - lengthscale (float): The length scale to use in the Gaussian function. This parameter scales the distance before applying the Gaussian function and must be provided.
- Returns:
- pandas.DataFrame
A DataFrame containing the weighted values for each of the specified value columns. The output DataFrame has the reference column as the index and each of the specified value columns with their weighted values.
- Raises:
- AssertionError
If the shapes of the ref_col and dist_to_col do not match, or if the required lengthscale parameter for the Gaussian weighting function is not provided.
- NotImplementedError
If a weight_function other than “gaussian” is specified.
Notes
The function currently only implements Gaussian weighting. The Gaussian weight is calculated as exp(-d^2 / (2 * l^2)), where d is the squared euclidean distance between ref_col and dist_to_col, and l is the lengthscale.
This implementation assumes the input DataFrame does not contain NaN values in the reference or distance-to columns. Handling NaN values may require additional preprocessing or the use of fillna methods.
Examples
>>> import pandas as pd >>> >>> data = { ... 'ref_col': [0, 1, 0, 1], ... 'dist_to_col': [1, 2, 3, 4], ... 'value1': [10, 20, 30, 40], ... 'value2': [100, 200, 300, 400] ... } >>> df = pd.DataFrame(data) >>> weighted_df = get_weighted_values(df, 'ref_col', 'dist_to_col', ['value1', 'value2'], lengthscale=1.0) >>> print(weighted_df)
- GPSat.utils.glue_local_predictions(preds_df: DataFrame, expert_locs_df: DataFrame, sigma: int | float | list = 3) DataFrame
Depracated. Use
glue_local_predictions_1d
andglue_local_predictions_2d
instead.Glues overlapping predictions by taking a normalised Gaussian weighted average.
Warning: This method only deals with expert locations on a regular grid.
- Parameters:
- preds_df: pd.DataFrame
containing predictions generated from local expert OI. It should have the following columns:
pred_loc_x
(float): The x-coordinate of the prediction location.pred_loc_y
(float): The y-coordinate of the prediction location.f*
(float): The predictive mean at the location (pred_loc_x, pred_loc_y).f*_var
(float): The predictive variance at the location (pred_loc_x, pred_loc_y).
- expert_locs_df: pd.DataFrame
containing local expert locations used to perform optimal interpolation. It should have the following columns:
x
(float): The x-coordinate of the expert location.y
(float): The y-coordinate of the expert location.
- sigma: int, float, or list, default 3
The standard deviation of the Gaussian weighting in the x and y directions.
If a single value is provided, it is used for both directions.
If a list is provided, the first value is used for the x direction and the second value is used for the y direction. Defaults to 3.
- Returns:
- pd.DataFrame:
Dataframe consisting of glued predictions (mean and std). It has the following columns:
pred_loc_x
(float): The x-coordinate of the prediction location.pred_loc_y
(float): The y-coordinate of the prediction location.f*
(float): The glued predictive mean at the location (pred_loc_x
,pred_loc_y
).f*_std
(float): The glued predictive standard deviation at the location (pred_loc_x
,pred_loc_y
).
Notes
The function assumes that the expert locations are equally spaced in both the x and y directions.
The function uses the
scipy.stats.norm.pdf
function to compute the Gaussian weights.The function normalizes the weighted sums with the total weights at each location.
- GPSat.utils.grid_2d_flatten(x_range, y_range, grid_res=None, step_size=None, num_step=None, center=True)
Create a 2D grid of points defined by x and y ranges, with the option to specify the grid resolution, step size, or number of steps. The resulting grid is flattened and concatenated into a 2D array of (x,y) coordinates.
- Parameters:
- x_range: tuple or list of floats
Two values representing the minimum and maximum values of the x-axis range.
- y_range: tuple or list of floats
Two values representing the minimum and maximum values of the y-axis range.
- grid_res: float or None, default None
The grid resolution, i.e. the distance between adjacent grid points. If specified, this parameter takes precedence over
step_size
andnum_step
.- step_size: float or None, default None
The step size between adjacent grid points. If specified, this parameter takes precedence over
num_step
.- num_step: int or None, default None
The number of steps between the minimum and maximum values of the x and y ranges. If specified, this parameter is used only if
grid_res
andstep_size
are not specified (areNone
). Note: the number of steps includes the starting point, so from 0 to 1 is two steps- center: bool, default True
If
True
, the resulting grid points will be the centers of the grid cells.If
False
, the resulting grid points will be the edges of the grid cells.
- Returns:
- ndarray
A 2D array of (x,y) coordinates, where each row represents a single point in the grid.
- Raises:
- AssertionError
If
grid_res
,step_size
andnum_step
are all unspecified. Must specify at least one.
Examples
>>> from GPSat.utils import grid_2d_flatten >>> grid_2d_flatten(x_range=(0, 2), y_range=(0, 2), grid_res=1) array([[0.5, 0.5], [1.5, 0.5], [0.5, 1.5], [1.5, 1.5]])
- GPSat.utils.guess_track_num(x, thresh, start_track=0)
- GPSat.utils.inverse_sigmoid(y, low=0, high=1)
- GPSat.utils.inverse_softplus(y, shift=0)
- GPSat.utils.json_load(file_path)
This function loads a JSON file from the specified file path and applies a nested dictionary literal evaluation (nested_dict_literal_eval) to convert any string keys in the format of ‘(…,…)’ to tuple keys.
The resulting dictionary is returned.
- Parameters:
- file_path: str
The path to the JSON file to be loaded.
- Returns:
- dict or list of dict
The loaded JSON file as a dictionary or list of dictionaries.
Examples
Assuming a JSON file named ‘config.json’ with the following contents: {
- “key1”: “value1”,
“(‘key2’, ‘key3’)”: “value2”, “key4”: {“(‘key5’, ‘key6’)”: “value3”}
}
The following code will load the file and convert the ‘(key2, key3)’ and ‘(key5, key6)’ keys to tuple keys: config = json_load(‘config.json’) print(config)
- {‘key1’: ‘value1’,
‘(key2, key3)’: ‘value2’, ‘key4’: {‘(key5, key6)’: ‘value3’}}
- GPSat.utils.json_serializable(d, max_len_df=100)
Converts a dictionary to a format that can be stored as JSON via the json.dumps() method.
- Parameters:
- d :dict
The dictionary to be converted.
- max_len_df: int, default 100
The maximum length of a Pandas DataFrame or Series that can be converted to a string representation. If the length of the DataFrame or Series is greater than this value, it will be stored as a string. Defaults to 100.
- Returns:
- dict
The converted dictionary.
- Raises:
- AssertionError: If the input is not a dictionary.
Notes
If a key in the dictionary is a tuple, it will be converted to a string.
To recover the original tuple, use nested_dict_literal_eval. - If a value in the dictionary is a dictionary, the function will be called recursively to convert it. - If a value in the dictionary is a NumPy array, it will be converted to a list. - If a value in the dictionary is a Pandas DataFrame or Series, it will be converted to a dictionary and the function will be called recursively to convert it if its length is less than or equal to max_len_df. Otherwise, it will be stored as a string. - If a value in the dictionary is not JSON serializable, it will be cast as a string.
- GPSat.utils.log_lines(*args, level='debug')
This function logs lines to a file with a specified logging level.
This function takes in any number of arguments and a logging level.
The function checks that the logging level is valid and then iterates through the arguments.
If an argument is a string, integer, float, dictionary, tuple, or list, it is printed and logged with the specified logging level.
If an argument is not one of these types, it is not logged and a message is printed indicating the argument’s type.
- Parameters:
- *args: tuple
arguments to be provided to logging using the method specified by level
- level: str, default “debug”
must be one of [“debug”, “info”, “warning”, “error”, “critical”] each argument provided is logged with getattr(logging, level)(arg)
- Returns:
- None
- GPSat.utils.match(x, y, exact=True, tol=1e-09)
This function takes two arrays, x and y, and returns an array of indices indicating where the elements of x match the elements of y. Can match exactly or within a specified tolerance.
- Parameters:
- x: array-like
the first array to be matched. If not an array will convert via to_array.
- y: array-like
the second array to be matched against. If not an array will convert via to_array.
- exact: bool, default=True.
If True, the function matches exactly. If False, the function matches within a specified tolerance.
- tol: float, optional, default=1e-9.
The tolerance used for matching when exact=False.
- Returns:
- indices: array
the indices of the matching elements in y for each element in x.
- Raises:
- AssertionError: if any element in x is not found in y or if multiple matches are found for any element in x.
Note
This function requires x and y to be arrays or can be converted by to_array If exact=False, the function only makes sense with floats. Use exact=True for int and str. If both x and y are large, with lengths n and m, this function can take up alot of memory as an intermediate bool array of size nxm is created. If there are multiple matches of x in y the index of the first match is return
- GPSat.utils.move_to_archive(top_dir, file_names=None, suffix='', archive_sub_dir='Archive', verbose=False)
Moves specified files from a directory to an archive sub-directory within the same directory. Moved files will have a suffix added on before file extension.
- Parameters:
- top_dirstr
The path to the directory containing the files to be moved.
- file_nameslist of str, default None
The names of the files to be moved. If not specified, all files in the directory will be moved.
- suffixstr, default “”.
A string to be added to the end of the file name before the extension in the archive directory.
- archive_sub_dirstr, default ‘Archive’
The name of the sub-directory within the top directory where the files will be moved.
- verbosebool, default is False.
If True, prints information about the files being moved.
- Returns:
- None
The function only moves files and does not return anything.
Note
If the archive sub-directory does not exist, it will be created.
If a file with the same name as the destination file already exists in the archive sub-directory, it will be overwritten.
- Raises:
- AssertionError
If top_dir does not exist or file_names is not specified.
Examples
Move all files in directory to archive sub-directory: >>> move_to_archive(“path/to/directory”)
Move specific files to archive sub-directory with a suffix added to the file name: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], suffix=”_backup”)
Move specific files to a custom archive sub-directory: >>> move_to_archive(“path/to/directory”, file_names=[“file1.txt”, “file2.txt”], archive_sub_dir=”Old Files”)
- GPSat.utils.nested_dict_literal_eval(d, verbose=False)
Converts a nested dictionary with string keys that represent tuples to a dictionary with tuple keys.
- Parameters:
- d: dict
The nested dictionary to be converted.
- verbose: bool, default False
If True, prints information about the keys being converted.
- Returns:
- dict
The converted dictionary with tuple keys.
- Raises:
- ValueError: If a string key cannot be evaluated as a tuple.
Note
This function modifies the original dictionary in place.
- GPSat.utils.nll(y, mu, sig, return_tot=True)
- GPSat.utils.not_nan(x)
- GPSat.utils.pandas_to_dict(x)
Converts a pandas Series or DataFrame (row) to a dictionary.
- Parameters:
- x: pd.Series, pd.DataFrame or dict
The input object to be converted to a dictionary.
- Returns:
- dict:
A dictionary representation of the input object.
- Raises:
- AssertionError: If the input object is a DataFrame with more than one row.
Warning
If the input object is not a pandas Series, DataFrame, or dictionary, a warning is issued and the input object is returned as is.
Examples
>>> import pandas as pd >>> data = {'name': ['John', 'Jane'], 'age': [30, 25]} >>> df = pd.DataFrame(data) >>> pandas_to_dict(df) AssertionError: in pandas_to_dict input provided as DataFrame, expected to only have 1 row, shape is: (2, 2)
>>> series = pd.Series(data['name']) >>> pandas_to_dict(series) {0: 'John', 1: 'Jane'}
>>> dictionary = {'name': ['John', 'Jane'], 'age': [30, 25]} >>> pandas_to_dict(dictionary) {'name': ['John', 'Jane'], 'age': [30, 25]}
select a single row of the dataframe
>>> pandas_to_dict(df.iloc[[0]]) {'name': 'John', 'age': 30}
- GPSat.utils.pip_freeze_to_dataframe()
- GPSat.utils.pretty_print_class(x)
This function takes in a class object as input and returns a string representation of the class name without the leading “<class ‘” and trailing “’>”.
Alternatively will remove leading ‘<__main__.’ and remove ‘ object at ‘, including anything that follows
The function achieves this by invoking the __str__ method of the class object and then using regular expressions to remove the unwanted characters.
- Parameters:
- x: an arbitrary class instance
- Returns:
- str
Examples
- class MyClass:
pass
print(pretty_print_class(MyClass))
- GPSat.utils.rmse(y, mu)
- GPSat.utils.sigmoid(x, low=0, high=1)
- GPSat.utils.softplus(x, shift=0)
- GPSat.utils.sparse_true_array(shape, grid_space=1, grid_space_offset=0)
Create a boolean numpy array with True values regularly spaced throughout, and False elsewhere.
- Parameters:
- shape: iterable (e.g. list or tuple)
representing the shape of the output array.
- grid_space: int, default 1
representing the spacing between True values.
- grid_space_offset: int, default 0
representing the offset of the first True value in each dimension.
- Returns:
- np.array
A boolean array with dimension equal to shape, with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape).
Note
The first dimension is treated as the y dimension. This function will return a bool array with dimension equal to shape with False everywhere except for Trues regularly spaced every ‘grid_space’. The fraction of True will be roughly equal to (1/n)^d where n = grid_space, d = len(shape). The function allows for grid_space_offset to be specific to each dimension.
- GPSat.utils.stats_on_vals(vals, measure=None, name=None, qs=None)
This function calculates various statistics on a given array of values.
- Parameters:
- vals: array-like
The input array of values.
- measure: str or None, default is None
The name of the measure being calculated.
- name: str or None, default is None
The name of the column in the output dataframe. Default is None.
- qs: list or None, defualt None
A list of quantiles to calculate. If None then will use [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99].
- Returns:
- pd.DataFrame
containing the following statistics: - measure: The name of the measure being calculated. - size: The number of elements in the input array. - num_not_nan: The number of non-NaN elements in the input array. - num_inf: The number of infinite elements in the input array. - min: The minimum value in the input array. - mean: The mean value of the input array. - max: The maximum value in the input array. - std: The standard deviation of the input array. - skew: The skewness of the input array. - kurtosis: The kurtosis of the input array. - qX: The Xth quantile of the input array, where X is the value in the qs parameter.
Note
The function also includes a timer decorator that calculates the time taken to execute the function.
- GPSat.utils.to_array(*args, date_format='%Y-%m-%d')
Converts input arguments to numpy arrays.
- Parameters:
- *argstuple
Input arguments to be converted to numpy arrays.
- date_formatstr, optional
Date format to be used when converting datetime.date objects to numpy arrays.
- Returns:
- generator
A generator that yields numpy arrays.
Note
This function converts input arguments to numpy arrays. If the input argument is already a numpy array, it is yielded as is. If the input argument is a list or tuple, it is converted to a numpy array and yielded. If the input argument is an integer, float, string, boolean, or numpy boolean, it is converted to a numpy array and yielded. If the input argument is a numpy integer or float, it is converted to a numpy array and yielded. If the input argument is a datetime.date object, it is converted to a numpy array using the specified date format and yielded. If the input argument is a numpy datetime64 object, it is yielded as is. If the input argument is None, an empty numpy array is yielded. If the input argument is of any other data type, a warning is issued and the input argument is converted to a numpy array of type object and yielded.
Examples
>>> import datetime >>> import numpy as np >>> x = [1, 2, 3]
since function returns are generator, get values out with next
>>> print(next(to_array(x))) [1 2 3]
or, for a single array like object, can assign with
>>> c, = to_array(x)
>>> y = np.array([4, 5, 6]) >>> z = datetime.date(2021, 1, 1) >>> for arr in to_array(x, y, z): ... print(f"arr type: {type(arr)}, values: {arr}") arr type: <class 'numpy.ndarray'>, values: [1 2 3] arr type: <class 'numpy.ndarray'>, values: [4 5 6] arr type: <class 'numpy.ndarray'>, values: ['2021-01-01']
- GPSat.utils.track_num_for_date(x)
GPSat.vff module
Code adapted from: https://github.com/st–/VFF
- class GPSat.vff.BlockDiagMat(A, B)
Bases:
object
- get()
- get_diag()
- inv()
- inv_diag()
- logdet()
- matmul(X)
- matmul_sqrt(X)
- matmul_sqrt_transpose(X)
- property shape
- solve(X)
- property sqrt_dims
- trace_KiX(X)
X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)
- class GPSat.vff.DiagMat(d)
Bases:
object
- get()
- get_diag()
- inv()
- inv_diag()
- logdet()
- matmul(B)
- matmul_sqrt(B)
- matmul_sqrt_transpose(B)
- property shape
- solve(B)
- property sqrt_dims
- trace_KiX(X)
X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)
- class GPSat.vff.GPR_kron(data, ms, a, b, kernel_list)
Bases:
GPModel
,InternalDataTrainingLossMixin
- elbo()
- maximum_log_likelihood_objective()
Objective for maximum likelihood estimation. Should be maximized. E.g. log-marginal likelihood (hyperparameter likelihood) for GPR, or lower bound to the log-marginal likelihood (ELBO) for sparse and variational GPs.
- Returns:
return has shape [].
- predict_f(Xnew, full_cov=False, full_output_cov=False)
Compute the mean and variance of the posterior latent function(s) at the input points.
Given $x_i$ this computes $f_i$, for:
\begin{align} \theta & \sim p(\theta) \\ f & \sim \mathcal{GP}(m(x), k(x, x'; \theta)) \\ f_i & = f(x_i) \\ \end{align}For an example of how to use
predict_f
, see ../../../../notebooks/getting_started/basic_usage.- Parameters:
Xnew –
Xnew has shape [batch…, N, D].
Input locations at which to compute mean and variance.
full_cov – If
True
, compute the full covariance between the inputs. IfFalse
, only returns the point-wise variance.full_output_cov – If
True
, compute the full covariance between the outputs. IfFalse
, assumes outputs are independent.
- Returns:
return[0] has shape [batch…, N, P].
return[1] has shape [batch…, N, P, N, P] if full_cov and full_output_cov.
return[1] has shape [batch…, N, P, P] if (not full_cov) and full_output_cov.
return[1] has shape [batch…, N, P] if (not full_cov) and (not full_output_cov).
return[1] has shape [batch…, P, N, N] if full_cov and (not full_output_cov).
- class GPSat.vff.LowRankMat(d, W)
Bases:
object
- get()
- get_diag()
- inv()
- inv_diag()
- logdet()
- matmul(B)
- matmul_sqrt(B)
- There’s a non-square sqrt of this matrix given by
[ D^{1/2}] [ W^T ]
This method right-multiplies the sqrt by the matrix B
- matmul_sqrt_transpose(B)
- There’s a non-square sqrt of this matrix given by
[ D^{1/2}] [ W^T ]
This method right-multiplies the transposed-sqrt by the matrix B
- property shape
- solve(B)
- property sqrt_dims
- trace_KiX(X)
X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)
- class GPSat.vff.Rank1Mat(d, v)
Bases:
object
- get()
- get_diag()
- inv()
- inv_diag()
- logdet()
- matmul(B)
- matmul_sqrt(B)
- There’s a non-square sqrt of this matrix given by
[ D^{1/2}] [ V^T ]
This method right-multiplies the sqrt by the matrix B
- matmul_sqrt_transpose(B)
- There’s a non-square sqrt of this matrix given by
[ D^{1/2}] [ W^T ]
This method right-multiplies the transposed-sqrt by the matrix B
- property shape
- solve(B)
- property sqrt_dims
- trace_KiX(X)
X is a square matrix of the same size as this one. if self is K, compute tr(K^{-1} X)
- GPSat.vff.kron(K)
- GPSat.vff.kron_two(A, B)
compute the Kronecker product of two tensorfow tensors
- GPSat.vff.make_Kuf(k, X, a, b, ms)
- GPSat.vff.make_Kuf_np(X, a, b, ms)
- GPSat.vff.make_Kuu(kern, a, b, ms)
# Make a representation of the Kuu matrices
- GPSat.vff.make_kvs(k)
Compute the kronecker-vector stack of the list of matrices k
- GPSat.vff.make_kvs_np(A_list)
- GPSat.vff.make_kvs_two(A, B)
compute the Kronecer-Vector stack of the matrices A and B
- GPSat.vff.make_kvs_two_np(A, B)
Module contents
Add package docstring here
- GPSat.get_config_path(*sub_dir)
- GPSat.get_data_path(*sub_dir)
- GPSat.get_parent_path(*sub_dir)
- GPSat.get_path(*sub_dir)
get_path to package