Dataloader

The GPSat Dataloader class provides utility methods for loading and manipulating data.

class GPSat.dataloader.DataLoader(hdf_store=None, dataset=None)

Bases: object

static add_cols(df, col_func_dict=None, filename=None, verbose=False)

Adds new columns to a given DataFrame based on the provided dictionary of column-function pairs.

This function allows the user to add new columns to a DataFrame using a dictionary that maps new column names to functions that compute the column values. The functions can be provided as values in the dictionary, and the new columns can be added to the DataFrame in a single call to this function.

If a tuple is provided as a key in the dictionary, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple.

Parameters:
dfpandas.DataFrame

The input DataFrame to which new columns will be added.

col_func_dictdict, optional

A dictionary that maps new column names (keys) to functions (values) that compute the column values. If a tuple is provided as a key, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple. If None, an empty dictionary will be used. Default is None.

filenamestr, optional

The name of the file from which the DataFrame was read. This parameter will be passed to the functions provided in the col_func_dict. Default is None.

verboseint or bool, optional

Determines the level of verbosity of the function. If verbose is 3 or higher, the function will print messages about the columns being added. Default is False.

Returns:
None
Raises:
AssertionError

If the length of the new columns returned by the function does not match the length of the tuple key in the col_func_dict.

Notes

DataFrame is manipulated inplace. If a single value is returned by the function, it will be assigned to a column with the name specified in the key. See help(utils.config_func) for more details.

Examples

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> add_one = lambda x: x + 1
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
>>> DataLoader.add_cols(df, col_func_dict= {
>>>     'C': {'func': add_one, "col_args": "A"}
>>>     })
   A  B  C
0  1  4  2
1  2  5  3
2  3  6  4
static add_data_to_col(df, add_data_to_col=None, verbose=False)

Adds new data to an existing column or creates a new column with the provided data in a DataFrame.

This function takes a DataFrame and a dictionary with the column name as the key and the data to be added as the value. It can handle scalar values or lists of values, and will replicate the DataFrame rows for each value in the list.

Parameters:
dfpandas.DataFrame

The input DataFrame to which data will be added or updated.

add_data_to_coldict, optional

A dictionary with the column name (key) and data to be added (value). The data can be a scalar value or a list of values. If a list of values is provided, the DataFrame rows will be replicated for each value in the list. If None, an empty dictionary will be used. Default is None.

verbosebool, default False.

If True, the function will print progress messages

Returns:
dfpandas.DataFrame

The DataFrame with the updated or added columns.

Raises:
AssertionError

If the add_data_to_col parameter is not a dictionary.

Notes

This method adds data to a specified column in a pandas DataFrame repeatedly. The method creates a copy of the DataFrame for each entry in the data to be added, and concatenates them to create a new DataFrame with the added data.

Examples

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> updated_df = DataLoader.add_data_to_col(df, add_data_to_col={"C": [7, 8]})
>>> print(updated_df)
   A  B  C
0  1  4  7
1  2  5  7
2  3  6  7
0  1  4  8
1  2  5  8
2  3  6  8
>>> len(df)
3
>>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4]})
>>> len(out)
12
>>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4], "b": [5,6,7,8]})
>>> len(out)
48
static bin_data(df, x_range=None, y_range=None, grid_res=None, x_col='x', y_col='y', val_col=None, bin_statistic='mean', return_bin_center=True)

Bins data from a given DataFrame into a 2D grid, applying the specified statistical function to the data in each bin.

This function takes a DataFrame containing x, y, and value columns and bins the data into a 2D grid. It returns the resulting grid, as well as the x and y bin edges or centers, depending on the value of return_bin_center.

Parameters:
dfpd.DataFrame

The input DataFrame containing the data to be binned.

x_rangelist or tuple of floats, optional

The range of x values, specified as [min, max]. If not provided, a default value of [-4500000.0, 4500000.0]. will be used.

y_rangelist or tuple of floats, optional

The range of y values, specified as [min, max]. If not provided, a default value of [-4500000.0, 4500000.0]. will be used.

grid_resfloat or None.

The grid resolution, expressed in kilometers. This parameter must be provided.

x_colstr, default is “x”.

The name of the column in the DataFrame containing the x values.

y_colstr, default is “y”.

The name of the column in the DataFrame containing the y values.

val_colstr, optional

The name of the column in the DataFrame containing the values to be binned. This parameter must be provided.

bin_statisticstr, default is “mean”.

The statistic to apply to the binned data. Options are 'mean', 'median', 'count', 'sum', 'min', 'max', or a custom callable function.

return_bin_centerbool, default is True.

If True, the function will return the bin centers instead of the bin edges.

Returns:
binned_datanumpy.ndarray

The binned data as a 2D grid.

x_outnumpy.ndarray

The x bin edges or centers, depending on the value of return_bin_center.

y_outnumpy.ndarray

The y bin edges or centers, depending on the value of return_bin_center.

classmethod bin_data_by(df, by_cols=None, val_col=None, x_col='x', y_col='y', x_range=None, y_range=None, grid_res=None, bin_statistic='mean', limit=10000)

Bins the input DataFrame df based on the given columns and computes the bin statistics for a specified value column.

This function takes a DataFrame, filters it based on the unique combinations of the by_cols column values, and then bins the data in each filtered DataFrame based on the x_col and y_col column values. It computes the bin statistic for the specified val_col and returns the result as an xarray DataArray. The output DataArray has dimensions "y", "x", and the given by_cols.

Parameters:
dfpandas.DataFrame

The input DataFrame to be binned.

by_colsstr or list[str] or tuple[str]

The column(s) by which the input DataFrame should be filtered. Unique combinations of these columns are used to create separate DataFrames for binning.

val_colstr

The column in the input DataFrame for which the bin statistics should be computed.

x_colstr, optional, default=’x’

The column in the input DataFrame to be used for binning along the x-axis.

y_colstr, optional, default=’y’

The column in the input DataFrame to be used for binning along the y-axis.

x_rangetuple, optional

The range of the x-axis values for binning. If None, the minimum and maximum x values are used.

y_rangetuple, optional

The range of the y-axis values for binning. If None, the minimum and maximum y values are used.

grid_resfloat, optional

The resolution of the grid used for binning. If None, the resolution is calculated based on the input data.

bin_statisticstr, optional, default=”mean”

The statistic to compute for each bin. Supported values are "mean", "median", "sum", "min", "max", and "count".

limitint, optional, default=10000

The maximum number of unique combinations of the by_cols column values allowed. Raises an AssertionError if the number of unique combinations exceeds this limit.

Returns:
outxarray.Dataset

The binned data as an xarray Dataset with dimensions 'y', 'x', and the given by_cols. Raises

Raises:
DeprecationWarning

If the deprecated method DataLoader.bin_data_by(...) is used instead of DataPrep.bin_data_by(...).

AssertionError

If any of the input parameters do not meet the specified conditions.

classmethod data_select(obj, where=None, combine_where='AND', table=None, return_df=True, reset_index=False, drop=True, copy=True, columns=None, close=False, **kwargs)

Selects data from an input object (pd.DataFrame, pd.HDFStore, xr.DataArray or xr.DataSet) based on filtering conditions.

This function filters data from various types of input objects based on the provided conditions specified in the 'where' parameter. It also supports selecting specific columns, resetting the index, and returning the output as a DataFrame.

Parameters:
objpd.DataFrame, pd.Series, dict, pd.HDFStore, xr.DataArray, or xr.Dataset

The input object from which data will be selected. If dict, it will try to convert it to pandas.DataFrame.

wheredict, list of dict or None, default None

Filtering conditions to be applied to the input object. It can be a single dictionary or a list of dictionaries. Each dictionary should have keys: "col", "comp", "val". e.g.

where = {"col": "t", "comp": "<=", "val": 4}

The "col" value specifies the column, "comp" specifies the comparison to be performed (>, >=, ==, !=, <=, <) and “val” is the value to be compared against. If None, then selects all data. Specifying 'where' parameter can avoid reading all data in from filesystem when obj is pandas.HDFStore or xarray.Dataset.

combine_where: str, default ‘AND’

How should where conditions, if there are multiple, be combined? Valid values are ["AND", "OR"], not case-sensitive.

tablestr, default None

The table name to select from when using an HDFStore object. If obj is pandas.HDFStore then table must be supplied.

return_dfbool, default True

If True, the output will be returned as a pandas.DataFrame.

reset_indexbool, default False

If True, the index of the output DataFrame will be reset.

dropbool, default True

If True, the output will have the filtered-out values removed. Applicable only for xarray objects. Default is True.

copybool, default True

If True, the output will be a copy of the selected data. Applicable only for DataFrame objects.

columnslist or None, default None

A list of column names to be selected from the input object. If None, selects all columns.

closebool, default False

If True, and obj is pandas.HDFStore it will be closed after selecting data.

kwargsany

Additional keyword arguments to be passed to the obj.select method when using an HDFStore object.

Returns:
outpandas.DataFrame, pandas.Series, or xarray.DataArray

The filtered data as a pd.DataFrame, pd.Series, or xr.DataArray, based on the input object type and return_df parameter.

Raises:
AssertionError

If the table parameter is not provided when using an HDFStore object.

AssertionError

If the provided columns are not found in the input object when using a DataFrame object.

Examples

>>> import pandas as pd
>>> import xarray as xr
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> # Select data from a DataFrame with a filtering condition
>>> selected_df = DataLoader.data_select(df, where={"col": "A", "comp": ">=", "val": 2})
>>> print(selected_df)
   A  B
1  2  5
2  3  6
static get_attribute_from_table(source, table, attribute_name)

Retrieve an attribute from a specific table in a HDF5 file or HDFStore.

This function can handle both cases when the source is a filepath string to a HDF5 file or a pandas HDFStore object. The function opens the source (if it’s a filepath), then attempts to retrieve the specified attribute from the specified table within the source. If the retrieval fails for any reason, a warning is issued and None is returned.

Parameters:
sourcestr or pandas.HDFStore

The source from where to retrieve the attribute. If it’s a string, it is treated as a filepath to a HDF5 file. If it’s a pandas HDFStore object, the function operates directly on it.

tablestr

The name of the table within the source from where to retrieve the attribute.

attribute_namestr

The name of the attribute to retrieve.

Returns:
attributeobject

The attribute retrieved from the specified table in the source. If the attribute could not be retrieved, None is returned.

Raises:
NotImplementedError

If the type of the source is neither a string nor a pandas.HDFStore.

static get_masks_for_expert_loc(ref_data, el_masks=None, obs_col=None)

Generate a list of masks based on given local experts locations (el_masks) and a reference data (ref_data).

This function can generate masks in two ways:
  1. If el_mask is a string “had_obs”, a mask is created based on the obs_col of the reference data where any non-NaN value is present.

  2. If el_mask is a dictionary with “grid_space” key, a regularly spaced mask is created based on the dimensions specified and the grid_space value.

The reference data is expected to be an xarray DataArray or xarray Dataset. Support for pandas DataFrame may be added in future.

Parameters:
ref_dataxarray.DataArray or xarray.Dataset

The reference data to use when generating the masks. The data should have coordinates that match the dimensions specified in the el_masks dictionary, if provided.

el_maskslist of str or dict, optional

A list of instructions for generating the masks. Each element in the list can either be a string or a dictionary. If a string, it should be “had_obs”, which indicates a mask should be created where any non-NaN value is present in the obs_col of the ref_data. If a dictionary, it should have a “grid_space” key indicating the regular spacing to be used when creating a mask and ‘dims’ key specifying dimensions in the reference data to be considered. By default, it is None, which indicates no mask is to be generated.

obs_colstr, optional

The column in the reference data to use when generating a mask based on “had_obs” instruction. This parameter is ignored if “had_obs” is not present in el_masks.

Returns:
list of xarray.DataArray

A list of masks generated based on the el_masks instructions. Each mask is an xarray DataArray with the same coordinates as the ref_data. Each value in the mask is a boolean indicating whether a local expert should be located at that point.

Raises:
AssertionError

If ref_data is not an instance of xarray.DataArray or xarray.Dataset, or if “grid_space” is in el_masks but the corresponding dimensions specified in the ‘dims’ key do not exist in ref_data.

Notes

The function could be extended to read data from file system and allow different reference data.

Future extensions could also include support for lel_mask to be only list of dict and for reference data to be pandas DataFrame.

static get_run_info(script_path=None)

Retrieves information about the current Python script execution environment, including run time, Python executable path, and Git information.

This function collects information about the current script execution environment, such as the date and time when the script is executed, the path of the Python interpreter, the script’s file path, and Git information (if available).

Parameters:
script_pathstr, default None

The file path of the currently executed script. If None, it will try to retrieve the file path automatically.

Returns:
run_infodict

A dictionary containing the following keys:

  • "run_time": The date and time when the script was executed, formatted as "YYYY-MM-DD HH:MM:SS".

  • "python_executable": The path of the Python interpreter.

  • "script_path": The absolute file path of the script (if available).

  • Git-related keys: "git_branch", "git_commit", "git_url", and "git_modified" (if available).

Examples

>>> from GPSat.dataloader import DataLoader
>>> run_info = DataLoader.get_run_info()
>>> print(run_info)
{
    "run_time": "2023-04-28 10:30:00",
    "python_executable": "/usr/local/bin/python3.9",
    "script_path": "/path/to/your/script.py",
    "branch": "main",
    "commit": "123abc",
    "remote": ["https://github.com/user/repo.git" (fetch),"https://github.com/user/repo.git" (push)]
    "details": ['commit 123abc',
      'Author: UserName <username42@gmail.com>',
      'Date:   Fri Apr 28 07:22:31 2023 +0100',
      ':bug: fix ']
    "modified" : ['list_of_files.py', 'modified_since.py', 'last_commit.py']
}
static get_where_list(global_select, local_select=None, ref_loc=None)

Generate a list of selection criteria for data filtering based on global and local conditions, as well as reference location.

The function accepts a list of global select conditions, and optional local select conditions and reference location. Each condition in global select can either be ‘static’ (with keys ‘col’, ‘comp’, and ‘val’) or ‘dynamic’ (requiring local select and reference location and having keys ‘loc_col’, ‘src_col’, ‘func’). The function evaluates each global select condition and constructs a corresponding selection dictionary.

Parameters:
global_selectlist of dict

A list of dictionaries defining global selection conditions. Each dictionary can be either ‘static’ or ‘dynamic’. ‘Static’ dictionaries should contain the keys ‘col’, ‘comp’, and ‘val’ which define a column, a comparison operator, and a value respectively. ‘Dynamic’ dictionaries should contain the keys ‘loc_col’, ‘src_col’, and ‘func’ which define a location column, a source column, and a function respectively.

local_selectlist of dict, optional

A list of dictionaries defining local selection conditions. Each dictionary should contain keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively. This parameter is required if any ‘dynamic’ condition is present in global_select.

ref_locpandas DataFrame, optional

A reference location as a pandas DataFrame. This parameter is required if any ‘dynamic’ condition is present in global_select.

Returns:
list of dict

A list of dictionaries each representing a selection condition to be applied on data. Each dictionary contains keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively.

Raises:
AssertionError

If a ‘dynamic’ condition is present in global_select but local_select or ref_loc is not provided, or if the required keys are not present in the ‘dynamic’ condition, or if the location column specified in a ‘dynamic’ condition is not present in ref_loc.

static get_where_list_legacy(read_in_by=None, where=None)

Generate a list (of lists) of where conditions that can be consumed by pd.HDFStore(...).select.

Parameters:
read_in_by: dict of dict or None

Sub-dictionary must contain the keys "values", "how".

where: str or None

Used if read_in_by is not provided.

Returns:
list of list

Containing string where conditions.

classmethod hdf_tables_in_store(store=None, path=None)

Retrieve the list of tables available in an HDFStore.

This class method allows the user to get the names of all tables stored in a given HDFStore. It accepts either an already open HDFStore object or a path to an HDF5 file. If a path is provided, the method will open the HDFStore in read-only mode, retrieve the table names, and then close the store.

Parameters:
storepd.io.pytables.HDFStore, optional

An open HDFStore object. If this parameter is provided, path should not be specified.

pathstr, optional

The file path to an HDF5 file. If this parameter is provided, store should not be specified. The method opens the HDFStore at this path in read-only mode to retrieve the table names.

Returns:
list of str

A list containing the names of all tables in the HDFStore.

Raises:
AssertionError

If both store and path are None, or if the store provided is not an instance of pd.io.pytables.HDFStore.

Notes

The method ensures that only one of store or path is provided. If path is specified, the HDFStore is opened in read-only mode and closed after retrieving the table names.

Examples

>>> DataLoader.hdf_tables_in_store(store=my_store)
['/table1', '/table2']
>>> DataLoader.hdf_tables_in_store(path='path/to/hdf5_file.h5')
['/table1', '/table2', '/table3']
static is_list_of_dict(lst)

Checks if the given input is a list of dictionaries.

This utility function tests if the input is a list where all elements are instances of the dict type.

Parameters:
lstlist

The input list to be checked for containing only dictionaries.

Returns:
bool

True if the input is a list of dictionaries, False otherwise.

Examples

>>> from GPSat.dataloader import DataLoader
>>> DataLoader.is_list_of_dict([{"col": "t", "comp": "==", "val": 1}])
True
>>> DataLoader.is_list_of_dict([{"a": 1, "b": 2}, {"c": 3, "d": 4}])
True
>>> DataLoader.is_list_of_dict([1, 2, 3])
False
>>> DataLoader.is_list_of_dict("not a list")
False
static kdt_tree_list_for_local_select(df, local_select)

Pre-calculates a list of KDTree objects for selecting points within a radius based on the local_select input.

Given a DataFrame and a list of local selection criteria, this function builds a list of KDTree objects that can be used for spatially selecting points within specified radii.

Parameters:
dfpd.DataFrame

The input DataFrame containing the data to be used for KDTree construction.

local_selectlist of dict

A list of dictionaries containing the selection criteria for each local select. Each dictionary should have the following keys:

  • "col": The name of the column(s) used for spatial selection. Can be a single string or a list of strings.

  • "comp": The comparison operator, either "<" or "<=". Currently, only less than comparisons are supported for multi-dimensional values.

Returns:
outlist

A list of KDTree objects or None values, where each element corresponds to an entry in the local_select input. If an entry in local_select has a single string for "col", the corresponding output element will be None. Otherwise, the output element will be a KDTree object built from the specified columns.

Examples

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> local_select = [{"col": ["x", "y"], "comp": "<"}]
>>> kdt_trees = DataLoader.kdt_tree_list_for_local_select(df, local_select)
>>> print(kdt_trees)
classmethod load(source, where=None, engine=None, table=None, source_kwargs=None, col_funcs=None, row_select=None, col_select=None, reset_index=False, add_data_to_col=None, close=False, verbose=False, combine_row_select='AND', **kwargs)

Load data from various sources and (optionally) apply selection of columns/rows and add/modify columns.

Parameters:
source: str, pd.DataFrame, pd.Series, pd.HDFStore, xr.DataSet, default None

If str, will try to convert to other types.

where: dict or list of dict, default None

Used when querying pd.HDFStore, xr.DataSet, xr.DataArray. Specified as a list of one or more dictionaries, each containing the keys:

  • "col": refers to a column (or variable for xarray objects.

  • "comp": is the type of comparison to apply e.g. "==", "!=", ">=", ">", "<=", "<".

  • "val": value to be compared with.

e.g.

where = [{"col": "A", "comp": ">=", "val": 0}]

will select entries where the column "A" is greater than 0.

Note: Think of this as a database query, with the where used to read data from the file system into memory.

engine: str or None, default None

Specify the type of ‘engine’ to use to read in data. If not supplied, it will be inferred by source if source is string. Valid values: "HDFStore", "netcdf4", "scipy", "pydap", "h5netcdf", "pynio", "cfgrib", "pseudonetcdf", "zarr" or any of Pandas "read_*".

table: str or None, default None

Used only if source is pd.HDFStore (or is converted to one) and is required if so. Should be a valid table (i.e. key) in HDFStore.

source_kwargs: dict or None, default None

Additional keyword arguments to pass to the data source reading functions, depending on engine. e.g. keyword arguments for pandas.read_csv() if engine=read_csv.

col_funcs: dict or None, default None

If dict, it will be provided to add_cols method to add or modify columns.

row_select: dict, list of dict, or None, default None

Used to select a subset of data after data is initially read into memory. Can be the same type of input as where i.e.

row_select = {"col": "A", "comp": ">=", "val": 0}

or use col_funcs that return bool array

e.g.

row_select = {"func": "lambda x: ~np.isnan(x)", "col_args": 1}

see help(utils.config_func) for more details.

col_select: list of str or None, default None

If specified as a list of strings, it will return a subset of columns using col_select. All values must be valid. If None, all columns will be returned.

filename: str or None, default None

Used by add_cols method.

reset_index: bool, default True

Apply reset_index(inplace=True) before returning?

add_data_to_col:

Add new column to data frame. See argument add_data_to_col in add_data_to_col.

close: bool, default False

See DataLoader.data_select for details

verbose: bool, default False

Set verbosity.

kwargs:

Additional arguments to be provided to data_select method

Returns:
pd.DataFrame

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df = DataLoader.load(source = df,
...                      where = {"col": "A", "comp": ">=", "val": 2})
>>> print(df.head())
   A  B
0  2  5
1  3  6

If the data is stored in a file, we can extract it as follows (here, we assume the data is saved in “path/to/data.h5” under the table “data”):

>>> df = DataLoader.load(source = "path/to/data.h5",
...                      table = "data")
classmethod local_data_select(df, reference_location, local_select, kdtree=None, verbose=True)

Selects data from a DataFrame based on a given criteria and reference (expert) location.

This method applies local selection criteria to a DataFrame, allowing for flexible, column-wise data selection based on comparison operations. For multi (dimensional) column selections, a KDTree can be used for efficiency.

Parameters:
dfpd.DataFrame

The DataFrame from which data will be selected.

reference_locationdict or pd.DataFrame

Reference location used for comparisons. If DataFrame is provided, it will be converted to dict.

local_selectlist of dict

List of dictionaries containing the selection criteria for each local select. Each dictionary must contain keys ‘col’, ‘comp’, and ‘val’. ‘col’ is the column in ‘df’ to apply the comparison on, ‘comp’ is the comparison operator as a string (can be ‘>=’, ‘>’, ‘==’, ‘<’, ‘<=’), and ‘val’ is the value to compare with.

kdtreeKDTree or list of KDTree, optional

Precomputed KDTree or list of KDTrees for optimization. Each KDTree in the list corresponds to an entry in local_select. If not provided, a new KDTree will be created.

verbosebool, default=True

If True, print details for each selection criteria.

Returns:
pd.DataFrame

A DataFrame containing only the data that meets all of the selection criteria.

Raises:
AssertionError

If ‘col’ is not in ‘df’ or ‘reference_location’, if the comparison operator in ‘local_select’ is not valid, or if the provided ‘kdtree’ is not of type KDTree.

Notes

If ‘col’ is a string, a simple comparison is performed. If ‘col’ is a list of strings, a KDTree-based selection is performed where each dimension is a column from ‘df’. For multi-dimensional comparisons, only less than comparisons are currently handled.

If ‘kdtree’ is provided and is a list, it must be of the same length as ‘local_select’ with each element corresponding to the same index in ‘local_select’.

static make_multiindex_df(idx_dict, **kwargs)

Create a multi-indexed DataFrame from the provided index dictionary for each keyword argument supplied.

This function creates a multi-indexed DataFrame, with each row having the same multi-index value The index dictionary serves as the levels and labels for the multi-index, while the keyword arguments provide the data.

Parameters:
idx_dictdict or pd.Series

A dictionary or pandas Series containing the levels and labels for the multi-index.

**kwargsdict

Keyword arguments specifying the data and column names for the resulting DataFrame. The data can be of various types: int, float, bool, np.ndarray, pd.DataFrame, dict, or tuple. This data will be transformed into a DataFrame, where the multi-index will be added.

Returns:
dict

A dictionary containing the multi-indexed DataFrames with keys corresponding to the keys of provided keyword arguments.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> idx_dict = {"year": 2020, "month": 1}
>>> data = pd.DataFrame({"x": np.arange(10)})
>>> df = pd.DataFrame({"y": np.arange(3)})
>>> DataLoader.make_multiindex_df(idx_dict, data=data, df=df)
{'data': <pandas.DataFrame (multiindexed) with shape (3, 4)>}
static mindex_df_to_mindex_dataarray(df, data_name, dim_cols=None, infer_dim_cols=True, index_name='index')

Converts a multi-index DataFrame to a multi-index DataArray.

The method facilitates a transition from pandas DataFrame representation to the Xarray DataArray format, while preserving multi-index structure. This can be useful for higher-dimensional indexing, labeling, and performing mathematical operations on the data.

Parameters:
dfpd.DataFrame

The input DataFrame with a multi-index to be converted to a DataArray.

data_namestr

The name of the column in ‘df’ that contains the data values for the DataArray.

dim_colslist of str, optional

A list of columns in ‘df’ that will be used as additional dimensions in the DataArray. If None, dimension columns will be inferred if ‘infer_dim_cols’ is True.

infer_dim_colsbool, default=True

If True and ‘dim_cols’ is None, dimension columns will be inferred from ‘df’. Columns will be considered a dimension column if they match the pattern “^_dim_d”.

index_namestr, default=”index”

The name assigned to the placeholder index created during the conversion process.

Returns:
xr.DataArray

A DataArray derived from the input DataFrame with the same multi-index structure. The data values are taken from the column in ‘df’ specified by ‘data_name’. Additional dimensions can be included from ‘df’ as specified by ‘dim_cols’.

Raises:
AssertionError

If ‘data_name’ is not a column in ‘df’.

Notes

The function manipulates ‘df’ by reference. If the original DataFrame needs to be preserved, provide a copy to the function.

classmethod read_flat_files(file_dirs, file_regex, sub_dirs=None, read_csv_kwargs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, verbose=False)

wrapper for read_from_multiple_files with read_engine=’csv’ Parameters

Read flat files (.csv, .tsv, etc) from file system and returns a pd.DataFrame object.

Parameters:
file_dirs: str or List[str]

The directories containing the files to read.

file_regex: str

A regular expression pattern to match file names within the specified directories.

sub_dirs: str or List[str], optional

Subdirectories within each file directory to search for files.

read_csv_kwargs: dict, optional

Additional keyword arguments specifically for CSV reading. These are keyword arguments for the function pandas.read_csv().

col_funcs: dict of dict, optional

A dictionary with column names as keys and column functions to apply during data reading as values. The column functions should be a dictionary of keyword arguments to utils.config_func.

row_select: list of dict, optional

A list of functions to select rows during data reading.

col_select: list of str, optional

A list of column names to read from data.

new_column_names: List[str], optional

New column names to assign to the resulting DataFrame.

strict: bool, default True

Whether to raise an error if a file directory does not exist.

verbose: bool or int, default False

Verbosity level for printing progress.

Returns:
pd.DataFrame

A DataFrame containing the combined data from multiple files.

Notes

  • This method reads data from multiple files located in specified directories and subdirectories.

  • The file_regex argument is used to filter files to be read.

  • Various transformations can be applied to the data, including adding new columns and selecting rows/columns.

  • If new_column_names is provided, it should be a list with names matching the number of columns in the output DataFrame.

  • The resulting DataFrame contains the combined data from all the specified files.

Examples

The command below reads the files "A_RAW.csv", "B_RAW.csv" and "C_RAW.csv" in the path "/path/to/dir" and combines them into a single dataframe.

>>> import pandas as pd
>>> from GPSat.dataloader import DataLoader
>>> col_funcs = {
...    "source": { # Add a new column "source" with entries "A", "B" or "C".
...        "func": "lambda x: re.sub('_RAW.*$', '', os.path.basename(x))",
...        "filename_as_arg": true
...    },
...    "datetime": { # Modify column "datetime" by converting to datetime64[s].
...        "func": "lambda x: x.astype('datetime64[s]')",
...        "col_args": "datetime"
...    },
...    "obs": { # Rename column "z" to "obs" and subtract mean value 0.1.
...        "func": "lambda x: x-0.1",
...        "col_args": "z"
...    }
... }
>>> row_select = [ # Read data whose "lat" value is >= 65.
...    {
...        "func": "lambda x: x>=65",
...        "col_kwargs": {
...            "x": "lat"
...        }
...    }
... ]
>>> df = DataLoader.read_flat_files(file_dirs = "/path/to/dir/",
...                                 file_regex = ".*_RAW.csv$",
...                                 col_funcs = col_funcs,
...                                 row_select = row_select)
>>> print(df.head(2))
        lon             lat             datetime                source  obs
0       59.944790       82.061122       2020-03-01 13:48:50     C       -0.0401
1       59.939555       82.063771       2020-03-01 13:48:50     C       -0.0861
classmethod read_from_multiple_files(file_dirs, file_regex, read_engine='csv', sub_dirs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, read_kwargs=None, read_csv_kwargs=None, verbose=False)

Reads and merges data from multiple files in specified directories, Optionally apply various transformations such as column renaming, row selection, column selection or other transformation functions to the data.

The primary input is a list of directories and a regular expression used to select which files within those directories should be read.

Parameters:
file_dirslist of str

A list of directories to read the files from. Each directory is a string. If a string is provided instead of a list, it will be wrapped into a single-element list.

file_regexstr

Regular expression to match the files to be read from the directories specified in ‘file_dirs’. e.g. “NEW.csv$’ with match all files ending with NEW.csv

read_enginestr, optional

The engine to be used to read the files. Options include ‘csv’, ‘nc’, ‘netcdf’, and ‘xarray’. Default is ‘csv’.

sub_dirslist of str, optional

A list of subdirectories to be appended to each directory in ‘file_dirs’. If a string is provided, it will be wrapped into a single-element list. Default is None.

col_funcsdict, optional

A dictionary that maps new column names to functions that compute the column values. Provided to add_cols via col_func_dict parameter. Default is None.

row_selectlist of dict, optional

A list of dictionaries, each representing a condition to select rows from the DataFrame. Provided to the row_select_bool method. Default is None.

col_selectslice, optional

A slice object to select specific columns from the DataFrame. If not provided, all columns are selected.

new_column_nameslist of str, optional

New names for the DataFrame columns. The length should be equal to the number of columns in the DataFrame. Default is None.

strictbool, optional

Determines whether to raise an error if a directory in ‘file_dirs’ does not exist. If False, a warning is issued instead. Default is True.

read_kwargsdict, optional

Additional keyword arguments to pass to the read function (pd.read_csv or xr.open_dataset). Default is None.

read_csv_kwargsdict, optional

Deprecated. Additional keyword arguments to pass to pd.read_csv. Use ‘read_kwargs’ instead. Default is None.

verbosebool or int, optional

Determines the verbosity level of the function. If True or an integer equal to or higher than 3, additional print statements are executed.

Returns:
outpandas.DataFrame

The resulting DataFrame, merged from all the files that were read and processed.

Raises:
AssertionError

Raised if the ‘read_engine’ parameter is not one of the valid choices, if ‘read_kwargs’ or ‘col_funcs’ are not dictionaries, or if the length of ‘new_column_names’ is not equal to the number of columns in the DataFrame. Raised if ‘strict’ is True and a directory in ‘file_dirs’ does not exist.

Notes

The function supports reading from csv, netCDF files and xarray Dataset formats. For netCDF and xarray Dataset, the data is converted to a DataFrame using the ‘to_dataframe’ method.

static read_from_npy(npy_files, npy_dir, dims=None, flatten_xy=True, return_xarray=True)

Read NumPy array(s) from the specified .npy file(s) and return as xarray DataArray(s).

This function reads one or more .npy files from the specified directory and returns them as xarray DataArray(s). The input can be a single file, a list of files, or a dictionary of files with the desired keys. The returned dictionary contains the xarray DataArray(s) with the corresponding keys.

Parameters:
npy_filesstr, list, or dict

The .npy file(s) to be read. It can be a single file (str), a list of files, or a dictionary of files.

npy_dirstr

The directory containing the .npy file(s).

dimslist or tuple, optional

The dimensions for the xarray DataArray(s), (default: None).

flatten_xybool, optional

If True, flatten the x and y arrays by taking the first row and first column, respectively (default: True).

return_xarray: bool, default True

If True will convert numpy arrays to pandas DataArray, otherwise will return dict of numpy arrays.

Returns:
dict

A dictionary containing xarray DataArray(s) with keys corresponding to the input files.

Examples

>>> read_from_npy(npy_files="data.npy", npy_dir="./data")
{'obs': <xarray.DataArray (shape)>
>>> read_from_npy(npy_files=["data1.npy", "data2.npy"], npy_dir="./data")
{'obs': [<xarray.DataArray (shape1)>, <xarray.DataArray (shape2)>]}
>>> read_from_npy(npy_files={"x": "data_x.npy", "y": "data_y.npy"}, npy_dir="./data")
{'x': <xarray.DataArray (shape_x)>, 'y': <xarray.DataArray (shape_y)>}
static read_from_pkl_dict(pkl_files, pkl_dir=None, default_name='obs', strict=True, dim_names=None)

Reads and processes data from pickle files and returns a DataFrame containing all data.

Parameters:
pkl_filesstr, list, or dict

The pickle file(s) to be read. This can be a string (representing a single file), a list of strings (representing multiple files), or a dictionary, where keys are the names of different data sources and the values are lists of file names.

pkl_dirstr, optional

The directory where the pickle files are located. If not provided, the current directory is used.

default_namestr, optional

The default data source name. This is used when pkl_files is a string or a list. Default is “obs”.

strictbool, optional

If True, the function will raise an exception if a file does not exist. If False, it will print a warning and continue with the remaining files. Default is True.

dim_nameslist, optional

The names of the dimensions. This is used when converting the data to a DataArray. If not provided, default names are used.

Returns:
DataFrame

A DataFrame containing the data from all provided files. The DataFrame has a MultiIndex with ‘idx0’, ‘idx1’ and ‘date’ as index levels, and ‘obs’ and ‘source’ as columns. Each ‘source’ corresponds to a different data source (file).

Notes

The function reads the data from the pickle files and converts them into a DataFrame For each file, it creates a MultiIndex DataFrame where the indices are a combination of two dimensions and dates extracted from the keys in the dictionary loaded from the pickle file.

The function assumes the dictionary loaded from the pickle file has keys that can be converted to dates with the format “YYYYMMDD”. It also assumes that the values in the dictionary to be 2D numpy array.

If pkl_files is a string or a list, the function treats them as files from a single data source and uses default_name as the source name. If it’s a dictionary, the keys are treated as data source names, and the values are lists of file names.

When multiple files are provided, the function concatenates the data along the date dimension.

static read_hdf(table, store=None, path=None, close=True, **select_kwargs)

Reads data from an HDF5 file, and returns a DataFrame.

This method can either read data directly from an open HDF5 store or from a provided file path. In case a file path is provided, it opens the HDF5 file in read mode, and closes it after reading, if ‘close’ is set to True.

Parameters:
tablestr

The key or the name of the dataset in the HDF5 file.

storepd.io.pytables.HDFStore, optional

An open HDF5 store. If provided, the method will directly read data from it. Default is None.

pathstr, optional

The path to the HDF5 file. If provided, the method will open the HDF5 file in read mode, and read data from it. Default is None.

closebool, optional

A flag that indicates whether to close the HDF5 store after reading the data. It is only relevant when ‘path’ is provided, in which case the default is True.

**select_kwargsdict, optional

Additional keyword arguments that are passed to the ‘select’ method of the HDFStore object. This can be used to select only a subset of data from the HDF5 file.

Returns:
dfpd.DataFrame

A DataFrame containing the data read from the HDF5 file.

Raises:
AssertionError

If both ‘store’ and ‘path’ are None, or if ‘store’ is not an instance of pd.io.pytables.HDFStore.

Notes

Either ‘store’ or ‘path’ must be provided. If ‘store’ is provided, ‘path’ will be ignored.

Examples

#>>> store = pd.HDFStore(‘data.h5’) #>>> df = read_hdf(table=’my_data’, store=store) #>>> print(df)

classmethod row_select_bool(df, row_select=None, combine='AND', **kwargs)

Returns a boolean array indicating which rows of the DataFrame meet the specified conditions.

This class method applies a series of conditions, provided in the ‘row_select’ list, to the input DataFrame ‘df’. Each condition is represented by a dictionary that is used as input to the ‘_bool_numpy_from_where’ method.

All conditions are combined via an ‘&’ operator, meaning if all conditions for a given row are True the return value for that row entry will be True and False if any condition is not satisfied.

If ‘row_select’ is None or an empty dictionary, all indices will be True.

Parameters:
dfDataFrame

The DataFrame to apply the conditions on.

row_selectlist of dict, optional

A list of dictionaries, each representing a condition to apply to ‘df’. Each dictionary should contain the information needed for the ‘_bool_numpy_from_where’ method. If None or an empty dictionary, all indices in the returned array will be True.

verbosebool or int, optional

If set to True or a number greater than or equal to 3, additional print statements will be executed.

kwargsdict

Additional keyword arguments passed to the ‘_bool_numpy_from_where’ method.

Returns:
selectnp.array of bool

A boolean array indicating which rows of the DataFrame meet the conditions. The length of the array is equal to the number of rows in ‘df’.

Raises:
AssertionError

If ‘row_select’ is not None, not a dictionary and not a list, or if any element in ‘row_select’ is not a dictionary.

Notes

The function is designed to work with pandas DataFrames.

If ‘row_select’ is None or an empty dictionary, the function will return an array with all elements set to True (indicating all rows of ‘df’ are selected).