Dataloader
The GPSat Dataloader class provides utility methods for loading and manipulating data.
- class GPSat.dataloader.DataLoader(hdf_store=None, dataset=None)
Bases:
object
- static add_cols(df, col_func_dict=None, filename=None, verbose=False)
Adds new columns to a given DataFrame based on the provided dictionary of column-function pairs.
This function allows the user to add new columns to a DataFrame using a dictionary that maps new column names to functions that compute the column values. The functions can be provided as values in the dictionary, and the new columns can be added to the DataFrame in a single call to this function.
If a tuple is provided as a key in the dictionary, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple.
- Parameters:
- dfpandas.DataFrame
The input DataFrame to which new columns will be added.
- col_func_dictdict, optional
A dictionary that maps new column names (keys) to functions (values) that compute the column values. If a tuple is provided as a key, it is assumed that the corresponding function will return multiple columns. The length of the returned columns should match the length of the tuple. If
None
, an empty dictionary will be used. Default isNone
.- filenamestr, optional
The name of the file from which the DataFrame was read. This parameter will be passed to the functions provided in the
col_func_dict
. Default isNone
.- verboseint or bool, optional
Determines the level of verbosity of the function. If verbose is
3
or higher, the function will print messages about the columns being added. Default isFalse
.
- Returns:
- None
- Raises:
- AssertionError
If the length of the new columns returned by the function does not match the length of the tuple key in the col_func_dict.
Notes
DataFrame is manipulated inplace. If a single value is returned by the function, it will be assigned to a column with the name specified in the key. See
help(utils.config_func)
for more details.Examples
>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> add_one = lambda x: x + 1
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) >>> DataLoader.add_cols(df, col_func_dict= { >>> 'C': {'func': add_one, "col_args": "A"} >>> }) A B C 0 1 4 2 1 2 5 3 2 3 6 4
- static add_data_to_col(df, add_data_to_col=None, verbose=False)
Adds new data to an existing column or creates a new column with the provided data in a DataFrame.
This function takes a DataFrame and a dictionary with the column name as the key and the data to be added as the value. It can handle scalar values or lists of values, and will replicate the DataFrame rows for each value in the list.
- Parameters:
- dfpandas.DataFrame
The input DataFrame to which data will be added or updated.
- add_data_to_coldict, optional
A dictionary with the column name (key) and data to be added (value). The data can be a scalar value or a list of values. If a list of values is provided, the DataFrame rows will be replicated for each value in the list. If
None
, an empty dictionary will be used. Default isNone
.- verbosebool, default False.
If
True
, the function will print progress messages
- Returns:
- dfpandas.DataFrame
The DataFrame with the updated or added columns.
- Raises:
- AssertionError
If the
add_data_to_col
parameter is not a dictionary.
Notes
This method adds data to a specified column in a pandas DataFrame repeatedly. The method creates a copy of the DataFrame for each entry in the data to be added, and concatenates them to create a new DataFrame with the added data.
Examples
>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> updated_df = DataLoader.add_data_to_col(df, add_data_to_col={"C": [7, 8]}) >>> print(updated_df) A B C 0 1 4 7 1 2 5 7 2 3 6 7 0 1 4 8 1 2 5 8 2 3 6 8
>>> len(df) 3 >>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4]}) >>> len(out) 12 >>> out = DataLoader.add_data_to_col(df, add_data_to_col={"a": [1,2,3,4], "b": [5,6,7,8]}) >>> len(out) 48
- static bin_data(df, x_range=None, y_range=None, grid_res=None, x_col='x', y_col='y', val_col=None, bin_statistic='mean', return_bin_center=True)
Bins data from a given DataFrame into a 2D grid, applying the specified statistical function to the data in each bin.
This function takes a DataFrame containing x, y, and value columns and bins the data into a 2D grid. It returns the resulting grid, as well as the x and y bin edges or centers, depending on the value of
return_bin_center
.- Parameters:
- dfpd.DataFrame
The input DataFrame containing the data to be binned.
- x_rangelist or tuple of floats, optional
The range of x values, specified as
[min, max]
. If not provided, a default value of[-4500000.0, 4500000.0]
. will be used.- y_rangelist or tuple of floats, optional
The range of y values, specified as
[min, max]
. If not provided, a default value of[-4500000.0, 4500000.0]
. will be used.- grid_resfloat or None.
The grid resolution, expressed in kilometers. This parameter must be provided.
- x_colstr, default is “x”.
The name of the column in the DataFrame containing the x values.
- y_colstr, default is “y”.
The name of the column in the DataFrame containing the y values.
- val_colstr, optional
The name of the column in the DataFrame containing the values to be binned. This parameter must be provided.
- bin_statisticstr, default is “mean”.
The statistic to apply to the binned data. Options are
'mean'
,'median'
,'count'
,'sum'
,'min'
,'max'
, or a custom callable function.- return_bin_centerbool, default is True.
If
True
, the function will return the bin centers instead of the bin edges.
- Returns:
- binned_datanumpy.ndarray
The binned data as a 2D grid.
- x_outnumpy.ndarray
The x bin edges or centers, depending on the value of
return_bin_center
.- y_outnumpy.ndarray
The y bin edges or centers, depending on the value of
return_bin_center
.
- classmethod bin_data_by(df, by_cols=None, val_col=None, x_col='x', y_col='y', x_range=None, y_range=None, grid_res=None, bin_statistic='mean', limit=10000)
Bins the input DataFrame
df
based on the given columns and computes the bin statistics for a specified value column.This function takes a DataFrame, filters it based on the unique combinations of the
by_cols
column values, and then bins the data in each filtered DataFrame based on thex_col
andy_col
column values. It computes the bin statistic for the specifiedval_col
and returns the result as an xarray DataArray. The output DataArray has dimensions"y"
,"x"
, and the givenby_cols
.- Parameters:
- dfpandas.DataFrame
The input DataFrame to be binned.
- by_colsstr or list[str] or tuple[str]
The column(s) by which the input DataFrame should be filtered. Unique combinations of these columns are used to create separate DataFrames for binning.
- val_colstr
The column in the input DataFrame for which the bin statistics should be computed.
- x_colstr, optional, default=’x’
The column in the input DataFrame to be used for binning along the x-axis.
- y_colstr, optional, default=’y’
The column in the input DataFrame to be used for binning along the y-axis.
- x_rangetuple, optional
The range of the x-axis values for binning. If
None
, the minimum and maximum x values are used.- y_rangetuple, optional
The range of the y-axis values for binning. If
None
, the minimum and maximum y values are used.- grid_resfloat, optional
The resolution of the grid used for binning. If
None
, the resolution is calculated based on the input data.- bin_statisticstr, optional, default=”mean”
The statistic to compute for each bin. Supported values are
"mean"
,"median"
,"sum"
,"min"
,"max"
, and"count"
.- limitint, optional, default=10000
The maximum number of unique combinations of the
by_cols
column values allowed. Raises an AssertionError if the number of unique combinations exceeds this limit.
- Returns:
- outxarray.Dataset
The binned data as an xarray Dataset with dimensions
'y'
,'x'
, and the givenby_cols
. Raises
- Raises:
- DeprecationWarning
If the deprecated method
DataLoader.bin_data_by(...)
is used instead ofDataPrep.bin_data_by(...)
.- AssertionError
If any of the input parameters do not meet the specified conditions.
- classmethod data_select(obj, where=None, combine_where='AND', table=None, return_df=True, reset_index=False, drop=True, copy=True, columns=None, close=False, **kwargs)
Selects data from an input object (
pd.DataFrame
,pd.HDFStore
,xr.DataArray
orxr.DataSet
) based on filtering conditions.This function filters data from various types of input objects based on the provided conditions specified in the
'where'
parameter. It also supports selecting specific columns, resetting the index, and returning the output as a DataFrame.- Parameters:
- objpd.DataFrame, pd.Series, dict, pd.HDFStore, xr.DataArray, or xr.Dataset
The input object from which data will be selected. If
dict
, it will try to convert it topandas.DataFrame
.- wheredict, list of dict or None, default None
Filtering conditions to be applied to the input object. It can be a single dictionary or a list of dictionaries. Each dictionary should have keys:
"col"
,"comp"
,"val"
. e.g.where = {"col": "t", "comp": "<=", "val": 4}
The
"col"
value specifies the column,"comp"
specifies the comparison to be performed (>
,>=
,==
,!=
,<=
,<
) and “val” is the value to be compared against. IfNone
, then selects all data. Specifying'where'
parameter can avoid reading all data in from filesystem whenobj
ispandas.HDFStore
orxarray.Dataset
.- combine_where: str, default ‘AND’
How should where conditions, if there are multiple, be combined? Valid values are [
"AND"
,"OR"
], not case-sensitive.- tablestr, default None
The table name to select from when using an HDFStore object. If
obj
ispandas.HDFStore
then table must be supplied.- return_dfbool, default True
If
True
, the output will be returned as apandas.DataFrame
.- reset_indexbool, default False
If
True
, the index of the output DataFrame will be reset.- dropbool, default True
If
True
, the output will have the filtered-out values removed. Applicable only for xarray objects. Default isTrue
.- copybool, default True
If
True
, the output will be a copy of the selected data. Applicable only for DataFrame objects.- columnslist or None, default None
A list of column names to be selected from the input object. If
None
, selects all columns.- closebool, default False
If
True
, andobj
ispandas.HDFStore
it will be closed after selecting data.- kwargsany
Additional keyword arguments to be passed to the
obj.select
method when using an HDFStore object.
- Returns:
- outpandas.DataFrame, pandas.Series, or xarray.DataArray
The filtered data as a
pd.DataFrame
,pd.Series
, orxr.DataArray
, based on the input object type andreturn_df
parameter.
- Raises:
- AssertionError
If the table parameter is not provided when using an HDFStore object.
- AssertionError
If the provided columns are not found in the input object when using a DataFrame object.
Examples
>>> import pandas as pd >>> import xarray as xr >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> # Select data from a DataFrame with a filtering condition >>> selected_df = DataLoader.data_select(df, where={"col": "A", "comp": ">=", "val": 2}) >>> print(selected_df) A B 1 2 5 2 3 6
- static get_attribute_from_table(source, table, attribute_name)
Retrieve an attribute from a specific table in a HDF5 file or HDFStore.
This function can handle both cases when the source is a filepath string to a HDF5 file or a pandas HDFStore object. The function opens the source (if it’s a filepath), then attempts to retrieve the specified attribute from the specified table within the source. If the retrieval fails for any reason, a warning is issued and None is returned.
- Parameters:
- sourcestr or pandas.HDFStore
The source from where to retrieve the attribute. If it’s a string, it is treated as a filepath to a HDF5 file. If it’s a pandas HDFStore object, the function operates directly on it.
- tablestr
The name of the table within the source from where to retrieve the attribute.
- attribute_namestr
The name of the attribute to retrieve.
- Returns:
- attributeobject
The attribute retrieved from the specified table in the source. If the attribute could not be retrieved, None is returned.
- Raises:
- NotImplementedError
If the type of the source is neither a string nor a pandas.HDFStore.
- static get_masks_for_expert_loc(ref_data, el_masks=None, obs_col=None)
Generate a list of masks based on given local experts locations (el_masks) and a reference data (ref_data).
- This function can generate masks in two ways:
If el_mask is a string “had_obs”, a mask is created based on the obs_col of the reference data where any non-NaN value is present.
If el_mask is a dictionary with “grid_space” key, a regularly spaced mask is created based on the dimensions specified and the grid_space value.
The reference data is expected to be an xarray DataArray or xarray Dataset. Support for pandas DataFrame may be added in future.
- Parameters:
- ref_dataxarray.DataArray or xarray.Dataset
The reference data to use when generating the masks. The data should have coordinates that match the dimensions specified in the el_masks dictionary, if provided.
- el_maskslist of str or dict, optional
A list of instructions for generating the masks. Each element in the list can either be a string or a dictionary. If a string, it should be “had_obs”, which indicates a mask should be created where any non-NaN value is present in the obs_col of the ref_data. If a dictionary, it should have a “grid_space” key indicating the regular spacing to be used when creating a mask and ‘dims’ key specifying dimensions in the reference data to be considered. By default, it is None, which indicates no mask is to be generated.
- obs_colstr, optional
The column in the reference data to use when generating a mask based on “had_obs” instruction. This parameter is ignored if “had_obs” is not present in el_masks.
- Returns:
- list of xarray.DataArray
A list of masks generated based on the el_masks instructions. Each mask is an xarray DataArray with the same coordinates as the ref_data. Each value in the mask is a boolean indicating whether a local expert should be located at that point.
- Raises:
- AssertionError
If ref_data is not an instance of xarray.DataArray or xarray.Dataset, or if “grid_space” is in el_masks but the corresponding dimensions specified in the ‘dims’ key do not exist in ref_data.
Notes
The function could be extended to read data from file system and allow different reference data.
Future extensions could also include support for lel_mask to be only list of dict and for reference data to be pandas DataFrame.
- static get_run_info(script_path=None)
Retrieves information about the current Python script execution environment, including run time, Python executable path, and Git information.
This function collects information about the current script execution environment, such as the date and time when the script is executed, the path of the Python interpreter, the script’s file path, and Git information (if available).
- Parameters:
- script_pathstr, default None
The file path of the currently executed script. If
None
, it will try to retrieve the file path automatically.
- Returns:
- run_infodict
A dictionary containing the following keys:
"run_time"
: The date and time when the script was executed, formatted as"YYYY-MM-DD HH:MM:SS"
."python_executable"
: The path of the Python interpreter."script_path"
: The absolute file path of the script (if available).Git-related keys:
"git_branch"
,"git_commit"
,"git_url"
, and"git_modified"
(if available).
Examples
>>> from GPSat.dataloader import DataLoader >>> run_info = DataLoader.get_run_info() >>> print(run_info) { "run_time": "2023-04-28 10:30:00", "python_executable": "/usr/local/bin/python3.9", "script_path": "/path/to/your/script.py", "branch": "main", "commit": "123abc", "remote": ["https://github.com/user/repo.git" (fetch),"https://github.com/user/repo.git" (push)] "details": ['commit 123abc', 'Author: UserName <username42@gmail.com>', 'Date: Fri Apr 28 07:22:31 2023 +0100', ':bug: fix '] "modified" : ['list_of_files.py', 'modified_since.py', 'last_commit.py'] }
- static get_where_list(global_select, local_select=None, ref_loc=None)
Generate a list of selection criteria for data filtering based on global and local conditions, as well as reference location.
The function accepts a list of global select conditions, and optional local select conditions and reference location. Each condition in global select can either be ‘static’ (with keys ‘col’, ‘comp’, and ‘val’) or ‘dynamic’ (requiring local select and reference location and having keys ‘loc_col’, ‘src_col’, ‘func’). The function evaluates each global select condition and constructs a corresponding selection dictionary.
- Parameters:
- global_selectlist of dict
A list of dictionaries defining global selection conditions. Each dictionary can be either ‘static’ or ‘dynamic’. ‘Static’ dictionaries should contain the keys ‘col’, ‘comp’, and ‘val’ which define a column, a comparison operator, and a value respectively. ‘Dynamic’ dictionaries should contain the keys ‘loc_col’, ‘src_col’, and ‘func’ which define a location column, a source column, and a function respectively.
- local_selectlist of dict, optional
A list of dictionaries defining local selection conditions. Each dictionary should contain keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively. This parameter is required if any ‘dynamic’ condition is present in global_select.
- ref_locpandas DataFrame, optional
A reference location as a pandas DataFrame. This parameter is required if any ‘dynamic’ condition is present in global_select.
- Returns:
- list of dict
A list of dictionaries each representing a selection condition to be applied on data. Each dictionary contains keys ‘col’, ‘comp’, and ‘val’ defining a column, a comparison operator, and a value respectively.
- Raises:
- AssertionError
If a ‘dynamic’ condition is present in global_select but local_select or ref_loc is not provided, or if the required keys are not present in the ‘dynamic’ condition, or if the location column specified in a ‘dynamic’ condition is not present in ref_loc.
- static get_where_list_legacy(read_in_by=None, where=None)
Generate a list (of lists) of where conditions that can be consumed by
pd.HDFStore(...).select
.- Parameters:
- read_in_by: dict of dict or None
Sub-dictionary must contain the keys
"values"
,"how"
.- where: str or None
Used if
read_in_by
is not provided.
- Returns:
- list of list
Containing string where conditions.
- classmethod hdf_tables_in_store(store=None, path=None)
Retrieve the list of tables available in an HDFStore.
This class method allows the user to get the names of all tables stored in a given HDFStore. It accepts either an already open HDFStore object or a path to an HDF5 file. If a path is provided, the method will open the HDFStore in read-only mode, retrieve the table names, and then close the store.
- Parameters:
- storepd.io.pytables.HDFStore, optional
An open HDFStore object. If this parameter is provided, path should not be specified.
- pathstr, optional
The file path to an HDF5 file. If this parameter is provided, store should not be specified. The method opens the HDFStore at this path in read-only mode to retrieve the table names.
- Returns:
- list of str
A list containing the names of all tables in the HDFStore.
- Raises:
- AssertionError
If both store and path are None, or if the store provided is not an instance of pd.io.pytables.HDFStore.
Notes
The method ensures that only one of store or path is provided. If path is specified, the HDFStore is opened in read-only mode and closed after retrieving the table names.
Examples
>>> DataLoader.hdf_tables_in_store(store=my_store) ['/table1', '/table2']
>>> DataLoader.hdf_tables_in_store(path='path/to/hdf5_file.h5') ['/table1', '/table2', '/table3']
- static is_list_of_dict(lst)
Checks if the given input is a list of dictionaries.
This utility function tests if the input is a list where all elements are instances of the
dict
type.- Parameters:
- lstlist
The input list to be checked for containing only dictionaries.
- Returns:
- bool
True
if the input is a list of dictionaries,False
otherwise.
Examples
>>> from GPSat.dataloader import DataLoader >>> DataLoader.is_list_of_dict([{"col": "t", "comp": "==", "val": 1}]) True
>>> DataLoader.is_list_of_dict([{"a": 1, "b": 2}, {"c": 3, "d": 4}]) True
>>> DataLoader.is_list_of_dict([1, 2, 3]) False
>>> DataLoader.is_list_of_dict("not a list") False
- static kdt_tree_list_for_local_select(df, local_select)
Pre-calculates a list of KDTree objects for selecting points within a radius based on the
local_select
input.Given a DataFrame and a list of local selection criteria, this function builds a list of KDTree objects that can be used for spatially selecting points within specified radii.
- Parameters:
- dfpd.DataFrame
The input DataFrame containing the data to be used for KDTree construction.
- local_selectlist of dict
A list of dictionaries containing the selection criteria for each local select. Each dictionary should have the following keys:
"col"
: The name of the column(s) used for spatial selection. Can be a single string or a list of strings."comp"
: The comparison operator, either"<"
or"<="
. Currently, only less than comparisons are supported for multi-dimensional values.
- Returns:
- outlist
A list of KDTree objects or None values, where each element corresponds to an entry in the
local_select
input. If an entry inlocal_select
has a single string for"col"
, the corresponding output element will be None. Otherwise, the output element will be a KDTree object built from the specified columns.
Examples
>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}) >>> local_select = [{"col": ["x", "y"], "comp": "<"}] >>> kdt_trees = DataLoader.kdt_tree_list_for_local_select(df, local_select) >>> print(kdt_trees)
- classmethod load(source, where=None, engine=None, table=None, source_kwargs=None, col_funcs=None, row_select=None, col_select=None, reset_index=False, add_data_to_col=None, close=False, verbose=False, combine_row_select='AND', **kwargs)
Load data from various sources and (optionally) apply selection of columns/rows and add/modify columns.
- Parameters:
- source: str, pd.DataFrame, pd.Series, pd.HDFStore, xr.DataSet, default None
If
str
, will try to convert to other types.- where: dict or list of dict, default None
Used when querying
pd.HDFStore
,xr.DataSet
,xr.DataArray
. Specified as a list of one or more dictionaries, each containing the keys:"col"
: refers to a column (or variable for xarray objects."comp"
: is the type of comparison to apply e.g."=="
,"!="
,">="
,">"
,"<="
,"<"
."val"
: value to be compared with.
e.g.
where = [{"col": "A", "comp": ">=", "val": 0}]
will select entries where the column
"A"
is greater than 0.Note: Think of this as a database query, with the
where
used to read data from the file system into memory.- engine: str or None, default None
Specify the type of ‘engine’ to use to read in data. If not supplied, it will be inferred by source if source is string. Valid values:
"HDFStore"
,"netcdf4"
,"scipy"
,"pydap"
,"h5netcdf"
,"pynio"
,"cfgrib"
,"pseudonetcdf"
,"zarr"
or any of Pandas"read_*"
.- table: str or None, default None
Used only if source is
pd.HDFStore
(or is converted to one) and is required if so. Should be a valid table (i.e. key) in HDFStore.- source_kwargs: dict or None, default None
Additional keyword arguments to pass to the data source reading functions, depending on
engine
. e.g. keyword arguments forpandas.read_csv()
ifengine=read_csv
.- col_funcs: dict or None, default None
If
dict
, it will be provided toadd_cols
method to add or modify columns.- row_select: dict, list of dict, or None, default None
Used to select a subset of data after data is initially read into memory. Can be the same type of input as
where
i.e.row_select = {"col": "A", "comp": ">=", "val": 0}
or use
col_funcs
that returnbool
arraye.g.
row_select = {"func": "lambda x: ~np.isnan(x)", "col_args": 1}
see
help(utils.config_func)
for more details.- col_select: list of str or None, default None
If specified as a list of strings, it will return a subset of columns using
col_select
. All values must be valid. IfNone
, all columns will be returned.- filename: str or None, default None
Used by
add_cols
method.- reset_index: bool, default True
Apply
reset_index(inplace=True)
before returning?- add_data_to_col:
Add new column to data frame. See argument
add_data_to_col
inadd_data_to_col
.- close: bool, default False
See
DataLoader.data_select
for details- verbose: bool, default False
Set verbosity.
- kwargs:
Additional arguments to be provided to
data_select
method
- Returns:
- pd.DataFrame
Examples
>>> import numpy as np >>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df = DataLoader.load(source = df, ... where = {"col": "A", "comp": ">=", "val": 2}) >>> print(df.head()) A B 0 2 5 1 3 6
If the data is stored in a file, we can extract it as follows (here, we assume the data is saved in “path/to/data.h5” under the table “data”):
>>> df = DataLoader.load(source = "path/to/data.h5", ... table = "data")
- classmethod local_data_select(df, reference_location, local_select, kdtree=None, verbose=True)
Selects data from a DataFrame based on a given criteria and reference (expert) location.
This method applies local selection criteria to a DataFrame, allowing for flexible, column-wise data selection based on comparison operations. For multi (dimensional) column selections, a KDTree can be used for efficiency.
- Parameters:
- dfpd.DataFrame
The DataFrame from which data will be selected.
- reference_locationdict or pd.DataFrame
Reference location used for comparisons. If DataFrame is provided, it will be converted to dict.
- local_selectlist of dict
List of dictionaries containing the selection criteria for each local select. Each dictionary must contain keys ‘col’, ‘comp’, and ‘val’. ‘col’ is the column in ‘df’ to apply the comparison on, ‘comp’ is the comparison operator as a string (can be ‘>=’, ‘>’, ‘==’, ‘<’, ‘<=’), and ‘val’ is the value to compare with.
- kdtreeKDTree or list of KDTree, optional
Precomputed KDTree or list of KDTrees for optimization. Each KDTree in the list corresponds to an entry in local_select. If not provided, a new KDTree will be created.
- verbosebool, default=True
If True, print details for each selection criteria.
- Returns:
- pd.DataFrame
A DataFrame containing only the data that meets all of the selection criteria.
- Raises:
- AssertionError
If ‘col’ is not in ‘df’ or ‘reference_location’, if the comparison operator in ‘local_select’ is not valid, or if the provided ‘kdtree’ is not of type KDTree.
Notes
If ‘col’ is a string, a simple comparison is performed. If ‘col’ is a list of strings, a KDTree-based selection is performed where each dimension is a column from ‘df’. For multi-dimensional comparisons, only less than comparisons are currently handled.
If ‘kdtree’ is provided and is a list, it must be of the same length as ‘local_select’ with each element corresponding to the same index in ‘local_select’.
- static make_multiindex_df(idx_dict, **kwargs)
Create a multi-indexed DataFrame from the provided index dictionary for each keyword argument supplied.
This function creates a multi-indexed DataFrame, with each row having the same multi-index value The index dictionary serves as the levels and labels for the multi-index, while the keyword arguments provide the data.
- Parameters:
- idx_dictdict or pd.Series
A dictionary or pandas Series containing the levels and labels for the multi-index.
- **kwargsdict
Keyword arguments specifying the data and column names for the resulting DataFrame. The data can be of various types:
int
,float
,bool
,np.ndarray
,pd.DataFrame
,dict
, ortuple
. This data will be transformed into a DataFrame, where the multi-index will be added.
- Returns:
- dict
A dictionary containing the multi-indexed DataFrames with keys corresponding to the keys of provided keyword arguments.
Examples
>>> import numpy as np >>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> idx_dict = {"year": 2020, "month": 1} >>> data = pd.DataFrame({"x": np.arange(10)}) >>> df = pd.DataFrame({"y": np.arange(3)}) >>> DataLoader.make_multiindex_df(idx_dict, data=data, df=df) {'data': <pandas.DataFrame (multiindexed) with shape (3, 4)>}
- static mindex_df_to_mindex_dataarray(df, data_name, dim_cols=None, infer_dim_cols=True, index_name='index')
Converts a multi-index DataFrame to a multi-index DataArray.
The method facilitates a transition from pandas DataFrame representation to the Xarray DataArray format, while preserving multi-index structure. This can be useful for higher-dimensional indexing, labeling, and performing mathematical operations on the data.
- Parameters:
- dfpd.DataFrame
The input DataFrame with a multi-index to be converted to a DataArray.
- data_namestr
The name of the column in ‘df’ that contains the data values for the DataArray.
- dim_colslist of str, optional
A list of columns in ‘df’ that will be used as additional dimensions in the DataArray. If None, dimension columns will be inferred if ‘infer_dim_cols’ is True.
- infer_dim_colsbool, default=True
If True and ‘dim_cols’ is None, dimension columns will be inferred from ‘df’. Columns will be considered a dimension column if they match the pattern “^_dim_d”.
- index_namestr, default=”index”
The name assigned to the placeholder index created during the conversion process.
- Returns:
- xr.DataArray
A DataArray derived from the input DataFrame with the same multi-index structure. The data values are taken from the column in ‘df’ specified by ‘data_name’. Additional dimensions can be included from ‘df’ as specified by ‘dim_cols’.
- Raises:
- AssertionError
If ‘data_name’ is not a column in ‘df’.
Notes
The function manipulates ‘df’ by reference. If the original DataFrame needs to be preserved, provide a copy to the function.
- classmethod read_flat_files(file_dirs, file_regex, sub_dirs=None, read_csv_kwargs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, verbose=False)
wrapper for read_from_multiple_files with read_engine=’csv’ Parameters
Read flat files (
.csv
,.tsv
, etc) from file system and returns apd.DataFrame
object.- Parameters:
- file_dirs: str or List[str]
The directories containing the files to read.
- file_regex: str
A regular expression pattern to match file names within the specified directories.
- sub_dirs: str or List[str], optional
Subdirectories within each file directory to search for files.
- read_csv_kwargs: dict, optional
Additional keyword arguments specifically for CSV reading. These are keyword arguments for the function
pandas.read_csv()
.- col_funcs: dict of dict, optional
A dictionary with column names as keys and column functions to apply during data reading as values. The column functions should be a dictionary of keyword arguments to
utils.config_func
.- row_select: list of dict, optional
A list of functions to select rows during data reading.
- col_select: list of str, optional
A list of column names to read from data.
- new_column_names: List[str], optional
New column names to assign to the resulting DataFrame.
- strict: bool, default True
Whether to raise an error if a file directory does not exist.
- verbose: bool or int, default False
Verbosity level for printing progress.
- Returns:
- pd.DataFrame
A DataFrame containing the combined data from multiple files.
Notes
This method reads data from multiple files located in specified directories and subdirectories.
The
file_regex
argument is used to filter files to be read.Various transformations can be applied to the data, including adding new columns and selecting rows/columns.
If
new_column_names
is provided, it should be a list with names matching the number of columns in the output DataFrame.The resulting DataFrame contains the combined data from all the specified files.
Examples
The command below reads the files
"A_RAW.csv"
,"B_RAW.csv"
and"C_RAW.csv"
in the path"/path/to/dir"
and combines them into a single dataframe.>>> import pandas as pd >>> from GPSat.dataloader import DataLoader >>> col_funcs = { ... "source": { # Add a new column "source" with entries "A", "B" or "C". ... "func": "lambda x: re.sub('_RAW.*$', '', os.path.basename(x))", ... "filename_as_arg": true ... }, ... "datetime": { # Modify column "datetime" by converting to datetime64[s]. ... "func": "lambda x: x.astype('datetime64[s]')", ... "col_args": "datetime" ... }, ... "obs": { # Rename column "z" to "obs" and subtract mean value 0.1. ... "func": "lambda x: x-0.1", ... "col_args": "z" ... } ... } >>> row_select = [ # Read data whose "lat" value is >= 65. ... { ... "func": "lambda x: x>=65", ... "col_kwargs": { ... "x": "lat" ... } ... } ... ] >>> df = DataLoader.read_flat_files(file_dirs = "/path/to/dir/", ... file_regex = ".*_RAW.csv$", ... col_funcs = col_funcs, ... row_select = row_select) >>> print(df.head(2)) lon lat datetime source obs 0 59.944790 82.061122 2020-03-01 13:48:50 C -0.0401 1 59.939555 82.063771 2020-03-01 13:48:50 C -0.0861
- classmethod read_from_multiple_files(file_dirs, file_regex, read_engine='csv', sub_dirs=None, col_funcs=None, row_select=None, col_select=None, new_column_names=None, strict=True, read_kwargs=None, read_csv_kwargs=None, verbose=False)
Reads and merges data from multiple files in specified directories, Optionally apply various transformations such as column renaming, row selection, column selection or other transformation functions to the data.
The primary input is a list of directories and a regular expression used to select which files within those directories should be read.
- Parameters:
- file_dirslist of str
A list of directories to read the files from. Each directory is a string. If a string is provided instead of a list, it will be wrapped into a single-element list.
- file_regexstr
Regular expression to match the files to be read from the directories specified in ‘file_dirs’. e.g. “NEW.csv$’ with match all files ending with NEW.csv
- read_enginestr, optional
The engine to be used to read the files. Options include ‘csv’, ‘nc’, ‘netcdf’, and ‘xarray’. Default is ‘csv’.
- sub_dirslist of str, optional
A list of subdirectories to be appended to each directory in ‘file_dirs’. If a string is provided, it will be wrapped into a single-element list. Default is None.
- col_funcsdict, optional
A dictionary that maps new column names to functions that compute the column values. Provided to add_cols via col_func_dict parameter. Default is None.
- row_selectlist of dict, optional
A list of dictionaries, each representing a condition to select rows from the DataFrame. Provided to the row_select_bool method. Default is None.
- col_selectslice, optional
A slice object to select specific columns from the DataFrame. If not provided, all columns are selected.
- new_column_nameslist of str, optional
New names for the DataFrame columns. The length should be equal to the number of columns in the DataFrame. Default is None.
- strictbool, optional
Determines whether to raise an error if a directory in ‘file_dirs’ does not exist. If False, a warning is issued instead. Default is True.
- read_kwargsdict, optional
Additional keyword arguments to pass to the read function (pd.read_csv or xr.open_dataset). Default is None.
- read_csv_kwargsdict, optional
Deprecated. Additional keyword arguments to pass to pd.read_csv. Use ‘read_kwargs’ instead. Default is None.
- verbosebool or int, optional
Determines the verbosity level of the function. If True or an integer equal to or higher than 3, additional print statements are executed.
- Returns:
- outpandas.DataFrame
The resulting DataFrame, merged from all the files that were read and processed.
- Raises:
- AssertionError
Raised if the ‘read_engine’ parameter is not one of the valid choices, if ‘read_kwargs’ or ‘col_funcs’ are not dictionaries, or if the length of ‘new_column_names’ is not equal to the number of columns in the DataFrame. Raised if ‘strict’ is True and a directory in ‘file_dirs’ does not exist.
Notes
The function supports reading from csv, netCDF files and xarray Dataset formats. For netCDF and xarray Dataset, the data is converted to a DataFrame using the ‘to_dataframe’ method.
- static read_from_npy(npy_files, npy_dir, dims=None, flatten_xy=True, return_xarray=True)
Read NumPy array(s) from the specified
.npy
file(s) and return as xarray DataArray(s).This function reads one or more .npy files from the specified directory and returns them as xarray DataArray(s). The input can be a single file, a list of files, or a dictionary of files with the desired keys. The returned dictionary contains the xarray DataArray(s) with the corresponding keys.
- Parameters:
- npy_filesstr, list, or dict
The
.npy
file(s) to be read. It can be a single file (str), a list of files, or a dictionary of files.- npy_dirstr
The directory containing the
.npy
file(s).- dimslist or tuple, optional
The dimensions for the xarray DataArray(s), (default:
None
).- flatten_xybool, optional
If
True
, flatten the x and y arrays by taking the first row and first column, respectively (default:True
).- return_xarray: bool, default True
If
True
will convert numpy arrays to pandas DataArray, otherwise will return dict of numpy arrays.
- Returns:
- dict
A dictionary containing xarray DataArray(s) with keys corresponding to the input files.
Examples
>>> read_from_npy(npy_files="data.npy", npy_dir="./data") {'obs': <xarray.DataArray (shape)>
>>> read_from_npy(npy_files=["data1.npy", "data2.npy"], npy_dir="./data") {'obs': [<xarray.DataArray (shape1)>, <xarray.DataArray (shape2)>]}
>>> read_from_npy(npy_files={"x": "data_x.npy", "y": "data_y.npy"}, npy_dir="./data") {'x': <xarray.DataArray (shape_x)>, 'y': <xarray.DataArray (shape_y)>}
- static read_from_pkl_dict(pkl_files, pkl_dir=None, default_name='obs', strict=True, dim_names=None)
Reads and processes data from pickle files and returns a DataFrame containing all data.
- Parameters:
- pkl_filesstr, list, or dict
The pickle file(s) to be read. This can be a string (representing a single file), a list of strings (representing multiple files), or a dictionary, where keys are the names of different data sources and the values are lists of file names.
- pkl_dirstr, optional
The directory where the pickle files are located. If not provided, the current directory is used.
- default_namestr, optional
The default data source name. This is used when pkl_files is a string or a list. Default is “obs”.
- strictbool, optional
If True, the function will raise an exception if a file does not exist. If False, it will print a warning and continue with the remaining files. Default is True.
- dim_nameslist, optional
The names of the dimensions. This is used when converting the data to a DataArray. If not provided, default names are used.
- Returns:
- DataFrame
A DataFrame containing the data from all provided files. The DataFrame has a MultiIndex with ‘idx0’, ‘idx1’ and ‘date’ as index levels, and ‘obs’ and ‘source’ as columns. Each ‘source’ corresponds to a different data source (file).
Notes
The function reads the data from the pickle files and converts them into a DataFrame For each file, it creates a MultiIndex DataFrame where the indices are a combination of two dimensions and dates extracted from the keys in the dictionary loaded from the pickle file.
The function assumes the dictionary loaded from the pickle file has keys that can be converted to dates with the format “YYYYMMDD”. It also assumes that the values in the dictionary to be 2D numpy array.
If pkl_files is a string or a list, the function treats them as files from a single data source and uses default_name as the source name. If it’s a dictionary, the keys are treated as data source names, and the values are lists of file names.
When multiple files are provided, the function concatenates the data along the date dimension.
- static read_hdf(table, store=None, path=None, close=True, **select_kwargs)
Reads data from an HDF5 file, and returns a DataFrame.
This method can either read data directly from an open HDF5 store or from a provided file path. In case a file path is provided, it opens the HDF5 file in read mode, and closes it after reading, if ‘close’ is set to True.
- Parameters:
- tablestr
The key or the name of the dataset in the HDF5 file.
- storepd.io.pytables.HDFStore, optional
An open HDF5 store. If provided, the method will directly read data from it. Default is None.
- pathstr, optional
The path to the HDF5 file. If provided, the method will open the HDF5 file in read mode, and read data from it. Default is None.
- closebool, optional
A flag that indicates whether to close the HDF5 store after reading the data. It is only relevant when ‘path’ is provided, in which case the default is True.
- **select_kwargsdict, optional
Additional keyword arguments that are passed to the ‘select’ method of the HDFStore object. This can be used to select only a subset of data from the HDF5 file.
- Returns:
- dfpd.DataFrame
A DataFrame containing the data read from the HDF5 file.
- Raises:
- AssertionError
If both ‘store’ and ‘path’ are None, or if ‘store’ is not an instance of pd.io.pytables.HDFStore.
Notes
Either ‘store’ or ‘path’ must be provided. If ‘store’ is provided, ‘path’ will be ignored.
Examples
#>>> store = pd.HDFStore(‘data.h5’) #>>> df = read_hdf(table=’my_data’, store=store) #>>> print(df)
- classmethod row_select_bool(df, row_select=None, combine='AND', **kwargs)
Returns a boolean array indicating which rows of the DataFrame meet the specified conditions.
This class method applies a series of conditions, provided in the ‘row_select’ list, to the input DataFrame ‘df’. Each condition is represented by a dictionary that is used as input to the ‘_bool_numpy_from_where’ method.
All conditions are combined via an ‘&’ operator, meaning if all conditions for a given row are True the return value for that row entry will be True and False if any condition is not satisfied.
If ‘row_select’ is None or an empty dictionary, all indices will be True.
- Parameters:
- dfDataFrame
The DataFrame to apply the conditions on.
- row_selectlist of dict, optional
A list of dictionaries, each representing a condition to apply to ‘df’. Each dictionary should contain the information needed for the ‘_bool_numpy_from_where’ method. If None or an empty dictionary, all indices in the returned array will be True.
- verbosebool or int, optional
If set to True or a number greater than or equal to 3, additional print statements will be executed.
- kwargsdict
Additional keyword arguments passed to the ‘_bool_numpy_from_where’ method.
- Returns:
- selectnp.array of bool
A boolean array indicating which rows of the DataFrame meet the conditions. The length of the array is equal to the number of rows in ‘df’.
- Raises:
- AssertionError
If ‘row_select’ is not None, not a dictionary and not a list, or if any element in ‘row_select’ is not a dictionary.
Notes
The function is designed to work with pandas DataFrames.
If ‘row_select’ is None or an empty dictionary, the function will return an array with all elements set to True (indicating all rows of ‘df’ are selected).