Utilities¶
Contents
DataSet¶
-
class
summit.utils.dataset.
DataSet
(data=None, index=None, columns=None, metadata_columns=[], units=None, dtype=None, copy=False)[source]¶ A represenation of a dataset
This is basically a pandas dataframe with a set of “metadata” columns that will be removed when the dataframe is converted to a numpy array
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
- Parameters
data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –
Dict can contain Series, arrays, constants, or list-like objects
Changed in version 0.23.0: If data is a dict, argument order is maintained for Python 3.6 and later.
index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
columns (Index or array-like) – Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
metadata_columns (Array-like) – A list of metadata columns that are already contained in the columns parameter.
dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer
copy (boolean, default False) – Copy data from inputs. Only affects DataFrame / 2d ndarray input
See also
DataFrame.from_records
Constructor from tuples, also record arrays.
DataFrame.from_dict
From dicts of Series, arrays, or dicts.
DataFrame.from_items
From sequence of (key, value) pairs pandas.read_csv, pandas.read_table, pandas.read_clipboard.
Examples
>>> data_columns = ["tau", "equiv_pldn", "conc_dfnb", "temperature"] >>> metadata_columns = ["strategy"] >>> columns = data_columns + metadata_columns >>> values = [[1.5, 0.5, 0.1, 30.0, "test"]] >>> ds = DataSet(values, columns=columns, metadata_columns="strategy") >>> values = {("tau", "DATA"): [1.5, 10.0], ("equiv_pldn", "DATA"): [0.5, 3.0], ("conc_dfnb", "DATA"): [0.1, 4.0], ("temperature", "DATA"): [30.0, 100.0], ("strategy", "METADATA"): ["test", "test"]} >>> ds = DataSet(values)
Notes
Based on https://notes.mikejarrett.ca/storing-metadata-in-pandas-dataframes/
-
property
data_columns
¶ Names of the data columns
-
static
from_df
(df: pandas.core.frame.DataFrame, metadata_columns: List = [], units: List = [])[source]¶ Create Dataset from a pandas dataframe
- Parameters
df (pandas.DataFrame) – Dataframe to be converted to a DataSet
metadata_columns (list, optional) – names of the columns in the dataframe that are metadata columns
units (list, optional) – A list of objects representing the units of the columns
-
classmethod
from_dict
(d)[source]¶ Construct DataFrame from dict of array-like or dicts.
Creates DataFrame object from dictionary by columns or by index allowing dtype specification.
- Parameters
data (dict) – Of the form {field : array-like} or {field : dict}.
orient ({'columns', 'index'}, default 'columns') – The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
dtype (dtype, default None) – Data type to force, otherwise infer.
columns (list, default None) – Column labels to use when
orient='index'
. Raises a ValueError if used withorient='columns'
.
- Returns
- Return type
DataFrame
See also
DataFrame.from_records
DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.
DataFrame
DataFrame object creation using constructor.
Examples
By default the keys of the dict become the DataFrame columns:
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Specify
orient='index'
to create the DataFrame using dictionary keys as rows:>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data, orient='index') 0 1 2 3 row_1 3 2 1 0 row_2 a b c d
When using the ‘index’ orientation, the column names can be specified manually:
>>> pd.DataFrame.from_dict(data, orient='index', ... columns=['A', 'B', 'C', 'D']) A B C D row_1 3 2 1 0 row_2 a b c d
-
insert
(loc, column, value, type='DATA', units=None, allow_duplicates=False)[source]¶ Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
- Parameters
loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).
column (str, number, or hashable object) – Label of the inserted column.
value (int, Series, or array-like) –
allow_duplicates (bool, optional) –
-
property
metadata_columns
¶ Names of the metadata columns
-
standardize
(small_tol=1e-05, return_mean=False, return_std=False, **kwargs) → numpy.ndarray[source]¶ Standardize data columns by removing the mean and scaling to unit variance
- The standard score of each data column is calculated as:
z = (x - u) / s
where u is the mean of the columns and s is the standard deviation of each data column
- Parameters
small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.
return_mean (bool, optional) – Return an array with the mean of each column in the DataSet
return_std (bool, optional) – Return an array with the stnadard deviation of each column in the DataSet
mean (array, optional) – Pass a precalculated array of means for the columns
std (array, optional) – Pass a precalculated array of standard deviations for the columns
- Returns
standard – Numpy array of the standardized data columns
- Return type
np.ndarray
Notes
This method does not change the internal values of the data columns in place.
-
to_dict
(**kwargs)[source]¶ Convert the DataFrame to a dictionary.
The type of the key-value pairs can be customized with the parameters (see below).
- Parameters
orient (str {'dict', 'list', 'series', 'split', 'records', 'index'}) –
Determines the type of the values of the dictionary.
’dict’ (default) : dict like {column -> {index -> value}}
’list’ : dict like {column -> [values]}
’series’ : dict like {column -> Series(values)}
’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
’records’ : list like [{column -> value}, … , {column -> value}]
’index’ : dict like {index -> {column -> value}}
Abbreviations are allowed. s indicates series and sp indicates split.
into (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
- Returns
Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.
- Return type
dict, list or collections.abc.Mapping
See also
DataFrame.from_dict
Create a DataFrame from a dictionary.
DataFrame.to_json
Convert a DataFrame to JSON format.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], ... 'col2': [0.5, 0.75]}, ... index=['row1', 'row2']) >>> df col1 col2 row1 1 0.50 row2 2 0.75 >>> df.to_dict() {'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can specify the return orientation.
>>> df.to_dict('series') {'col1': row1 1 row2 2 Name: col1, dtype: int64, 'col2': row1 0.50 row2 0.75 Name: col2, dtype: float64}
>>> df.to_dict('split') {'index': ['row1', 'row2'], 'columns': ['col1', 'col2'], 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records') [{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index') {'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
You can also specify the mapping type.
>>> from collections import OrderedDict, defaultdict >>> df.to_dict(into=OrderedDict) OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
If you want a defaultdict, you need to initialize it:
>>> dd = defaultdict(list) >>> df.to_dict('records', into=dd) [defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}), defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
-
zero_to_one
(small_tol=1e-05, return_min_max=False) → numpy.ndarray[source]¶ Scale the data columns between zero and one
Each of the data columns is scaled between zero and one based on the maximum and minimum values of each column
- Parameters
small_tol (float, optional) – The minimum value of any value in the final scaled array. This is used to prevent very small values that will cause issues in later calcualtions. Defaults to 1e-5.
- Returns
scaled (numpy.ndarray) – A numpy array with the scaled data columns
if return_min_max true returns a tuple of scaled, mins, maxes
Notes
This method does not change the internal values of the data columns in place.
Multi-Objective¶
-
summit.utils.multiobjective.
hypervolume
(pointset, ref)[source]¶ Compute the absolute hypervolume of a pointset according to the reference point ref.
-
summit.utils.multiobjective.
pareto_efficient
(data, maximize=True)[source]¶ Find the pareto-efficient points
- Parameters
data (array-like) – An (n_points, n_data) array
maximize (bool, optional) – Whether the problem is a maximization or minimization problem. Defaults to maximization (i.e,. True)
- Returns
data is an array with the pareto front values indices is an array with the indices of the pareto points in the original data array
- Return type
data, indices