Dataset
MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
- Authors:
Tarlis Portela
- matdata.dataset.load_ds(dataset='mat.FoursquareNYC', prefix='', missing=None, sample_size=1, random_num=1, sort=True)[source]
Load a dataset for training or testing from a GitHub repository.
Parameters:
- datasetstr, optional
The name of the dataset to load (default ‘mat.FoursquareNYC’).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The loaded dataset with optional sampling.
- matdata.dataset.load_ds_holdout(dataset='mat.FoursquareNYC', train_size=0.7, prefix='', missing='-999', sample_size=1, random_num=1, sort=True)[source]
Load a dataset for training and testing with a holdout method from a GitHub repository.
Parameters:
- datasetstr, optional
The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’). Format as category.DatasetName
- train_sizefloat, optional
The proportion of the dataset to include in the training set (default 0.7).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- trainpandas.DataFrame
The training dataset.
- testpandas.DataFrame
The testing dataset.
- matdata.dataset.load_ds_kfold(dataset='mat.FoursquareNYC', k=5, prefix='', missing='-999', sample_size=1, random_num=1)[source]
Load a dataset for k-fold cross-validation from a GitHub repository.
Parameters:
- datasetstr, optional
The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’).
- kint, optional
The number of folds for cross-validation (default 5).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- ktrainlist ofpandas.DataFrame
The training datasets for each fold.
- ktestlist of pandas.DataFrame
The testing datasets for each fold.
- matdata.dataset.prepare_ds(df, tid_col='tid', class_col=None, sample_size=1, random_num=1, sort=True)[source]
Prepare dataset for training or testing (helper function).
Parameters:
- dfpandas.DataFrame
The DataFrame containing the dataset.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used for ordering data (default None).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The prepared dataset with optional sampling.
- matdata.dataset.read_ds(data_file, tid_col='tid', class_col=None, missing='-999', sample_size=1, random_num=1)[source]
Read a dataset from a file.
Parameters:
- data_filestr
The path to the dataset file.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The read dataset.
- matdata.dataset.read_ds_5fold(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]
Read datasets for k-fold cross-validation from files in a directory.
See also
read_ds_kfold
Read datasets for k-fold cross-validation.
Parameters
,-----------
data_path
str The path to the directory containing the dataset files.
prefix
str, optional The prefix of the dataset file names (default ‘specific’).
suffix
str, optional The suffix of the dataset file names (default ‘.csv’).
tid_col
str, optional The name of the column representing trajectory IDs (default ‘tid’).
class_col
str or None, optional The name of the column representing class labels. If None, no class column is used (default None).
missing
str, optional The placeholder value used to denote missing data (default ‘-999’).
Returns
,--------
5_train
list ofpandas.DataFrame The training datasets for each fold.
5_test
list of pandas.DataFrame The testing datasets for each fold.
- matdata.dataset.read_ds_holdout(data_path, prefix=None, suffix='.csv', tid_col='tid', class_col=None, missing='-999', fold=None)[source]
Read datasets for holdout validation from files in a directory.
Parameters:
- data_pathstr
The path to the directory containing the dataset files.
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- foldint or None, optional
The fold number to load for holdout validation, including subdirectory (ex. run1). If None, read files in data_path.
Returns:
- trainpandas.DataFrame
The training dataset.
- testpandas.DataFrame
The testing dataset.
- matdata.dataset.read_ds_kfold(data_path, k=5, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]
Read datasets for k-fold cross-validation from files in a directory.
Parameters:
- data_pathstr
The path to the directory containing the dataset files.
- kint, optional
The number of folds for cross-validation (default 5).
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
Returns:
- ktrainlist ofpandas.DataFrame
The training datasets for each fold.
- ktestlist of pandas.DataFrame
The testing datasets for each fold.