Dataset

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Authors:
  • Tarlis Portela

matdata.dataset.load_ds(dataset='mat.FoursquareNYC', prefix='', missing=None, sample_size=1, random_num=1, sort=True)[source]

Load a dataset for training or testing from a GitHub repository.

Parameters:

datasetstr, optional

The name of the dataset to load (default ‘mat.FoursquareNYC’).

prefixstr, optional

The prefix to be added to the dataset file name (default ‘’).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The loaded dataset with optional sampling.

matdata.dataset.load_ds_holdout(dataset='mat.FoursquareNYC', train_size=0.7, prefix='', missing='-999', sample_size=1, random_num=1, sort=True)[source]

Load a dataset for training and testing with a holdout method from a GitHub repository.

Parameters:

datasetstr, optional

The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’). Format as category.DatasetName

train_sizefloat, optional

The proportion of the dataset to include in the training set (default 0.7).

prefixstr, optional

The prefix to be added to the dataset file name (default ‘’).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

trainpandas.DataFrame

The training dataset.

testpandas.DataFrame

The testing dataset.

matdata.dataset.load_ds_kfold(dataset='mat.FoursquareNYC', k=5, prefix='', missing='-999', sample_size=1, random_num=1)[source]

Load a dataset for k-fold cross-validation from a GitHub repository.

Parameters:

datasetstr, optional

The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’).

kint, optional

The number of folds for cross-validation (default 5).

prefixstr, optional

The prefix to be added to the dataset file name (default ‘’).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

ktrainlist ofpandas.DataFrame

The training datasets for each fold.

ktestlist of pandas.DataFrame

The testing datasets for each fold.

matdata.dataset.prepare_ds(df, tid_col='tid', class_col=None, sample_size=1, random_num=1, sort=True)[source]

Prepare dataset for training or testing (helper function).

Parameters:

dfpandas.DataFrame

The DataFrame containing the dataset.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used for ordering data (default None).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The prepared dataset with optional sampling.

matdata.dataset.read_ds(data_file, tid_col='tid', class_col=None, missing='-999', sample_size=1, random_num=1)[source]

Read a dataset from a file.

Parameters:

data_filestr

The path to the dataset file.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used (default None).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The read dataset.

matdata.dataset.read_ds_5fold(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]

Read datasets for k-fold cross-validation from files in a directory.

See also

read_ds_kfold

Read datasets for k-fold cross-validation.

Parameters, -----------

data_path

str The path to the directory containing the dataset files.

prefix

str, optional The prefix of the dataset file names (default ‘specific’).

suffix

str, optional The suffix of the dataset file names (default ‘.csv’).

tid_col

str, optional The name of the column representing trajectory IDs (default ‘tid’).

class_col

str or None, optional The name of the column representing class labels. If None, no class column is used (default None).

missing

str, optional The placeholder value used to denote missing data (default ‘-999’).

Returns, --------

5_train

list ofpandas.DataFrame The training datasets for each fold.

5_test

list of pandas.DataFrame The testing datasets for each fold.

matdata.dataset.read_ds_holdout(data_path, prefix=None, suffix='.csv', tid_col='tid', class_col=None, missing='-999', fold=None)[source]

Read datasets for holdout validation from files in a directory.

Parameters:

data_pathstr

The path to the directory containing the dataset files.

prefixstr, optional

The prefix of the dataset file names (default ‘specific’).

suffixstr, optional

The suffix of the dataset file names (default ‘.csv’).

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used (default None).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

foldint or None, optional

The fold number to load for holdout validation, including subdirectory (ex. run1). If None, read files in data_path.

Returns:

trainpandas.DataFrame

The training dataset.

testpandas.DataFrame

The testing dataset.

matdata.dataset.read_ds_kfold(data_path, k=5, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]

Read datasets for k-fold cross-validation from files in a directory.

Parameters:

data_pathstr

The path to the directory containing the dataset files.

kint, optional

The number of folds for cross-validation (default 5).

prefixstr, optional

The prefix of the dataset file names (default ‘specific’).

suffixstr, optional

The suffix of the dataset file names (default ‘.csv’).

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used (default None).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

Returns:

ktrainlist ofpandas.DataFrame

The training datasets for each fold.

ktestlist of pandas.DataFrame

The testing datasets for each fold.

matdata.dataset.repository_datasets()[source]

Read the datasets available in the repository and organize them by category.

Returns:

dict

A dictionary containing lists of datasets, where each category is a key.