matdata package

Subpackages

Submodules

matdata.converter module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Authors:
  • Tarlis Portela

matdata.converter.any2ts(data_path, folder, file, cols=None, tid_col='tid', class_col='label', opLabel='Converting TS')[source]

Converts data from various formats (CSV, Parquet, etc.) to a time series format.

Parameters:

data_pathstr

The directory path where the data files are located.

folderstr

The folder containing the data file to be converted.

filestr

The name of the data file to be converted.

colslist of str, optional

A list of column names to be included in the time series data.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

opLabelstr, optional (default=’Converting TS’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

A DataFrame containing the time series data, with trajectory identifier, class label, and specified columns.

matdata.converter.csv2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a CSV file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the CSV file to be read.

class_colstr, optional (default=’label’)

Unused, kept for standard.

tid_colstr, optional (default=’tid’)

Unused, kept for standard.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

Returns:

pandas.DataFrame

A DataFrame containing the data from the CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.descTypes(df)[source]
matdata.converter.df2csv(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing CSV')[source]

Writes a pandas DataFrame to a CSV file.

Parameters:

dfpandas.DataFrame

The DataFrame to be written to the CSV file.

data_pathstr

The directory path where the Parquet file will be saved.

filestr, optional (default=’train’)

The base name of the CSV file (without extension).

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

select_colslist of str, optional

A list of column names to be included in the CSV file. If None, all columns are included.

opLabelstr, optional (default=’Writing PARQUET’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

The input DataFrame

matdata.converter.df2mat(df, folder, file, cols=None, mat_cols=None, desc_cols=None, label_columns=None, other_dsattrs=None, tid_col='tid', class_col='label', opLabel='Converting MAT')[source]

Converts a pandas DataFrame to a Multiple Aspect Trajectory .mat file and saves it to the specified folder.

Parameters:

dfpandas.DataFrame

The DataFrame to be converted to a .mat file.

folderstr

The directory where the .mat file will be saved.

filestr

The base name of the .mat file (without extension).

colslist of str, optional

A list of column names from the DataFrame to include in the .mat file. If None, all columns are included.

mat_colslist of str, optional

A list of column names representing the trajectory attibutes. If None, no columns are used.

desc_colslist of str, optional

A dict of column descriptors to be included as descriptive metadata.

label_columnslist of str, optional

A list of column names that can be treated as labels in the .mat file.

other_dsattrsdict, optional

A dictionary of additional dataset attributes to be included in the .mat file.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

opLabelstr, optional (default=’Converting MAT’)

A label describing the operation, useful for logging or display purposes.

Returns:

None

matdata.converter.df2parquet(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing Parquet')[source]

Writes a pandas DataFrame to a Parquet file.

Parameters:

dfpandas.DataFrame

The DataFrame to be written to the Parquet file.

data_pathstr

The directory path where the Parquet file will be saved.

filestr, optional (default=’train’)

The base name of the Parquet file (without extension).

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

select_colslist of str, optional

A list of column names to be included in the Parquet file. If None, all columns are included.

opLabelstr, optional (default=’Writing PARQUET’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

The input DataFrame

matdata.converter.df2zip(df, data_path, file, tid_col='tid', class_col='label', select_cols=None, opLabel='Writing ZIP')[source]

Writes a pandas DataFrame to a CSV file and compresses it into a ZIP archive.

  • This format is used for older movelet methods, such as Movelets, MasterMovelets, SuperMovelets, and the Dodge, Xiao, Zheng feature extractors. In this format all ‘,’ (commas) are replaced for ‘_’ to avoid problems reading csv trajectory files.

Parameters:

dfpandas.DataFrame

The DataFrame to be written to the CSV file and then compressed into a ZIP archive.

data_pathstr

The directory path where the ZIP archive will be saved.

filestr

The base name of the CSV file (without extension) to be compressed into the ZIP archive.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

select_colslist of str, optional

A list of column names to be included in the CSV file. If None, all columns are included.

opLabelstr, optional (default=’Writing ZIP’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

The input DataFrame

matdata.converter.mat2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a MATLAB .mat file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the .mat file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the unique trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the data from the .mat file, with missing values handled as specified and columns renamed if necessary.

Raises:

Exception

Not Implemented.

matdata.converter.parquet2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a Parquet file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the Parquet file to be read.

class_colstr, optional (default=’label’)

Unused, kept for standard.

tid_colstr, optional (default=’tid’)

Unused, kept for standard.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the data from the Parquet file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.read_zip(zipFile, cols=None, class_col='label', tid_col='tid', opLabel='Reading ZIP')[source]
matdata.converter.ts2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a time series file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the time series file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the unique trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the data from the time series file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.xes2df(url, class_col='label', tid_col='tid', missing=None, opLabel='Converting XES', save=False, start_tid=1)[source]

Converts an XES (eXtensible Event Stream) file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the XES file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

opLabelstr, optional (default=’Converting XES’)

A label describing the operation, useful for logging or display purposes.

savebool, optional (default=False)

A flag indicating whether to save the DataFrame to a file after conversion.

start_tidint, optional (default=1)

The starting value for trajectory identifiers as tid_col values need to be generated.

Returns:

pandas.DataFrame

A DataFrame containing the data from the XES file, with columns renamed if necessary.

matdata.converter.zip2arf(folder, file, cols, tid_col='tid', class_col='label', missing='?', opLabel='Reading ZIP')[source]

Extracts a CSV file from a ZIP archive and converts it into an ARFF (Attribute-Relation File Format) file.

Parameters:

folderstr

The directory path where the ZIP archive is located.

filestr

The name of the ZIP archive file (with or without extension).

colslist of str

A list of column names to be included in the ARFF file.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

opLabelstr, optional (default=’Reading CSV’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

A DataFrame containing the data from the extracted ZIP file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.zip2csv(folder, file, cols, class_col='label', tid_col='tid', missing='?')[source]

Extracts and compile Trajectory CSV files from a ZIP archive and converts it into a pandas DataFrame.

Parameters:

folderstr

The directory path where the ZIP archive is located, and destination to the CSV resulting file.

filestr

The name of the ZIP archive file (with or without extension).

colslist of str

A list of column names to be included in the DataFrame.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

Returns:

pandas.DataFrame

A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.zip2df(url, class_col='label', tid_col='tid', missing='?', opLabel='Reading ZIP')[source]

Extracts and converts a CSV trajectory file from a ZIP archive located at a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the ZIP archive containing the CSV file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the unique trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

opLabelstr, optional (default=’Reading ZIP’)

A label describing the operation, for logging purposes.

Returns:

pandas.DataFrame

A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.dataset module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Authors:
  • Tarlis Portela

matdata.dataset.load_ds(dataset='mat.FoursquareNYC', prefix='', missing=None, sample_size=1, random_num=1, sort=True)[source]

Load a dataset for training or testing from a GitHub repository.

Parameters:

datasetstr, optional

The name of the dataset to load (default ‘mat.FoursquareNYC’).

prefixstr, optional

The prefix to be added to the dataset file name (default ‘’).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The loaded dataset with optional sampling.

matdata.dataset.load_ds_holdout(dataset='mat.FoursquareNYC', train_size=0.7, prefix='', missing='-999', sample_size=1, random_num=1, sort=True)[source]

Load a dataset for training and testing with a holdout method from a GitHub repository.

Parameters:

datasetstr, optional

The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’). Format as category.DatasetName

train_sizefloat, optional

The proportion of the dataset to include in the training set (default 0.7).

prefixstr, optional

The prefix to be added to the dataset file name (default ‘’).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

trainpandas.DataFrame

The training dataset.

testpandas.DataFrame

The testing dataset.

matdata.dataset.load_ds_kfold(dataset='mat.FoursquareNYC', k=5, prefix='', missing='-999', sample_size=1, random_num=1)[source]

Load a dataset for k-fold cross-validation from a GitHub repository.

Parameters:

datasetstr, optional

The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’).

kint, optional

The number of folds for cross-validation (default 5).

prefixstr, optional

The prefix to be added to the dataset file name (default ‘’).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

ktrainlist ofpandas.DataFrame

The training datasets for each fold.

ktestlist of pandas.DataFrame

The testing datasets for each fold.

matdata.dataset.prepare_ds(df, tid_col='tid', class_col=None, sample_size=1, random_num=1, sort=True)[source]

Prepare dataset for training or testing (helper function).

Parameters:

dfpandas.DataFrame

The DataFrame containing the dataset.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used for ordering data (default None).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The prepared dataset with optional sampling.

matdata.dataset.read_ds(data_file, tid_col='tid', class_col=None, missing='-999', sample_size=1, random_num=1)[source]

Read a dataset from a file.

Parameters:

data_filestr

The path to the dataset file.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used (default None).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

sample_sizefloat, optional

The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The read dataset.

matdata.dataset.read_ds_5fold(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]

Read datasets for k-fold cross-validation from files in a directory.

See also

read_ds_kfold

Read datasets for k-fold cross-validation.

Parameters, -----------

data_path

str The path to the directory containing the dataset files.

prefix

str, optional The prefix of the dataset file names (default ‘specific’).

suffix

str, optional The suffix of the dataset file names (default ‘.csv’).

tid_col

str, optional The name of the column representing trajectory IDs (default ‘tid’).

class_col

str or None, optional The name of the column representing class labels. If None, no class column is used (default None).

missing

str, optional The placeholder value used to denote missing data (default ‘-999’).

Returns, --------

5_train

list ofpandas.DataFrame The training datasets for each fold.

5_test

list of pandas.DataFrame The testing datasets for each fold.

matdata.dataset.read_ds_holdout(data_path, prefix=None, suffix='.csv', tid_col='tid', class_col=None, missing='-999', fold=None)[source]

Read datasets for holdout validation from files in a directory.

Parameters:

data_pathstr

The path to the directory containing the dataset files.

prefixstr, optional

The prefix of the dataset file names (default ‘specific’).

suffixstr, optional

The suffix of the dataset file names (default ‘.csv’).

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used (default None).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

foldint or None, optional

The fold number to load for holdout validation, including subdirectory (ex. run1). If None, read files in data_path.

Returns:

trainpandas.DataFrame

The training dataset.

testpandas.DataFrame

The testing dataset.

matdata.dataset.read_ds_kfold(data_path, k=5, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]

Read datasets for k-fold cross-validation from files in a directory.

Parameters:

data_pathstr

The path to the directory containing the dataset files.

kint, optional

The number of folds for cross-validation (default 5).

prefixstr, optional

The prefix of the dataset file names (default ‘specific’).

suffixstr, optional

The suffix of the dataset file names (default ‘.csv’).

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr or None, optional

The name of the column representing class labels. If None, no class column is used (default None).

missingstr, optional

The placeholder value used to denote missing data (default ‘-999’).

Returns:

ktrainlist ofpandas.DataFrame

The training datasets for each fold.

ktestlist of pandas.DataFrame

The testing datasets for each fold.

matdata.dataset.repository_datasets()[source]

Read the datasets available in the repository and organize them by category.

Returns:

dict

A dictionary containing lists of datasets, where each category is a key.

matdata.dataset.translateCategory(dataset, category, descName=None)[source]
matdata.dataset.translateDesc(dataset, category, descName)[source]

matdata.generator module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Authors:
  • Tarlis Portela

class matdata.generator.AttributeGenerator(name='attr', atype='nominal', method='random', interval=None, n=-1, dependency=None, cellSize=1, adjacents=None, precision=2)[source]

Bases: object

descType()[source]
next()[source]
nextn(n)[source]
class matdata.generator.NominalGenerator(method='random', n=50, interval=None)[source]

Bases: object

next()[source]
nextn(n)[source]
static nominalInterval(n)[source]
class matdata.generator.NumericGenerator(method='random', start=0, end=100, precision=2)[source]

Bases: object

next()[source]
nextn(n)[source]
class matdata.generator.SpatialGrid2D(X=(1, 5), Y=(1, 5), cellSize=1, spatial_adjacents=None, precision=2, dependency=[])[source]

Bases: object

SPATIAL_ADJACENTS_1 = [(-1, -1), (-1, 0), (-1, 1), (0, -1), (0, 1), (1, -1), (1, 0), (1, 1)]
SPATIAL_ADJACENTS_2 = [(-2, -2), (-2, -1), (-2, 0), (-2, 1), (-2, 2), (-1, -2), (-1, -1), (-1, 0), (-1, 1), (-2, 2), (0, -2), (0, -1), (0, 1), (-2, 2), (1, -2), (1, -1), (1, 0), (1, 1), (-2, 2), (2, -2), (2, -1), (2, 0), (2, 1), (-2, 2)]
adjacents(cell)[source]
next()[source]
nextin(cell)[source]
nextn(n)[source]
position(x, y)[source]
randomRoute(startCell, n)[source]
size()[source]
text(point)[source]
matdata.generator.cycleGenerators(L, generators)[source]
matdata.generator.default_types()[source]
matdata.generator.getMiddleE(X)[source]
matdata.generator.getSamplingData(base_data, cols_for_sampling)[source]
matdata.generator.getScale(start=100, n_ele=10)[source]
matdata.generator.instantiate_generators(attr_desc=[{'atype': 'space', 'interval': [(0.0, 1000.0), (0.0, 1000.0)], 'method': 'grid_cell', 'name': 'space'}, {'atype': 'time', 'interval': [0, 1440], 'method': 'random', 'name': 'time'}, {'atype': 'numeric', 'interval': [-1000, 1000], 'method': 'random', 'name': 'n1'}, {'atype': 'numeric', 'interval': [0.0, 1000.0], 'method': 'random', 'name': 'n2'}, {'atype': 'nominal', 'method': 'random', 'n': 1000, 'name': 'nominal'}, {'atype': 'day', 'interval': ['Monday', 'Tuesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Wednesday'], 'method': 'random', 'name': 'day'}, {'atype': 'weather', 'interval': ['Clear', 'Clouds', 'Fog', 'Unknown', 'Rain', 'Snow'], 'method': 'random', 'name': 'weather'}, {'atype': 'category', 'dependency': 'space', 'interval': ['Residence', 'Food', 'Travel & Transport', 'Professional & Other Places', 'Shop & Service', 'Outdoors & Recreation', 'College & University', 'Arts & Entertainment', 'Nightlife Spot', 'Event'], 'method': 'random', 'name': 'category'}])[source]
matdata.generator.randomGenerator(N=10, M=50, L=10, C=10, random_seed=1, fileprefix='random', fileposfix='train', attr_desc=None, save_to=False, outformats=['csv'])[source]

Function to generate trajectories based on random data.

Parameters:

Nint, optional

Number of trajectories (default 10)

Mint, optional

Size of trajectories (default 50)

Lint, optional

Number of attributes (default 10)

Cint, optional

Number of classes (default 10)

random_seedint, optional

Random Seed (default 1)

attr_desclist of dict, optional

Data type intervals to generate attributes as a list of descriptive dicts. Default: None (uses default types) OR a list of instances of AttributeGenerator

save_tostr or bool, optional

Destination folder to save, or False if not to save CSV files (default False)

fileprefixstr, optional

Output filename prefix (default ‘sample’)

fileposfixstr, optional

Output filename postfix (default ‘train’)

outformatslist, optional

Output file formats for saving (default [‘csv’])

Returns:

pandas.DataFrame

The generated dataset.

matdata.generator.random_set(N, M, L, label, j, generators)[source]
matdata.generator.random_trajectory(M, L, tid, generators)[source]
matdata.generator.sample_set(df_for_sampling, N, M, label, j)[source]
matdata.generator.sample_trajectory(df, M, tid)[source]
matdata.generator.samplerGenerator(N=10, M=50, C=1, random_seed=1, fileprefix='sample', fileposfix='train', cols_for_sampling=['space', 'time', 'day', 'rating', 'price', 'weather', 'root_type', 'type'], save_to=False, base_data=None, outformats=['csv'])[source]

Function to generate trajectories based on real data.

Parameters:

Nint, optional

Number of trajectories (default 10)

Mint, optional

Size of trajectories, number of points (default 50)

Cint, optional

Number of classes (default 1)

random_seedint, optional

Random seed (default 1)

cols_for_samplinglist, optional

Columns to add in the generated dataset. Default: [‘space’, ‘time’, ‘day’, ‘rating’, ‘price’, ‘weather’, ‘root_type’, ‘type’].

save_tostr or bool, optional

Destination folder to save, or False if not to save CSV files (default False)

fileprefixstr, optional

Output filename prefix (default ‘sample’)

fileposfixstr, optional

Output filename postfix (default ‘train’)

base_dataDataFrame, optional

DataFrame of trajectories to use as a base for sampling data. Default: None (uses example data)

outformatslist, optional

Output file formats for saving (default [‘csv’])

Returns:

pandas.DataFrame

The generated dataset.

matdata.generator.scalerRandomGenerator(Ns=[100, 10], Ms=[10, 10], Ls=[8, 10], Cs=[2, 10], random_seed=1, fileprefix='scalability', fileposfix='train', attr_desc=None, save_to=None, save_desc_files=True, outformats=['csv'])[source]

Function to generate trajectory datasets based on random data.

Parameters:

Nslist of int, optional

Parameters to scale the number of trajectories. List of 2 values: starting number, number of elements (default [100, 10])

Mslist of int, optional

Parameters to scale the size of trajectories. List of 2 values: starting number, number of elements (default [10, 10])

Lslist of int, optional

Parameters to scale the number of attributes (* doubles the columns). List of 2 values: starting number, number of elements (default [8, 10])

Cslist of int, optional

Parameters to scale the number of classes. List of 2 values: starting number, number of elements (default [2, 10])

random_seedint, optional

Random seed (default 1)

attr_desclist, optional

Data type intervals to generate attributes as a list of descriptive dicts. Default: None (uses default types)

save_tostr or bool, optional

Destination folder to save, or False if not to save CSV files (default False)

fileprefixstr, optional

Output filename prefix (default ‘sample’)

fileposfixstr, optional

Output filename postfix (default ‘train’)

save_desc_filesbool, optional

True if to save the .json description files, False otherwise (default True)

outformatslist, optional

Output file formats for saving (default [‘csv’])

Returns:

None

matdata.generator.scalerSamplerGenerator(Ns=[100, 10], Ms=[10, 10], Ls=[8, 10], Cs=[2, 10], random_seed=1, fileprefix='scalability', fileposfix='train', cols_for_sampling=['space', 'time', 'day', 'rating', 'price', 'weather', 'root_type', 'type'], save_to=None, base_data=None, save_desc_files=True, outformats=['csv'])[source]

Generates trajectory datasets based on real data.

Parameters:

Nslist of int, optional

Parameters to scale the number of trajectories. List of 2 values: starting number, number of elements (default [100, 10])

Mslist of int, optional

Parameters to scale the size of trajectories. List of 2 values: starting number, number of elements (default [10, 10])

Lslist of int, optional

Parameters to scale the number of attributes (* doubles the columns). List of 2 values: starting number, number of elements (default [8, 10])

Cslist of int, optional

Parameters to scale the number of classes. List of 2 values: starting number, number of elements (default [2, 10])

random_seedint, optional

Random seed (default 1)

fileprefixstr, optional

Output filename prefix (default ‘scalability’)

fileposfixstr, optional

Output filename postfix (default ‘train’)

cols_for_samplinglist or dict, optional

Columns to add in the generated dataset. Default: [‘space’, ‘time’, ‘day’, ‘rating’, ‘price’, ‘weather’, ‘root_type’, ‘type’]. If a dictionary is provided in the format: {‘aspectName’: ‘type’, ‘aspectName’: ‘type’}, it is used when providing base_data and saving .MAT.

save_tostr or bool, optional

Destination folder to save, or False if not to save CSV files (default False)

base_dataDataFrame, optional

DataFrame of trajectories to use as a base for sampling data. Default: None (uses example data)

save_desc_filesbool, optional

True if to save the .json description files, False otherwise (default True)

outformatslist, optional

Output file formats for saving (default [‘csv’])

Returns:

None

matdata.preprocess module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Authors:
  • Tarlis Portela

matdata.preprocess.convertDataset(dir_path, k=None, cols=None, fileprefix='', tid_col='tid', class_col='label')[source]
matdata.preprocess.countClasses(data_path, folder, file='train.csv', tid_col='tid', class_col='label', markd=False)[source]

Counts the occurrences of each class label in a dataset.

Parameters:

data_pathstr

The directory path where the dataset file is located.

folderstr

The subfolder within the data path where the dataset file is located.

filestr, optional (default=’train.csv’)

The name of the dataset file to be read.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

markdbool, optional (default=False)

A flag indicating whether to print the class counts in Markdown format.

Returns:

pandas.DataFrame or str

If markd is False, prins the markdown text and returns a dictionary DataFrame containing the counts of each class label in the dataset. If markd is True, returns str markdown of the counts of each class label in the dataset.

matdata.preprocess.countClasses_df(df, tid_col='tid', class_col='label', markd=False)[source]
matdata.preprocess.datasetStatistics(data_path, folder, file_prefix='', tid_col='tid', class_col='label', to_file=False)[source]

Computes statistics for a dataset, including summary statistics for each column and class distribution into a markdown file format.

Parameters:

data_pathstr

The directory path where the dataset file(s) are located.

folderstr

The subfolder within the data path where the dataset file(s) are located.

file_prefixstr, optional (default=’’)

The prefix to be added to the dataset file names.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

to_filebool, optional (default=False)

A flag indicating whether to save the statistics to a file.

Returns:

dict or None

If to_file is False, prints markdown and returns a str containing the computed statistics. If to_file is str, returns markdown str and saves the statistics to a file named as in to_file value.

matdata.preprocess.dfStats(df)[source]

Computes summary statistics for each column in a DataFrame.

Parameters:

dfpandas.DataFrame

The DataFrame for which statistics are to be computed.

Returns:

pandas.DataFrame

A DataFrame containing summary statistics for each column, including mean, standard deviation, and variance. Columns are sorted by variance in descending order.

matdata.preprocess.dfVariance(df)[source]

Computes the variance for each column in a DataFrame.

Parameters:

dfpandas.DataFrame

The DataFrame for which variance is to be computed.

Returns:

pandas.Series

A Series containing the variance for each column in the DataFrame.

matdata.preprocess.dropLabelsltk(df, k, tid_col='tid', class_col='label')[source]
matdata.preprocess.featuresJSON(df, version=1, deftype='nominal', defcomparator='equals', tid_col='tid', label_col='label', file=False)[source]

Generates a JSON representation of features from a DataFrame.

Parameters:

dfpandas.DataFrame

The DataFrame containing the dataset.

versionint, optional (default=1)

The version number of the JSON schema (1 for MASTERMovelets format, 2 for HiPerMovelets format).

deftypestr, optional (default=’nominal’)

The default type of features.

defcomparatorstr, optional (default=’equals’)

The default comparator for features.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

label_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

filebool, optional (default=False)

A flag indicating whether to save the JSON representation to a file.

Returns:

str

If file is False, returns a str representing the features in JSON format. If file is str, returns a str of JSON features and saves the JSON representation to a file param name.

matdata.preprocess.joinTrainTest(dir_path, train_file='train.csv', test_file='test.csv', tid_col='tid', class_col='label', to_file=False)[source]

Joins training and testing datasets from separate files into a single DataFrame.

Parameters:

dir_pathstr

The directory path where the training and testing files are located.

train_filestr, optional (default=”train.csv”)

The name of the training file to be read.

test_filestr, optional (default=”test.csv”)

The name of the testing file to be read.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

to_filebool, optional (default=False)

A flag indicating whether to save the joined DataFrame to a file, and saves the joined DataFrame to a file named ‘joined.csv’.

Returns:

pandas.DataFrame

A DataFrame containing the joined training and testing data. If to_file is True, returns the DataFrame and saves the joined DataFrame to a file named ‘joined.csv’.

matdata.preprocess.joinTrainTest_df(train, test, tid_col='tid', class_col='label', sort=True)[source]
matdata.preprocess.kfold_stratify(df, k=10, inc=1, limit=10, random_num=1, tid_col='tid', class_col='label', fileprefix='', ktrain=None, ktest=None, organize_columns=True, mat_columns=None, data_path='.', outformats=[], ignore_ltk=True, sort=True)[source]
matdata.preprocess.kfold_trainTestSplit(df, k, random_num=1, tid_col='tid', class_col='label', fileprefix='', columns_order=None, ktrain=None, ktest=None, mat_columns=None, data_path='.', outformats=[], verbose=False, sort=True)[source]

Splits a DataFrame into k folds for k-fold cross-validation, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be split into k folds.

kint

The number of folds for cross-validation.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving, for example: ‘specific_’ or ‘generic_’.

columns_orderlist of str, optional

A list of column names specifying the desired order of columns. If None, no reordering is performed.

ktrainlist of pandas.DataFrame, optional

A list of training sets for each fold. If None, the function will split the data into training and testing sets.

ktestlist of pandas.DataFrame, optional

A list of testing sets for each fold. If None, the function will split the data into training and testing sets.

mat_columnslist of str, optional

A list of column names to be included in the .mat files, corresponding to columns_order.

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).

verbosebool, optional (default=False)

A flag indicating whether to display progress messages.

sortbool, optional

If True, sort the data by class_col and tid_col (default True).

Returns:

ktrainlist of pandas.DataFrame

List of DataFrame containing the training sets.

ktestlist of pandas.DataFrame

List of DataFrame containing the testing sets.

matdata.preprocess.klabels_extract(df, kl=10, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True)[source]

Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be stratified and split into training and testing sets.

klint, optional (default=10)

The number of class labels to stratify the DataFrame.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

sortbool, optional

If True, sort the data by class_col and tid_col (default True).

Returns:

datapandas.DataFrame

A DataFrame containing the dataset.

matdata.preprocess.klabels_stratify(df, kl=10, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]

Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be stratified and split into training and testing sets.

klint, optional (default=10)

The number of class labels to stratify the DataFrame.

train_sizefloat, optional (default=0.7)

The proportion of the stratified dataset to include in the training set.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

mat_columnslist of str, optional (unused for now)

A list of column names to be included in the .mat files, if set to save.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

sortbool, optional

If True, sort the data by class_col and tid_col (default True).

Returns:

trainpandas.DataFrame

A DataFrame containing the training set.

testpandas.DataFrame

A DataFrame containing the testing set.

matdata.preprocess.labels_extract(df, labels=[], tid_col='tid', class_col='label', organize_columns=True)[source]
matdata.preprocess.organizeFrame(df, columns_order=None, tid_col='tid', class_col='label', make_spatials=False)[source]

Organizes a DataFrame by reordering columns and optionally converting spatial columns.

Parameters:

dfpandas.DataFrame

The DataFrame to be organized.

columns_orderlist of str, optional

A list of column names specifying the desired order of columns. If None, no reordering is performed.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

make_spatialsbool, optional (default=False)

A flag indicating whether to convert spatial columns to both lat/lon separated or space format, which is the lat/lon concatenated in one column.

Returns:

pandas.DataFrame

A DataFrame containing the organized data, with columns added as specified and spatial columns converted if requested.

columns_order_zip

A list of the columns with space column, if present.

columns_order_csv

A list of the columns with lat/lon columns, if present.

matdata.preprocess.readDataset(data_path, folder=None, file='train.csv', class_col='label', tid_col='tid', missing=None)[source]

Reads a dataset file (CSV format by default, ‘train.csv’) and returns it as a pandas DataFrame.

Parameters:

data_pathstr

The directory path where the dataset file is located.

folderstr, optional

The subfolder within the data path where the dataset file is located.

filestr, optional (default=’train.csv’)

The name of the dataset file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the dataset from the specified file, with trajectory identifier, class label, and missing values handled as specified.

matdata.preprocess.readDsDesc(data_path, folder=None, file='train.csv', tid_col='tid', class_col='label', missing='?')[source]
matdata.preprocess.sortByLabel(df, tid_col='tid', class_col='label')[source]

Sort a DataFrame by class label column and trajectory ID column.

Parameters:

dfpandas.DataFrame

The DataFrame to be sorted.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

class_colstr, optional

The name of the column representing class labels (default ‘label’).

Returns:

pandas.DataFrame

The sorted DataFrame.

matdata.preprocess.sortByTID(df_, tid_col='tid')[source]

Sort a DataFrame by trajectory ID column.

Parameters:

df_pandas.DataFrame

The DataFrame to be sorted.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

Returns:

pandas.DataFrame

The sorted DataFrame.

matdata.preprocess.splitData(df, k, random_num, tid_col='tid', class_col='label', opLabel='Spliting Data', ignore_ltk=True)[source]
matdata.preprocess.splitTIDs(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', min_elements=1, opLabel='Spliting Data (class-balanced)')[source]
matdata.preprocess.splitTIDsUnbalanced(df, train_size=0.7, random_num=1, tid_col='tid', min_elements=1, opLabel='Spliting Data')[source]
matdata.preprocess.splitframe(data, name='tid')[source]
matdata.preprocess.stratify(df, sample_size=0.5, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True, opLabel='Data Stratification (class-balanced)')[source]

Stratifies a DataFrame by class label, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be stratified and split into training and testing sets.

sample_sizefloat, optional (default=0.5)

The proportion of the dataset to sample for stratification.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

sortbool, optional

If True, sort the data by class_col and tid_col (default True).

Returns:

pandas.DataFrame

A DataFrame containing the stratified set.

matdata.preprocess.stratifyTrainTest(df, sample_size=0.5, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]

Stratifies a DataFrame by class label and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be stratified and split into training and testing sets.

sample_sizefloat, optional (default=0.5)

The proportion of the dataset to sample for stratification.

train_sizefloat, optional (default=0.7)

The proportion of the stratified dataset to include in the training set.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

mat_columnslist of str, optional (unused for now)

A list of column names to be included in the .mat files, if set to save.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

sortbool, optional

If True, sort the data by class_col and tid_col (default True).

Returns:

trainpandas.DataFrame

A DataFrame containing the training set.

testpandas.DataFrame

A DataFrame containing the testing set.

matdata.preprocess.suffle_df(df, tid_col='tid', random_num=1)[source]

Shuffle a DataFrame by trajectory ID column.

Parameters:

dfpandas.DataFrame

The DataFrame to be shuffled.

tid_colstr, optional

The name of the column representing trajectory IDs (default ‘tid’).

random_numint, optional

Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame

The shuffled DataFrame.

matdata.preprocess.trainTestSplit(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', fileprefix='', data_path='.', outformats=[], verbose=False, organize_columns=True, sort=True)[source]

Splits a DataFrame into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame

The DataFrame to be split into training and testing sets.

train_sizefloat, optional (default=0.7)

The proportion of the dataset to include in the training set.

random_numint, optional (default=1)

The random seed for reproducible results.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

fileprefixstr, optional (default=’’)

The prefix to be added to the file names when saving.

data_pathstr, optional (default=’.’)

The directory path where the output files will be saved.

outformatslist of str, optional

A list of output formats for saving the datasets (e.g., [‘csv’, ‘parquet’]).

verbosebool, optional (default=False)

A flag indicating whether to display progress messages.

organize_columnsbool, optional (default=True)

A flag indicating whether to organize columns before saving.

sortbool, optional

If True, sort the data by class_col and tid_col (default True).

Returns:

trainpandas.DataFrame

A DataFrame containing the training set.

testpandas.DataFrame

A DataFrame containing the testing set.

matdata.preprocess.writeFile(data_path, df, file, tid_col, class_col, columns_order, mat_columns=None, desc_cols=None, outformat='zip', opSuff='')[source]
matdata.preprocess.writeFiles(data_path, file, train, test, tid_col, class_col, columns_order, mat_columns=None, desc_cols=None, outformat='zip', opSuff='')[source]

Module contents

Multiple Aspect Trajectory Tools Framework

MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the classification task of multiple aspect trajectories, specifically for extracting and visualizing the movelets, the parts of the trajectory that better discriminate a class. It integrates into a unique platform the fragmented approaches available for multiple aspects trajectories and in general for multidimensional sequence classification into a unique web-based and python library system. Offers both movelets visualization and classification methods.

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

@author: Tarlis Portela