matdata package

Submodules

matdata.converter module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Authors:

Tarlis Portela

matdata.converter.any2ts(data_path, folder, file, cols=None, tid_col='tid', class_col='label', opLabel='Converting TS')[source]

Converts data from various formats (CSV, Parquet, etc.) to a time series format.

Parameters:

data_pathstr: The directory path where the data files are located.
folderstr: The folder containing the data file to be converted.
filestr: The name of the data file to be converted.
colslist of str, optional: A list of column names to be included in the time series data.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
opLabelstr, optional (default=’Converting TS’): A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame: A DataFrame containing the time series data, with trajectory identifier, class label, and specified columns.

matdata.converter.csv2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a CSV file from a given URL into a pandas DataFrame.

Parameters:

urlstr: The URL pointing to the CSV file to be read.
class_colstr, optional (default=’label’): Unused, kept for standard.
tid_colstr, optional (default=’tid’): Unused, kept for standard.
missingstr, optional (default=’?’): The placeholder for missing values in the CSV file.

Returns:

pandas.DataFrame: A DataFrame containing the data from the CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.descTypes(df)[source]

matdata.converter.df2csv(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing CSV')[source]

Writes a pandas DataFrame to a CSV file.

Parameters:

dfpandas.DataFrame: The DataFrame to be written to the CSV file.
data_pathstr: The directory path where the Parquet file will be saved.
filestr, optional (default=’train’): The base name of the CSV file (without extension).
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
select_colslist of str, optional: A list of column names to be included in the CSV file. If None, all columns are included.
opLabelstr, optional (default=’Writing PARQUET’): A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame: The input DataFrame

matdata.converter.df2mat(df, folder, file, cols=None, mat_cols=None, desc_cols=None, label_columns=None, other_dsattrs=None, tid_col='tid', class_col='label', opLabel='Converting MAT')[source]

Converts a pandas DataFrame to a Multiple Aspect Trajectory .mat file and saves it to the specified folder.

Parameters:

dfpandas.DataFrame: The DataFrame to be converted to a .mat file.
folderstr: The directory where the .mat file will be saved.
filestr: The base name of the .mat file (without extension).
colslist of str, optional: A list of column names from the DataFrame to include in the .mat file. If None, all columns are included.
mat_colslist of str, optional: A list of column names representing the trajectory attibutes. If None, no columns are used.
desc_colslist of str, optional: A dict of column descriptors to be included as descriptive metadata.
label_columnslist of str, optional: A list of column names that can be treated as labels in the .mat file.
other_dsattrsdict, optional: A dictionary of additional dataset attributes to be included in the .mat file.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
opLabelstr, optional (default=’Converting MAT’): A label describing the operation, useful for logging or display purposes.

Returns:

None

matdata.converter.df2parquet(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing Parquet')[source]

Writes a pandas DataFrame to a Parquet file.

Parameters:

dfpandas.DataFrame: The DataFrame to be written to the Parquet file.
data_pathstr: The directory path where the Parquet file will be saved.
filestr, optional (default=’train’): The base name of the Parquet file (without extension).
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
select_colslist of str, optional: A list of column names to be included in the Parquet file. If None, all columns are included.
opLabelstr, optional (default=’Writing PARQUET’): A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame: The input DataFrame

matdata.converter.df2zip(df, data_path, file, tid_col='tid', class_col='label', select_cols=None, opLabel='Writing ZIP')[source]

Writes a pandas DataFrame to a CSV file and compresses it into a ZIP archive.

This format is used for older movelet methods, such as Movelets, MasterMovelets, SuperMovelets, and the Dodge, Xiao, Zheng feature extractors. In this format all ‘,’ (commas) are replaced for ‘_’ to avoid problems reading csv trajectory files.

Parameters:

dfpandas.DataFrame: The DataFrame to be written to the CSV file and then compressed into a ZIP archive.
data_pathstr: The directory path where the ZIP archive will be saved.
filestr: The base name of the CSV file (without extension) to be compressed into the ZIP archive.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
select_colslist of str, optional: A list of column names to be included in the CSV file. If None, all columns are included.
opLabelstr, optional (default=’Writing ZIP’): A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame: The input DataFrame

matdata.converter.mat2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a MATLAB .mat file from a given URL into a pandas DataFrame.

Parameters:

urlstr: The URL pointing to the .mat file to be read.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
tid_colstr, optional (default=’tid’): The name of the column to be used as the unique trajectory identifier.
missingstr, optional (default=’?’): The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame: A DataFrame containing the data from the .mat file, with missing values handled as specified and columns renamed if necessary.

Raises:

Exception: Not Implemented.

matdata.converter.parquet2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a Parquet file from a given URL into a pandas DataFrame.

Parameters:

urlstr: The URL pointing to the Parquet file to be read.
class_colstr, optional (default=’label’): Unused, kept for standard.
tid_colstr, optional (default=’tid’): Unused, kept for standard.
missingstr, optional (default=’?’): The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame: A DataFrame containing the data from the Parquet file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.read_zip(zipFile, cols=None, class_col='label', tid_col='tid', opLabel='Reading ZIP')[source]

matdata.converter.ts2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a time series file from a given URL into a pandas DataFrame.

Parameters:

urlstr: The URL pointing to the time series file to be read.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
tid_colstr, optional (default=’tid’): The name of the column to be used as the unique trajectory identifier.
missingstr, optional (default=’?’): The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame: A DataFrame containing the data from the time series file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.xes2df(url, class_col='label', tid_col='tid', missing=None, opLabel='Converting XES', save=False, start_tid=1)[source]

Converts an XES (eXtensible Event Stream) file from a given URL into a pandas DataFrame.

Parameters:

urlstr: The URL pointing to the XES file to be read.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
opLabelstr, optional (default=’Converting XES’): A label describing the operation, useful for logging or display purposes.
savebool, optional (default=False): A flag indicating whether to save the DataFrame to a file after conversion.
start_tidint, optional (default=1): The starting value for trajectory identifiers as tid_col values need to be generated.

Returns:

pandas.DataFrame: A DataFrame containing the data from the XES file, with columns renamed if necessary.

matdata.converter.zip2arf(folder, file, cols, tid_col='tid', class_col='label', missing='?', opLabel='Reading ZIP')[source]

Extracts a CSV file from a ZIP archive and converts it into an ARFF (Attribute-Relation File Format) file.

Parameters:

folderstr: The directory path where the ZIP archive is located.
filestr: The name of the ZIP archive file (with or without extension).
colslist of str: A list of column names to be included in the ARFF file.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
missingstr, optional (default=’?’): The placeholder for missing values in the CSV file.
opLabelstr, optional (default=’Reading CSV’): A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame: A DataFrame containing the data from the extracted ZIP file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.zip2csv(folder, file, cols, class_col='label', tid_col='tid', missing='?')[source]

Extracts and compile Trajectory CSV files from a ZIP archive and converts it into a pandas DataFrame.

Parameters:

folderstr: The directory path where the ZIP archive is located, and destination to the CSV resulting file.
filestr: The name of the ZIP archive file (with or without extension).
colslist of str: A list of column names to be included in the DataFrame.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
missingstr, optional (default=’?’): The placeholder for missing values in the CSV file.

Returns:

pandas.DataFrame: A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.zip2df(url, class_col='label', tid_col='tid', missing='?', opLabel='Reading ZIP')[source]

Extracts and converts a CSV trajectory file from a ZIP archive located at a given URL into a pandas DataFrame.

Parameters:

urlstr: The URL pointing to the ZIP archive containing the CSV file to be read.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
tid_colstr, optional (default=’tid’): The name of the column to be used as the unique trajectory identifier.
missingstr, optional (default=’?’): The placeholder for missing values in the CSV file.
opLabelstr, optional (default=’Reading ZIP’): A label describing the operation, for logging purposes.

Returns:

pandas.DataFrame: A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.dataset module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Authors:

Tarlis Portela

matdata.dataset.load_ds(dataset='mat.FoursquareNYC', prefix='', missing=None, sample_size=1, random_num=1, sort=True)[source]

Load a dataset for training or testing from a GitHub repository.

Parameters:

datasetstr, optional: The name of the dataset to load (default ‘mat.FoursquareNYC’).
prefixstr, optional: The prefix to be added to the dataset file name (default ‘’).
missingstr, optional: The placeholder value used to denote missing data (default ‘-999’).
sample_sizefloat, optional: The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
random_numint, optional: Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame: The loaded dataset with optional sampling.

matdata.dataset.load_ds_holdout(dataset='mat.FoursquareNYC', train_size=0.7, prefix='', missing='-999', sample_size=1, random_num=1, sort=True)[source]

Load a dataset for training and testing with a holdout method from a GitHub repository.

Parameters:

datasetstr, optional: The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’). Format as category.DatasetName
train_sizefloat, optional: The proportion of the dataset to include in the training set (default 0.7).
prefixstr, optional: The prefix to be added to the dataset file name (default ‘’).
missingstr, optional: The placeholder value used to denote missing data (default ‘-999’).
sample_sizefloat, optional: The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
random_numint, optional: Random seed for reproducibility (default 1).

Returns:

trainpandas.DataFrame: The training dataset.
testpandas.DataFrame: The testing dataset.

matdata.dataset.load_ds_kfold(dataset='mat.FoursquareNYC', k=5, prefix='', missing='-999', sample_size=1, random_num=1)[source]

Load a dataset for k-fold cross-validation from a GitHub repository.

Parameters:

datasetstr, optional: The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’).
kint, optional: The number of folds for cross-validation (default 5).
prefixstr, optional: The prefix to be added to the dataset file name (default ‘’).
missingstr, optional: The placeholder value used to denote missing data (default ‘-999’).
sample_sizefloat, optional: The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
random_numint, optional: Random seed for reproducibility (default 1).

Returns:

ktrainlist ofpandas.DataFrame: The training datasets for each fold.
ktestlist of pandas.DataFrame: The testing datasets for each fold.

matdata.dataset.prepare_ds(df, tid_col='tid', class_col=None, sample_size=1, random_num=1, sort=True)[source]

Prepare dataset for training or testing (helper function).

Parameters:

dfpandas.DataFrame: The DataFrame containing the dataset.
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).
class_colstr or None, optional: The name of the column representing class labels. If None, no class column is used for ordering data (default None).
sample_sizefloat, optional: The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
random_numint, optional: Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame: The prepared dataset with optional sampling.

matdata.dataset.read_ds(data_file, tid_col='tid', class_col=None, missing='-999', sample_size=1, random_num=1)[source]

Read a dataset from a file.

Parameters:

data_filestr: The path to the dataset file.
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).
class_colstr or None, optional: The name of the column representing class labels. If None, no class column is used (default None).
missingstr, optional: The placeholder value used to denote missing data (default ‘-999’).
sample_sizefloat, optional: The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
random_numint, optional: Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame: The read dataset.

matdata.dataset.read_ds_5fold(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]

Read datasets for k-fold cross-validation from files in a directory.

Parameters:

data_pathstr: The path to the directory containing the dataset files.
prefixstr, optional: The prefix of the dataset file names (default ‘specific’).
suffixstr, optional: The suffix of the dataset file names (default ‘.csv’).
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).
class_colstr or None, optional: The name of the column representing class labels. If None, no class column is used (default None).
missingstr, optional: The placeholder value used to denote missing data (default ‘-999’).
foldint or None, optional: The fold number to load for holdout validation, including subdirectory (ex. run1). If None, read files in data_path.

Returns:

trainpandas.DataFrame: The training dataset.
testpandas.DataFrame: The testing dataset.

matdata.dataset.read_ds_kfold(data_path, k=5, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]

Read datasets for k-fold cross-validation from files in a directory.

Parameters:

data_pathstr: The path to the directory containing the dataset files.
kint, optional: The number of folds for cross-validation (default 5).
prefixstr, optional: The prefix of the dataset file names (default ‘specific’).
suffixstr, optional: The suffix of the dataset file names (default ‘.csv’).
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).
class_colstr or None, optional: The name of the column representing class labels. If None, no class column is used (default None).
missingstr, optional: The placeholder value used to denote missing data (default ‘-999’).

Returns:

ktrainlist ofpandas.DataFrame: The training datasets for each fold.
ktestlist of pandas.DataFrame: The testing datasets for each fold.

matdata.dataset.repository_datasets()[source]

Read the datasets available in the repository and organize them by category.

Returns:

dict: A dictionary containing lists of datasets, where each category is a key.

matdata.dataset.translateCategory(dataset, category, descName=None)[source]

matdata.dataset.translateDesc(dataset, category, descName)[source]

matdata.generator module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Authors:

Tarlis Portela

class matdata.generator.AttributeGenerator(name='attr', atype='nominal', method='random', interval=None, n=-1, dependency=None, cellSize=1, adjacents=None, precision=2)[source]

Bases: object

descType()[source]

next()[source]

nextn(n)[source]

class matdata.generator.NominalGenerator(method='random', n=50, interval=None)[source]

Bases: object

next()[source]

nextn(n)[source]

static nominalInterval(n)[source]

class matdata.generator.NumericGenerator(method='random', start=0, end=100, precision=2)[source]

Bases: object

next()[source]

nextn(n)[source]

class matdata.generator.SpatialGrid2D(X=(1, 5), Y=(1, 5), cellSize=1, spatial_adjacents=None, precision=2, dependency=[])[source]

Bases: object

SPATIAL_ADJACENTS_1 = [(-1, -1), (-1, 0), (-1, 1), (0, -1), (0, 1), (1, -1), (1, 0), (1, 1)]

SPATIAL_ADJACENTS_2 = [(-2, -2), (-2, -1), (-2, 0), (-2, 1), (-2, 2), (-1, -2), (-1, -1), (-1, 0), (-1, 1), (-2, 2), (0, -2), (0, -1), (0, 1), (-2, 2), (1, -2), (1, -1), (1, 0), (1, 1), (-2, 2), (2, -2), (2, -1), (2, 0), (2, 1), (-2, 2)]

adjacents(cell)[source]

next()[source]

nextin(cell)[source]

nextn(n)[source]

position(x, y)[source]

randomRoute(startCell, n)[source]

size()[source]

text(point)[source]

matdata.generator.cycleGenerators(L, generators)[source]

matdata.generator.default_types()[source]

matdata.generator.getMiddleE(X)[source]

matdata.generator.getSamplingData(base_data, cols_for_sampling)[source]

matdata.generator.getScale(start=100, n_ele=10)[source]

matdata.generator.instantiate_generators(attr_desc=[{'atype': 'space', 'interval': [(0.0, 1000.0), (0.0, 1000.0)], 'method': 'grid_cell', 'name': 'space'}, {'atype': 'time', 'interval': [0, 1440], 'method': 'random', 'name': 'time'}, {'atype': 'numeric', 'interval': [-1000, 1000], 'method': 'random', 'name': 'n1'}, {'atype': 'numeric', 'interval': [0.0, 1000.0], 'method': 'random', 'name': 'n2'}, {'atype': 'nominal', 'method': 'random', 'n': 1000, 'name': 'nominal'}, {'atype': 'day', 'interval': ['Monday', 'Tuesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Wednesday'], 'method': 'random', 'name': 'day'}, {'atype': 'weather', 'interval': ['Clear', 'Clouds', 'Fog', 'Unknown', 'Rain', 'Snow'], 'method': 'random', 'name': 'weather'}, {'atype': 'category', 'dependency': 'space', 'interval': ['Residence', 'Food', 'Travel & Transport', 'Professional & Other Places', 'Shop & Service', 'Outdoors & Recreation', 'College & University', 'Arts & Entertainment', 'Nightlife Spot', 'Event'], 'method': 'random', 'name': 'category'}])[source]

matdata.generator.randomGenerator(N=10, M=50, L=10, C=10, random_seed=1, fileprefix='random', fileposfix='train', attr_desc=None, save_to=False, outformats=['csv'])[source]

Function to generate trajectories based on random data.

Parameters:

Nint, optional: Number of trajectories (default 10)
Mint, optional: Size of trajectories (default 50)
Lint, optional: Number of attributes (default 10)
Cint, optional: Number of classes (default 10)
random_seedint, optional: Random Seed (default 1)
attr_desclist of dict, optional: Data type intervals to generate attributes as a list of descriptive dicts. Default: None (uses default types) OR a list of instances of AttributeGenerator
save_tostr or bool, optional: Destination folder to save, or False if not to save CSV files (default False)
fileprefixstr, optional: Output filename prefix (default ‘sample’)
fileposfixstr, optional: Output filename postfix (default ‘train’)
outformatslist, optional: Output file formats for saving (default [‘csv’])

Returns:

pandas.DataFrame: The generated dataset.

matdata.generator.random_set(N, M, L, label, j, generators)[source]

matdata.generator.random_trajectory(M, L, tid, generators)[source]

matdata.generator.sample_set(df_for_sampling, N, M, label, j)[source]

matdata.generator.sample_trajectory(df, M, tid)[source]

matdata.generator.samplerGenerator(N=10, M=50, C=1, random_seed=1, fileprefix='sample', fileposfix='train', cols_for_sampling=['space', 'time', 'day', 'rating', 'price', 'weather', 'root_type', 'type'], save_to=False, base_data=None, outformats=['csv'])[source]

Function to generate trajectories based on real data.

Parameters:

Nint, optional: Number of trajectories (default 10)
Mint, optional: Size of trajectories, number of points (default 50)
Cint, optional: Number of classes (default 1)
random_seedint, optional: Random seed (default 1)
cols_for_samplinglist, optional: Columns to add in the generated dataset. Default: [‘space’, ‘time’, ‘day’, ‘rating’, ‘price’, ‘weather’, ‘root_type’, ‘type’].
save_tostr or bool, optional: Destination folder to save, or False if not to save CSV files (default False)
fileprefixstr, optional: Output filename prefix (default ‘sample’)
fileposfixstr, optional: Output filename postfix (default ‘train’)
base_dataDataFrame, optional: DataFrame of trajectories to use as a base for sampling data. Default: None (uses example data)
outformatslist, optional: Output file formats for saving (default [‘csv’])

Returns:

pandas.DataFrame: The generated dataset.

matdata.generator.scalerRandomGenerator(Ns=[100, 10], Ms=[10, 10], Ls=[8, 10], Cs=[2, 10], random_seed=1, fileprefix='scalability', fileposfix='train', attr_desc=None, save_to=None, save_desc_files=True, outformats=['csv'])[source]

Function to generate trajectory datasets based on random data.

Parameters:

Nslist of int, optional: Parameters to scale the number of trajectories. List of 2 values: starting number, number of elements (default [100, 10])
Mslist of int, optional: Parameters to scale the size of trajectories. List of 2 values: starting number, number of elements (default [10, 10])
Lslist of int, optional: Parameters to scale the number of attributes (* doubles the columns). List of 2 values: starting number, number of elements (default [8, 10])
Cslist of int, optional: Parameters to scale the number of classes. List of 2 values: starting number, number of elements (default [2, 10])
random_seedint, optional: Random seed (default 1)
attr_desclist, optional: Data type intervals to generate attributes as a list of descriptive dicts. Default: None (uses default types)
save_tostr or bool, optional: Destination folder to save, or False if not to save CSV files (default False)
fileprefixstr, optional: Output filename prefix (default ‘sample’)
fileposfixstr, optional: Output filename postfix (default ‘train’)
save_desc_filesbool, optional: True if to save the .json description files, False otherwise (default True)
outformatslist, optional: Output file formats for saving (default [‘csv’])

Returns:

None

matdata.generator.scalerSamplerGenerator(Ns=[100, 10], Ms=[10, 10], Ls=[8, 10], Cs=[2, 10], random_seed=1, fileprefix='scalability', fileposfix='train', cols_for_sampling=['space', 'time', 'day', 'rating', 'price', 'weather', 'root_type', 'type'], save_to=None, base_data=None, save_desc_files=True, outformats=['csv'])[source]

Generates trajectory datasets based on real data.

Parameters:

Nslist of int, optional: Parameters to scale the number of trajectories. List of 2 values: starting number, number of elements (default [100, 10])
Mslist of int, optional: Parameters to scale the size of trajectories. List of 2 values: starting number, number of elements (default [10, 10])
Lslist of int, optional: Parameters to scale the number of attributes (* doubles the columns). List of 2 values: starting number, number of elements (default [8, 10])
Cslist of int, optional: Parameters to scale the number of classes. List of 2 values: starting number, number of elements (default [2, 10])
random_seedint, optional: Random seed (default 1)
fileprefixstr, optional: Output filename prefix (default ‘scalability’)
fileposfixstr, optional: Output filename postfix (default ‘train’)
cols_for_samplinglist or dict, optional: Columns to add in the generated dataset. Default: [‘space’, ‘time’, ‘day’, ‘rating’, ‘price’, ‘weather’, ‘root_type’, ‘type’]. If a dictionary is provided in the format: {‘aspectName’: ‘type’, ‘aspectName’: ‘type’}, it is used when providing base_data and saving .MAT.
save_tostr or bool, optional: Destination folder to save, or False if not to save CSV files (default False)
base_dataDataFrame, optional: DataFrame of trajectories to use as a base for sampling data. Default: None (uses example data)
save_desc_filesbool, optional: True if to save the .json description files, False otherwise (default True)
outformatslist, optional: Output file formats for saving (default [‘csv’])

Returns:

None

matdata.preprocess module

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Authors:

Tarlis Portela

matdata.preprocess.convertDataset(dir_path, k=None, cols=None, fileprefix='', tid_col='tid', class_col='label')[source]

matdata.preprocess.countClasses(data_path, folder, file='train.csv', tid_col='tid', class_col='label', markd=False)[source]

Counts the occurrences of each class label in a dataset.

Parameters:

data_pathstr: The directory path where the dataset file is located.
folderstr: The subfolder within the data path where the dataset file is located.
filestr, optional (default=’train.csv’): The name of the dataset file to be read.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
markdbool, optional (default=False): A flag indicating whether to print the class counts in Markdown format.

Returns:

pandas.DataFrame or str: If markd is False, prins the markdown text and returns a dictionary DataFrame containing the counts of each class label in the dataset. If markd is True, returns str markdown of the counts of each class label in the dataset.

matdata.preprocess.countClasses_df(df, tid_col='tid', class_col='label', markd=False)[source]

matdata.preprocess.datasetStatistics(data_path, folder, file_prefix='', tid_col='tid', class_col='label', to_file=False)[source]

Computes statistics for a dataset, including summary statistics for each column and class distribution into a markdown file format.

Parameters:

data_pathstr: The directory path where the dataset file(s) are located.
folderstr: The subfolder within the data path where the dataset file(s) are located.
file_prefixstr, optional (default=’’): The prefix to be added to the dataset file names.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
to_filebool, optional (default=False): A flag indicating whether to save the statistics to a file.

Returns:

dict or None: If to_file is False, prints markdown and returns a str containing the computed statistics. If to_file is str, returns markdown str and saves the statistics to a file named as in to_file value.

matdata.preprocess.dfStats(df)[source]

Computes summary statistics for each column in a DataFrame.

Parameters:

dfpandas.DataFrame: The DataFrame for which statistics are to be computed.

Returns:

pandas.DataFrame: A DataFrame containing summary statistics for each column, including mean, standard deviation, and variance. Columns are sorted by variance in descending order.

matdata.preprocess.dfVariance(df)[source]

Computes the variance for each column in a DataFrame.

Parameters:

dfpandas.DataFrame: The DataFrame for which variance is to be computed.

Returns:

pandas.Series: A Series containing the variance for each column in the DataFrame.

matdata.preprocess.dropLabelsltk(df, k, tid_col='tid', class_col='label')[source]

matdata.preprocess.featuresJSON(df, version=1, deftype='nominal', defcomparator='equals', tid_col='tid', label_col='label', file=False)[source]

Generates a JSON representation of features from a DataFrame.

Parameters:

dfpandas.DataFrame: The DataFrame containing the dataset.
versionint, optional (default=1): The version number of the JSON schema (1 for MASTERMovelets format, 2 for HiPerMovelets format).
deftypestr, optional (default=’nominal’): The default type of features.
defcomparatorstr, optional (default=’equals’): The default comparator for features.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
label_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
filebool, optional (default=False): A flag indicating whether to save the JSON representation to a file.

Returns:

str: If file is False, returns a str representing the features in JSON format. If file is str, returns a str of JSON features and saves the JSON representation to a file param name.

matdata.preprocess.joinTrainTest(dir_path, train_file='train.csv', test_file='test.csv', tid_col='tid', class_col='label', to_file=False)[source]

Joins training and testing datasets from separate files into a single DataFrame.

Parameters:

dir_pathstr: The directory path where the training and testing files are located.
train_filestr, optional (default=”train.csv”): The name of the training file to be read.
test_filestr, optional (default=”test.csv”): The name of the testing file to be read.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
to_filebool, optional (default=False): A flag indicating whether to save the joined DataFrame to a file, and saves the joined DataFrame to a file named ‘joined.csv’.

Returns:

pandas.DataFrame: A DataFrame containing the joined training and testing data. If to_file is True, returns the DataFrame and saves the joined DataFrame to a file named ‘joined.csv’.

matdata.preprocess.joinTrainTest_df(train, test, tid_col='tid', class_col='label', sort=True)[source]

matdata.preprocess.kfold_stratify(df, k=10, inc=1, limit=10, random_num=1, tid_col='tid', class_col='label', fileprefix='', ktrain=None, ktest=None, organize_columns=True, mat_columns=None, data_path='.', outformats=[], ignore_ltk=True, sort=True)[source]

matdata.preprocess.kfold_trainTestSplit(df, k, random_num=1, tid_col='tid', class_col='label', fileprefix='', columns_order=None, ktrain=None, ktest=None, mat_columns=None, data_path='.', outformats=[], verbose=False, sort=True)[source]

Splits a DataFrame into k folds for k-fold cross-validation, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame: The DataFrame to be split into k folds.
kint: The number of folds for cross-validation.
random_numint, optional (default=1): The random seed for reproducible results.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
fileprefixstr, optional (default=’’): The prefix to be added to the file names when saving, for example: ‘specific_’ or ‘generic_’.
columns_orderlist of str, optional: A list of column names specifying the desired order of columns. If None, no reordering is performed.
ktrainlist of pandas.DataFrame, optional: A list of training sets for each fold. If None, the function will split the data into training and testing sets.
ktestlist of pandas.DataFrame, optional: A list of testing sets for each fold. If None, the function will split the data into training and testing sets.
mat_columnslist of str, optional: A list of column names to be included in the .mat files, corresponding to columns_order.
data_pathstr, optional (default=’.’): The directory path where the output files will be saved.
outformatslist of str, optional: A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
verbosebool, optional (default=False): A flag indicating whether to display progress messages.
sortbool, optional: If True, sort the data by class_col and tid_col (default True).

Returns:

ktrainlist of pandas.DataFrame: List of DataFrame containing the training sets.
ktestlist of pandas.DataFrame: List of DataFrame containing the testing sets.

matdata.preprocess.klabels_extract(df, kl=10, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True)[source]

Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame: The DataFrame to be stratified and split into training and testing sets.
klint, optional (default=10): The number of class labels to stratify the DataFrame.
random_numint, optional (default=1): The random seed for reproducible results.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
organize_columnsbool, optional (default=True): A flag indicating whether to organize columns before saving.
sortbool, optional: If True, sort the data by class_col and tid_col (default True).

Returns:

datapandas.DataFrame: A DataFrame containing the dataset.

matdata.preprocess.klabels_stratify(df, kl=10, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]

Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame: The DataFrame to be stratified and split into training and testing sets.
klint, optional (default=10): The number of class labels to stratify the DataFrame.
train_sizefloat, optional (default=0.7): The proportion of the stratified dataset to include in the training set.
random_numint, optional (default=1): The random seed for reproducible results.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
organize_columnsbool, optional (default=True): A flag indicating whether to organize columns before saving.
mat_columnslist of str, optional (unused for now): A list of column names to be included in the .mat files, if set to save.
fileprefixstr, optional (default=’’): The prefix to be added to the file names when saving.
outformatslist of str, optional: A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
data_pathstr, optional (default=’.’): The directory path where the output files will be saved.
sortbool, optional: If True, sort the data by class_col and tid_col (default True).

Returns:

trainpandas.DataFrame: A DataFrame containing the training set.
testpandas.DataFrame: A DataFrame containing the testing set.

matdata.preprocess.labels_extract(df, labels=[], tid_col='tid', class_col='label', organize_columns=True)[source]

matdata.preprocess.organizeFrame(df, columns_order=None, tid_col='tid', class_col='label', make_spatials=False)[source]

Organizes a DataFrame by reordering columns and optionally converting spatial columns.

Parameters:

dfpandas.DataFrame: The DataFrame to be organized.
columns_orderlist of str, optional: A list of column names specifying the desired order of columns. If None, no reordering is performed.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
make_spatialsbool, optional (default=False): A flag indicating whether to convert spatial columns to both lat/lon separated or space format, which is the lat/lon concatenated in one column.

Returns:

pandas.DataFrame: A DataFrame containing the organized data, with columns added as specified and spatial columns converted if requested.
columns_order_zip: A list of the columns with space column, if present.
columns_order_csv: A list of the columns with lat/lon columns, if present.

matdata.preprocess.readDataset(data_path, folder=None, file='train.csv', class_col='label', tid_col='tid', missing=None)[source]

Reads a dataset file (CSV format by default, ‘train.csv’) and returns it as a pandas DataFrame.

Parameters:

data_pathstr: The directory path where the dataset file is located.
folderstr, optional: The subfolder within the data path where the dataset file is located.
filestr, optional (default=’train.csv’): The name of the dataset file to be read.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
missingstr, optional (default=’?’): The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame: A DataFrame containing the dataset from the specified file, with trajectory identifier, class label, and missing values handled as specified.

matdata.preprocess.readDsDesc(data_path, folder=None, file='train.csv', tid_col='tid', class_col='label', missing='?')[source]

matdata.preprocess.sortByLabel(df, tid_col='tid', class_col='label')[source]

Sort a DataFrame by class label column and trajectory ID column.

Parameters:

dfpandas.DataFrame: The DataFrame to be sorted.
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).
class_colstr, optional: The name of the column representing class labels (default ‘label’).

Returns:

pandas.DataFrame: The sorted DataFrame.

matdata.preprocess.sortByTID(df_, tid_col='tid')[source]

Sort a DataFrame by trajectory ID column.

Parameters:

df_pandas.DataFrame: The DataFrame to be sorted.
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).

Returns:

pandas.DataFrame: The sorted DataFrame.

matdata.preprocess.splitData(df, k, random_num, tid_col='tid', class_col='label', opLabel='Spliting Data', ignore_ltk=True)[source]

matdata.preprocess.splitTIDs(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', min_elements=1, opLabel='Spliting Data (class-balanced)')[source]

matdata.preprocess.splitTIDsUnbalanced(df, train_size=0.7, random_num=1, tid_col='tid', min_elements=1, opLabel='Spliting Data')[source]

matdata.preprocess.splitframe(data, name='tid')[source]

matdata.preprocess.stratify(df, sample_size=0.5, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True, opLabel='Data Stratification (class-balanced)')[source]

Stratifies a DataFrame by class label, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame: The DataFrame to be stratified and split into training and testing sets.
sample_sizefloat, optional (default=0.5): The proportion of the dataset to sample for stratification.
random_numint, optional (default=1): The random seed for reproducible results.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
organize_columnsbool, optional (default=True): A flag indicating whether to organize columns before saving.
sortbool, optional: If True, sort the data by class_col and tid_col (default True).

Returns:

pandas.DataFrame: A DataFrame containing the stratified set.

matdata.preprocess.stratifyTrainTest(df, sample_size=0.5, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]

Stratifies a DataFrame by class label and splits it into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame: The DataFrame to be stratified and split into training and testing sets.
sample_sizefloat, optional (default=0.5): The proportion of the dataset to sample for stratification.
train_sizefloat, optional (default=0.7): The proportion of the stratified dataset to include in the training set.
random_numint, optional (default=1): The random seed for reproducible results.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
organize_columnsbool, optional (default=True): A flag indicating whether to organize columns before saving.
mat_columnslist of str, optional (unused for now): A list of column names to be included in the .mat files, if set to save.
fileprefixstr, optional (default=’’): The prefix to be added to the file names when saving.
outformatslist of str, optional: A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
data_pathstr, optional (default=’.’): The directory path where the output files will be saved.
sortbool, optional: If True, sort the data by class_col and tid_col (default True).

Returns:

trainpandas.DataFrame: A DataFrame containing the training set.
testpandas.DataFrame: A DataFrame containing the testing set.

matdata.preprocess.suffle_df(df, tid_col='tid', random_num=1)[source]

Shuffle a DataFrame by trajectory ID column.

Parameters:

dfpandas.DataFrame: The DataFrame to be shuffled.
tid_colstr, optional: The name of the column representing trajectory IDs (default ‘tid’).
random_numint, optional: Random seed for reproducibility (default 1).

Returns:

pandas.DataFrame: The shuffled DataFrame.

matdata.preprocess.trainTestSplit(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', fileprefix='', data_path='.', outformats=[], verbose=False, organize_columns=True, sort=True)[source]

Splits a DataFrame into training and testing sets, optionally organizes columns, and saves them to files.

Parameters:

dfpandas.DataFrame: The DataFrame to be split into training and testing sets.
train_sizefloat, optional (default=0.7): The proportion of the dataset to include in the training set.
random_numint, optional (default=1): The random seed for reproducible results.
tid_colstr, optional (default=’tid’): The name of the column to be used as the trajectory identifier.
class_colstr, optional (default=’label’): The name of the column to be treated as the class/label column.
fileprefixstr, optional (default=’’): The prefix to be added to the file names when saving.
data_pathstr, optional (default=’.’): The directory path where the output files will be saved.
outformatslist of str, optional: A list of output formats for saving the datasets (e.g., [‘csv’, ‘parquet’]).
verbosebool, optional (default=False): A flag indicating whether to display progress messages.
organize_columnsbool, optional (default=True): A flag indicating whether to organize columns before saving.
sortbool, optional: If True, sort the data by class_col and tid_col (default True).

Returns:

trainpandas.DataFrame: A DataFrame containing the training set.
testpandas.DataFrame: A DataFrame containing the testing set.

matdata.preprocess.writeFile(data_path, df, file, tid_col, class_col, columns_order, mat_columns=None, desc_cols=None, outformat='zip', opSuff='')[source]

matdata.preprocess.writeFiles(data_path, file, train, test, tid_col, class_col, columns_order, mat_columns=None, desc_cols=None, outformat='zip', opSuff='')[source]

Module contents

Multiple Aspect Trajectory Tools Framework

MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the classification task of multiple aspect trajectories, specifically for extracting and visualizing the movelets, the parts of the trajectory that better discriminate a class. It integrates into a unique platform the fragmented approaches available for multiple aspects trajectories and in general for multidimensional sequence classification into a unique web-based and python library system. Offers both movelets visualization and classification methods.

@author: Tarlis Portela

matdata package

Subpackages

Submodules

matdata.converter module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Raises:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

matdata.dataset module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Returns:

matdata.generator module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

matdata.preprocess module

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters:

Returns:

Parameters: