matdata package
Subpackages
- matdata.inc package
- Submodules
- matdata.inc.ts_io module
LongFormatDataParseException
TsFileParseException
from_long_to_nested()
generate_example_long_table()
load_from_arff_to_dataframe()
load_from_long_to_dataframe()
load_from_tsfile()
load_from_tsfile_to_dataframe()
load_from_ucr_tsv_to_dataframe()
write_dataframe_to_tsfile()
write_results_to_uea_format()
- Module contents
- matdata.util package
Submodules
matdata.converter module
MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
- Authors:
Tarlis Portela
- matdata.converter.any2ts(data_path, folder, file, cols=None, tid_col='tid', class_col='label', opLabel='Converting TS')[source]
Converts data from various formats (CSV, Parquet, etc.) to a time series format.
Parameters:
- data_pathstr
The directory path where the data files are located.
- folderstr
The folder containing the data file to be converted.
- filestr
The name of the data file to be converted.
- colslist of str, optional
A list of column names to be included in the time series data.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- opLabelstr, optional (default=’Converting TS’)
A label describing the operation, useful for logging or display purposes.
Returns:
- pandas.DataFrame
A DataFrame containing the time series data, with trajectory identifier, class label, and specified columns.
- matdata.converter.csv2df(url, class_col='label', tid_col='tid', missing=None)[source]
Converts a CSV file from a given URL into a pandas DataFrame.
Parameters:
- urlstr
The URL pointing to the CSV file to be read.
- class_colstr, optional (default=’label’)
Unused, kept for standard.
- tid_colstr, optional (default=’tid’)
Unused, kept for standard.
- missingstr, optional (default=’?’)
The placeholder for missing values in the CSV file.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the CSV file, with missing values handled as specified and columns renamed if necessary.
- matdata.converter.df2csv(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing CSV')[source]
Writes a pandas DataFrame to a CSV file.
Parameters:
- dfpandas.DataFrame
The DataFrame to be written to the CSV file.
- data_pathstr
The directory path where the Parquet file will be saved.
- filestr, optional (default=’train’)
The base name of the CSV file (without extension).
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- select_colslist of str, optional
A list of column names to be included in the CSV file. If None, all columns are included.
- opLabelstr, optional (default=’Writing PARQUET’)
A label describing the operation, useful for logging or display purposes.
Returns:
- pandas.DataFrame
The input DataFrame
- matdata.converter.df2mat(df, folder, file, cols=None, mat_cols=None, desc_cols=None, label_columns=None, other_dsattrs=None, tid_col='tid', class_col='label', opLabel='Converting MAT')[source]
Converts a pandas DataFrame to a Multiple Aspect Trajectory .mat file and saves it to the specified folder.
Parameters:
- dfpandas.DataFrame
The DataFrame to be converted to a .mat file.
- folderstr
The directory where the .mat file will be saved.
- filestr
The base name of the .mat file (without extension).
- colslist of str, optional
A list of column names from the DataFrame to include in the .mat file. If None, all columns are included.
- mat_colslist of str, optional
A list of column names representing the trajectory attibutes. If None, no columns are used.
- desc_colslist of str, optional
A dict of column descriptors to be included as descriptive metadata.
- label_columnslist of str, optional
A list of column names that can be treated as labels in the .mat file.
- other_dsattrsdict, optional
A dictionary of additional dataset attributes to be included in the .mat file.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- opLabelstr, optional (default=’Converting MAT’)
A label describing the operation, useful for logging or display purposes.
Returns:
None
- matdata.converter.df2parquet(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing Parquet')[source]
Writes a pandas DataFrame to a Parquet file.
Parameters:
- dfpandas.DataFrame
The DataFrame to be written to the Parquet file.
- data_pathstr
The directory path where the Parquet file will be saved.
- filestr, optional (default=’train’)
The base name of the Parquet file (without extension).
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- select_colslist of str, optional
A list of column names to be included in the Parquet file. If None, all columns are included.
- opLabelstr, optional (default=’Writing PARQUET’)
A label describing the operation, useful for logging or display purposes.
Returns:
- pandas.DataFrame
The input DataFrame
- matdata.converter.df2zip(df, data_path, file, tid_col='tid', class_col='label', select_cols=None, opLabel='Writing ZIP')[source]
Writes a pandas DataFrame to a CSV file and compresses it into a ZIP archive.
This format is used for older movelet methods, such as Movelets, MasterMovelets, SuperMovelets, and the Dodge, Xiao, Zheng feature extractors. In this format all ‘,’ (commas) are replaced for ‘_’ to avoid problems reading csv trajectory files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be written to the CSV file and then compressed into a ZIP archive.
- data_pathstr
The directory path where the ZIP archive will be saved.
- filestr
The base name of the CSV file (without extension) to be compressed into the ZIP archive.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- select_colslist of str, optional
A list of column names to be included in the CSV file. If None, all columns are included.
- opLabelstr, optional (default=’Writing ZIP’)
A label describing the operation, useful for logging or display purposes.
Returns:
- pandas.DataFrame
The input DataFrame
- matdata.converter.mat2df(url, class_col='label', tid_col='tid', missing=None)[source]
Converts a MATLAB .mat file from a given URL into a pandas DataFrame.
Parameters:
- urlstr
The URL pointing to the .mat file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the unique trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the dataset.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the .mat file, with missing values handled as specified and columns renamed if necessary.
Raises:
- Exception
Not Implemented.
- matdata.converter.parquet2df(url, class_col='label', tid_col='tid', missing=None)[source]
Converts a Parquet file from a given URL into a pandas DataFrame.
Parameters:
- urlstr
The URL pointing to the Parquet file to be read.
- class_colstr, optional (default=’label’)
Unused, kept for standard.
- tid_colstr, optional (default=’tid’)
Unused, kept for standard.
- missingstr, optional (default=’?’)
The placeholder for missing values in the dataset.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the Parquet file, with missing values handled as specified and columns renamed if necessary.
- matdata.converter.read_zip(zipFile, cols=None, class_col='label', tid_col='tid', opLabel='Reading ZIP')[source]
- matdata.converter.ts2df(url, class_col='label', tid_col='tid', missing=None)[source]
Converts a time series file from a given URL into a pandas DataFrame.
Parameters:
- urlstr
The URL pointing to the time series file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the unique trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the dataset.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the time series file, with missing values handled as specified and columns renamed if necessary.
- matdata.converter.xes2df(url, class_col='label', tid_col='tid', missing=None, opLabel='Converting XES', save=False, start_tid=1)[source]
Converts an XES (eXtensible Event Stream) file from a given URL into a pandas DataFrame.
Parameters:
- urlstr
The URL pointing to the XES file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- opLabelstr, optional (default=’Converting XES’)
A label describing the operation, useful for logging or display purposes.
- savebool, optional (default=False)
A flag indicating whether to save the DataFrame to a file after conversion.
- start_tidint, optional (default=1)
The starting value for trajectory identifiers as tid_col values need to be generated.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the XES file, with columns renamed if necessary.
- matdata.converter.zip2arf(folder, file, cols, tid_col='tid', class_col='label', missing='?', opLabel='Reading ZIP')[source]
Extracts a CSV file from a ZIP archive and converts it into an ARFF (Attribute-Relation File Format) file.
Parameters:
- folderstr
The directory path where the ZIP archive is located.
- filestr
The name of the ZIP archive file (with or without extension).
- colslist of str
A list of column names to be included in the ARFF file.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- missingstr, optional (default=’?’)
The placeholder for missing values in the CSV file.
- opLabelstr, optional (default=’Reading CSV’)
A label describing the operation, useful for logging or display purposes.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the extracted ZIP file, with missing values handled as specified and columns renamed if necessary.
- matdata.converter.zip2csv(folder, file, cols, class_col='label', tid_col='tid', missing='?')[source]
Extracts and compile Trajectory CSV files from a ZIP archive and converts it into a pandas DataFrame.
Parameters:
- folderstr
The directory path where the ZIP archive is located, and destination to the CSV resulting file.
- filestr
The name of the ZIP archive file (with or without extension).
- colslist of str
A list of column names to be included in the DataFrame.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the CSV file.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.
- matdata.converter.zip2df(url, class_col='label', tid_col='tid', missing='?', opLabel='Reading ZIP')[source]
Extracts and converts a CSV trajectory file from a ZIP archive located at a given URL into a pandas DataFrame.
Parameters:
- urlstr
The URL pointing to the ZIP archive containing the CSV file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the unique trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the CSV file.
- opLabelstr, optional (default=’Reading ZIP’)
A label describing the operation, for logging purposes.
Returns:
- pandas.DataFrame
A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.
matdata.dataset module
MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
- Authors:
Tarlis Portela
- matdata.dataset.load_ds(dataset='mat.FoursquareNYC', prefix='', missing=None, sample_size=1, random_num=1, sort=True)[source]
Load a dataset for training or testing from a GitHub repository.
Parameters:
- datasetstr, optional
The name of the dataset to load (default ‘mat.FoursquareNYC’).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The loaded dataset with optional sampling.
- matdata.dataset.load_ds_holdout(dataset='mat.FoursquareNYC', train_size=0.7, prefix='', missing='-999', sample_size=1, random_num=1, sort=True)[source]
Load a dataset for training and testing with a holdout method from a GitHub repository.
Parameters:
- datasetstr, optional
The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’). Format as category.DatasetName
- train_sizefloat, optional
The proportion of the dataset to include in the training set (default 0.7).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- trainpandas.DataFrame
The training dataset.
- testpandas.DataFrame
The testing dataset.
- matdata.dataset.load_ds_kfold(dataset='mat.FoursquareNYC', k=5, prefix='', missing='-999', sample_size=1, random_num=1)[source]
Load a dataset for k-fold cross-validation from a GitHub repository.
Parameters:
- datasetstr, optional
The name of the dataset file to load from the GitHub repository (default ‘mat.FoursquareNYC’).
- kint, optional
The number of folds for cross-validation (default 5).
- prefixstr, optional
The prefix to be added to the dataset file name (default ‘’).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- ktrainlist ofpandas.DataFrame
The training datasets for each fold.
- ktestlist of pandas.DataFrame
The testing datasets for each fold.
- matdata.dataset.prepare_ds(df, tid_col='tid', class_col=None, sample_size=1, random_num=1, sort=True)[source]
Prepare dataset for training or testing (helper function).
Parameters:
- dfpandas.DataFrame
The DataFrame containing the dataset.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used for ordering data (default None).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The prepared dataset with optional sampling.
- matdata.dataset.read_ds(data_file, tid_col='tid', class_col=None, missing='-999', sample_size=1, random_num=1)[source]
Read a dataset from a file.
Parameters:
- data_filestr
The path to the dataset file.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- sample_sizefloat, optional
The proportion of the dataset to include in the sample (default 1, i.e., use the entire dataset).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The read dataset.
- matdata.dataset.read_ds_5fold(data_path, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]
Read datasets for k-fold cross-validation from files in a directory.
See also
read_ds_kfold
Read datasets for k-fold cross-validation.
Parameters
,-----------
data_path
str The path to the directory containing the dataset files.
prefix
str, optional The prefix of the dataset file names (default ‘specific’).
suffix
str, optional The suffix of the dataset file names (default ‘.csv’).
tid_col
str, optional The name of the column representing trajectory IDs (default ‘tid’).
class_col
str or None, optional The name of the column representing class labels. If None, no class column is used (default None).
missing
str, optional The placeholder value used to denote missing data (default ‘-999’).
Returns
,--------
5_train
list ofpandas.DataFrame The training datasets for each fold.
5_test
list of pandas.DataFrame The testing datasets for each fold.
- matdata.dataset.read_ds_holdout(data_path, prefix=None, suffix='.csv', tid_col='tid', class_col=None, missing='-999', fold=None)[source]
Read datasets for holdout validation from files in a directory.
Parameters:
- data_pathstr
The path to the directory containing the dataset files.
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
- foldint or None, optional
The fold number to load for holdout validation, including subdirectory (ex. run1). If None, read files in data_path.
Returns:
- trainpandas.DataFrame
The training dataset.
- testpandas.DataFrame
The testing dataset.
- matdata.dataset.read_ds_kfold(data_path, k=5, prefix='specific', suffix='.csv', tid_col='tid', class_col=None, missing='-999')[source]
Read datasets for k-fold cross-validation from files in a directory.
Parameters:
- data_pathstr
The path to the directory containing the dataset files.
- kint, optional
The number of folds for cross-validation (default 5).
- prefixstr, optional
The prefix of the dataset file names (default ‘specific’).
- suffixstr, optional
The suffix of the dataset file names (default ‘.csv’).
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr or None, optional
The name of the column representing class labels. If None, no class column is used (default None).
- missingstr, optional
The placeholder value used to denote missing data (default ‘-999’).
Returns:
- ktrainlist ofpandas.DataFrame
The training datasets for each fold.
- ktestlist of pandas.DataFrame
The testing datasets for each fold.
matdata.generator module
MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
- Authors:
Tarlis Portela
- class matdata.generator.AttributeGenerator(name='attr', atype='nominal', method='random', interval=None, n=-1, dependency=None, cellSize=1, adjacents=None, precision=2)[source]
Bases:
object
- class matdata.generator.NominalGenerator(method='random', n=50, interval=None)[source]
Bases:
object
- class matdata.generator.NumericGenerator(method='random', start=0, end=100, precision=2)[source]
Bases:
object
- class matdata.generator.SpatialGrid2D(X=(1, 5), Y=(1, 5), cellSize=1, spatial_adjacents=None, precision=2, dependency=[])[source]
Bases:
object
- SPATIAL_ADJACENTS_1 = [(-1, -1), (-1, 0), (-1, 1), (0, -1), (0, 1), (1, -1), (1, 0), (1, 1)]
- SPATIAL_ADJACENTS_2 = [(-2, -2), (-2, -1), (-2, 0), (-2, 1), (-2, 2), (-1, -2), (-1, -1), (-1, 0), (-1, 1), (-2, 2), (0, -2), (0, -1), (0, 1), (-2, 2), (1, -2), (1, -1), (1, 0), (1, 1), (-2, 2), (2, -2), (2, -1), (2, 0), (2, 1), (-2, 2)]
- matdata.generator.instantiate_generators(attr_desc=[{'atype': 'space', 'interval': [(0.0, 1000.0), (0.0, 1000.0)], 'method': 'grid_cell', 'name': 'space'}, {'atype': 'time', 'interval': [0, 1440], 'method': 'random', 'name': 'time'}, {'atype': 'numeric', 'interval': [-1000, 1000], 'method': 'random', 'name': 'n1'}, {'atype': 'numeric', 'interval': [0.0, 1000.0], 'method': 'random', 'name': 'n2'}, {'atype': 'nominal', 'method': 'random', 'n': 1000, 'name': 'nominal'}, {'atype': 'day', 'interval': ['Monday', 'Tuesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Wednesday'], 'method': 'random', 'name': 'day'}, {'atype': 'weather', 'interval': ['Clear', 'Clouds', 'Fog', 'Unknown', 'Rain', 'Snow'], 'method': 'random', 'name': 'weather'}, {'atype': 'category', 'dependency': 'space', 'interval': ['Residence', 'Food', 'Travel & Transport', 'Professional & Other Places', 'Shop & Service', 'Outdoors & Recreation', 'College & University', 'Arts & Entertainment', 'Nightlife Spot', 'Event'], 'method': 'random', 'name': 'category'}])[source]
- matdata.generator.randomGenerator(N=10, M=50, L=10, C=10, random_seed=1, fileprefix='random', fileposfix='train', attr_desc=None, save_to=False, outformats=['csv'])[source]
Function to generate trajectories based on random data.
Parameters:
- Nint, optional
Number of trajectories (default 10)
- Mint, optional
Size of trajectories (default 50)
- Lint, optional
Number of attributes (default 10)
- Cint, optional
Number of classes (default 10)
- random_seedint, optional
Random Seed (default 1)
- attr_desclist of dict, optional
Data type intervals to generate attributes as a list of descriptive dicts. Default: None (uses default types) OR a list of instances of AttributeGenerator
- save_tostr or bool, optional
Destination folder to save, or False if not to save CSV files (default False)
- fileprefixstr, optional
Output filename prefix (default ‘sample’)
- fileposfixstr, optional
Output filename postfix (default ‘train’)
- outformatslist, optional
Output file formats for saving (default [‘csv’])
Returns:
- pandas.DataFrame
The generated dataset.
- matdata.generator.samplerGenerator(N=10, M=50, C=1, random_seed=1, fileprefix='sample', fileposfix='train', cols_for_sampling=['space', 'time', 'day', 'rating', 'price', 'weather', 'root_type', 'type'], save_to=False, base_data=None, outformats=['csv'])[source]
Function to generate trajectories based on real data.
Parameters:
- Nint, optional
Number of trajectories (default 10)
- Mint, optional
Size of trajectories, number of points (default 50)
- Cint, optional
Number of classes (default 1)
- random_seedint, optional
Random seed (default 1)
- cols_for_samplinglist, optional
Columns to add in the generated dataset. Default: [‘space’, ‘time’, ‘day’, ‘rating’, ‘price’, ‘weather’, ‘root_type’, ‘type’].
- save_tostr or bool, optional
Destination folder to save, or False if not to save CSV files (default False)
- fileprefixstr, optional
Output filename prefix (default ‘sample’)
- fileposfixstr, optional
Output filename postfix (default ‘train’)
- base_dataDataFrame, optional
DataFrame of trajectories to use as a base for sampling data. Default: None (uses example data)
- outformatslist, optional
Output file formats for saving (default [‘csv’])
Returns:
- pandas.DataFrame
The generated dataset.
- matdata.generator.scalerRandomGenerator(Ns=[100, 10], Ms=[10, 10], Ls=[8, 10], Cs=[2, 10], random_seed=1, fileprefix='scalability', fileposfix='train', attr_desc=None, save_to=None, save_desc_files=True, outformats=['csv'])[source]
Function to generate trajectory datasets based on random data.
Parameters:
- Nslist of int, optional
Parameters to scale the number of trajectories. List of 2 values: starting number, number of elements (default [100, 10])
- Mslist of int, optional
Parameters to scale the size of trajectories. List of 2 values: starting number, number of elements (default [10, 10])
- Lslist of int, optional
Parameters to scale the number of attributes (* doubles the columns). List of 2 values: starting number, number of elements (default [8, 10])
- Cslist of int, optional
Parameters to scale the number of classes. List of 2 values: starting number, number of elements (default [2, 10])
- random_seedint, optional
Random seed (default 1)
- attr_desclist, optional
Data type intervals to generate attributes as a list of descriptive dicts. Default: None (uses default types)
- save_tostr or bool, optional
Destination folder to save, or False if not to save CSV files (default False)
- fileprefixstr, optional
Output filename prefix (default ‘sample’)
- fileposfixstr, optional
Output filename postfix (default ‘train’)
- save_desc_filesbool, optional
True if to save the .json description files, False otherwise (default True)
- outformatslist, optional
Output file formats for saving (default [‘csv’])
Returns:
None
- matdata.generator.scalerSamplerGenerator(Ns=[100, 10], Ms=[10, 10], Ls=[8, 10], Cs=[2, 10], random_seed=1, fileprefix='scalability', fileposfix='train', cols_for_sampling=['space', 'time', 'day', 'rating', 'price', 'weather', 'root_type', 'type'], save_to=None, base_data=None, save_desc_files=True, outformats=['csv'])[source]
Generates trajectory datasets based on real data.
Parameters:
- Nslist of int, optional
Parameters to scale the number of trajectories. List of 2 values: starting number, number of elements (default [100, 10])
- Mslist of int, optional
Parameters to scale the size of trajectories. List of 2 values: starting number, number of elements (default [10, 10])
- Lslist of int, optional
Parameters to scale the number of attributes (* doubles the columns). List of 2 values: starting number, number of elements (default [8, 10])
- Cslist of int, optional
Parameters to scale the number of classes. List of 2 values: starting number, number of elements (default [2, 10])
- random_seedint, optional
Random seed (default 1)
- fileprefixstr, optional
Output filename prefix (default ‘scalability’)
- fileposfixstr, optional
Output filename postfix (default ‘train’)
- cols_for_samplinglist or dict, optional
Columns to add in the generated dataset. Default: [‘space’, ‘time’, ‘day’, ‘rating’, ‘price’, ‘weather’, ‘root_type’, ‘type’]. If a dictionary is provided in the format: {‘aspectName’: ‘type’, ‘aspectName’: ‘type’}, it is used when providing base_data and saving .MAT.
- save_tostr or bool, optional
Destination folder to save, or False if not to save CSV files (default False)
- base_dataDataFrame, optional
DataFrame of trajectories to use as a base for sampling data. Default: None (uses example data)
- save_desc_filesbool, optional
True if to save the .json description files, False otherwise (default True)
- outformatslist, optional
Output file formats for saving (default [‘csv’])
Returns:
None
matdata.preprocess module
MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
- Authors:
Tarlis Portela
- matdata.preprocess.convertDataset(dir_path, k=None, cols=None, fileprefix='', tid_col='tid', class_col='label')[source]
- matdata.preprocess.countClasses(data_path, folder, file='train.csv', tid_col='tid', class_col='label', markd=False)[source]
Counts the occurrences of each class label in a dataset.
Parameters:
- data_pathstr
The directory path where the dataset file is located.
- folderstr
The subfolder within the data path where the dataset file is located.
- filestr, optional (default=’train.csv’)
The name of the dataset file to be read.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- markdbool, optional (default=False)
A flag indicating whether to print the class counts in Markdown format.
Returns:
- pandas.DataFrame or str
If markd is False, prins the markdown text and returns a dictionary DataFrame containing the counts of each class label in the dataset. If markd is True, returns str markdown of the counts of each class label in the dataset.
- matdata.preprocess.datasetStatistics(data_path, folder, file_prefix='', tid_col='tid', class_col='label', to_file=False)[source]
Computes statistics for a dataset, including summary statistics for each column and class distribution into a markdown file format.
Parameters:
- data_pathstr
The directory path where the dataset file(s) are located.
- folderstr
The subfolder within the data path where the dataset file(s) are located.
- file_prefixstr, optional (default=’’)
The prefix to be added to the dataset file names.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- to_filebool, optional (default=False)
A flag indicating whether to save the statistics to a file.
Returns:
- dict or None
If to_file is False, prints markdown and returns a str containing the computed statistics. If to_file is str, returns markdown str and saves the statistics to a file named as in to_file value.
- matdata.preprocess.dfStats(df)[source]
Computes summary statistics for each column in a DataFrame.
Parameters:
- dfpandas.DataFrame
The DataFrame for which statistics are to be computed.
Returns:
- pandas.DataFrame
A DataFrame containing summary statistics for each column, including mean, standard deviation, and variance. Columns are sorted by variance in descending order.
- matdata.preprocess.dfVariance(df)[source]
Computes the variance for each column in a DataFrame.
Parameters:
- dfpandas.DataFrame
The DataFrame for which variance is to be computed.
Returns:
- pandas.Series
A Series containing the variance for each column in the DataFrame.
- matdata.preprocess.featuresJSON(df, version=1, deftype='nominal', defcomparator='equals', tid_col='tid', label_col='label', file=False)[source]
Generates a JSON representation of features from a DataFrame.
Parameters:
- dfpandas.DataFrame
The DataFrame containing the dataset.
- versionint, optional (default=1)
The version number of the JSON schema (1 for MASTERMovelets format, 2 for HiPerMovelets format).
- deftypestr, optional (default=’nominal’)
The default type of features.
- defcomparatorstr, optional (default=’equals’)
The default comparator for features.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- label_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- filebool, optional (default=False)
A flag indicating whether to save the JSON representation to a file.
Returns:
- str
If file is False, returns a str representing the features in JSON format. If file is str, returns a str of JSON features and saves the JSON representation to a file param name.
- matdata.preprocess.joinTrainTest(dir_path, train_file='train.csv', test_file='test.csv', tid_col='tid', class_col='label', to_file=False)[source]
Joins training and testing datasets from separate files into a single DataFrame.
Parameters:
- dir_pathstr
The directory path where the training and testing files are located.
- train_filestr, optional (default=”train.csv”)
The name of the training file to be read.
- test_filestr, optional (default=”test.csv”)
The name of the testing file to be read.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- to_filebool, optional (default=False)
A flag indicating whether to save the joined DataFrame to a file, and saves the joined DataFrame to a file named ‘joined.csv’.
Returns:
- pandas.DataFrame
A DataFrame containing the joined training and testing data. If to_file is True, returns the DataFrame and saves the joined DataFrame to a file named ‘joined.csv’.
- matdata.preprocess.joinTrainTest_df(train, test, tid_col='tid', class_col='label', sort=True)[source]
- matdata.preprocess.kfold_stratify(df, k=10, inc=1, limit=10, random_num=1, tid_col='tid', class_col='label', fileprefix='', ktrain=None, ktest=None, organize_columns=True, mat_columns=None, data_path='.', outformats=[], ignore_ltk=True, sort=True)[source]
- matdata.preprocess.kfold_trainTestSplit(df, k, random_num=1, tid_col='tid', class_col='label', fileprefix='', columns_order=None, ktrain=None, ktest=None, mat_columns=None, data_path='.', outformats=[], verbose=False, sort=True)[source]
Splits a DataFrame into k folds for k-fold cross-validation, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be split into k folds.
- kint
The number of folds for cross-validation.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving, for example: ‘specific_’ or ‘generic_’.
- columns_orderlist of str, optional
A list of column names specifying the desired order of columns. If None, no reordering is performed.
- ktrainlist of pandas.DataFrame, optional
A list of training sets for each fold. If None, the function will split the data into training and testing sets.
- ktestlist of pandas.DataFrame, optional
A list of testing sets for each fold. If None, the function will split the data into training and testing sets.
- mat_columnslist of str, optional
A list of column names to be included in the .mat files, corresponding to columns_order.
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- verbosebool, optional (default=False)
A flag indicating whether to display progress messages.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- ktrainlist of pandas.DataFrame
List of DataFrame containing the training sets.
- ktestlist of pandas.DataFrame
List of DataFrame containing the testing sets.
- matdata.preprocess.klabels_extract(df, kl=10, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True)[source]
Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- klint, optional (default=10)
The number of class labels to stratify the DataFrame.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- datapandas.DataFrame
A DataFrame containing the dataset.
- matdata.preprocess.klabels_stratify(df, kl=10, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]
Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- klint, optional (default=10)
The number of class labels to stratify the DataFrame.
- train_sizefloat, optional (default=0.7)
The proportion of the stratified dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- mat_columnslist of str, optional (unused for now)
A list of column names to be included in the .mat files, if set to save.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
- matdata.preprocess.labels_extract(df, labels=[], tid_col='tid', class_col='label', organize_columns=True)[source]
- matdata.preprocess.organizeFrame(df, columns_order=None, tid_col='tid', class_col='label', make_spatials=False)[source]
Organizes a DataFrame by reordering columns and optionally converting spatial columns.
Parameters:
- dfpandas.DataFrame
The DataFrame to be organized.
- columns_orderlist of str, optional
A list of column names specifying the desired order of columns. If None, no reordering is performed.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- make_spatialsbool, optional (default=False)
A flag indicating whether to convert spatial columns to both lat/lon separated or space format, which is the lat/lon concatenated in one column.
Returns:
- pandas.DataFrame
A DataFrame containing the organized data, with columns added as specified and spatial columns converted if requested.
- columns_order_zip
A list of the columns with space column, if present.
- columns_order_csv
A list of the columns with lat/lon columns, if present.
- matdata.preprocess.readDataset(data_path, folder=None, file='train.csv', class_col='label', tid_col='tid', missing=None)[source]
Reads a dataset file (CSV format by default, ‘train.csv’) and returns it as a pandas DataFrame.
Parameters:
- data_pathstr
The directory path where the dataset file is located.
- folderstr, optional
The subfolder within the data path where the dataset file is located.
- filestr, optional (default=’train.csv’)
The name of the dataset file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the dataset.
Returns:
- pandas.DataFrame
A DataFrame containing the dataset from the specified file, with trajectory identifier, class label, and missing values handled as specified.
- matdata.preprocess.readDsDesc(data_path, folder=None, file='train.csv', tid_col='tid', class_col='label', missing='?')[source]
- matdata.preprocess.sortByLabel(df, tid_col='tid', class_col='label')[source]
Sort a DataFrame by class label column and trajectory ID column.
Parameters:
- dfpandas.DataFrame
The DataFrame to be sorted.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr, optional
The name of the column representing class labels (default ‘label’).
Returns:
- pandas.DataFrame
The sorted DataFrame.
- matdata.preprocess.sortByTID(df_, tid_col='tid')[source]
Sort a DataFrame by trajectory ID column.
Parameters:
- df_pandas.DataFrame
The DataFrame to be sorted.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
Returns:
- pandas.DataFrame
The sorted DataFrame.
- matdata.preprocess.splitData(df, k, random_num, tid_col='tid', class_col='label', opLabel='Spliting Data', ignore_ltk=True)[source]
- matdata.preprocess.splitTIDs(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', min_elements=1, opLabel='Spliting Data (class-balanced)')[source]
- matdata.preprocess.splitTIDsUnbalanced(df, train_size=0.7, random_num=1, tid_col='tid', min_elements=1, opLabel='Spliting Data')[source]
- matdata.preprocess.stratify(df, sample_size=0.5, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True, opLabel='Data Stratification (class-balanced)')[source]
Stratifies a DataFrame by class label, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- sample_sizefloat, optional (default=0.5)
The proportion of the dataset to sample for stratification.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- pandas.DataFrame
A DataFrame containing the stratified set.
- matdata.preprocess.stratifyTrainTest(df, sample_size=0.5, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]
Stratifies a DataFrame by class label and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- sample_sizefloat, optional (default=0.5)
The proportion of the dataset to sample for stratification.
- train_sizefloat, optional (default=0.7)
The proportion of the stratified dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- mat_columnslist of str, optional (unused for now)
A list of column names to be included in the .mat files, if set to save.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
- matdata.preprocess.suffle_df(df, tid_col='tid', random_num=1)[source]
Shuffle a DataFrame by trajectory ID column.
Parameters:
- dfpandas.DataFrame
The DataFrame to be shuffled.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The shuffled DataFrame.
- matdata.preprocess.trainTestSplit(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', fileprefix='', data_path='.', outformats=[], verbose=False, organize_columns=True, sort=True)[source]
Splits a DataFrame into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be split into training and testing sets.
- train_sizefloat, optional (default=0.7)
The proportion of the dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘parquet’]).
- verbosebool, optional (default=False)
A flag indicating whether to display progress messages.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
Module contents
Multiple Aspect Trajectory Tools Framework
MAT-data: Data Preprocessing for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the classification task of multiple aspect trajectories, specifically for extracting and visualizing the movelets, the parts of the trajectory that better discriminate a class. It integrates into a unique platform the fragmented approaches available for multiple aspects trajectories and in general for multidimensional sequence classification into a unique web-based and python library system. Offers both movelets visualization and classification methods.
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
@author: Tarlis Portela