Converter

MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining

The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)

Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)

Authors:
  • Tarlis Portela

matdata.converter.any2ts(data_path, folder, file, cols=None, tid_col='tid', class_col='label', opLabel='Converting TS')[source]

Converts data from various formats (CSV, Parquet, etc.) to a time series format.

Parameters:

data_pathstr

The directory path where the data files are located.

folderstr

The folder containing the data file to be converted.

filestr

The name of the data file to be converted.

colslist of str, optional

A list of column names to be included in the time series data.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

opLabelstr, optional (default=’Converting TS’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

A DataFrame containing the time series data, with trajectory identifier, class label, and specified columns.

matdata.converter.csv2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a CSV file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the CSV file to be read.

class_colstr, optional (default=’label’)

Unused, kept for standard.

tid_colstr, optional (default=’tid’)

Unused, kept for standard.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

Returns:

pandas.DataFrame

A DataFrame containing the data from the CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.df2csv(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing CSV')[source]

Writes a pandas DataFrame to a CSV file.

Parameters:

dfpandas.DataFrame

The DataFrame to be written to the CSV file.

data_pathstr

The directory path where the Parquet file will be saved.

filestr, optional (default=’train’)

The base name of the CSV file (without extension).

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

select_colslist of str, optional

A list of column names to be included in the CSV file. If None, all columns are included.

opLabelstr, optional (default=’Writing PARQUET’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

The input DataFrame

matdata.converter.df2mat(df, folder, file, cols=None, mat_cols=None, desc_cols=None, label_columns=None, other_dsattrs=None, tid_col='tid', class_col='label', opLabel='Converting MAT')[source]

Converts a pandas DataFrame to a Multiple Aspect Trajectory .mat file and saves it to the specified folder.

Parameters:

dfpandas.DataFrame

The DataFrame to be converted to a .mat file.

folderstr

The directory where the .mat file will be saved.

filestr

The base name of the .mat file (without extension).

colslist of str, optional

A list of column names from the DataFrame to include in the .mat file. If None, all columns are included.

mat_colslist of str, optional

A list of column names representing the trajectory attibutes. If None, no columns are used.

desc_colslist of str, optional

A dict of column descriptors to be included as descriptive metadata.

label_columnslist of str, optional

A list of column names that can be treated as labels in the .mat file.

other_dsattrsdict, optional

A dictionary of additional dataset attributes to be included in the .mat file.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

opLabelstr, optional (default=’Converting MAT’)

A label describing the operation, useful for logging or display purposes.

Returns:

None

matdata.converter.df2parquet(df, data_path, file='train', tid_col='tid', class_col='label', select_cols=None, opLabel='Writing Parquet')[source]

Writes a pandas DataFrame to a Parquet file.

Parameters:

dfpandas.DataFrame

The DataFrame to be written to the Parquet file.

data_pathstr

The directory path where the Parquet file will be saved.

filestr, optional (default=’train’)

The base name of the Parquet file (without extension).

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

select_colslist of str, optional

A list of column names to be included in the Parquet file. If None, all columns are included.

opLabelstr, optional (default=’Writing PARQUET’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

The input DataFrame

matdata.converter.df2zip(df, data_path, file, tid_col='tid', class_col='label', select_cols=None, opLabel='Writing ZIP')[source]

Writes a pandas DataFrame to a CSV file and compresses it into a ZIP archive.

  • This format is used for older movelet methods, such as Movelets, MasterMovelets, SuperMovelets, and the Dodge, Xiao, Zheng feature extractors. In this format all ‘,’ (commas) are replaced for ‘_’ to avoid problems reading csv trajectory files.

Parameters:

dfpandas.DataFrame

The DataFrame to be written to the CSV file and then compressed into a ZIP archive.

data_pathstr

The directory path where the ZIP archive will be saved.

filestr

The base name of the CSV file (without extension) to be compressed into the ZIP archive.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

select_colslist of str, optional

A list of column names to be included in the CSV file. If None, all columns are included.

opLabelstr, optional (default=’Writing ZIP’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

The input DataFrame

matdata.converter.mat2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a MATLAB .mat file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the .mat file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the unique trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the data from the .mat file, with missing values handled as specified and columns renamed if necessary.

Raises:

Exception

Not Implemented.

matdata.converter.parquet2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a Parquet file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the Parquet file to be read.

class_colstr, optional (default=’label’)

Unused, kept for standard.

tid_colstr, optional (default=’tid’)

Unused, kept for standard.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the data from the Parquet file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.ts2df(url, class_col='label', tid_col='tid', missing=None)[source]

Converts a time series file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the time series file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the unique trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the dataset.

Returns:

pandas.DataFrame

A DataFrame containing the data from the time series file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.xes2df(url, class_col='label', tid_col='tid', missing=None, opLabel='Converting XES', save=False, start_tid=1)[source]

Converts an XES (eXtensible Event Stream) file from a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the XES file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

opLabelstr, optional (default=’Converting XES’)

A label describing the operation, useful for logging or display purposes.

savebool, optional (default=False)

A flag indicating whether to save the DataFrame to a file after conversion.

start_tidint, optional (default=1)

The starting value for trajectory identifiers as tid_col values need to be generated.

Returns:

pandas.DataFrame

A DataFrame containing the data from the XES file, with columns renamed if necessary.

matdata.converter.zip2arf(folder, file, cols, tid_col='tid', class_col='label', missing='?', opLabel='Reading ZIP')[source]

Extracts a CSV file from a ZIP archive and converts it into an ARFF (Attribute-Relation File Format) file.

Parameters:

folderstr

The directory path where the ZIP archive is located.

filestr

The name of the ZIP archive file (with or without extension).

colslist of str

A list of column names to be included in the ARFF file.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

opLabelstr, optional (default=’Reading CSV’)

A label describing the operation, useful for logging or display purposes.

Returns:

pandas.DataFrame

A DataFrame containing the data from the extracted ZIP file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.zip2csv(folder, file, cols, class_col='label', tid_col='tid', missing='?')[source]

Extracts and compile Trajectory CSV files from a ZIP archive and converts it into a pandas DataFrame.

Parameters:

folderstr

The directory path where the ZIP archive is located, and destination to the CSV resulting file.

filestr

The name of the ZIP archive file (with or without extension).

colslist of str

A list of column names to be included in the DataFrame.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

Returns:

pandas.DataFrame

A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.

matdata.converter.zip2df(url, class_col='label', tid_col='tid', missing='?', opLabel='Reading ZIP')[source]

Extracts and converts a CSV trajectory file from a ZIP archive located at a given URL into a pandas DataFrame.

Parameters:

urlstr

The URL pointing to the ZIP archive containing the CSV file to be read.

class_colstr, optional (default=’label’)

The name of the column to be treated as the class/label column.

tid_colstr, optional (default=’tid’)

The name of the column to be used as the unique trajectory identifier.

missingstr, optional (default=’?’)

The placeholder for missing values in the CSV file.

opLabelstr, optional (default=’Reading ZIP’)

A label describing the operation, for logging purposes.

Returns:

pandas.DataFrame

A DataFrame containing the data from the extracted CSV file, with missing values handled as specified and columns renamed if necessary.