Pre-processing
MAT-Tools: Python Framework for Multiple Aspect Trajectory Data Mining
The present application offers a tool, to support the user in the preprocessing of multiple aspect trajectory data. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods. Copyright (C) 2022, MIT license (this portion of code is subject to licensing from source project distribution)
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
- Authors:
Tarlis Portela
- matdata.preprocess.countClasses(data_path, folder, file='train.csv', tid_col='tid', class_col='label', markd=False)[source]
Counts the occurrences of each class label in a dataset.
Parameters:
- data_pathstr
The directory path where the dataset file is located.
- folderstr
The subfolder within the data path where the dataset file is located.
- filestr, optional (default=’train.csv’)
The name of the dataset file to be read.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- markdbool, optional (default=False)
A flag indicating whether to print the class counts in Markdown format.
Returns:
- pandas.DataFrame or str
If markd is False, prins the markdown text and returns a dictionary DataFrame containing the counts of each class label in the dataset. If markd is True, returns str markdown of the counts of each class label in the dataset.
- matdata.preprocess.datasetStatistics(data_path, folder, file_prefix='', tid_col='tid', class_col='label', to_file=False)[source]
Computes statistics for a dataset, including summary statistics for each column and class distribution into a markdown file format.
Parameters:
- data_pathstr
The directory path where the dataset file(s) are located.
- folderstr
The subfolder within the data path where the dataset file(s) are located.
- file_prefixstr, optional (default=’’)
The prefix to be added to the dataset file names.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- to_filebool, optional (default=False)
A flag indicating whether to save the statistics to a file.
Returns:
- dict or None
If to_file is False, prints markdown and returns a str containing the computed statistics. If to_file is str, returns markdown str and saves the statistics to a file named as in to_file value.
- matdata.preprocess.dfStats(df)[source]
Computes summary statistics for each column in a DataFrame.
Parameters:
- dfpandas.DataFrame
The DataFrame for which statistics are to be computed.
Returns:
- pandas.DataFrame
A DataFrame containing summary statistics for each column, including mean, standard deviation, and variance. Columns are sorted by variance in descending order.
- matdata.preprocess.dfVariance(df)[source]
Computes the variance for each column in a DataFrame.
Parameters:
- dfpandas.DataFrame
The DataFrame for which variance is to be computed.
Returns:
- pandas.Series
A Series containing the variance for each column in the DataFrame.
- matdata.preprocess.featuresJSON(df, version=1, deftype='nominal', defcomparator='equals', tid_col='tid', label_col='label', file=False)[source]
Generates a JSON representation of features from a DataFrame.
Parameters:
- dfpandas.DataFrame
The DataFrame containing the dataset.
- versionint, optional (default=1)
The version number of the JSON schema (1 for MASTERMovelets format, 2 for HiPerMovelets format).
- deftypestr, optional (default=’nominal’)
The default type of features.
- defcomparatorstr, optional (default=’equals’)
The default comparator for features.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- label_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- filebool, optional (default=False)
A flag indicating whether to save the JSON representation to a file.
Returns:
- str
If file is False, returns a str representing the features in JSON format. If file is str, returns a str of JSON features and saves the JSON representation to a file param name.
- matdata.preprocess.joinTrainTest(dir_path, train_file='train.csv', test_file='test.csv', tid_col='tid', class_col='label', to_file=False)[source]
Joins training and testing datasets from separate files into a single DataFrame.
Parameters:
- dir_pathstr
The directory path where the training and testing files are located.
- train_filestr, optional (default=”train.csv”)
The name of the training file to be read.
- test_filestr, optional (default=”test.csv”)
The name of the testing file to be read.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- to_filebool, optional (default=False)
A flag indicating whether to save the joined DataFrame to a file, and saves the joined DataFrame to a file named ‘joined.csv’.
Returns:
- pandas.DataFrame
A DataFrame containing the joined training and testing data. If to_file is True, returns the DataFrame and saves the joined DataFrame to a file named ‘joined.csv’.
- matdata.preprocess.kfold_trainTestSplit(df, k, random_num=1, tid_col='tid', class_col='label', fileprefix='', columns_order=None, ktrain=None, ktest=None, mat_columns=None, data_path='.', outformats=[], verbose=False, sort=True)[source]
Splits a DataFrame into k folds for k-fold cross-validation, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be split into k folds.
- kint
The number of folds for cross-validation.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving, for example: ‘specific_’ or ‘generic_’.
- columns_orderlist of str, optional
A list of column names specifying the desired order of columns. If None, no reordering is performed.
- ktrainlist of pandas.DataFrame, optional
A list of training sets for each fold. If None, the function will split the data into training and testing sets.
- ktestlist of pandas.DataFrame, optional
A list of testing sets for each fold. If None, the function will split the data into training and testing sets.
- mat_columnslist of str, optional
A list of column names to be included in the .mat files, corresponding to columns_order.
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- verbosebool, optional (default=False)
A flag indicating whether to display progress messages.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- ktrainlist of pandas.DataFrame
List of DataFrame containing the training sets.
- ktestlist of pandas.DataFrame
List of DataFrame containing the testing sets.
- matdata.preprocess.klabels_extract(df, kl=10, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True)[source]
Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- klint, optional (default=10)
The number of class labels to stratify the DataFrame.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- datapandas.DataFrame
A DataFrame containing the dataset.
- matdata.preprocess.klabels_stratify(df, kl=10, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]
Stratifies a DataFrame by a specified number of class labels and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- klint, optional (default=10)
The number of class labels to stratify the DataFrame.
- train_sizefloat, optional (default=0.7)
The proportion of the stratified dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- mat_columnslist of str, optional (unused for now)
A list of column names to be included in the .mat files, if set to save.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
- matdata.preprocess.organizeFrame(df, columns_order=None, tid_col='tid', class_col='label', make_spatials=False)[source]
Organizes a DataFrame by reordering columns and optionally converting spatial columns.
Parameters:
- dfpandas.DataFrame
The DataFrame to be organized.
- columns_orderlist of str, optional
A list of column names specifying the desired order of columns. If None, no reordering is performed.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- make_spatialsbool, optional (default=False)
A flag indicating whether to convert spatial columns to both lat/lon separated or space format, which is the lat/lon concatenated in one column.
Returns:
- pandas.DataFrame
A DataFrame containing the organized data, with columns added as specified and spatial columns converted if requested.
- columns_order_zip
A list of the columns with space column, if present.
- columns_order_csv
A list of the columns with lat/lon columns, if present.
- matdata.preprocess.readDataset(data_path, folder=None, file='train.csv', class_col='label', tid_col='tid', missing=None)[source]
Reads a dataset file (CSV format by default, ‘train.csv’) and returns it as a pandas DataFrame.
Parameters:
- data_pathstr
The directory path where the dataset file is located.
- folderstr, optional
The subfolder within the data path where the dataset file is located.
- filestr, optional (default=’train.csv’)
The name of the dataset file to be read.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- missingstr, optional (default=’?’)
The placeholder for missing values in the dataset.
Returns:
- pandas.DataFrame
A DataFrame containing the dataset from the specified file, with trajectory identifier, class label, and missing values handled as specified.
- matdata.preprocess.sortByLabel(df, tid_col='tid', class_col='label')[source]
Sort a DataFrame by class label column and trajectory ID column.
Parameters:
- dfpandas.DataFrame
The DataFrame to be sorted.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- class_colstr, optional
The name of the column representing class labels (default ‘label’).
Returns:
- pandas.DataFrame
The sorted DataFrame.
- matdata.preprocess.sortByTID(df_, tid_col='tid')[source]
Sort a DataFrame by trajectory ID column.
Parameters:
- df_pandas.DataFrame
The DataFrame to be sorted.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
Returns:
- pandas.DataFrame
The sorted DataFrame.
- matdata.preprocess.stratify(df, sample_size=0.5, random_num=1, tid_col='tid', class_col='label', organize_columns=True, sort=True, opLabel='Data Stratification (class-balanced)')[source]
Stratifies a DataFrame by class label, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- sample_sizefloat, optional (default=0.5)
The proportion of the dataset to sample for stratification.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- pandas.DataFrame
A DataFrame containing the stratified set.
- matdata.preprocess.stratifyTrainTest(df, sample_size=0.5, train_size=0.7, random_num=1, tid_col='tid', class_col='label', organize_columns=True, mat_columns=None, fileprefix='', outformats=[], data_path='.', sort=True)[source]
Stratifies a DataFrame by class label and splits it into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be stratified and split into training and testing sets.
- sample_sizefloat, optional (default=0.5)
The proportion of the dataset to sample for stratification.
- train_sizefloat, optional (default=0.7)
The proportion of the stratified dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- mat_columnslist of str, optional (unused for now)
A list of column names to be included in the .mat files, if set to save.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘zip’, ‘parquet’]).
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.
- matdata.preprocess.suffle_df(df, tid_col='tid', random_num=1)[source]
Shuffle a DataFrame by trajectory ID column.
Parameters:
- dfpandas.DataFrame
The DataFrame to be shuffled.
- tid_colstr, optional
The name of the column representing trajectory IDs (default ‘tid’).
- random_numint, optional
Random seed for reproducibility (default 1).
Returns:
- pandas.DataFrame
The shuffled DataFrame.
- matdata.preprocess.trainTestSplit(df, train_size=0.7, random_num=1, tid_col='tid', class_col='label', fileprefix='', data_path='.', outformats=[], verbose=False, organize_columns=True, sort=True)[source]
Splits a DataFrame into training and testing sets, optionally organizes columns, and saves them to files.
Parameters:
- dfpandas.DataFrame
The DataFrame to be split into training and testing sets.
- train_sizefloat, optional (default=0.7)
The proportion of the dataset to include in the training set.
- random_numint, optional (default=1)
The random seed for reproducible results.
- tid_colstr, optional (default=’tid’)
The name of the column to be used as the trajectory identifier.
- class_colstr, optional (default=’label’)
The name of the column to be treated as the class/label column.
- fileprefixstr, optional (default=’’)
The prefix to be added to the file names when saving.
- data_pathstr, optional (default=’.’)
The directory path where the output files will be saved.
- outformatslist of str, optional
A list of output formats for saving the datasets (e.g., [‘csv’, ‘parquet’]).
- verbosebool, optional (default=False)
A flag indicating whether to display progress messages.
- organize_columnsbool, optional (default=True)
A flag indicating whether to organize columns before saving.
- sortbool, optional
If True, sort the data by class_col and tid_col (default True).
Returns:
- trainpandas.DataFrame
A DataFrame containing the training set.
- testpandas.DataFrame
A DataFrame containing the testing set.