Skip to content
Snippets Groups Projects

Detection of Generated Text Reviews by Leveraging Methods from Authorship Attribution

In this software repository you can find code that was used to calculate the results for our paper submitted to BTW23 (https://sites.google.com/view/btw-2023-tud).

In the following we give a step by step description about how to reproduce our experiments.

Environment

Hardware

The experiments were executed on a virtual machine running Debian 10, with 12 cores and 128GB memory.

If the experiments are run on a machine with less memory, the parameters for n_jobs in fakereviewsmodels.py in function grid_parameters should be change from "n_jobs": -1, to "n_jobs": 1,, such that only one job is used by the gridsearch.

Software

Our software project uses poetry to manage python dependencies, which already needs a basic python installation to begin with. Instructions on requirements to run and install poetry can be found here: https://python-poetry.org/docs.

For the rest of this document, we assume that a virtual environment with python 3.8.12 is available, since this version of python was used to run the experiments.

  1. Once poetry is sucessfully installed, clone this repository to the computer on which you want to run our code.

  2. Change into the folder of the repository and execute the following commands. They will set the python version to use, create a virtual python environment, and install the necessary dependencies. The <path-to-python-with-version-3.8.12> could be for example ~/.pyenv/versions/3.8.12/bin/python on a system which uses pyenv.

     $ poetry env use <path-to-python-with-version-3.8.12>
     $ poetry install
  3. Execute the following command to activate the virtual environment.

     $ poetry shell

Data Preprocessing

The dataset used needs to be downloaded and some pre-processing steps need to be run first. This is described in the following, step by step.

  1. Download the dataset published by Salminen et al. [paper] [dataset] and place the file fakereviewsdataset.csv into the folder data/raw. Make sure to remove any blank symbols from the filename of the CSV file.

  2. In the next step, the data from the CSV is split into 10 folders, using one folder per review category. This is done by changing into the folder tools running the code in preprocessing.py, using the following command in poetry shell. The expected runtime is only a few seconds.

     $ python -m preprocessing
  3. The dependency trees are build using the stanza library from Stanford (https://stanfordnlp.github.io/stanza/), which is done for each category folder individually. This can be achieved by executing the following command in the root directory of the repository for each of folders seperately, replacing <FOLDER_REVIEW_CATEGORY> with the corresponding folder names, or alternatively, run the shellscript parsedependencies.sh to parse all category folders.

     $ parse_dependency data/processed/<FOLDER_REVIEW_CATEGORY>

    When the parsing pipeline is started for the first time, necessary models for English are downloaded (~320MB). The parsing of all the texts and the creation of the dependency trees takes some time, upto a view hours.

  4. The extraction of the statistical textfeatures can be done by changing into the folder tools and executing the following command. We based our code on the implementation of Strøm. This step should be finished with a few minutes runtime.

     $ python -m extract_textfeatures

    We based our code on the implementation of Strøm.

With the preprocessing done, the following section describes how to run our experiments.

Experiment Configurations

Our experiments are executed using the dbispipeline package, which builds on scikit-learn pipelines. The central configuration for each individual experiment can be found in the folders preliminary and final, which contains the code for the prelimenary and for the final experiments.

The code for the preliminary experiments is split between the three folders as display below, where he first folder includes the code with the initial gridsearch, the second folder the experiments to find best parameters for the tf-idf vectorizer, and the third folder the code where the hyperparameters were optimized. Here, the experiments in 1_initial and 2_ngram are so called planfiles, a description about how execute them can be found below in section Plan Configuration. In folder 3_optimization, the files that include the term lightgbm can be run by executing python <path-and-filename>. The remaining files in this folder are again plan files, see the description below on how to run these.

    preliminary
    ├── 1_initial
    ├── 2_ngram
    └── 3_optimization

The code for the final experiments is split into two folders, as shown below.

    final
    ├── predictions_memory
    └── runtime

The folder predictions_memory includes the code used for the experiments regarding predictions and memory. When executed, the code runs a ten-fold cross-validation prediction using the feature set and the classifier as stated in the filename. Once finished, a file including the predictions and a second file containing split-indices from the cross-validation are stored in the folder results. They can be run by executing the following command in the command line, the results regarding memory usage are printed in command line.

    python -m memory_profiler <path-and-filename.py>

In the folder runtime is the code for the experiments used to report the runtimes, which are again plan files that can be run as described below. The results are printed on command line.

Plan Configuration

To keep the plans for the final experiments tidy, some parts of the configuration were relocated to the folder textreviewdetection, as displayed in the tree below.

textreviewdetection
   ├── dataloaders
   ├── features
   ├── models
   └── transformers
folder description
textreviewdetection/dataloaders handles loading of the already preprocessed data
textreviewdetection/features configurations of the three individual feature sets
textreviewdetection/models configurations of the selected classifiers

Running Plan files

The plan files can be executed in the poetry shell in the repository root folder, e.g., when you want to run the experiment corresponding to the plan textfeatures_initial.py, execute the following command:

    dbispipeline --force --dry-run preliminary/1_initial/textfeatures_initial.py

This calls the dbispipeline to processes the given configuration file, transforming the configuration into a sklearn pipeline. This pipeline is then executed results are printed on command line when finished.