Report

DIY or GPT-4?

A comparative evaluation of GPT-4 Turbo and Analyse & Tal and TrygFonden's A&ttack models on five parameters

With the report DIY or GPT-4? we contribute concrete knowledge about the pros and cons of using different AI-based technologies for word processing and propose a methodological framework for evaluating the performance of AI models.

The study answers the question of whether it is worth investing in custom-built supervised algorithms such as A&ttack, or whether they should be retired in favour of the prompt-based Swiss Army Knife GPT-4.

The study is a comparative evaluation of our own supervised classification models, A&ttack 1 and A&ttack 2.5, and the most talked about commercial AI on the market, GPT-4. The use case is the identification of linguistic attacks in the public debate on Facebook.

The models are evaluated comparatively on five parameters:

1. performance - how accurate are the models' results compared to human judgements?

2. Fairness - Are there biases in the models' results?

3. Stability - How reliable are the results over time?

4. cost - How much does it cost to use the technologies?

5. Power consumption - How much power do the models use?

In addition, we test and evaluate the annotation potential of GPT-4:

6. Annotation potential - What is the ability of GPT-4 to replace or complement human annotators in the process of generating training data?

Based on the evaluation, we conclude that it is currently not appropriate to use GPT-4 as a classification tool for mapping attacks in the public debate on Facebook in a Danish context.

A&ttack 2.5 beats GPT-4 on the standard performance parameter. However, GPT-4's performance is significantly less fair, based on the average pairwise difference in classification for 19 protected groups. GPT-4 also has stability problems, and even over a short period of three days, the model changes its classifications for 10% of our test datasets. At the same time, classifying the debate with GPT-4 is three times more expensive than retraining the A&ttack model, and the carbon footprint of using GPT-4 to classify attacks in public debate is 150 times greater than with A&ttack 2.5. We would not rule out using GPT-4 to annotate training data, but this strategy would require additional testing, which is currently against OpenAI's terms of use.

TrygFonden and Analyse & Tal are behind the study, which is a methodological complement to our analysis of attacks and hate in the public debate on Facebook.

Download report

Udgivelsesdato

December 1, 2024