Report
DIY or GPT-4?
A comparative evaluation of GPT-4 Turbo and Analyse & Tal and TrygFonden's A&ttack models on five parameters
With the report DIY or GPT-4? we contribute concrete knowledge about the pros and cons of using different AI-based technologies for word processing and propose a methodological framework for evaluating the performance of AI models.
The study answers the question of whether it is worth investing in custom-built supervised algorithms such as A&ttack, or whether they should be retired in favour of the prompt-based Swiss Army Knife GPT-4.
The study is a comparative evaluation of our own supervised classification models, A&ttack 1 and A&ttack 2.5, and the most talked about commercial AI on the market, GPT-4. The use case is the identification of linguistic attacks in the public debate on Facebook.
The models are evaluated comparatively on five parameters:
1. performance - how accurate are the models' results compared to human judgements?
2. Fairness - Are there biases in the models' results?
3. Stability - How reliable are the results over time?
4. cost - How much does it cost to use the technologies?
5. Power consumption - How much power do the models use?
In addition, we test and evaluate the annotation potential of GPT-4:
6. Annotation potential - What is the ability of GPT-4 to replace or complement human annotators in the process of generating training data?
Based on the evaluation, we conclude that it is currently not appropriate to use GPT-4 as a classification tool for mapping attacks in the public debate on Facebook in a Danish context.
A&ttack 2.5 beats GPT-4 on the standard performance parameter. However, GPT-4's performance is significantly less fair, based on the average pairwise difference in classification for 19 protected groups. GPT-4 also has stability problems, and even over a short period of three days, the model changes its classifications for 10% of our test datasets. At the same time, classifying the debate with GPT-4 is three times more expensive than retraining the A&ttack model, and the carbon footprint of using GPT-4 to classify attacks in public debate is 150 times greater than with A&ttack 2.5. We would not rule out using GPT-4 to annotate training data, but this strategy would require additional testing, which is currently against OpenAI's terms of use.
TrygFonden and Analyse & Tal are behind the study, which is a methodological complement to our analysis of attacks and hate in the public debate on Facebook.
Udgivelsesdato
December 1, 2024