OpenAI’s CriticGPT outperforms humans in catching AI-generated code bugs

As a general rule, isn't it usually preferable for the backup/correcting mechanism to be different? For example, aircraft with GPS still have INS, compasses and maps, etc. The idea being that if GPS is down, you can still use INS, and if INS is malfunctioning or it's a general computer failure, if you've got a compass, a map, and a good watch (and all pilots should always have all three), you can still navigate.

This is a basic safety engineering principle. Why the hell are they using the same technology as a backup/correction? Seriously, if they were designing a system deliberately, no one would do it this way. It reeks of desperation to prove that LLMs can do things that they are simply not capable of doing, by people who seem to believe that humans put together sentences in a similar fashion(!?!?!?!).
 
Upvote
14 (32 / -18)
I use ChatGPT as a starting point. Its like having a drunk intern help you write emails and code. It can be helpful but you need to thoroughly review it before deploying it.

I'm not sure how helpful CriticGPT will be, maybe it'll be a drunk junior staff now? I still wouldn't trust its output without verifying it.
 
Upvote
0 (9 / -9)

Robin-3

Ars Praetorian
423
Subscriptor
It strikes me that some GPT errors will result in "fixable" errors within the GPT model itself, and these are likely the kind a CriticGPT could identify.

However, others are likely to stem from the fact that GPTs are designed to mimic human intelligence, but they aren't human. They don't have actual experience, judgement (which even humans with poor judgement have), comprehension, etc. And I can't conceive of any CriticGPT being able to address these, even if it can be taught to identify patterns that signal them.

I'm not sure how to better describe this. And maybe I'm wrong, and CriticGPT will do brilliantly... but I don't think so, and I find I actually hope it doesn't.
 
Upvote
3 (11 / -8)
OpenAI created CriticGPT to act as an AI assistant to human trainers who review programming code generated by the ChatGPT AI.
Act as assistant for now. Later on, maybe not so much. Then they'll just need a third system to replace the prompt-engineer/types-question-guy to eliminate the human aspect entirely below the C-suite level.
 
Upvote
-2 (2 / -4)
Well this does sound ridiculous, but I have had some success with LLMs when I point out their errors and ask them to rewrite it. Maybe it’ll help a little bit? Nothing will fix the fundamental issues at hand because it’s not AI in the sci-fi sense of the word in the way that they want it to be.
 
Upvote
-1 (6 / -7)
So it's turtles AI all the way down? Sounds about right for a company that makes turtles AI.
Yep, it's insane.

This tells me some things just from a high level.

1) Their tool does not do what they claim (chatgpt)
2) Their tool is not as accurate as they claim (chatgpt)
3) Given the first two, I'm fairly certain that it might be cheaper in the short and long run to use other machine learning systems besides Large Language Models to be more accurate and save more money.
 
Upvote
-4 (6 / -10)

Psyborgue

Ars Praefectus
3,663
Subscriptor++
It's chatbots watching chatbots all the way down
Yeah, but, does it work? If annotators are choosing the generated critique over human, then I guess it does.
Self certification absolutely works. Just ask Boeing.
This is to aid the annotators, not replace them. Read the actual paper and/or article before posting. OpenAI's papers are among the most accessable out there.
 
Upvote
1 (6 / -5)

Ralf The Dog

Ars Praefectus
4,321
Subscriptor++
As a general rule, isn't it usually preferable for the backup/correcting mechanism to be different? For example, aircraft with GPS still have INS, compasses and maps, etc. The idea being that if GPS is down, you can still use INS, and if INS is malfunctioning or it's a general computer failure, if you've got a compass, a map, and a good watch (and all pilots should always have all three), you can still navigate.

This is a basic safety engineering principle. Why the hell are they using the same technology as a backup/correction? Seriously, if they were designing a system deliberately, no one would do it this way. It reeks of desperation to prove that LLMs can do things that they are simply not capable of doing, by people who seem to believe that humans put together sentences in a similar fashion(!?!?!?!).
It largely is different. First off, they are feeding the supervisor model with lots of data related to mistakes that GPT has made in the past. The primary LLM is not trained on this data. This results in several new layers to the model. Even more important, they are quite likely playing some very cool games with the system prompt that is fed to the fact checker. After a while this can get very expensive because, you wind up with lots of different versions of the LLM talking to each other. You eat lots of tokens and use up more context memory than you would like to think about.
 
Upvote
2 (6 / -4)

Psyborgue

Ars Praefectus
3,663
Subscriptor++
So the AI that has 100% confidence that the bullshit code filled with placeholders instead of functionality is somehow exactly what you wanted is going to be babysat by another AI that knows that's wrong?

I have doubts.
The maybe read the paper. This is to aid human annotators. Besides which, do you find bugs when reading over your own work? Because I do. It is possible to use a model to improve itself. This is proven.

 
Upvote
6 (8 / -2)

Psyborgue

Ars Praefectus
3,663
Subscriptor++
Upvote
-2 (3 / -5)
Yeah, but, does it work? If annotators are choosing the generated critique over human, then I guess it does.

This is to aid the annotators, not replace them. Read the actual paper and/or article before posting. OpenAI's papers are among the most accessable out there.
Aid the annotators .... by giving them another AI process to monitor for bad outputs? Shades of FSD.
 
Upvote
0 (2 / -2)

DeschutesCore

Ars Scholae Palatinae
1,045
Subscriptor
The maybe read the paper. This is to aid human annotators. Besides which, do you find bugs when reading over your own work? Because I do. It is possible to use a model to improve itself. This is proven.

I did read the paper.

I also happen to know OpenAI plays fast and loose with everything they say and do, and them slapping this together as yet another iteration and selling access is damned near guaranteed.
 
Upvote
11 (12 / -1)

DeschutesCore

Ars Scholae Palatinae
1,045
Subscriptor
Language models can do that. OpenAI has an API to call arbitrary functions/tools which can include a shell, interpreter, compiler -- anything.


The limit is creativity.
ChatGPT can run python scripts and such, but it can not compile code that I have ever seen, and I have numerous logs of it saying it can't run code*. Most visible when you ask it complex math problems. I think they changed settings so you only see "analyzing" now unless you change the setting in your account panel.

* I can't even count the number of times the model has insisted it can't run code, right after it said it can't read files that you can upload, nor could it explain why the upload button exist.
 
Upvote
4 (5 / -1)