Why is the Ars community so anti-AI?

Dmytry

Ars Legatus Legionis
10,279
Other example of language modeling:
  • The village barber shaves himself and every other man in the village who don't shave himself. Does he shave himself?


  • The scenario you've described is a classic example of a logical paradox known as the "Barber Paradox," which was formulated by the British mathematician and logician Bertrand Russell in 1918.
  • The paradox goes as follows: In a village, there is a barber who shaves all those, and only those, men in the village who do not shave themselves. The question then arises: Does the barber shave himself?
  • If we assume that the barber does shave himself, according to the definition, he should not shave himself (since he only shaves those who do not shave themselves). On the other hand, if we assume that he does not shave himself, then according to the definition, he should shave himself (since he shaves all those who do not shave themselves).
  • Both assumptions lead to a contradiction, which means that the initial premise of the barber's shaving rule is logically inconsistent. There cannot exist such a barber within the constraints of the scenario as described, because it violates the rules of set theory and creates a paradox.
  • The Barber Paradox is a self-referential paradox related to the "set of all sets that do not contain themselves" and is used to illustrate the difficulties in attempting to construct a set of all sets that do not contain themselves. It's an example of the type of logical inconsistencies that Russell and Alfred North Whitehead were trying to resolve with their type theory in the Principia Mathematica.

The fundamental issue here is that there's no understanding of either Russel's paradox or even "a barber who shaves himself [...] shaves himself".

Both of those patterns of tokens are stored inside the language model, without it learning any relation between the two (such as primacy of the tautology over the paradox). The Russel's paradox pattern matches stronger so that's what it talks about.

It's also not clear how a language model could ever gradually learn an "understanding" based representation. When autocompleting advanced mathematics which is a product of many man centuries of consecutive thought, any sort of limited step by step representation that fits inside an LLM, would be inferior to memorization.

(Of course, OpenAI could easily fix Russel's non paradox by extending the training dataset with the non-paradox, but then other variants would lead to more bizarre, more illogical answers still; improvements at fake logic can only damage its performance as a knowledge machine).
 
Last edited:
  • Like
Reactions: VividVerism

demultiplexer

Ars Praefectus
3,259
Subscriptor
There are countless ways to mathematically get to the impossibility of fitting any kind of consistent solution model in an LLM. 200 billion parameters suddenly is very, very little when having to contend with the breadth of human knowledge and understanding.

There are really cool things you can do with this tech in narrow applications, but the oxygen in the room is being taken up by essentially DOA systems. Like, it feels more and more to me like the end of the LLM era will be marked by lots of money having been spent on slightly better chatbots, and then no money being left for actually useful scientific applications. A bit like neural networks have been massively underfunded for... ever. They're still underfunded for applications that have proven value. The bulk of practical image recognition and classification work was done by academics and volunteers.
 

Dmytry

Ars Legatus Legionis
10,279
There are countless ways to mathematically get to the impossibility of fitting any kind of consistent solution model in an LLM. 200 billion parameters suddenly is very, very little when having to contend with the breadth of human knowledge and understanding.

There are really cool things you can do with this tech in narrow applications, but the oxygen in the room is being taken up by essentially DOA systems. Like, it feels more and more to me like the end of the LLM era will be marked by lots of money having been spent on slightly better chatbots, and then no money being left for actually useful scientific applications. A bit like neural networks have been massively underfunded for... ever. They're still underfunded for applications that have proven value. The bulk of practical image recognition and classification work was done by academics and volunteers.
I would think Amazon has sunk a lot of money into trying to improve image recognition for their "just walk out" checkout.

I think the issue is that you can't just make a fundamental breakthrough as a simple matter of N software engineers working for M months.

It doesn't fit into timelines, it's fundamental research.

On top of it a lot of improvement has been a result of finding ways to turn negative returns into very rapidly diminishing returns. Skip connections, residual networks, layer normalization, dropout, all those things make it possible to stack dramatically more layers without getting a whole lot of benefit from those layers - as opposed to pre-2010s situation where adding more layers made everything work worse.

edit: Also I think literally all of the real advancement is way more fundamental below the level that laymen or management are ever aware about. Things like transformer architecture come and go, things like the above list, they are here to stay and will likely be around in 20 years.
 

Dmytry

Ars Legatus Legionis
10,279
Also with regards to generative AI there's a pretty simple argument why, when done using solely scraped images, it is plagiarism.

If you download 1 image off the internet and pass it off as your own, that's plagiarism. If you download N images and select one at random, ditto.

What do most AI image generators do? They use N images to reconstruct the underlying probability distribution of images. This probabilistic model is sampled instead of choosing an image. A very large N and a bunch of math are used to approximate the limit of N->infinity .

One could of course argue that "but humans do something like that". But the imagery which informs human visual cortex, is largely original "video" captured by our eyes; even if hypothetically a human was capable of building a statistical model and sampling it, the sample would be almost entirely original.
 
Last edited:

karolus

Ars Tribunus Angusticlavius
6,685
Subscriptor++
I would think Amazon has sunk a lot of money into trying to improve image recognition for their "just walk out" checkout.

I think the issue is that you can't just make a fundamental breakthrough as a simple matter of N software engineers working for M months.

It doesn't fit into timelines, it's fundamental research.

On top of it a lot of improvement has been a result of finding ways to turn negative returns into very rapidly diminishing returns. Skip connections, residual networks, layer normalization, dropout, all those things make it possible to stack dramatically more layers without getting a whole lot of benefit from those layers - as opposed to pre-2010s situation where adding more layers made everything work worse.

edit: Also I think literally all of the real advancement is way more fundamental below the level that laymen or management are ever aware about. Things like transformer architecture come and go, things like the above list, they are here to stay and will likely be around in 20 years.
Is AI the best tool for this particular job? Wouldn't improving RFID or similar technologies work better for inventory tracking and billing?
 

wco81

Ars Legatus Legionis
28,661
Also with regards to generative AI there's a pretty simple argument why, when done using solely scraped images, it is plagiarism.

If you download 1 image off the internet and pass it off as your own, that's plagiarism. If you download N images and select one at random, ditto.

What do most AI image generators do? They use N images to reconstruct the underlying probability distribution of images. This probabilistic model is sampled instead of choosing an image. A very large N and a bunch of math are used to approximate the limit of N->infinity .

One could of course argue that "but humans do something like that". But the imagery which informs human visual cortex, is largely original "video" captured by our eyes; even if hypothetically a human was capable of building a statistical model and sampling it, the sample would be almost entirely original.
There was a discussion of Image Playgrounds that Apple showed. They purposely restricted the kind of images it could generate, to avoid issues like deep fakes or copyright infringement. Plus they're not trying to compete with Midjourneys.

However, it was noted that Apple could simply buy a photo clearinghouse for Getty Images, which is worth less than $2 billion. Then it could use those photos to do image generation.

It may not be the prohibitive to buy up photo clearinghouses for all these other AI ventures.
 

Dmytry

Ars Legatus Legionis
10,279
Is AI the best tool for this particular job? Wouldn't improving RFID or similar technologies work better for inventory tracking and billing?
It may be a question of which salesmen are better at lying with statistics. ML salesmen are really good.

Take those high profile roll backs of e.g. McDonalds AI checkout. How does something like this happen? Do they just not compare AI performance to human performance?

I think they do, but it's done using every trick in the book: have humans and AI transcribe recordings, without opportunity for a human to say "what was that?" Penalizing humans for spelling errors and things like transcribing "sprite and coke" as "coke and sprite". And so on and so forth.

The cardinal sin of training on the test dataset, I don't think its quite illegal to do that, and if its not illegal then you bet that when there's any money involved its done.

So I think what happened at Amazon is that machine learning folks upsold the tech. It's not like the tech doesn't work at all, it does, and a lot of ML people sincerely believe they can get very good performance in a year (which they keep sincerely believing indefinitely).

edit: the other thing is that the compute power available for the checkout merely grows with Moore's law, which is pretty slow, while the compute power thrown at best-in-class models was growing at a massively accelerated rate due to accelerated $ investment. So all the management expects rapid progress, and none of the "experts" who do know better disabuse them.

This also goes for things like "full self driving" using cameras only.

edit: also, an anekdote from work, literally today. I can't even fucking tell by eyeballing it if a model is just naturally shit or maybe RGB got mixed up with BGR or perhaps input pixel values are incorrectly scaled. (I work on software that has to support a lot of other people's machine vision related models).

Models generally perform like absolute shit if tested not on their usual test dataset but on some random obscure stock videos we probably actually bought and which we use for functional tests (complete with ordinary lighting, motion blur, video compression artifacts, etc).
 
Last edited:
  • Like
Reactions: parejkoj

Ajar

Ars Tribunus Angusticlavius
8,904
Subscriptor++
Perplexity AI raised $63 million at a $1B valuation. They claim their product combines web search with LLMs to provide a better search experience. "It's almost like Wikipedia and ChatGPT had a kid," said their CEO. Basically, a conversational interface to a chatbot that has near real time access to web content to provide up to date, factual answers, complete with citations to send traffic to information sources.

The reality is a little different:

 

Ajar

Ars Tribunus Angusticlavius
8,904
Subscriptor++
A little more on Perplexity:


It's a good example to highlight because it's a peek into how so many of these AI startups are totally ignoring conventions, regulations, and (in some cases) laws - because if they hit it big, it won't matter, and if they fail and go out of business, it also won't matter. And even with the "move fast and break the internet" ethos, there's still nothing behind the curtain - just some "prompt engineering" and existing off the shelf LLMs.
 

Pont

Ars Legatus Legionis
25,788
Subscriptor
And nobody has yet solved the "2.0" problem.

How do you train an AI on public internet data, now that the internet is flooded with AI-generated data?

It'll get interesting once data poisoning becomes standard practice, too. I anticipate a loop where AI-generated images use data poisoning to make it even harder for competitors to train on them.
 
  • Like
Reactions: Yagisama
And nobody has yet solved the "2.0" problem.

How do you train an AI on public internet data, now that the internet is flooded with AI-generated data?

It'll get interesting once data poisoning becomes standard practice, too. I anticipate a loop where AI-generated images use data poisoning to make it even harder for competitors to train on them.
That’s an already solved problem. The marketing is that they’re on their way to AGI, but what they’re selling is mostly pointless garbage that mostly produces more garbage.

The “2.0” solution is already baked into what they’re selling, not what they’re saying.

They’re already selling garbage accelerant. In effect, there is no actual “2.0” problem, and there’s thus no actual need for a “2.0” solution.
 
With the logical conclusion that all of this, the now ~6 trillion dollar valued AI industry, has no plan for the future and will collapse?

Because that's the only conclusion I see. If training on all written work on the internet doesn't yield a broadly useful tools, if your valuation assumes future revenues in the thousands of dollars per human on earth with no plan for how to monetize that... what's going to happen?
 
  • Like
Reactions: BurntToShreds
Also with regards to generative AI there's a pretty simple argument why, when done using solely scraped images, it is plagiarism.

If you download 1 image off the internet and pass it off as your own, that's plagiarism. If you download N images and select one at random, ditto.

What do most AI image generators do? They use N images to reconstruct the underlying probability distribution of images. This probabilistic model is sampled instead of choosing an image. A very large N and a bunch of math are used to approximate the limit of N->infinity .

One could of course argue that "but humans do something like that". But the imagery which informs human visual cortex, is largely original "video" captured by our eyes; even if hypothetically a human was capable of building a statistical model and sampling it, the sample would be almost entirely original.
Would you say it is plagiarism if the AI was instead trained on the output of a camera pointing at a screen?

What if they used a Blender (or some other 3D modeling program) to construct something silly like a 2D room with an image floating in it, and the model is trained on the output of a virtual camera pointing at that image (with some random variations in position, angle, lighting)?

My gut says analogies and existing law just don’t work very well in this situation. I don’t think the generative models are actually doing creativity, they are just very complicated tools. But the person using a tool usually is considered to be the one injecting the human intent and creativity.

Arguments about whether or not the AI is doing something like a human seem like a red herring to me. Maybe the tool required breaking copyright to make, so it just shouldn’t be allowed to exist. Or maybe we expected some level of effort on the part of the human to qualify for copyright protection, but in that case we should modify our laws to make this intent clear.

Although we’ll have to be careful to write these laws, if we lean too hard on “just recycling old works doesn’t qualify for protection ” we might accidentally put the Star Wars sequel trilogy and some Marvel movies in the public domain…
 

Soriak

Ars Legatus Legionis
11,745
Subscriptor
There’s a ton of private data that isn’t actively being used for training, as far as we know. Scientific journals, for example. Company internal data is starting to get used, but nobody is handing that over to OpenAI. So that’s going to drive demand for chips to train proprietary models.

My university just spent over $10m on NVIDIA chips for internal use. A few research innovations have led to billion dollar companies, and the university pockets some of that. Is there a chance that mining all our internal data could spark some more patents? I would be shocked if not.
 
Would you say it is plagiarism if the AI was instead trained on the output of a camera pointing at a screen?

What if they used a Blender (or some other 3D modeling program) to construct something silly like a 2D room with an image floating in it, and the model is trained on the output of a virtual camera pointing at that image (with some random variations in position, angle, lighting)?

My gut says analogies and existing law just don’t work very well in this situation. I don’t think the generative models are actually doing creativity, they are just very complicated tools. But the person using a tool usually is considered to be the one injecting the human intent and creativity.

Arguments about whether or not the AI is doing something like a human seem like a red herring to me. Maybe the tool required breaking copyright to make, so it just shouldn’t be allowed to exist. Or maybe we expected some level of effort on the part of the human to qualify for copyright protection, but in that case we should modify our laws to make this intent clear.

Although we’ll have to be careful to write these laws, if we lean too hard on “just recycling old works doesn’t qualify for protection ” we might accidentally put the Star Wars sequel trilogy and some Marvel movies in the public domain…
Well, my point is that sampling a statistical distribution derived by a purely mechanical process from a bunch of copyrighted works, should be treated identically to a variation where you just directly sample from said copyrighted works at random by picking one.

There's basically no societal benefit "deriving a statistical distribution from a bunch of original works and then sampling it", even less than just sampling one original work at random.

The weird thing that's happening with AI is that, like corporations, it got privileges that humans do not have. For example, had OpenAI ran a sweatshop where they provide each worker with training material that consists of pirated books and (for sora and their voice thing) movies, they would be in an enormous amount of trouble.
 
Last edited:

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
There’s a ton of private data that isn’t actively being used for training, as far as we know. Scientific journals, for example.

Where in the world did you get the idea science journals aren't publishing LLM generated text?

Scientific American: AI Chatbots Have Thoroughly Infiltrated Scientific Publishing

The answer to every confident assertion for places where large amounts of reliably human generated prose can still be gleaned is: Wrong.

And even then it is extremely common for English-language journals to have a lot of prose written by people whose English prose is, frankly, not all that good. I don't have a problem with that in itself, there's plenty of extraordinary researchers that happen to come from other linguistic backgrounds and as long as the text is legible and clear in its meaning there's no reason to make that a barrier to publishing, but that's not helpful in training a model to generate prose that feels fluent to a fluent speaker.
 
Where in the world did you get the idea science journals aren't publishing LLM generated text?

Scientific American: AI Chatbots Have Thoroughly Infiltrated Scientific Publishing
That's why you can't use them for training...

In all seriousness though, scientific journals are almost certainly used for training, given that everyone in the procedurally generated text scene just downloads books off bibliotik and such.

edit: sidenote, I think LLMs are basically procedural texturing but for text. Their use in any sort of factual tool is about as limited as use of procedural texturing in a maps app.
 
  • Like
Reactions: VividVerism
FWIW, I think assailing these things on the basis of plagiarism and copyright infringement is among the weakest of available indictments.

Why? Because it is perhaps the only way in which they’re actually kinda sorta like humans. It’s an entirely easy to sell and understand narrative of LLMs just having a much bigger corpus of answers to, “Who are your influences?” Watch practically any interview or look at any Wikipedia page for any artist in any domain and somewhere you will find mention of who or what they say their influences are. That influencing was them consuming lots of content, mashing it up with other things, and then crapping out derivative content of their own with varying degrees of derivation. Sometimes the distance between the influence and the derivation is too small, and you get into plagiarism.

This inevitably then lands LLM crap as being about policing the outputs for plagiarism the same way one does for human contributions. A problem which is already intractable at human scale and is impossible at machine scale.

edit: And! It’s probably the least idiotic, problematic thing about them.
 
That's why you can't use them for training...

In all seriousness though, scientific journals are almost certainly used for training, given that everyone in the procedurally generated text scene just downloads books off bibliotik and such.

edit: sidenote, I think LLMs are basically procedural texturing but for text. Their use in any sort of factual tool is about as limited as use of procedural texturing in a maps app.
You work in the field, right? Do you have any accessible explanation, or know of one, for why LLM based text can’t be used to train new LLMs?

In particular I think there’s a sort of gut-feel intuition that the models require some additional source of creativity or cognition or deeper understanding to make progress. Or, like, the models can only summarize ideas that are already in their training set somehow, which means they can’t add to the training set.

First, this makes me suspicious because… I don’t know, it has the feeling of a mathematical or technical argument but I hand-waved quite a bit while making it. Essentially it has the smell of something that we want to be true, and also seems a bit clever, so I expect it to not be true, haha.

Second, even if we do need some human spark to create new ideas, it isn’t obvious to me that LLM based journal papers would lack that spark. If someone writes a rough draft and has the LLM touch it up, it is still summarizing the human’s ideas.

Even if an LLM is given a really broad prompt and generates a couple papers using… it seem like the human is still injecting some sort of human creativity by picking a paper to share. Or humanity as a whole does, the LLM could release a handful of papers, most of which are garbage, which ones make it through review and get cited should produce some signal as to what a “good” idea is, right?

Anyway, that’s all very hand-wavy of course (which is why I’m asking, rather than arguing—I fully expect my line of thought to be annihilated because I’ve seen lots of people worrying about this).
 

Shavano

Ars Legatus Legionis
59,253
Subscriptor
With the logical conclusion that all of this, the now ~6 trillion dollar valued AI industry, has no plan for the future and will collapse?

Because that's the only conclusion I see. If training on all written work on the internet doesn't yield a broadly useful tools, if your valuation assumes future revenues in the thousands of dollars per human on earth with no plan for how to monetize that... what's going to happen?
I think it won't totally collapse. There will still be useful models trained on data that has to be generated and selected by humans. But approaching AI this way? No that's a dead end. Humans don't have to ingest terabytes of text to be able to generate text with intelligle thoughts, so it's clearly possible to do something better and more efficient and hopefully more resiliant than the garbage repeaters we're getting inundated with now.
 
You work in the field, right? Do you have any accessible explanation, or know of one, for why LLM based text can’t be used to train new LLMs?

In particular I think there’s a sort of gut-feel intuition that the models require some additional source of creativity or cognition or deeper understanding to make progress. Or, like, the models can only summarize ideas that are already in their training set somehow, which means they can’t add to the training set.

How much do you know about how an LLM works internally, or even most modern what-we-all-call-AI? It might be totally obvious or quite weird what I'll say next depending on your prior knowledge.

All current AIs are fundamentally nonsynthetic, meaning they are not able to generate any new information beyond their model and training set. There has been a lot of excitement at the very start of the LLM and transformer era that they in fact might be synthetic, but aside from that being mathematically impossible due to the way they work internally, all recent research indicates they're not. Modern AIs like diffusion models for generating images may produce something that as a whole does not exist yet, but under the hood it's all just a mosaic of compressed training data. It is genuinely true that modern AI algorithms are a kind of supercharged predictive text, using what amounts to an asymptotically optimized internal dataset.

This axiomatically means you can't use the output of LLMs to generate new data for use in training. All it's spitting out is a kind of lossily compressed version of its training data, and like with jpegs, you don't make it better by compressing it twice.

Second, even if we do need some human spark to create new ideas, it isn’t obvious to me that LLM based journal papers would lack that spark. If someone writes a rough draft and has the LLM touch it up, it is still summarizing the human’s ideas.
An LLM as a writing tool is potentially interesting, but for information-dense things like research papers it is nearly useless. They almost always pertain to some kind of niche subject with novel or at least unseen-by-LLM-training-data ideas.

Language in general is a tool of human thought. By taking away the human, you kind of lose the whole function of language. LLMs should be used as a writing tool for stuff that's been done 1000 times.
 
  • Like
Reactions: slowtech

Ajar

Ars Tribunus Angusticlavius
8,904
Subscriptor++
As far as that goes, I found the original model collapse paper quite persuasive, but there's a new paper out that modifies the procedure from the first paper and claims that it's possible to avoid model collapse - that there is a provable upper bound on deterioration as GenAI outputs are increasingly added to training datasets that predate GenAI.

That doesn't dispute what others are saying here - these models are producing remixes and mashups, not covers - never mind wholly original material.

*

What the boosters claim is going to get them to something that can actually generalize is combining large training sets with huge models and reinforcement learning between model agents to reward learning relationships and principles - i.e. having one agent pose various logical problems to another so that the second one eventually starts getting them right. If the boosters are right and it's actually possible to reach something indistinguishable from the output of skilled human human reasoning, I expect that's when they'll claim AGI.

Personally, I question whether that's possible, at least purely through a text / token based interface. I read a long piece by a former OpenAI engineer on this topic and there were some pretty striking omissions in how they seem to be thinking about these things. I'll dig up the link and share when I get a chance.
 

demultiplexer

Ars Praefectus
3,259
Subscriptor
What the boosters claim is going to get them to something that can actually generalize is combining large training sets with huge models and reinforcement learning between model agents to reward learning relationships and principles - i.e. having one agent pose various logical problems to another so that the second one eventually starts getting them right. If the boosters are right and it's actually possible to reach something indistinguishable from the output of skilled human human reasoning, I expect that's when they'll claim AGI.

Personally, I question whether that's possible, at least purely through a text / token based interface. I read a long piece by a former OpenAI engineer on this topic and there were some pretty striking omissions in how they seem to be thinking about these things. I'll dig up the link and share when I get a chance.
I really feel like at that point we're again entering this split reality between academics/researchers and laypeople/businesspeople, where it's been really well-understood that you can't get to synthesizing analysis with neural networks (or any variation/training model). But people with money can be convinced it's somehow possible (because see how good this chatbot is!) and thus the world chases something that is provably impossible.

Also, people claiming AGI have a much higher mountain to climb than just learning how to correctly interpret logical problems. That's something Wolfram Alpha already does without any need for modern AI models. They need to show that it can synthesize new information, and prove that this doesn't trivially follow from its parameters. That by itself is one of those NP-hard problems!

I'm now extra annoyed at all of this because it's happened before in my own fields (namely hydrogen and nuclear), I've even tried to publish on the topics to help mitigate even more waste towards these areas, but it's taken ~20 years for the world to finally come around to conclusions that were widespread and very, very well-understood among experts before the turn of the century. All that money and talent could have worked on solar, wind, batteries, also all areas that were very widely understood to be gamechangers in the energy transition. Is the same thing now happening in AI?
 
  • Like
Reactions: BurntToShreds

Ajar

Ars Tribunus Angusticlavius
8,904
Subscriptor++
I think the the thesis is that self-play / reinforcement learning approaches with enormous models will put so much data into the training set that would otherwise have been out of sample that, eventually, humans won't be able to tell the difference between generated output that's in sample and the model actually generalizing.

A commonly cited example is AlphaGo making moves human players hadn't considered. But that's in a prescribed domain where it's easy to generate valid samples.
 

ramases

Ars Tribunus Angusticlavius
7,569
Subscriptor++
I think the the thesis is that self-play / reinforcement learning approaches with enormous models will put so much data into the training set that would otherwise have been out of sample that, eventually, humans won't be able to tell the difference between generated output that's in sample and the model actually generalizing.

A commonly cited example is AlphaGo making moves human players hadn't considered. But that's in a prescribed domain where it's easy to generate valid samples.

Go is both a closed world (the number of possible future states may not fit into memory, but they are finite), and has a closed world assumption (all the rules of Go are perfectly known a priori, and a move that was once recognized as admissible will never become inadmissible through introduction of new knowledge). As a consequence for any given state vector AlphaGo has hypothetical perfect knowledge over all possible and admissible futures.

None of this is true for reality.