Most likely using invisible text...edit: i predict that SEO will get weird, like pages written to try to get the AI to recommend something, if the page is ingested via RAG.
Most likely using invisible text...edit: i predict that SEO will get weird, like pages written to try to get the AI to recommend something, if the page is ingested via RAG.
- The village barber shaves himself and every other man in the village who don't shave himself. Does he shave himself?
- The scenario you've described is a classic example of a logical paradox known as the "Barber Paradox," which was formulated by the British mathematician and logician Bertrand Russell in 1918.
- The paradox goes as follows: In a village, there is a barber who shaves all those, and only those, men in the village who do not shave themselves. The question then arises: Does the barber shave himself?
- If we assume that the barber does shave himself, according to the definition, he should not shave himself (since he only shaves those who do not shave themselves). On the other hand, if we assume that he does not shave himself, then according to the definition, he should shave himself (since he shaves all those who do not shave themselves).
- Both assumptions lead to a contradiction, which means that the initial premise of the barber's shaving rule is logically inconsistent. There cannot exist such a barber within the constraints of the scenario as described, because it violates the rules of set theory and creates a paradox.
- The Barber Paradox is a self-referential paradox related to the "set of all sets that do not contain themselves" and is used to illustrate the difficulties in attempting to construct a set of all sets that do not contain themselves. It's an example of the type of logical inconsistencies that Russell and Alfred North Whitehead were trying to resolve with their type theory in the Principia Mathematica.
I would think Amazon has sunk a lot of money into trying to improve image recognition for their "just walk out" checkout.There are countless ways to mathematically get to the impossibility of fitting any kind of consistent solution model in an LLM. 200 billion parameters suddenly is very, very little when having to contend with the breadth of human knowledge and understanding.
There are really cool things you can do with this tech in narrow applications, but the oxygen in the room is being taken up by essentially DOA systems. Like, it feels more and more to me like the end of the LLM era will be marked by lots of money having been spent on slightly better chatbots, and then no money being left for actually useful scientific applications. A bit like neural networks have been massively underfunded for... ever. They're still underfunded for applications that have proven value. The bulk of practical image recognition and classification work was done by academics and volunteers.
Is AI the best tool for this particular job? Wouldn't improving RFID or similar technologies work better for inventory tracking and billing?I would think Amazon has sunk a lot of money into trying to improve image recognition for their "just walk out" checkout.
I think the issue is that you can't just make a fundamental breakthrough as a simple matter of N software engineers working for M months.
It doesn't fit into timelines, it's fundamental research.
On top of it a lot of improvement has been a result of finding ways to turn negative returns into very rapidly diminishing returns. Skip connections, residual networks, layer normalization, dropout, all those things make it possible to stack dramatically more layers without getting a whole lot of benefit from those layers - as opposed to pre-2010s situation where adding more layers made everything work worse.
edit: Also I think literally all of the real advancement is way more fundamental below the level that laymen or management are ever aware about. Things like transformer architecture come and go, things like the above list, they are here to stay and will likely be around in 20 years.
There was a discussion of Image Playgrounds that Apple showed. They purposely restricted the kind of images it could generate, to avoid issues like deep fakes or copyright infringement. Plus they're not trying to compete with Midjourneys.Also with regards to generative AI there's a pretty simple argument why, when done using solely scraped images, it is plagiarism.
If you download 1 image off the internet and pass it off as your own, that's plagiarism. If you download N images and select one at random, ditto.
What do most AI image generators do? They use N images to reconstruct the underlying probability distribution of images. This probabilistic model is sampled instead of choosing an image. A very large N and a bunch of math are used to approximate the limit of N->infinity .
One could of course argue that "but humans do something like that". But the imagery which informs human visual cortex, is largely original "video" captured by our eyes; even if hypothetically a human was capable of building a statistical model and sampling it, the sample would be almost entirely original.
It may be a question of which salesmen are better at lying with statistics. ML salesmen are really good.Is AI the best tool for this particular job? Wouldn't improving RFID or similar technologies work better for inventory tracking and billing?
That’s an already solved problem. The marketing is that they’re on their way to AGI, but what they’re selling is mostly pointless garbage that mostly produces more garbage.And nobody has yet solved the "2.0" problem.
How do you train an AI on public internet data, now that the internet is flooded with AI-generated data?
It'll get interesting once data poisoning becomes standard practice, too. I anticipate a loop where AI-generated images use data poisoning to make it even harder for competitors to train on them.
Would you say it is plagiarism if the AI was instead trained on the output of a camera pointing at a screen?Also with regards to generative AI there's a pretty simple argument why, when done using solely scraped images, it is plagiarism.
If you download 1 image off the internet and pass it off as your own, that's plagiarism. If you download N images and select one at random, ditto.
What do most AI image generators do? They use N images to reconstruct the underlying probability distribution of images. This probabilistic model is sampled instead of choosing an image. A very large N and a bunch of math are used to approximate the limit of N->infinity .
One could of course argue that "but humans do something like that". But the imagery which informs human visual cortex, is largely original "video" captured by our eyes; even if hypothetically a human was capable of building a statistical model and sampling it, the sample would be almost entirely original.
Well, my point is that sampling a statistical distribution derived by a purely mechanical process from a bunch of copyrighted works, should be treated identically to a variation where you just directly sample from said copyrighted works at random by picking one.Would you say it is plagiarism if the AI was instead trained on the output of a camera pointing at a screen?
What if they used a Blender (or some other 3D modeling program) to construct something silly like a 2D room with an image floating in it, and the model is trained on the output of a virtual camera pointing at that image (with some random variations in position, angle, lighting)?
My gut says analogies and existing law just don’t work very well in this situation. I don’t think the generative models are actually doing creativity, they are just very complicated tools. But the person using a tool usually is considered to be the one injecting the human intent and creativity.
Arguments about whether or not the AI is doing something like a human seem like a red herring to me. Maybe the tool required breaking copyright to make, so it just shouldn’t be allowed to exist. Or maybe we expected some level of effort on the part of the human to qualify for copyright protection, but in that case we should modify our laws to make this intent clear.
Although we’ll have to be careful to write these laws, if we lean too hard on “just recycling old works doesn’t qualify for protection ” we might accidentally put the Star Wars sequel trilogy and some Marvel movies in the public domain…
There’s a ton of private data that isn’t actively being used for training, as far as we know. Scientific journals, for example.
That's why you can't use them for training...Where in the world did you get the idea science journals aren't publishing LLM generated text?
Scientific American: AI Chatbots Have Thoroughly Infiltrated Scientific Publishing
You work in the field, right? Do you have any accessible explanation, or know of one, for why LLM based text can’t be used to train new LLMs?That's why you can't use them for training...
In all seriousness though, scientific journals are almost certainly used for training, given that everyone in the procedurally generated text scene just downloads books off bibliotik and such.
edit: sidenote, I think LLMs are basically procedural texturing but for text. Their use in any sort of factual tool is about as limited as use of procedural texturing in a maps app.
I think it won't totally collapse. There will still be useful models trained on data that has to be generated and selected by humans. But approaching AI this way? No that's a dead end. Humans don't have to ingest terabytes of text to be able to generate text with intelligle thoughts, so it's clearly possible to do something better and more efficient and hopefully more resiliant than the garbage repeaters we're getting inundated with now.With the logical conclusion that all of this, the now ~6 trillion dollar valued AI industry, has no plan for the future and will collapse?
Because that's the only conclusion I see. If training on all written work on the internet doesn't yield a broadly useful tools, if your valuation assumes future revenues in the thousands of dollars per human on earth with no plan for how to monetize that... what's going to happen?
You work in the field, right? Do you have any accessible explanation, or know of one, for why LLM based text can’t be used to train new LLMs?
In particular I think there’s a sort of gut-feel intuition that the models require some additional source of creativity or cognition or deeper understanding to make progress. Or, like, the models can only summarize ideas that are already in their training set somehow, which means they can’t add to the training set.
An LLM as a writing tool is potentially interesting, but for information-dense things like research papers it is nearly useless. They almost always pertain to some kind of niche subject with novel or at least unseen-by-LLM-training-data ideas.Second, even if we do need some human spark to create new ideas, it isn’t obvious to me that LLM based journal papers would lack that spark. If someone writes a rough draft and has the LLM touch it up, it is still summarizing the human’s ideas.
I really feel like at that point we're again entering this split reality between academics/researchers and laypeople/businesspeople, where it's been really well-understood that you can't get to synthesizing analysis with neural networks (or any variation/training model). But people with money can be convinced it's somehow possible (because see how good this chatbot is!) and thus the world chases something that is provably impossible.What the boosters claim is going to get them to something that can actually generalize is combining large training sets with huge models and reinforcement learning between model agents to reward learning relationships and principles - i.e. having one agent pose various logical problems to another so that the second one eventually starts getting them right. If the boosters are right and it's actually possible to reach something indistinguishable from the output of skilled human human reasoning, I expect that's when they'll claim AGI.
Personally, I question whether that's possible, at least purely through a text / token based interface. I read a long piece by a former OpenAI engineer on this topic and there were some pretty striking omissions in how they seem to be thinking about these things. I'll dig up the link and share when I get a chance.
I think the the thesis is that self-play / reinforcement learning approaches with enormous models will put so much data into the training set that would otherwise have been out of sample that, eventually, humans won't be able to tell the difference between generated output that's in sample and the model actually generalizing.
A commonly cited example is AlphaGo making moves human players hadn't considered. But that's in a prescribed domain where it's easy to generate valid samples.