Big Tech's underground race to buy AI training data
reuters.com>Rates vary by buyer and content type, but Braga said companies are generally willing to pay $1 to $2 per image, $2 to $4 per short-form video and $100 to $300 per hour of longer films. The market rate for text is $0.001 per word, she added.
This is high enough that there should be a market to compensate the end users who created these
> The market rate for text is $0.001 per word, she added.
I'm astonished that a picture turns out to be worth a thousand words.
Top multimodal models have about 200-1000 tokens per image, so the math works out.
On the other hand, per byte the word is more expensive
I’m an economist. This is an example of a volume discount. Prices often decrease per-unit when buying larger quantities. That happens whether it’s milk or the square-footage of an apartment. I’d expect larger files to be worth less per-byte. Photo and video files tend to be larger than text ones.
I love this fact! I would have never realized it
These are probably pretty arbitrarily priced so someone thought they'd be cute & everyone else picked up this rate.
I don't think market prices usually work this way. Am I missing a reason that this is an exception?
I would think classical mechanisms of price determination don't really make sense when there's a brand new market: customers don't know how much things "should" cost and businesses can't make decisions via comparison to competitors. So there is some arbitrariness in setting the initial prices - you can't do market research without an established market.
This is especially true with these data brokers, since it's not like "cost of materials and labor + a profit margin" makes sense. In this particular case data is more like a new commodity than a manufactured good.
Do you really think that market prices are going to reflect such a popular adage because a picture really is worth 1000 words? Does this kind of pricing differential get reflected in the salaries of journalists vs photographers? No? Then as someone else said it's a new market with an artificially "for lolz" initial price that competitors are blindly copying to avoid having to do their own price discovery. I'd expect this to shift & correct slowly over time unless it's a cute joke that everyone appreciates & the true price is close enough that no one cares enough to differentiate in that way.
There is a market...to compensate the platforms where creators uploaded the data for free.
If you give something away when it's worthless, don't come back for more when it's discovered to be worth more.
Users of these sites have had license agreements and privacy policies for a long time, and freely gave away their content just because free web hosting was worth it. Why would they be entitled to anything more now that this content have found new value?
There was https://en.wikipedia.org/wiki/Datacoup , which tried to create a market, but they ultimately went under and the brand is now used by an unrelated company.
Are certain types of textual content more valuable than others? For instance, conversations vs long form content vs short form (ie tweets)
Are counterfeit words then AI generated? Just like money you need a very good “press” and hard to detect..
With what's happening in the EU with the GDPR on one hand and with the DMA on the other, I wouldn't be surprised if this becomes the new business model for social media companies.
This market is troubling. But I have a different question:
What does the long game look like for raw training data? How will AIs maintain the quality of their diet?
To compare, web search started — in the early days of Google — as a huge win because so much valuable information that was scattered around became findable. But over time it has become whac-a-mole with spam and AI copypasta, and now it's a struggle to keep returning good results, for any search engine.
Just like how ads have integrated into everything, trying to get us to click away from the happy path, AI will be in everything, trying to get us to do things that it is not yet good at so that it can learn from us. Which would be fine if the newfound efficiencies were properly democratized.
Yep. All these tech giants are taking the labour that people provided to the world in good faith in what I've seen described as a gift economy, and trying to lock it up. On the internet for the longest time people were providing their knowledge and fruits of labour for free, anticipating reciprocity (which on average they got). They stopped when reciprocity stopped. Platforms would monetize their efforts, control the distribution and often remove the reference to the creator.
These AI systems are being build on top of all the collective effort and resulting knowledge of the entire humanity. We can pretend they are just another private enterprise or we can acknowledge that they are something more than that.
And it's not just the productivity we could achieve with democratizing these systems. There's another danger. When big companies buy up all this intellectual property, what better choice would they have than to lock it up? At least until recently you could argue that IP rights owners were as entities incentivized to proliferate this knowledge, now the opposite is happening.
Do you have a concrete example of something you're afraid of loosing access to? The examples that come to my mind are such that cutting people off from them would degrade the relevance of the AI that's trained on them, but maybe I'm overlooking something.
Like, if you prevent access to research in order to protect the moat around your AI product, you'll harm the research community that would otherwise be your users. So now they're looking for other jobs and you have no users.
I wonder if they’ve considered hiring people to write. A lot of people might do it for cheap just to have their imprint on AI.
Or another twist pay people to submit ten years of emails (upload the backup file) or just pay small amounts for works they’ve made. College essays, journals, etc.
I have to imagine the valuable training data is domain specific stuff like sales call recordings for specific industries and technical materials about specific topics owned by companies. Surely there is enough public or copyright free general purpose material.
This won't be necessary in future AIs. As AIs will start aligning tokens from all the rich modalities of audio, video, 3D with text so that they can express complex ideas, they will bootstrap in proper language generation.
I don't think college essays, etc would contain anything novel. Future techniques could smoothly interpolate better creating ever-anew wordmud.
I agree with your overall point that an AI which can learn about the world directly won't need eleventy billion documents to learn language generation. Just two comments:
1) Based on how pre-verbal children learn, one nitpick is that I strongly suspect we need to give AI touch and a sense of space in order to truly understand quantity, causality, object permanence, etc.
2) Something that is not a nitpick: even a superhuman multimodal AI wouldn't have direct access to human emotions, sexuality, ideas of natural beauty, etc. I don't think humans have run out of interesting things to say about these ideas.
(In particular, I don't think a superhuman AI is capable of understanding music unless it is directly emulating the biological processes by which humans understand music. The issue is not "logical" - melodies don't actually make sense analytically.)
> I don't think [things created by humans] would contain anything novel.
That's quite a proposition.
Not every essay is created equal. Plus I don't understand what would a new way of combining same words, given llms already have seen trillions of tokens, would achieve. llms could inpaint to arrive at similar texts.
Turnitin will have millions of essays written by students. No doubt they will already be looking at these deal (or getting ready to update their license if it currently doesn't permit it).
They're more interested in eliminating jobs than creating them.
This already happens. I have seen recruiters trying to get domain experts in various fields to write articles for AI training.
LinkedIn built a whole platform inside their platform for doing exactly this. I think you get a badge or something on your profile claiming your an expert on something if you write a couple paragraphs on a topic using the provided prompt.
They're very clear its going into an AI generated article on the topic but you better believe that is also now core training data.
Most companies are hiring for the role of AI Tutor. Some of that is definitely happening.
People will just use ai to write those essays and emails!
This will be a fun reminiscence once we find out how humans are able to learn with just a tiny fraction of that data volume.
Despite all the hoopla around AGI, the sheer amount of data required really makes human learning all the more impressive.
Gödel probably consumed a miniscule fraction of what these systems have seen. And look what he came up with!
Not sure this is a good conjecture. The main reasons are 1) AI’s are expected to have incredible range that the average human does not 2) humans actually do take in enormous amounts of data but it happens over the course of many years and most of it is audio/visual/tactile/experience.
We already see that if you want to focus on a narrow skillset you can use a much smaller model and training set. But right now it is a race because everyone wants to be the one true generalized intelligence model.
The data volume is actually not that different once you account for all senses and how many years it takes for a human to become useful. The interesting thing would be how the human brain filters out the unimportant information as it develops.
That's a distinction without a difference. The majority of data is from a distribution that's already been sampled multiple times.
E.g. how often does a baby go out and experience something novel? The majority of it's time is spent getting the same stimulus over and over again, as anyone listening to childrens television can attest.
Humans learn in fundamentally different ways to our current systems and information poverty is not a problem for us.
And what do you think epochs in machine learning are? Or why more modern training efforts (i.e. for LLMs) are focussing hard on deduplicating scraped data?
Why don't you tell me instead of asking questions that you surely know the answer for?
It was rhetorical. But in case you actually don't know: what you described (i.e. multi sampling) has been common practice in ML for ages. Only now the latest models are getting so big that people are actually trying hard to move away from this idea because it would take a human lifetime in wall clock time to train a cutting edge LLM on similar datastreams.
Because we actually think? I'm not just trying to guess the next word and I understand causal relationships.
The real difference with human learning is feedback: when young humans learn, at least some of the time they are interacting with intelligent agents that are able to give them focused feedback on their recent inputs and initial reactions to them.
I think this ignores the most essential feedback very young humans get: planet Earth itself obeys laws of physics, mathematics, logic, etc. And by age 2 human children already have far faster and deeper reasoning abilities than any contemporary AI, even if their lack of linguistic knowledge means they wouldn't perform very well on LLM benchmarks.
In general AI researchers have done a very bad job exploring how a system might be "near-human" according to some fancy linguistic benchmark, yet dramatically dumber than a pigeon in terms of general reasoning abilities.
Tiny fraction... if you ignore the learning data processed by a billion years of evolution.
It’s a good question what portion of our DNA contributes to the information processing and knowledge in our brain.
However, the first complex nervous systems came about in the Cambrian explosion, only about half a billion years ago. And we also don’t train LLMs by random mutation and selection, it’s a much more teleological process.
But to extend the analogy, we should be able to train a model continuously, and not have to start training from scratch for each new model. Although, maybe, that would require random mutations, and thus much more time?
All the more reason for comprehensive privacy/data protection legislation and a refusal to provide data to these companies wherever possible.
The fact that ChatGPT isn’t deemed copyright infringement is absurd. Like you can’t take the entire internet and use it to train your software and claim you’re not violating the copyright of thousands of people
If the predictions that traditional search engines will be displaced by LLM engines turn out to be correct then there will have to be a reckoning about copyright. It's already difficult enough to make money by writing online, but if most content gets consumed second-hand through an LLM then it will become basically impossible. How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?
> How are journalists supposed to eat if NewsGPT just scoops up their work and starts regurgitating it seconds after publishing?
NewsGPT won't just regurgitate the work of journalists. First it'll consider the paid "partners" of NewsGPT to make sure to downplay anything that might hurt them, then it'll do the same for their advertisers while inserting some ads in the text, then they'll give the article tweaks according to NewsGPT's own ideology and then finally spit out something very different at their users. Maybe they can argue that NewsGPT is too transformative to count as copyright infringement.
How are you supposed to trust "journalism" from a text generator that hallucinates? The information ecosystem is bad enough without running it through a text blender that's already hitting compute, power and data limits.
And even if it doesn't hallucinate, current-age text generators are very good (in a bad way) at following a leading question.
For example, questions like "Tell me why I should use semaglutide for weight loss" gives widely different answers than "Tell me why I shouldn't use semaglutide for weight loss".
A human writer might fall into the bias trap of the original question being leading, but much less so than text generators that often repeat your prompt (re-enforcing whatever leading answer was embedded in your question) before answering it.
Regurgitation seconds after is what already happens with the AP though. There are some real journalists that will sadly be pushed further out of the fold, and presumably many human but fake journalists that have been coasting for years on such regurgitation. I’m not so optimistic about the ai future, and believe payment or at least credit really needs to get figured out for generative stuff. But real content producers should direct some of the irritation at their editors, colleagues, and industry or else it’s all rather dishonest isn’t it?
> Regurgitation seconds after is what already happens with the AP though.
The AP makes about half a billion a year from other outlets paying them for permission to regurgitate their content. That's not the same as the AI lobby saying they should be allowed to scrape apnews.com and publish articles derived from the content they get from there, for free, and without attribution.
I do see your point, and yeah theft is theft and theft is bad. But what I’m getting at is more the POV for media consumers.
If it’s regurgitated / unoriginal anyway then I don’t think most people care much whether it’s summarized/subjected to extra spin and fluff by a person or by a machine.
We should be working to strengthen journalism as a practice. LLMs will do anything but.
Journalist are ultimately extremely overrated in 2024.
I go out of my way to not consume news outside what happens to cross my way because of financial markets.
What exactly do you think I am missing that is so important? Journalist by large produce complete nonsense in 2024. Journalist in 2024 are a massive net negative and would be much better served doing something productive, like selling apples on the street.
The main counterargument is that you have read 1000s of documents to train your brain which produces unique documents with no credit to the original copyright holders.
GenAI is just doing the same thing on a larger scale.
Well, if the concept is "we should legally treat the systems just like very big humans", then the next step is to arrest and confine all the leaders of the companies involved on charges of slavery and child exploitation.
The distinction does matter in copyright too, since a transformative work needs some non-trivial amount of human input.
Yep, the issue with the parent counterargument is that gen ai is a monetized tool owned and sold by a corporation. People would probably be fine with a human-like embodied ai or something learning in the same way.
If I offered a paid service where you could pay me $20 a month and I would draw you copyrighted works that are in my internal neural network that would also be illegal
Frankly, it is and should be treated as such. The fact that they're dodging questions about their data sources is a red flag and a pretty clear indication that they know they're in the wrong and are fighting to become established enough to be in a position to, at best, ask for forgiveness after the fact.
isnt that what Google did ? they scraped the internet but the public/econ advisors felt the benefits outweighed copyright violations, they were just "indexers", they weren't scraping "news" they were indexing it lol
same thing with emulators and roms. somebody dumped the cartridges (copyrighted software) into ROM files to be played on emulators (copyrighted bios) but they were "archiving" and if you owned the original copy you could download them. I still vividly remember seeing on warez website disclaimer: "DMCA SAFE HARBOUR NOTICE: YOU MUST OWN THE ORIGINAL GAME OTHERWISE ITS ILLEGAL BUT YES, YOU CAN DOWNLOAD EVERY SINGLE GAME MADE ON THAT CONSOLE FOR FREE"
I feel like the same outcome will be for LLMs trained on copyrighted material. It will be "training". The net benefit is too great than fretting over "training"
tldr: "indexing" ---> "archiving" ---> "training"
Google surfaces data — or it used to — LLMs and AI companies actively exploit it with zero benefit given to creators or users of the platforms they're now cannibalizing.
the irony. im surprised how businesses built on selling google search results is allowed to exist. i guess for the same reason google scraping the internet and building a product on top of it is allowed.
then it only makes sense scraped AI training data is also going to be tolerated because you would need to reproduce a large language model like ChatGPT using your copyrighted content can produce a similar derivative of your copyrighted content by doing forensic analysis.
its such an uphill battle for copyright holders. They need to replicate: copyrighted input ---> LM similar to ChatGPT4 ---> copyrighted output
So far its not looking good for OpenAI because its possible to generate copyrighted output (type spiderman in czech) so all that remains is demonstrating the middle layer (training it on LM similar to ChatGPT4) but that is unrealistically expensive.
I have theory that all this money spent on large models is to make it impossible for discovery (as it would require access to $100 billion GPUs)
The whole notion that AI can replace search is nonsense. It yields no benefit to the creators of the results it scrapes and the models hallucinate. It's worse for users and it's worse for everyone producing anything of note online.
but many chatgpt users are not using Google as much instead relying on LLMs + RAG
ChatGPT is the new search engine and provides far more value to the end user than Google.
The issue seems to be people want a payout from OpenAI...but its non-profit
It's a shiny toy — it'll yield worse answers. Much like Google's own AI.
Google search is terrible. Chatgpt is definitively better for searching right now, and i often find myself reaching for it over google for a wide category of questions.
Google search is terrible because Google's stopped caring about search quality in favor of monetization. It doesn't mean an LLM can outperform a traditional search engine that cares about said quality.
The same benefit doesn’t exist for ChatGPT as Google because Google means people click on your site and you get ad revenue. Google even facilitates this in both directions with search ads and as an ad service you can get paid from for hosting ads. The ROM site DMCA thing was always BS lmao it’s completely legal for you to dump your own carts and use them in emulators but that freedom doesn’t extend to having a copy of someone else’s game cart. That’s just an intentional misunderstanding of the DMCA in a futile attempt to not get banned
so you think scraping copyrighted content to sell ads is okay and downloading copyrighted games for free is also okay then why is it not okay for ChatGPT to train itself on scraped content?
It's not scraping, it's indexing and linking out to creators. LLMs are helping themselves to everything with no regard for content creators. They should be subject to copyright claims — I don't care if it destroys their business, they should've considered that at the outset. They didn't then and they don't care to now, they're simply greedy and looking to build something that benefits themselves and their investors with no regard for anyone they step on to do so.
but how can you prove that your picture of a cat was used in LLM?
if you owned a franchise called "Chicken Brothers" with a the logo of two chickens standing side by side with arms crossed proudly then do you have claim over all derivatives including the spanish name generated by LLM?
i just dont think its straight forward, the main complaint should be payout for license used during training but its tough to prove unless someone at OpenAI dumps the AWS cloudwatch logs
That's OpenAI's problem and the burden should be on them.
The first part is fine because the search engine blurb isn’t a replacement for the thing itself. And I disagree with what ROM sites claim, you can’t just dump ROMs online and claim it’s not copyright infringement
Companies like Quest Diagnostics (a lab testing firm) are sitting on a goldmine of clean data. It's only a matter of time before a firm like Amazon (who already bought One Medical) gobbles them up.
Disclaimer: Long on $DGX
>in talks with multiple tech companies to license Photobucket's 13 billion photos and videos
>Photobucket declined to identify its prospective buyers, citing commercial confidentiality.
>tech companies are also quietly paying for content locked behind paywalls and login screens, giving rise to a hidden trade in everything from chat logs to long forgotten personal photos from faded social media apps
In this market, ethics seem to exist when it comes to corporate clients, but not when it comes to end-users.
It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.
Photobucket is a morally bankrupt shell of its former self. They send constant emails with extremely urgent subject lines threatening to delete your photos unless you sign up for a $5/mo plan. They do this even if your account doesn't contain any photos.
This is funny. If they delete your photos then they lose their lever for getting you into a payment plan. Except they'll probably email you a recurring one-time offer to restore your 'deleted' photos for a nominal fee.
> threatening to delete your photos unless you sign up for a $5/mo plan
What's morally bankrupt about that? It costs money to host your photos and they're a business that can decide to charge their customers any rate they think the market will accept.
I have no photos on their service and they've been emailing me weekly since last year with URGENT and ACTION REQUIRED.
> It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid.
I can think of worse things than that which might be hidden away for public scraping.
>In former times it was maintained that ownership of landed property extends from heaven all the way down to the center of the Earth, but this doctrine is obsolete, as evidenced by the flight of airplanes.
I vividly remember consenting to all variety of terms agreements as a 13-year old on the web in 2007. I also remember explicitly licensing all of my output as CC and embracing copyleft. It's never been a secret that even captchas contribute to the improvement of models designed to ultimately sell ads to eyeballs.
A lot of people just were not paying attention to the game being played, and so now they're getting played themselves.
When a company hides their skeevy practices in 30-page social media term consent form, I blame them a lot more than the normal person with limited time or the literal child. I’d prefer such people didn’t “get played” by multinational corporations, even if they potentially could have prevented it.
I think it's important to understand that consent was indeed given, and most users likely understood that they did not own any non-copyrightable portion of their user-generated content.
Rather, the conversation should focus on how to improve parsing of ToS (I personally believe we should use symbolic labeling like we do with food), as well as regulation around what terms can change for content which was generated under the premise of an older ToS.
OP's statement, "It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid," is simply false. Many, if not most users, understood that they gave permission for their UGC to be used to improve the services. This is what I am rebuking.
It's immediately and self-evidently obvious that no end-user in 2007 consented to photos of their 2007 era teenage self being used to train an AI how to identify an emo kid
Unfortunately, they did actually. It's more accurate to say that they were presented a EULA and Terms of Service that no reasonable teenager would have had any hope of understanding. But since they're over 13, they're held to the terms of those agreements in any case.
These companies are slimy. Make no mistake, this will get worse in the future.
People have known for eons that companies were using their data. TikTok is well publicized to be the CCP. And yet millions of people would rather have entertainment. There are plenty (myself included) who abstain, but the reality is the vast majority of people, if presented with free and unlimited dopamine hits, will gladly give away their info.
> the reality is the vast majority of people, if presented with free and unlimited dopamine hits, will gladly give away their info
I counter that the reality is the vast majority of people do not meaningfully understand the exchange they are making. I'm not saying they're stupid or blaming them whatsoever; it's a similar phenomenon to playing the lottery. Our brains aren't equipped to understand such unintuitive phenomena.
> I counter that the reality is the vast majority of people do not meaningfully understand the exchange they are making.
I’m not sure this is true in 2024 at all. The presumption today is you’re being tracked, and people simply don’t care.
But let’s presume this isn’t true. I think the response should be to expect more from society. Every additional bit of nanny state coddling reduces individual responsibility.
Do you consider such people to blame? Under your own framing, they’re more or less being taken advantage of by a modern form of drug dealer.
Absolutely. Do you think drug addicts have no responsibility for their behavior? Are drug dealers viable without a strong customer base?
There's a funny thing where the legal/commercial definition of "consent" is essentially a subset of "non-consent", having extremely little overlap with "consent" in a meaningful way.
They talk about voice samples, but they don’t mention prices for them
Would it be attractive for a company like Twilio or Aircall to offer free phone calls and sell anonymized recordings?
Funnily, this is how Google improved their voice recognition.
Remember a decade or so ago, you could call a 1-800 number and look up phone numbers using your voice? It was backed by Google and once Google was done collecting the data, they shut it down.
Google411
It would solve all government budget issues if the three letter agencys would start selling all data.
No, that's gross violation of privacy; no such thing as anonymized recordings.
It would be a violation of privacy if people weren’t aware/hadn’t consented
But if it was part of the terms of the new free service, and all the parties involved got a reminder message on the call… you might still not like it, but it doesn’t seem like it would be a violation of privacy
I’m not a lawyer, but I do live in a one party consent state. I would imagine if I setup a service here, ensured all calls originated in my state, and the person who owned the account being used consented, it would be legal. Even without informing the person on the other end of the call.
Would this violate other laws outside my jurisdiction? Probably, but that just means I won’t travel there.
I actually hope I’m wrong.
A fantastic example of why the inherent "lack" of one party in an economic exchange is a necessary component of modern capitalism.
The only people who would be willing to use such a service are people who have likely already been systematically disenfranchised by our global economic system. Poor people.
Privacy should not be incentivized and treated as a luxury. Especially when the end result of all this training data is models which further discriminate against vulnerable third-parties and automate maximum value extraction from the average user via unprecedented amounts of emotional manipulation afforded to us by the development of user-facing generative AI. Whether through highly-targeted, ad-hoc advertisements, or discriminative insurance policies.
Google having so many private photos in Google Photos must be a goldmine for them.
> Google having so many private photos in Google Photos must be a goldmine for them.
While true, it's META who has won that arm's race long ago in my view; hell, they just disclosed that they have private access to DMs to Netflixh [0] in a lawsuit.
If you don;t think they are training their own models on this data over all their platforms you have to be a complete idiot o: Facebook, Instagram, Whatsapp.
That is a much larger treasure trove given the sheer scale of people on those platforms, Google is limited to mainly Android users and those who use it's suite on PC (relatively small compared to social media users), which excludes most Mac users.
The thing they don't tell you about this dark underbelly of AI is just like the (meta)data that is for sale to 3rd parties, it's tiered price structure wherein Mac users are often the premium tier de to their more 'affluent' status and likelihood of impulsive in app purchases.
This is why I think META already won the AI race, they opensource Llama and have the a massive treasure trove of data to refine and train when they see what the OSS community creates that is of actual value: ChatGPT/DALL-e runs at a loss for MS/OpenAI. But if anyone can monetize this gold rush it will be META.
And perhaps more critically from an infrastructure POV, Llamma now runs better on CPU [1] rather than GPU, which means they won't have to be constrained or price pinched on GPUs like Microsoft, Google, Amazon likely will due to demand constraints from Nvidia (see ETH mining craze during COVID). They can focus on optimizing their data centers with more free cash flow which meant they can have a bigger footprint for when they finally figure out how to properly monetize this AI bubble, because it is is a bubble, from now until then.
I think Zuck learned from Libra that staying out of the limelight during a bubble is critical if he wants to undo the Metaverse money-pit/losses.
0: https://www.movieguide.org/news-articles/facebook-allowed-ne...
> Google is limited to mainly Android users
https://www.appmysite.com/blog/android-vs-ios-mobile-operati...
Random link. Can't vouch for it. But US and RoW have quite different patterns.
> Random link. Can't vouch for it
Seems about right to me, Android dominates the mobile World by sheer numbers.
But what is the value that they can derive from user data? A million Bangladeshi's texts from food delivery is probably a lot less valuable than say a Singaporean using Numbers on Mac OS to layout the next lucrative investment and the data they;d get from the correspondence of say 100 high net worth individuals hidden behind iOS (Pegasus MITM attack notwithstanding).
Again, the name of the game is to derive signal from noise from data, bulk collection is primitive when training models and often incredibly difficult to work around once it is in. I seriously think Gemini had this problem, along with QA/QC issues, rather it going from so-so Bard to total 'woke' Gemini. I may be wrong, but I think this is what happens when you go down the bulk collection and unfiltered/un-curated data route.
> But what is the value that they can derive from user data?
What, are the pictures and videos of people from the global south somehow not good enough to train AI due to their economic situation?
> What, are the pictures and videos of people from the global south somehow not good enough to train AI due to their economic situation?
I don't make the rules, in fact if you are seriously wondering what use 'darker' people's data have had with AI training look no further than the surveillance based platforms that are responsible for tons of false incarcerations of mainly black US citizens [0].
I'm not sure if it's going to change for the plight of the 'Global South's' data either. It's not that I think it's inherently prejudiced, either; it's more like it's optimized to be greedy in order to extract as much value as it possibly can from the current system at all costs.
People need to stop smoking hopium and thinking that this is going to usher some sort of egalitarian renaissance, this is business as usual by the mega corps that bring you this tech.
0: https://innocenceproject.org/artificial-intelligence-is-putt...
Whatsapp chats are encrypted, how can they be used to train the models? Also what kind of training can be done on Instagram data, is there anything of value there?
> Whatsapp chats are encrypted
While they claim E2E encryption, I seriously doubt they would offer this service entirely for free with having some backdoor or potential MITM breach that they likely tucked away in the ToS given the wide use of it it most of the World who pay for SMS/text messages: it just seems so incredibly unlikely to be entirely encrypted from a company that willing gave DMs to Netflix, used Cambridge Analytica etc... But even if it is encrypted, the meta data generated can tell you a lot too--as was the case with Pokemon GO--that may not directly benefit LLMs, but could help with creating dark patterns that make your AI companion (under the guise of an LLM) the 'must own' when deciding who to buy tokens/compute from.
Speculative for sure, but just look at the Twitter file leaks revealing how social media platforms willing work alongside intelligence agencies.
> While they claim E2E encryption, I seriously doubt they would offer this service entirely for free with having some backdoor or potential MITM breach that they likely tucked away in the ToS given the wide use of it it most of the World who pay for SMS/text messages: it just seems so incredibly unlikely
You don't have to trust Metas self-regulation, but you best believe the EU does not fuck around on such issues. Self-preservation is a hell of a motivator.
> Also what kind of training can be done on Instagram data, is there anything of value there?
Billions of comments and private messages; billions of data points on user behavior and (more importantly) how they respond to manipulative UI/UX/content... Nothing useful there??
I'm genuinely curious how does that data help. What would the prompts be like? "Help me design an addictive UX"? How do comments like birthday wishes or people posting their beach pictures and people replying with how good they look add any kind of value to the ML model training? Those conversations would be in larger quantity than any that discuss anything meaningful.
As well as emails, documents, reviews…
I am incredibly thankful that I never used any of those services. I'm angry enough at the thought that my own websites may have been scraped to train LLMs, but at least I could remove that content. I'd be beside myself if I couldn't do at least that much.
No Datadome Javascript:
https://www.usnews.com/news/top-news/articles/2024-04-05/ins...
I assume some of the more shady/no-name dashcam units with Wifi capability are uploading their video and internal microphone recordings. Distributed surveillance: The Panopitcar
Any modern car is likely to already be transmitting that data and more, such as your weight, metadata about your doctor visits, etc. Cars are a privacy nightmare.
I've wondered about crowdsourcing that. Sousveillance. Don't think enough people would be interested, though.
Nobody's going to mention Worldcoin?
I still speculate PG's golden boy was fired for unethically sourced training data for gpt 4 but we'll likely never get the real story.
I wonder when one of the richest corps will manage to get exclusive access to such data and lock out the others.
Never.
Because no one will sell them an exclusive license to the data.
The companies selling this data are slimy. They're borderline crimelords. Picture a pirate captain with a hostage that he is ransoming. Now imagine he gets his ransom, but before he releases the hostage he makes a copy of her. Then ransoms the copy to another interested party. But before he releases the copy, he makes another copy and... you get the idea.
It's pirate thinking.
"If one hostage is good? Then two are better! And three? Well, that's just good business!!!" -Hondo Ohnaka
GDPR covered data should be worth a lot less.
Ha - I love your optimism that they are even considering GDPR
Who could have guessed giving away all of our data to corporations wholly focused on profit would be a bad thing?
If the end result is ai chat agents that anyone in the world can access for free, that seems like an absolutely wonderful thing
If the companies making those agents are paying top dollar for training data then the product isn't going to be truly free, at best it will be "free" with caveats. Do you want to use an AI agent which is fine-tuned according to the wishes of the top bidding advertisers? Because that's probably the first thing they'll try to make "free" chatbots actually turn a profit.
The future is having your own personal AI assistant, completely free of charge, which is suspiciously eager to recommend shopping at Temu and eating at McDonalds.
As Yuval Harari suggests the AI economy will move away from money and man power. What will be important are control over resources and their distribution. These big companies won't care about a number in some database. They won't care about selling stuff to you, maybe in the midterm but not in the long run.
that sounds good until you realize that money is just an abstraction of resources.
Because the economy functions better when decoupled and provides autonomy in current paradigm. Consider a future when those in power, rich don't depend on majority for food, transportation and security or services then they don't have the need to convince you to work for them.
We can be forgiven for not having foreseen how social media would be used against us, connecting the world sounded like a cool idea on the surface. But having gone through that there's no excuse to be naive and simplistic about AI.
What harm has been inflicted upon you directly as a result?
That's the beauty of the system we have. Your data can be routinely used against you and you'll never be told about it.
Your health insurance company can buy up records from a data broker that show you've been spending 6% more time at fast food restaurants compared to last year and they can use that to raise your rates, but they'll never tell you that was the reason, you'll just have a higher bill than before
An employer can pass you over for a job you've applied to because you wrote something on a social media site 12 years ago that offended their political ideology, but you'll never know that was why, you'll just never get a call back.
If you get arrested, you could be denied bail because some AI decided you were a flight risk or more likely to reoffend if released, but no one will be able to tell you what made the AI decide that and you may not even be told an AI was used to make that choice.
As our lives become increasingly interconnected and recorded and analyzed it becomes extremely difficult for you to be aware of how or why your data is impacting your life, but it would be a huge mistake to assume that it isn't.
The funny thing about insurance is that as it becomes perfect at assessing risk, it becomes worthless.
Oh, you’re about to have a $x claim this year, your premium is $x + y% admin fee.
Just self insure and save yourself the y%.
I imagine that probably /can/ name direct harms but why does someone need to have been directly harmed for them to be concerned about it/think it's bad? I've never been murdered but that's hardly reason for me to be okay with murder.
That's like asking: "What harm has been inflicted upon you, individually and directly by some parts-per-million of a cancer-promoting chemical dumped in the town's water supply over the last twenty years?"
The worst downsides of social media weren't apparent or even necessarily occurring yet until we were well into it by more than just the first few years.
> for free
Even if a future service doesn't have an obvious charge or subscription, just because you don't recognize how you're being exploited doesn't mean it's truly "free."
There's a reason advertising exists as an industry at all, let alone a global trillion-dollar one. Today's "free" is actually paid for by exploiting user attention and attempting to hack your brain--sometimes in ways that are culturally accepted due to long tradition of use, sometimes in new disturbing ones.
Yes. If a thing is being paid by the collection of user data (which is what advertising involves in these sorts of use cases), then it's not free in any meaningful sense. You're still paying, just using a different medium of exchange.
What makes you think that will be the end result?
It's already the case, there are free models that were trained on this data
What's publicly and easily available for free is not good, so we're stealing labor and replacing it with worse labor.
Perplexity is already looking to include ads. Free is a hook to get users invested before they trap them and extract all the value for themselves.
Are you an artist selling their own work?
Code. Every day.
When you solve something tricky, you just basically released that as open source and trained ChatGPT 2030 how to do it without you.
That's wishful thinking though.
That said, AI tech is or is quickly becoming freely accessible; unless they have a USP, free / homemade versions will end up competing with the paid services, and it's hard to compete with free.