How good could GTP-4 get if copyright was not an issue?
If we could train a GTP on every book in the Library of Congress, every stock photo, every piece of art, etc. how smart could it get? What if it was trained on all intercepted data? Why would this NOT be happening right now? I don't think that they are worrying about copyright today. From what I've heard OpenAI currently does not have enough free GPUs to train GPT-5. There are also unsolved technical challenges such as supporting very long context lengths without accuracy degradation, or efficient access to real-time information. But even ignoring the above, it's not clear that training on more data will dramatically improve the quality of the model. You want high quality data to get high quality model. I'm not sure if much more high quality data exists in the world than what they already have access to. It's possible that to make it smarter we should train it on video - predicting the next video frame might be a more challenging task, requiring different world models leading to deeper understanding of reality. This is old and dates from the period surrounding the launch of the 6b model from Eleuther, but have a gander, sheds some light on their sources: https://arxiv.org/abs/2101.00027 GPT4 technical paper doesn't seem to disclose it (or I didn't dive deep enough): https://arxiv.org/abs/2303.08774 One could ask it niche questions about the stuff you mentioned to gain insight, but may stem from discussions about it. It's a great question I've wondered about as well. I think we're going to find out soon. Some countries are creating fairly strict regulation limiting AI, but others are laying out clear guidelines encouraging innovation. Japan for example has ruled that copyright does not apply to AI training, so as I understand your idea could be carried out there. I suspect the jurisdictions with the most innovation friendly approaches will attract talent and capital and wind up with the strongest models. It's not happening now because some of us respect property rights.