TikTok’s entrancing “For You” feed made its parent company, ByteDance, an AI leader on the world stage. But that same company is now so behind in the generative AI race that it has been secretly using OpenAI’s technology to develop its own competing large language model, or LLM.
This practice is generally considered a faux pas in the AI world. It’s also in direct violation of OpenAI’s terms of service, which state that its model output can’t be used “to develop any artificial intelligence models that compete with our products and services.” Microsoft, which ByteDance is buying its OpenAI access through, has the same policy. Nevertheless, internal ByteDance documents shared with me confirm that the OpenAI API has been relied on to develop its foundational LLM, codenamed Project Seed, during nearly every phase of development, including for training and evaluating the model.
Employees involved are well aware of the implications; I’ve seen conversations on Lark, ByteDance’s internal communication platform for employees, about how to “whitewash” the evidence through “data desensitization.” The misuse is so rampant that Project Seed employees regularly hit their max allowance for API access.
While use of OpenAI’s platform was more brazen in the early days of Project Seed, a few months ago, ByteDance ordered the team to stop using GPT-generated text in “any stage of model development,” the internal documents show. It was around this time that the company gained regulatory approval in China to release Project Seed through a chatbot platform called Doubao.
Even still, I’m told that the API continues to be used in ways that violate OpenAI’s and Microsoft’s terms of service, including for evaluating the performance of ByteDance’s model behind Doubao. As a person with firsthand knowledge of the situation inside ByteDance put it, “They say they want to make sure everything is legal, but they really just don’t want to get caught.”
In response to a detailed list of facts in this story, ByteDance spokesperson Jodi Seth said that GPT-generated data was used early on in the development of Project Seed for annotating the model, and that it was removed from ByteDance’s training data around the middle of this year. “ByteDance is licensed by Microsoft to use the GPT APIs,” she said in a statement. “We use GPT to power products and features in non-China markets, but use our self-developed model to power Doubao, which is available only in China.”
“Microsoft AI solutions such as Azure OpenAI Service are part of our Limited Access Framework, meaning all customers must apply for and be approved by Microsoft for access,” Microsoft spokesperson Frank Shaw said in a statement. “We also set standards and provide resources to help our customers use these technologies responsibly and in compliance with our terms of service, and have processes in place to detect misuse and discontinue access if companies violate our code of conduct.”
Update December 15th, 6:40PM ET: After this story was published, OpenAI spokesperson Niko Felix sent me the following statement confirming that ByteDance’s account has been suspended: “All API customers must adhere to our usage policies to ensure that our technology is used for good. While ByteDance’s use of our API was minimal, we have suspended their account while we further investigate. If we discover that their usage doesn’t follow these policies, we will ask them to make necessary changes or terminate their account.”
While it isn’t discussed in the open, using proprietary AI models — particularly OpenAI’s — to help build competing products has become common practice for smaller companies. It’s generally viewed as a legal gray area since OpenAI and Microsoft have yet to make an example of an offender. “A lot of startups now are just taking that risk,” according to Naveen Rao, VP of generative AI at Databricks.
However, everyone I spoke to while reporting on this story agreed that it’s highly unusual for a company with ByteDance’s scale and resources to behave this way. It suggests the Project Seed team has been put under tremendous pressure to deliver quickly. “I get recruitment emails from ByteDance very regularly,” one AI researcher at a large American tech company told me. “I usually just ignore them. But this makes me wanna move them to spam.”
Other companies have dealt with similar concerns about misusing the output of GPT to build a competitor. At Google, for example, a researcher quit in protest earlier this year because some employees were trying to use data from a website that contained conversations with ChatGPT that people uploaded. That episode, which didn’t involve misusing OpenAI’s API like ByteDance has done, was a humiliating moment internally and led to the employees involved getting a slap on the wrist, I’m told.
Since Project Seed was kicked off inside ByteDance about a year ago, it has become a high-priority, secretive initiative. Employees working on it have to sign separate nondisclosure agreements, and information access within the project has become increasingly siloed over time. Zhang Yiming, ByteDance’s billionaire co-founder and former CEO, keeps close tabs on its progress.
The two main products being developed under Project Seed are Doubao, the consumer chatbot platform that is already live in China (and seemingly accessible outside of the country as well), and a separate business-focused bot platform that’s in development to be sold through ByteDance’s cloud division.
While employees have been told that the goal for Project Seed is to, like OpenAI, eventually build artificial general intelligence, the real goal seems to be becoming the ChatGPT of China as quickly as possible. The team has been given the order of matching GPT-3.5’s performance by the end of this year and matching GPT-4 by mid-2024. The current Seed model is around 200 billion parameters. For comparison, GPT-3.5 has 175 billion parameters. (OpenAI has yet to disclose this number for GPT-4.)
Project Seed has nothing to do with TikTok, at least for now, and is developed on China-based servers. Most of the team sits in China, though there are team members based in the US as well. The leader of the project is Zhu Wenjia, ByteDance’s head of search, who reports to Yang Zhenyuan, the company’s top engineering leader. Other key leaders include Qiao Mu under Wenjia and Xiang Liang, who leads the applied machine learning team.
While I’ve heard that OpenAI is working on the ability to identify the output of its API for potential misuse, the genie appears to be out of the bottle already. It’s unclear whether behavior like ByteDance’s could further inflame the high tensions that already exist between the US and China, which both consider AI a national security concern.
Another big question is what happens to the quality of information online when LLMs are increasingly helping build other LLMs. Since foundational models are already trained on nonfactual, human-created data, using them to build more LLMs could only amplify the hallucination problem. As Rao from Databricks put it to me, “You end up in this place where you’re completely unhinged.”
The watercooler
My notes on what else happened in the tech industry recently:
- My bingo card did not have Epic Games winning its antitrust case against Google but losing against Apple. Aside from the deleted chats, Google’s sweetheart deals to pay off potential rivals to the Play Store ended up looking really bad. Google made these deals because it had to since Android isn’t technically locked down like iOS. If iOS were designed like Android to allow sideloading, I have no doubt Apple would have behaved the exact same way Google did. And if secret deals to pay off competition end up being what ultimately sunk Google here, I don’t see how it’s going to be allowed to keep paying off Apple to be the default search engine on iPhones.
- Sam Altman is back on the press circuit and clearly set on trying to move past the details of his shock firing. While receiving his Time magazine “CEO of the year” award, he suggested it had something to do with concerns about “superintelligence,” which is a new explanation that I frankly don’t buy. “We’ve never been more unified” is something he needs to say, but the reality is that Ilya Sutskever is no longer in the building and is talking to the press through a lawyer. (I also got a kick out of Time owner Marc Benioff, who offered to match any OpenAI employee’s comp package if they jumped to Salesforce during Altman’s firing saga, in the audience recording Altman onstage from his phone.)
- It was a big week for Threads, with the launch of EU availability and the first public tests of integration with ActivityPub, the decentralized social media protocol that powers Mastodon. I understand the doubt some had that Meta would make good on its promise to federate, and as Meta has been telling people, the company knows it’s operating at a “trust deficit.” As Mark Zuckerberg explained to me a few months ago, he thinks we’re entering a new phase for social media: “Maybe for phase one of social networking, it was fine to have these systems that people felt a little more locked into, but I think for the mature state of the ecosystem, I don’t think that that’s going to be where it goes.”
- Apple is continuing to build up hype ahead of the release of the Vision Pro. Based on the reporting that Apple retail employees will start being trained on the headset in mid-January, it sounds like it’ll ship sooner than I expected. I’m very interested to see the front-facing display in action since that wasn’t working for my demo earlier this year, and it appears that no one outside of Apple has tried it yet.
People moves
Some interesting job moves I noticed this week:
- Tesla keeps losing its AI leaders. This time, it’s Tim Zaman, the former head of AI infrastructure who was also moonlighting as head of machine learning at X (formerly Twitter) for the past year. He’s now an engineer at Google DeepMind.
- Mad about your view counts? Kim Farrell is officially the head of creators at TikTok and is moving to Los Angeles.
- Sarah Franklin left her role as president and CMO of Salesforce to be CEO of Lattice.
- Quartz co-founder Zach Seward is the first-ever director of AI initiatives at The New York Times.
- Kevin McCarthy wants to work on AI with Elon Musk. Great.
- A move I should have caught a couple of weeks ago but still want to note: Javier Soltero, who oversaw the launching and killing of more messaging apps than I care to count at Google, is now running Canva’s enterprise division.
Interesting links
I’ll be back next week with at least one more issue before the holiday break. In the meantime, send me your feedback, tips, CES parties, and ideas for improving this newsletter in the new year. Thanks for subscribing.
Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.