Welcome to readin – the best world tech news chanel.

Four takeaways from the race to accumulate data for AI| GuyWhoKnowsThings


online data has long been a valuable commodity. For years, Meta and Google have used data to target their online advertising. Netflix and Spotify have used it to recommend more movies and music. Political candidates have turned to data to know which groups of voters to focus their attention on.

Over the past 18 months, it has become increasingly clear that digital data is also crucial in the development of artificial intelligence. This is what you should know.

The success of AI depends on data. This is because AI models become more accurate and more human with more data.

In the same way that a student learns by reading more books, essays, and other information, large language models (the systems that underlie chatbots) also become more accurate and powerful if they are fed more data.

Some large language models, such as OpenAI GPT-3, released in 2020, were trained with hundreds of billions of “tokens,” which are essentially words or fragments of words. More recent large language models were trained with over three billion tokens.

Technology companies are using publicly available online data to develop their AI models, faster than new data is produced. According to one prediction, high-quality digital data will run out by 2026.

In the race for more data, OpenAI, Google, and Meta are turning to new tools, changing their terms of service, and engaging in internal debates.

At OpenAI, researchers created a program in 2021 that converted audio from YouTube videos to text and then fed the transcripts into one of its AI models, going against YouTube's terms of service, people with knowledge of the development said. issue.

(The New York Times has sued OpenAI and Microsoft for using copyrighted news articles without permission for AI development. Open AI and microsoft have said they used news articles in transformative ways that did not violate copyright law).

Google, which owns YouTube, also used YouTube data to develop its artificial intelligence models, entering a legal gray copyright area, people with knowledge of the action said. And Google revised its privacy policy last year so it could use publicly available material to develop more artificial intelligence products.

At Meta, executives and lawyers last year debated how to get more data for AI development and discussed purchasing a major publisher like Simon & Schuster. In private meetings, they weighed including copyrighted works in their AI model, even if it meant they would be sued later, according to recordings of the meetings, which were obtained by The Times.

OpenAI, Google and other companies are exploring using their AI to create more data. The result would be what is known as “synthetic” data. The idea is that AI models generate new text that can then be used to build better AI.

Synthetic data is risky because AI models can make mistakes. Relying on that data can exacerbate those errors.


Share this article:
you may also like
Next magazine you need
most popular

what you need to know

in your inbox every morning