Large artificial intelligence models will only get “crazier and crazier” unless more is done to control what information they are trained on, according to the founder of one of the UK’s leading AI start-ups.
Emad Mostaque, CEO of Stability AI, argues continuing to train large language models like OpenAI’s GPT4 and Google’s LaMDA on what is effectively the entire internet, is making them too unpredictable and potentially dangerous.
“The labs themselves say this could pose an existential threat to humanity,” said Mr Mostaque.
On Tuesday the head of OpenAI, Sam Altman, told the United States Congress that the technology could “go quite wrong” and called for regulation.
Today Sir Antony Seldon, headteacher of Epsom College, told Sky News’s Sophy Ridge on Sunday that AI could be could be “invidious and dangerous”.
“When the people making [the models] say that, we should probably have an open discussion about that,” added Mr Mostaque.
But AI developers like Stability AI may have no choice in having such a discussion. Much of the data used to train their powerful text-to-image AI products was also “scraped” from the internet.
Headteachers warn ‘bewildered’ schools need more help to cope with rapid advances in AI
Could AI really take your job?
Artificial intelligence: Revealed – how many firms are already using AI… and how workers feel about it
That includes millions of copyright images that led to legal action against the company – as well as big questions about who ultimately “owns” the products that image- or text-generating AI systems create.
His firm collaborated on the development of Stable Diffusion, one of the leading text-to-image AIs. Stability AI has just launched a new model called Deep Floyd that it claims is the most advanced image-generating AI yet.
A necessary step in making the AI safe, explained Daria Bakshandaeva, senior researcher at Stability AI, was to remove illegal, violent and pornographic images from the training data.
If the AI sees harmful or explicit images during its training, it could recreate them in its output. To avoid this, the developers remove these images from the training data, so the AI cannot “imagine” how they would look.
But it still took two billion images from online sources to train it. Stability AI says it is actively working on new datasets to train AI models that respect people’s rights to their data.
Stability AI is being sued in the US by photo agency Getty Images for using 12 million of its images as part of the dataset used to train its model. Stability AI has responded that rules around “fair use” of the images means no copyright has been infringed.
But the concern isn’t just about copyright. Increasing amounts of data available on the web whether it’s pictures, text or computer code is being generated by AI.
“If you look at coding, 50% of all the code generated now is AI generated, which is an amazing shift in just over one year or 18 months,” said Mr Mostaque.
And text-generating AIs are creating increasing amounts of online content, even news reports.
Please use Chrome browser for a more accessible video player
US company News Guard, which verifies online content, recently found 49 almost entirely AI generated “fake news” websites online being used to drive clicks to advertising content.
“We remain really concerned about an average internet users’ ability to find information and know that it is accurate information,” said Matt Skibinski, managing director at NewsGuard.
AIs risk polluting the web with content that’s deliberately misleading and harmful or just rubbish. It’s not that people haven’t been doing that for years, it’s just that now AI’s might end up being trained on data scraped from the web that other AIs have created.
All the more reason to think hard now about what data we use to train even more powerful AIs.
“Don’t feed them junk food,” said Mr Mostaque. “We can have better free range organic models right now. Otherwise, they’ll become crazier and crazier.”
A good place to start, he argues, is making AIs that are trained on data, whether it’s text or images or medical data, that is more specific to the users it’s being made for. Right now, most AIs are designed and trained in California.
“I think we need our own datasets or our own models to reflect the diversity of humanity,” said Mr Mostaque.
“I think that will be safer as well. I think they’ll be more aligned with human values than just having a very limited data set and a very limited set of experiences that are only available to the richest people in the world.”