OpenAI upgrades Sora and rolls it out in ChatGPT


OpenAI is integrating Sora’s image generation capabilities directly into ChatGPT starting today — this feature is dubbed “Images in ChatGPT.” While Sora was previously only accessible through a separate website, users can now use it to generate images within ChatGPT itself.

Sora was announced as an AI-powered video generator, but this initial release focuses solely on image creation and will be available across ChatGPT Plus, Pro, Team, and Free subscription tiers. The free tier’s usage limit is the same as DALL-E, spokesperson Taya Christianson told The Verge, but added that they “didn’t have a specific number to share” and ”these may change over time based on demand.“ Per the ChatGPT FAQ, free users were previously able to generate “three images per day with DALL·E 3.” As for the fate of DALL-E, Christianson said “fans” will “still have access via a custom GPT.”

“This model is a step change above previous models,” research lead is Gabriel Goh told The Verge, adding that the team used the GPT-4o “omnimodal” — or a model that can generate any kind of data like text, image, audio, and video — foundation for this iteration of Sora.

Some of the improvements Goh noted include “binding,” which refers to how well AI image generators maintain correct relationships between attributes and objects; a model with poor binding, for instance, might get a prompt for a blue star plus a red triangle and create a red star and no triangle. Most image models struggle with this, Goh said, often mixing up colors and shapes when asked to render multiple items — typically around 5 to 8. He says Sora’s new image generation can correctly bind attributes for 15 to 20 objects without confusion, representing a significant improvement in accuracy and reliability.

A visual representation of Sora’s binding capabilities, which are able to render multiple objects in an image. This one has multiple colored shapes, numbers, patterns, and a cursive OpenAI.

An example of Sora’s “binding” capabilities.
OpenAI

Users will also notice an improvement in text rendering, which makes it easier to generate coherent text without typos on an image (in existing tools, you’ll often notice that text gets garbled pretty easily). Getting text rendering right was a significant challenge, Goh said. If small titles or text elements have typos or errors, the entire image can become unusable.

“This was just like a process of iteration that took many, many months to get right,” Goh said. While not perfect, he said the team reached a point where the text quality is consistently usable (where it tends to blunder is really small text). “It’s been just many months of small improvements.”

The system uses an autoregressive approach — generating images sequentially from left to right and top to bottom, similar to how text is written — rather than the diffusion model technique used by most image generators (like DALL-E) that create the entire image at once. Goh speculates that this technical difference could be what gives Sora better text rendering and binding capabilities.

An AI-generated example of Sora’s ability to generate text. It shows the 4 most popular cocktails, with the ingredients to make them.

An example of Sora’s ability to generate coherent text.
OpenAI

In a briefing before the feature launch, the team demonstrated several examples showing the system’s capabilities, including scientific diagrams like Newton’s prism experiment with correctly labeled components, multi-panel comics with consistent characters and text bubbles, and informational posters with accurate text. They also highlighted practical applications like creating transparent background images for stickers, restaurant menus, and logos.

“If I go to draw an image, I do so with the limitation of my own skill… but also with all of the knowledge of the world that I’ve built up,” ChatGPT multimodal product lead Jackie Shannon explained. “The model brings world knowledge to the equation, so when you ask for an image of Newton’s prism experiment, you don’t have to explain what that is to get an image back.”

The new system does take longer to generate images than before, though OpenAI suggests this is a worthwhile tradeoff. “While we certainly have room to improve on latency…the quality of these images, the capability, the world knowledge, really makes up for the additional seconds that they’ll spend waiting,” Shannon said.

An AI-generated image of Newton’s prism experiment on a notepad at Washington Square Park.

Newton’s prism experiment rendered on a notepad in Washington Square Park.
OpenAI

When asked about safeguards — pointing out the infamous nude deepfakes of Taylor Swift generated using a Microsoft model, xAI’s Grok ability to render Kamala Harris with a gun, and Google Gemini’s knack for removing watermarks — the OpenAI team emphasized the system includes robust safeguards to prevent misuse. Shannon said the tool prevents watermark removal, blocks generation of sexual deepfakes, and refuses CSAM generation requests.

OpenAI’s new image generation system doesn’t include visual watermarks or indicators showing images are AI-generated. However, Shannon explained that “all of our generated images will include standard C2PA metadata to mark the image as having been created by OpenAI” and the company “will have some internal tooling to be able to look up images as well.”

“Ultimately, no system is perfect for this type of thing, but we’re continuously improving our safeguards and we think of this as a starting point,” Shannon added. “One thing that’s true about all of the images generated from ChatGPT is that the user owns them and are free to use them within the bounds of our usage policies as they would like.”



Source link