OpenAI has always been at the forefront of LLMs, but they've never been great at generating images. At least until today's 4o image generation...
Thank you Posthog for sponsoring! Check them out at: soydev.link/posthog
SOURCE
x.com/OpenAI/status/1904602845221187829
T3 Chat (image generation coming soon™️): soydev.link/chat
Want to sponsor a video? Learn more here: soydev.link/sponsor-me
Check out my Twitch, Twitter, Discord more at t3.gg/
S/O Ph4se0n3 for the awesome edit 🙏
I gave it this prompt: “ Create a realistic image of a 100% full glass of wine, where the wine is on the verge of overflowing and only the surface tension of the wine is preventing it from spilling over the top. Literally no more empty space in the glass for more wine. This is a tricky task for AI image generation, so think long and hard about it first. Don’t mess this up.”
And it actually worked!
It's still diffusion, partly. Mostly likely, the LLM first autoregressively generates a set of semantic embeddings, which are then decoded with diffusion (e.g. similar to Wurstschen, but the stage C model is 4o - autoregressive). Since it was trained multi-modal, they may be using hierarchical CLIP image embeddings (e.g. 64 full-scale tokens + 4x64 overlapping patch tokens), which then go into the latent diffusion decoder. My guess is that the top-to-bottom effect is on the user side (e.g. "a big reveal" and it reduces the request rate), otherwise it wouldn't be possible to show a "preview" image - A blurry preview could be decoded from the CLIP image embeddings before passed to the diffusion model. On the other hand, if it was purely autoregressive, it might be similar to VAR, but that should be much faster and would eat too much of 4o's capacity (a diffusion decoder would allow most of the "generation" capability to be in an auxiliary model).
Interesting thing about the text - it seems to be only able to render it like a font is being rendered. I tried to make a "broken sign" and it couldn't bring half of the text to be misaligned with the rest.
the craziest part is they're using the 4o model to do the diffusion. I wonder if 4o is generating the picture in the same way it does text to create an svg-like image where it's just an array of pixel data. that would explain the way it renders
This is human concurrency at its peak. I strive to be this efficient.
Me: Let me tell you about the new image processing (waits until image processes and explains whats happening in the meantime)
Theo: Let me tell you about the new image processing. While thats going let me spin up another example with the text updated. Lets also spin up a Dall-e example. While thats going lets also read this article and react to a video.
All of these image generators are just party tricks if theres no consistency. Consistent, characters, sets, vehicles, weapons and landscapes. You can't do long form content with random images.
Shoutout to Posthog!
Thanks to their session-recording feature, I was recently able to prove that a user did access something they claimed they never got access to.
I am still surprised, that no one talks about how dangerous this new image generation is. Internet will be full of AI generated images that will be hard to realize if it's AI generated or not. It is scary.
@figloalds
"Create a <describes Theo>" The AI: ** creates an image of Primeagen **
4 days ago | [YT] | 508
View 2 replies
@DavidMishchenko
Looks like someone forgot to turn off night filter when making the thumbnail >_>
4 days ago | [YT] | 775
View 10 replies
@frittex
new way to end up homeless after uni just dropped
4 days ago | [YT] | 407
@JeremyDawesJezweb
It’s like watching a progressively loading JPEG on dial up
4 days ago | [YT] | 242
View 10 replies
@markopavlovic8750
Previous video: google won. This video: OpenAI is pretty much unmatched
4 days ago | [YT] | 145
View 7 replies
@lisan_al_g4ib
I gave it this prompt: “ Create a realistic image of a 100% full glass of wine, where the wine is on the verge of overflowing and only the surface tension of the wine is preventing it from spilling over the top. Literally no more empty space in the glass for more wine. This is a tricky task for AI image generation, so think long and hard about it first. Don’t mess this up.” And it actually worked!
4 days ago | [YT] | 113
View 4 replies
@hjups
It's still diffusion, partly. Mostly likely, the LLM first autoregressively generates a set of semantic embeddings, which are then decoded with diffusion (e.g. similar to Wurstschen, but the stage C model is 4o - autoregressive). Since it was trained multi-modal, they may be using hierarchical CLIP image embeddings (e.g. 64 full-scale tokens + 4x64 overlapping patch tokens), which then go into the latent diffusion decoder. My guess is that the top-to-bottom effect is on the user side (e.g. "a big reveal" and it reduces the request rate), otherwise it wouldn't be possible to show a "preview" image - A blurry preview could be decoded from the CLIP image embeddings before passed to the diffusion model. On the other hand, if it was purely autoregressive, it might be similar to VAR, but that should be much faster and would eat too much of 4o's capacity (a diffusion decoder would allow most of the "generation" capability to be in an auxiliary model).
4 days ago | [YT] | 37
View 11 replies
@ShaharHarshuv
Interesting thing about the text - it seems to be only able to render it like a font is being rendered. I tried to make a "broken sign" and it couldn't bring half of the text to be misaligned with the rest.
4 days ago | [YT] | 16
View 1 reply
@GiblikJovanovic
it's really crazy how nobody is talking about the book bevelorus the hidden codex of the financial alchemists
4 days ago | [YT] | 214
@0x.rorschach
the craziest part is they're using the 4o model to do the diffusion. I wonder if 4o is generating the picture in the same way it does text to create an svg-like image where it's just an array of pixel data. that would explain the way it renders
4 days ago | [YT] | 8
@yankotliarov9239
I'll wait fo Fireship video on it
4 days ago | [YT] | 149
View 2 replies
@mrgyani
You need to start the podcast with - "As a $400 haircut user, I need to pay my bills. Today's sponsors are.."
4 days ago | [YT] | 10
@Ownedyou
13:38 I am laughing my ass off! Now the OpaBI makes sense - the couple is bisexual and the girl in blue is checcking out the girl in red too!!
4 days ago | [YT] | 9
View 1 reply
@KevinDay
27:18 "A little choppiness in his ear..." 😐..
4 days ago | [YT] | 23
View 1 reply
@0xLostInCode
Honestly, people are gonna forget about this model in a couple of weeks.
4 days ago | [YT] | 11
View 1 reply
@CloakDev
This is human concurrency at its peak. I strive to be this efficient. Me: Let me tell you about the new image processing (waits until image processes and explains whats happening in the meantime) Theo: Let me tell you about the new image processing. While thats going let me spin up another example with the text updated. Lets also spin up a Dall-e example. While thats going lets also read this article and react to a video.
4 days ago (edited) | [YT] | 4
View 1 reply
@TristanWayne-h1j
All of these image generators are just party tricks if theres no consistency. Consistent, characters, sets, vehicles, weapons and landscapes. You can't do long form content with random images.
4 days ago (edited) | [YT] | 24
View 8 replies
@theDanielJLewis
Shoutout to Posthog! Thanks to their session-recording feature, I was recently able to prove that a user did access something they claimed they never got access to.
4 days ago | [YT] | 11
@Bub24
I am still surprised, that no one talks about how dangerous this new image generation is. Internet will be full of AI generated images that will be hard to realize if it's AI generated or not. It is scary.
4 days ago | [YT] | 18
View 5 replies
@solmateusbraga
Does anyone know the name of the app/site he uses to sketch?
4 days ago | [YT] | 7
Load more