Pricing
- Input tokens0.080000000000 tokens
- Output tokens0.090000000000 tokens
- Context windowPENDING INFORMATION
- ThroughputPENDING INFORMATION
Image generation is powered by DALL-E 2, a generative model based on a diffusion architecture. It leverages CLIP (Contrastive Language-Image Pre-training) to map text inputs to a visual semantic space. The system uses a two-stage process: a 'prior' that generates a CLIP image embedding from the text caption, and a diffusion 'decoder' (unCLIP) that produces the final image from that embedding. This architecture enables the model to understand complex relationships between objects and generate coherent, high-resolution visuals.