The term ‘natively multimodal‘ has been making waves in the AI community for over a year, yet companies have only recently begun to fully harness the multimodal capabilities of their AI models. Google has now unveiled its latest “Gemini 2.0 Flash Experimental” model, which includes the ability to generate and edit images directly.
You may be asking yourself, what’s the fuss about image generation? True, AI-generated images have been a feature in many popular chatbots like ChatGPT for some time. However, image generation in platforms like ChatGPT or Gemini typically involves sending prompts to specialized diffusion models such as Dall-E 3 or Imagen 3. These models are specifically trained to create images and function as add-ons to the primary AI model, rather than being integrated within it.
In contrast, language-vision models like Gemini are inherently multimodal. They possess the unique capability to understand, create, and alter both text and images natively. Up until now, no tech company has provided this level of functionality to users. OpenAI introduced its own image generation feature with GPT-4o in 2024, but it was never made publicly available.
With native image generation, you benefit from enhanced consistency since multimodal models are trained on extensive datasets that include various forms of content. This leads to a better grasp of concepts and a broader general knowledge base.
In addition to generating images, you can effortlessly edit them using simple prompts. For instance, you can upload an image and request the model to add sunglasses, insert legible text, remove objects, and more. Unlike diffusion models, which regenerate the entire image each time you make a request, natively multimodal models ensure consistency across multiple edits.
Native Image Generation with Gemini 2.0 Flash Experimental
As of now, the native image generation feature is not available to the general public. The Gemini 2.0 Flash Experimental model with this capability can only be accessed through Google’s AI Studio (visit) at no cost.
Having tried out the model on AI Studio, I found it to be a thrilling experience. To begin, I created a visual guide showcasing the consistency of Gemini’s image generation capabilities by asking it to illustrate the steps for making an omelet, generating an image for each step.
The results were impressively consistent, with no noticeable glitches. Even small details, like the bowl, remained the same between images. The images can be downloaded in a resolution of 1024 x 680, allowing you to produce visual guides on a variety of topics.
Next, I requested Gemini to create an aesthetically pleasing table and then to display the table from a central camera angle. It executed this task flawlessly. I then asked Gemini to add a PlayStation to the table and give me a closer look. Once again, it delivered beautifully, capturing the PS5’s reflection in a nearby mirror.
Native Image Editing with Gemini 2.0 Flash Experimental
To showcase Gemini’s image editing feature, I uploaded an image from my gallery and instructed Gemini 2.0 to remove a wine glass from the table. Afterward, I requested it to add mushrooms to a pizza and was impressed by the outcome. Then, I asked Gemini to include a croissant, and it delivered once again, demonstrating the full potential of AI image editing, thanks to Gemini’s native multimodal capabilities.
Then, I uploaded a personal image and asked Gemini to add sunglasses, followed by text that read “Beebom” on my shirt. Both requests were executed adeptly.
Lastly, I asked Gemini to colorize an image, which it executed beautifully. The end result was even more stunning than the original, free from glitches or distortions.

There are countless possibilities you can explore with Gemini’s new multimodal capabilities. Google has done an impressive job integrating native image generation and editing, and I plan to use it extensively in the upcoming weeks to push its boundaries.
