Skip to content

minimaxir/gemimg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gemimg

gemimg is a lightweight (<400 LoC) Python package for easily interfacing with Google's Gemini API and the Gemini 2.5 Flash Image model (a.k.a. Nano Banana) with robust features. This tool allows for:

  • Create images in many aspect ratios with only a few lines of code!
  • Minimal dependencies, and does not use Google's Client SDK.
  • Handles image I/O, including multi-image I/O and image encoding/decoding.
  • Generates images only: no irrelevant text output
  • Utilities for common use cases, such as saving, resizing, and compositing multiple images.

Although Gemini 2.5 Flash Image can be used for free in Google AI Studio or Google Gemini, those interfaces place a visible watermark on their outputs and have generation limits. Using gemimg and the Gemini API directly, not only do you have more programmatic control over the generation, but it's much easier to do more complex inputs which increases productivity for power users.

Installation

gemimg can be installed from PyPI:

pip3 install gemimg
uv pip install gemimg

Demo

First, you will need to get a Gemini API key (from a GCP project which has billing information), or a free applicable API key.

from gemimg import GemImg

g = GemImg(api_key="AI...")

You can also pass the API key by storing it in an .env file with a GEMINI_API_KEY field in the working directory (recommended), or by setting the environment variable of GEMINI_API_KEY directly to the API key.

Now, you can generate images with a simple text prompt!

gen = g.generate("A kitten with prominent purple-and-green fur.")

The generated image is stored as a PIL.Image object and can be retrieved with gen.image for passing again to Gemini 2.5 Flash Image for further edits. By default, generate() also automatically saves the generated image as a PNG file in the current working directory. You can save a WEBP instead by specifying webp=True, change the save directory by specifying save_dir, or disable the saving behavior with save=False.

Due to Gemini 2.5 Flash Image's multimodal text encoder, you can create nuanced prompts including details and positioning that are not as consistent in Flux or Midjourney:

prompt = """
Create an image of a three-dimensional pancake in the shape of a skull, garnished on top with blueberries and maple syrup.
"""

gen = g.generate(prompt)

Gemini 2.5 Flash Image allows you to make highly-targeted edits to images. With gemimg, you can pass along the image you just generated very easily for editing.

edit_prompt = """
Make ALL of the following edits to the image:
- Put a strawberry in the left eye socket.
- Put a blackberry in the right eye socket.
- Put a mint garnish on top of the pancake.
- Change the plate to a plate-shaped chocolate-chip cookie.
- Add happy people to the background.
"""

gen_edit = g.generate(edit_prompt, gen.image)

You may have noticed from the previous example that the prompt input is a Markdown dashed list. As a model based off of Gemini's text encoder, Nano Banana is extremely responsive to Markdown formatting compared to older text encoders used in traditional image generation models, and you can prompt engineer highly nuanced subject and compositional requirements, and Nano Banana follows them with very high accuracy:

prompt = """
Create an image featuring three specific kittens in three specific positions.

All of the kittens MUST follow these descriptions EXACTLY:
- Left: a kitten with prominent black-and-silver fur, wearing both blue denim overalls and a blue plain denim baseball hat.
- Middle: a kitten with prominent white-and-gold fur and prominent gold-colored long goatee facial hair, wearing a 24k-carat golden monocle.
- Right: a kitten with prominent #9F2B68-and-#00FF00 fur, wearing a San Franciso Giants sports jersey.

Aspects of the image composition that MUST be followed EXACTLY:
- All kittens MUST be positioned according to the "rule of thirds" both horizontally and vertically.
- All kittens MUST lay prone, facing the camera.
- All kittens MUST have heterochromatic eye colors matching their two specified fur colors.
- The image is shot on top of a bed in a multimillion-dollar Victorian mansion.
- The image is a Pulitzer Prize winning cover photo for The New York Times with neutral diffuse 3PM lighting for both the subjects and background that complement each other.
- NEVER include any text, watermarks, or line overlays.
"""

gen = g.generate(prompt, aspect_ratio="16:9")

You can also input two (or more!) images/image paths to do things like combine images or put an object from Image A into Image B without having to train a LoRA. For example, here's a mirror selfie of myself, and a fantasy lava pool generated with gemimg that beckons me to claim its power:

edit_prompt = """
Make the person in the first image stand waist-deep in the lava of the second image. The person's arms are raise high in cheer.

The lighting of the person must match that of the second image.
"""

gen = g.generate(edit_prompt, ["max_woolf.webp", gen_volcano.image])

You can also guide the generation with an input image, similar to ControlNet implementations. Giving Gemini 2.5 Flash Image this handmade input drawing and prompt:

prompt = """
Generate an image of characters playing a poker game sitting at a green felt table, directly facing the front. This new image MUST map ALL of the following characters to the poses and facial expressions represented by the specified colors of the provided image:
- Green: Spongebob SquarePants
- Red: Shadow the Hedgehog
- Purple: Pedro Pascal
- Pink: Taylor Swift
- Blue: The Mona Lisa
- Yellow: Evangelion Unit-01 from "Neon Genesis Evangelion"

The image is an award-winning highly-detailed painting, oil on oaken canvas. All characters MUST adhere to the oil on oaken canvas artistic style, even if this varies from their typical styles. All characters must be present individually in the image.
"""

gen = g.generate(prompt, "pose_control_base.png")

Jupyter Notebook which randomizes the character order.

This is just the tip of the iceberg of things you can do with Gemini 2.5 Flash Image (a blog post is coming shortly). By leveraging Gemini 2.5 Flash Image's long context window, you can even give it HTML and have it render a webpage (Jupyter Notebook). And that's not even getting into JSON prompting of the model, which can offer extremely granular control of the generation. (Jupyter Notebook)

Gemini 2.5 Flash Image Model Notes

  • Gemini 2.5 Flash Image cannot do style transfer, e.g. turn me into Studio Ghibli, and seems to ignore commands that try to do so. Google's developer documentation example of style transfer unintentionally demonstrates this by incorrectly applying the specified style. The only way to shift the style is to generate a completely new image in that style, which can still have mixed results if the source style is intrinsic.
    • This also causes issues with the "put subject from Image A into Image B" use case if either are a substantially different style.
  • Gemini 2.5 Flash Image does have moderation in the form of both prompt moderation and post-generation image moderation, although it's more leient than typical for Google's services. In the former case, the gen.text will indicate the refusal reason. In the latter case, a PROHIBITED_CONTENT error will be thrown.
  • Gemini 2.5 Flash Image is unsurprisingly bad at free-form text generation, both in terms of text fidelity and frequency of typos. However, a workaround is to provide the rendered text as an input image, and ask the model to composite it with another image.
  • Yes, both a) LLM-style prompt engineering with with Markdown-formated lists and b) old-school AI image style quality syntatic sugar such as award-winning and DSLR camera are both extremely effective with Gemini 2.5 Flash Image, due to its text encoder and larger training dataset which can now more accurately discriminate which specific image traits are present in an award-winning image and what traits aren't. I've tried generations both with and without those tricks and the tricks definitely have an impact. Google's developer documentation encourages the latter.
  • Cherry-picking outputs, in the sense that multiple generations with the same prompt are needed to get one good output, is surprisingly minimal for an image-generation model and Google 2.5 Flash Image tends to correctly interpret the intent on the first try. Any obvious logical mistakes are consistently fixed with more prompt engineering. Most superflous prompts you see in the examples are cases where such a fix is applied.
  • Although the Gemini 2.5 Flash Image API schema suggests that it supports system prompts, it doesn't appear to have any impact on the resulting output, so they are not supported in this package.

Miscellaneous Notes

  • gemimg is intended to be bespoke and very tightly scoped. Compatibility for other image generation APIs and/or endpoints will not be supported, unless they follow the identical APIs (i.e. a hypothetical gemini-3-flash-image). As this repository is designed to be future-proof, there likely will not be many updates other than bug/compatability fixes.
  • gemimg intentionally does not support true multiturn conversations within a single conversational thread as:
    1. The technical lift for doing so would no longer make this package lightweight
    2. It is unclear if it's actually better for the typical use cases.
  • gemimg intentionally does not support text output (and therefore the "interweaving" use case from the API examples) because:
    1. Text output slows down the image generation, which is the purpose of this package
    2. Text output can cause the model to rethink aspects of the generations, which adds undesirable entropy to the prompt.
    3. Interweaving follows the same issues as generating multiple images in a single call and is unreliable.
  • By default, input images to generate() are resized such that their max dimension is 768px while maintaining the aspect ratio. This is done a) as a sanity safeguard against providing a massive image and b) Gemini processes images in tiles of 768x768px, so this forces the input to be 1 tile which should lower costs and improve consistency. If you want to disable this behavior, set resize_inputs=False.
  • Do not question my example image prompts. I assure you, there is a specific reason or objective for every model input and prompt engineering trick. There is a method to my madness...although for this particular project I confess its more madness than method.

Roadmap

  • Async support (for parallel calls and FastAPI support)
  • Additional model parameters if the Gemini API supports them.

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon and GitHub Sponsors. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

About

Lightweight wrapper for generating and editing images from Gemini 2.5 Flash Image/Nano Banana

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages