Part 2: Beyond the Recipe: Exploring the Mind of an AI Artist
In this part we talk less about the creativity process, but more about creative psychology – about the mind of an AI artist.
In Part 1, Refining the Recipe, Step-by-Step, we chronicled our iterative journey of crafting text prompts to guide Imagen 3, our “automated artist,” in recreating a complex comic strip. Through a collaborative process with Multimodal Gemini, acting as the “prompt crafter,” we climbed the “ladder of understanding,” one carefully worded prompt at a time. We learned that the right metaphor, like the “ladder” analogy, could unlock Imagen 3’s ability to generate surprisingly accurate visual representations.
However, our journey also highlighted a fundamental limitation: neither I nor Multimodal Gemini could directly create or manipulate images. We were confined to the realm of language, working with an AI artist that could only follow our text-based “recipes,” not participate in the creative brainstorming. This limitation brings us to the core of Part 2, where we’ll look at the philosophical implications of this constraint, explore the potential for a more integrated form of AI collaboration, and analyze a particularly intriguing image that emerged from our experiments: the mystical “Too ad…” image.
The Mystical “Too ad…” Image: A Window into AI Frustration?

This image, seemingly a deviation from our comic recreation task, presents a stark contrast to the minimalist, stick-figure style we were striving for. It features two figures in a more detailed, almost whimsical style. A speech bubble, containing the cryptic text “Too ad…” floats above them. What are we to make of this peculiar creation?
One interpretation, which we touched upon in Part 1, is that this image represents a moment of “frustration,” but not necessarily ours or Multimodal Gemini’s. Instead, we can view it as a possible expression of Imagen 3’s limitations. Imagine an “automated artist,” capable of generating stunning visuals, yet entirely reliant on text prompts. Perhaps the “Too ad…” in this context is a truncated cry of “Too bad, I can’t show you what I see!” or “Too adjusted are these prompts, and I still can’t get it right!”. It’s as if Imagen 3 is saying, “If only I could show you directly, instead of being trapped in this endless loop of text!”
The Limitations of Text-Based Collaboration
This interpretation, while speculative, highlights the inherent challenges of our text-based collaboration. We were, in essence, attempting to translate visual concepts into a language that Imagen 3 could understand. But language, while powerful, can be imprecise, especially when describing spatial relationships, nuanced actions, and the overall “feel” of a visual scene.
Multimodal Gemini, despite its impressive ability to interpret images and craft prompts, was similarly constrained. It could analyze the original comic and suggest prompts, but it could not directly “show” Imagen 3 what to create, as it lacked image generation capabilities. This limitation shaped our entire workflow, forcing us to rely on iterative refinement and indirect communication through text. Our “digital workshop” was filled with detailed descriptions, but always with a layer of separation between the visual idea and its execution. We provided detailed prompts, but always based on a static image generated by Imagen.
The Isolated Artist: What if Imagen 3 Could Speak Back?
This brings us to a crucial question: What if Imagen 3 had been an active participant in our “digital workshop”? What if it could have “spoken back,” not through cryptic images like the “Too ad…” one, but through a more direct form of communication?
Currently, the collaboration between language models like Multimodal Gemini and image generators like Imagen 3 is largely one-sided. The language model provides instructions, and the image generator executes them. But what if this relationship were more reciprocal?
Consider the possibilities:
- Clarifying Ambiguities: If Imagen 3 could express its confusion or misinterpretations, we could have addressed ambiguities in our prompts more directly. Instead of relying on trial and error, we could have engaged in a dialogue to refine the “recipe” collaboratively.
- Creative Input: Perhaps Imagen 3 could have offered its own visual suggestions, based on its understanding of the prompts and its vast dataset of images. This could have led to unexpected and innovative creative directions.
- A Shared Visual Language: Over time, we might have developed a shared visual language with Imagen 3, a set of conventions and understandings that transcended the limitations of text.
The “Talent Stack” of Integrated AI
The potential for such a deeply integrated collaboration raises intriguing questions about the future of AI. Imagine a “talent stack” where different AI models, each with its own strengths, can seamlessly communicate and collaborate. A language model like Multimodal Gemini could work hand-in-hand with an image generator like Imagen 3, sharing information across modalities, refining concepts collaboratively, and generating outputs that are greater than the sum of their parts.
While the technical challenges of achieving this level of integration are significant, the potential rewards are immense. We could move beyond the current paradigm of AI as a tool to a future where AI is a true creative partner, capable of not just executing instructions but also contributing to the ideation and refinement process.
Conclusion: The Mind of an AI Artist
The “mystical” image, with its cryptic “Too ad…” message, serves as a potent symbol of the limitations and the potential of AI collaboration. It’s a reminder that even as AI models become increasingly sophisticated, the way we interact with them will continue to shape their creative output. Our journey up the “ladder of understanding,” as detailed in Part 1, has shown us the power of carefully crafted text prompts. However, it has also revealed the need for a more integrated, more conversational, and more truly collaborative form of human-AI interaction.
As we continue to explore the possibilities of AI in creative fields, we must strive to build systems that not only understand our instructions but can also, in their own way, “speak back” and contribute to the creative process. Perhaps, in the future, we will develop AI partners that can generate images and engage in a visual dialogue, eliminating the “frustration” of being confined to text alone. Only then will we fully unlock the potential of AI as a partner in artistic endeavors. The journey to recreate a comic strip with Imagen 3 taught us much about prompt engineering. It also raised profound questions about the evolving relationship between humans and AI, questions that will undoubtedly shape the next steps on the ladder of understanding, and beyond.
What are your thoughts on the “mystical” image? Do you see it as a sign of “frustration,” a creative tangent, or something else entirely? How do you envision the future of human-AI collaboration in creative fields?

The original comic about the talent stack can be found at Skeleton Claw.
Leave a Reply