General Introduction – For the Entire Two-Part Series
This two-part blog post series about the ladder of understanding continues our exploration of AI’s capabilities in the realm of visual understanding and creation. In our previous post, Can AI Interpret a Comic Strip?, we saw how Multimodal Gemini could analyze and interpret a complex comic. In that post we used a comic by Skeleton Claw as example.
Now, we embark on a new adventure: attempting to recreate that very same comic using only the power of words. “The Ladder of Understanding” chronicles our collaborative journey in text-based image generation, a journey shaped by a fundamental constraint: neither I nor my AI partner, Multimodal Gemini, can directly create or manipulate images. Bound to the realm of language, we are tasked with crafting the perfect “recipe” for an AI “artist” to follow. This series will explore the intricacies of this process, the breakthroughs, limitations, and the philosophical implications of collaborating with an AI that can understand, but not generate, images.
See also Part 2: Beyond the Recipe.
Part 1: Refining the Recipe, Step-by-Step
In this first part, we dive deep into the “digital workshop” where the “recipe” for our comic recreation was forged. I, acting as the “intermediary” and “project manager,” teamed up with Multimodal Gemini, the “prompt crafter” and “recipe developer,” to guide Imagen 3, our “automated artist,” in generating the visual elements of the comic strip. What follows is a step-by-step account of our iterative journey, a chronological exploration of the prompts we crafted, the images Imagen 3 produced, and the key learnings that emerged along the way. From initial struggles with spatial relationships to the breakthrough moment when the “ladder of understanding” finally appeared, we’ll uncover the challenges and triumphs of communicating visual ideas through the medium of text alone. This is the story of how a human and an AI collaboratively climbed the “ladder of understanding,” one carefully crafted prompt at a time.
The Iterative Journey: Crafting the Recipe
Round 1: The “Image of Collapse” – When Text Overloads the Image

In our initial approach, Multimodal Gemini, drawing on its prior success in interpreting the comic, crafted a prompt that included both descriptions of the visual elements and the original text from each panel. Reasoning that a human artist might benefit from such a detailed script, Multimodal Gemini aimed to provide Imagen 3 with as much context as possible.
- Prompt Example: “A four-panel cartoon comic strip in a simple, minimalist style with thick black lines and flat, limited colors. Each panel should feature a white, stick-figure character with large, round eyes. Each panel has a light blue background.
- Panel 1: The character is standing confidently next to a single, large horizontal blue bar that goes from left to right across the panel. The text above reads: ‘You can achieve true greatness if you focus fully on one thing.’
- Panel 2: The character is standing looking up, next to a series of shorter horizontal blue bars. The text above reads: ‘If you have too many interests you’ll never reach your full potential.’
- Panel 3: The same short blue bars are depicted, but the character is climbing over them. The text above reads: ‘Wait, where the f*** are you going?’
- Panel 4: The character is sitting by a blue sea in a white platform. A few short bars are visible descending down the platform. The sky is blue with white clouds. The overall style should be clean, with a light blue background. The text should be above each panel.”
Analysis: This initial failure, while discouraging, provided a crucial insight. Instead of the well-defined panels and clear narrative progression we expected, Imagen 3 produced an “image of collapse”.
The text, instead of providing helpful context, was scattered randomly, adding to the visual chaos. While Multimodal Gemini understood the comic’s format and the interplay between text and image, Imagen 3, at least in its current form, was overwhelmed by this combined information. It seemed unable to parse the prompt into separate visual and textual instructions.
My initial feedback to Multimodal was merely a text description of this failure. However, I quickly realized that describing the chaos wasn’t enough. To truly understand the nature of the problem, Multimodal Gemini needed to see the output itself. This marked a turning point in our collaboration. We shifted from relying solely on text descriptions to incorporating direct visual feedback, mimicking the way a human artist might critique a rough sketch. We had to adjust our expectations and focus on pure image description, and I had to learn to “show, not just tell.”
Round 2: Initial Attempts and the Challenge of Horizontal Bars

New Idea: We began by removing the text and focusing solely on the visual description of the first panel, particularly the long horizontal bar.
Analysis: Imagen 3 struggled to accurately depict the horizontal bars, often placing them vertically or misinterpreting their spatial relationship to the stick figure. This highlighted a key challenge: spatial relationships, easily grasped by humans, are not inherently understood by AI. We realized that we needed to be much more explicit in our language, breaking down visual concepts into their most basic components. It was also clear that using the term “horizontal” was creating difficulty for the “automated artist”.
Round 3: Embracing the Vertical: Bars as Steps

New Idea: In an attempt to simplify the instructions, we shifted our focus to vertical bars, imagining them as steps the stick figure could climb.
Analysis: This approach yielded some improvements. Imagen 3 was now consistently generating vertical bars. However, the concept of “steps” was not fully realized, and the overall composition still lacked the dynamic movement of the original comic. We were learning that metaphors, while helpful, needed to be carefully chosen and explicitly defined.
Round 4: Refining the Climb: From Vertical to Horizontal Again

New Idea: We needed to refine the depiction of the climbing action. Initially the stick figure was climbing vertically as spiderman would. We needed to communicate to Imagen that the bars were to be used as handles regardless of their length.
Analysis: The results were getting better, but we were not quite there yet. It was clear that we needed to emphasize that the bars were horizontal.
Round 5: The Ladder Analogy Breakthrough

New Idea: Then came the breakthrough. In our “digital workshop,” we hit upon the “ladder” analogy. Instead of describing the bars as steps, we reframed them as rungs of a ladder.
Analysis: This seemingly simple change in language transformed Imagen 3’s output. The “ladder” metaphor provided a clear, universally understood framework for the arrangement and function of the bars. The stick figure was now positioned as if climbing, and the overall composition began to resemble the original comic. This was a pivotal moment, demonstrating the power of the right analogy in bridging the gap between human intention and AI interpretation.
Round 6: Success With Horizontal Bars as Rungs of a Ladder

New Idea: Finally we managed to get it right. By combining all our learnings and specifying that the stick figure was climbing on a ladder using horizontal bars as rungs.
Analysis: We had finally cracked the code. This was a major breakthrough, demonstrating that even complex visual concepts could be communicated through carefully crafted text prompts.
The “Talent Stack” of Text-Based Collaboration
This journey wasn’t just about finding the right prompts; it was about leveraging a unique combination of skills, a “talent stack” built for the specific challenges of text-based image generation. My role as the “intermediary” and “project manager” required visual analysis to deconstruct the comic, critical evaluation to assess Imagen 3’s output, and conceptual bridging to connect abstract ideas to concrete language.
Multimodal Gemini, the “prompt crafter” and “recipe developer,” brought to the table its strengths in natural language processing, image interpretation (as demonstrated in our previous post), and iterative refinement. Together, we formed a collaborative “talent stack” that compensated for the fact neither of us could directly manipulate images. We were both working within the limitations of language, using it as a bridge to communicate with our “automated artist.”
The Mystical “Too ad…” Image

While seemingly a deviation from our goal, this “mystical” image, with its cryptic “Too ad…” text, offered a glimpse into the unexpected creative spaces that can emerge from AI collaboration. We’ll investigate he possible meanings of this image in Part 2, exploring whether it represents a moment of frustration, a creative tangent, or perhaps something more.
Conclusion: The Ladder of Understanding
Our journey to recreate a comic strip through text prompts was challenging but ultimately rewarding. Each carefully crafted prompt built upon the last, forming a “ladder of understanding” that brought us closer to our goal. We learned that the right metaphor, the precise choice of words, could unlock Imagen 3’s ability to generate surprisingly accurate visual representations.
However, this process also highlighted the limitations of our approach. We were, after all, working with an “automated artist” that could only follow instructions, not participate in the creative brainstorming. This limitation, stemming from the fact that neither I nor Multimodal Gemini could provide direct visual input, and the potential for a more integrated form of AI collaboration will be the focus of our exploration in Part 2.
We’ll also delve into the curious case of the ‘mystical’ image, and what it might reveal about the nature of AI creativity. For now, we invite you to try your own hand at prompt engineering. Can you craft the perfect “recipe” for Imagen 3 or other image generators? What “ladders of understanding” will you discover?
So please, read the follow up post: Beyond the Recipe were we look closer at the phenomenon of the Too ad…. comic-strip.
Leave a Reply