How I Built a Short-Animated Film Using Free AI Tools (Without Lip-Sync)
Hi there,
I’d like to share some practical impressions from working on animation and voice-over for short stories. In my earlier videos I relied on static images and was only starting to experiment with multiple voices.
This video runs a little over four minutes and already uses animation. There is no lip-sync yet and no complex or expensive visual effects. The animation is deliberately symbolic: it doesn’t try to reproduce events literally, but instead supports the meaning of each scene and leaves room for the viewer’s imagination.
You could say this project was built “on one wing,” running on fumes. Aside from paid voice acting, I used a free ChatGPT image generator and the free Grok video generator. Even so, the process—despite its rough edges—turned out to be genuinely interesting.
At this stage, prompt quality matters more than anything else. You have to understand how Grok actually interprets instructions. Otherwise, a character may suddenly perform an unexpected gesture that has nothing to do with the script.
The story has two main characters—celestial bureaucrats from the intake and dispatch department of purgatory—and roughly 85% of the video consists of dialogue. As a starting point, I chose real-world visual references: UN Secretary-General António Guterres and German Chancellor Olaf Scholz. The AI’s output was… approximate. Guterres, at least, came out somewhat recognizable.
I then provided ChatGPT with the full story and asked it to generate prompts for several static joint scenes. The goal was simple: visual variety in poses, angles, and expressions. This is necessary because Grok’s video generator cannot deviate much from the spatial layout of the original “template” image. Since the story is short, I limited myself to four base scenes, later adding two more.
You got the point.
Based on these static images, ChatGPT generated prompts specifically for Grok’s video output. Using the resulting clips, I assembled the core of the video. A major constraint became obvious here: the combined duration of all Grok clips was about ten and a half seconds, while the audio track runs for almost four minutes. The only workable solution was to reuse and alternate segments from different source animations to introduce at least minimal visual variation.
This required learning a few technical tricks. In many clips, both characters appear to “talk” at the same time—why Grok interpreted the prompts this way is unclear. Rather than fight it, I used a masking technique suggested by ChatGPT: one character is frozen on a single frame while the other remains animated and carries the dialogue.
Technically, this is done by applying a mask to one character on the top video layer, while the lower layer is locked to a single frame. The advantage is reuse: the same clip can be repurposed by swapping masks and assigning dialogue to the other character.
This approach has limitations, especially with hand gestures. In several cases, both characters’ hands occupied the same screen space at different times. When that happened, I had to split the clip into multiple segments and manually adjust mask boundaries for each one.
Another simple but effective technique was time-stretching—slowing down footage to match the required duration. I also used a few auxiliary effects: introducing a static image first and then “activating” it, or adding brief on-screen explanations outside the spoken dialogue when the scene benefited from clarification.
If I were starting this project again with what I know now, many mistakes could have been avoided, and the final result would be noticeably smoother. In theory, the rational move would be to delete everything and rebuild it properly. In practice, after investing this much cognitive and emotional energy, the project is “done enough,” and it’s time to move on.
For now, I’m deliberately avoiding longer projects—until the process feels more predictable and controllable.
Hope you like it anyhow...






