Back to projects

Shorts Factory

2025active

An autonomous media pipeline that transforms a topic into a polished educational short—research to final render—using Gemini, VEO 3.1, and a 7-step production workflow.

// GitHub

View Repository
425 commits
Last commit 9 minutes ago
TypeScript

// Problem

I wanted to make YouTube videos like Wendover Productions and PolyMatter—content that takes complex topics and turns them into compelling, educational entertainment. The kind of videos where you learn how the universe works while being genuinely entertained. But making even one of these videos takes a full day: deep research, writing a narrative arc, scripting, recording voiceover, creating a storyboard, generating visuals, and assembling everything. And I didn't want to produce the kind of low-quality, faceless 'AI slop' flooding YouTube. I wanted high-production-value content that leveraged the actual breakthroughs in image and video generation—not just threw prompts at a model and called it a day.

// Solution

A structured production pipeline that codifies the entire video creation workflow into seven discrete steps, each powered by specialized AI models. Instead of treating video generation as a black box, I broke down the process into the same stages a professional production team would use: research, narrative design, scriptwriting, voiceover recording, storyboarding, image generation, and video clip assembly. Each step can be AI-assisted or human-refined, with approval gates that prevent the pipeline from advancing until the content meets quality standards. The system can now produce five videos in the time it used to take me to make one.

// What I Built

A full-stack Next.js 16 application with a 7-step pipeline: Research uses Google's Interactions API with web search to produce comprehensive topic briefs with citations. Narrative takes that research and structures it into a story arc—identifying the hero object, pop culture bridges, and scene transitions (the beat mapping is trained on classifications of viral video scripts). Script generates word-for-word voiceover text, with a 'swipe file' system that can clone the style of specific creators (archetypes like 'systems_engineer' or 'map_nerd'). Voiceover provides a teleprompter interface for recording, with real-time audio levels and Gemini transcription. Storyboard breaks the script into visual sequences—frames with timing, visual notes, and shot descriptions (inspired by ripping apart Stranger Things storyboards). Images generates each frame using Gemini 2.5 Flash Image with style presets. Finally, Clips uses VEO 3.1 to animate each image into 5-8 second video segments. A copilot system with specialized agents (director, researcher, narrative_writer, script_writer, storyboard_generator, video_generator) guides you through clarifying questions at each transition, and approved steps become immutable to prevent corruption.

// Screenshots

Pipeline overview

Research step

Script editor

Video generation

// Technologies

Google Gemini 2.5/3 + Interactions API

Text generation for research, narrative, and script steps—using Interactions API with google_search tool for web research, plus thinking mode (configurable token budgets) for complex reasoning tasks

Google VEO 3.1

Video generation from images with audio prompts—5-8 second clips in 9:16, 16:9, or 1:1 aspect ratios. Includes retry logic for rate limiting and polling for long-running jobs

Next.js 16 + React 19 + TanStack Query

App Router with RSC for secure data fetching, REST-only mutations (no server actions), and TanStack Query for client-side caching with query key factories

PostgreSQL + Drizzle ORM

Type-safe database with atomic JSONB operations to prevent race conditions during concurrent image/clip generation. Separate tables for videos, runs, copilot threads, and swipe files

Google Cloud Storage

Persistent storage for audio recordings, generated images, and video clips—with public URL generation and GCS URI proxy for playback

// Lessons Learned

  • 01Doing it manually first was non-negotiable. I made three videos by hand before writing a single line of code—not to validate the market, but to understand the actual workflow. The research-to-narrative transition, the storyboard timing calculations, the way a good script references earlier beats—none of that would have been obvious from theorizing. The manual process became the product spec.
  • 02The swipe file system is more powerful than I expected. Training the script generator on classified examples of viral videos (what makes a 'systems_engineer' voice different from a 'map_nerd') produces dramatically better output than generic prompts. Style transfer through examples beats style description every time.
  • 03State hardening prevents catastrophic errors. Once a step is approved (marked with an ApprovedAt timestamp), the UI becomes read-only. This seems restrictive, but it prevents the nightmare scenario where you tweak your research after generating images, creating inconsistencies that cascade through the pipeline.
  • 04VEO 3.1's audio prompts are game-changing. Previous image-to-video models would just zoom and pan. VEO actually understands motion prompts like 'camera slowly pulls back to reveal' and 'subject turns toward camera.' The results feel directed, not procedurally generated.
  • 05This project crystallized how I think about AI-assisted creative work: the AI handles the tedious parts (research synthesis, visual generation, timing calculations) while humans make the creative decisions (what angle to take, which style to use, when to break the rules). The 7-step pipeline enforces that separation at the architecture level.