Back to YouTube
Parker RexApril 26, 2025

I Built Flappy Bird Using Every AI Model (You Wont Guess Who Won)

Watch me build Flappy Bird with every AI model (Gemini 2, 01, 03 mini variants) and see which one wins in this live benchmark.

Show Notes

I ran a hands-on benchmark by rebuilding Flappy Bird in Python with Pygame and feeding the same prompt to a lineup of AI models. The goal was to see which model delivers the most usable, responsive game loop, and Grok 3 turns out to be the surprise winner.

Setup and approach

  • Same prompt used across all models to keep the comparison fair.
  • Each model was run via a separate folder with a small, self-contained Python script (main.py) that uses Pygame.
  • The workflow: open multiple terminals, cd into each model folder (01, 03, 04 mini, etc.), and run python main.py to test.
  • Observations often involved rendering quirks (bird shapes: square, circle, triangle) and control responsiveness (gravity, speed, and spacebar behavior).
  • Tools noted:
    • VS Code for editing and quick syntax fixes
    • Copilot used live to fix syntax issues when needed
    • Pygame as the game engine required by the prompt

Example prompt approach (paraphrased):

  • A fixed prompt was used for all models to request a self-contained Flappy Bird implementation using Pygame, with gravity, collision, and a responsive spacebar control.

Models tested

  • Gemini family
    • Gemini 2 flashing
    • 01
    • 03 mini
    • 03 mini high
  • Sonnet / Anthropic family
    • Sonnet 37
    • Sonnet 35
    • EnAnthropic (Anthropic’s style prompt)
  • Other Open AI family
    • Gemini 25
    • OpenAI’s latest model (as of the test period)
  • 04 mini family
    • 04 mini
    • 04 mini high
  • Grock family
    • Grock 3 (the sleeper hit)

Observations and quick verdicts

  • Gemini 2 flashing
    • Verdict: rough start, slow and unimpressive
  • Gemini 01
    • Verdict: poor baseline
  • Gemini 03 mini
    • Verdict: underwhelming behavior, gravity and control feel off
  • Gemini 03 mini high
    • Verdict: notably better; one of the better performers in the run
  • Sonnet 37
    • Verdict: strong potential, good thinking/adjustments; bugged reset at one point
  • Sonnet 35
    • Verdict: hit-or-miss; some syntax/implementation issues
  • EnAnthropic
    • Verdict: mixed results; reliability varied across runs
  • Gemini 25
    • Verdict: mixed-to-average outcomes; not the standout
  • OpenAI latest
    • Verdict: included in the mix, but not clearly the best in this batch
  • 04 mini
    • Verdict: fast start, but encountered stability/syntax problems; not consistently reliable
  • 04 mini high
    • Verdict: one of the faster, more responsive attempts; still some quirks
  • Grock 3
    • Verdict: the clear winner in this test; finally delivered a reliable, playable result
    • Noted as the “line 129” moment where the implementation aligned cleanly with the prompt
    • Final assessment: Grock 3 beat the others for playable performance and stability

The winner and what it means

  • Grock 3 emerged as the winner of this Flappy Bird benchmark.
  • Why it stood out:
    • More stable gravity and collision behavior
    • More reliable spacebar responsiveness
    • Fewer quirky rendering anomalies compared to other models
  • Takeaway: even with a single shared prompt and the same task, model performance can be all over the map. The best result often comes from a model that handles the control flow and timing more predictably, not just raw language quality.

Notable moments and gotchas

  • The experiment wasn’t just about accuracy; it was about “playability” and responsiveness in a simple game loop. Some models produced nice text but struggled to drive a real-time PyGame window smoothly.
  • Syntax hiccups and code integration issues were common (hence the Copilot mentions). Having a quick fix path can save lots of time but doesn’t change the underlying model performance.
  • Visual quirks (bird shapes, velocities) reminded that rendering decisions can skew perceived performance even when logic is correct.

How you can reproduce

  • Use a fixed prompt across all models to ensure comparability.
  • Create a separate folder per model and place a minimal PyGame-based main.py in each.
  • Run: python main.py from each model’s folder and observe:
    • Responsiveness to spacebar
    • Consistency of gravity and bird movement
    • Stability of the game loop (no freezes or crashes)
  • Grade each model by how playable the result is, not just whether it runs.

Actionable takeaways

  • When benchmarking AI models on code tasks, prioritize runtime behavior and user experience (e.g., input responsiveness, frame rate stability) in addition to correctness.
  • A single “best” model can be elusive; you may need to test multiple generations within a provider’s lineup to find a playable option.
  • Don’t rely on a model’s language quality alone—test its ability to produce real-time, deterministic results in an executable environment.
  • Having quick tooling fixes (e.g., Copilot or editor assist) is helpful, but keep the core tests model-centric rather than tool-centric.

If you want more AI/battery-tested benchmarks like this, I’ll run additional rounds and share the breakdowns with a focus on real-time performance and robustness.