The Incredible Machine AI Benchmark
Project Overview
Created a sophisticated benchmarking system to test frontier AI models' planning and reasoning capabilities using The Incredible Machine, a vintage 16-bit simulation game. The project addresses the need for AI-hard benchmarks that aren't contaminated by training data.
Technical Architecture
• QEMU system emulator for precise environment control
• QDB debug interface for real-time command execution
• Windows 98SE installation with modern mouse drivers
• Touch-based driver system for accurate coordinate mapping
• Automated screenshot capture and analysis pipeline
Coordinate System Challenge
Solved complex coordinate scaling issues that plagued earlier VNC-based approaches:
• VNC Docker containers on Raspberry Pi proved unreliable
• Browser DOSBox on archive.org had coordinate mismatches
• QEMU with QDB provided repeatable, faster-than-human control
• Modern mouse drivers for Windows 98SE enabled precise positioning
AI Model Testing
Developed zero-shot learning protocols:
• Detailed descriptions of puzzle elements and interactions
• Interface component documentation
• Step-by-step task execution generation
• Performance evaluation across multiple frontier models
Research Implications
Opens new avenues for AI benchmarking using vintage software:
• 16-bit and 32-bit era simulation games as test environments
• Air traffic control simulators for complex scenario testing
• Professional engineering software (OrCAD, Tanner, L-Edit)
• Lower fidelity interfaces enabling focused capability assessment
Current Status
Research prototype successfully demonstrates concept. With sufficient coaching, frontier models can generate correct task execution steps. Further development needed for reliable automated problem solving across the full game library.