The Incredible Machine AI Benchmark

Project Details

Date: March 2025

Duration: 2 months development

Technologies: QEMU, QDB Debug Interface, Windows 98SE, DOS emulation

Focus: AI planning and reasoning benchmarks

Role: Research Engineer & System Architect

Project Overview

Created a sophisticated benchmarking system to test frontier AI models' planning and reasoning capabilities using The Incredible Machine, a vintage 16-bit simulation game. The project addresses the need for AI-hard benchmarks that aren't contaminated by training data.

Technical Architecture

• QEMU system emulator for precise environment control

• QDB debug interface for real-time command execution

• Windows 98SE installation with modern mouse drivers

• Touch-based driver system for accurate coordinate mapping

• Automated screenshot capture and analysis pipeline

Coordinate System Challenge

Solved complex coordinate scaling issues that plagued earlier VNC-based approaches:

• VNC Docker containers on Raspberry Pi proved unreliable

• Browser DOSBox on archive.org had coordinate mismatches

• QEMU with QDB provided repeatable, faster-than-human control

• Modern mouse drivers for Windows 98SE enabled precise positioning

AI Model Testing

Developed zero-shot learning protocols:

• Detailed descriptions of puzzle elements and interactions

• Interface component documentation

• Step-by-step task execution generation

• Performance evaluation across multiple frontier models

Research Implications

Opens new avenues for AI benchmarking using vintage software:

• 16-bit and 32-bit era simulation games as test environments

• Air traffic control simulators for complex scenario testing

• Professional engineering software (OrCAD, Tanner, L-Edit)

• Lower fidelity interfaces enabling focused capability assessment

Current Status

Research prototype successfully demonstrates concept. With sufficient coaching, frontier models can generate correct task execution steps. Further development needed for reliable automated problem solving across the full game library.

AI Benchmarking QEMU Emulation Planning AI Vintage Computing Research