FDM-1 stakes a claim as a general computer action model trained on an 11‑million‑hour video corpus, operating at 30 FPS to navigate sites, complete multi-step CAD, and even drive a car. Unlike screenshot-tuned systems, it runs directly on video for long-horizon tasks, and the team reports steady gains with scale. The demos span web use, automated UI testing, and real-world control, signaling breadth across digital and physical interfaces. 🎥🖱️🚗 si.inc
Under the hood, a video encoder compresses almost two hours of 30 FPS footage into about one million tokens, a move aimed at efficient long-context learning. Action labels come from an inverse dynamics model, enabling training on unlabeled internet video rather than contractor annotations. This recipe targets durability over narrow finetunes by fusing perception with action across extended timelines. 🧩⚙️📦 si.inc
The authors position FDM-1 as a coworker for CAD, finance, and engineering, with demonstrations that emphasize end-to-end task execution. Training directly on video and learning without bespoke labels reduces friction for scaling to new domains. If the reported capabilities hold broadly, this approach could reshape how software is operated by automated systems. 📈🔧🌐 si.inc