🚀 NVMe‑to‑GPU streaming lands a 70B‑parameter transformer on a single RTX 3090 by moving layers through PCIe and, optionally, reading directly from NVMe without CPU involvement. The ntransformer engine is built in C++ and CUDA, supports several quantization formats, and targets setups where the parameter footprint far exceeds VRAM. Linux and the CUDA Toolkit are required, and a build flag enables the direct NVMe path. The approach focuses on practical deployment on commodity cards rather than bespoke multi‑GPU rigs. github.com
🧠 The core trick is a three‑tier adaptive cache that juggles VRAM, pinned RAM, and NVMe, minimizing stalls while streaming weights. The project reports a 33x speedup over traditional methods under this pipeline, indicating that storage orchestration can rival raw memory capacity for throughput. By keeping only the active layers resident and pulling the rest on demand, the system treats fast storage as an extension of memory. The result turns a capacity problem into a scheduling problem that software can optimize. github.com
🛠️ For practitioners, this narrows the gap between desire and feasibility when working with very large language systems on consumer hardware. HN discussion notes that current token rates may suit experimentation more than interactive production, but the accessibility shift is clear. The roadmap includes further quantization advances and new architectures, suggesting continued gains from software rather than only silicon upgrades. This is a blueprint for squeezing more from what many already own. github.com