A serious bug is causing major headaches for early adopters of Nvidia's new Blackwell GPUs, including the flagship RTX 5090 and the professional RTX PRO 6000. Reports from a GPU cloud provider, CloudRift, and various community forums confirm the issue, which can leave a GPU completely unresponsive and force a full system reboot to fix. The problem has become so significant that a $1,000 bug bounty has been offered to anyone who can find a solution.
The root of the issue appears to be a virtualization reset bug. When a GPU is passed through to a virtual machine (VM) and then reassigned to the host, a standard process called a PCIe function-level reset (FLR) is supposed to happen. However, on these new GPUs, the reset fails, causing the card to freeze and become undetectable by the host machine. The only way to get the card working again is to perform a hard reboot of the entire system.
While this bug poses a serious risk to multi-tenant AI workloads and cloud-based systems, it's also impacting home users and enthusiasts. Reports from forums like Proxmox and Level1Techs show that even home lab setups are experiencing complete host hangs and GPU failures after a virtual machine is shut down. One user noted that their host became unresponsive after a Linux VM was shut down, a problem they never experienced with their previous RTX 4080.
The issue seems to be limited to the new Blackwell family of cards, as older models like the RTX 4090 are unaffected. As of now, Nvidia has not officially acknowledged the bug, and there is no known mitigation. For anyone using or planning to use GPU passthrough, this bug presents a significant risk, as a single card failure could take down the entire system.