PyTorch Bug: Corrupted Tensors From Failed Storage Resizes
Have you ever encountered a situation in PyTorch where your tensors seem to behave erratically, leading to cryptic errors or even crashes like segmentation faults? It’s a frustrating experience, especially when you’re deep into your machine learning workflow. Well, it turns out there's a specific bug in PyTorch that can cause exactly this kind of trouble, particularly when you’re dealing with tensors that have shared storage and a resize operation fails. This issue, which we'll explore in detail, affects how PyTorch updates tensor metadata, leaving you with what can be described as a 'Zombie' tensor – a tensor with seemingly valid dimensions but no actual data behind it.
Understanding the Problem: The 'Zombie' Tensor Scenario
Let's dive into the core of the issue. PyTorch is designed to be robust, and when you try to perform an operation that's fundamentally not allowed, it should ideally handle it gracefully. One such scenario involves resizing a tensor's storage when that storage is not resizable. This commonly happens when a tensor shares its underlying storage with a NumPy array that was directly injected into PyTorch using set_(). NumPy arrays, when used this way, often create a fixed-size storage that PyTorch cannot modify.
When resize_() is called on such a tensor, PyTorch should recognize that the storage is immutable and prevent the resize. Indeed, it does raise a RuntimeError with a message like: "Trying to resize storage that is not resizable." This is good – the program stops the invalid operation. However, the bug lies in the timing of these checks. Before PyTorch fully verifies that the storage can be resized, it updates the tensor's shape and stride metadata to reflect the intended new size. When the check then fails, the RuntimeError is thrown, but the tensor's metadata has already been altered.
This leaves the tensor in a precarious state. Its shape attribute might indicate a large, new dimension (e.g., torch.Size([5, 5, 5])), but its actual storage() is still the original, potentially empty, storage with zero bytes (0 bytes). This fundamental mismatch between the tensor's declared size and its actual underlying data capacity is what creates the 'Zombie' tensor. It's an object that looks like a tensor of a certain shape, but there's no data to back it up. Subsequently trying to access or print this tensor often results in a Segmentation Fault or another internal RuntimeError, because the program is attempting to operate on data that simply doesn't exist in the allocated storage.
The core problem: The exception handling for resize_() on non-resizable storage is not exception-safe. This means that even though an exception is raised, the internal state of the tensor is left corrupted. Ideally, if an operation fails, all related metadata should revert to its state before the operation began. In this case, the shape and stride metadata are not reverted, leading to the observed corruption. This is a critical bug because it can lead to hard-to-debug crashes in applications that rely on PyTorch's tensor manipulations.
Minimal Reproduction of the Bug
To truly grasp the severity and nature of this bug, let’s walk through a minimal reproduction case provided by the researchers. This code snippet clearly demonstrates how to trigger the issue and observe the corrupted state:
import torch
import numpy as np
# Create non-resizable storage (0 bytes)
locked_storage = torch.from_numpy(np.array([], dtype=np.int32)).untyped_storage()
# Inject into a fresh tensor
t = torch.tensor([], dtype=torch.int32)
t.set_(locked_storage)
# Attempt to resize (Expected: Fail, maintain original shape)
# (Actual: Fails, but updates shape to 5x5x5)
try:
t.resize_((5, 5, 5))
except RuntimeError:
pass
# Verify corruption
print(f"Shape: {t.shape}") # Prints: torch.Size([5, 5, 5])
print(f"Storage: {t.untyped_storage().nbytes()}") # Prints: 0
print(t) # CRASH
In this example, we first create a tensor with an untyped_storage that is essentially empty (0 bytes) and derived from a NumPy array. This simulates the condition where the storage is fixed and cannot be resized. We then inject this storage into a new tensor t. When t.resize_((5, 5, 5)) is called, PyTorch attempts to change the tensor's dimensions. As expected, it encounters the non-resizable storage and raises a RuntimeError. However, as the code shows, the tensor’s shape is erroneously updated to torch.Size([5, 5, 5]) before the exception is caught. The storage size remains 0. The subsequent print(t) statement, which tries to access the tensor's elements based on its reported shape, triggers the crash because it's attempting to read data from non-existent memory.
Expected Behavior vs. Actual Behavior:
- Expected: If
resize_()fails due to immutable storage, the tensor's metadata (shape, stride) should remain unchanged. The shape should staytorch.Size([0]), consistent with the empty storage. The operation would fail cleanly, and no corruption would occur. - Actual: The
RuntimeErroris raised, but the tensor's shape metadata is incorrectly updated to the target size (torch.Size([5, 5, 5])in the example). This desynchronization between shape and storage leads to crashes upon access.
This minimal reproduction clearly illustrates the bug, highlighting the importance of exception safety in library implementations. The provided gist also mentions that in a more complex loop, the original program encountered a segmentation fault rather than just a RuntimeError on print, underscoring the potential for severe instability caused by this issue.
Analyzing the Root Cause and Potential Solutions
This bug stems from a fundamental issue in how PyTorch handles exceptions during tensor operations, specifically within the resize_() method when it interacts with tensor storage. The problem isn't that PyTorch fails to detect an invalid operation – it correctly identifies that the storage cannot be resized and raises a RuntimeError. The critical flaw is that the tensor's internal metadata, its shape and stride information, is modified before the final check that would throw the exception. This means that even though the operation is aborted, the tensor object is left in an inconsistent state.
Think of it like this: you're trying to change the size of a locked box. Before checking if the lock is truly unbreakable, you start writing down the new dimensions on a piece of paper. Then, you discover the lock is indeed unbreakable. You throw away the idea of changing the box, but you forget to erase the dimensions you wrote down on the paper. Now, you have a piece of paper with dimensions for a box that you never actually resized, and this paper is supposed to represent the box. If you try to use this piece of paper to interact with the original, unchanged box, you'll run into confusion and errors.
In PyTorch terms, the tensor object has pointers and attributes that define its shape, strides, and the actual memory buffer (storage) it uses. When resize_() is called, it first prepares the metadata for the new shape. Then, it attempts to perform the actual resizing of the underlying storage. If the storage is immutable (like when it's tied to a NumPy array), this storage resizing fails, triggering an exception. However, because the metadata update happened before this failure was definitively known, the metadata remains in the