PyTorch Bug: Tensor Corruption On Failed Resize

Alex Johnson
-
PyTorch Bug: Tensor Corruption On Failed Resize

Hey there, fellow PyTorch enthusiasts! We've stumbled upon a rather peculiar and potentially problematic bug within PyTorch, and we wanted to shed some light on it. It involves how the library handles tensor shape metadata when an attempt to resize storage fails. This issue, specifically when a resize_() operation is called on a tensor that shares its storage with a non-resizable buffer, can lead to what we're calling corrupted or "zombie" tensors. This can ultimately result in unexpected crashes, like segmentation faults or internal runtime errors, which are never fun when you're in the middle of developing or running your models. Let's dive deep into what's happening, why it's happening, and what it means for your PyTorch projects.

Understanding the "Zombie Tensor" Bug in PyTorch

So, what exactly is this "zombie tensor" bug? It all starts when you try to resize a tensor that's connected to storage that can't be resized. A prime example of this is when you've injected a NumPy array into a PyTorch tensor using set_(). In such cases, PyTorch correctly identifies the issue and throws a RuntimeError, specifically: "Trying to resize storage that is not resizable." This is good! It means PyTorch is aware that the operation isn't possible. However, the problem lies in the fact that the operation isn't entirely exception-safe. Before the check for non-resizable storage actually fails, PyTorch goes ahead and updates the tensor's shape and stride metadata to reflect the new, target size you were attempting to set. This leaves the tensor in a very strange, inconsistent state. The tensor's .shape attribute will report a large, seemingly valid size (like torch.Size([5, 5, 5]) in the example), but its actual underlying storage (.storage()) remains empty, holding zero bytes of data. It's like having a blueprint for a mansion but only having enough materials to build a tiny shed – there's a fundamental disconnect. This mismatch between what the tensor thinks it is (based on its metadata) and what it actually is (its storage capacity) is the core of the problem. When you then try to access or print this "zombie" tensor, PyTorch doesn't know how to reconcile this discrepancy, leading to those dreaded crashes. It's a subtle bug, but one that can cause significant headaches if you're not aware of it. We'll explore a minimal reproduction case to illustrate this more clearly, followed by the observed behavior and the expected, safer outcome.

Minimal Reproduction and Observed Behavior

To really get a handle on this PyTorch bug, let's walk through a minimal reproduction scenario. This will help demystify the process and show you exactly how this "zombie tensor" state can be achieved. We'll then look at what happens when the code runs, highlighting the discrepancy between the expected and actual outcomes.

First, we need to create a scenario where we have storage that absolutely cannot be resized. The easiest way to do this in PyTorch is to leverage NumPy arrays. We start by creating an empty NumPy array with a specific data type (e.g., np.int32) and then convert it into PyTorch's internal untyped_storage. This locked_storage is essentially a fixed block of memory that PyTorch won't attempt to reallocate or resize later.

import torch
import numpy as np

# Create non-resizable storage (0 bytes)
locked_storage = torch.from_numpy(np.array([], dtype=np.int32)).untyped_storage()

Next, we create a fresh, empty PyTorch tensor. Crucially, we then set this tensor to use the locked_storage we just created. This is where the tensor becomes tied to the non-resizable memory.

# Inject into a fresh tensor
t = torch.tensor([], dtype=torch.int32)
t.set_(locked_storage)

Now comes the critical part: attempting to resize this tensor. We'll try to resize it to a shape that requires more memory than our current 0-byte storage can provide, say (5, 5, 5).

# Attempt to resize (Expected: Fail, maintain original shape)
# (Actual: Fails, but updates shape to 5x5x5)
try:
    t.resize_((5, 5, 5))
except RuntimeError:
    pass

According to PyTorch's design, this resize_() operation should fail when it detects that the underlying storage (locked_storage) cannot be resized. And indeed, it does throw a RuntimeError with the message "Trying to resize storage that is not resizable." The try...except block catches this error, preventing the program from crashing at this exact point.

However, this is where the bug manifests. After the exception is caught, the tensor's metadata is left in an inconsistent state. Let's check:

# Verify corruption
print(f"Shape: {t.shape}")       # Prints: torch.Size([5, 5, 5])
print(f"Storage: {t.untyped_storage().nbytes()}") # Prints: 0
print(t) # CRASH

As you can see, t.shape is now reported as torch.Size([5, 5, 5]), indicating a tensor that should hold a significant amount of data. But t.untyped_storage().nbytes() proudly displays 0, confirming that the underlying storage is still empty. The final print(t) line is where the real trouble usually occurs. Because the shape indicates a large tensor but the storage is empty, attempting to access the data to print it results in a crash. In the provided gist, this manifested as a RuntimeError on print, but in other scenarios, especially within complex loops, it has been observed to cause a full Segmentation Fault, which is a much more severe and harder-to-debug issue. This is the "zombie tensor" – it has a shape, but no substance, and it lurks in your program, waiting to cause trouble.

Expected vs. Actual Behavior: The Importance of Exception Safety

The core of this PyTorch bug lies in a violation of what's known as the Strong Exception Guarantee. In software engineering, a strong exception guarantee means that if a function or operation fails (throws an exception), the program should be left in the same state as it was before the operation was attempted. No partial updates, no corrupted data, just as if the operation never happened. This is the ideal scenario because it ensures predictability and prevents the introduction of subtle bugs.

In the case of resize_() on a tensor with non-resizable storage, here's how the guarantees stack up:

  • Expected Behavior (Strong Exception Guarantee): When t.resize_((5, 5, 5)) is called, PyTorch should first check if the underlying storage is resizable. If it's not, it should immediately raise the RuntimeError without modifying the tensor's existing metadata. Therefore, after the except RuntimeError: block, the tensor t should still have its original shape, which in our minimal example is torch.Size([0]). The storage size would remain 0 bytes. This is a safe state; the operation failed cleanly, and the tensor is unchanged.

  • Actual Behavior (Weak Exception Guarantee / Corrupted State): As we saw in the reproduction, PyTorch updates the tensor's shape metadata before it fully validates the storage. So, when the RuntimeError is thrown, the shape has already been changed to torch.Size([5, 5, 5]). The storage, however, remains at 0 bytes because the resize operation ultimately failed. This creates a critical inconsistency: the tensor thinks it has a large shape but has no data to back it up. Accessing this tensor, even just to print its contents, leads to undefined behavior, often resulting in a crash (like a RuntimeError on print or a more severe Segmentation Fault). This is a weaker guarantee, often referred to as a

You may also like