: Attempting to extend an "ATS-only" datastore with a non-ATS LUN, or issues with ATS Heartbeats on certain storage firmware. Troubleshooting & Resolution
Warning: Deactivating this setting may increase metadata locking overhead across your cluster. Use this strictly as a diagnostic step or temporary workaround while working with vendor support. Summary Checklist Diagnostic Area Action Item
The mechanism is often implemented via commands or similar primitives (e.g., NVMe Compare and Write, or Linux’s BLKZEROOUT with verification).
Firmware bugs or misconfigurations on the storage array can lead to incorrect reporting of block states. : Attempting to extend an "ATS-only" datastore with
ESXi hosts do not just use ATS for metadata locks. They also use the SCSI Compare and Write command as a to periodically check the health and accessibility of datastores. These heartbeats are lightweight ATS operations that verify a host's ability to read and write to a critical block on the LUN.
In VMware environments, the feature uses atomic test-and-set commands (ATS). If the underlying storage array has a firmware bug or a momentary timeout, the ATS primitive may return a false equality, leading to VM freezes or "Lost access to volume" messages. 3. Latency and Connectivity Spikes
The most frequent cause is simple resource starvation. If hundreds of virtual machines on different hosts are demanding high input/output operations per second (IOPS) simultaneously, metadata updates stack up. The time elapsed between a host’s "Test" phase and its "Set" phase widens, dramatically increasing the probability that a neighboring host will modify the block first. 2. Excessive Micro-Operations Summary Checklist Diagnostic Area Action Item The mechanism
The error occurs when the ESXi host attempts to update a block but finds that the existing data on that block does not match what it expected (the "test" part of "test and set" failed). This typically signifies a lock contention mismatch in state between the host and the storage array. Broadcom support portal Common Causes Performance issues with VM operations
When the OS asks, "Is this zero?" the drive lies and says "Yes" (because it forgot it wrote something else). Then the atomic compare fails.
| Level | Action | Steps & Details | |:-----:|:-------|:----------------| | 1 | | Identify your Fibre Channel HBA, iSCSI adapter, or RAID controller model. Check the VMware Compatibility Guide (VCG) for the recommended driver and firmware version for your ESXi build. Upgrade to these versions. The affected user found that upgrading a "buggy FC HBA driver" resolved the issue. | | 2 | Force a Locking Mechanism Downgrade | Switch the VMFS datastore from "ATS-Only" locking to "ATS+SCSI" locking. This forces the cluster to fall back to traditional SCSI reservations for some operations. Use the command: esxcli storage vmfs lockmode set -s|--scsi -l|--volume-label=<VMFS label> -u|--volume-uuid=<VMFS UUID> . This is a diagnostic step that impacts performance. It can be reverted by recreating the datastore. | | 3 | Perform a Full Storage & Hardware Audit | Check all physical components (cables, SFPs, switches). Check the storage array's logs for hardware errors. Run a full disk check on the affected volume. Update the storage array's firmware. | | 4 | Contact VMware or Storage Vendor Support | If the error persists, open a support case with VMware and your storage array vendor. Provide the vmkernel.log , vobd.log , HBA driver details, and storage firmware versions. This is complex behavior that often requires multi-vendor collaboration to resolve. | They also use the SCSI Compare and Write
If the block on the disk has changed since the host last checked it, the equality test returns false . The array then returns an "ATS Miscompare" error . Common Causes of This Error
This is a more severe case. A host's in-memory view of the metadata for a particular block might be outdated. For example, a host might crash and reboot. Upon restarting, its VMFS driver might have an incorrect, stale version of the block's expected value. The ATS operation will fail as a safety mechanism to prevent corruption.
The entire process happens at the hardware level in a single, uninterrupted (atomic) step. Decoding the Error Message
of how this error happens in real systems, or should we continue this sci-fi horror