What Happened ¶
The quality of my previous laptop (Lenovo Legion R720) isn't great. After 4 years of use, I swapped its keyboard once and replaced its failed mechanical hard drive with a SATA SSD. In addition, every time I opened its back panel, some plastic residue or even screws would fall out from the back panel or the hinge. These were strong indications that the laptop already had one foot in the grave. So I took advantage of a special deal and got a new laptop for cheap, and planned to migrate my data over to the new laptop.
The operating system I daily drive, Arch Linux, sits on a Western Digital SN550 1T SSD I bought half a year ago (including the EFI boot partition and the Btrfs-formatted root partition). The usual procedure is to reinstall the operating system on the new laptop. But I had a second thought. Apart from some tweaks to the touchpad and the NVIDIA GPU, I didn't have much hardware-specific configuration on the OS, so I can simply adjust or remove these configurations and have a working system. Easy peasy.
So I took that drive and inserted it into the second M.2 slot on my new laptop. I turned on the computer and saw the EFI boot partition automatically picked up and listed on the BIOS boot menu. I'm back to business in no time.
As a regular Arch Linux user, of course the first thing to do when I log in is
sudo pacman -Syu. As the system update ran, a "No space left on device" error popped up, and the update process was aborted. I ran
df -h and saw I only used 600GB out of that 1TB drive. What?
sudo pacman -Syu and saw another error: "Read-only Filesystem." I ran
mount and found that the root partition is now
ro read only.
I took a look at
dmesg and saw a bunch of Btrfs error messages. They said some data blocks had incorrect checksums.
I turned off the computer, grabbed my USB drive, and booted into the Arch Linux installation ISO on it. I first ran
smartctl -a /dev/nvme0n1 and saw that the drive was 100% healthy (percent of available spare blocks) with 0 data errors. Weird. I then ran
btrfs check /dev/nvme0n1p2 and after a 10-minute scan, it found 3 corrupted files. Two of them are packages downloaded by
pacman, and the other one from the Java OpenJFX component. I noted down the name of that component, preparing to reinstall the package later on. I removed the 3 corrupted files and reran
btrfs check. Everything is back in order.
No big deal
While I was rebooting the computer, thinking to myself about how the Btrfs partition had corrupted data out of nowhere, it had a Kernel Panic early in the startup process, with dozens of Btrfs functions in the call stack.
As I booted to my USB drive again, I suddenly recalled something. I DID have some hardware-specific configuration on my previous laptop. To be specific, I used Intel-Undervolt to undervolt the CPU by 0.1V. For 7th-Gen Intel laptop CPUs, like the i7-7700HQ on my previous laptop, 0.1V voltage doesn't impact stability. But this minor tweak can reduce the heat output and power consumption by a lot. However, starting from 8th-Gen CPUs, with the increasing competition from AMD, Intel was pushing its CPUs to their limits, without much room left for undervolting. Therefore, my configuration undervolted the CPU too much and made it unstable.
No big deal
I mounted the Btrfs partition, removed the configuration and the binary of the Intel-Undervolt program, then rebooted again. This time, although I didn't see another Kernel Panic, Btrfs printed many messages instead about how the metadata was damaged, and it couldn't mount my
As a long-time Btrfs user, I knew that Btrfs had an extremely complex on-disk format, and the chance of data recovery with damaged metadata was close to zero. To avoid such situations, I enabled the
-m dup option a while back, to store two copies of metadata on the disk. Back to the installation environment, I ran
btrfs check again. It found a damaged metadata block and declared that its checksum was incorrect. It then found the duplicate metadata block for backup, and printed the exact same checksum error.
Both metadata blocks are dead
I tried to copy my data away first. I plugged in my USB drive for backups and ran
cp -r /mnt/home /backup_usb/, only to see the console covered with "Input/output error", and no file made its way to the backup drive. Out of desperation, I ran
btrfs check --repair, which cleared my entire
I took a look at the backup drive. My last backup was from August this year. Thankfully I usually upload most of my files to GitHub or my private Git server, and most of the rest files can be obtained from other places (like my Steam library). Therefore, I didn't lose too much data over the 4 months.
All that's left is the boring reinstallation process.
Most likely, the Btrfs metadata was corrupted on my second attempt to start the operating system, where it had a Btrfs Kernel Panic. The reason should be that CPU undervolting, causing communication instabilities from CPU to RAM or SSD, or checksum computation errors.
The adjustments made by Intel-Undervolt will remain effective until the system is completely shut off. But since I turned off the computer by long-pressing the power button, the computer cold-booted every time, so the CPU voltage was normal in the installation environment from my USB drive, and the operations were executed correctly and stably.
Lessons Learnt ¶
- Practice 3-2-1 backup strategy on your important data. That is a minimum of 3 copies, with 2 stored locally (including the one you're currently using, and a backup) and 1 stored remotely (like a cloud storage service).
- Whenever Btrfs has the slightest hiccup, the first thing to do should always be mounting the partition read-only and copying the data elsewhere, before attempting further diagnosis or repair steps. I'd imagine that this applies to other CoW filesystems (like ZFS) as well.