articles:software-raid-and-uefi-boot
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| articles:software-raid-and-uefi-boot [2023/09/18 13:40] – Nikita Kipriyanov | articles:software-raid-and-uefi-boot [2023/09/20 06:09] (current) – Nikita Kipriyanov | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== | + | ====== |
| - | When installing Linux with software MD RAID, there is a temptation to use a // | + | When installing Linux with software MD RAID, there is a temptation to use a // |
| - | Problems arise when one desires to also boot from this array, due to the fact the firmware doesn' | + | TL;DR: don' |
| - | The obvious first consequence is that we are bound to RAID1 (mirror), because this is the only case which presents the same data on all components. But there' | + | ===== The problem ===== |
| + | |||
| + | Problems arise when one desires to also boot from this array. Now, not only Linux must understand the structure on disks, but system firmware too. Firmware, on the other hand, won't interpret the MD superblock (metadata) which describes the shape of the array. Often GRUB is used as a bootloader for Linux; it employs some unused areas on disk to install itself. Disks must //look normal// to the firmware even without metadata interpretation and //be useful// to GRUB or other bootloader. | ||
| + | |||
| + | The obvious first consequence is that we are bound to RAID1 (mirror), because this is the only case which stores | ||
| + | |||
| + | ===== MD RAID1 superblocks ===== | ||
| {{ md-raid-superblocks.svg |On-disk placement of various MD RAID superblock variants}} | {{ md-raid-superblocks.svg |On-disk placement of various MD RAID superblock variants}} | ||
| - | There are two types of MD superblocks: | + | There are two types of MD superblocks |
| * version 0.9 (deprecated) and version 1.0 are placed near the end of the component device (regardless of the array size!), not farther than 128KiB to the end. | * version 0.9 (deprecated) and version 1.0 are placed near the end of the component device (regardless of the array size!), not farther than 128KiB to the end. | ||
| * version 1.1 is placed at the beginning of component devices | * version 1.1 is placed at the beginning of component devices | ||
| * version 1.2 is placed at 4KiB past the beginning of component devices | * version 1.2 is placed at 4KiB past the beginning of component devices | ||
| - | If the full device capacity is used (when e.g. all component devices are the same size), the size of a data area is slightly smaller, between 64KiB and 128KiB | + | If the full device capacity is used (when e.g. all component devices are the same size), the size of a data area (green) |
| + | |||
| + | ===== Partition tables ===== | ||
| + | Now let's observe how partition tables look from the point of view of the firmware who doesn' | ||
| + | |||
| + | ==== MBR ==== | ||
| + | {{ mbr-boot-structure.svg |MBR disk layout}} | ||
| + | MBR partition table contains the bootloader code for " | ||
| + | * v0.9 or v1.0: the partition table and the bootloader code is in its expected place, partition locations in the table are correct. The last partition never covers the few thousands of sectors at the end, which is fine. The disk can contain " | ||
| + | * v1.1: the place where MBR is expected to be found we see the MD superblock which looks like garbage to the firmware. This disk structure is invalid and disk is **not bootable**. | ||
| + | * v1.2: normally there will be no MBR in the place where firmware expects it, but it is possible to carefully craft a special MBR with the bootloader in such a way so it properly points to the really existing partitions needed to boot, and place it into the free use area (orange). | ||
| + | |||
| + | ==== GPT ==== | ||
| + | {{ gpt-grub-legacy-boot.svg |GPT disk layout when used for legacy boot}} | ||
| + | {{ gpt-uefi-boot.svg |GPT disk layout when used for UEFI boot}} | ||
| + | |||
| + | GPT partition table is stored within the first 34 sectors (occupying 17408 bytes((more for 4K sector disks))), **and** the last 33 sectors (16896 bytes((more for 4K sector disks))), so again, if we partition the array, it occupies the very beginning and very end of the data area. The first sector (LBA 0) is the protective MBR header which can contain the bootloader code for the " | ||
| + | * v0.9 or v1.0, GPT: the first GPT appears to be in its normal place, but the second copy of the GPT at the end is missing. (It is actually exists on the disk but not at the place firmware expects it to be, so it considers it to be non-existing). This disk partition table is invalid so **not bootable**. It is possible to craft yet another GPT copy at the very end of the device into "lost space" area, but | ||
| + | - that area might be not big enough to hold a GPT, | ||
| + | - the first GPT header contains the actual address of the header of the copy, and that would be the address of really existing " | ||
| + | * v1.1, GPT: no partition table copies appear on their usual places; the place where first partition table is expected to be found we see MD superblock. This is **not bootable**. | ||
| + | * v1.2, GPT: for the same reason as previous, it's **not bootable**. At first sight it might look like both places where GPT should appear are unused and we should be able to make it if we craft //both// copies with first one having correct pointer to the second one, but the GPT is much larger than 4KiB that is left for us in the free use area, and the lost space area at the end might be not big enough too. As in v1.2 MBR case, we can craft a special MBR with bootloader and put it into free use area, but that will only enable the legacy boot. | ||
| + | |||
| + | In addition to crafting the MBR where it's possible one will need to maintain it: update the bootloader code with system updates, manually clone in case of disk replacement, | ||
| + | |||
| + | ==== GPT primer ==== | ||
| + | As an example, the following table is the complete on-disk structure of the RAID1 MD array with superblock version 1.2, partitioned with GPT, created out of devices exactly 1000000000 = 1GB with sector size 512. Notice how MD pads the beginning of data to 1MiB, the size to 64KiB, and GPT pads partitions to 1MiB: | ||
| + | | ^ Address | ||
| + | | ::: | 0x00000000 | ||
| + | | ::: | 0x00001000 | ||
| + | | ::: | 0x00002000 | ||
| + | ^ Virtual disk | 0x00100000 | ||
| + | ^ ::: | 0x00100200 | ||
| + | ^ ::: | 0x00100400 | ||
| + | ^ ::: | 0x00104400 | ||
| + | ^ ::: | 0x00200000 | ||
| + | ^ ::: | 0x00300000 | ||
| + | ^ ::: | 0x3b900000 | ||
| + | ^ ::: | 0x3b99be00 | ||
| + | ^ ::: | 0x3b99fe00 | ||
| + | | | 0x3b9a0000 | ||
| + | | ::: | 0x3b9aca00 = 10000000000 | ||
| + | |||
| + | ===== Conclusion ===== | ||
| + | Here's a summary table: | ||
| + | ^ Configuration | ||
| + | | MBR, v0.9, v1.0 | Yes | No | | ||
| + | | MBR, v1.1 | No | No | | ||
| + | | MBR, v1.2 | Yes, with great care | No | | ||
| + | | GPT, v0.9, v1.0 | Yes, with care; boot code must appear within first 2T of the array | No | | ||
| + | | GPT, v1.1 | No | No | | ||
| + | | GPT, v1.2 | Yes, with great care | No | | ||
| + | |||
| + | In addition to these complications, | ||
| + | |||
| + | MBR is obsolete as it allows to partition up to 2 TiB of space per device and some newer system don't support this kind of a boot sequence anymore. | ||
| - | With DOS a.k.a. MBR partitioning scheme, which is only supported | + | So, partitiontable MD RAID actually provides easier maintenance |
| - | However, this would not work with GPT partitioning scheme, which is required for UEFI boot. Here's why. | + | ===== Resolution ===== |
| + | What to do? Do not use partitionable RAID. | ||
| - | First of all, the GPT is written | + | * On each disk create a separate partition table; always use GPT for UEFI boot and for disks larger than 2TiB. |
| + | * Create few small partitions required | ||
| + | * ESP for UEFI boot | ||
| + | * BIOS boot for legacy boot | ||
| + | * A separate partition for ''/ | ||
| + | * The rest of the space will be one big partition | ||
| + | * Use LVM to partition this big RAID space into volumes. LVM is far better than whatever partition table we were considering. Since we're using Linux' | ||
| + | * You can also add additional layers between RAID and LVM, in the following order: | ||
| + | * bcache layer for SSD caching | ||
| + | * then LUKS goes for crypto | ||
| + | * then there might be VDO for deduplication and compression | ||
| + | * Provide for redundant bootloading: | ||
| + | * install bootloader onto each device in case of legacy boot | ||
| + | * create additional ESPs on all devices | ||
| + | * install GRUB and/or initramfs hook that does this cloning every time the contents may update | ||
| + | * create additional firmware boot entries to permit booting from each device | ||
| - | Secondly, UEFI specification has no provision for the case when OS-managed RAID is going to be used. To boot UEFI uses a ESP, EFI System Partition, which can not be a part of a RAID. Rather, it should be seen as a simple GPT partition. | + | Disk replacement procedure will now include partitioning |
articles/software-raid-and-uefi-boot.1695044428.txt.gz · Last modified: by Nikita Kipriyanov
