User Tools

Site Tools


articles:software-raid-and-uefi-boot

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
articles:software-raid-and-uefi-boot [2023/09/19 04:59] Nikita Kipriyanovarticles:software-raid-and-uefi-boot [2023/09/20 06:09] (current) Nikita Kipriyanov
Line 3: Line 3:
 When installing Linux with software MD RAID, there is a temptation to use a //partitionable// software RAID, to reduce maintenance burden for the case of failed drive replacing and repartitioning. In this case one creates an array out of raw unpartitioned devices, and the partitions it. When installing Linux with software MD RAID, there is a temptation to use a //partitionable// software RAID, to reduce maintenance burden for the case of failed drive replacing and repartitioning. In this case one creates an array out of raw unpartitioned devices, and the partitions it.
  
-TL;DR: don't fall for this temptation. The supposed reduced maintenance burden here is bogus for most practical cases.+TL;DR: don't fall for this temptation. The operational gains promised by partitionable RAID are hoax, and the reduced maintenance burden here is bogus for most practical cases. 
 + 
 +===== The problem =====
  
 Problems arise when one desires to also boot from this array. Now, not only Linux must understand the structure on disks, but system firmware too. Firmware, on the other hand, won't interpret the MD superblock (metadata) which describes the shape of the array. Often GRUB is used as a bootloader for Linux; it employs some unused areas on disk to install itself. Disks must //look normal// to the firmware even without metadata interpretation and //be useful// to GRUB or other bootloader. Problems arise when one desires to also boot from this array. Now, not only Linux must understand the structure on disks, but system firmware too. Firmware, on the other hand, won't interpret the MD superblock (metadata) which describes the shape of the array. Often GRUB is used as a bootloader for Linux; it employs some unused areas on disk to install itself. Disks must //look normal// to the firmware even without metadata interpretation and //be useful// to GRUB or other bootloader.
  
 The obvious first consequence is that we are bound to RAID1 (mirror), because this is the only case which stores data on disks as is, on all components. But there's more. The obvious first consequence is that we are bound to RAID1 (mirror), because this is the only case which stores data on disks as is, on all components. But there's more.
 +
 +===== MD RAID1 superblocks =====
  
 {{ md-raid-superblocks.svg |On-disk placement of various MD RAID superblock variants}} {{ md-raid-superblocks.svg |On-disk placement of various MD RAID superblock variants}}
  
-There are two types of MD superblocks (blue):+There are two types of MD superblocks (blue), v0.9 and v1.x. They have different structure, whereas v1.x differ only by their placement within:
   * version 0.9 (deprecated) and version 1.0 are placed near the end of the component device (regardless of the array size!), not farther than 128KiB to the end.   * version 0.9 (deprecated) and version 1.0 are placed near the end of the component device (regardless of the array size!), not farther than 128KiB to the end.
   * version 1.1 is placed at the beginning of component devices   * version 1.1 is placed at the beginning of component devices
   * version 1.2 is placed at 4KiB past the beginning of component devices   * version 1.2 is placed at 4KiB past the beginning of component devices
  
-If the full device capacity is used (when e.g. all component devices are the same size), the size of a data area (green) is slightly smaller, between 64KiB and 128KiB less than device size. This area in case of RAID1 would be the virtual disk size.+If the full device capacity is used (when e.g. all component devices are the same size), the size of a data area (green) is slightly smaller, between 64KiB and around 1MiB less than device size. This area in case of RAID1 would be the virtual disk size.
  
 +===== Partition tables =====
 Now let's observe how partition tables look from the point of view of the firmware who doesn't interpret the MD metadata. Now let's observe how partition tables look from the point of view of the firmware who doesn't interpret the MD metadata.
  
-MBR partition table contains the bootloader code for "legacy" boot; it is 512 bytes stored in the very first sector of the device (LBA 0), so if we partition our array, the very beginning of the data area (green) will be the MBR. The partition table permits to partition only first 2TiB of the device, because it uses 32-bit signed integers to specify first and last sectors of the partition.+==== MBR ==== 
 +{{ mbr-boot-structure.svg |MBR disk layout}} 
 +MBR partition table contains the bootloader code for "legacy" boot; it is 512 bytes stored in the very first sector of the device (LBA 0), so if we partition our array, the very beginning of the data area (green) will be the MBR. The partition table permits to partition only first 2TiB of the device, because it uses 32-bit signed integers to specify first and last sectors of the partition. When GRUB is used as a bootloader, it places part of its code into otherwise unused area of the disk, so you're required to make it large enough.
   * v0.9 or v1.0: the partition table and the bootloader code is in its expected place, partition locations in the table are correct. The last partition never covers the few thousands of sectors at the end, which is fine. The disk can contain "extended" partition (the trick in the MBR partitioning scheme which allows having more than 4 partitions); this structure will also fully possible.   * v0.9 or v1.0: the partition table and the bootloader code is in its expected place, partition locations in the table are correct. The last partition never covers the few thousands of sectors at the end, which is fine. The disk can contain "extended" partition (the trick in the MBR partitioning scheme which allows having more than 4 partitions); this structure will also fully possible.
   * v1.1: the place where MBR is expected to be found we see the MD superblock which looks like garbage to the firmware. This disk structure is invalid and disk is **not bootable**.   * v1.1: the place where MBR is expected to be found we see the MD superblock which looks like garbage to the firmware. This disk structure is invalid and disk is **not bootable**.
   * v1.2: normally there will be no MBR in the place where firmware expects it, but it is possible to carefully craft a special MBR with the bootloader in such a way so it properly points to the really existing partitions needed to boot, and place it into the free use area (orange).   * v1.2: normally there will be no MBR in the place where firmware expects it, but it is possible to carefully craft a special MBR with the bootloader in such a way so it properly points to the really existing partitions needed to boot, and place it into the free use area (orange).
  
-GPT partition table is stored within the first 34 sectors (occupying 17408 bytes((more for 4K sector disks))), **and** the last 33 sectors (16896 bytes((more for 4K sector disks))), so again, if we partition the array, it occupies the very beginning and very end of the data area. The first sector (LBA 0) is the protective MBR header which can contain the bootloader code for the "legacy" boot; it also can be the normal MBR defining the same partitions as are defined in the GPT (only four of them; this time the extended partition trick won't work). GPT is the only scheme that enables the UEFI boot, which also requires the existence of special partition with GUID ''C12A7328-F81F-11D2-BA4B-00A0C93EC93B'', called **EFI System Partition** containing the EFI Executable file with the bootloader code. GRUB needs a place for its stage 1.5 loader, which is not possible to place after the partition table, but it can be placed into the special partition with GUID ''21686148-6449-6E6F-744E-656564454649'', called **BIOS boot**, so it's possible to legacy boot from GPT disks.+==== GPT ==== 
 +{{ gpt-grub-legacy-boot.svg |GPT disk layout when used for legacy boot}} 
 +{{ gpt-uefi-boot.svg |GPT disk layout when used for UEFI boot}} 
 + 
 +GPT partition table is stored within the first 34 sectors (occupying 17408 bytes((more for 4K sector disks))), **and** the last 33 sectors (16896 bytes((more for 4K sector disks))), so again, if we partition the array, it occupies the very beginning and very end of the data area. The first sector (LBA 0) is the protective MBR header which can contain the bootloader code for the "legacy" boot; it also can be the normal MBR defining the same partitions as are defined in the GPT (only four of them; this time the extended partition trick won't work). GPT is the only scheme that enables the UEFI boot, which also requires the existence of special partition with GUID ''C12A7328-F81F-11D2-BA4B-00A0C93EC93B'', called **EFI System Partition** containing the EFI Executable file with the bootloader code((or it can be Linux kernel itself with integrated initramfs and **EFI Stub** attached)). GRUB can be used for legacy boot but it needs a place for its stage 1.5 loader, which is not possible to place after the partition table, so it'placed into the special partition with GUID ''21686148-6449-6E6F-744E-656564454649'', called **BIOS boot**.
   * v0.9 or v1.0, GPT: the first GPT appears to be in its normal place, but the second copy of the GPT at the end is missing. (It is actually exists on the disk but not at the place firmware expects it to be, so it considers it to be non-existing). This disk partition table is invalid so **not bootable**. It is possible to craft yet another GPT copy at the very end of the device into "lost space" area, but   * v0.9 or v1.0, GPT: the first GPT appears to be in its normal place, but the second copy of the GPT at the end is missing. (It is actually exists on the disk but not at the place firmware expects it to be, so it considers it to be non-existing). This disk partition table is invalid so **not bootable**. It is possible to craft yet another GPT copy at the very end of the device into "lost space" area, but
     - that area might be not big enough to hold a GPT,     - that area might be not big enough to hold a GPT,
Line 34: Line 45:
 In addition to crafting the MBR where it's possible one will need to maintain it: update the bootloader code with system updates, manually clone in case of disk replacement, and update the partition table if things important for boot change. This MBR can also provide false hints; it's a dirty hack so it's dangerous. For the MBR, the 4KiB of space v1.2 superblock gives is not enough to place GRUB stage 1.5 code, so another bootloader must be used. For GPT case the BIOS boot partition can't really be used to place GRUB code((even if we successfully install GRUB into any of those structures everything will have sector numbers shifted, good luck patching!)). All of this defeats the purpose of having partitionable RAID to reduce maintenance. In addition to crafting the MBR where it's possible one will need to maintain it: update the bootloader code with system updates, manually clone in case of disk replacement, and update the partition table if things important for boot change. This MBR can also provide false hints; it's a dirty hack so it's dangerous. For the MBR, the 4KiB of space v1.2 superblock gives is not enough to place GRUB stage 1.5 code, so another bootloader must be used. For GPT case the BIOS boot partition can't really be used to place GRUB code((even if we successfully install GRUB into any of those structures everything will have sector numbers shifted, good luck patching!)). All of this defeats the purpose of having partitionable RAID to reduce maintenance.
  
 +==== GPT primer ====
 +As an example, the following table is the complete on-disk structure of the RAID1 MD array with superblock version 1.2, partitioned with GPT, created out of devices exactly 1000000000 = 1GB with sector size 512. Notice how MD pads the beginning of data to 1MiB, the size to 64KiB, and GPT pads partitions to 1MiB:
 +|              ^ Address     ^ Length      ^ Length(dec) ^ Contents            ^
 +| :::          | 0x00000000  | 0x00001000  | 4096        | Free use area       |
 +| :::          | 0x00001000  | 0x00001000  | 4096        | MD superblock v1.2  |
 +| :::          | 0x00002000  | 0x000fe000  | 1040384     | Padding             |
 +^ Virtual disk | 0x00100000  | 0x00000200  | 512         | Protective MBR      |
 +^ :::          | 0x00100200  | 0x00000200  | 512         | GPT 1 header        |
 +^ :::          | 0x00100400  | 0x00004000  | 16384       | GPT 1 entries       |
 +^ :::          | 0x00104400  | 0x000fbc00  | 1031168     | Padding             |
 +^ :::          | 0x00200000  | 0x00100000  | 1048576     | Partition 1 data    |
 +^ :::          | 0x00300000  | 0x3b600000  | 996147200   | Partition 2 data    |
 +^ :::          | 0x3b900000  | 0x0009be00  | 638464      | Padding             |
 +^ :::          | 0x3b99be00  | 0x00004000  | 16384       | GPT 2 entries       |
 +^ :::          | 0x3b99fe00  | 0x00000200  | 512         | GPT 2 header        |
 +|              | 0x3b9a0000  | 0x0000ca00  | 51712       | Padding             |
 +| :::          | 0x3b9aca00 = 10000000000                                   ||||
 +
 +===== Conclusion =====
 Here's a summary table: Here's a summary table:
 ^ Configuration    ^ Legacy boot possible                                                ^ UEFI boot possible ^ ^ Configuration    ^ Legacy boot possible                                                ^ UEFI boot possible ^
Line 44: Line 74:
  
 In addition to these complications, UEFI specification provides no support for OS-managed software RAID, like MD. ESP must be a simple GPT partition. In addition to these complications, UEFI specification provides no support for OS-managed software RAID, like MD. ESP must be a simple GPT partition.
 +
 +MBR is obsolete as it allows to partition up to 2 TiB of space per device and some newer system don't support this kind of a boot sequence anymore.
  
 So, partitiontable MD RAID actually provides easier maintenance for legacy boot systems with boot disk of less than 2TiB, only RAID1, and is impossible to boot from with UEFI. This kind of a setup is obsolete and very limiting. So, partitiontable MD RAID actually provides easier maintenance for legacy boot systems with boot disk of less than 2TiB, only RAID1, and is impossible to boot from with UEFI. This kind of a setup is obsolete and very limiting.
  
-What to do? Do not use partitionable RAID. Partition each disk separately into whatever scheme you decided (always use GPT for UEFI boot and for disks larger than 2TiB); create few small partitions required to boot (ESP or a BIOS boot and separate partition for /boot to build a RAID1 from). The rest of the space will be one big partition to hold software RAID of whatever level you want; you're not restricted to RAID1 anymore. LVM is superior to whatever partition table we considering, it can be placed upon this big MD RAID and used to manage volumes. You can also add additional layers: over the MD RAID goes LUKS for cryptothen there might be VDO for deduplication and compressionand now the LVM. Disk replacement will require partitioning the new disk into the same structure and re-installation of the bootloader (or re-creation the firmware boot entry), but all steps will be standard and obvious, and no dangerous configuration will be there. This is proper, robust and most flexible way to use Linux software RAID.+===== Resolution ===== 
 +What to do? Do not use partitionable RAID. 
 + 
 +  * On each disk create a separate partition table; always use GPT for UEFI boot and for disks larger than 2TiB
 +  * Create few small partitions required to boot on each: 
 +    * ESP for UEFI boot 
 +    * BIOS boot for legacy boot 
 +    * A separate partition for ''/boot'' to build a RAID1 from((technically this is not required anymore since GRUB is able to interpret MD RAID and LVMHowever, this still needed if you want to add additional layers, and overall it's cleaner. Also it can be argued that ESP should be used for kernel and initramfs storage on the UEFI system, e.g. it overtakes the role of ''/boot''.)) 
 +    * The rest of the space will be one big partition to hold software RAID of whatever level you want; you're not restricted to RAID1 anymore. 
 +  * Use LVM to partition this big RAID space into volumes. LVM is far better than whatever partition table we were considering. Since we're using Linux'RAID, hopes for compatibility with other systems already lost, so introducing another Linux technology doesn't impose any additional limits. 
 +  * You can also add additional layers between RAID and LVM, in the following order: 
 +    * bcache layer for SSD caching 
 +    * then LUKS goes for crypto 
 +    * then there might be VDO for deduplication and compression 
 +  * Provide for redundant bootloading: 
 +    * install bootloader onto each device in case of legacy boot 
 +    * create additional ESPs on all devices and copy contents 
 +    * install GRUB and/or initramfs hook that does this cloning every time the contents may update 
 +    * create additional firmware boot entries to permit booting from each device using its ESP     
 + 
 +Disk replacement procedure will now include partitioning the new disk into the same structure and re-installation of the bootloader (or re-creation the firmware boot entry), but all steps will be standard and obvious, and no dangerous configuration will be there. This is proper, robust and most flexible way to use Linux software RAID and make the system boot redundantly from devices that constitute it.
articles/software-raid-and-uefi-boot.1695099568.txt.gz · Last modified: by Nikita Kipriyanov