{"version":"https://jsonfeed.org/version/1","title":"Lan Tian @ Blog","description":null,"home_page_url":"https://lantian.pub","feed_url":"https://lantian.pub/feed.json","language":["zh","en","default"],"author":{"name":"Lan Tian","url":"https://lantian.pub"},"items":[{"id":"https://lantian.pub/article/modify-computer/nixos-low-ram-vps.lantian/","url":"https://lantian.pub/article/modify-computer/nixos-low-ram-vps.lantian/","title":"NixOS 系列(五):制作小内存 VPS 的 DD 磁盘镜像","link":"https://lantian.pub/article/modify-computer/nixos-low-ram-vps.lantian/","summary":"","image":"/usr/uploads/202110/nixos-social-preview.png","banner_image":"/usr/uploads/202110/nixos-social-preview.png","content_html":"
\n\nNixOS 系列文章目录:
\n\n
黑色星期五已经过了,相信有一些读者新买了一些特价的 VPS、云服务器等,并且想在 VPS 上安装 NixOS。但是由于 NixOS 的知名度不如 CentOS、Debian、Ubuntu 等老牌 Linux 发行版,几乎没有 VPS 服务商提供预装 NixOS 的磁盘镜像,只能由用户使用以下方法之一手动安装:
\n由于你可以在 NixOS 安装镜像的环境中随意操作 VPS 的硬盘,这种方法自由度最高,可以任意对硬盘进行分区,指定文件系统格式。但是,使用这种方法前,你的主机商需要在以下三项前提中满足任意一项:
\n我这次就购买了一台内存刚好为 1GB 的 VPS,没有足够内存解压 NixOS 23.05 的镜像,因此无法使用 netboot.xyz 启动 NixOS 安装环境。同时由于我的主机商也不提供自定义镜像功能,我也无法通过光盘启动 NixOS 安装程序。
\nNixOS-Infect 工具的原理是在本地系统上安装 Nix Daemon,再使用它构建一个完整的 NixOS 系统,最后将原系统的启动项替换成 NixOS 的。由于这种方法不需要在内存中解压 NixOS 的完整安装镜像,这种方法更适合小内存的 VPS。但这种方法的缺点是无法自定义分区结构和文件系统类型。只能使用 VPS 服务商的默认分区配置。对于使用 Btrfs/ZFS 以及 Impermanence 等非标准分区方案 / 文件系统的用户不友好。
\n而 NixOS-Anywhere 的原理是通过 Linux 内核的 kexec
功能替换当前运行的内核,直接启动到内存中的 NixOS 的安装镜像,本质原理与 netboot.xyz 大致相同,因此也与 netboot.xyz 一样需要较大的内存空间。
对于类似的小内存 VPS,我曾经使用的方法是,先使用 NixOS-Infect 安装一个普通的 NixOS,然后部署一份开启了 Btrfs 和 Impermanence 的配置,然后重启到恢复环境,在恢复环境中调整分区、转换分区格式。这种方法能用,但是很麻烦,而且一旦中间一步操作出错,很难修复系统,只能从头开始。
\n最近 NixOS 社区发布了一款工具 Disko,它的原本用途是在 NixOS 安装环境中自动对硬盘进行分区,从而实现用 Nix 配置文件声明式管理硬盘分区。但是,这款工具也提供了根据给定的分区表和 NixOS 配置,自动生成磁盘镜像的功能。那么,我们就可以配置好 Btrfs/ZFS/Impermanence,生成对应的磁盘镜像,再在 VPS 上直接用 dd
命令写入硬盘,就可以简单地安装 NixOS 了。
由于这种方法对 VPS 上运行的恢复环境几乎没有要求(有网络和 dd
命令就可以),我们可以启动到占用内存很小的 Alpine Linux 发行版,然后通过网络传输磁盘镜像写入 VPS 硬盘。
在开始这个方法前,我们需要准备一份简单的 NixOS 配置,包含最基础的引导、网络、root 密码、SSH 密钥等配置,以保证你后续可以部署完整的配置。当然你也可以直接使用一份完整的 NixOS 配置,只不过稍后创建的磁盘镜像体积会更大。
\n我准备的配置文件如下,存为 configuration.nix
:
{\n config,\n pkgs,\n lib,\n ...\n}: {\n # 我用的一些内核参数\n boot.kernelParams = [\n # 关闭内核的操作审计功能\n \"audit=0\"\n # 不要根据 PCIe 地址生成网卡名(例如 enp1s0,对 VPS 没用),而是直接根据顺序生成(例如 eth0)\n \"net.ifnames=0\"\n ];\n\n # 我用的 Initrd 配置,开启 ZSTD 压缩和基于 systemd 的第一阶段启动\n boot.initrd = {\n compressor = \"zstd\";\n compressorArgs = [\"-19\" \"-T0\"];\n systemd.enable = true;\n };\n\n # 安装 Grub\n boot.loader.grub = {\n enable = !config.boot.isContainer;\n default = \"saved\";\n devices = [\"/dev/vda\"];\n };\n\n # 时区,根据你的所在地修改\n time.timeZone = \"America/Los_Angeles\";\n\n # Root 用户的密码和 SSH 密钥。如果网络配置有误,可以用此处的密码在控制台上登录进去手动调整网络配置。\n users.mutableUsers = false;\n users.users.root = {\n hashedPassword = \"$6$9iybgF./X/RNsRrQ$h7Zlk//loJDPg7yCCPT/9jVU0Tvep6vEA1FvPBT.kqJUA5qlzhDJEYnBFlpBZmTXuUXjF0qgmDWmGkXIMC9JD/\";\n openssh.authorizedKeys.keys = [\n \"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMcWoEQ4Mh27AV3ixcn9CMaUK/R+y4y5TqHmn2wJoN6i lantian@lantian-lenovo-archlinux\"\n \"ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCulLscvKjEeroKdPE207W10MbZ3+ZYzWn34EnVeIG0GzfZ3zkjQJVfXFahu97P68Tw++N6zIk7htGic9SouQuAH8+8kzTB8/55Yjwp7W3bmqL7heTmznRmKehtKg6RVgcpvFfciyxQXV/bzOkyO+xKdmEw+fs92JLUFjd/rbUfVnhJKmrfnohdvKBfgA27szHOzLlESeOJf3PuXV7BLge1B+cO8TJMJXv8iG8P5Uu8UCr857HnfDyrJS82K541Scph3j+NXFBcELb2JSZcWeNJRVacIH3RzgLvp5NuWPBCt6KET1CCJZLsrcajyonkA5TqNhzumIYtUimEnAPoH51hoUD1BaL4wh2DRxqCWOoXn0HMrRmwx65nvWae6+C/7l1rFkWLBir4ABQiKoUb/MrNvoXb+Qw/ZRo6hVCL5rvlvFd35UF0/9wNu1nzZRSs9os2WLBMt00A4qgaU2/ux7G6KApb7shz1TXxkN1k+/EKkxPj/sQuXNvO6Bfxww1xEWFywMNZ8nswpSq/4Ml6nniS2OpkZVM2SQV1q/VdLEKYPrObtp2NgneQ4lzHmAa5MGnUCckES+qOrXFZAcpI126nv1uDXqA2aytN6WHGfN50K05MZ+jA8OM9CWFWIcglnT+rr3l+TI/FLAjE13t6fMTYlBH0C8q+RnQDiIncNwyidQ== lantian@LandeMacBook-Pro.local\"\n ];\n };\n\n # 使用 systemd-networkd 管理网络\n systemd.network.enable = true;\n services.resolved.enable = false;\n\n # 配置网络 IP 和 DNS\n systemd.network.networks.eth0 = {\n address = [\"123.45.678.90/24\"];\n gateway = [\"123.45.678.1\"];\n matchConfig.Name = \"eth0\";\n };\n networking.nameservers = [\n \"8.8.8.8\"\n ];\n\n # 开启 SSH 服务端,监听 2222 端口\n services.openssh = {\n enable = true;\n ports = [2222];\n settings = {\n PasswordAuthentication = false;\n PermitRootLogin = lib.mkForce \"prohibit-password\";\n };\n };\n\n # 关闭 NixOS 自带的防火墙\n networking.firewall.enable = false;\n\n # 关闭 DHCP,手动配置 IP\n networking.useDHCP = false;\n\n # 主机名,随意设置即可\n networking.hostName = \"bootstrap\";\n\n # 首次安装系统时 NixOS 的最新版本,用于在大版本升级时避免发生向前不兼容的情况\n system.stateVersion = \"23.05\";\n\n # QEMU(KVM)虚拟机需要使用的内核模块\n boot.initrd.postDeviceCommands = lib.mkIf (!config.boot.initrd.systemd.enable) ''\n # Set the system time from the hardware clock to work around a\n # bug in qemu-kvm > 1.5.2 (where the VM clock is initialised\n # to the *boot time* of the host).\n hwclock -s\n '';\n\n boot.initrd.availableKernelModules = [\n \"virtio_net\"\n \"virtio_pci\"\n \"virtio_mmio\"\n \"virtio_blk\"\n \"virtio_scsi\"\n ];\n boot.initrd.kernelModules = [\n \"virtio_balloon\"\n \"virtio_console\"\n \"virtio_rng\"\n ];\n}\n
\n然后,准备一份 flake.nix
,用 Flake 的方式管理 nixpkgs 的版本,并同时引入 Impermanence 等我使用的模块:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n ./configuration.nix\n ];\n };\n };\n}\n
\n这个系统配置现在是无法构建的,因为我们还没有配置文件系统。如果你现在用 nixos-rebuild build --flake .#bootstrap
试图构建,会遇到以下错误:
error:\nFailed assertions:\n- The 『fileSystems』 option does not specify your root file system.\n
\n所以接下来,我们就要加入 Disko 模块,以及分区表和文件系统的配置。
\n\n\n如果你不使用 Impermanence 等将 root 分区放在 tmpfs 上的方案,请跳到下一小节。
\n
修改 flake.nix
引入 Disko 模块:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n # 新增下面几行\n disko = {\n url = \"github:nix-community/disko\";\n inputs.nixpkgs.follows = \"nixpkgs\";\n };\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n\n # 新增下面一行\n inputs.disko.nixosModules.disko\n\n ./configuration.nix\n ];\n };\n };\n}\n
\n接下来,我们就要通过 Disko 模块提供的配置选项,配置磁盘镜像中的分区了。修改 configuration.nix
,加入以下配置:
{\n config,\n pkgs,\n lib,\n ...\n}: {\n # 其余配置省略\n\n disko = {\n # 不要让 Disko 直接管理 NixOS 的 fileSystems.* 配置。\n # 原因是 Disko 默认通过 GPT 分区表的分区名挂载分区,但分区名很容易被 fdisk 等工具覆盖掉。\n # 导致一旦新配置部署失败,磁盘镜像自带的旧配置也无法正常启动。\n enableConfig = false;\n\n devices = {\n # 定义一个磁盘\n disk.main = {\n # 要生成的磁盘镜像的大小,2GB 足够我使用,可以按需调整\n imageSize = \"2G\";\n # 磁盘路径。Disko 生成磁盘镜像时,实际上是启动一个 QEMU 虚拟机走一遍安装流程。\n # 因此无论你的 VPS 上的硬盘识别成 sda 还是 vda,这里都以 Disko 的虚拟机为准,指定 vda。\n device = \"/dev/vda\";\n type = \"disk\";\n # 定义这块磁盘上的分区表\n content = {\n # 使用 GPT 类型分区表。Disko 对 MBR 格式分区的支持似乎有点问题。\n type = \"gpt\";\n # 分区列表\n partitions = {\n # GPT 分区表不存在 MBR 格式分区表预留给 MBR 主启动记录的空间,因此这里需要预留\n # 硬盘开头的 1MB 空间给 MBR 主启动记录,以便后续 Grub 启动器安装到这块空间。\n boot = {\n size = \"1M\";\n type = \"EF02\"; # for grub MBR\n # 优先级设置为最高,保证这块空间在硬盘开头\n priority = 0;\n };\n\n # ESP 分区,或者说是 boot 分区。这套配置理论上同时支持 EFI 模式和 BIOS 模式启动的 VPS。\n ESP = {\n name = \"ESP\";\n # 根据我个人的需求预留 512MB 空间。如果你的 boot 分区占用更大/更小,可以按需调整。\n size = \"512M\";\n type = \"EF00\";\n # 优先级设置成第二高,保证在剩余空间的前面\n priority = 1;\n # 格式化成 FAT32 格式\n content = {\n type = \"filesystem\";\n format = \"vfat\";\n # 用作 Boot 分区,Disko 生成磁盘镜像时根据此处配置挂载分区,需要和 fileSystems.* 一致\n mountpoint = \"/boot\";\n mountOptions = [\"fmask=0077\" \"dmask=0077\"];\n };\n };\n\n # 存放 NixOS 系统的分区,使用剩下的所有空间。\n nix = {\n size = \"100%\";\n # 格式化成 Btrfs,可以按需修改\n content = {\n type = \"filesystem\";\n format = \"btrfs\";\n # 用作 Nix 分区,Disko 生成磁盘镜像时根据此处配置挂载分区,需要和 fileSystems.* 一致\n mountpoint = \"/nix\";\n mountOptions = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n };\n };\n };\n };\n\n # 由于我开了 Impermanence,需要声明一下根分区是 tmpfs,以便 Disko 生成磁盘镜像时挂载分区\n nodev.\"/\" = {\n fsType = \"tmpfs\";\n mountOptions = [\"relatime\" \"mode=755\" \"nosuid\" \"nodev\"];\n };\n };\n };\n\n # 由于我们没有让 Disko 管理 fileSystems.* 配置,我们需要手动配置\n # 根分区,由于我开了 Impermanence,所以这里是 tmpfs\n fileSystems.\"/\" = {\n device = \"tmpfs\";\n fsType = \"tmpfs\";\n options = [\"relatime\" \"mode=755\" \"nosuid\" \"nodev\"];\n };\n\n # /nix 分区,是磁盘镜像上的第三个分区。由于我的 VPS 将硬盘识别为 sda,因此这里用 sda3。如果你的 VPS 识别结果不同请按需修改\n fileSystems.\"/nix\" = {\n device = \"/dev/sda3\";\n fsType = \"btrfs\";\n options = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n\n # /boot 分区,是磁盘镜像上的第二个分区。由于我的 VPS 将硬盘识别为 sda,因此这里用 sda2。如果你的 VPS 识别结果不同请按需修改\n fileSystems.\"/boot\" = {\n device = \"/dev/sda2\";\n fsType = \"vfat\";\n options = [\"fmask=0077\" \"dmask=0077\"];\n };\n}\n
\n\n\n如果你使用 Impermanence 等将 root 分区放在 tmpfs 上的方案,请参照上一小节并跳过这一小节。
\n
与上一小节一样,修改 flake.nix
引入 Disko 模块:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n # 新增下面几行\n disko = {\n url = \"github:nix-community/disko\";\n inputs.nixpkgs.follows = \"nixpkgs\";\n };\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n\n # 新增下面一行\n inputs.disko.nixosModules.disko\n\n ./configuration.nix\n ];\n };\n };\n}\n
\n接下来,我们就要通过 Disko 模块提供的配置选项,配置磁盘镜像中的分区了。修改 configuration.nix
,加入以下配置:
{\n config,\n pkgs,\n lib,\n ...\n}: {\n # 其余配置省略\n\n disko = {\n # 不要让 Disko 直接管理 NixOS 的 fileSystems.* 配置。\n # 原因是 Disko 默认通过 GPT 分区表的分区名挂载分区,但分区名很容易被 fdisk 等工具覆盖掉。\n # 导致一旦新配置部署失败,磁盘镜像自带的旧配置也无法正常启动。\n enableConfig = false;\n\n devices = {\n # 定义一个磁盘\n disk.main = {\n # 要生成的磁盘镜像的大小,2GB 足够我使用,可以按需调整\n imageSize = \"2G\";\n # 磁盘路径。Disko 生成磁盘镜像时,实际上是启动一个 QEMU 虚拟机走一遍安装流程。\n # 因此无论你的 VPS 上的硬盘识别成 sda 还是 vda,这里都以 Disko 的虚拟机为准,指定 vda。\n device = \"/dev/vda\";\n type = \"disk\";\n # 定义这块磁盘上的分区表\n content = {\n # 使用 GPT 类型分区表。Disko 对 MBR 格式分区的支持似乎有点问题。\n type = \"gpt\";\n # 分区列表\n partitions = {\n # GPT 分区表不存在 MBR 格式分区表预留给 MBR 主启动记录的空间,因此这里需要预留\n # 硬盘开头的 1MB 空间给 MBR 主启动记录,以便后续 Grub 启动器安装到这块空间。\n boot = {\n size = \"1M\";\n type = \"EF02\"; # for grub MBR\n # 优先级设置为最高,保证这块空间在硬盘开头\n priority = 0;\n };\n\n # ESP 分区,或者说是 boot 分区。这套配置理论上同时支持 EFI 模式和 BIOS 模式启动的 VPS。\n ESP = {\n name = \"ESP\";\n # 根据我个人的需求预留 512MB 空间。如果你的 boot 分区占用更大/更小,可以按需调整。\n size = \"512M\";\n type = \"EF00\";\n # 优先级设置成第二高,保证在剩余空间的前面\n priority = 1;\n # 格式化成 FAT32 格式\n content = {\n type = \"filesystem\";\n format = \"vfat\";\n # 用作 Boot 分区,Disko 生成磁盘镜像时根据此处配置挂载分区,需要和 fileSystems.* 一致\n mountpoint = \"/boot\";\n mountOptions = [\"fmask=0077\" \"dmask=0077\"];\n };\n };\n\n # 存放 NixOS 系统的分区,使用剩下的所有空间。\n nix = {\n size = \"100%\";\n # 格式化成 Btrfs,可以按需修改\n content = {\n type = \"filesystem\";\n format = \"btrfs\";\n # 用作根分区,Disko 生成磁盘镜像时根据此处配置挂载分区,需要和 fileSystems.* 一致\n mountpoint = \"/\";\n mountOptions = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n };\n };\n };\n };\n };\n };\n\n # 由于我们没有让 Disko 管理 fileSystems.* 配置,我们需要手动配置\n # 根分区,是磁盘镜像上的第三个分区。由于我的 VPS 将硬盘识别为 sda,因此这里用 sda3。如果你的 VPS 识别结果不同请按需修改\n fileSystems.\"/\" = {\n device = \"/dev/sda3\";\n fsType = \"btrfs\";\n options = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n\n # /boot 分区,是磁盘镜像上的第二个分区。由于我的 VPS 将硬盘识别为 sda,因此这里用 sda3。如果你的 VPS 识别结果不同请按需修改\n fileSystems.\"/boot\" = {\n device = \"/dev/sda2\";\n fsType = \"vfat\";\n options = [\"fmask=0077\" \"dmask=0077\"];\n };\n}\n
\n修改 flake.nix
添加一个「软件包」,调用 Disko 的生成磁盘镜像功能:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n disko = {\n url = \"github:nix-community/disko\";\n inputs.nixpkgs.follows = \"nixpkgs\";\n };\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n inputs.disko.nixosModules.disko\n ./configuration.nix\n ];\n };\n\n # 新增下面几行\n packages.x86_64-linux = {\n image = self.nixosConfigurations.bootstrap.config.system.build.diskoImages;\n };\n };\n}\n
\n最后运行 nix build .#image
。稍等片刻,磁盘镜像就会生成在 result/main.raw
路径下。
在 VPS 上启动救援系统,或者 Alpine Linux 等轻量化系统。
\n如果你的救援系统有 SSH 服务端,可以使用下列命令上传镜像:
\n# 根据 VPS 上的硬盘识别结果,修改 sda/vda\ncat result/main.raw | ssh root@123.45.678.90 \"dd of=/dev/sda\"\n
\n如果你的救援系统没有 SSH,可以使用下列命令: (注意:没有加密!)
\n# 根据 VPS 上的硬盘识别结果,修改 sda/vda\n# 在 VPS 上运行\nnc -l 1234 | dd of=/dev/sda\n# 在本地运行\ncat result/main.raw | nc 123.45.678.89 1234\n
\n等待命令执行结束,然后重启 VPS。此时你应该就进入了已经安装好的 NixOS 系统了。
\n由于我们创建的磁盘镜像大小只有 2GB,dd
完成后的镜像不会占满 VPS 的硬盘空间,需要手动扩展分区。
运行 fdisk /dev/sda
,删除第三个 /nix
(或者 /
)分区,然后重新创建,保证分区起始位置不变,分区结束位置扩展到硬盘结尾。如果看到擦除文件系统头部信息的提示,不要擦除!
最后运行文件系统对应的命令扩展文件系统的大小。ext4 分区可以使用 resize2fs /dev/sda3
。Btrfs 分区可以使用 btrfs filesystem resize max /nix
(或者 /
)。
\n\nList of NixOS Series Posts:
\n\n
\n- NixOS Series 1: Why I fell in love
\n- NixOS Series 2: Basic Config, Nix Flake & Batch Deploy\n
\n\n
\n- Recommended: NixOS & Nix Flakes - A Guide for Beginners by Ryan Yin
\n- NixOS Series 3: Software Packaging 101
\n- NixOS Series 4: \"Stateless\" Operating System
\n- NixOS Series 5: Creating Disk Image for Low RAM VPS
\n
Black friday has passed. Some readers, I believe, have perchased some VPSes or cloud servers on sale, and want to install NixOS on them. However, since NixOS is nowhere as famous as popular Linux distros, such as CentOS, Debian and Ubuntu, almost no VPS provider will offer a disk image preinstalled with NixOS. This lefts the user one of the following options to perform the installation manually:
\nSince you can operate on the VPS's hard drive as you wish in NixOS's installation media, repartitioning the drive and specifying file system types, this approach offers the maximum freedom. However, before you can use this approach, your provider must satisfy one of the three prerequisites:
\nIn my case, I purchased a VPS with exactly 1GB of RAM, not enough for extracting the image of NixOS 23.05. Therefore, I cannot boot into NixOS installation environment with netboot.xyz. In addition, my provider doesn't support custom ISOs, so I cannot boot into NixOS installer with that either.
\nNixOS-Infect works by installating a Nix daemon on the local OS, build a complete NixOS installation on it, and finally replace the bootloader entries with those for NixOS. Since this approach doesn't require extracting the full installer image, it is more suitable for VPSes with low RAM. The downside of this approach though, is that you cannot customize partitions and filesystem types. You are left with the default partition schemes and filesystems configured by the provider. For users who depends on non-standard partition or filesystem schemes, including Btrfs/ZFS or Impermanence, this approach is not suitable.
\nNixOS-Anywhere, on the other hand, works by replacing the current running kernel with kexec
, and booting straight into NixOS installation image stored in RAM. Since it works in almost the same way as netboot.xyz, it also requires a large chunk of RAM, just like netboot.xyz.
I used to setup similar low RAM VPSes by setting up a normal NixOS with NixOS-Infect first, and then deploy a configuration with Btrfs and Impermanence enabled, reboot into rescue environment, and finally adjust partitions and convert filesystems. It works, but takes many steps to complete. In addition, if I did any of the steps incorrectly, I'm left with an unfixable system, and will need to start over.
\nRecently, the NixOS community released a tool, Disko. It is originally used for automatically partitioning hard drives in the NixOS installation environment, so that user can declaratively partition the drive with a Nix config file. However, the tool also supports generating a disk image based on a given partition table and NixOS config. Therefore, we can set up Btrfs/ZFS/Impermanence, generate the corresponding disk image, and dd
the image into the VPS's hard drive, to easily install NixOS on there.
Since this method requires next to nothing for the rescue environment on VPS (as long as there is network and dd
command), we can boot into Alpine Linux, a distro known for minimal RAM usage, and transfer the disk image over the Internet into the hard drive of VPS.
Before using this method, we need to prepare a simple NixOS configuration, including the basic config for bootloader, networking, root password and SSH keys, so that you can deploy the full configuration later. Of course, you can simply use your full NixOS configuration, at the cost of larger disk image.
\nHere is the configuration file I prepared, stored as configuration.nix
:
{\n config,\n pkgs,\n lib,\n ...\n}: {\n # Kernel parameters I use\n boot.kernelParams = [\n # Disable auditing\n \"audit=0\"\n # Do not generate NIC names based on PCIe addresses (e.g. enp1s0, useless for VPS)\n # Generate names based on orders (e.g. eth0)\n \"net.ifnames=0\"\n ];\n\n # My Initrd config, enable ZSTD compression and use systemd-based stage 1 boot\n boot.initrd = {\n compressor = \"zstd\";\n compressorArgs = [\"-19\" \"-T0\"];\n systemd.enable = true;\n };\n\n # Install Grub\n boot.loader.grub = {\n enable = !config.boot.isContainer;\n default = \"saved\";\n devices = [\"/dev/vda\"];\n };\n\n # Timezone, change based on your location\n time.timeZone = \"America/Los_Angeles\";\n\n # Root password and SSH keys. If network config is incorrect, use this password\n # to manually adjust network config on serial console/VNC.\n users.mutableUsers = false;\n users.users.root = {\n hashedPassword = \"$6$9iybgF./X/RNsRrQ$h7Zlk//loJDPg7yCCPT/9jVU0Tvep6vEA1FvPBT.kqJUA5qlzhDJEYnBFlpBZmTXuUXjF0qgmDWmGkXIMC9JD/\";\n openssh.authorizedKeys.keys = [\n \"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMcWoEQ4Mh27AV3ixcn9CMaUK/R+y4y5TqHmn2wJoN6i lantian@lantian-lenovo-archlinux\"\n \"ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCulLscvKjEeroKdPE207W10MbZ3+ZYzWn34EnVeIG0GzfZ3zkjQJVfXFahu97P68Tw++N6zIk7htGic9SouQuAH8+8kzTB8/55Yjwp7W3bmqL7heTmznRmKehtKg6RVgcpvFfciyxQXV/bzOkyO+xKdmEw+fs92JLUFjd/rbUfVnhJKmrfnohdvKBfgA27szHOzLlESeOJf3PuXV7BLge1B+cO8TJMJXv8iG8P5Uu8UCr857HnfDyrJS82K541Scph3j+NXFBcELb2JSZcWeNJRVacIH3RzgLvp5NuWPBCt6KET1CCJZLsrcajyonkA5TqNhzumIYtUimEnAPoH51hoUD1BaL4wh2DRxqCWOoXn0HMrRmwx65nvWae6+C/7l1rFkWLBir4ABQiKoUb/MrNvoXb+Qw/ZRo6hVCL5rvlvFd35UF0/9wNu1nzZRSs9os2WLBMt00A4qgaU2/ux7G6KApb7shz1TXxkN1k+/EKkxPj/sQuXNvO6Bfxww1xEWFywMNZ8nswpSq/4Ml6nniS2OpkZVM2SQV1q/VdLEKYPrObtp2NgneQ4lzHmAa5MGnUCckES+qOrXFZAcpI126nv1uDXqA2aytN6WHGfN50K05MZ+jA8OM9CWFWIcglnT+rr3l+TI/FLAjE13t6fMTYlBH0C8q+RnQDiIncNwyidQ== lantian@LandeMacBook-Pro.local\"\n ];\n };\n\n # Manage networking with systemd-networkd\n systemd.network.enable = true;\n services.resolved.enable = false;\n\n # Configure network IP and DNS\n systemd.network.networks.eth0 = {\n address = [\"123.45.678.90/24\"];\n gateway = [\"123.45.678.1\"];\n matchConfig.Name = \"eth0\";\n };\n networking.nameservers = [\n \"8.8.8.8\"\n ];\n\n # Enable SSH server and listen on port 2222\n services.openssh = {\n enable = true;\n ports = [2222];\n settings = {\n PasswordAuthentication = false;\n PermitRootLogin = lib.mkForce \"prohibit-password\";\n };\n };\n\n # Disable NixOS's builtin firewall\n networking.firewall.enable = false;\n\n # Disable DHCP and configure IP manually\n networking.useDHCP = false;\n\n # Hostname, can be set as you wish\n networking.hostName = \"bootstrap\";\n\n # Latest NixOS version on your first install. Used to prevent backward\n # incompatibilities on major upgrades\n system.stateVersion = \"23.05\";\n\n # Kernel modules required by QEMU (KVM) virtual machine\n boot.initrd.postDeviceCommands = lib.mkIf (!config.boot.initrd.systemd.enable) ''\n # Set the system time from the hardware clock to work around a\n # bug in qemu-kvm > 1.5.2 (where the VM clock is initialised\n # to the *boot time* of the host).\n hwclock -s\n '';\n\n boot.initrd.availableKernelModules = [\n \"virtio_net\"\n \"virtio_pci\"\n \"virtio_mmio\"\n \"virtio_blk\"\n \"virtio_scsi\"\n ];\n boot.initrd.kernelModules = [\n \"virtio_balloon\"\n \"virtio_console\"\n \"virtio_rng\"\n ];\n}\n
\nThen, prepare flake.nix
to manage nixpkgs versions in the Flake way, as well as introduce other modules I use, such as Impermanence:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n ./configuration.nix\n ];\n };\n };\n}\n
\nRight now, this system config will not build, as we haven't configured filesystems yet. If you try to build it with nixos-rebuild build --flake .#bootstrap
now, you will see the following errors:
error:\nFailed assertions:\n- The 『fileSystems』 option does not specify your root file system.\n
\nTherefore, our next step is adding the Disko module, and the configuration for partition tables and filesystems.
\n\n\nIf you don't use Impermanence, or other mechanisms that use tmpfs as the root partition, please skip to the next section.
\n
Change your flake.nix
to add the Disko module:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n # Add the following lines\n disko = {\n url = \"github:nix-community/disko\";\n inputs.nixpkgs.follows = \"nixpkgs\";\n };\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n\n # Add the next line\n inputs.disko.nixosModules.disko\n\n ./configuration.nix\n ];\n };\n };\n}\n
\nThen, we need to set up partitioning in the disk image with options provided by Disko. Modify configuration.nix
and add the following config:
{\n config,\n pkgs,\n lib,\n ...\n}: {\n # Other configurations omitted\n\n disko = {\n # Do not let Disko manage fileSystems.* config for NixOS.\n # Reason is that Disko mounts partitions by GPT partition names, which are\n # easily overwritten with tools like fdisk. When you fail to deploy a new\n # config in this case, the old config that comes with the disk image will\n # not boot either.\n enableConfig = false;\n\n devices = {\n # Define a disk\n disk.main = {\n # Size for generated disk image. 2GB is enough for me. Adjust per your need.\n imageSize = \"2G\";\n # Path to disk. When Disko generates disk images, it actually runs a QEMU\n # virtual machine and runs the installation steps. Whether your VPS\n # recognizes its hard disk as \"sda\" or \"vda\" doesn't matter. We abide to\n # Disko's QEMU VM and use \"vda\" here.\n device = \"/dev/vda\";\n type = \"disk\";\n # Parititon table for this disk\n content = {\n # Use GPT partition table. There seems to be some issues with MBR support\n # from Disko.\n type = \"gpt\";\n # Partition list\n partitions = {\n # Compared to MBR, GPT partition table doesn't reserve space for MBR\n # boot record. We need to reserve the first 1MB for MBR boot record,\n # so Grub can be installed here.\n boot = {\n size = \"1M\";\n type = \"EF02\"; # for grub MBR\n # Use the highest priority to ensure it's at the beginning\n priority = 0;\n };\n\n # ESP partition, or \"boot\" partition as you may call it. In theory,\n # this config will support VPSes with both EFI and BIOS boot modes.\n ESP = {\n name = \"ESP\";\n # Reserve 512MB of space per my own need. If you use more/less\n # on your boot partition, adjust accordingly.\n size = \"512M\";\n type = \"EF00\";\n # Use the second highest priority so it's before the remaining space\n priority = 1;\n # Format as FAT32\n content = {\n type = \"filesystem\";\n format = \"vfat\";\n # Use as boot partition. Disko use the information here to mount\n # partitions on disk image generation. Use the same settings as\n # fileSystems.*\n mountpoint = \"/boot\";\n mountOptions = [\"fmask=0077\" \"dmask=0077\"];\n };\n };\n\n # Parition to store the NixOS system, use all remaining space.\n nix = {\n size = \"100%\";\n # Format as Btrfs. Change per your needs.\n content = {\n type = \"filesystem\";\n format = \"btrfs\";\n # Use as the Nix partition. Disko use the information here to mount\n # partitions on disk image generation. Use the same settings as\n # fileSystems.*\n mountpoint = \"/nix\";\n mountOptions = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n };\n };\n };\n };\n\n # Since I enabled Impermanence, I need to declare the root partition as tmpfs,\n # so Disko can mount the partitions when generating disk images\n nodev.\"/\" = {\n fsType = \"tmpfs\";\n mountOptions = [\"relatime\" \"mode=755\" \"nosuid\" \"nodev\"];\n };\n };\n };\n\n # Since we aren't letting Disko manage fileSystems.*, we need to configure it ourselves\n # Root partition, is tmpfs because I enabled impermanence.\n fileSystems.\"/\" = {\n device = \"tmpfs\";\n fsType = \"tmpfs\";\n options = [\"relatime\" \"mode=755\" \"nosuid\" \"nodev\"];\n };\n\n # /nix partition, third partition on the disk image. Since my VPS recognizes\n # hard drive as \"sda\", I specify \"sda3\" here. If your VPS recognizes the drive\n # differently, change accordingly\n fileSystems.\"/nix\" = {\n device = \"/dev/sda3\";\n fsType = \"btrfs\";\n options = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n\n # /boot partition, second partition on the disk image. Since my VPS recognizes\n # hard drive as \"sda\", I specify \"sda2\" here. If your VPS recognizes the drive\n # differently, change accordingly\n fileSystems.\"/boot\" = {\n device = \"/dev/sda2\";\n fsType = \"vfat\";\n options = [\"fmask=0077\" \"dmask=0077\"];\n };\n}\n
\n\n\nIf you use Impermanence, or other mechanisms that use tmpfs as the root partition, read the last section and skip this section.
\n
Same as the last section, change your flake.nix
to add the Disko module:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n # Add the following lines\n disko = {\n url = \"github:nix-community/disko\";\n inputs.nixpkgs.follows = \"nixpkgs\";\n };\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n\n # Add the next line\n inputs.disko.nixosModules.disko\n\n ./configuration.nix\n ];\n };\n };\n}\n
\nThen, we need to set up partitioning in the disk image with options provided by Disko. Modify configuration.nix
and add the following config:
{\n config,\n pkgs,\n lib,\n ...\n}: {\n # Other configurations omitted\n\n disko = {\n # Do not let Disko manage fileSystems.* config for NixOS.\n # Reason is that Disko mounts partitions by GPT partition names, which are\n # easily overwritten with tools like fdisk. When you fail to deploy a new\n # config in this case, the old config that comes with the disk image will\n # not boot either.\n enableConfig = false;\n\n devices = {\n # Define a disk\n disk.main = {\n # Size for generated disk image. 2GB is enough for me. Adjust per your need.\n imageSize = \"2G\";\n # Path to disk. When Disko generates disk images, it actually runs a QEMU\n # virtual machine and runs the installation steps. Whether your VPS\n # recognizes its hard disk as \"sda\" or \"vda\" doesn't matter. We abide to\n # Disko's QEMU VM and use \"vda\" here.\n device = \"/dev/vda\";\n type = \"disk\";\n # Parititon table for this disk\n content = {\n # Use GPT partition table. There seems to be some issues with MBR support\n # from Disko.\n type = \"gpt\";\n # Partition list\n partitions = {\n # Compared to MBR, GPT partition table doesn't reserve space for MBR\n # boot record. We need to reserve the first 1MB for MBR boot record,\n # so Grub can be installed here.\n boot = {\n size = \"1M\";\n type = \"EF02\"; # for grub MBR\n # Use the highest priority to ensure it's at the beginning\n priority = 0;\n };\n\n # ESP partition, or \"boot\" partition as you may call it. In theory,\n # this config will support VPSes with both EFI and BIOS boot modes.\n ESP = {\n name = \"ESP\";\n # Reserve 512MB of space per my own need. If you use more/less\n # on your boot partition, adjust accordingly.\n size = \"512M\";\n type = \"EF00\";\n # Use the second highest priority so it's before the remaining space\n priority = 1;\n # Format as FAT32\n content = {\n type = \"filesystem\";\n format = \"vfat\";\n # Use as boot partition. Disko use the information here to mount\n # partitions on disk image generation. Use the same settings as\n # fileSystems.*\n mountpoint = \"/boot\";\n mountOptions = [\"fmask=0077\" \"dmask=0077\"];\n };\n };\n\n # Parition to store the NixOS system, use all remaining space.\n nix = {\n size = \"100%\";\n # Format as Btrfs. Change per your needs.\n content = {\n type = \"filesystem\";\n format = \"btrfs\";\n # Use as the root partition. Disko use the information here to mount\n # partitions on disk image generation. Use the same settings as\n # fileSystems.*\n mountpoint = \"/\";\n mountOptions = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n };\n };\n };\n };\n };\n };\n\n # Since we aren't letting Disko manage fileSystems.*, we need to configure it ourselves\n # Root partition, third partition on the disk image. Since my VPS recognizes\n # hard drive as \"sda\", I specify \"sda3\" here. If your VPS recognizes the drive\n # differently, change accordingly\n fileSystems.\"/\" = {\n device = \"/dev/sda3\";\n fsType = \"btrfs\";\n options = [\"compress-force=zstd\" \"nosuid\" \"nodev\"];\n };\n\n # /boot partition, second partition on the disk image. Since my VPS recognizes\n # hard drive as \"sda\", I specify \"sda2\" here. If your VPS recognizes the drive\n # differently, change accordingly\n fileSystems.\"/boot\" = {\n device = \"/dev/sda2\";\n fsType = \"vfat\";\n options = [\"fmask=0077\" \"dmask=0077\"];\n };\n}\n
\nChange flake.nix
to add a \"package\" that calls the generate disk image function from Disko:
{\n description = \"Lan Tian's NixOS Flake\";\n\n inputs = {\n nixpkgs.url = \"github:NixOS/nixpkgs/nixos-unstable\";\n impermanence.url = \"github:nix-community/impermanence\";\n disko = {\n url = \"github:nix-community/disko\";\n inputs.nixpkgs.follows = \"nixpkgs\";\n };\n };\n\n outputs = {\n self,\n nixpkgs,\n ...\n } @ inputs: let\n lib = nixpkgs.lib;\n in rec {\n nixosConfigurations.bootstrap = lib.nixosSystem {\n system = \"x86_64-linux\";\n modules = [\n inputs.impermanence.nixosModules.impermanence\n inputs.disko.nixosModules.disko\n ./configuration.nix\n ];\n };\n\n # Add the following lines\n packages.x86_64-linux = {\n image = self.nixosConfigurations.bootstrap.config.system.build.diskoImages;\n };\n };\n}\n
\nFinally, run nix build .#image
. After a short while, you will see the generated disk image at result/main.raw
.
Boot into rescue environment, or a lightweight Linux distro like Alpine Linux, on your VPS.
\nIf your rescue environment has a SSH server, use the following command to upload your image:
\n# Change to sda/vda based on how your VPS recognizes its hard drive\ncat result/main.raw | ssh root@123.45.678.90 \"dd of=/dev/sda\"\n
\nIf your rescue environment doesn't have SSH, use the following command: (ATTENTION: NO ENCRYPTION!)
\n# Change to sda/vda based on how your VPS recognizes its hard drive\n# Run this on VPS\nnc -l 1234 | dd of=/dev/sda\n# Run this on local computer\ncat result/main.raw | nc 123.45.678.89 1234\n
\nReboot your VPS after the command finishes. Now you should be booting into the freshly installed NixOS.
\nSince the disk image we created is only 2GB large, the image written into VPS's hard drive doesn't consume all spaces on the hard drive. You will need to manually expand the partition.
\nRun fdisk /dev/sda
, remove the third partition for /nix
(or /
), and recreate the partition with the same start position, and extend the end position to the end of the hard drive. Do not erase the filesystem header when prompted!
Finally, run the filesystem resize command for your filesystem. For ext4 partitions, use resize2fs /dev/sda3
. For Btrfs, use btrfs filesystem resize max /nix
(or '/').
(题图来自:维基百科 - 三角函数)
\n我想计算我的所有 VPS 节点之间的网络延迟,并把延迟写入 Bird BGP 服务端的配置中,以便让节点之间的数据转发经过延迟最低的路径。但是,我的节点截至今天有 17 个,我不想在节点之间手动两两 Ping 获取延迟。
\n于是我想了一种方法:标记所有节点所在物理地点的经纬度,根据经纬度计算物理距离,再将距离除以光速的一半即可获得大致的延迟。我随机抽样了几对节点,发现她们之间的路由都比较直,没有严重的绕路现象,此时物理距离就是一个可以满足我要求的近似值。
\n因为我的节点上用的都是 NixOS,统一使用 Nix 语言管理配置,所以我需要找到一种在 Nix 中计算这个距离的方法。一种常用的根据经纬度算距离的方法是半正矢公式(Haversine Formula),它将地球近似为一个半径为 6371 公里的球体,再使用以下公式计算经纬度之间的距离:
\n\n\nh=hav(rd) 其中:hav(θ) 可得:d=(hav(φ2−φ1)+cos(φ1)cos(φ2)hav(λ2−λ1))=sin2(2θ)=21−cos(θ)=r⋅archav(h)=2r⋅arcsin(h)=2r⋅arcsin(sin2(2φ2−φ1)+cos(φ1)cos(φ2)sin2(2λ2−λ1))\n参考资料:维基百科 - 半正矢公式
\n
\n\n注:半正矢公式有几种变体,我实际参考的是 Stackoverflow 上的这一版使用 arctan 函数的实现:https://stackoverflow.com/a/27943
\n
但是,Nix 作为一个打包、写软件配置用的语言,自然没有三角函数的支持,只能完成一些简单的浮点数计算。
\n于是我用了另一种方法,直接调用 Python 的 geopy
模块计算距离:
{\n pkgs,\n lib,\n ...\n}: let\nin {\n # 计算两个经纬度之间的距离,单位是公里\n distance = a: b: let\n py = pkgs.python3.withPackages (p: with p; [geopy]);\n\n helper = a: b:\n lib.toInt (builtins.readFile (pkgs.runCommandLocal\n \"geo-result.txt\"\n {nativeBuildInputs = [py];}\n ''\n python > $out <<EOF\n import geopy.distance\n print(int(geopy.distance.geodesic((${a.lat}, ${a.lng}), (${b.lat}, ${b.lng})).km))\n EOF\n ''));\n in\n if a.lat < b.lat || (a.lat == b.lat && a.lng < b.lng)\n then helper a b\n else helper b a;\n}\n
\n这种方法能用,但这相当于为每组不同的经纬度单独创建了一个「软件包」,再让 Nix 进行构建。Nix 为了尽可能保持可重复打包,避免软件包打包过程中引入变量,会创建一个不联网、磁盘访问受限的沙盒环境,然后在这个虚拟环境中启动 Python,加载 geopy
,进行计算。这个过程很慢,在我的笔记本电脑(i7-11800H)上需要为每个软件包花大约 0.5 秒,而且由于 Nix 的限制无法并行处理。截至今天,我的 17 个节点分散在全世界 10 个不同的城市,这意味着计算这些距离就要花费 210⋅9⋅0.5s=22.5s 的时间。
而且,由于构建软件包的函数 pkgs.runCommandLocal
的输出立即被 builtins.readFile
读取,这些距离计算用的软件包并不会被我的 NixOS 配置直接引用,也就意味着它们的引用计数为 0,在运行 nixos-collect-garbage -d
时会被立即清理。之后构建下一次配置时,又要花费 22.5 秒再计算一遍。
那么,我能不能不再依赖 Python,而是使用 Nix 的简单的浮点数功能实现 sin,cos,tan 这些三角函数,从而实现计算半正矢函数呢?
\n于是就有了今天的项目:使用纯 Nix 语言实现的三角函数库。
\n正弦 sin 和余弦 cos 这两个三角函数都有比较简单的计算方法:泰勒级数。我们都知道,正弦 sin 有如下的泰勒展开式:
\nsinx=n=0∑∞(−1)n(2n+1)!x2n+1=x−3!x3+5!x5−...\n不难发现,每个泰勒展开项可以用基本的四则运算完成计算。我们就可以在 Nix 中实现如下的函数:
\n{\n pi = 3.14159265358979323846264338327950288;\n\n # 辅助函数,对数列中的所有项求和/乘积\n sum = builtins.foldl' builtins.add 0;\n multiply = builtins.foldl' builtins.mul 1;\n\n # 取余函数,计算 a mod b,用于将 sin/cos 的输入限制到 (-2pi, 2pi)\n mod = a: b:\n if a < 0\n then mod (b - mod (0 - a) b) b\n else a - b * (div a b);\n\n # 乘方函数,计算 x^times,其中 times 为整数\n pow = x: times: multiply (lib.replicate times x);\n\n # 正弦函数\n sin = x: let\n # 将 x 转为浮点数避免整数乘除法,并取余 2pi 限制输入范围,避免精度损失\n x' = mod (1.0 * x) (2 * pi);\n # 计算数列中的第 i 项,其中 i 从 1 开始\n step = i: (pow (0 - 1) (i - 1)) * multiply (lib.genList (j: x' / (j + 1)) (i * 2 - 1));\n # 注:此处 lib.genList 的调用相当于 for (j = 0; j < i*2-1; j++)\n in\n # TODO:咕咕咕\n 0;\n}\n
\n其中计算单个泰勒展开项时,为了避免浮点数的精度损失,没有分别计算分子分母两个大数再相除,而是将 n!xn 展开成 1x⋅2x⋅...⋅nx,单独计算每一项,再将所有数值相对较小的结果相乘。
\n然后,我们要决定计算多少项。我们可以选择计算固定的项数,比如 10 项:
\n{\n sin = x: let\n x' = mod (1.0 * x) (2 * pi);\n step = i: (pow (0 - 1) (i - 1)) * multiply (lib.genList (j: x' / (j + 1)) (i * 2 - 1));\n in\n # 如果 x < 0 就取负,进一步缩小要处理的范围\n if x < 0\n then -sin (0 - x)\n # 计算 10 项泰勒展开项并求和\n else sum (lib.genList (i: step (i + 1)) 10);\n}\n
\n但是计算固定项数时,因为 Nix 的浮点数是 32 位的 float,输入值很小时泰勒展开项很快就小于浮点数精度,浪费计算次数,而输入值很大时计算 10 项又不能保证计算足够精确。于是我决定改成根据泰勒展开项的值决定,在这一步计算结果小于精度要求时结束计算:
\n{\n # 精度限制,泰勒展开项小于该值时停止计算\n epsilon = pow (0.1) 10;\n\n # 绝对值函数 abs 以及别名 fabs\n abs = x:\n if x < 0\n then 0 - x\n else x;\n fabs = abs;\n\n sin = x: let\n x' = mod (1.0 * x) (2 * pi);\n step = i: (pow (0 - 1) (i - 1)) * multiply (lib.genList (j: x' / (j + 1)) (i * 2 - 1));\n # 如果当前项的绝对值小于 epsilon 就停止计算,否则继续算下一步\n # tmp 用于累加,i 是泰勒展开项的编号计数\n helper = tmp: i: let\n value = step i;\n in\n if (fabs value) < epsilon\n then tmp\n else helper (tmp + value) (i + 1);\n in\n if x < 0\n then -sin (0 - x)\n # 累加从 0 开始,编号从 1 开始\n else helper 0 1;\n}\n
\n于是我们就有了一个足够精确的正弦 sin 函数。把它的输入值从 0 到 10(大于 2π),每隔 0.001 扫描一遍:
\n{\n # arange:生成一个从 min(含)到 max(不含),间隔 step 的数列\n arange = min: max: step: let\n count = floor ((max - min) / step);\n in\n lib.genList (i: min + step * i) count;\n\n # arange2:生成一个从 min(含)到 max(含),间隔 step 的数列\n arange2 = min: max: step: arange min (max + step) step;\n\n # 测试函数:将数组 inputs 中的每个值都用函数 fn 计算一遍,生成 input -> output 的 attrset\n testOnInputs = inputs: fn:\n builtins.listToAttrs (builtins.map (v: {\n name = builtins.toString v;\n value = fn v;\n })\n inputs);\n\n # 测试函数:将从 min(含)到 max(含),间隔 step 的输入都测试一遍\n testRange = min: max: step: testOnInputs (math.arange2 min max step);\n\n testOutput = testRange (0 - 10) 10 0.001 math.sin;\n}\n
\n将 testOutput
和 Python Numpy 的 np.sin
比较,所有结果的差距都小于 0.0001%,满足要求。
类似的,我们可以实现余弦 cos 函数:
\n{\n # 将余弦转换成正弦\n cos = x: sin (0.5 * pi - x);\n}\n
\n你不会真以为我会从零开始再来一遍吧?不会吧不会吧?
\n类似的,正切 tan 函数也很简单:
\n{\n tan = x: (sin x) / (cos x);\n}\n
\n将 cos
和 tan
用类似的方法测试,差距均小于 0.0001%。
arctan 函数也有泰勒展开式:
\narctanx=n=0∑∞(−1)n2n+1x2n+1=x−3x3+5x5−...\n但是很容易发现,arctan 的泰勒展开式收敛远不如 sin 的展开式快。由于 arctan 展开式的分母线性增加,计算到小于 epsilon 所需的项数大幅增加,甚至可能直接让 Nix 的栈溢出:
\nerror: stack overflow (possible infinite recursion)\n
\n所以我们不能用泰勒展开式了,得用其它计算次数少的方法。受到 https://stackoverflow.com/a/42542593 的启发,我决定用多项式回归来拟合 [0,1] 上的 arctan 曲线,并将其它范围的 arctan 按如下规则进行映射:
\nx<0,x>1,arctan(x)=−arctan(−x)arctan(x)=2π−arctan(x1)\n启动 Python,加载 Numpy,开始拟合:
\nimport numpy as np\n\n# 生成 arctan 函数的输入,[0, 1] 的 1000 个点:\na = np.linspace(0, 1, 1000)\n\n# 多项式回归,我指定用十次函数回归(x^10)\nfit = np.polyfit(a, np.arctan(a), 10)\n\n# 输出回归结果\nprint('\\n'.join([\"{0:.7f}\".format(i) for i in (fit[::-1])]))\n# 0.0000000\n# 0.9999991\n# 0.0000361\n# -0.3339481\n# 0.0056166\n# 0.1692346\n# 0.1067547\n# -0.3812212\n# 0.3314050\n# -0.1347016\n# 0.0222228\n
\n以上输出代表 [0,1] 上的 arctan 可以近似为:
\narctan(x)=0+0.9999991x+0.0000361x2−...+0.0222228x10
\n于是我们就可以在 Nix 中实现以上多项式:
\n{\n # 多项式计算,x^0*poly[0] + x^1*poly[1] + ... + x^n*poly[n]\n polynomial = x: poly: let\n step = i: (pow x i) * (builtins.elemAt poly i);\n in\n sum (lib.genList step (builtins.length poly));\n\n # 反正切函数\n atan = x: let\n poly = [\n 0.0000000\n 0.9999991\n 0.0000366\n (0 - 0.3339528)\n 0.0056430\n 0.1691462\n 0.1069422\n (0 - 0.3814731)\n 0.3316130\n (0 - 0.1347978)\n 0.0222419\n ];\n in\n # x < 0 的映射\n if x < 0\n then -atan (0 - x)\n # x > 1 的映射\n else if x > 1\n then pi / 2 - atan (1 / x)\n # 0 <= x <= 1,多项式计算\n else polynomial x poly;\n}\n
\n进行精度测试,所有结果误差小于 0.0001%。
\n对于平方根函数,我们可以使用著名的牛顿法进行递推。我使用的递推公式是:
\nan+1=2an+anx
\n其中 x 是平方根函数的输入。
\n我们可以在 Nix 中如下实现牛顿法求平方根,递推到结果变化小于 epsilon 即可:
\n{\n # 平方根函数\n sqrt = x: let\n helper = tmp: let\n value = (tmp + 1.0 * x / tmp) / 2;\n in\n if (fabs (value - tmp)) < epsilon\n then value\n else helper value;\n in\n if x < epsilon\n then 0\n else helper (1.0 * x);\n}\n
\n精度测试显示所有结果的误差小于 1e−10(绝对值)。
\n有了以上函数,终于可以开始实现半正矢公式了。我参考的是 Stackoverflow 上这一版的实现:https://stackoverflow.com/a/27943
\n{\n # 角度转换成弧度\n deg2rad = x: x * pi / 180;\n\n # 半正矢公式,输入两个经纬度,输出地球上的球面距离\n haversine = lat1: lon1: lat2: lon2: let\n # 将地球视为半径 6371 公里的球体\n radius = 6371000;\n # 纬度差的弧度\n rad_lat = deg2rad ((1.0 * lat2) - (1.0 * lat1));\n # 经度差的弧度\n rad_lon = deg2rad ((1.0 * lon2) - (1.0 * lon1));\n # 按公式计算\n a = (sin (rad_lat / 2)) * (sin (rad_lat / 2)) + (cos (deg2rad (1.0 * lat1))) * (cos (deg2rad (1.0 * lat2))) * (sin (rad_lon / 2)) * (sin (rad_lon / 2));\n c = 2 * atan ((sqrt a) / (sqrt (1 - a)));\n in\n radius * c;\n}\n
\n最后根据光速计算理论延迟:
\n{\n # 150000:光每毫秒行进的米数,再除以 2(计算的是双向延迟)\n rttMs = lat1: lon1: lat2: lon2: floor ((haversine lat1 lon1 lat2 lon2) / 150000);\n}\n
\n我终于达成了最开始的目标:用经纬度除以光速计算节点间网络理论延迟。
\n以上三角函数(和一些额外的数学函数)可以在我的 GitHub 获取:https://github.com/xddxdd/nix-math
\n如果你使用 Nix Flake,可以用以下方式使用这些函数:
\n{\n inputs = {\n nix-math.url = \"github:xddxdd/nix-math\";\n };\n\n outputs = inputs: let\n math = inputs.nix-math.lib.math;\n in{\n value = math.sin (math.deg2rad 45);\n };\n}\n
","date_published":"2023-09-20T15:10:57.000Z","date_modified":"2024-03-18T07:22:20.513Z","author":{"name":"Lan Tian","url":"https://lantian.pub"},"tags":"计算机与客户端"},{"id":"https://lantian.pub/en/article/modify-computer/nix-trigonometric-math-library-from-zero.lantian/","url":"https://lantian.pub/en/article/modify-computer/nix-trigonometric-math-library-from-zero.lantian/","title":"Nix Trigonometric Math Library from Ground Zero","link":"https://lantian.pub/en/article/modify-computer/nix-trigonometric-math-library-from-zero.lantian/","summary":"","image":"/usr/uploads/202309/trigonometric.png","banner_image":"/usr/uploads/202309/trigonometric.png","content_html":"(Title image sourced from: Wikipedia - Trigonometry)
\nI wanted to calculate the network latency between all my VPS nodes, and add the latency into the configuration file of Bird BGP daemon, so the network packets are forwarded through the lowest latency route. However, I have 17 nodes as of today, and I didn't want to manually run a ping
command between each pair.
So I came up with a solution: I can mark the latitudes and longitudes of the physical locations of my nodes, calculate the physical distance, and divide that by half the light speed to get the approximate latencies. I randomly sampled a few node pairs, and found that the Internet routing between them are mostly straightforward, with no significant detours. In this case, the physical distance is a good approximation that satisfies my requirements.
\nBecause I use NixOS across all my nodes, and manage all configs with Nix, I need to find a way to calculate this distance with Nix. One commonly used method to calculate distance based on latitude/longitude is Haversine formula. It approximates the Earth as a sphere with a radius of 6371km, and then use the following formula to calculate the distance:
\n\n\nh=hav(rd)Where: hav(θ)Therefore: d=(hav(φ2−φ1)+cos(φ1)cos(φ2)hav(λ2−λ1))=sin2(2θ)=21−cos(θ)=r⋅archav(h)=2r⋅arcsin(h)=2r⋅arcsin(sin2(2φ2−φ1)+cos(φ1)cos(φ2)sin2(2λ2−λ1))\nReference: Wikipedia - Haversine formula
\n
\n\nNote: there are a few variations of Haversine formula. I actually used this arctan-based implementation from Stackoverflow: https://stackoverflow.com/a/27943
\n
Nix however, as a language mainly focused on packaging and generating config files, naturally doesn't natively support trigonometric functions, and is only capable of some simple floating point computations.
\nThus I went with another way, depending on Python's geopy
module for distance computation:
{\n pkgs,\n lib,\n ...\n}: let\nin {\n # Calculate distance between latitudes/longitudes in kilometers\n distance = a: b: let\n py = pkgs.python3.withPackages (p: with p; [geopy]);\n\n helper = a: b:\n lib.toInt (builtins.readFile (pkgs.runCommandLocal\n \"geo-result.txt\"\n {nativeBuildInputs = [py];}\n ''\n python > $out <<EOF\n import geopy.distance\n print(int(geopy.distance.geodesic((${a.lat}, ${a.lng}), (${b.lat}, ${b.lng})).km))\n EOF\n ''));\n in\n if a.lat < b.lat || (a.lat == b.lat && a.lng < b.lng)\n then helper a b\n else helper b a;\n}\n
\nIt works, but what it really did is creating a new \"package\" for each pair of latitudes/longitudes, and having Nix build it. In order to achieve reproducible packaging wherever possible, and prevent extra variables from being introduced, Nix creates a sandbox isolated from Internet and restricted from arbitrary disk access, run Python in this sandbox, have it load geopy
, and do the calculation. This process is slow, taking around 0.5s for each package on my laptop (i7-11800H), and cannot be parallelized due to Nix's limitations. As of today, my 17 nodes are distributed in 10 different cities around the world. This means calculating all these distances alone will take 210⋅9⋅0.5s=22.5s.
In addition, since the output of the packaging function pkgs.runCommandLocal
is immediately consumed by builtins.readFile
, the packages for distance calculation are not directly referenced by my Nix config. This means that their reference count is 0, and will be immediately garbage collected with nixos-collect-garbage -d
. Next time I want to build my config, it needs another 22.5s to calculate all of them again.
Is it possible that I no longer rely on Python, but instead implement the trigonometric functions sin, cos, tan, and finally implement the Haversine function?
\nAnd here comes the project today: trigonometric math library implemented in pure Nix.
\nThe trigonometric functions, sine and cosine, have a relatively easy way to compute: Taylor expansions. We all know that the sine function has the following Taylor expansion:
\nsinx=n=0∑∞(−1)n(2n+1)!x2n+1=x−3!x3+5!x5−...\nWe can observe that each expanded item can be calculated with basic arithmetric operations. Therefore, we can implement the following functions in Nix:
\n{\n pi = 3.14159265358979323846264338327950288;\n\n # Helper functions to sum/multiply all items in the array\n sum = builtins.foldl' builtins.add 0;\n multiply = builtins.foldl' builtins.mul 1;\n\n # Modulos function, \"a mod b\". Used for limiting input to sin/cos to (-2pi, 2pi)\n mod = a: b:\n if a < 0\n then mod (b - mod (0 - a) b) b\n else a - b * (div a b);\n\n # Power function, calculates \"x^times\", where \"times\" is an integer\n pow = x: times: multiply (lib.replicate times x);\n\n # Sine function\n sin = x: let\n # Convert x to floating point to avoid integer arithmetrics.\n # Also modulos it by 2pi to limit input range and avoid precision loss\n x' = mod (1.0 * x) (2 * pi);\n # Calculate i-th item in the expansion, i starts from 1\n step = i: (pow (0 - 1) (i - 1)) * multiply (lib.genList (j: x' / (j + 1)) (i * 2 - 1));\n # Note: this lib.genList call is equal to for (j = 0; j < i*2-1; j++)\n in\n # TODO: Not completed yet!\n 0;\n}\n
\nFor the calculation of a single Taylor expansion item, to avoid precision loss, I didn't calculate the numerator and denominator separately before dividing them. Instead, I expanded n!xn to 1x⋅2x⋅...⋅nx, and calculate them one by one, and multiply all these much smaller results.
\nThen, we need to determine how many items we want to calculate. We could opt to a constant number of items, 10 for example:
\n{\n sin = x: let\n x' = mod (1.0 * x) (2 * pi);\n step = i: (pow (0 - 1) (i - 1)) * multiply (lib.genList (j: x' / (j + 1)) (i * 2 - 1));\n in\n # Invert when x < 0 to reduce input range\n if x < 0\n then -sin (0 - x)\n # Calculate 10 Taylor expansion items and add them up\n else sum (lib.genList (i: step (i + 1)) 10);\n}\n
\nBut when a fixed number of items are used, since Nix uses 32 bit float for its calculations, the 10 Taylor expansion items quickly diminish below floating point accuracy when the input is very small, and further items are still not small enough to be ignored with larger inputs. So I decided to have it make decisions based on the value of Taylor expansion items, and stop computation when the value is below our accuracy target:
\n{\n # Accuracy target, stop iterating when Taylor expansion item is below this\n epsilon = pow (0.1) 10;\n\n # Absolute value function \"abs\" and its alias \"fabs\"\n abs = x:\n if x < 0\n then 0 - x\n else x;\n fabs = abs;\n\n sin = x: let\n x' = mod (1.0 * x) (2 * pi);\n step = i: (pow (0 - 1) (i - 1)) * multiply (lib.genList (j: x' / (j + 1)) (i * 2 - 1));\n # Stop if absolute value of current item is below epsilon, continue otherwise\n # \"tmp\" is the accumulator, and \"i\" is the index for the Taylor expansion item\n helper = tmp: i: let\n value = step i;\n in\n if (fabs value) < epsilon\n then tmp\n else helper (tmp + value) (i + 1);\n in\n if x < 0\n then -sin (0 - x)\n # Accumulate from 0, index start from 1\n else helper 0 1;\n}\n
\nNow we have a sine function with sufficient accuracy. Scan its result with input from 0 to 10 (above 2π), with a step of 0.001:
\n{\n # arange: generate an array from \"min\" (inclusive) to \"max\" (exclusive) every \"step\"\n arange = min: max: step: let\n count = floor ((max - min) / step);\n in\n lib.genList (i: min + step * i) count;\n\n # arange: generate an array from \"min\" (inclusive) to \"max\" (inclusive) every \"step\"\n arange2 = min: max: step: arange min (max + step) step;\n\n # Test function: calculate each value from array \"inputs\" with \"fn\", and generate an attrset for input -> output\n testOnInputs = inputs: fn:\n builtins.listToAttrs (builtins.map (v: {\n name = builtins.toString v;\n value = fn v;\n })\n inputs);\n\n # Test function: try all inputs from \"min\" (inclusive) to \"max\" (inclusive) every \"step\"\n testRange = min: max: step: testOnInputs (math.arange2 min max step);\n\n testOutput = testRange (0 - 10) 10 0.001 math.sin;\n}\n
\nCompare testOutput
to the result of Python Numpy's np.sin
, and all the results are within 0.0001% of true value. This satisfies our precision requirements.
Similarly, we can implement the cosine function:
\n{\n # Convert cosine to sine\n cos = x: sin (0.5 * pi - x);\n}\n
\nYou really think I'm doing it from ground zero again? Really?
\nSimilarly, the tangent function is also simple:
\n{\n tan = x: (sin x) / (cos x);\n}\n
\nI also ran the test on cos
and tan
, and the error is also within 0.0001%.
The arctangent function also has a Taylor expansion:
\narctanx=n=0∑∞(−1)n2n+1x2n+1=x−3x3+5x5−...\nYet it is easy to notice that arctan's Taylor expansion doesn't converge nearly as fast as sine. Since its denominator increase linearly, we need to calculate much more items before it's smaller than epsilon, which may cause a stack overflow for Nix:
\nerror: stack overflow (possible infinite recursion)\n
\nTaylor expansion is no longer an option then, we need something that calculates much faster. Being inspired by https://stackoverflow.com/a/42542593, I decided to fit the arctangent curve on [0,1] with polynomial regression, and map the arctangent function in other ranges using the following rules:
\nx<0,x>1,arctan(x)=−arctan(−x)arctan(x)=2π−arctan(x1)\nStart Python and Numpy, and begin the fitting process:
\nimport numpy as np\n\n# Generate input to arctan, 1000 points on [0, 1]:\na = np.linspace(0, 1, 1000)\n\n# Polynomial regression, I'm using 10th order polynomial (x^10)\nfit = np.polyfit(a, np.arctan(a), 10)\n\n# Output regression results\nprint('\\n'.join([\"{0:.7f}\".format(i) for i in (fit[::-1])]))\n# 0.0000000\n# 0.9999991\n# 0.0000361\n# -0.3339481\n# 0.0056166\n# 0.1692346\n# 0.1067547\n# -0.3812212\n# 0.3314050\n# -0.1347016\n# 0.0222228\n
\nThe output above means that the arctangent function on [0,1] can be approximated with:
\narctan(x)=0+0.9999991x+0.0000361x2−...+0.0222228x10
\nWe can replicate this polynomial function in Nix:
\n{\n # Polynomial calculation, x^0*poly[0] + x^1*poly[1] + ... + x^n*poly[n]\n polynomial = x: poly: let\n step = i: (pow x i) * (builtins.elemAt poly i);\n in\n sum (lib.genList step (builtins.length poly));\n\n # Arctangent function\n atan = x: let\n poly = [\n 0.0000000\n 0.9999991\n 0.0000366\n (0 - 0.3339528)\n 0.0056430\n 0.1691462\n 0.1069422\n (0 - 0.3814731)\n 0.3316130\n (0 - 0.1347978)\n 0.0222419\n ];\n in\n # Mapping when x < 0\n if x < 0\n then -atan (0 - x)\n # Mapping when x > 1\n else if x > 1\n then pi / 2 - atan (1 / x)\n # Polynomial calculation when 0 <= x <= 1\n else polynomial x poly;\n}\n
\nI ran the precision test, and all results are within 0.0001% of true value.
\nFor the square root function, we can iterate with the famous Newtonian method. The iteration formula I'm using is:
\nan+1=2an+anx
\nOf which x is the input to the square root function.
\nWe can implement Newtonian square root calculation in Nix with the following code, and iterate until the change in result is below epsilon:
\n{\n # Square root function\n sqrt = x: let\n helper = tmp: let\n value = (tmp + 1.0 * x / tmp) / 2;\n in\n if (fabs (value - tmp)) < epsilon\n then value\n else helper value;\n in\n if x < epsilon\n then 0\n else helper (1.0 * x);\n}\n
\nThe precision test shows all results are within 1e−10 (absolute value) of true value.
\nWith the functions above ready, we can finally start implementing the Haversine formula. I'm using this implementation from Stackoverflow as a reference: https://stackoverflow.com/a/27943
\n{\n # Convert degree to radian\n deg2rad = x: x * pi / 180;\n\n # Haversine formula, input a pair of latitudes/longitudes, output surface distance on Earth\n haversine = lat1: lon1: lat2: lon2: let\n # Treat the Earth as a sphere with radius of 6371km\n radius = 6371000;\n # Radian of latitude difference\n rad_lat = deg2rad ((1.0 * lat2) - (1.0 * lat1));\n # Radian of longitude difference\n rad_lon = deg2rad ((1.0 * lon2) - (1.0 * lon1));\n # Calculate based on formula\n a = (sin (rad_lat / 2)) * (sin (rad_lat / 2)) + (cos (deg2rad (1.0 * lat1))) * (cos (deg2rad (1.0 * lat2))) * (sin (rad_lon / 2)) * (sin (rad_lon / 2));\n c = 2 * atan ((sqrt a) / (sqrt (1 - a)));\n in\n radius * c;\n}\n
\nFinally, calculate the theoretical delay based on light speed:
\n{\n # 150000: distance light travels each millisecond, divided by 2 (for round trip)\n rttMs = lat1: lon1: lat2: lon2: floor ((haversine lat1 lon1 lat2 lon2) / 150000);\n}\n
\nI finally reached the target I was aiming for: calculate the theoretical network latency between my nodes based on the light speed.
\nAll these trigonometric functions (and some extra math functions) can be obtained from my GitHub: https://github.com/xddxdd/nix-math
\nIf you're using Nix Flake, you can use the function as follows:
\n{\n inputs = {\n nix-math.url = \"github:xddxdd/nix-math\";\n };\n\n outputs = inputs: let\n math = inputs.nix-math.lib.math;\n in{\n value = math.sin (math.deg2rad 45);\n };\n}\n
","date_published":"2023-09-20T15:10:57.000Z","date_modified":"2024-03-18T07:22:20.537Z","author":{"name":"Lan Tian","url":"https://lantian.pub"},"tags":"Computers and Clients"},{"id":"https://lantian.pub/article/random-notes/pipewire-sigkill-fix.lantian/","url":"https://lantian.pub/article/random-notes/pipewire-sigkill-fix.lantian/","title":"解决 Pipewire 被 SIGKILL 的问题","link":"https://lantian.pub/article/random-notes/pipewire-sigkill-fix.lantian/","summary":"","image":"/usr/uploads/202305/pipewire.png","banner_image":"/usr/uploads/202305/pipewire.png","content_html":"我频繁遇到 Pipewire 音频框架突然停止运行的情况:
\nsystemctl --user status pipewire.service
只能看到 Pipewire 进程被 SIGKILL
信号终止,没有其它有用的日志信息;coredumpctl
和 dmesg
里也找不到 Coredump 内存转储事件的记录。Pipewire 进程运行时具有实时优先级,其调度需求被最优先满足,以便及时处理音频数据,避免音频卡顿。Pipewire 提高进程优先级是通过它的 libpipewire-module-rt
模块请求系统中以 root
权限运行的 RTKit
(Realtime Kit)服务,然后 RTKit
以特权修改进程优先级来达成的。
但是,如果一个具有实时优先级的进程出了 Bug,进入了死循环,那么它会占用所有的 CPU 资源。系统上绝大部分其它进程(包括但不限于 SSH 服务端,Xorg,还有你的 Shell)由于优先级更低,就无法得到任何 CPU 时间片,无法处理任何任务,包括你尝试修复系统时输入的命令。
\n为了避免这个问题,Linux 内核默认对实时进程的运行时长做了限制。在默认设置下,实时进程必须在 200 毫秒内完成这一次的计算(例如 Pipewire 的音频处理),调用 sched_yield
系统调用把 CPU 时间片交还给其它进程。之后这个进程就可以在后台等待下一次事件触发(例如声卡的音频缓冲区即将耗尽),Linux 内核再次调度这个实时进程。如果实时进程在 200 毫秒后仍未完成计算,Linux 内核会直接发送 SIGKILL 信号结束进程。
由于我的电脑在切换性能模式时发生卡顿,Pipewire 处理音频数据的耗时超过了 200 毫秒,就被 Linux 内核直接结束了进程。
\n由于我没法解决电脑切换性能模式时卡顿的问题,我选择把 Pipewire 的运行时长限制提升到 5 秒,足够电脑卡顿时 Pipewire 处理音频数据。
\n首先需要修改 Pipewire libpipewire-module-rt
模块的参数,让 Pipewire 申请更长的时间限制:
{\n \"context.modules\": [\n {\n \"args\": {\n \"nice.level\": -11,\n \"rt.prio\": 88,\n \"rt.time.hard\": 5000000,\n \"rt.time.soft\": 5000000\n },\n \"flags\": [\n \"ifexists\",\n \"nofail\"\n ],\n \"name\": \"libpipewire-module-rt\"\n }\n ]\n}\n
\n其中 5000000 的单位是微秒,换算成秒数为 5 秒整。
\n然后,由于 RTKit
还有一层运行时长限制,我们还需要给 RTKit
添加启动参数,提高它的限制。运行 systemctl edit rtkit-daemon.service
,然后输入以下内容:
[Service]\n# 先清除掉原先的 ExecStart 命令\nExecStart=\n# 然后换成我们的加了参数的命令,如果你的发行版 rtkit-daemon 路径不同,请自行修改\nExecStart=/usr/lib/rtkit-daemon --rttime-usec-max=5000000\n
\n如果你用的是 NixOS 系统,可以直接使用下面的配置:
\nlet\n # 时间限制,单位是微秒\n realtimeLimitUS = 5000000;\nin {\n security.rtkit.enable = true;\n systemd.services.rtkit-daemon.serviceConfig.ExecStart = [\n \"\" # 清除掉原先的 ExecStart 命令\n \"${pkgs.rtkit}/libexec/rtkit-daemon --rttime-usec-max=${builtins.toString realtimeLimitUS}\"\n ];\n\n services.pipewire.enable = true;\n\n environment.etc = {\n \"pipewire/pipewire.conf.d/rtprio.conf\".text = builtins.toJSON {\n \"context.modules\" = [\n {\n name = \"libpipewire-module-rt\";\n args = {\n \"nice.level\" = -11;\n \"rt.prio\" = 88;\n \"rt.time.soft\" = realtimeLimitUS;\n \"rt.time.hard\" = realtimeLimitUS;\n };\n flags = [\"ifexists\" \"nofail\"];\n }\n ];\n };\n };\n}\n
","date_published":"2023-05-28T17:38:58.000Z","date_modified":"2024-03-18T07:22:20.535Z","author":{"name":"Lan Tian","url":"https://lantian.pub"},"tags":"随手记"},{"id":"https://lantian.pub/en/article/random-notes/pipewire-sigkill-fix.lantian/","url":"https://lantian.pub/en/article/random-notes/pipewire-sigkill-fix.lantian/","title":"Preventing Pipewire from being SIGKILLed","link":"https://lantian.pub/en/article/random-notes/pipewire-sigkill-fix.lantian/","summary":"","image":"/usr/uploads/202305/pipewire.png","banner_image":"/usr/uploads/202305/pipewire.png","content_html":"I frequently encounter the situation that the Pipewire audio server is suddenly stopped:
\nsystemctl --user status pipewire.service
only shows that the Pipewire process was terminated by a SIGKILL
signal, without any other useful log information.coredumpctl
nor dmesg
shows the existence of a core dump event.The Pipewire process runs with realtime priority, with which its scheduling needs are satisfied first, so it can process audio data in time to prevent stuttering. To increase its process priority, Pipewire uses its libpipewire-module-rt
module to send requests to the RTKit
service running as root
in the system. RTKit
then changes process priority with its privileges.
However, if a process with realtime priority encountered a bug, for example an infinite loop, it will consume all CPU resources. Since most of the other processes (including but not limited to, SSH daemon, Xorg, and your shell) are running with a lower priority, they won't get any CPU time slices, and won't be able to handle any tasks, including your command inputs trying to fix the system.
\nTo mitigate this problem, Linux kernel limits the execution time of realtime processes by default. Under the default settings, a realtime process must finish its computations (like Pipewire's audio processing) within 200ms, and use the sched_yield
system call to return CPU time slices to other processes. It can then wait for the next event in the background (like running out of audio buffer on the sound card), when Linux kernel invokes this process again. If the realtime process does not finish within 200ms, Linux kernel will send a SIGKILL signal to terminate the process.
Because my computer was lagging while switching between performance profiles, Pipewire spent more than 200ms on handling audio data, and thus was terminated by Linux kernel.
\nSince I'm unable to fix the lagging while switching performance profiles, I decided to increase Pipewire's time limit to 5 seconds, enough for it to process audio data even when the computer is lagging.
\nFirst we need to change the settings for Pipewire's libpipewire-module-rt
module, to make it request a longer time limit:
{\n \"context.modules\": [\n {\n \"args\": {\n \"nice.level\": -11,\n \"rt.prio\": 88,\n \"rt.time.hard\": 5000000,\n \"rt.time.soft\": 5000000\n },\n \"flags\": [\n \"ifexists\",\n \"nofail\"\n ],\n \"name\": \"libpipewire-module-rt\"\n }\n ]\n}\n
\nThe unit for 5000000 is microseconds. Pipewire will now request a 5 second time limit.
\nThen, since RTKit
imposes an additional layer of execution time limit, we need to add a startup argument to RTKit
to increase that limit as well. Run systemctl edit rtkit-daemon.service
, and enter the following config:
[Service]\n# First, clear the original ExecStart command.\nExecStart=\n# Then replace with our command with additional arguments.\n# If your distro puts rtkit-daemon elsewhere, change command to match.\nExecStart=/usr/lib/rtkit-daemon --rttime-usec-max=5000000\n
\nIf you're using NixOS, you can use the following config instead:
\nlet\n # Time limit in microseconds\n realtimeLimitUS = 5000000;\nin {\n security.rtkit.enable = true;\n systemd.services.rtkit-daemon.serviceConfig.ExecStart = [\n \"\" # Override command in rtkit package's service file\n \"${pkgs.rtkit}/libexec/rtkit-daemon --rttime-usec-max=${builtins.toString realtimeLimitUS}\"\n ];\n\n services.pipewire.enable = true;\n\n environment.etc = {\n \"pipewire/pipewire.conf.d/rtprio.conf\".text = builtins.toJSON {\n \"context.modules\" = [\n {\n name = \"libpipewire-module-rt\";\n args = {\n \"nice.level\" = -11;\n \"rt.prio\" = 88;\n \"rt.time.soft\" = realtimeLimitUS;\n \"rt.time.hard\" = realtimeLimitUS;\n };\n flags = [\"ifexists\" \"nofail\"];\n }\n ];\n };\n };\n}\n
","date_published":"2023-05-28T17:38:58.000Z","date_modified":"2024-03-18T07:22:20.545Z","author":{"name":"Lan Tian","url":"https://lantian.pub"},"tags":"Random Notes"},{"id":"https://lantian.pub/article/modify-website/how-to-kill-the-dn42-network.lantian/","url":"https://lantian.pub/article/modify-website/how-to-kill-the-dn42-network.lantian/","title":"如何引爆 DN42 网络(2023-05-12 更新)","link":"https://lantian.pub/article/modify-website/how-to-kill-the-dn42-network.lantian/","summary":"","image":"/usr/uploads/202008/i-love-niantic-network.png","banner_image":"/usr/uploads/202008/i-love-niantic-network.png","content_html":"\n\nDN42 是一个测试网络,所有人都在帮助所有人。即使你不小心搞砸了,也没有人会指责你。你可以在 DN42 的 IRC 频道,邮件列表或者非官方 Telegram 群组寻求帮助。
\n
由于 DN42 是一个实验用网络,其中也有很多新手、小白参与,因此时不时会有新手配置出现错误,而对整个 DN42 网络造成影响,甚至炸掉整个网络。
\n现在,作为一名长者(x),我将教各位小白如何操作才能炸掉 DN42,以及如果你作为小白的邻居(指 Peer 关系),应该如何防止他炸到你。
\n\n\n注意:你不应该在 DN42 网络中实际执行这些操作,你应该更加注重对破坏的防御。
\n恶意破坏会导致你被踢出 DN42 网络。
\n
本文信息根据 Telegram 群及 IRC 中的真实惨案改编。
\n你刚刚加入 DN42,并且准备把你手上的几台服务器都连接进去。你通过邮件,IRC 或者 Telegram 找了几个人分别和你的几台服务器 Peer,但是你还没有配置好你的内部路由分发。
\n于是你准备配置 OSPF,并打开 Bird 的配置文件加了一个 protocol:
\nprotocol ospf {\n ipv4 {\n import all;\n export all;\n };\n area 0.0.0.0 {\n interface zt0 {\n type broadcast;\n # 略掉一些不重要的参数\n };\n };\n};\n
\n你心满意足地把配置文件复制到每台服务器上,然后 bird configure
,看到你的各台服务器都通过 OSPF 获取到了其它服务器的路由。
突然,你的 IRC / Telegram 弹出了一个提示框,你点开来一看:
\n<mc**> shit.... as424242**** is hijacking my prefixes, for example 172.23.*.*/27\n 草…… AS424242**** 在劫持我的地址前缀(即地址块),例如 172.23.*.*/27\n<he**> yup, I see some roa fails for them as well\n 对,我也看到 ROA 验证失败了\n
\n恭喜你,你成功劫持了 DN42 网络(的一部分)。
\n当你的服务器通过 BGP 协议和其他人 Peer 时,每一条路由都包含了路径信息,包括它从哪里来,经过了哪些节点到达你这里。例如 172.22.76.184/29
这条路由可能就带有 4242422547 -> 4242422601 -> 424242****
这条路径,其中 4242422547
是路由来源(就是我),而 4242422601
是你的邻居(此处以 Burble 举例)。
但是,你的内网在传递路由时使用的是 OSPF 协议,而 OSPF 在传递路由信息时不会保留 BGP 的路径,因为它并不认识这些东西。此时你的另一台服务器通过 OSPF 获取到了 172.22.76.184/29
这条路由,但是不包含任何路径信息,它在与邻居的 BGP 宣告中就会将这条路由使用你自己的 ASN 播出去,造成劫持效果。
画成图大概是这样的:
\n[2547] -> [2601] -> [你的 A 节点] -> [你的 B 节点] -> [你的 B 节点的邻居]\n 2547 2547 2547 没了! 你的 ASN(BOOM)\n 2601 2601\n 你的 ASN\n
\nTelegram 里的老哥说话很好听,一边帮助你修上面那个 Bug,一边向你推荐 Babel:
\n但是,群友不推荐你使用 Bird 自带的 Babel 协议支持,因为 Bird 的 Babel 不能根据延迟选路。
\n你心动了,删掉了 OSPF 的配置文件,并装了一个 Babeld。很快你的每台机器上都出现了其它节点通过 Babel 发来的路由。你等了几分钟,似乎没有爆炸。
\n但是你注意到,你的 Bird 没有把这些路由通过 BGP 发出去。老哥们怂恿你开启 Bird Kernel Protocol 的 Learn:
\nprotocol kernel sys_kernel_v4 {\n scan time 20;\n # 群友怂恿你添加这一行\n learn;\n # 不重要的略过\n};\n
\n你照做了。几分钟后,你被 IRC 和 Telegram 里的人疯狂艾特。是的,你又把其他人的网络劫持了。
\n这和上面 OSPF 一段其实是相同的问题,Babel 在传递路由时丢弃了 BGP 的路径信息。只不过默认情况下,Bird 会忽略其它路由软件写入内核路由表的路由信息,除非你开了 learn。
\nexport filter
写成这样:export filter {\n # 只允许向外发送来自 STATIC(手动配置)和 BGP 协议的路由\n if source ~ [RTS_STATIC, RTS_BGP] then accept;\n # 拒绝掉其它路由协议的路由\n reject;\n}\n
\nRoute Origin Authorization
),限制每条路由的来源 ASN。\n左右横跳是多种错误的总称,它们会造成 BGP 路由程序频繁切换获得的最优路径。由于最优路径会通过 Peering 传递给别的节点,这个切换过程会造成连锁反应,相连的多个节点都会因为一个节点的故障而一起切换,最终故障扩散到全网。
\n这一过程会造成大量的流量消耗,而由于 DN42 内多数人用的是便宜的 VPS 做节点,因此长期下来结果只有以下两种:
\n而且左右横跳错误可能会造成严重的影响:
\n例如,某 Telegram 群友从 Fullmesh + Direct 转向 Multihop 时出现事故,造成了非常大量的路由切换。
\n\n他在切换过程中没有断开 BGP,而 Babel 的配置错误导致大量路由被传递及撤销。
\n由于上述路由切换的连锁传递,并且该群友接了较多的 Peering,多个较大的 AS 被迫断开之间的连接,以(在该群友睡醒之前)控制住问题规模。
\n另外,该群友先前还有多次类似的路由切换事故,但这里地方太小了写不下。(滑稽)
\n<bur*> is someone awake who is on telegram ?\n 有用 Telegram 的人醒着吗?\n<bur*> Kio*, sun*, ie**, lantian perhaps ?\n 可能是 Kio*,sun*,ie**,Lan Tian?\n<Kio*> Kio* is here\n Kio* 在\n<fox*> I am in that dn42 telegram chat too but I do not understand moon runes\n 我也在 DN42 的 Telegram 群,但我不懂月相\n<fox*> also its midnight for china?\n 另外现在是中国的半夜?\n<bur*> yes, I'm going to be nuking a lot of peerings if they are all asleep\n 对,如果他们全在睡觉,我就要炸掉一大堆 Peering 了\n<bur*> I think its originating from NIA*, but a lovely multi mb/s flap going on for the past hour\n 我觉得问题来自 NIA*,一个小时前开始有一个好几 MB/s 的「可爱」的左右横跳\n<bur*> and its like whack-a-mole, if I disable one peering the traffic just pops up on a different one\n 而且像打地鼠,如果我关掉一个 Peering,它又会从另一个 Peering 上跳出来\n<fox*> petition for bur* network to stop accepting new peers to help save dn42 network health\n 建议 Bur* 的网络不要再接受新的 Peer 了,以保证「42 号去中心网络」的健康发展\n<Kio*> NIA* is awake now\n NIA* 现在醒了\n<bur*> NIA* certainly has ipv4 next hop problems, they are advertising routes with next hops in other networks\n NIA* 的 IPv4 Nexthop 肯定有问题,他们广播的路由的 Nexthop 都在其它网络\n<Kio*> He says he is adjusting his \"network from full-mesh to rr and multihops\"\n 他说他在「把网络从 Full-mesh 调整成 Route Reflector 和 Multihop」\n<bur*> well its not working ;)\n 唔姆,这没有正常工作 ;)\n<stv*> bur*: I also took down our peering\n bur*:我也把我们的 Peering 断了\n<bur*> stv*, too much traffic from the grc?\n stv*, 来自 GRC(全球路由收集节点)的流量太多了?\n<stv*> I added a new peer around 1hr ago. Just to check that this hasnt be the cause..\n 我一小时前接了一个新的 Peer,只是为了确认这不是原因……\n<stv*> bur*: no the grc is still up and running\n bur*:不,GRC 还在正常工作\n<bur*> ah, if you are getting a lot of route updates its cos of NIA*\n 啊,如果你收到很多路由更新,它们是来自 NIA* 的\n<bur*> grc is currently pumping about 4mb/s to downstram peers\n GRC 现在正在向下游发送 4 MB/s 的更新\n<sun*> bur*: what happen?\n bur*:发生了什么?\n<bur*> NIA* is having issues\n NIA* 出了问题\n<bur*> sun* anyway, you are up late!\n sun* 不管怎么说,你睡得好晚!\n<sun*> I just came back from the bar:)\n 我刚从酒吧回来 :)\n<do**> don't drink and root\n 酒后不要 root(指用管理员权限修改系统)\n<bur*> nice :)\n 不错 :)\n<sun*> l like drink ;)\n 我喜欢喝酒 ;)\n<bur*> ok, I'm bored of this now, if you are currently sending me more than 1mb/s of bgp traffic your peering is about to get disabled.\n 行吧,我现在累了,如果你正在向我发送超过 1MB/s 的 BGP 流量,那你的 Peering 会被我禁用。\n<bur*> Kio*, sun*, Tch*, jrb*, lantian, ie**, so far\n 目前是 Kio*,sun*,Tch*,jrb*,Lan Tian,ie** 几个\n<Kio*> barely notice any flapping here, is it v4 or v6 ?\n 几乎没观察到左右横跳,是 IPv4 还是 IPv6?\n<bur*> 4 mostly, I think. you got killed on us-nyc1\n 我觉得大部分是 IPv4,你和我美国纽约 1 号节点的 Peer 被关了\n<bur*> Nap*\n Nap*\n<Nap*> Shut mine down if you need, I can't look into with much detail until tonight\n 有必要的话就把我的 Peer 关了吧,我今晚之前都不能仔细检查\n<bau*> half of dn42 is about to loose connectivity due to bur* disableing peerings lol\n 哈哈,半个 DN42 会因为 Bur* 禁用 Peering 而断网\n<do**> oh yeah, this looks nice\n 哦耶,太棒了\n<Kio*> thats why everybody should be at least multi homed with two peers\n 因此所有人都应该至少接两个 Peer\n<jrb*> bur*: and on which peering?\n bur*:在哪个 Peering 上?\n<Kio*> you shouldnt loose connectivity if only one peer drops\n 如果只有一个 Peer 掉线,你不应该也掉线\n<bur*> jrb* us-nyc1 and us-lax1 for you so far\n jrb* 目前是美国纽约 1 号和美国洛杉矶 1 号\n<jrb*> mapping table says us-3 and us-5, let me check.\n 映射表显示是美国 3 号和 5 号,我检查一下。\n<Nap*> Do we know what routes are flapping causing the updates?\n 我们知道是谁的路由造成这些更新吗?\n<Kio*> filtering problematic ASN on my us node now\n 正在我的节点上过滤有问题的 ASN\n<bur*> Nap* its NIA*\n Nap*,是 NIA*\n<bur*> AS42424213**\n AS42424213**\n<jrb*> sun*, rou*: disabling my peerings with you for now, there seems to be serious flapping\n sun*,rou*:我现在禁用和你们的 Peering,看起来有严重的左右横跳\n<do**> him again?\n 又是他?\n<sun*> what?\n 啥?\n<sun*> is me problem?\n 我的问题吗?\n<bur*> sun*, I've killed all of our peerings\n sun*,我关掉了我们所有的 Peering\n<sun*> why?\n 为什么?\n<bur*> sun*, you are distributing the problems from NIA*\n sun*,你在传递 NIA* 造成的问题\n<Nap*> bur*: K, gonna try to filter on ATL/CHI at least.\n bur*:行,准备尝试至少在亚特兰大和芝加哥节点上做过滤。\n<bur*> thanks Nap*\n 谢了 Nap*\n<Kio*> recommend everybody to temporarily enable \"bgp_path ~\" filter for the problematic ASN\n 推荐所有人暂时打开「bgp_path ~」过滤掉有问题的 ASN\n<sun*> i disabled NIA*, would fix problem?\n 我禁用了 NIA*,会解决问题吗?\n<do**> bur*: I also peer with NIA* and I don't get any bgp updates from him\n bur*:我也和 NIA* Peer 了,但没收到他的任何 BGP 更新\n<do**> ah wait\n 啊等等\n<bur*> sun*, depends if you are also getting the updates from other peers too\n sun*,取决于你会不会也从其他 Peer 收到这些更新\n<do**> now I see it\n 现在我看到了\n<do**> disabling peering\n 正在禁用 Peering\n<sun*> if bgp_path ~ [= 42424213** =] then reject;\n (Bird Filter 命令)\n<bur*> ~ [= * 42424213** * =] to reject all paths\n 用「~ [= * 42424213** * =]」过滤掉所有包含他的路径\n<sun*> ohh\n 噢哦\n<jrb*> bur*: seems to be mostly rou* from my perspective\n bur*:从我这看主要是 rou*\n<Kio*> Should be filtered on my side, if anyone continues to receive those updates please notify\n 我这里应该过滤好了,如果任何人继续收到这些更新,请通知我\n<bur*> sun*, I tried re-enabling you on lax1 but you jumped striaght to 1mb/s+ again\n sun*,我尝试在洛杉矶 1 号节点重新启用我们的 Peering,但流量马上到了 1 MB/s 多\n<bur*> jrb*, re-enabled\n jrb*,重新启用了\n<sun*> i have disabled NIA*\n 我也禁用 NIA* 了\n<bur*> Kio*, re-enabled\n Kio*,重新启用了\n<do**> oh btw, I have notified NIA* about this issue\n 哦顺便提一句,我已经告知 NIA* 这个问题了\n<jrb*> do**: also tell him to notify everybody to get out of the blacklists.\n do**:另外告诉他(修好网络后)通知所有人解除黑名单。\n<do**> jrb*: will do\n jib*:好的\n<Nap*> bur*: I should have it filtered on my ATL (your CHI)\n bur*:我应该在我的亚特兰大节点上过滤了(对应你的芝加哥节点)\n<Kio*> wrote NIA* also directly on telegram\n 在 Telegram 上直接向 NIA* 发了消息\n<sun*> bur*: is it better now?\n bur*:现在好点了吗?\n<bur*> for the record, this is the first time that I've mass disabled peerings, but this was causing issues across the board\n 这是我有史以来第一次大规模禁用 Peering,但这次的确造成了很多问题\n<bur*> sun*, no not really\n sun*,不,没有\n<An**> I've stop importing route from NIA*\n 我已经停止从 NIA* 导入路由了\n<stv*> I am also dropping NIA* now\n 我现在也丢弃 NIA*(的路由)了\n<bur*> sun*, thats like 1k updates every few seconds\n sun*,每过几秒就会有一千条路由更新\n<Nap*> bur*: all host should have it filtered now.\n bur*:所有节点都应该过滤了。\n<bur*> Nap*, looks to me, thanks\n Nap*,看起来没问题,谢谢\n<sun*> bur*: seems to have reduced traffic\n bur*:看起来流量降低了\n<bur*> sun*, yes that looks better\n sun*,的确看起来好些了\n<bur*> sun*, is that now ok across all your nodes ?\n sun*,现在你的所有节点都正常吗?\n<sun*> yep\n 对\n<bur*> sun*, ok re-enabled\n sun*,好的,重新启用了\n<do**> alright, also filtered 42424213**\n 好的,也把 42424213** 过滤了\n<tm**> hi, also filtered 42424213**\n 大家好,我也把 42424213** 过滤了\n<bur*> I guess they got the message, seems we're back to normal again and everyone I disabled is back again\n 我猜他们(指 NIA*)收到消息了,看起来我们再次回复正常了,所有我禁用的人都被重新启用了\n<do**> bur*: I think NIA* is asleep, probably everyone filtered it\n bur*:我觉得 NIA* 还在睡觉,也许所有人都过滤了\n<do**> or disabled peering\n 或者禁用了 Peering\n<bur*> do**, there is that, but I also renabled NIA* and am not getting the same errors now\n do**,有可能,但我也重新启用了 NIA*,现在没有看到先前的错误\n<do**> oh, interesting\n 哦,有趣\n<bur*> I might regret doing that by morning, but hey. I do try and keep everything open as best as possible.\n 到了早上我有可能会后悔(指 NIA* 的问题在 bur* 睡觉时再次出现),但我尝试尽量公开/开放所有东西。\n<do**> bur*: last time when NIA* did that I waited for their response\n bur*:上次 NIA* 搞出这种事情的时候,我等他们的回复(后才采取行动)\n<Kio*> Nope nia* just messaged in Telegram about it\n 不,NIA* 刚在 Telegram 上发了消息\n<do**> ah\n 啊\n<bur*> my peering hasn't re-established, so I guess they hit the big red shutdown button\n 我(和 NIA*)的 Peering 还没有重新建立,我猜他们按下了那个巨大的、红色的关闭按钮\n<Kio*> He tried to migrate his network to a full mesh\n 他尝试把网络迁移到 Full mesh\n<Kio*> and is now \"pulling all the wires\"\n 现在正在「全部拔线」\n<do**> Kio*: did you message him directly or was that on any of the groups?\n Kio*:你给他直接发了消息吗,还是在哪个群里?\n<Kio*> on the telegram group\n 在 Telegram 群里\n<do**> bur*: you didn't get that many bgp updates from me?\n bur*:你没有从我这里收到那么多 BGP 更新?\n<sun*> NIA* woke up :)\n NIA* 醒了 :)\n<bur*> do**, you went from an average of ~3kbs to ~10kbs+, peaking at 50kbs. In the grand scheme of things that was lost in the noise\n do**,你从平均 3 KB/s 到十几 KB/s,峰值 50 KB/s。在如此巨大的量级中这点小问题被淹没了\n<do**> interesting\n 有趣\n<do**> I also peer directly with NIA*\n 我也和 NIA* 直接 Peer 了\n<bur*> do**, yes, interesting. Is the link restricted in bandwidth ?\n do**,是的,有趣。(你和他的)链路有带宽限制吗?\n<do**> not at all\n 完全没有\n
\n因为今年是 2020 年,你准备给你的网络加一组 IPv6 地址。按照我的 DN42 注册教程,你很快就给自己注册了一个 IPv6 地址块,并且很快被合并进了 Registry。
\n在你看来,一切都很正常。但在地球的另一边,一个人的手机 / 电脑上弹出消息,告诉他他的 DN42 ROA 记录生成器出现了错误。他打开 Registry,扶额叹息,并 commit 了这样一个修改:
\n\nhttps://git.dn42.dev/dn42/registry/commit/9f45ee31cdea4a997d59a262c4a8ac8eb3cbd1f1
\n这位群友添加了 fd37:03b3:cae6:5158::/48
这样一个地址块。因为一个 IPv6 地址由 32 个 16 进制数构成(共 128 比特),而这个地址块显式定义了其中的前 16 个数(即 64 位),对应的子网掩码应该是 /64
或更高。
但是由于未知原因,这个错误没有被 DN42 Registry 的内容检查程序检查出来,当时也没有被操作合并的管理员发现,就成功进入了 Registry。
\n随后,ROA 记录生成器在解析 Registry 内容时遇到了这个格式错误的地址块,就直接报错退出了。
\nhttps://git.dn42.dev/dn42/registry/commit/00f90f592a35e325152ce28157f64d3fca7c8d7d
\n万幸的是这个问题对整个 DN42 网络影响不大,只是 ROA 更新延迟了几小时而已。
\n由于 DN42 从建立之初就在强调去中心特性,因此你可以写一个自己的 ROA 生成器作为备份。
\n\n\n虽然这次我的 ROA 生成器也挂掉了……
\n
原因是不同人写的程序即使功能相同,也会在实现上有细微的差别。这样在遇到这样一个输入内容的 Bug 时,就有可能有人的程序仍能保持正常运行。
\n我有一个朋友…… 行吧就是我自己。
\n因为我同时接了 DN42 和 NeoNetwork,还有一段自己的内网,所以为了防止把内网路由发到 DN42 和 NeoNetwork,我采取了以下方法:
\n配置完后一切看起来都很正常,直到几天后群友发现我的 Telegram Bot(就是我的 Looking Glass)Ping 不通任何 DN42 内的 IP。
\n刚开始一切都很正常,我的网段 172.22.76.184/29
被正常广播。直到某次 Direct 协议刷新了一次,从系统的某个网络界面获取到了 172.22.76.184/29
这个网段,并再次将它传进了路由表。
这条新的路由信息就把原先的路由覆盖了,同时因为这条路由来自 Direct 协议,被打上了 Community,就不再被广播了。并且 Static 如其名是「静态」协议,其内容不会改变,自然也不会产生新的路由再覆盖回去。
\n此时我相当于停止宣告了我的 IP 段,自然就无法收到回程数据包了。
\n在 Bird 中,尽量避免多个路由协议产生相同的路由条目,相互覆盖可能会造成不可预料的后果。
\n我最终选择添加 Filter 将 Direct 协议限制在我的内网网段,避免它再次覆盖我的 DN42 网段。
\n一名新玩家注册了一个 ASN:
\n\n这是 DN42 发生的变化:
\nTelegram 群:
\n\n蒂 花 之 秀:
\n\nIRC:
\n<lantian> Someone successfully registered in DN42 with ASN 424242236 (9 digits)\n 有人成功在 DN42 上注册了 ASN 424242236(9 位数)\n<lantian> Is this expected?\n 这是正常的吗?\n <xu**> doh\n 噢\n <xu**> shouldt have happened\n 不应该发生\n <xu**> probably forgot the extra 2\n 或许忘了个 2\n <xu**> 424242 2236\n 424242 2236\n <Kai*> too late tho. it already has one peer with tech9\n 太晚了,已经和 Tech9 Peer 上了\n <dne*> filtering fail!\n 过滤器挂了!\n <xu**> pomoke?\n (用户名)\n<lantian> yep, doesn't seem to be on irc though\n 对,但看起来不在 IRC 上\n<lantian> nor on telegram\n 也不在 Telegram 上\n <0x7*> so how a 9-digit ASN passed the schema checker...?\n 所以 9 位数 ASN 怎么过的检查程序……?\n<lantian> I don't think schema checker checks ASN, or it will block out clearnet ASNs\n 我不觉得检查程序会检查 ASN,否则会阻挡掉公网 ASN\n<lantian> But maybe we need a warning?\n 但也许需要加个警告?\n <xu**> probably a bug in the policy checker\n 也许是检查程序的一个 Bug\n <xu**> i wish we had gone with a prefix that had a visual space\n 我希望我们的 ASN 前缀有个看起来明显的分隔\n <xu**> like AS424200xxxx\n 例如 AS424200xxxx\n<lantian> Well pomoke tried to peer with me via email (but ended in spam folder)\n 总之 Pomoke 尝试发邮件找我 Peer(但进了垃圾箱)\n<lantian> I'm going to tell him/her to correct the ASN\n 我准备告诉他/她改正自己的 ASN\n <Kai*> 9 is a good number tho\n 不管怎么说 9 是个好数字\n <Kai*> once in a blue moon that bur* made mistake\n bur* 犯错,蓝月将至(英语成语,即千载难逢)\n <sun*> westerners love digital 9\n 西方人喜欢数字 9\n <bur*> crap\n 草\n <bur*> lantian, are you in contact with pomoke? if they can submit a fix quickly\n then I'll merge it. Otherwise I'll need to pull the commit\n Lan Tian,你能联系上 Pomoke 吗?如果他们可以迅速提交修正信息我就马上把它合并了。\n 否则我就得撤销变更了\n<lantian> bur*: I sent him/her an email, not sure about response time\n bur*,我给他/她发了封邮件,不知什么时候会回\n <bur*> umm, I'm going to have to pull it then\n 唔姆,那我就不得不撤销了\n
\n裁决之镰:
\n\n看戏就完事了,这种事情太少见了:
\n<Kai*> once in a blue moon that bur* made mistake\n bur* 犯错,蓝月将至\n
\n当然看戏归看戏,还是要上 IRC 说一句出问题了。
\n在和别人 Peer 的时候,多检查一遍对方的信息。
\n以及没事可以翻翻 DN42 New ASN 这个自动推送新 ASN 的 Telegram 频道。
\n在 DN42 Telegram 群帮别人调试网络时,我突然发现我的两个节点之间出现了环路:
\ntraceroute to fd28:cb8f:4c92:1::1 (fd28:cb8f:4c92:1::1), 30 hops max, 80 byte packets\n 1 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 88.023 ms\n 2 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 94.401 ms\n 3 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 167.664 ms\n 4 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 174.235 ms\n 5 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 247.213 ms\n 6 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 253.499 ms\n 7 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 326.690 ms\n 8 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 333.412 ms\n 9 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 406.978 ms\n10 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 413.537 ms\n11 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 486.762 ms\n12 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 493.147 ms\n\n18 hops not responding.\n
\n我登录上这两个节点一看,VirMach 节点的确优先选择了 BuyVM 发来的路由,而 BuyVM 也选择了 VirMach 的路由。
\nBGP 不应该是防环路的吗?为什么这两个节点会互相选择对方的路由?
\n这个问题总共涉及到三个 AS 的四个节点:
\n\n\n\n\n\n
\n其中 KSKB 是 fd28:cb8f:4c92::/48
这条路由的源头,他将路由广播给了 Lutoma,以及我的 VirMach 节点。Lutoma 随后将这条路由广播给了我的 BuyVM 节点。
我的所有节点都开启了 add paths yes;
选项,也就是说节点间会互相交换所有收到的路由,而不只是节点选出来、写入内核路由表的最佳路由。因此,对于我的 BuyVM 节点来说,到路由的源头有两条路线:
\n\n\n\n\n
\n对于 VirMach 节点也是一样的:
\n\n\n\n\n\n
\n一般来说,VirMach 节点肯定选择直连 KSKB 的路由,而不是经过 BuyVM 和 Lutoma,总共两跳(iBGP 同一 AS 内不计跳数)的路由。此时 BuyVM 节点下一跳无论选择 Lutoma 还是 VirMach 节点,都可以获得一条可达的路由,而不是出现环路。
\n问题是,我用 BIRD 的 Filter 手动调整了路由优先级。DN42 有一组标准的 BGP Community,用于标记每条路由来源的地区。为了降低网络延迟,我使用下面的算法(简化后)来调整路由优先级:
\n优先级 = 200 - 10 * 路由跳数\n如果当前节点和路由来源在同一地区:\n 优先级 += 100\n
\n问题发生时,KSKB 的原始路由并没有添加来源地区的 Community。但是 Lutoma 的网络配置错误,给来自 KSKB 的路由也加上了来源地区 Community,地区和我的 VirMach 节点相同。(根据 DN42 的标准,各个网络只应该给自己的路由添加来源地区 Community,不能给别人的路由添加。)
\n此时我的 BuyVM 节点算出了以下的路由优先级,并选择了经过我的 VirMach 节点的路由:
\n\n\n\n\n\n
\n而我的 VirMach 节点反而选择了经过 BuyVM 节点的路由:
\n\n\n\n\n\n
\n这样,环路就形成了。
\n这个问题出现时,以下三个因素缺一不可:
\nadd paths yes;
选项,导致备选路由被同时发给其它节点。如果不开启此选项,BuyVM 节点在选择 VirMach 作为下一跳时,就不会把经过 Lutoma 的路由也发给 VirMach 节点了,此时 VirMach 节点只有直连 KSKB 的一条路由可走。add paths yes;
选项开启,就需要在设计 iBGP 用的优先级算法时,保证在任何情况下,来自同一节点的路由之间优先级顺序都不变,从而保证总能选到这一节点的首选路由,而非备选路由。我解决问题的方法是,不再在 iBGP 内部重新计算路由优先级,而是统一使用由收到路由的节点计算的、由 iBGP 传递来的优先级,来保证首选、备选路由的优先级顺序不变。
","date_published":"2023-05-12T14:03:33.000Z","date_modified":"2024-03-18T07:22:20.524Z","author":{"name":"Lan Tian","url":"https://lantian.pub"},"tags":"网站与服务端"},{"id":"https://lantian.pub/en/article/modify-website/how-to-kill-the-dn42-network.lantian/","url":"https://lantian.pub/en/article/modify-website/how-to-kill-the-dn42-network.lantian/","title":"How to Kill the DN42 Network (Updated 2023-05-12)","link":"https://lantian.pub/en/article/modify-website/how-to-kill-the-dn42-network.lantian/","summary":"","image":"/usr/uploads/202008/i-love-niantic-network.png","banner_image":"/usr/uploads/202008/i-love-niantic-network.png","content_html":"\n\nDN42 is an experimental network, where everyone helps everyone. Nobody is going to blame you if you screwed up. You may seek help at DN42's IRC channel, mailing list or the unofficial Telegram group.
\n
Since DN42 is a network for experimentation, a lot of relatively inexperienced users also participate in it. Therefore, occasionally an inexperienced user may misconfigure his/her system and impact the whole DN42 network or even shut it down.
\nAs a more experienced user, here I will teach new users about some operations that can kill the network and about defense against such misconfigurations that everyone can set up against peers.
\n\n\nWARNING: You should not actually perform these operations in DN42. You should focus more on protecting yourself against them.
\nMalicious actions will make you kicked from DN42.
\n
The stories are based on real disasters in the Telegram group and IRC channel.
\nYou just joined DN42 and plan to connect all of your servers. You've already peered with a few others on several of your nodes, but you haven't finished on your internal routing yet.
\nSo you plan to configure OSPF. You opened Bird's configuration file and added a protocol:
\nprotocol ospf {\n ipv4 {\n import all;\n export all;\n };\n area 0.0.0.0 {\n interface zt0 {\n type broadcast;\n # Unimportant stuff redacted\n };\n };\n};\n
\nSatisfied, you copied the config file to every server and ran bird configure
. You checked and confirmed that every server obtained routes from each other via OSPF.
Suddenly a message box pops up on your IRC client / Telegram. You clicked on it:
\n<mc**> shit.... as424242**** is hijacking my prefixes, for example 172.23.*.*/27\n<he**> yup, I see some roa fails for them as well\n
\nCongratulations! You've successfully hijacked (part of) DN42.
\nWhen your server peers with others via BGP protocol, each route contains path information, including the origin as well as the list of nodes it went through. For example, the route 172.22.76.184/29
may have the path information of 4242422547 -> 4242422601 -> 424242****
, where 4242422547
is the origin (me by the way), and 4242422601
is your neighbor (Burble here, as an example).
But since your internal networking uses OSPF, which has no idea what BGP paths are, it doesn't preserve them while passing routes around. Now another node of yours obtained 172.22.76.184/29
via OSPF, yet without any path information. It will then proceed to announce the route with your own ASN to your peers, causing a hijack.
Here is a graph of what's going on:
\n[2547] -> [2601] -> [Your Node A] -> [Your Node B] -> [Peer of Node B]\n 2547 2547 2547 Gone! Your ASN (BOOM)\n 2601 2601\n Your ASN\n
\nThose in the Telegram group are really nice guys. As they help you in fixing the problem, they also recommended Babel to you:
\nBut they don't recommend Bird's built-in Babel support since it doesn't support selecting paths by latency.
\nYou are persuaded, removed the OSPF configuration, and installed Babeld. Soon each of your nodes is getting Babel routes. You waited for a few minutes. No sign of catastrophe yet.
\nBut you do notice that Bird isn't announcing the routes via BGP. The Telegram guys instigated you to enable the learn
option of Bird's kernel protocol:
protocol kernel sys_kernel_v4 {\n scan time 20;\n # You're gonna add this line!\n learn;\n # Unimportant stuff redacted\n};\n
\nYou do this. A few minutes later, you are called out again by people in IRC and Telegram. Yes, you hijacked other's networks. Again.
\nIt is actually the same problem as the OSPF one since Babel also dropped all BGP path information while passing routes around. However, Bird ignores routing information installed to the system by other routing software by default, until you enabled learn
.
export filter
to this in Bird:export filter {\n # Only allow announcing STATIC (manually configured) and BGP routes\n if source ~ [RTS_STATIC, RTS_BGP] then accept;\n # Reject routes from other protocols\n reject;\n}\n
\nRoute flapping is a whole range of errors that cause one problem: they cause the BGP routing software to frequently switch (or flap) the best route they chose. Since the best route gets announced to other nodes via peering, the flapping sets off a chained reaction, where multiple connected nodes will flap together for one node's mistake. Eventually, the problem will be distributed to the whole network.
\nThis process consumes a significant amount of bandwidth or traffic. Since many people in DN42 use cheap VPSes for nodes, there are only two possible outcomes eventually:
\nIn addition, route flapping may cause severe impacts:
\nFor example, one user in the Telegram group had a misconfiguration while transitioning from Full-mesh + Direct connections to Multihop.
\n\nHe didn't disconnect BGP in the process, and the Babel configuration error caused large amounts of routes to be announced and withdrawn.
\nBecause of the chain reaction and the number of peerings the guy has set up, multiple large ASes had to disconnect from each other to control the problem (before he woke up).
\n\n\nBy the way, this guy had a number of similar accidents before at a smaller scale, which this margin is too narrow to contain.
\n
<bur*> is someone awake who is on telegram ?\n<bur*> Kio*, sun*, ie**, lantian perhaps ?\n<Kio*> Kio* is here\n<fox*> I am in that dn42 telegram chat too but I do not understand moon runes\n<fox*> also its midnight for china?\n<bur*> yes, I'm going to be nuking a lot of peerings if they are all asleep\n<bur*> I think its originating from NIA*, but a lovely multi mb/s flap going on for the past hour\n<bur*> and its like whack-a-mole, if I disable one peering the traffic just pops up on a different one\n<fox*> petition for bur* network to stop accepting new peers to help save dn42 network health\n<Kio*> NIA* is awake now\n<bur*> NIA* certainly has ipv4 next hop problems, they are advertising routes with next hops in other networks\n<Kio*> He says he is adjusting his \"network from full-mesh to rr and multihops\"\n<bur*> well its not working ;)\n<stv*> bur*: I also took down our peering\n<bur*> stv*, too much traffic from the grc?\n<stv*> I added a new peer around 1hr ago. Just to check that this hasnt be the cause..\n<stv*> bur*: no the grc is still up and running\n<bur*> ah, if you are getting a lot of route updates its cos of NIA*\n<bur*> grc is currently pumping about 4mb/s to downstram peers\n<sun*> bur*: what happen?\n<bur*> NIA* is having issues\n<bur*> sun* anyway, you are up late!\n<sun*> I just came back from the bar:)\n<do**> don't drink and root\n<bur*> nice :)\n<sun*> l like drink ;)\n<bur*> ok, I'm bored of this now, if you are currently sending me more than 1mb/s of bgp traffic your peering is about to get disabled.\n<bur*> Kio*, sun*, Tch*, jrb*, lantian, ie**, so far\n<Kio*> barely notice any flapping here, is it v4 or v6 ?\n<bur*> 4 mostly, I think. you got killed on us-nyc1\n<bur*> Nap*\n<Nap*> Shut mine down if you need, I can't look into with much detail until tonight\n<bau*> half of dn42 is about to loose connectivity due to bur* disableing peerings lol\n<do**> oh yeah, this looks nice\n<Kio*> thats why everybody should be at least multi homed with two peers\n<jrb*> bur*: and on which peering?\n<Kio*> you shouldnt loose connectivity if only one peer drops\n<bur*> jrb* us-nyc1 and us-lax1 for you so far\n<jrb*> mapping table says us-3 and us-5, let me check.\n<Nap*> Do we know what routes are flapping causing the updates?\n<Kio*> filtering problematic ASN on my us node now\n<bur*> Nap* its NIA*\n<bur*> AS42424213**\n<jrb*> sun*, rou*: disabling my peerings with you for now, there seems to be serious flapping\n<do**> him again?\n<sun*> what?\n<sun*> is me problem?\n<bur*> sun*, I've killed all of our peerings\n<sun*> why?\n<bur*> sun*, you are distributing the problems from NIA*\n<Nap*> bur*: K, gonna try to filter on ATL/CHI at least.\n<bur*> thanks Nap*\n<Kio*> recommend everybody to temporarily enable \"bgp_path ~\" filter for the problematic ASN\n<sun*> i disabled NIA*, would fix problem?\n<do**> bur*: I also peer with NIA* and I don't get any bgp updates from him\n<do**> ah wait\n<bur*> sun*, depends if you are also getting the updates from other peers too\n<do**> now I see it\n<do**> disabling peering\n<sun*> if bgp_path ~ [= 42424213** =] then reject;\n<bur*> ~ [= * 42424213** * =] to reject all paths\n<sun*> ohh\n<jrb*> bur*: seems to be mostly rou* from my perspective\n<Kio*> Should be filtered on my side, if anyone continues to receive those updates please notify\n<bur*> sun*, I tried re-enabling you on lax1 but you jumped striaght to 1mb/s+ again\n<bur*> jrb*, re-enabled\n<sun*> i have disabled NIA*\n<bur*> Kio*, re-enabled\n<do**> oh btw, I have notified NIA* about this issue\n<jrb*> do**: also tell him to notify everybody to get out of the blacklists.\n<do**> jrb*: will do\n<Nap*> bur*: I should have it filtered on my ATL (your CHI)\n<Kio*> wrote NIA* also directly on telegram\n<sun*> bur*: is it better now?\n<bur*> for the record, this is the first time that I've mass disabled peerings, but this was causing issues across the board\n<bur*> sun*, no not really\n<An**> I've stop importing route from NIA*\n<stv*> I am also dropping NIA* now\n<bur*> sun*, thats like 1k updates every few seconds\n<Nap*> bur*: all host should have it filtered now.\n<bur*> Nap*, looks to me, thanks\n<sun*> bur*: seems to have reduced traffic\n<bur*> sun*, yes that looks better\n<bur*> sun*, is that now ok across all your nodes ?\n<sun*> yep\n<bur*> sun*, ok re-enabled\n<do**> alright, also filtered 42424213**\n<tm**> hi, also filtered 42424213**\n<bur*> I guess they got the message, seems we're back to normal again and everyone I disabled is back again\n<do**> bur*: I think NIA* is asleep, probably everyone filtered it\n<do**> or disabled peering\n<bur*> do**, there is that, but I also renabled NIA* and am not getting the same errors now\n<do**> oh, interesting\n<bur*> I might regret doing that by morning, but hey. I do try and keep everything open as best as possible.\n<do**> bur*: last time when NIA* did that I waited for their response\n<Kio*> Nope nia* just messaged in Telegram about it\n<do**> ah\n<bur*> my peering hasn't re-established, so I guess they hit the big red shutdown button\n<Kio*> He tried to migrate his network to a full mesh\n<Kio*> and is now \"pulling all the wires\"\n<do**> Kio*: did you message him directly or was that on any of the groups?\n<Kio*> on the telegram group\n<do**> bur*: you didn't get that many bgp updates from me?\n<sun*> NIA* woke up :)\n<bur*> do**, you went from an average of ~3kbs to ~10kbs+, peaking at 50kbs. In the grand scheme of things that was lost in the noise\n<do**> interesting\n<do**> I also peer directly with NIA*\n<bur*> do**, yes, interesting. Is the link restricted in bandwidth ?\n<do**> not at all\n
\nSince it's the year 2020, you plan to add an IPv6 block to your network. With my DN42 registration guide, you registered yourself a IPv6 block, which quickly got merged to registry.
\nFrom your perspective, everything is normal. Yet on the other side of the planet, a message pops up on one person's phone/computer that his DN42 ROA generator is malfunctioning. He opens the registry page, facepalms, and commits this change:
\n\nhttps://git.dn42.dev/dn42/registry/commit/9f45ee31cdea4a997d59a262c4a8ac8eb3cbd1f1
\nThis user added a IPv6 block, fd37:03b3:cae6:5158::/48
. Since an IPv6 address consists of 32 hex numbers (128 bits total), and this block defined the first 16 digits (or 64 bits), the corresponding netmask should be /64
or higher.
But for some reason, this error wasn't detected by DN42 Registry's schema checker, nor by the admin who inspected and merged the change, so it successfully ended up in the registry.
\nLater, the ROA generator found the erroneous IP block while parsing the registry and crashed.
\nhttps://git.dn42.dev/dn42/registry/commit/00f90f592a35e325152ce28157f64d3fca7c8d7d
\nFortunately, except that the ROA update was delayed by a few hours, this error didn't impact the network itself much.
\nSince the decentralized nature of DN42 as it's born, you can write your own ROA generator as a backup.
\n\n\nAlthough my ROA generator also failed this time...
\n
The reason is that different implementations may have minor differences even though they do the same thing. When such a bug on input content arises, some implementations may survive.
\nThe story starts with my friend Joe... Fine. The story starts with me.
\nSince my network is connected to both DN42 and NeoNetwork, as well as my internal network with a private IP range, to prevent announcing my internal network to DN42 and NeoNetwork, I did this:
\nInitially, everything looked normal, until a few days later when some users on Telegram found that my looking glass bot times out on any IP in DN42.
\nInitially, everything is indeed normal, and my IP block 172.22.76.184/29
is announced correctly. Until Direct protocol performed a refresh and obtained 172.22.76.184/29
from one of the network interfaces, and sent the route to Bird routing table again.
The new route overwrote the previous route, and since it comes from Direct protocol, it's labeled with the community and wasn't broadcasted. Static protocol, on the other hand, is indeed \"static\", and won't overwrite the route again.
\nAt this time, I effectively stopped announcing my IP range. No wonder I cannot receive any packets coming back to my nodes now.
\nIn Bird, you should avoid getting the same route entry from multiple routing protocols, as they overwrite each other and may cause unexpected behavior.
\nI finally chose to limit Direct protocol to my internal IP range with a filter, so it won't overwrite my DN42 ranges again.
\nA new user registered an ASN:
\n\nThis is what happened to DN42:
\nTelegram Group: (Translation available below the image)
\n\nTranslation:
\n<lantian> Why someone with an ASN of 424242236 came to peer with me\n<lantian> Yep, 9 digits\n<lantian> /whois@lantian_lg_bot 424242236\n <lg> (outputs WHOIS information of the AS)\n<lantian> And it has proper WHOIS information\n <KaiKai> https://net-info.nia.ac.cn/#424242236\n <KaiKai> Really, it exists\n <Pastel> Burble didn't spot the error?\n <Pastel> Like the /64, which crashed the ROA generator\n
\nIRC:
\n<lantian> Someone successfully registered in DN42 with ASN 424242236 (9 digits)\n<lantian> Is this expected?\n <xu**> doh\n <xu**> shouldt have happened\n <xu**> probably forgot the extra 2\n <xu**> 424242 2236\n <Kai*> too late tho. it already has one peer with tech9\n <dne*> filtering fail!\n <xu**> pomoke?\n<lantian> yep, doesn't seem to be on irc though\n<lantian> nor on telegram\n <0x7*> so how a 9-digit ASN passed the schema checker...?\n<lantian> I don't think schema checker checks ASN, or it will block out clearnet ASNs\n<lantian> But maybe we need a warning?\n <xu**> probably a bug in the policy checker\n <xu**> i wish we had gone with a prefix that had a visual space\n <xu**> like AS424200xxxx\n<lantian> Well pomoke tried to peer with me via email (but ended in spam folder)\n<lantian> I'm going to tell him/her to correct the ASN\n <Kai*> 9 is a good number tho\n <Kai*> once in a blue moon that bur* made mistake\n <sun*> westerners love digital 9\n <bur*> crap\n <bur*> lantian, are you in contact with pomoke? if they can submit a fix quickly\n then I'll merge it. Otherwise I'll need to pull the commit\n<lantian> bur*: I sent him/her an email, not sure about response time\n <bur*> umm, I'm going to have to pull it then\n
\nJustice Has Arrived:
\n\nJust have fun, as this is so rare:
\n<Kai*> once in a blue moon that bur* made mistake\n
\nBut while having fun, remember to point out the problem on IRC.
\nDouble-check your peer's information when peering.
\nCheck DN42 New ASN, a Telegram channel that notifies of new DN42 ASNs, in your free time.
\nWhen I was helping others debugging their network in the DN42 Telegram group, I suddenly noticed a routing loop between two of my nodes:
\ntraceroute to fd28:cb8f:4c92:1::1 (fd28:cb8f:4c92:1::1), 30 hops max, 80 byte packets\n 1 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 88.023 ms\n 2 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 94.401 ms\n 3 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 167.664 ms\n 4 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 174.235 ms\n 5 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 247.213 ms\n 6 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 253.499 ms\n 7 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 326.690 ms\n 8 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 333.412 ms\n 9 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 406.978 ms\n10 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 413.537 ms\n11 us-new-york-city.virmach-ny1g.lantian.dn42 (fdbc:f9dc:67ad:8::1) 486.762 ms\n12 lu-bissen.buyvm.lantian.dn42 (fdbc:f9dc:67ad:2::1) 493.147 ms\n\n18 hops not responding.\n
\nI logged onto these two nodes, and indeed, the VirMach node did choose BuyVM's route as the preferred path, and the BuyVM node did the same for VirMach's route.
\nIsn't BGP supposed to prevent loops? Why are these two nodes choosing the route from each other?
\nThe problem involves 4 nodes from 3 ASes:
\n\n\n\n\n\n
\nKSKB is the source for the route fd28:cb8f:4c92::/48
. He broadcasted the route to Lutoma, as well as my VirMach node. Lutoma then broadcased the route to my BuyVM node.
All my nodes have add path yes;
option turned on, which means my nodes will exchange all received routes, rather than only the preferred ones written into kernel routing table. Therefore, as far as the BuyVM node concerns, it can choose from two paths to the source:
\n\n\n\n\n
\nThe same applies for my VirMach node:
\n\n\n\n\n\n
\nGenerally speaking, the VirMach node should prefer the direct route to KSKB, instead of the path through my BuyVM node and Lutoma's node, for a total of 2 hops (hops aren't counted for iBGP within the same AS). Now regardless of the next hop BuyVM node prefers, either Lutoma's node or my VirMach node, it will have a reachable path rather than a routing loop.
\nThe problem is that I manually adjusted route preferences with a BIRD filter. DN42 has a standard set of BGP communities to mark the source region of each route. To reduce network latency, I used the following algorithm (simplified) to adjust my route preferences:
\nPreference = 200 - 10 * (Hop count)\nIf the current node is in the same region as the route source:\n Preference += 100\n
\nWhen the problem happened, the original route from KSKB don't have source region community set up. However, Lutoma's network was set up incorrectly, and added source region community to KSKB's route as well, and with the same region as my VirMach node. (According to the standard of DN42, networks should only add source region communities to their own routes, not to routes received from other networks.)
\nNow my BuyVM node calculated the following route preferences, and chose the route through my VirMach node:
\n\n\n\n\n\n
\nYet my VirMach node chose the route through BuyVM:
\n\n\n\n\n\n
\nAnd now we have a routing loop.
\nFor this problem to appear, all three requirements must be met:
\nadd paths yes;
option is turned on, so that secondary routes are sent to other nodes as well. If this option wasn't turned on, as soon as the BuyVM node choose the VirMach node as the next hop, it won't broadcast its route through Lutoma to the VirMach node. Then, the VirMach node will only have the option of sending traffic directly to KSKB.add paths yes;
option on, while designing the iBGP route preference algorithm, we need to guarantee that routes from the same node have their priorities in the same order as that node, so that the primary routes will always be used over secondary routes.My solution to the problem is to no longer recalculate route priority for those received from iBGP. Instead, I will always use the priority value calculated by the edge node receiving the route, and passed over along with the route announcement over iBGP, to guarantee that the order of primary and secondary routes never change.
","date_published":"2023-05-12T14:03:33.000Z","date_modified":"2024-03-18T07:22:20.541Z","author":{"name":"Lan Tian","url":"https://lantian.pub"},"tags":"Website and Servers"},{"id":"https://lantian.pub/article/modify-computer/laptop-muxed-nvidia-passthrough.lantian/","url":"https://lantian.pub/article/modify-computer/laptop-muxed-nvidia-passthrough.lantian/","title":"Optimus MUXed 笔记本上的 NVIDIA 虚拟机显卡直通(2023-05 更新)","link":"https://lantian.pub/article/modify-computer/laptop-muxed-nvidia-passthrough.lantian/","summary":"","content_html":"一年前,为了能够一边用 Arch Linux 浏览网页、写代码,一边用 Windows 运行游戏等没法在 Linux 上方便地完成的任务,我试着在我的联想 R720 游戏本上进行了显卡直通。但是由于那台电脑是 Optimus MUXless 架构(前文有各种架构的介绍),也就是独显没有输出端口、全靠核显显示画面,那套配置的应用受到了很大的阻碍,最后被我放弃。
\n但是现在,我换了台新电脑。这台电脑的 HDMI 输出接口是直连 NVIDIA 独立显卡的,也就是 Optimus MUXed 架构。在这种架构下,有办法让虚拟机识别到一个「独显上的显示器」,从而正常启用大部分功能。于是,我终于可以配置出一套可以长期使用的显卡直通配置。
\n在按照本文进行操作前,你需要准备好:
\n一台 Optimus MUXed 架构的笔记本电脑。我的电脑型号是 HP OMEN 17t-ck000(i7-11800H,RTX 3070)。
\n用 Libvirt(Virt-Manager)配置好一台 Windows 10 或 Windows 11 的虚拟机,我用的是 Windows 11。
\n(可选)根据电脑视频输出接口的不同,一个 HDMI,DP,或 USB Type-C 接口的假显示器(诱骗接头),淘宝上一般几块到十几块钱一个。
\n(可选)外接一套 USB 键鼠套装。
\n开始操作之前,预先提醒:
\n如果你有兴趣尝试显卡直通,并正准备购买一台新电脑,你可以参考以下我的选择方法。
\n显卡直通的前提条件是:
\n但是,游戏本厂商很少会在宣传页上写明视频接口连接的是独显还是核显。因此我们只能根据常见的参数进行推测:
\n优先选择支持「独显直连内屏」的电脑,因为这种情况下独显一定具有视频输出功能,并且厂家大概率会将机身视频接口连接到独显上。
\n或者选择带有中高端独立显卡的电脑,一般 NVIDIA 显卡型号要以 60 或以上结尾。
\n用好七天无理由退货服务。
\nIntel 第五代到第九代的 CPU 核显都支持对显卡本身进行虚拟化,也就是划分出几个虚拟的显卡,将虚拟显卡直通进虚拟机、让虚拟机享受显卡加速的同时,允许宿主机同时使用显卡进行显示。
\n但是 Linux 下的 GVT-g 驱动不支持第十代及更新的 CPU,而且 Intel 也没有支持的计划。再加上 GVT-g 虚拟显卡无法和 NVIDIA 独显组成 Optimus 结构,它也没有什么用。
\n所以,我们不用管 GVT-g 了,只直通 NVIDIA 独显就好。
\nIntel 十一代及之后的 CPU 核显使用另一种虚拟化方式:SR-IOV。Intel 官方已经发布了 SR-IOV 的内核模块代码,但尚未合入 Linux 主线。有第三方项目将这部分内核代码移植成 DKMS 模块,但根据 Issues 反馈成功率不高,我在 i7-11800H 上测试也没成功。所以,本文将不涉及 Intel 核显的 SR-IOV 功能。
\n\n\n这一段的大部分内容和 2021 年的这篇文章是一样的。
\n
宿主系统上的 NVIDIA 的驱动会占用独显,阻止虚拟机调用它,因此需要先用 PCIe 直通用的 vfio-pci
驱动替换掉它。
禁用 NVIDIA 驱动,把独显交给处理虚拟机 PCIe 直通的内核模块管理的步骤如下:
\n运行 lspci -nn | grep NVIDIA
,获得类似如下输出:
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] [10de:249d] (rev a1)\n0000:01:00.1 Audio device [0403]: NVIDIA Corporation GA104 High Definition Audio Controller [10de:228b] (rev a1)\n
\n这里的 [10de:249d]
就是独显的制造商 ID 和设备 ID,其中 10de
代表这个 PCIe 设备由 NVIDIA 生产,而 249d
代表这是张 3070。228b
是 HDMI 接口的音频输出,也需要用 vfio-pci
驱动接管。
创建 /etc/modprobe.d/lantian.conf
,添加如下内容:
options vfio-pci ids=10de:249d,10de:228b\n
\n给 vfio-pci
这个负责 PCIe 直通的内核驱动一个配置,让它去管理独显。ids
参数就是要直通的独显的制造商 ID 和设备 ID。
修改 /etc/mkinitcpio.conf
,在 MODULES
中添加以下内容:
MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd)\n
\n删除 nvidia
等与 NVIDIA 驱动相关的内核模块,或者确保它们排在 VFIO 驱动后面。这样 PCIe 直通模块就会在系统启动的早期抢占独显,阻止 NVIDIA 驱动后续占用独显。
运行 mkinitcpio -P
更新 initramfs。
重启电脑。
\n(2023-05)如果你用的是 NixOS 系统,可以直接使用下面的配置:
\n{\n boot.kernelModules = [\"vfio-pci\"];\n boot.extraModprobeConfig = ''\n # 这里改成你的显卡的制造商 ID 和设备 ID\n options vfio-pci ids=10de:249d\n '';\n\n boot.blacklistedKernelModules = [\"nouveau\" \"nvidiafb\" \"nvidia\" \"nvidia-uvm\" \"nvidia-drm\" \"nvidia-modeset\"];\n}\n
\n在 2021 年的这篇文章中,我在这里介绍了一大堆绕过 NVIDIA 驱动限制的内容。但是从 465 版本开始,NVIDIA 解除了大部分的限制,理论上来说现在直接把显卡直通进虚拟机就能用。
\n但也只是理论上而已。
\n我依然建议大家做完所有的隐藏虚拟机的步骤,因为:
\n(2022-01)对于笔记本电脑来说,NVIDIA 并没有解除所有的限制。
\n即使 NVIDIA 驱动不检测虚拟机,你运行的程序也会检测虚拟机,隐藏虚拟机特征可以提高成功运行这些程序的概率。
\n那么,开始操作:
\n与 Optimus MUXless 架构不同,我这次没有手动提取显卡 BIOS、修改 UEFI 固件就成功进行了显卡直通。
\nPCI\\VEN_10DE&DEV_1C8D&SUBSYS_39D117AA&REV_A1
。如果 SUBSYS
后面跟着的是一串 0,这就意味着显卡 BIOS 加载失败,你需要手动提取显卡 BIOS。编辑你的虚拟机配置,virsh edit Windows
,做如下修改:
<!-- 把 features 一段改成这样,就是让 QEMU 隐藏虚拟机的特征 -->\n<features>\n <acpi/>\n <apic/>\n <hyperv mode=\"custom\">\n <relaxed state=\"on\"/>\n <vapic state=\"on\"/>\n <spinlocks state=\"on\" retries=\"8191\"/>\n <vpindex state=\"on\"/>\n <runtime state=\"on\"/>\n <synic state=\"on\"/>\n <stimer state=\"on\"/>\n <reset state=\"on\"/>\n <vendor_id state=\"on\" value=\"GenuineIntel\"/>\n <frequencies state=\"on\"/>\n <tlbflush state=\"on\"/>\n </hyperv>\n <kvm>\n <hidden state=\"on\"/>\n </kvm>\n <vmport state=\"off\"/>\n</features>\n<!-- 添加显卡直通的 PCIe 设备 -->\n<hostdev mode='subsystem' type='pci' managed='yes'>\n <source>\n <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>\n </source>\n <rom bar='off'/>\n <!-- 注意这里的 PCIe 总线地址必须是 01:00.0,一点都不能差 -->\n <!-- 如果保存时提示 PCIe 总线地址冲突,就把其它设备的 <address> 全部删掉 -->\n <!-- 这样 Libvirt 会重新分配一遍 PCIe 地址 -->\n <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0' multifunction='on'/>\n</hostdev>\n<!-- 添加一块在虚拟机和宿主机之间共享的内存,以便将虚拟机显示内容传回宿主机 -->\n<shmem name='looking-glass'>\n <model type='ivshmem-plain'/>\n <!-- 这里内存大小的公式是:分辨率宽 x 分辨率高 / 131072,然后向上取到 2 的 n 次方 -->\n <!-- 因为大部分 HDMI 假显示器的分辨率都是 3840 x 2160,计算结果是 63.28MB,向上取到 64MB -->\n <size unit='M'>64</size>\n</shmem>\n<!-- 禁用内存 Balloon,也就是内存动态伸缩,严重影响性能 -->\n<memballoon model=\"none\"/>\n<!-- 在 </qemu:commandline> 之前添加这些参数 -->\n<qemu:arg value='-acpitable'/>\n<qemu:arg value='file=/ssdt1.dat'/>\n
\n此处的 ssdt1.dat 是一个修改后的 ACPI 表,用来模拟一块满电的电池。它对应如下 Base64,可以用 Base64 解码网站转换成二进制文件,放在根目录,或者从本站下载。
\nU1NEVKEAAAAB9EJPQ0hTAEJYUENTU0RUAQAAAElOVEwYEBkgoA8AFVwuX1NCX1BDSTAGABBMBi5f\nU0JfUENJMFuCTwVCQVQwCF9ISUQMQdAMCghfVUlEABQJX1NUQQCkCh8UK19CSUYApBIjDQELcBcL\ncBcBC9A5C1gCCywBCjwKPA0ADQANTElPTgANABQSX0JTVACkEgoEAAALcBcL0Dk=\n
\n修改共享内存文件的权限。
\n修改 /etc/apparmor.d/local/abstractions/libvirt-qemu
文件增加一行:
/dev/shm/looking-glass rw,\n
\n然后运行 sudo systemctl restart apparmor
重启 AppArmor。
创建 /etc/tmpfiles.d/looking-glass.conf
,写入以下内容,把 lantian
换成你的用户名:
f /dev/shm/looking-glass 0660 lantian kvm -\n
\n然后运行 sudo systemd-tmpfiles /etc/tmpfiles.d/looking-glass.conf --create
生效。
启动虚拟机,等一会,Windows 会自动装好 NVIDIA 驱动。
\nDevice by Connection
(按照连接方式显示设备),确认显卡的地址是总线 Bus 1,接口 Slot 0,功能 Function 0,并且确认显卡上级的 PCIe 接口是总线 Bus 0,接口 Slot 1,功能 Function 0。SUBSYS
后是否跟着一串 0。\n关闭虚拟机并再次启动,注意不是直接重启,再次在设备管理器里确认显卡工作正常。
\n以下步骤二选一:
\nC:\\IddSampleDriver
。注意这个文件夹不能移动到其它位置!C:\\IddSampleDriver\\option.txt
,你会看到第一行是一个数字 1(不要修改),然后是分辨率 / 刷新率列表。只保留你想要的一项分辨率 / 刷新率,把其它的分辨率 / 刷新率都删掉。C:\\IddSampleDriver\\IddSampleDriver.inf
并一路下一步完成安装。(2023-05)现在新版 Looking Glass 会自动安装 IVSHMEM 驱动(虚拟机和宿主机共享内存的驱动),你无需再手动安装驱动。这里保留手动安装步骤以供参考:
\n(2022-01)下载这份 Virtio 驱动复制到虚拟机内解压,注意一定是这份,其它的版本大都没有 IVSHMEM 驱动:
\n\n在虚拟机里进入设备管理器,找到系统设备 - PCI 标准内存控制器(PCI standard RAM controller
):
Virtio 驱动/Win10/amd64/ivshmem.inf
文件IVSHMEM
安装 Looking Glass,这是一个将虚拟机的显示画面传输到宿主机的工具。
\n(2023-05)如果按照 2022-01 的步骤操作,虚拟机开机过程中、Looking Glass 启动前你将无法看到开机画面。因此我推荐在设备管理器中直接禁用 QXL 虚拟显卡。以下旧版步骤保留以供参考。
\n(2022-01)关闭虚拟机,virsh edit Windows
编辑虚拟机配置。
找到 <video><model type=\"qxl\" ...></video>
,将 type
改为 none
,以禁用 QXL 虚拟显卡:
<video>\n <model type=\"none\"/>\n</video>\n
\n在宿主机上安装 Looking Glass 的客户端,Arch Linux 用户可以直接从 AUR 安装 looking-glass
包。运行 looking-glass-client
命令启动客户端。
回到 Virt-Manager,关掉虚拟机的窗口(就是查看虚拟机桌面、编辑配置的窗口),在 Virt-Manager 主界面右键选择你的虚拟机,点击启动。
\n稍等片刻,Looking Glass 的客户端就会显示出虚拟机的画面,此时显卡直通就配置完成了。
\n虽然显卡直通已经完成,但是虚拟机的体验还需要优化。具体来说:
\n我们将一个个解决以上问题。
\n(2023-05)新版 Looking Glass 已经可以传输声音。以下步骤保留以供参考。
\n虽然 Virt-Manager 本身可以通过 SPICE 协议连接虚拟机,从而传输虚拟机的声音,但是 Looking Glass 也会通过 SPICE 传输键鼠操作,而虚拟机上同时只能有一个 SPICE 连接。这就意味着我们无法使用 Virt-Manager 来听声音了。
\n我们可以安装 Scream,一个 Windows 下的虚拟声卡软件,将声音通过虚拟机的网卡来传输,然后在宿主机上用 Scream 的客户端接收。
\n在虚拟机上,从 Scream 的下载页面下载 Scream 安装程序,解压后右键以管理员身份运行 Install-x64.bat
脚本安装驱动,然后重启。
在宿主机上安装 Scream 客户端,Arch Linux 用户可以安装 AUR 中的 scream
软件包。
在宿主机上开一个终端运行 scream -v
,在虚拟机中播放音频,测试能不能听到。如果无法听到,尝试指定 Scream 客户端监听的网卡,例如 scream -i virbr0 -v
,其中 virbr0
对应 Virt-Manager 默认的 NAT 网络,是你的虚拟机与宿主机通信的网卡。
最后,可以创建一个 SystemD 服务,来方便地启动 Scream 客户端。创建 ~/.config/systemd/user/scream.service
,写入以下内容:
[Unit]\nDescription=Scream\n\n[Service]\nType=simple\nRestart=always\nRestartSec=1\nExecStart=/usr/bin/scream -i virbr0 -v\n\n[Install]\nWantedBy=graphical-session.target\n
\n以后使用时就只需要运行 systemctl --user start scream
了。
(2023-05)新版 Looking Glass 已经可以稳定传输鼠标键盘操作。以下步骤保留以供参考。
\nLooking Glass 的键盘鼠标传输不太稳定,有时会丢失一些操作,因此如果你想在虚拟机里玩游戏,就需要用更稳定的方法将键鼠操作传进虚拟机。
\n我们有两种方法:让 Libvirt 虚拟机直接捕获宿主机的键鼠操作,或者把一套 USB 键鼠直接直通进虚拟机。
\n捕获宿主机键鼠操作。
\n在 Linux 系统上,所有的键鼠操作都是通过 evdev
(即 Event Device
)框架传输给桌面环境的。Libvirt 可以监听你的键鼠操作,将你的操作传给虚拟机。同时,Libvirt 可以在你按下左 Ctrl + 右 Ctrl 这套组合键的时候,在虚拟机和宿主机之间切换,这样你就可以用同一套键盘鼠标同时操作宿主机和虚拟机了。
首先在宿主机上运行 ls -l /dev/input/by-path
查看你现有的 evdev
设备,例如我就有:
pci-0000:00:14.0-usb-0:1:1.1-event-mouse # USB 外接鼠标\npci-0000:00:14.0-usb-0:1:1.1-mouse\npci-0000:00:14.0-usb-0:6:1.0-event\npci-0000:00:15.0-platform-i2c_designware.0-event-mouse # 电脑内置的触摸板\npci-0000:00:15.0-platform-i2c_designware.0-mouse\npci-0000:00:1f.3-platform-skl_hda_dsp_generic-event\nplatform-i8042-serio-0-event-kbd # 电脑内置的键盘\nplatform-pcspkr-event-spkr\n
\n名字中带有 event-mouse
的就是鼠标,带有 event-kbd
的就是键盘。
然后,virsh edit Windows
编辑虚拟机配置,在 <devices>
中添加一段:
<input type=\"evdev\">\n <!-- 根据上面 ls 的结果,修改鼠标或键盘的路径 -->\n <source dev=\"/dev/input/by-path/platform-i8042-serio-0-event-kbd\" grab=\"all\" repeat=\"on\"/>\n</input>\n<!-- 有多个鼠标键盘时,重复即可 -->\n<input type=\"evdev\">\n <source dev=\"/dev/input/by-path/pci-0000:00:15.0-platform-i2c_designware.0-event-mouse\" grab=\"all\" repeat=\"on\"/>\n</input>\n
\n启动虚拟机,这时你会发现键鼠操作没反应了,因为它们被虚拟机捕获了。按下左 Ctrl + 右 Ctrl 组合键就可以恢复宿主机键鼠控制,再按一次就可以控制虚拟机。
\n然后,我们就可以禁用 Looking Glass 的键鼠传输功能了。创建 /etc/looking-glass-client.ini
,写入以下内容:
[spice]\nenable=no\n
\nUSB 键鼠直通
\n捕获键鼠操作并不是万能的,例如我的触摸板就无法被正常捕获,体现为无法移动虚拟机内的光标。
\n如果你也遇到了这种情况,并且你有一套 USB 键鼠,就可以将它们直通进虚拟机,专门用它们控制虚拟机。虚拟机的 USB 直通技术非常成熟,你遇到问题的概率非常小。
\n在 Virt-Manager 里选择添加硬件(Add Hardware
) - USB 宿主设备(USB Host Device
),选择你的鼠标键盘即可。
\n\n\n
Looking Glass 提供了一个内核模块,可以用于 IVSHMEM 共享内存设备,让 Looking Glass 能使用 DMA 技术高效地读取虚拟机画面,从而提高帧率。
\n安装 Linux 内核头文件和 DKMS,在 Arch Linux 上就是安装 linux-headers
和 dkms
两个包。
从 AUR 安装 looking-glass-module-dkms
。
配置 Udev 规则:创建 /etc/udev/rules.d/99-kvmfr.rules
,写入以下内容:
SUBSYSTEM==\"kvmfr\", OWNER=\"lantian\", GROUP=\"kvm\", MODE=\"0660\"\n
\n将 lantian
替换成你自己的用户名。
配置内存大小:创建 /etc/modprobe.d/looking-glass.conf
,写入以下内容:
# 这里的内存大小计算方法和虚拟机的 shmem 一项相同。\noptions kvmfr static_size_mb=64\n
\n开机自动加载模块:创建 /etc/modules-load.d/looking-glass.conf
,写入一行 kvmfr
。
运行 sudo modprobe kvmfr
加载模块,此时 /dev
下会多出一个 kvmfr0
设备,就是 Looking Glass 的内存设备了。
修改 /etc/apparmor.d/local/abstractions/libvirt-qemu
文件增加一行:
/dev/kvmfr0 rw,\n
\n以允许虚拟机访问这个设备。运行 sudo systemctl restart apparmor
重启 AppArmor。
virsh edit Windows
编辑虚拟机配置:
在 <devices>
中删除 <shmem>
一段:
<shmem name='looking-glass'>\n <model type='ivshmem-plain'/>\n <size unit='M'>64</size>\n</shmem>\n
\n在 <qemu:commandline>
中增加下面几行:
<qemu:arg value=\"-device\"/>\n<qemu:arg value=\"{"driver":"ivshmem-plain","id":"shmem-looking-glass","memdev":"looking-glass"}\"/>\n<qemu:arg value=\"-object\"/>\n<!-- 下一行有一个 67108864,对应 64MB * 1048576 -->\n<!-- 如果你之前设置的内存大小不同请相应修改 -->\n<qemu:arg value=\"{"qom-type":"memory-backend-file","id":"looking-glass","mem-path":"/dev/kvmfr0","size":67108864,"share":true}\"/>\n
\n启动虚拟机。
\n修改 /etc/looking-glass-client.ini
,添加以下内容:
[app]\nshmFile=/dev/kvmfr0\n
\n启动 Looking Glass,此时应该可以看到虚拟机画面。
\n(2023-05)如果你用的是 NixOS,可以直接使用下面的配置:
\n{\n boot.extraModulePackages = with config.boot.kernelPackages; [\n kvmfr\n ];\n boot.extraModprobeConfig = ''\n # 这里的内存大小计算方法和虚拟机的 shmem 一项相同。\n options kvmfr static_size_mb=64\n '';\n boot.kernelModules = [\"kvmfr\"];\n services.udev.extraRules = ''\n SUBSYSTEM==\"kvmfr\", OWNER=\"root\", GROUP=\"libvirtd\", MODE=\"0660\"\n '';\n\n environment.etc.\"looking-glass-client.ini\".text = ''\n [app]\n shmFile=/dev/kvmfr0\n '';\n}\n
\n2022-01-26 更新:实测应用这个补丁后,NVIDIA 显卡仍未完全断电,耗电量与未使用补丁前相同。本段内容失效。
\n\n\n这一段只适用于 20 系及以上的 NVIDIA 显卡,当使用 NVIDIA 官方驱动时,它们也可以自动断电。10 系及以下的 NVIDIA 显卡不支持此功能。
\n这一段涉及自行编译内核,和使用未经严格检查和测试的内核补丁,不建议不熟悉 Linux 的用户操作。请自行衡量风险。
\n
当你不使用虚拟机时,管理 PCIe 直通的 vfio-pci
驱动会将设备设置成 D3
模式,也就是 PCIe 设备的省电模式。但是 D3
模式也分两种:D3hot
,此时设备仍然通电,和 D3cold
,此时设备完全断电。现在内核中的 vfio-pci
驱动只支持 D3hot
,此时 NVIDIA 独立显卡由于芯片未断电,仍会消耗 10W 左右的功率,从而导致笔记本电脑续航下降。
一位 NVIDIA 的工程师在 Linux 内核的邮件列表上发布了一组让 vfio-pci
支持 D3cold
模式的补丁。应用此补丁后,当虚拟机关机时,NVIDIA 独立显卡会被彻底断电,从而节省电量。
这组补丁可以在 https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com/T/ 看到。它总共由三个补丁组成,我将三个补丁合并后上传到了 https://github.com/xddxdd/pkgbuild/blob/master/linux-xanmod-lantian/0007-vfio-pci-d3cold.patch。
\n对于 Arch Linux 来说,给内核打补丁是比较简单的。AUR 中大部分内核的 PKGBUILD 都可以自动打补丁,只需要下载一个内核的 PKGBUILD,然后把这个补丁加入 PKGBUILD 的 source
部分就可以了。具体修改可以看我的这个 commit:https://github.com/xddxdd/pkgbuild/commit/406adb7bf5657cfe07bb17ff561d11ed97ebab39
要注意的是,这个补丁无法保证稳定。
\n根据邮件列表的讨论:
\n[RFC]
。D3cold
模式,这个补丁存在将显卡 reset,导致状态丢失,继而导致虚拟机崩溃的风险。虽然目前我使用 Windows 11 虚拟机暂时没有发现类似的问题,但是你需要了解其中的隐患。风险自负。
感谢前人在显卡直通上做出的努力,没有他们的努力本文不可能存在。
\n以下是我配置时参考的资料:
\n<domain xmlns:qemu=\"http://libvirt.org/schemas/domain/qemu/1.0\" type=\"kvm\">\n <name>Windows11</name>\n <uuid>5d5b00d8-475a-4b6c-8053-9dda30cd2f95</uuid>\n <metadata>\n <libosinfo:libosinfo xmlns:libosinfo=\"http://libosinfo.org/xmlns/libvirt/domain/1.0\">\n <libosinfo:os id=\"http://microsoft.com/win/11\"/>\n </libosinfo:libosinfo>\n </metadata>\n <memory unit=\"KiB\">16777216</memory>\n <currentMemory unit=\"KiB\">16777216</currentMemory>\n <vcpu placement=\"static\">16</vcpu>\n <os>\n <type arch=\"x86_64\" machine=\"pc-q35-8.0\">hvm</type>\n <loader readonly=\"yes\" type=\"pflash\">/run/libvirt/nix-ovmf/OVMF_CODE.fd</loader>\n <nvram template=\"/run/libvirt/nix-ovmf/OVMF_VARS.fd\">/var/lib/libvirt/qemu/nvram/Windows11_VARS.fd</nvram>\n </os>\n <features>\n <acpi/>\n <apic/>\n <hyperv mode=\"custom\">\n <relaxed state=\"on\"/>\n <vapic state=\"on\"/>\n <spinlocks state=\"on\" retries=\"8191\"/>\n <vpindex state=\"on\"/>\n <runtime state=\"on\"/>\n <synic state=\"on\"/>\n <stimer state=\"on\"/>\n <reset state=\"on\"/>\n <vendor_id state=\"on\" value=\"GenuineIntel\"/>\n <frequencies state=\"on\"/>\n <tlbflush state=\"on\"/>\n </hyperv>\n <kvm>\n <hidden state=\"on\"/>\n </kvm>\n <vmport state=\"off\"/>\n </features>\n <cpu mode=\"host-passthrough\" check=\"none\" migratable=\"on\">\n <topology sockets=\"1\" dies=\"1\" cores=\"8\" threads=\"2\"/>\n </cpu>\n <clock offset=\"localtime\">\n <timer name=\"rtc\" tickpolicy=\"catchup\"/>\n <timer name=\"pit\" tickpolicy=\"delay\"/>\n <timer name=\"hpet\" present=\"no\"/>\n <timer name=\"hypervclock\" present=\"yes\"/>\n </clock>\n <on_poweroff>destroy</on_poweroff>\n <on_reboot>restart</on_reboot>\n <on_crash>destroy</on_crash>\n <pm>\n <suspend-to-mem enabled=\"no\"/>\n <suspend-to-disk enabled=\"no\"/>\n </pm>\n <devices>\n <emulator>/run/libvirt/nix-emulators/qemu-system-x86_64</emulator>\n <disk type=\"file\" device=\"disk\">\n <driver name=\"qemu\" type=\"qcow2\" discard=\"unmap\"/>\n <source file=\"/var/lib/libvirt/images/Windows11.qcow2\"/>\n <target dev=\"vda\" bus=\"virtio\"/>\n <boot order=\"1\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x04\" slot=\"0x00\" function=\"0x0\"/>\n </disk>\n <disk type=\"file\" device=\"cdrom\">\n <driver name=\"qemu\" type=\"raw\"/>\n <source file=\"/mnt/root/persistent/media/LegacyOS/Common/virtio-win-0.1.215.iso\"/>\n <target dev=\"sdb\" bus=\"sata\"/>\n <readonly/>\n <address type=\"drive\" controller=\"0\" bus=\"0\" target=\"0\" unit=\"1\"/>\n </disk>\n <controller type=\"usb\" index=\"0\" model=\"qemu-xhci\" ports=\"15\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x02\" slot=\"0x00\" function=\"0x0\"/>\n </controller>\n <controller type=\"pci\" index=\"0\" model=\"pcie-root\"/>\n <controller type=\"pci\" index=\"1\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"1\" port=\"0x10\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x0\" multifunction=\"on\"/>\n </controller>\n <controller type=\"pci\" index=\"2\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"2\" port=\"0x11\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x1\"/>\n </controller>\n <controller type=\"pci\" index=\"3\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"3\" port=\"0x12\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x2\"/>\n </controller>\n <controller type=\"pci\" index=\"4\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"4\" port=\"0x13\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x3\"/>\n </controller>\n <controller type=\"pci\" index=\"5\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"5\" port=\"0x14\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x4\"/>\n </controller>\n <controller type=\"pci\" index=\"6\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"6\" port=\"0x15\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x5\"/>\n </controller>\n <controller type=\"pci\" index=\"7\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"7\" port=\"0x16\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x6\"/>\n </controller>\n <controller type=\"pci\" index=\"8\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"8\" port=\"0x17\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x7\"/>\n </controller>\n <controller type=\"pci\" index=\"9\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"9\" port=\"0x18\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x0\" multifunction=\"on\"/>\n </controller>\n <controller type=\"pci\" index=\"10\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"10\" port=\"0x19\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x1\"/>\n </controller>\n <controller type=\"pci\" index=\"11\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"11\" port=\"0x1a\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x2\"/>\n </controller>\n <controller type=\"pci\" index=\"12\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"12\" port=\"0x1b\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x3\"/>\n </controller>\n <controller type=\"pci\" index=\"13\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"13\" port=\"0x1c\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x4\"/>\n </controller>\n <controller type=\"pci\" index=\"14\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"14\" port=\"0x1d\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x5\"/>\n </controller>\n <controller type=\"sata\" index=\"0\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x1f\" function=\"0x2\"/>\n </controller>\n <controller type=\"virtio-serial\" index=\"0\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x03\" slot=\"0x00\" function=\"0x0\"/>\n </controller>\n <interface type=\"network\">\n <mac address=\"52:54:00:f4:bf:15\"/>\n <source network=\"default\"/>\n <model type=\"virtio\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x01\" slot=\"0x00\" function=\"0x0\"/>\n </interface>\n <serial type=\"pty\">\n <target type=\"isa-serial\" port=\"0\">\n <model name=\"isa-serial\"/>\n </target>\n </serial>\n <console type=\"pty\">\n <target type=\"serial\" port=\"0\"/>\n </console>\n <channel type=\"spicevmc\">\n <target type=\"virtio\" name=\"com.redhat.spice.0\"/>\n <address type=\"virtio-serial\" controller=\"0\" bus=\"0\" port=\"1\"/>\n </channel>\n <input type=\"mouse\" bus=\"ps2\"/>\n <input type=\"mouse\" bus=\"virtio\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x06\" slot=\"0x00\" function=\"0x0\"/>\n </input>\n <input type=\"keyboard\" bus=\"ps2\"/>\n <input type=\"keyboard\" bus=\"virtio\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x07\" slot=\"0x00\" function=\"0x0\"/>\n </input>\n <tpm model=\"tpm-crb\">\n <backend type=\"passthrough\">\n <device path=\"/dev/tpm0\"/>\n </backend>\n </tpm>\n <graphics type=\"spice\" autoport=\"yes\">\n <listen type=\"address\"/>\n <image compression=\"off\"/>\n </graphics>\n <sound model=\"ich9\">\n <audio id=\"1\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x1b\" function=\"0x0\"/>\n </sound>\n <audio id=\"1\" type=\"spice\"/>\n <video>\n <model type=\"qxl\" ram=\"65536\" vram=\"65536\" vgamem=\"16384\" heads=\"1\" primary=\"yes\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x01\" function=\"0x0\"/>\n </video>\n <hostdev mode=\"subsystem\" type=\"pci\" managed=\"yes\">\n <source>\n <address domain=\"0x0000\" bus=\"0x01\" slot=\"0x00\" function=\"0x0\"/>\n </source>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x05\" slot=\"0x00\" function=\"0x0\"/>\n </hostdev>\n <redirdev bus=\"usb\" type=\"spicevmc\">\n <address type=\"usb\" bus=\"0\" port=\"2\"/>\n </redirdev>\n <redirdev bus=\"usb\" type=\"spicevmc\">\n <address type=\"usb\" bus=\"0\" port=\"3\"/>\n </redirdev>\n <watchdog model=\"itco\" action=\"reset\"/>\n <memballoon model=\"none\"/>\n </devices>\n <qemu:commandline>\n <qemu:arg value=\"-device\"/>\n <qemu:arg value=\"{"driver":"ivshmem-plain","id":"shmem0","memdev":"looking-glass"}\"/>\n <qemu:arg value=\"-object\"/>\n <qemu:arg value=\"{"qom-type":"memory-backend-file","id":"looking-glass","mem-path":"/dev/kvmfr0","size":67108864,"share":true}\"/>\n <qemu:arg value=\"-acpitable\"/>\n <qemu:arg value=\"file=/etc/ssdt1.dat\"/>\n </qemu:commandline>\n</domain>\n
A year ago, to simultaneously browse webpages and write codes on my Arch Linux installation and use Windows to run tasks infeasible on Linux (such as gaming), I tried GPU passthrough on my Lenovo R720 gaming laptop. But since that laptop has an Optimus MUXless architecture (as mentioned in that post), its dedicated GPU doesn't have output ports, and the integrated GPU is in charge of all the displays. Therefore, severe limitations exist for that setup, and I eventually gave up on it.
\nBut now, I've purchased a new laptop. The HDMI output port on this laptop is directly connected to its NVIDIA dedicated graphics card, or in other words, it has an Optimus MUXed architecture. Since there is a way to make the virtual machine aware of a \"monitor on the dedicated GPU\", most functionalities work normally. I am finally able to create a GPU passthrough setup that works long-term.
\nBefore following steps in this post, you need to prepare:
\nA laptop with the Optimus MUXed architecture. My laptop is a HP OMEN 17t-ck000 (i7-11800H, RTX 3070).
\nSet up a virtual machine of Windows 10 or Windows 11 with Libvirt (Virt-Manager). I'm using Windows 11.
\n(Optional) Depending on the video output ports on your computer, purchase an HDMI, DP, or USB Type-C dummy plug. You can get one for a few bucks on Amazon.
\n(Optional) A USB keyboard and mouse combo.
\nA reminder before we begin:
\nIf you are interested in GPU passthrough and are looking for a new laptop, you can refer to my guidelines.
\nThe prerequisites for laptop GPU passthrough is:
\nHowever, it's extremely rare for a laptop manufacturer to mention the port connection schemes on their product pages, so we have to infer from more common specifications:
\nPrefer a laptop with a MUX switch, aka ones that can switch their internal screen onto the dedicated GPU. In this case, the dedicated GPU must be capable of video output, and there's a high chance that the manufacturer connected the chassis video outputs to the dedicated GPU:
\nOr choose a laptop with a mid-range to high-end graphics card. For NVIDIA GPUs the model number needs to end with 60 or larger.
\nTake advantage of unconditional return policies.
\n5th to 9th-Gen Intel integrated graphics support virtualizing the GPU itself, or in other words, splitting it into several virtual GPUs. The virtual GPUs can be passed through into VMs so they get GPU acceleration, while the host can still display stuff on the very same GPU.
\nHowever, the GVT-g driver in Linux doesn't support 10th-Gen or newer Intel CPUs, and Intel has no plan to support them. In addition, the GVT-g virtual GPU cannot form an Optimus configuration with an NVIDIA GPU, so it isn't useful anyway.
\nThat's why we're ignoring GVT-g and focusing on the NVIDIA GPU in this guide.
\n11th-Gen and later Intel integrated graphics support another form of virtualization: SR-IOV. Intel has officially released the source code to the kernel module with SR-IOV, but it isn't merged into Linux mainline as of now. There's a third party project that ports the code into a DKMS module, but success rate is not high according to reports in Issues section. I tried it with my i7-11800H and didn't succeed. Therefore, this time we will not try SR-IOV on Intel GPUs.
\n\n\nMost of the content is the same as my post in 2021.
\n
The NVIDIA driver on the Host OS will hold control of the dGPU, and stop VM from using it. Therefore you need to replace the driver with vfio-pci
, built solely for PCIe passthrough.
Here are the steps for disabling the NVIDIA driver and passing control to the PCIe passthrough module:
\nRun lspci -nn | grep NVIDIA
and obtain an output similar to:
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104M [GeForce RTX 3070 Mobile / Max-Q] [10de:249d] (rev a1)\n0000:01:00.1 Audio device [0403]: NVIDIA Corporation GA104 High Definition Audio Controller [10de:228b] (rev a1)\n
\nHere [10de:249d]
is the vendor ID and device ID of the dGPU, where 10de
means this device is manufactured by NVIDIA, and 249d
means this is a RTX 3070. 228b
is the audio output on the HDMI port, which should also be taken over by vfio-pci
.
Create /etc/modprobe.d/lantian.conf
with the following content:
options vfio-pci ids=10de:249d,10de:228b\n
\nThis configures vfio-pci
, the kernel module responsible for PCIe passthrough, to manage the dGPU. ids
is the vendor ID and device ID of the device to be passed through.
Modify /etc/mkinitcpio.conf
, add the following contents to MODULES
:
MODULES=(vfio_pci vfio vfio_iommu_type1 vfio_virqfd)\n
\nAnd remove anything related to NVIDIA drivers (such as nvidia
), or make sure they're listed after VFIO drivers. Now PCIe passthrough module will take control of the dGPU in the early booting process, preventing NVIDIA drivers from taking control.
Run mkinitcpio -P
to update the initramfs.
Reboot.
\n(2023-05) If you're using NixOS, you can use the following config:
\n{\n boot.kernelModules = [\"vfio-pci\"];\n boot.extraModprobeConfig = ''\n # Change to your GPU's vendor ID and device ID\n options vfio-pci ids=10de:249d\n '';\n\n boot.blacklistedKernelModules = [\"nouveau\" \"nvidiafb\" \"nvidia\" \"nvidia-uvm\" \"nvidia-drm\" \"nvidia-modeset\"];\n}\n
\nIn my post in 2021, I mentioned a lot of configurations to circumvent restrictions of the NVIDIA driver. But since version 465, NVIDIA lifted most of the restrictions, so theoretically, you pass a GPU into the VM, and everything should just work.
\nBut that's just the theory.
\nI still recommend everyone to follow all the steps and hide the VM characteristics, because:
\n(2022-01) Not all restructions are lifted for laptops.
\nEven if NVIDIA driver isn't detecting VMs, the programs you run might. Hiding VM characteristics increases the chance to run them successfully.
\nAnd here we start:
\nUnlike the Optimus MUXless architecture, I didn't manually extract the graphic card's BIOS nor modify the UEFI firmware, and everything just works.
\nPCI\\VEN_10DE&DEV_1C8D&SUBSYS_39D117AA&REV_A1
. If SUBSYS
is followed by a sequence of zeros, then the GPU video BIOS is missing, and you need the manual steps.Modify your VM configuration, virsh edit Windows
, and make the following changes:
<!-- Modify the features section, so QEMU will hide the fact that this is a VM -->\n<features>\n <acpi/>\n <apic/>\n <hyperv mode=\"custom\">\n <relaxed state=\"on\"/>\n <vapic state=\"on\"/>\n <spinlocks state=\"on\" retries=\"8191\"/>\n <vpindex state=\"on\"/>\n <runtime state=\"on\"/>\n <synic state=\"on\"/>\n <stimer state=\"on\"/>\n <reset state=\"on\"/>\n <vendor_id state=\"on\" value=\"GenuineIntel\"/>\n <frequencies state=\"on\"/>\n <tlbflush state=\"on\"/>\n </hyperv>\n <kvm>\n <hidden state=\"on\"/>\n </kvm>\n <vmport state=\"off\"/>\n</features>\n<!-- Add the PCIe passthrough device, must be below the hostdev for iGPU -->\n<hostdev mode='subsystem' type='pci' managed='yes'>\n <source>\n <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>\n </source>\n <rom bar='off'/>\n <!-- The PCIe bus address here MUST BE EXACTLY 01:00.0 -->\n <!-- If there is a PCIe bus address conflict when saving config changes, -->\n <!-- Remove <address> of all other devices -->\n <!-- And Libvirt will reallocate PCIe bus addresses -->\n <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0' multifunction='on'/>\n</hostdev>\n<!-- Add a shared memory between VM and host -->\n<!-- So VM can transfer its display content to host -->\n<shmem name='looking-glass'>\n <model type='ivshmem-plain'/>\n <!-- Size is calculated as: display resolution width x height / 131072 -->\n <!-- then round up to power of 2 -->\n <!-- Most HDMI dummy plugs have a resolution of 3840 x 2160 -->\n <!-- The result is 63.28MB which rounds up to 64MB -->\n <size unit='M'>64</size>\n</shmem>\n<!-- Disable memory ballooning, this hurts performance significantly -->\n<memballoon model=\"none\"/>\n<!-- Add these parameters before </qemu:commandline> -->\n<qemu:arg value='-acpitable'/>\n<qemu:arg value='file=/ssdt1.dat'/>\n
\nThe ssdt1.dat is an ACPI table, and it emulates a fully-charged battery. It corresponds to the Base64 below. It can be converted to a binary file with Base64 decoding website or downloaded from this site. Place it in the root folder.
\nU1NEVKEAAAAB9EJPQ0hTAEJYUENTU0RUAQAAAElOVEwYEBkgoA8AFVwuX1NCX1BDSTAGABBMBi5f\nU0JfUENJMFuCTwVCQVQwCF9ISUQMQdAMCghfVUlEABQJX1NUQQCkCh8UK19CSUYApBIjDQELcBcL\ncBcBC9A5C1gCCywBCjwKPA0ADQANTElPTgANABQSX0JTVACkEgoEAAALcBcL0Dk=\n
\nModify permissions for the shared memory.
\nModify /etc/apparmor.d/local/abstractions/libvirt-qemu
and add this line:
/dev/shm/looking-glass rw,\n
\nThen execute sudo systemctl restart apparmor
to restart AppArmor.
Create /etc/tmpfiles.d/looking-glass.conf
with the following contents, replacing lantian
to your username:
f /dev/shm/looking-glass 0660 lantian kvm -\n
\nThen execute sudo systemd-tmpfiles /etc/tmpfiles.d/looking-glass.conf --create
to make it effective.
Start the VM and wait a while. Windows will automatically install NVIDIA drivers.
\nDevice by Connection
and verify that the NVIDIA GPU is at Bus 1, Slot 0, Function 0. The parent PCIe port to the dGPU should be at Bus 0, Slot 1, Function 0.SUBSYS
in its Hardware ID.\nTurn off the virtual machine and restart it. This is not just a reboot. Confirm in the device manager that the GPU is working.
\nDo either one of the following steps:
\nC:\\IddSampleDriver
. Note that you must not move the folder anywhere else!C:\\IddSampleDriver\\option.txt
. You'll see the number 1 on the first line (don't change it), followed by a list of resolutions and refresh rates. Only keep the one resolution and refresh rate entry you want, and remove all other items.C:\\IddSampleDriver\\IddSampleDriver.inf
, and complete the installation.(2023-05) Now the newer version of Looking Glass will install IVSHMEM driver automatically (the driver for shared memory between VM and host). You no longer need to install it manually. These manual installation steps are kept for reference only:
\n(2022-01) Download this Virtio driver, copy it into the VM, and extract it. You MUST use this copy, as no other copies have the IVSHMEM driver!
\n\nOpen Device Manager in the VM, and find System devices - PCI standard RAM controller
:
Virtio Drivers/Win10/amd64/ivshmem.inf
fileIVSHMEM
Install Looking Glass, a tool to transfer the display output from the VM to the host.
\n(2023-05) If you followed the steps from 2022-01, you won't be able to see the startup screen while the VM is booting, and before Looking Glass starts. Therefore, I recommend disabling the QXL virtual adapter in Device Manager. The following older steps are kept for reference purpose only.
\n(2022-01) Turn off the VM and run virsh edit Windows
to edit the VM config.
Find <video><model type=\"qxl\" ...></video>
, change type
to none
to disable the QXL emulated GPU:
<video>\n <model type=\"none\"/>\n</video>\n
\nInstall Looking Glass client on the host. Arch Linux users can simply install looking-glass
from AUR. Run looking-glass-client
to start the client.
Back to Virt-Manager, close the window of the VM (the window that shows the VM desktop and changes VM configurations), right-click on the VM on Virt-Manager's main window, and select Run.
\nIn a moment, you should see the VM's display on Looking Glass client. Now the GPU passthrough setup is complete.
\nAlthough GPU passthrough is done, there is still room for user experience optimization. Particularly:
\nWe will fix the problems one by one.
\n(2023-05) The latest Looking Glass can relay audio now. These steps are kept for reference only.
\nWhile Virt-Manager can connect to the VM with SPICE protocol to get the VM's sound output, Looking Glass also relays keyboard and mouse events through SPICE. Since the VM only accepts one simultaneous SPICE connection, we cannot get the audio output with Virt-Manager.
\nWe can install Scream, a virtual sound card software in Windows, to transfer audio output over the network. A Scream client can be run on the host to receive the audio signal.
\nDownload the Scream installer from its download page on the VM, extract it, run Install-x64.bat
as administrator to install the driver, and reboot.
Install the Scream client on the host. Arch Linux users can install the scream
package from AUR.
Open a terminal on the host and run scream -v
. Test by playing some sound in the VM. If you can't hear anything, try specifying the network interface to the VM, like scream -i virbr0 -v
, where virbr0
is the default NAT network for Virt-Manager, and the network interface between the VM and the host.
Finally, you can create a SystemD service to run the Scream client conveniently later. Create ~/.config/systemd/user/scream.service
with the following content:
[Unit]\nDescription=Scream\n\n[Service]\nType=simple\nRestart=always\nRestartSec=1\nExecStart=/usr/bin/scream -i virbr0 -v\n\n[Install]\nWantedBy=graphical-session.target\n
\nYou will only need to run systemctl --user start scream
in the future.
The latest Looking Glass can relay keyboard and mouse events reliably now. These steps are kept for reference only.
\nThe relay of the keyboard and mouse in Looking Glass isn't very stable, as misses of operation can happen from time to time. Therefore, if you want to play some games in the VM, you need a more reliable way to pass your keyboard and mouse into the VM.
\nWe have two options: letting Libvirt capture the keyboard and mouse events or simply pass your keyboard and mouse into the VM.
\nCapturing Keyboard and Mouse Events.
\nOn Linux, all keyboard and mouse operations are passed to the desktop environment via the evdev
(or Event Device
) framework. Libvirt can capture your operations and pass them to the VM. In addition, Libvirt can switch the control between the host and the VM whenever you press Left Ctrl and the Right Ctrl, so you can operate on both the host and the VM with one keyboard-mouse combo.
First run ls -l /dev/input/by-path
on the host to see your present evdev
devices. I have these ones for example:
pci-0000:00:14.0-usb-0:1:1.1-event-mouse # USB mouse\npci-0000:00:14.0-usb-0:1:1.1-mouse\npci-0000:00:14.0-usb-0:6:1.0-event\npci-0000:00:15.0-platform-i2c_designware.0-event-mouse # Builtin Touchpad\npci-0000:00:15.0-platform-i2c_designware.0-mouse\npci-0000:00:1f.3-platform-skl_hda_dsp_generic-event\nplatform-i8042-serio-0-event-kbd # Builtin Keyboard\nplatform-pcspkr-event-spkr\n
\nThose with event-mouse
are mouses, and the event-kbd
ones are keyboards.
Then, run virsh edit Windows
to edit the VM config. Add these into the <devices>
section:
<input type=\"evdev\">\n <!-- Change the mouse or keyboard path based on your ls result -->\n <source dev=\"/dev/input/by-path/platform-i8042-serio-0-event-kbd\" grab=\"all\" repeat=\"on\"/>\n</input>\n<!-- Repeat if you have many mouses or keyboards -->\n<input type=\"evdev\">\n <source dev=\"/dev/input/by-path/pci-0000:00:15.0-platform-i2c_designware.0-event-mouse\" grab=\"all\" repeat=\"on\"/>\n</input>\n
\nStart the VM, and you should notice that your keyboard and mouse aren't working on the host. They're captured by the VM. Press Left Ctrl + Right Ctrl to return control to the host. Press again to control the VM.
\nNow we can disable the keyboard and mouse relay of Looking Glass. Create /etc/looking-glass-client.ini
with the following content:
[spice]\nenable=no\n
\nUSB Keyboard and Mouse Passthrough
\nCapturing keyboard and mouse operations doesn't always work. For example, my touchpad cannot be captured properly, as I can't move the cursor in the VM.
\nIf you also encountered the issue and you have a USB keyboard and mouse combo, you can pass them into the VM and use them specifically for it. USB passthrough to VM is a mature technology, so the chance of running into problems is very low.
\nSimply click Add Hardware - USB Host Device
in Virt-Manager and select your keyboard and mouse.
\n\nMost of the content in this section is from https://looking-glass.io/docs/B6/module/
\n
Looking Glass provides a kernel module for the IVSHMEM shared memory device. It allows Looking Glass to read the display output efficiently with DMA to improve the framerate.
\nInstall Linux kernel header files and DKMS, or the packages of linux-headers
and dkms
on Arch Linux.
Install looking-glass-module-dkms
from AUR.
Set up an Udev rule: create /etc/udev/rules.d/99-kvmfr.rules
with the following content:
SUBSYSTEM==\"kvmfr\", OWNER=\"lantian\", GROUP=\"kvm\", MODE=\"0660\"\n
\nReplace lantian
with your own username.
Configure the memory size: create /etc/modprobe.d/looking-glass.conf
with the following content:
# The memory size is calculates in the same way as VM's shmem.\noptions kvmfr static_size_mb=64\n
\nLoad the module automatically on boot: create /etc/modules-load.d/looking-glass.conf
with a single line of kvmfr
.
Run sudo modprobe kvmfr
to load the module. Now a kvmfr0
device should appear under /dev
, and this is the memory device for Looking Glass.
Edit /etc/apparmor.d/local/abstractions/libvirt-qemu
and add this line:
/dev/kvmfr0 rw,\n
\nIt allows the VM to access the device. Run sudo systemctl restart apparmor
to restart AppArmor.
Run virsh edit Windows
to change the VM's configuration:
Delete <shmem>
section from <devices>
:
<shmem name='looking-glass'>\n <model type='ivshmem-plain'/>\n <size unit='M'>64</size>\n</shmem>\n
\nAdd these lines under <qemu:commandline>
:
<qemu:arg value=\"-device\"/>\n<qemu:arg value=\"{"driver":"ivshmem-plain","id":"shmem-looking-glass","memdev":"looking-glass"}\"/>\n<qemu:arg value=\"-object\"/>\n<!-- There is a number 67108864 in the next line, which is 64MB * 1048576 -->\n<!-- Change accordingly if you've set a different memory size -->\n<qemu:arg value=\"{"qom-type":"memory-backend-file","id":"looking-glass","mem-path":"/dev/kvmfr0","size":67108864,"share":true}\"/>\n
\nStart the VM.
\nChange /etc/looking-glass-client.ini
and add the following content:
[app]\nshmFile=/dev/kvmfr0\n
\nStart Looking Glass. You should see the VM display now.
\n(2023-05) If you use NixOS, you can directly use the config below:
\n{\n boot.extraModulePackages = with config.boot.kernelPackages; [\n kvmfr\n ];\n boot.extraModprobeConfig = ''\n # The memory size is calculates in the same way as VM's shmem.\n options kvmfr static_size_mb=64\n '';\n boot.kernelModules = [\"kvmfr\"];\n services.udev.extraRules = ''\n SUBSYSTEM==\"kvmfr\", OWNER=\"root\", GROUP=\"libvirtd\", MODE=\"0660\"\n '';\n\n environment.etc.\"looking-glass-client.ini\".text = ''\n [app]\n shmFile=/dev/kvmfr0\n '';\n}\n
\n2022-01-26 Update: testing shows that the NVIDIA GPU still isn't completely shut down after applying the patch. The power draw is the same as before. This section is now invalid.
\n\n\nThis section only applies to 20-series of NVIDIA GPUs or newer. They can shut themselves down with the NVIDIA official drivers. The 10-series or older GPUs don't support this feature.
\nThis section involves compiling a kernel yourself, and using an patch without extensive inspection or testing. Not intended for novice users. Evaluate the risks yourself.
\n
When you aren't using the VM, the vfio-pci
driver in charge of PCIe passthrough sets the device to the D3
mode, aka the power saving mode of PCIe devices. But there are two types of D3
modes: D3hot
, where the device is still powered, and D3cold
, where the device is shut off completely. Currently, the vfio-pci
driver in the kernel only supports D3hot
, and the NVIDIA GPU will still consume around 10 watts of power since its chip power isn't cut. This impacts the battery life of laptops.
An NVIDIA engineer posted a patchset for vfio-pci
's D3cold
support on the Linux kernel mailing list. With this patchset, the NVIDIA GPU will be shut down completely when the VM is off. It saves power for your battery.
The patchset can be found at https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com/T/, which consists of three patches. I combined the three patches and uploaded the result to https://github.com/xddxdd/pkgbuild/blob/master/linux-xanmod-lantian/0007-vfio-pci-d3cold.patch.
\nPatching kernel is relatively simple for Arch Linux. Most kernel PKGBUILDs in AUR can apply patches automatically. All you have to do is to download the PKGBUILD for a kernel and add the patch to its source
section. See my commit for an example: https://github.com/xddxdd/pkgbuild/commit/406adb7bf5657cfe07bb17ff561d11ed97ebab39.
DO NOTE that this patch doesn't guarantee stability.
\nBased on mailing list discussions:
\n[RFC]
on the title of the e-mails.D3cold
mode, with this patch, there are risks of resetting the GPU, losing all states, and crashing the VM. Although I have never encountered such problems in my experience, you should be aware of the possible outcomes.Use this at your own risk.
Huge thanks to previous explorers on the topic of GPU passthrough. Without their efforts, this post won't have existed in the first place.
\nHere are the sources I referenced when I did my configuration:
\n<domain xmlns:qemu=\"http://libvirt.org/schemas/domain/qemu/1.0\" type=\"kvm\">\n <name>Windows11</name>\n <uuid>5d5b00d8-475a-4b6c-8053-9dda30cd2f95</uuid>\n <metadata>\n <libosinfo:libosinfo xmlns:libosinfo=\"http://libosinfo.org/xmlns/libvirt/domain/1.0\">\n <libosinfo:os id=\"http://microsoft.com/win/11\"/>\n </libosinfo:libosinfo>\n </metadata>\n <memory unit=\"KiB\">16777216</memory>\n <currentMemory unit=\"KiB\">16777216</currentMemory>\n <vcpu placement=\"static\">16</vcpu>\n <os>\n <type arch=\"x86_64\" machine=\"pc-q35-8.0\">hvm</type>\n <loader readonly=\"yes\" type=\"pflash\">/run/libvirt/nix-ovmf/OVMF_CODE.fd</loader>\n <nvram template=\"/run/libvirt/nix-ovmf/OVMF_VARS.fd\">/var/lib/libvirt/qemu/nvram/Windows11_VARS.fd</nvram>\n </os>\n <features>\n <acpi/>\n <apic/>\n <hyperv mode=\"custom\">\n <relaxed state=\"on\"/>\n <vapic state=\"on\"/>\n <spinlocks state=\"on\" retries=\"8191\"/>\n <vpindex state=\"on\"/>\n <runtime state=\"on\"/>\n <synic state=\"on\"/>\n <stimer state=\"on\"/>\n <reset state=\"on\"/>\n <vendor_id state=\"on\" value=\"GenuineIntel\"/>\n <frequencies state=\"on\"/>\n <tlbflush state=\"on\"/>\n </hyperv>\n <kvm>\n <hidden state=\"on\"/>\n </kvm>\n <vmport state=\"off\"/>\n </features>\n <cpu mode=\"host-passthrough\" check=\"none\" migratable=\"on\">\n <topology sockets=\"1\" dies=\"1\" cores=\"8\" threads=\"2\"/>\n </cpu>\n <clock offset=\"localtime\">\n <timer name=\"rtc\" tickpolicy=\"catchup\"/>\n <timer name=\"pit\" tickpolicy=\"delay\"/>\n <timer name=\"hpet\" present=\"no\"/>\n <timer name=\"hypervclock\" present=\"yes\"/>\n </clock>\n <on_poweroff>destroy</on_poweroff>\n <on_reboot>restart</on_reboot>\n <on_crash>destroy</on_crash>\n <pm>\n <suspend-to-mem enabled=\"no\"/>\n <suspend-to-disk enabled=\"no\"/>\n </pm>\n <devices>\n <emulator>/run/libvirt/nix-emulators/qemu-system-x86_64</emulator>\n <disk type=\"file\" device=\"disk\">\n <driver name=\"qemu\" type=\"qcow2\" discard=\"unmap\"/>\n <source file=\"/var/lib/libvirt/images/Windows11.qcow2\"/>\n <target dev=\"vda\" bus=\"virtio\"/>\n <boot order=\"1\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x04\" slot=\"0x00\" function=\"0x0\"/>\n </disk>\n <disk type=\"file\" device=\"cdrom\">\n <driver name=\"qemu\" type=\"raw\"/>\n <source file=\"/mnt/root/persistent/media/LegacyOS/Common/virtio-win-0.1.215.iso\"/>\n <target dev=\"sdb\" bus=\"sata\"/>\n <readonly/>\n <address type=\"drive\" controller=\"0\" bus=\"0\" target=\"0\" unit=\"1\"/>\n </disk>\n <controller type=\"usb\" index=\"0\" model=\"qemu-xhci\" ports=\"15\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x02\" slot=\"0x00\" function=\"0x0\"/>\n </controller>\n <controller type=\"pci\" index=\"0\" model=\"pcie-root\"/>\n <controller type=\"pci\" index=\"1\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"1\" port=\"0x10\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x0\" multifunction=\"on\"/>\n </controller>\n <controller type=\"pci\" index=\"2\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"2\" port=\"0x11\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x1\"/>\n </controller>\n <controller type=\"pci\" index=\"3\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"3\" port=\"0x12\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x2\"/>\n </controller>\n <controller type=\"pci\" index=\"4\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"4\" port=\"0x13\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x3\"/>\n </controller>\n <controller type=\"pci\" index=\"5\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"5\" port=\"0x14\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x4\"/>\n </controller>\n <controller type=\"pci\" index=\"6\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"6\" port=\"0x15\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x5\"/>\n </controller>\n <controller type=\"pci\" index=\"7\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"7\" port=\"0x16\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x6\"/>\n </controller>\n <controller type=\"pci\" index=\"8\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"8\" port=\"0x17\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x02\" function=\"0x7\"/>\n </controller>\n <controller type=\"pci\" index=\"9\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"9\" port=\"0x18\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x0\" multifunction=\"on\"/>\n </controller>\n <controller type=\"pci\" index=\"10\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"10\" port=\"0x19\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x1\"/>\n </controller>\n <controller type=\"pci\" index=\"11\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"11\" port=\"0x1a\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x2\"/>\n </controller>\n <controller type=\"pci\" index=\"12\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"12\" port=\"0x1b\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x3\"/>\n </controller>\n <controller type=\"pci\" index=\"13\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"13\" port=\"0x1c\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x4\"/>\n </controller>\n <controller type=\"pci\" index=\"14\" model=\"pcie-root-port\">\n <model name=\"pcie-root-port\"/>\n <target chassis=\"14\" port=\"0x1d\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x03\" function=\"0x5\"/>\n </controller>\n <controller type=\"sata\" index=\"0\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x1f\" function=\"0x2\"/>\n </controller>\n <controller type=\"virtio-serial\" index=\"0\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x03\" slot=\"0x00\" function=\"0x0\"/>\n </controller>\n <interface type=\"network\">\n <mac address=\"52:54:00:f4:bf:15\"/>\n <source network=\"default\"/>\n <model type=\"virtio\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x01\" slot=\"0x00\" function=\"0x0\"/>\n </interface>\n <serial type=\"pty\">\n <target type=\"isa-serial\" port=\"0\">\n <model name=\"isa-serial\"/>\n </target>\n </serial>\n <console type=\"pty\">\n <target type=\"serial\" port=\"0\"/>\n </console>\n <channel type=\"spicevmc\">\n <target type=\"virtio\" name=\"com.redhat.spice.0\"/>\n <address type=\"virtio-serial\" controller=\"0\" bus=\"0\" port=\"1\"/>\n </channel>\n <input type=\"mouse\" bus=\"ps2\"/>\n <input type=\"mouse\" bus=\"virtio\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x06\" slot=\"0x00\" function=\"0x0\"/>\n </input>\n <input type=\"keyboard\" bus=\"ps2\"/>\n <input type=\"keyboard\" bus=\"virtio\">\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x07\" slot=\"0x00\" function=\"0x0\"/>\n </input>\n <tpm model=\"tpm-crb\">\n <backend type=\"passthrough\">\n <device path=\"/dev/tpm0\"/>\n </backend>\n </tpm>\n <graphics type=\"spice\" autoport=\"yes\">\n <listen type=\"address\"/>\n <image compression=\"off\"/>\n </graphics>\n <sound model=\"ich9\">\n <audio id=\"1\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x1b\" function=\"0x0\"/>\n </sound>\n <audio id=\"1\" type=\"spice\"/>\n <video>\n <model type=\"qxl\" ram=\"65536\" vram=\"65536\" vgamem=\"16384\" heads=\"1\" primary=\"yes\"/>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x00\" slot=\"0x01\" function=\"0x0\"/>\n </video>\n <hostdev mode=\"subsystem\" type=\"pci\" managed=\"yes\">\n <source>\n <address domain=\"0x0000\" bus=\"0x01\" slot=\"0x00\" function=\"0x0\"/>\n </source>\n <address type=\"pci\" domain=\"0x0000\" bus=\"0x05\" slot=\"0x00\" function=\"0x0\"/>\n </hostdev>\n <redirdev bus=\"usb\" type=\"spicevmc\">\n <address type=\"usb\" bus=\"0\" port=\"2\"/>\n </redirdev>\n <redirdev bus=\"usb\" type=\"spicevmc\">\n <address type=\"usb\" bus=\"0\" port=\"3\"/>\n </redirdev>\n <watchdog model=\"itco\" action=\"reset\"/>\n <memballoon model=\"none\"/>\n </devices>\n <qemu:commandline>\n <qemu:arg value=\"-device\"/>\n <qemu:arg value=\"{"driver":"ivshmem-plain","id":"shmem0","memdev":"looking-glass"}\"/>\n <qemu:arg value=\"-object\"/>\n <qemu:arg value=\"{"qom-type":"memory-backend-file","id":"looking-glass","mem-path":"/dev/kvmfr0","size":67108864,"share":true}\"/>\n <qemu:arg value=\"-acpitable\"/>\n <qemu:arg value=\"file=/etc/ssdt1.dat\"/>\n </qemu:commandline>\n</domain>\n