{"id":3474,"date":"2024-03-05T18:15:10","date_gmt":"2024-03-05T16:15:10","guid":{"rendered":"https:\/\/vasil.ludost.net\/blog\/?p=3474"},"modified":"2024-03-08T12:25:02","modified_gmt":"2024-03-08T10:25:02","slug":"2024-03-05-the-failed-live-migration-case","status":"publish","type":"post","link":"https:\/\/vasil.ludost.net\/blog\/?p=3474","title":{"rendered":"2024-03-05 The failed live migration case"},"content":{"rendered":"<p>There&#8217;s a purpose in everyone&#8217;s life. Some days, it seems like mine is to show people what not to do.<\/p>\n<p>I&#8217;ve written this explanation to show my mistakes and that, hopefully, other people don&#8217;t (there&#8217;s no hope for me). What follows is most of my debugging process, and it could be helpful to other people to see it (and point out what is wrong with it).<\/p>\n<p>This happens in a StorPool deployment for a customer who runs a public cloud service. We provide the shared storage via some drivers of our own, and the customer uses OpenNebula to control Linux\/KVM hypervisors. It&#8217;s all integrated very nicely and is used for our internal clouds.<\/p>\n<p>KVM VMs are a <code>qemu<\/code> process in the Linux userspace with multiple threads.<\/p>\n<p>The case started with a customer complaining that after an update we did to the storage software, the live migrations of his VMs started failing (the VMs would fail to move but would continue working on the original host). The problem was narrowed down to VMs started before a specific date (when we did the update mentioned above), and initially, it seemed to be contained to VMs that had QEMU CPU set (i.e., the least performant one).<\/p>\n<p>At this point, I asked &#8211; can&#8217;t they restart all such VMs? This CPU type is mainly contributing to global warming anyway.<br \/>\nThe reply was, &#8220;Are you insane? These are 900 VMs. We&#8217;ll reboot them, but we really don&#8217;t want to do that now&#8221;.<br \/>\nFair point.<\/p>\n<p>Then, some further tests showed that any VMs, not just the ones with the QEMU CPU, were failing the migration. Finally, someone looked into the (not easy to find) logs of <code>qemu<\/code> and found the following:<\/p>\n<pre>\n2024-02-20T09:46:41.938285Z qemu-kvm-one: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)\n<\/pre>\n<p>(seriously. Trying to trace anything via the logs in OpenNebula\/libvirt\/qemu feels like interrogating prisoners who don&#8217;t speak your language.)<\/p>\n<p>Two mistakes above that would&#8217;ve saved a day or two &#8211; I should&#8217;ve looked into the issue immediately myself instead of relying on hearsay and building the wrong picture (based on assumptions) in my head. I did not expect that the (routine) update we did could have any such effect.<\/p>\n<p>I specifically checked if the same problem existed in our labs and test environments, where we had done the same routine update, and it didn&#8217;t exist.<\/p>\n<p>Now, this was weird. The only mention of the above error was in <a href=\"https:\/\/bugzilla.redhat.com\/show_bug.cgi?id=2058982\">https:\/\/bugzilla.redhat.com\/show_bug.cgi?id=2058982<\/a> when the storage device on the originating side was broken. This was not the case here.<\/p>\n<p>My first step was to get permission to use one VM at the customer for tests, and after I had that, I ran strace on it while migrating.<\/p>\n<p>What I was able to see in the thread that performed the migration was:<\/p>\n<pre>\nprctl(PR_SET_NAME, \"live_migration\")    = 0\n\u2026\nsendmsg(99, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=\"...\"..., iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32768\nsendmsg(99, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=\"...\"..., iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32768\nwrite(2, \"2024-02-15T07:00:23.057678Z \", 28) = 28\nwrite(2, \"qemu-kvm-one:\", 13)           = 13\nwrite(2, \" \", 1)                        = 1\nwrite(2, \"qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)\", 82) = 82\nwrite(2, \"\\n\", 1)                       = 1\nfutex(0x55586af89640, FUTEX_WAKE_PRIVATE, 1) = 1\nsendmsg(20, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=\"{\\\"timestamp\\\": {\\\"seconds\\\": 1707980423, \\\"microseconds\\\": 57880}, \\\"event\\\": \\\"MIGRATION\\\", \\\"data\\\": {\\\"status\\\": \\\"failed\\\"}}\\r\\n\", iov_len=115}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 115\nfutex(0x55586af89640, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)\nsendmsg(20, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=\"{\\\"timestamp\\\": {\\\"seconds\\\": 1707980423, \\\"microseconds\\\": 58133}, \\\"event\\\": \\\"RESUME\\\"}\\r\\n\", iov_len=82}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 82\n<\/pre>\n<p>Some digging showed that <code>99<\/code> was the file descriptor with the connection to the remote qemu. This was the memory transfer, followed almost immediately by writing out the error and nothing in between.<\/p>\n<p>So, being somewhat lost, I looked up debug symbols for qemu to see if I could trace this further. The usual tooling told me there weren&#8217;t:<\/p>\n<pre>\n[root@somewhere str3]# <b>gdb \/usr\/libexec\/qemu-kvm<\/b>\nGNU gdb (GDB) Rocky Linux 8.2-20.el8.0.1\nCopyright (C) 2018 Free Software Foundation, Inc.\nLicense GPLv3+: GNU GPL version 3 or later &lt;http:\/\/gnu.org\/licenses\/gpl.html&gt;\nThis is free software: you are free to change and redistribute it.\nThere is NO WARRANTY, to the extent permitted by law.\nType \"show copying\" and \"show warranty\" for details.\nThis GDB was configured as \"x86_64-redhat-linux-gnu\".\nType \"show configuration\" for configuration details.\nFor bug reporting instructions, please see:\n&lt;http:\/\/www.gnu.org\/software\/gdb\/bugs\/&gt;.\nFind the GDB manual and other documentation resources online at:\n    &lt;http:\/\/www.gnu.org\/software\/gdb\/documentation\/&gt;.\n\nFor help, type \"help\".\nType \"apropos word\" to search for commands related to \"word\"...\nReading symbols from \/usr\/libexec\/qemu-kvm...Reading symbols from .gnu_debugdata for \/usr\/libexec\/qemu-kvm...(no debugging symbols found)...done.\n(no debugging symbols found)...done.\nMissing separate debuginfos, use: yum debuginfo-install qemu-kvm-core-6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64\n(gdb) <b>quit<\/b>\n\n[root@somewhere str3]# <b>yum debuginfo-install qemu-kvm-core-6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64<\/b>\nRepository storpool-contrib is listed more than once in the configuration\nenabling epel-debuginfo repository\nExtra Packages for Enterprise Linux 8 - x86_64 - Debug                                                                                                                                                                                            4.0 MB\/s | 6.1 MB     00:01    \nLast metadata expiration check: 0:00:02 ago on Thu Feb 15 16:26:34 2024.\nCould not find debuginfo package for the following installed packages: qemu-kvm-core-15:6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64\nCould not find debugsource package for the following installed packages: qemu-kvm-core-15:6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64\nDependencies resolved.\nNothing to do.\nComplete!\n<\/pre>\n<p>(I can already hear people saying, &#8220;You&#8217;re an idiot&#8221;; someone had to point that out to me)<\/p>\n<p>Then, the colleague in communication with the customer said, &#8220;BTW, do you remember that at that time we did the other update, to migrate from one-type-of-volume-name to the-other-type-of-volume-name&#8221;? Of course, I didn&#8217;t. We ran the same migration in our lab, and the issue was reproduced immediately.<\/p>\n<p>The mistake was that I did not collect all available information (or even most of it), still relying on assumptions.<\/p>\n<p>I could deploy the latest qemu for AlmaLinux (almost the same as what the customer was using), download debug symbols, and start digging. Yay.<\/p>\n<p>On <code>strace<\/code>, the issue looked the same &#8211; lots of sending to the socket and then just the error. So, I started digging with <code>gdb<\/code> by attaching, adding a breakpoint, continuing the process, and then running a live migration to see how the situation looked like at different points of the program. I had something like this primed and pasted when I ran <code>gdb -p xxxx<\/code> (some comments added to explain what these are): <\/p>\n<pre>\nset print pretty\t\t\t<i># the usual gdb print is hard to read<\/i>\nset pagination off\t\t\t<i># pagination is annoying in this regard, my terminal can scroll<\/i>\nhandle SIGUSR1 nostop pass\t\t<i># qemu uses SIGUSR1 a lot<\/i>\nbreak block\/block-backend.c:253\t\t<i># stop here<\/i>\ncont\t\t\t\t\t<i># continue execution<\/i>\nbt full\t\t\t\t\t<i># print me a full backtrace with local variables when stopped<\/i>\n<\/pre>\n<p>It took some time to read through the relevant code and ensure there wasn&#8217;t something obvious (which took many wrong turns). The issue looked like that <code>bdrv_inactivate_recurse()<\/code> was trying to remove write permissions from the devices and was failing for some reason.<\/p>\n<pre>\n6523     \/* Inactivate this node *\/\n6524     if (bs-&gt;drv-&gt;bdrv_inactivate) {\n6525         ret = bs-&gt;drv-&gt;bdrv_inactivate(bs);\n6526         if (ret &lt; 0) {\n6527             return ret;\n6528         }\n6529     }\n6530 \n6531     QLIST_FOREACH(parent, &bs-&gt;parents, next_parent) {\n6532         if (parent-&gt;klass-&gt;inactivate) {\n6533             ret = parent-&gt;klass-&gt;inactivate(parent);\n6534             if (ret &lt; 0) {\n6535                 return ret;\n6536             }\n6537         }\n6538     }\n6539 \n6540     bdrv_get_cumulative_perm(bs, &cumulative_perms,\n6541                              &cumulative_shared_perms);\n6542     if (cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) {\n6543         \/* Our inactive parents still need write access. Inactivation failed. *\/\n6544         return -EPERM;\n6545     }\n\n<\/pre>\n<p>This is the code in question, and in <code>bdrv_get_cumulative_perms()<\/code>, there were some devices that still had r\/w set even after the stuff between lines 6531 and 6533 had passed:<\/p>\n<pre>\nbdrv_inactivate_recurse (bs=0x556e00221530) at ..\/block.c:6534\n6534                if (ret &lt; 0) {\n(gdb) \n6531        QLIST_FOREACH(parent, &bs-&gt;parents, next_parent) {\n(gdb) \n6540        bdrv_get_cumulative_perm(bs, &cumulative_perms,\n(gdb) \n6542        if (cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) {\n(gdb) \nbdrv_inactivate_all () at ..\/block.c:6591\n6591            if (ret &lt; 0) {\n(gdb) \n6592                bdrv_next_cleanup(&it);\n(gdb) \n\n<\/pre>\n<p>The <code>inactivate()<\/code> function above was in <code>block\/block-backend.c<\/code> and looks like this:<\/p>\n<pre>\n 250 static int blk_root_inactivate(BdrvChild *child)\n 251 {\n 252     BlockBackend *blk = child-&gt;opaque;\n 253 \n 254     if (blk-&gt;disable_perm) {\n 255         return 0;\n 256     }\n 257 \n 258     if (!blk_can_inactivate(blk)) {\n 259         return -EPERM;\n 260     }\n 261 \n 262     blk-&gt;disable_perm = true;\n 263     if (blk-&gt;root) {\n 264         bdrv_child_try_set_perm(blk-&gt;root, 0, BLK_PERM_ALL, &error_abort);\n 265     }\n 266 \n 267     return 0;\n 268 }\n<\/pre>\n<p>and what was happening was:<\/p>\n<pre>\n258         if (!blk_can_inactivate(blk)) {\n(gdb) \n262         blk-&gt;disable_perm = true;\n(gdb) \n263         if (blk-&gt;root) {\n(gdb) \n264             bdrv_child_try_set_perm(blk-&gt;root, 0, BLK_PERM_ALL, &error_abort);\n(gdb) \nbdrv_inactivate_recurse (bs=0x556e00221530) at ..\/block.c:6534\n<\/pre>\n<p>So <code>qemu<\/code> is calling it, and that was all OK. <\/p>\n<p>Here, a few times, I had missed what this was trying to do. Here&#8217;s <code>bdrv_child_try_set_perm()<\/code>:<\/p>\n<pre>\n2517 \n2518 int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,\n2519                             Error **errp)\n2520 {\n2521     Error *local_err = NULL;\n2522     Transaction *tran = tran_new();\n2523     int ret;\n2524 \n2525     bdrv_child_set_perm(c, perm, shared, tran);\n2526 \n2527     ret = bdrv_refresh_perms(c-&gt;bs, &local_err);\n2528 \n2529     tran_finalize(tran, ret);\n2530 \n2531     if (ret &lt; 0) {\n2532         if ((perm & ~c-&gt;perm) || (c-&gt;shared_perm & ~shared)) {\n2533             \/* tighten permissions *\/\n2534             error_propagate(errp, local_err);\n2535         } else {\n2536             \/*\n2537              * Our caller may intend to only loosen restrictions and\n2538              * does not expect this function to fail.  Errors are not\n2539              * fatal in such a case, so we can just hide them from our\n2540              * caller.\n2541              *\/\n2542             error_free(local_err);\n2543             ret = 0;\n2544         }\n2545     }\n2546 \n2547     return ret;\n2548 }\n<\/pre>\n<p>So that <code>0<\/code> being passed is &#8220;remove all permissions&#8221;, but at some time late in the day, you can&#8217;t see it\u2026<\/p>\n<p>And now, I break into that one and try to see what happens:<\/p>\n<pre>\n(gdb) <b>n<\/b>\n2197        *s = (BdrvChildSetPermState) {\n(gdb) <b>print *(local_err)<\/b>\nNo symbol \"local_err\" in current context.\n(gdb) \nNo symbol \"local_err\" in current context.\n(gdb) <b>n<\/b>\n2203        c-&gt;perm = perm;\n(gdb) <b>print *(local_err)<\/b>\nNo symbol \"local_err\" in current context.\n(gdb) <b>n<\/b>\n2204        c-&gt;shared_perm = shared;\n(gdb) <b>print *(local_err)<\/b>\nNo symbol \"local_err\" in current context.\n(gdb) <b>n<\/b>\n2206        tran_add(tran, &bdrv_child_set_pem_drv, s);\n(gdb) <b>n<\/b>\nbdrv_child_try_set_perm (c=0x556e002e1730, perm=perm@entry=0, shared=shared@entry=31, errp=0x556dfeb99258 &lt;error_abort&gt;) at ..\/block.c:2527\n2527        ret = bdrv_refresh_perms(c-&gt;bs, &local_err);\n(gdb) print *(local_err)\nCannot access memory at address 0x0\n(gdb) <b>n<\/b>\n2529        tran_finalize(tran, ret);\n(gdb) <b>print *(local_err)<\/b>\n$1 = {\n  msg = 0x7f25a4041b70 \"Could not open '\/var\/lib\/one\/\/datastores\/0\/310\/disk.0': No such file or directory\", \n  err_class = ERROR_CLASS_GENERIC_ERROR, \n  src = 0x556dfe63ca2a \"..\/util\/osdep.c\", \n  func = 0x556dfe63cbc0 &lt;__func__.26488&gt; \"qemu_open_internal\", \n  line = 359, \n  hint = 0x0\n}\n(gdb) <b>q<\/b>\n<\/pre>\n<p>Now, I see it going through the function, and I&#8217;m trying to look at that <code>local_err<\/code> if there&#8217;s something in it. Finally, I get the actual error; it has happened somewhere and is not propagated &#8211; but <code>qemu<\/code> has tried opening a file, and that doesn&#8217;t exist.<\/p>\n<p>Two things here:<br \/>\n&#8211; This file does exist (it&#8217;s a symlink to a symlink to a block device, and I can access it with the permissions of the process);<br \/>\n&#8211; Why isn&#8217;t this error logged or propagated? I should probably send a patch for that\u2026<br \/>\n&#8211; This was not in what <code>strace<\/code> showed, so what the hell.<\/p>\n<p>So, I break at <code>util\/osdep.c<\/code> in <code>qemu_open_internal()<\/code>, to see what happens. I pass through that a few times, trying to see where that error comes from, because I still think there is no syscall there, and glibc is screwing with me. This goes to show that you shouldn&#8217;t debug when tired because glibc returning <code>ENOENT<\/code> on its own is a lot less likely than <code>strace<\/code> missing one syscall.<\/p>\n<p>I&#8217;m going to leave out a lot of unsuccessful attempts and show you how I got to the next clue in one large gdb session (commends with #):<\/p>\n<pre>\n(gdb) <b>break util\/osdep.c:312<\/b>\nBreakpoint 1 at 0x556dfe2c5c71: file ..\/util\/osdep.c, line 312.\n(gdb) handle SIGUSR1 nostop pass\nSignal        Stop      Print   Pass to program Description\nSIGUSR1       No        Yes     Yes             User defined signal 1\n(gdb) <b>set print pretty<\/b>\n(gdb) <b>set pagination off<\/b>\n(gdb) <b>cont<\/b>\nContinuing.\n[New Thread 0x7f259c9ba700 (LWP 1798462)]\n[New Thread 0x7f259c1b9700 (LWP 1798463)]\n\nThread 5 \"CPU 0\/KVM\" received signal SIGUSR1, User defined signal 1.\n\nThread 5 \"CPU 0\/KVM\" received signal SIGUSR1, User defined signal 1.\n\nThread 5 \"CPU 0\/KVM\" received signal SIGUSR1, User defined signal 1.\n[Switching to Thread 0x7f259c1b9700 (LWP 1798463)]\n\nThread 8 \"live_migration\" hit Breakpoint 1, qemu_open_internal (name=name@entry=0x556e0021b241 \"\/var\/lib\/one\/\/datastores\/0\/310\/disk.0\", flags=16384, mode=mode@entry=0, errp=errp@entry=0x7f259c1b85e0) at ..\/util\/osdep.c:318\n318         if (strstart(name, \"\/dev\/fdset\/\", &fdset_id_str)) {\n(gdb) <b>bt full<\/b>\n\u2026 \t\t\t\t<i># nothing interesting in the backtrace<\/i>\n(gdb) <b>disassemble <\/b>\nDump of assembler code for function qemu_open_internal:\n   0x0000556dfe2c5c40 &lt;+0&gt;:     push   %r15\n   0x0000556dfe2c5c42 &lt;+2&gt;:     push   %r14\n   0x0000556dfe2c5c44 &lt;+4&gt;:     mov    %rcx,%r14\n   0x0000556dfe2c5c47 &lt;+7&gt;:     push   %r13\n   0x0000556dfe2c5c49 &lt;+9&gt;:     mov    %edx,%r13d\n   0x0000556dfe2c5c4c &lt;+12&gt;:    push   %r12\n   0x0000556dfe2c5c4e &lt;+14&gt;:    mov    %esi,%r12d\n   0x0000556dfe2c5c51 &lt;+17&gt;:    lea    0x376dad(%rip),%rsi        # 0x556dfe63ca05\n   0x0000556dfe2c5c58 &lt;+24&gt;:    push   %rbp\n   0x0000556dfe2c5c59 &lt;+25&gt;:    mov    %rdi,%rbp\n   0x0000556dfe2c5c5c &lt;+28&gt;:    push   %rbx\n   0x0000556dfe2c5c5d &lt;+29&gt;:    sub    $0x28,%rsp\n   0x0000556dfe2c5c61 &lt;+33&gt;:    mov    %fs:0x28,%rax\n   0x0000556dfe2c5c6a &lt;+42&gt;:    mov    %rax,0x18(%rsp)\n   0x0000556dfe2c5c6f &lt;+47&gt;:    xor    %eax,%eax\n=&gt; 0x0000556dfe2c5c71 &lt;+49&gt;:    lea    0x10(%rsp),%rdx\n   0x0000556dfe2c5c76 &lt;+54&gt;:    callq  0x556dfe2c6a30 &lt;strstart&gt;\n   0x0000556dfe2c5c7b &lt;+59&gt;:    test   %eax,%eax\n   0x0000556dfe2c5c7d &lt;+61&gt;:    je     0x556dfe2c5cd8 &lt;qemu_open_internal+152&gt;\n   0x0000556dfe2c5c7f &lt;+63&gt;:    mov    0x10(%rsp),%rdi\n   0x0000556dfe2c5c84 &lt;+68&gt;:    callq  0x556dfe2c79b0 &lt;qemu_parse_fd&gt;\n   0x0000556dfe2c5c89 &lt;+73&gt;:    movslq %eax,%rdi\n   0x0000556dfe2c5c8c &lt;+76&gt;:    mov    %rdi,%rbx\n   0x0000556dfe2c5c8f &lt;+79&gt;:    cmp    $0xffffffffffffffff,%rdi\n   0x0000556dfe2c5c93 &lt;+83&gt;:    je     0x556dfe2c5d7c &lt;qemu_open_internal+316&gt;\n   0x0000556dfe2c5c99 &lt;+89&gt;:    mov    %r12d,%esi\n   0x0000556dfe2c5c9c &lt;+92&gt;:    callq  0x556dfe101b60 &lt;monitor_fdset_dup_fd_add&gt;\n   0x0000556dfe2c5ca1 &lt;+97&gt;:    mov    %eax,%ebx\n   0x0000556dfe2c5ca3 &lt;+99&gt;:    cmp    $0xffffffff,%eax\n   0x0000556dfe2c5ca6 &lt;+102&gt;:   je     0x556dfe2c5db3 &lt;qemu_open_internal+371&gt;\n   0x0000556dfe2c5cac &lt;+108&gt;:   mov    0x18(%rsp),%rcx\n   0x0000556dfe2c5cb1 &lt;+113&gt;:   xor    %fs:0x28,%rcx\n   0x0000556dfe2c5cba &lt;+122&gt;:   mov    %ebx,%eax\n   0x0000556dfe2c5cbc &lt;+124&gt;:   jne    0x556dfe2c5dd3 &lt;qemu_open_internal+403&gt;\n   0x0000556dfe2c5cc2 &lt;+130&gt;:   add    $0x28,%rsp\n   0x0000556dfe2c5cc6 &lt;+134&gt;:   pop    %rbx\n   0x0000556dfe2c5cc7 &lt;+135&gt;:   pop    %rbp\n   0x0000556dfe2c5cc8 &lt;+136&gt;:   pop    %r12\n   0x0000556dfe2c5cca &lt;+138&gt;:   pop    %r13\n   0x0000556dfe2c5ccc &lt;+140&gt;:   pop    %r14\n   0x0000556dfe2c5cce &lt;+142&gt;:   pop    %r15\n   0x0000556dfe2c5cd0 &lt;+144&gt;:   retq   \n   0x0000556dfe2c5cd1 &lt;+145&gt;:   nopl   0x0(%rax)\n   0x0000556dfe2c5cd8 &lt;+152&gt;:   mov    %r12d,%esi\n   0x0000556dfe2c5cdb &lt;+155&gt;:   mov    %r13d,%edx\n   0x0000556dfe2c5cde &lt;+158&gt;:   mov    %rbp,%rdi\n   0x0000556dfe2c5ce1 &lt;+161&gt;:   xor    %eax,%eax\n   0x0000556dfe2c5ce3 &lt;+163&gt;:   or     $0x80000,%esi\n   0x0000556dfe2c5ce9 &lt;+169&gt;:   callq  0x556dfdf0f190 &lt;open64@plt&gt;\n\n\u2026\n\n<\/pre>\n<p>I am not good at reading assembly, so if I&#8217;m doing this, I&#8217;m desperate (the colleague talking to the client was sitting next to me and asking why I was digging in the assembly). I see that it should call <code>open64@plt<\/code>, which looks like the syscall <code>open<\/code> to me.<\/p>\n<p>I spend some time stepping instruction by instruction and trying to understand (and failing) what <code>\/usr\/include\/bits\/fcntl2.h<\/code> is doing with the macros there, but instruction by instruction I get to this:<\/p>\n<pre>\n(gdb) <b>si<\/b>\n__libc_open64 (file=file@entry=0x556e0021b241 \"\/var\/lib\/one\/\/datastores\/0\/310\/disk.0\", oflag=oflag@entry=540672) at ..\/sysdeps\/unix\/sysv\/linux\/open64.c:37\n37      {\n(gdb) <b>si<\/b>\n0x00007f25d7a401d4      37      {\n(gdb) <b>si<\/b>\n0x00007f25d7a401d8      37      {\n(gdb) <b>disassemble <\/b>\nDump of assembler code for function __libc_open64:\n   0x00007f25d7a401d0 &lt;+0&gt;:     endbr64 \n   0x00007f25d7a401d4 &lt;+4&gt;:     sub    $0x68,%rsp\n=&gt; 0x00007f25d7a401d8 &lt;+8&gt;:     mov    %esi,%r10d\n   0x00007f25d7a401db &lt;+11&gt;:    mov    %rdx,0x40(%rsp)\n   0x00007f25d7a401e0 &lt;+16&gt;:    mov    %fs:0x28,%rax\n   0x00007f25d7a401e9 &lt;+25&gt;:    mov    %rax,0x28(%rsp)\n   0x00007f25d7a401ee &lt;+30&gt;:    xor    %eax,%eax\n   0x00007f25d7a401f0 &lt;+32&gt;:    and    $0x40,%r10d\n   0x00007f25d7a401f4 &lt;+36&gt;:    jne    0x7f25d7a40248 &lt;__libc_open64+120&gt;\n   0x00007f25d7a401f6 &lt;+38&gt;:    mov    %esi,%eax\n   0x00007f25d7a401f8 &lt;+40&gt;:    and    $0x410000,%eax\n   0x00007f25d7a401fd &lt;+45&gt;:    cmp    $0x410000,%eax\n   0x00007f25d7a40202 &lt;+50&gt;:    je     0x7f25d7a40248 &lt;__libc_open64+120&gt;\n   0x00007f25d7a40204 &lt;+52&gt;:    mov    0x20d236(%rip),%eax        # 0x7f25d7c4d440 &lt;__pthread_multiple_threads&gt;\n   0x00007f25d7a4020a &lt;+58&gt;:    test   %eax,%eax\n   0x00007f25d7a4020c &lt;+60&gt;:    jne    0x7f25d7a40273 &lt;__libc_open64+163&gt;\n   0x00007f25d7a4020e &lt;+62&gt;:    mov    %esi,%edx\n   0x00007f25d7a40210 &lt;+64&gt;:    mov    $0x101,%eax\n   0x00007f25d7a40215 &lt;+69&gt;:    mov    %rdi,%rsi\n   0x00007f25d7a40218 &lt;+72&gt;:    mov    $0xffffff9c,%edi\n   0x00007f25d7a4021d &lt;+77&gt;:    syscall \n   0x00007f25d7a4021f &lt;+79&gt;:    cmp    $0xfffffffffffff000,%rax\n   0x00007f25d7a40225 &lt;+85&gt;:    ja     0x7f25d7a402c8 &lt;__libc_open64+248&gt;\n   0x00007f25d7a4022b &lt;+91&gt;:    mov    0x28(%rsp),%rcx\n   0x00007f25d7a40230 &lt;+96&gt;:    xor    %fs:0x28,%rcx\n   0x00007f25d7a40239 &lt;+105&gt;:   jne    0x7f25d7a402f1 &lt;__libc_open64+289&gt;\n   0x00007f25d7a4023f &lt;+111&gt;:   add    $0x68,%rsp\n<\/pre>\n<p>And that, at <code>0x00007f25d7a4021d &lt;+77&gt;:<\/code> is obviously the entry point to the kernel (the syscall). So\u2026<\/p>\n<pre>\n(gdb) <b>si<\/b>\n0x00007f25d7a40289      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a4028d      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a40290      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a40295      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a4029a      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a4029c      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a4029f      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a402a4      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>si<\/b>\n0x00007f25d7a402a6      48        return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS,\n(gdb) <b>disassemble <\/b>\nDump of assembler code for function __libc_open64:\n\u2026 \t\t\t<i># omitted<\/i>\n   0x00007f25d7a40285 &lt;+181&gt;:   mov    0xc(%rsp),%esi\n   0x00007f25d7a40289 &lt;+185&gt;:   mov    (%rsp),%rdi\n   0x00007f25d7a4028d &lt;+189&gt;:   mov    %eax,%r8d\n   0x00007f25d7a40290 &lt;+192&gt;:   mov    0x8(%rsp),%r10d\n   0x00007f25d7a40295 &lt;+197&gt;:   mov    $0x101,%eax\n   0x00007f25d7a4029a &lt;+202&gt;:   mov    %esi,%edx\n   0x00007f25d7a4029c &lt;+204&gt;:   mov    %rdi,%rsi\n   0x00007f25d7a4029f &lt;+207&gt;:   mov    $0xffffff9c,%edi\n   0x00007f25d7a402a4 &lt;+212&gt;:   syscall \n=&gt; 0x00007f25d7a402a6 &lt;+214&gt;:   cmp    $0xfffffffffffff000,%rax\n   0x00007f25d7a402ac &lt;+220&gt;:   ja     0x7f25d7a402de &lt;__libc_open64+270&gt;\n\n<\/pre>\n<p>Maybe not the one I expected, but by stepping instruction by instruction, I obviously called a syscall here. So qemu asks the kernel about the file, and the kernel says it&#8217;s not there.<\/p>\n<p>So for the hell of it I run <code>strace -e trace=open,openat -p PIDFILE<\/code>, and<\/p>\n<pre>\n[root@a8onekvm1 ~]# <b>strace -f -p 1463232 -o dbg.open -s 255 -e trace=open,openat<\/b>\nstrace: Process 1463232 attached with 6 threads\nstrace: Process 1832488 attached\nstrace: Process 1832489 attached\n^Cstrace: Process 1463232 detached\nstrace: Process 1463236 detached\nstrace: Process 1463237 detached\nstrace: Process 1463240 detached\nstrace: Process 1463242 detached\nstrace: Process 1463245 detached\n\n[root@a8onekvm1 ~]# <b>grep open dbg.open <\/b>\n1832489 openat(AT_FDCWD, \"\/var\/lib\/one\/\/datastores\/0\/310\/disk.0\", O_RDONLY|O_DIRECT|O_CLOEXEC) = -1 ENOENT (No such file or directory)\n<\/pre>\n<p>So, now <code>strace<\/code> sees it, but otherwise doesn&#8217;t?<\/p>\n<p>Cue 5 minutes of me, yelling about <code>strace<\/code> betraying me, then checking that the file exists, writing a small C program to execute the exact same syscall with the permissions of the process, and (being late Friday afternoon and being tired) complaining to other people. And someone says, &#8220;this is most probably mount namespaces issue&#8221;.<\/p>\n<p>And what do you know, it is:<\/p>\n<pre>\n[root@a8onekvm1 1463232]# <b>nsenter -m -t 1463232 ls -l \/var\/lib\/one\/\/datastores\/0\/310\/disk.0<\/b>\nlrwxrwxrwx 1 oneadmin oneadmin 28 Feb 16 11:02 \/var\/lib\/one\/\/datastores\/0\/310\/disk.0 -&gt; \/dev\/storpool-byid\/fir.b.f2g\n[root@a8onekvm1 1463232]# <b>nsenter -m -t 1463232 ls -l \/dev\/storpool-byid<\/b>\nls: cannot access '\/dev\/storpool-byid': No such file or directory\n<\/pre>\n<p>Sooo\u2026 what we have done is replace the symlink from pointing to \/dev\/storpool\/XXX to \/dev\/storpool-byid\/YYY. The namespace for qemu doesn&#8217;t have the second directory, it was not in use when the process was started, and we have just screwed it by moving the symlink.<\/p>\n<p>So, happy with that, on Monday I wrote a lousy script to recreate the links, which one of my colleagues made into a good script that people won&#8217;t be ashamed to show, for example, replacing<\/p>\n<pre>\nvirsh dumpxml one-316 |grep 'source dev.*datastores' | cut -d ' ' -f 2\n<\/pre>\n<p>with<\/p>\n<pre>\nxmlstarlet sel -t -m '\/\/devices\/disk\/source' -v '@dev' -n \"${DOMXML}\" 2&gt;\/dev\/null\n<\/pre>\n<p>Then, we apply this to the live environment to one VM, and happily tell the customer to try a live migration.<\/p>\n<p>It fails with the same error.<br \/>\n(cut some time for swearing at the world and all broken software)<\/p>\n<p>So I&#8217;m back to square one, as I try <code>strace<\/code>-ing the process in the live environment with <code>-e trace=open,openat<\/code>, and there is no such call. I also tried <code>ltrace<\/code>, but the only difference was that it was slower and didn&#8217;t show anything more.<\/p>\n<p>Then, a colleague says, &#8220;You know, these debuginfo packages should exist, they&#8217;re not that old&#8221;, goes looking in the mirrors, and FINDS THEM. For some reason, google doesn&#8217;t index that part, and I could not find them the lazy way.<\/p>\n<p>Here&#8217;s the obvious error that you should look for yourself. It would&#8217;ve been even easier if I logged in to the local mirror we have for Rocky Linux and ran a <code>find<\/code> for myself.<\/p>\n<p>Armed with debug symbols, I get the test VM again, attach to it, add a breakpoint at <code>qemu_internal_open()<\/code>, and nothing happens. It doesn&#8217;t get hit. It looks like strace wasn&#8217;t lying here.<\/p>\n<p>I go back to the <code>blk_root_inactivate()<\/code> function and I notice:<\/p>\n<pre>\nThread 8 \"live_migration\" hit Breakpoint 1, blk_root_inactivate (child=0x55586bf47c00) at ..\/block\/block-backend.c:252\n252         BlockBackend *blk = child-&gt;opaque;\n(gdb) <b>bt<\/b>\n#0  blk_root_inactivate (child=0x55586bf47c00) at ..\/block\/block-backend.c:252\n#1  0x000055586a5e8162 in bdrv_inactivate_recurse (bs=0x55586b297760) at ..\/block.c:6533\n#2  0x000055586a5e963f in bdrv_inactivate_all () at ..\/block.c:6590\n#3  0x000055586a37b535 in qemu_savevm_state_complete_precopy_non_iterable (f=0x55586c05d500, in_postcopy=&lt;optimized out&gt;, inactivate_disks=&lt;\/optimized&gt;&lt;optimized out&gt;) at ..\/migration\/savevm.c:1441\n#4  0x000055586a37b622 in qemu_savevm_state_complete_precopy (f=0x55586c05d500, iterable_only=iterable_only@entry=false, inactivate_disks=inactivate_disks@entry=true) at ..\/migration\/savevm.c:1493\n#5  0x000055586a374236 in migration_completion (s=0x55586b207000) at ..\/migration\/migration.c:3260\n#6  migration_iteration_run (s=0x55586b207000) at ..\/migration\/migration.c:3672\n#7  migration_thread (opaque=0x55586b207000) at ..\/migration\/migration.c:3908\n#8  0x000055586a6d9dc4 in qemu_thread_start (args=0x55586c0ee6c0) at ..\/util\/qemu-thread-posix.c:585\n#9  0x00007f4a951d11ca in start_thread () from target:\/lib64\/libpthread.so.0\n#10 0x00007f4a94e3de73 in clone () from target:\/lib64\/libc.so.6\n(gdb) <b>n<\/b>\n254         if (blk-&gt;disable_perm) {\n(gdb) <b>n<\/b>\nbdrv_inactivate_recurse (bs=0x55586b297760) at ..\/block.c:6534\n<\/pre>\n<p><code>blk-&gt;disable_perm<\/code> is already set, so it looks like in this version of qemu (a bit older than what I initially used), when the live migration fails, this flag doesn&#8217;t get reset. I made a diff between the two versions, and that part has enough changes to warrant this idea. So, VMs that didn&#8217;t fail the migration would be fine.<\/p>\n<p>Now, what about this one and a few more? They seem to be in a corrupt state. I read a bit more code, then bite the bullet and change the variable live:<\/p>\n<pre>\nThread 7 \"live_migration\" hit Breakpoint 1, blk_root_inactivate (child=0x55586bf47c00) at ..\/block\/block-backend.c:254\n254         if (blk-&gt;disable_perm) {\n(gdb) <b>bt<\/b>\n#0  blk_root_inactivate (child=0x55586bf47c00) at ..\/block\/block-backend.c:254\n#1  0x000055586a5e8162 in bdrv_inactivate_recurse (bs=0x55586b297760) at ..\/block.c:6533\n#2  0x000055586a5e963f in bdrv_inactivate_all () at ..\/block.c:6590\n#3  0x000055586a37b535 in qemu_savevm_state_complete_precopy_non_iterable (f=0x55586c05d500, in_postcopy=&lt;\/optimized&gt;&lt;optimized out&gt;, inactivate_disks=&lt;\/optimized&gt;&lt;optimized out&gt;) at ..\/migration\/savevm.c:1441\n#4  0x000055586a37b622 in qemu_savevm_state_complete_precopy (f=0x55586c05d500, iterable_only=iterable_only@entry=false, inactivate_disks=inactivate_disks@entry=true) at ..\/migration\/savevm.c:1493\n#5  0x000055586a374236 in migration_completion (s=0x55586b207000) at ..\/migration\/migration.c:3260\n#6  migration_iteration_run (s=0x55586b207000) at ..\/migration\/migration.c:3672\n#7  migration_thread (opaque=0x55586b207000) at ..\/migration\/migration.c:3908\n#8  0x000055586a6d9dc4 in qemu_thread_start (args=0x55586c2325e0) at ..\/util\/qemu-thread-posix.c:585\n#9  0x00007f4a951d11ca in start_thread () from target:\/lib64\/libpthread.so.0\n#10 0x00007f4a94e3de73 in clone () from target:\/lib64\/libc.so.6\n(gdb) <b>print ( (BlockBackend *) (child-&gt;opaque))-&gt;disable_perm<\/b>\n$1 = true\n(gdb) <b>print ( (BlockBackend *) (child-&gt;opaque))-&gt;perm<\/b>\n$2 = 3\n(gdb) <b>set: variable ( (BlockBackend *) (child-&gt;opaque))-&gt;disable_perm = false<\/b>\nNo symbol \"false\" in current context.\t<i># arghhhh<\/i>\n(gdb) <b>set variable ( (BlockBackend *) (child-&gt;opaque))-&gt;disable_perm = 0<\/b>\n(gdb) <b>cont<\/b>\nContinuing.\n\nThread 7 \"live_migration\" hit Breakpoint 1, blk_root_inactivate (child=0x55586bf51850) at ..\/block\/block-backend.c:254\n254         if (blk-&gt;disable_perm) {\n(gdb) <b>print ( (BlockBackend *) (child-&gt;opaque))-&gt;disable_perm<\/b>\n$3 = false\n(gdb) <b>print ( (BlockBackend *) (child-&gt;opaque))-&gt;perm<\/b>\n$4 = 1\n(gdb) <b>cont<\/b>\nContinuing.\n[Thread 0x7f4999ffb700 (LWP 144024) exited]\n[Thread 0x7f499a7fc700 (LWP 144025) exited]\n\nThread 1 \"qemu-kvm-one\" received signal SIGTERM, Terminated.\t<i># oops?<\/i>\n[Switching to Thread 0x7f4a985a6e40 (LWP 3460894)]\n0x00007f4a94f28036 in ppoll () from target:\/lib64\/libc.so.6\n(gdb) <b>q<\/b>\n\n<\/pre>\n<p>(imagine this being done live and as quickly as possible, as the VM is hanging. You also see two devices, OpenNebula adds a context image to all VMs, and it&#8217;s a CD-ROM device that the migration didn&#8217;t get to corrupt)<\/p>\n<p>Then I see SIGTERM and think, &#8220;crap, I killed it&#8221;. Then I look at the logs, and it has actually migrated successfully.<\/p>\n<p>So, finally, with some pointers (&#8220;the command in gdb is command&#8221; :) ) I write the following:<\/p>\n<pre>\nset print pretty\nset pagination off\nhandle SIGUSR1 nostop pass\nhandle SIGTERM nostop pass\nbreak block\/block-backend.c:253\ncommands\nprint ( (BlockBackend *) (child-&gt;opaque))-&gt;disable_perm\nprint ( (BlockBackend *) (child-&gt;opaque))-&gt;perm\nset variable ( (BlockBackend *) (child-&gt;opaque))-&gt;disable_perm = 0\ncont\nend\ncont\n<\/pre>\n<p>and it works like a charm on the 10-or-something VMs with a corrupt state.<\/p>\n<p>A list of errors and lessons learned:<\/p>\n<p>&#8211; Don&#8217;t assume that because there&#8217;s something not right, it&#8217;s the cause of the problem you&#8217;re looking at (QEMU processor for the VMs);<br \/>\n&#8211; Collect all relevant information (and maybe even irrelevant) &#8211; I keep telling people that it&#8217;s easier to work with too much information than with not enough. I should listen to myself one of these days;<br \/>\n&#8211; strace can hide syscalls and shouldn&#8217;t be fully trusted;<br \/>\n&#8211; If you&#8217;re searching for debuginfo (or anything like that), do your legwork, and don&#8217;t trust a single search engine;<br \/>\n&#8211; testing the wrong software version can reveal only a part of the problem (or a different one).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There&#8217;s a purpose in everyone&#8217;s life. Some days, it seems like mine is to show people what not to do. I&#8217;ve written this explanation to show my mistakes and that, hopefully, other people don&#8217;t (there&#8217;s no hope for me). What follows is most of my debugging process, and it could be helpful to other people [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"0","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[16],"class_list":["post-3474","post","type-post","status-publish","format-standard","hentry","category-general","tag-16"],"_links":{"self":[{"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3474","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3474"}],"version-history":[{"count":0,"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/3474\/revisions"}],"wp:attachment":[{"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3474"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3474"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vasil.ludost.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3474"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}