2024-03-05 The failed live migration case
by Vasil KolevThere’s a purpose in everyone’s life. Some days, it seems like mine is to show people what not to do.
I’ve written this explanation to show my mistakes and that, hopefully, other people don’t (there’s no hope for me). What follows is most of my debugging process, and it could be helpful to other people to see it (and point out what is wrong with it).
This happens in a StorPool deployment for a customer who runs a public cloud service. We provide the shared storage via some drivers of our own, and the customer uses OpenNebula to control Linux/KVM hypervisors. It’s all integrated very nicely and is used for our internal clouds.
KVM VMs are a qemu
process in the Linux userspace with multiple threads.
The case started with a customer complaining that after an update we did to the storage software, the live migrations of his VMs started failing (the VMs would fail to move but would continue working on the original host). The problem was narrowed down to VMs started before a specific date (when we did the update mentioned above), and initially, it seemed to be contained to VMs that had QEMU CPU set (i.e., the least performant one).
At this point, I asked – can’t they restart all such VMs? This CPU type is mainly contributing to global warming anyway.
The reply was, “Are you insane? These are 900 VMs. We’ll reboot them, but we really don’t want to do that now”.
Fair point.
Then, some further tests showed that any VMs, not just the ones with the QEMU CPU, were failing the migration. Finally, someone looked into the (not easy to find) logs of qemu
and found the following:
2024-02-20T09:46:41.938285Z qemu-kvm-one: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)
(seriously. Trying to trace anything via the logs in OpenNebula/libvirt/qemu feels like interrogating prisoners who don’t speak your language.)
Two mistakes above that would’ve saved a day or two – I should’ve looked into the issue immediately myself instead of relying on hearsay and building the wrong picture (based on assumptions) in my head. I did not expect that the (routine) update we did could have any such effect.
I specifically checked if the same problem existed in our labs and test environments, where we had done the same routine update, and it didn’t exist.
Now, this was weird. The only mention of the above error was in https://bugzilla.redhat.com/show_bug.cgi?id=2058982 when the storage device on the originating side was broken. This was not the case here.
My first step was to get permission to use one VM at the customer for tests, and after I had that, I ran strace on it while migrating.
What I was able to see in the thread that performed the migration was:
prctl(PR_SET_NAME, "live_migration") = 0 … sendmsg(99, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="..."..., iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32768 sendmsg(99, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="..."..., iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32768 write(2, "2024-02-15T07:00:23.057678Z ", 28) = 28 write(2, "qemu-kvm-one:", 13) = 13 write(2, " ", 1) = 1 write(2, "qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)", 82) = 82 write(2, "\n", 1) = 1 futex(0x55586af89640, FUTEX_WAKE_PRIVATE, 1) = 1 sendmsg(20, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="{\"timestamp\": {\"seconds\": 1707980423, \"microseconds\": 57880}, \"event\": \"MIGRATION\", \"data\": {\"status\": \"failed\"}}\r\n", iov_len=115}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 115 futex(0x55586af89640, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) sendmsg(20, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="{\"timestamp\": {\"seconds\": 1707980423, \"microseconds\": 58133}, \"event\": \"RESUME\"}\r\n", iov_len=82}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 82
Some digging showed that 99
was the file descriptor with the connection to the remote qemu. This was the memory transfer, followed almost immediately by writing out the error and nothing in between.
So, being somewhat lost, I looked up debug symbols for qemu to see if I could trace this further. The usual tooling told me there weren’t:
[root@somewhere str3]# gdb /usr/libexec/qemu-kvm GNU gdb (GDB) Rocky Linux 8.2-20.el8.0.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/libexec/qemu-kvm...Reading symbols from .gnu_debugdata for /usr/libexec/qemu-kvm...(no debugging symbols found)...done. (no debugging symbols found)...done. Missing separate debuginfos, use: yum debuginfo-install qemu-kvm-core-6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64 (gdb) quit [root@somewhere str3]# yum debuginfo-install qemu-kvm-core-6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64 Repository storpool-contrib is listed more than once in the configuration enabling epel-debuginfo repository Extra Packages for Enterprise Linux 8 - x86_64 - Debug 4.0 MB/s | 6.1 MB 00:01 Last metadata expiration check: 0:00:02 ago on Thu Feb 15 16:26:34 2024. Could not find debuginfo package for the following installed packages: qemu-kvm-core-15:6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64 Could not find debugsource package for the following installed packages: qemu-kvm-core-15:6.2.0-33.module+el8.8.0+1454+0b2cbfb8.x86_64 Dependencies resolved. Nothing to do. Complete!
(I can already hear people saying, “You’re an idiot”; someone had to point that out to me)
Then, the colleague in communication with the customer said, “BTW, do you remember that at that time we did the other update, to migrate from one-type-of-volume-name to the-other-type-of-volume-name”? Of course, I didn’t. We ran the same migration in our lab, and the issue was reproduced immediately.
The mistake was that I did not collect all available information (or even most of it), still relying on assumptions.
I could deploy the latest qemu for AlmaLinux (almost the same as what the customer was using), download debug symbols, and start digging. Yay.
On strace
, the issue looked the same – lots of sending to the socket and then just the error. So, I started digging with gdb
by attaching, adding a breakpoint, continuing the process, and then running a live migration to see how the situation looked like at different points of the program. I had something like this primed and pasted when I ran gdb -p xxxx
(some comments added to explain what these are):
set print pretty # the usual gdb print is hard to read set pagination off # pagination is annoying in this regard, my terminal can scroll handle SIGUSR1 nostop pass # qemu uses SIGUSR1 a lot break block/block-backend.c:253 # stop here cont # continue execution bt full # print me a full backtrace with local variables when stopped
It took some time to read through the relevant code and ensure there wasn’t something obvious (which took many wrong turns). The issue looked like that bdrv_inactivate_recurse()
was trying to remove write permissions from the devices and was failing for some reason.
6523 /* Inactivate this node */ 6524 if (bs->drv->bdrv_inactivate) { 6525 ret = bs->drv->bdrv_inactivate(bs); 6526 if (ret < 0) { 6527 return ret; 6528 } 6529 } 6530 6531 QLIST_FOREACH(parent, &bs->parents, next_parent) { 6532 if (parent->klass->inactivate) { 6533 ret = parent->klass->inactivate(parent); 6534 if (ret < 0) { 6535 return ret; 6536 } 6537 } 6538 } 6539 6540 bdrv_get_cumulative_perm(bs, &cumulative_perms, 6541 &cumulative_shared_perms); 6542 if (cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) { 6543 /* Our inactive parents still need write access. Inactivation failed. */ 6544 return -EPERM; 6545 }
This is the code in question, and in bdrv_get_cumulative_perms()
, there were some devices that still had r/w set even after the stuff between lines 6531 and 6533 had passed:
bdrv_inactivate_recurse (bs=0x556e00221530) at ../block.c:6534 6534 if (ret < 0) { (gdb) 6531 QLIST_FOREACH(parent, &bs->parents, next_parent) { (gdb) 6540 bdrv_get_cumulative_perm(bs, &cumulative_perms, (gdb) 6542 if (cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) { (gdb) bdrv_inactivate_all () at ../block.c:6591 6591 if (ret < 0) { (gdb) 6592 bdrv_next_cleanup(&it); (gdb)
The inactivate()
function above was in block/block-backend.c
and looks like this:
250 static int blk_root_inactivate(BdrvChild *child) 251 { 252 BlockBackend *blk = child->opaque; 253 254 if (blk->disable_perm) { 255 return 0; 256 } 257 258 if (!blk_can_inactivate(blk)) { 259 return -EPERM; 260 } 261 262 blk->disable_perm = true; 263 if (blk->root) { 264 bdrv_child_try_set_perm(blk->root, 0, BLK_PERM_ALL, &error_abort); 265 } 266 267 return 0; 268 }
and what was happening was:
258 if (!blk_can_inactivate(blk)) { (gdb) 262 blk->disable_perm = true; (gdb) 263 if (blk->root) { (gdb) 264 bdrv_child_try_set_perm(blk->root, 0, BLK_PERM_ALL, &error_abort); (gdb) bdrv_inactivate_recurse (bs=0x556e00221530) at ../block.c:6534
So qemu
is calling it, and that was all OK.
Here, a few times, I had missed what this was trying to do. Here’s bdrv_child_try_set_perm()
:
2517 2518 int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared, 2519 Error **errp) 2520 { 2521 Error *local_err = NULL; 2522 Transaction *tran = tran_new(); 2523 int ret; 2524 2525 bdrv_child_set_perm(c, perm, shared, tran); 2526 2527 ret = bdrv_refresh_perms(c->bs, &local_err); 2528 2529 tran_finalize(tran, ret); 2530 2531 if (ret < 0) { 2532 if ((perm & ~c->perm) || (c->shared_perm & ~shared)) { 2533 /* tighten permissions */ 2534 error_propagate(errp, local_err); 2535 } else { 2536 /* 2537 * Our caller may intend to only loosen restrictions and 2538 * does not expect this function to fail. Errors are not 2539 * fatal in such a case, so we can just hide them from our 2540 * caller. 2541 */ 2542 error_free(local_err); 2543 ret = 0; 2544 } 2545 } 2546 2547 return ret; 2548 }
So that 0
being passed is “remove all permissions”, but at some time late in the day, you can’t see it…
And now, I break into that one and try to see what happens:
(gdb) n 2197 *s = (BdrvChildSetPermState) { (gdb) print *(local_err) No symbol "local_err" in current context. (gdb) No symbol "local_err" in current context. (gdb) n 2203 c->perm = perm; (gdb) print *(local_err) No symbol "local_err" in current context. (gdb) n 2204 c->shared_perm = shared; (gdb) print *(local_err) No symbol "local_err" in current context. (gdb) n 2206 tran_add(tran, &bdrv_child_set_pem_drv, s); (gdb) n bdrv_child_try_set_perm (c=0x556e002e1730, perm=perm@entry=0, shared=shared@entry=31, errp=0x556dfeb99258 <error_abort>) at ../block.c:2527 2527 ret = bdrv_refresh_perms(c->bs, &local_err); (gdb) print *(local_err) Cannot access memory at address 0x0 (gdb) n 2529 tran_finalize(tran, ret); (gdb) print *(local_err) $1 = { msg = 0x7f25a4041b70 "Could not open '/var/lib/one//datastores/0/310/disk.0': No such file or directory", err_class = ERROR_CLASS_GENERIC_ERROR, src = 0x556dfe63ca2a "../util/osdep.c", func = 0x556dfe63cbc0 <__func__.26488> "qemu_open_internal", line = 359, hint = 0x0 } (gdb) q
Now, I see it going through the function, and I’m trying to look at that local_err
if there’s something in it. Finally, I get the actual error; it has happened somewhere and is not propagated – but qemu
has tried opening a file, and that doesn’t exist.
Two things here:
– This file does exist (it’s a symlink to a symlink to a block device, and I can access it with the permissions of the process);
– Why isn’t this error logged or propagated? I should probably send a patch for that…
– This was not in what strace
showed, so what the hell.
So, I break at util/osdep.c
in qemu_open_internal()
, to see what happens. I pass through that a few times, trying to see where that error comes from, because I still think there is no syscall there, and glibc is screwing with me. This goes to show that you shouldn’t debug when tired because glibc returning ENOENT
on its own is a lot less likely than strace
missing one syscall.
I’m going to leave out a lot of unsuccessful attempts and show you how I got to the next clue in one large gdb session (commends with #):
(gdb) break util/osdep.c:312 Breakpoint 1 at 0x556dfe2c5c71: file ../util/osdep.c, line 312. (gdb) handle SIGUSR1 nostop pass Signal Stop Print Pass to program Description SIGUSR1 No Yes Yes User defined signal 1 (gdb) set print pretty (gdb) set pagination off (gdb) cont Continuing. [New Thread 0x7f259c9ba700 (LWP 1798462)] [New Thread 0x7f259c1b9700 (LWP 1798463)] Thread 5 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1. Thread 5 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1. Thread 5 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1. [Switching to Thread 0x7f259c1b9700 (LWP 1798463)] Thread 8 "live_migration" hit Breakpoint 1, qemu_open_internal (name=name@entry=0x556e0021b241 "/var/lib/one//datastores/0/310/disk.0", flags=16384, mode=mode@entry=0, errp=errp@entry=0x7f259c1b85e0) at ../util/osdep.c:318 318 if (strstart(name, "/dev/fdset/", &fdset_id_str)) { (gdb) bt full … # nothing interesting in the backtrace (gdb) disassemble Dump of assembler code for function qemu_open_internal: 0x0000556dfe2c5c40 <+0>: push %r15 0x0000556dfe2c5c42 <+2>: push %r14 0x0000556dfe2c5c44 <+4>: mov %rcx,%r14 0x0000556dfe2c5c47 <+7>: push %r13 0x0000556dfe2c5c49 <+9>: mov %edx,%r13d 0x0000556dfe2c5c4c <+12>: push %r12 0x0000556dfe2c5c4e <+14>: mov %esi,%r12d 0x0000556dfe2c5c51 <+17>: lea 0x376dad(%rip),%rsi # 0x556dfe63ca05 0x0000556dfe2c5c58 <+24>: push %rbp 0x0000556dfe2c5c59 <+25>: mov %rdi,%rbp 0x0000556dfe2c5c5c <+28>: push %rbx 0x0000556dfe2c5c5d <+29>: sub $0x28,%rsp 0x0000556dfe2c5c61 <+33>: mov %fs:0x28,%rax 0x0000556dfe2c5c6a <+42>: mov %rax,0x18(%rsp) 0x0000556dfe2c5c6f <+47>: xor %eax,%eax => 0x0000556dfe2c5c71 <+49>: lea 0x10(%rsp),%rdx 0x0000556dfe2c5c76 <+54>: callq 0x556dfe2c6a30 <strstart> 0x0000556dfe2c5c7b <+59>: test %eax,%eax 0x0000556dfe2c5c7d <+61>: je 0x556dfe2c5cd8 <qemu_open_internal+152> 0x0000556dfe2c5c7f <+63>: mov 0x10(%rsp),%rdi 0x0000556dfe2c5c84 <+68>: callq 0x556dfe2c79b0 <qemu_parse_fd> 0x0000556dfe2c5c89 <+73>: movslq %eax,%rdi 0x0000556dfe2c5c8c <+76>: mov %rdi,%rbx 0x0000556dfe2c5c8f <+79>: cmp $0xffffffffffffffff,%rdi 0x0000556dfe2c5c93 <+83>: je 0x556dfe2c5d7c <qemu_open_internal+316> 0x0000556dfe2c5c99 <+89>: mov %r12d,%esi 0x0000556dfe2c5c9c <+92>: callq 0x556dfe101b60 <monitor_fdset_dup_fd_add> 0x0000556dfe2c5ca1 <+97>: mov %eax,%ebx 0x0000556dfe2c5ca3 <+99>: cmp $0xffffffff,%eax 0x0000556dfe2c5ca6 <+102>: je 0x556dfe2c5db3 <qemu_open_internal+371> 0x0000556dfe2c5cac <+108>: mov 0x18(%rsp),%rcx 0x0000556dfe2c5cb1 <+113>: xor %fs:0x28,%rcx 0x0000556dfe2c5cba <+122>: mov %ebx,%eax 0x0000556dfe2c5cbc <+124>: jne 0x556dfe2c5dd3 <qemu_open_internal+403> 0x0000556dfe2c5cc2 <+130>: add $0x28,%rsp 0x0000556dfe2c5cc6 <+134>: pop %rbx 0x0000556dfe2c5cc7 <+135>: pop %rbp 0x0000556dfe2c5cc8 <+136>: pop %r12 0x0000556dfe2c5cca <+138>: pop %r13 0x0000556dfe2c5ccc <+140>: pop %r14 0x0000556dfe2c5cce <+142>: pop %r15 0x0000556dfe2c5cd0 <+144>: retq 0x0000556dfe2c5cd1 <+145>: nopl 0x0(%rax) 0x0000556dfe2c5cd8 <+152>: mov %r12d,%esi 0x0000556dfe2c5cdb <+155>: mov %r13d,%edx 0x0000556dfe2c5cde <+158>: mov %rbp,%rdi 0x0000556dfe2c5ce1 <+161>: xor %eax,%eax 0x0000556dfe2c5ce3 <+163>: or $0x80000,%esi 0x0000556dfe2c5ce9 <+169>: callq 0x556dfdf0f190 <open64@plt> …
I am not good at reading assembly, so if I’m doing this, I’m desperate (the colleague talking to the client was sitting next to me and asking why I was digging in the assembly). I see that it should call open64@plt
, which looks like the syscall open
to me.
I spend some time stepping instruction by instruction and trying to understand (and failing) what /usr/include/bits/fcntl2.h
is doing with the macros there, but instruction by instruction I get to this:
(gdb) si __libc_open64 (file=file@entry=0x556e0021b241 "/var/lib/one//datastores/0/310/disk.0", oflag=oflag@entry=540672) at ../sysdeps/unix/sysv/linux/open64.c:37 37 { (gdb) si 0x00007f25d7a401d4 37 { (gdb) si 0x00007f25d7a401d8 37 { (gdb) disassemble Dump of assembler code for function __libc_open64: 0x00007f25d7a401d0 <+0>: endbr64 0x00007f25d7a401d4 <+4>: sub $0x68,%rsp => 0x00007f25d7a401d8 <+8>: mov %esi,%r10d 0x00007f25d7a401db <+11>: mov %rdx,0x40(%rsp) 0x00007f25d7a401e0 <+16>: mov %fs:0x28,%rax 0x00007f25d7a401e9 <+25>: mov %rax,0x28(%rsp) 0x00007f25d7a401ee <+30>: xor %eax,%eax 0x00007f25d7a401f0 <+32>: and $0x40,%r10d 0x00007f25d7a401f4 <+36>: jne 0x7f25d7a40248 <__libc_open64+120> 0x00007f25d7a401f6 <+38>: mov %esi,%eax 0x00007f25d7a401f8 <+40>: and $0x410000,%eax 0x00007f25d7a401fd <+45>: cmp $0x410000,%eax 0x00007f25d7a40202 <+50>: je 0x7f25d7a40248 <__libc_open64+120> 0x00007f25d7a40204 <+52>: mov 0x20d236(%rip),%eax # 0x7f25d7c4d440 <__pthread_multiple_threads> 0x00007f25d7a4020a <+58>: test %eax,%eax 0x00007f25d7a4020c <+60>: jne 0x7f25d7a40273 <__libc_open64+163> 0x00007f25d7a4020e <+62>: mov %esi,%edx 0x00007f25d7a40210 <+64>: mov $0x101,%eax 0x00007f25d7a40215 <+69>: mov %rdi,%rsi 0x00007f25d7a40218 <+72>: mov $0xffffff9c,%edi 0x00007f25d7a4021d <+77>: syscall 0x00007f25d7a4021f <+79>: cmp $0xfffffffffffff000,%rax 0x00007f25d7a40225 <+85>: ja 0x7f25d7a402c8 <__libc_open64+248> 0x00007f25d7a4022b <+91>: mov 0x28(%rsp),%rcx 0x00007f25d7a40230 <+96>: xor %fs:0x28,%rcx 0x00007f25d7a40239 <+105>: jne 0x7f25d7a402f1 <__libc_open64+289> 0x00007f25d7a4023f <+111>: add $0x68,%rsp
And that, at 0x00007f25d7a4021d <+77>:
is obviously the entry point to the kernel (the syscall). So…
(gdb) si 0x00007f25d7a40289 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a4028d 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a40290 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a40295 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a4029a 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a4029c 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a4029f 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a402a4 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) si 0x00007f25d7a402a6 48 return SYSCALL_CANCEL (openat, AT_FDCWD, file, oflag | EXTRA_OPEN_FLAGS, (gdb) disassemble Dump of assembler code for function __libc_open64: … # omitted 0x00007f25d7a40285 <+181>: mov 0xc(%rsp),%esi 0x00007f25d7a40289 <+185>: mov (%rsp),%rdi 0x00007f25d7a4028d <+189>: mov %eax,%r8d 0x00007f25d7a40290 <+192>: mov 0x8(%rsp),%r10d 0x00007f25d7a40295 <+197>: mov $0x101,%eax 0x00007f25d7a4029a <+202>: mov %esi,%edx 0x00007f25d7a4029c <+204>: mov %rdi,%rsi 0x00007f25d7a4029f <+207>: mov $0xffffff9c,%edi 0x00007f25d7a402a4 <+212>: syscall => 0x00007f25d7a402a6 <+214>: cmp $0xfffffffffffff000,%rax 0x00007f25d7a402ac <+220>: ja 0x7f25d7a402de <__libc_open64+270>
Maybe not the one I expected, but by stepping instruction by instruction, I obviously called a syscall here. So qemu asks the kernel about the file, and the kernel says it’s not there.
So for the hell of it I run strace -e trace=open,openat -p PIDFILE
, and
[root@a8onekvm1 ~]# strace -f -p 1463232 -o dbg.open -s 255 -e trace=open,openat strace: Process 1463232 attached with 6 threads strace: Process 1832488 attached strace: Process 1832489 attached ^Cstrace: Process 1463232 detached strace: Process 1463236 detached strace: Process 1463237 detached strace: Process 1463240 detached strace: Process 1463242 detached strace: Process 1463245 detached [root@a8onekvm1 ~]# grep open dbg.open 1832489 openat(AT_FDCWD, "/var/lib/one//datastores/0/310/disk.0", O_RDONLY|O_DIRECT|O_CLOEXEC) = -1 ENOENT (No such file or directory)
So, now strace
sees it, but otherwise doesn’t?
Cue 5 minutes of me, yelling about strace
betraying me, then checking that the file exists, writing a small C program to execute the exact same syscall with the permissions of the process, and (being late Friday afternoon and being tired) complaining to other people. And someone says, “this is most probably mount namespaces issue”.
And what do you know, it is:
[root@a8onekvm1 1463232]# nsenter -m -t 1463232 ls -l /var/lib/one//datastores/0/310/disk.0 lrwxrwxrwx 1 oneadmin oneadmin 28 Feb 16 11:02 /var/lib/one//datastores/0/310/disk.0 -> /dev/storpool-byid/fir.b.f2g [root@a8onekvm1 1463232]# nsenter -m -t 1463232 ls -l /dev/storpool-byid ls: cannot access '/dev/storpool-byid': No such file or directory
Sooo… what we have done is replace the symlink from pointing to /dev/storpool/XXX to /dev/storpool-byid/YYY. The namespace for qemu doesn’t have the second directory, it was not in use when the process was started, and we have just screwed it by moving the symlink.
So, happy with that, on Monday I wrote a lousy script to recreate the links, which one of my colleagues made into a good script that people won’t be ashamed to show, for example, replacing
virsh dumpxml one-316 |grep 'source dev.*datastores' | cut -d ' ' -f 2
with
xmlstarlet sel -t -m '//devices/disk/source' -v '@dev' -n "${DOMXML}" 2>/dev/null
Then, we apply this to the live environment to one VM, and happily tell the customer to try a live migration.
It fails with the same error.
(cut some time for swearing at the world and all broken software)
So I’m back to square one, as I try strace
-ing the process in the live environment with -e trace=open,openat
, and there is no such call. I also tried ltrace
, but the only difference was that it was slower and didn’t show anything more.
Then, a colleague says, “You know, these debuginfo packages should exist, they’re not that old”, goes looking in the mirrors, and FINDS THEM. For some reason, google doesn’t index that part, and I could not find them the lazy way.
Here’s the obvious error that you should look for yourself. It would’ve been even easier if I logged in to the local mirror we have for Rocky Linux and ran a find
for myself.
Armed with debug symbols, I get the test VM again, attach to it, add a breakpoint at qemu_internal_open()
, and nothing happens. It doesn’t get hit. It looks like strace wasn’t lying here.
I go back to the blk_root_inactivate()
function and I notice:
Thread 8 "live_migration" hit Breakpoint 1, blk_root_inactivate (child=0x55586bf47c00) at ../block/block-backend.c:252 252 BlockBackend *blk = child->opaque; (gdb) bt #0 blk_root_inactivate (child=0x55586bf47c00) at ../block/block-backend.c:252 #1 0x000055586a5e8162 in bdrv_inactivate_recurse (bs=0x55586b297760) at ../block.c:6533 #2 0x000055586a5e963f in bdrv_inactivate_all () at ../block.c:6590 #3 0x000055586a37b535 in qemu_savevm_state_complete_precopy_non_iterable (f=0x55586c05d500, in_postcopy=<optimized out>, inactivate_disks=</optimized><optimized out>) at ../migration/savevm.c:1441 #4 0x000055586a37b622 in qemu_savevm_state_complete_precopy (f=0x55586c05d500, iterable_only=iterable_only@entry=false, inactivate_disks=inactivate_disks@entry=true) at ../migration/savevm.c:1493 #5 0x000055586a374236 in migration_completion (s=0x55586b207000) at ../migration/migration.c:3260 #6 migration_iteration_run (s=0x55586b207000) at ../migration/migration.c:3672 #7 migration_thread (opaque=0x55586b207000) at ../migration/migration.c:3908 #8 0x000055586a6d9dc4 in qemu_thread_start (args=0x55586c0ee6c0) at ../util/qemu-thread-posix.c:585 #9 0x00007f4a951d11ca in start_thread () from target:/lib64/libpthread.so.0 #10 0x00007f4a94e3de73 in clone () from target:/lib64/libc.so.6 (gdb) n 254 if (blk->disable_perm) { (gdb) n bdrv_inactivate_recurse (bs=0x55586b297760) at ../block.c:6534
blk->disable_perm
is already set, so it looks like in this version of qemu (a bit older than what I initially used), when the live migration fails, this flag doesn’t get reset. I made a diff between the two versions, and that part has enough changes to warrant this idea. So, VMs that didn’t fail the migration would be fine.
Now, what about this one and a few more? They seem to be in a corrupt state. I read a bit more code, then bite the bullet and change the variable live:
Thread 7 "live_migration" hit Breakpoint 1, blk_root_inactivate (child=0x55586bf47c00) at ../block/block-backend.c:254 254 if (blk->disable_perm) { (gdb) bt #0 blk_root_inactivate (child=0x55586bf47c00) at ../block/block-backend.c:254 #1 0x000055586a5e8162 in bdrv_inactivate_recurse (bs=0x55586b297760) at ../block.c:6533 #2 0x000055586a5e963f in bdrv_inactivate_all () at ../block.c:6590 #3 0x000055586a37b535 in qemu_savevm_state_complete_precopy_non_iterable (f=0x55586c05d500, in_postcopy=</optimized><optimized out>, inactivate_disks=</optimized><optimized out>) at ../migration/savevm.c:1441 #4 0x000055586a37b622 in qemu_savevm_state_complete_precopy (f=0x55586c05d500, iterable_only=iterable_only@entry=false, inactivate_disks=inactivate_disks@entry=true) at ../migration/savevm.c:1493 #5 0x000055586a374236 in migration_completion (s=0x55586b207000) at ../migration/migration.c:3260 #6 migration_iteration_run (s=0x55586b207000) at ../migration/migration.c:3672 #7 migration_thread (opaque=0x55586b207000) at ../migration/migration.c:3908 #8 0x000055586a6d9dc4 in qemu_thread_start (args=0x55586c2325e0) at ../util/qemu-thread-posix.c:585 #9 0x00007f4a951d11ca in start_thread () from target:/lib64/libpthread.so.0 #10 0x00007f4a94e3de73 in clone () from target:/lib64/libc.so.6 (gdb) print ( (BlockBackend *) (child->opaque))->disable_perm $1 = true (gdb) print ( (BlockBackend *) (child->opaque))->perm $2 = 3 (gdb) set: variable ( (BlockBackend *) (child->opaque))->disable_perm = false No symbol "false" in current context. # arghhhh (gdb) set variable ( (BlockBackend *) (child->opaque))->disable_perm = 0 (gdb) cont Continuing. Thread 7 "live_migration" hit Breakpoint 1, blk_root_inactivate (child=0x55586bf51850) at ../block/block-backend.c:254 254 if (blk->disable_perm) { (gdb) print ( (BlockBackend *) (child->opaque))->disable_perm $3 = false (gdb) print ( (BlockBackend *) (child->opaque))->perm $4 = 1 (gdb) cont Continuing. [Thread 0x7f4999ffb700 (LWP 144024) exited] [Thread 0x7f499a7fc700 (LWP 144025) exited] Thread 1 "qemu-kvm-one" received signal SIGTERM, Terminated. # oops? [Switching to Thread 0x7f4a985a6e40 (LWP 3460894)] 0x00007f4a94f28036 in ppoll () from target:/lib64/libc.so.6 (gdb) q
(imagine this being done live and as quickly as possible, as the VM is hanging. You also see two devices, OpenNebula adds a context image to all VMs, and it’s a CD-ROM device that the migration didn’t get to corrupt)
Then I see SIGTERM and think, “crap, I killed it”. Then I look at the logs, and it has actually migrated successfully.
So, finally, with some pointers (“the command in gdb is command” :) ) I write the following:
set print pretty set pagination off handle SIGUSR1 nostop pass handle SIGTERM nostop pass break block/block-backend.c:253 commands print ( (BlockBackend *) (child->opaque))->disable_perm print ( (BlockBackend *) (child->opaque))->perm set variable ( (BlockBackend *) (child->opaque))->disable_perm = 0 cont end cont
and it works like a charm on the 10-or-something VMs with a corrupt state.
A list of errors and lessons learned:
– Don’t assume that because there’s something not right, it’s the cause of the problem you’re looking at (QEMU processor for the VMs);
– Collect all relevant information (and maybe even irrelevant) – I keep telling people that it’s easier to work with too much information than with not enough. I should listen to myself one of these days;
– strace can hide syscalls and shouldn’t be fully trusted;
– If you’re searching for debuginfo (or anything like that), do your legwork, and don’t trust a single search engine;
– testing the wrong software version can reveal only a part of the problem (or a different one).
Tags: работа
March 25th, 2024 at 14:28
Hi,
Thanks for sharing. Are you positive that “strace” indeed misses syscalls?
You said that it worked well with “strace -f -p 1463232 -o dbg.open -s 255 -e trace=open,openat”. What options did you use when it missed the open() syscall?
March 25th, 2024 at 15:28
“strace -f -p 1463232 -o dbg.open -s 255”, just without “-e”. And I’m sure, because I grepped for it in the debug outputs I had (I usually trace everything).