igb: Developing a new device for QEMU (part 4: Debugging and submitting patches)

This is the last post of a walkthrough of new device emulator development for QEMU which uses igb, a recently introduced Intel NIC emulation as an example.

The first post roughly described there are several development steps:

Writing a boilerplate and adding basic features
Adding QTest
Adding advanced features and libvirt support
Running more tests and increasing feature coverage
Debugging
Contributing to the upstream

This post discusses 5 and 6. Though this post is written with new device emulator development in mind, the debugging techniques shown here apply to other kinds of QEMU development as well.

Debugging

Debugging is the most interesting (and tiresome) part of the device emulator development.

The first step of debugging is to make a bug reproduction case. Usually, you will start debugging by creating a minimal reproduction case, but it is often impractical for low-level software like QEMU as the higher-level software is just too big; you cannot trim Linux down. The higher-level software may even lack source code (c.f., Windows). In such a situation, you can’t do anything but use the real workload as a reproduction case. That also means there can be bugs anywhere: guest user-space program, guest kernel, QEMU, or even compiler.

But don’t worry. It is always possible to figure out a bug no matter how the reproduction case is complex. You just need right tools, knowledge, methodology, and patience.

Tools

First, let me introduce useful tools. You can’t (or don’t want to) debug everything just with printf(), but you don’t need dozens of tools either. The most important thing is to choose tools according to the situation.

Hand-written scripts

The first thing to do, even before you try to figure out the bug, is to make the iterations fast. Write some scripts to automate repetitive tasks, especially the program launch. Scripts I used during development are available at: https://github.com/akihikodaki/q

You can choose whatever script language. Below is possible options:

Shell script

Shell is available everywhere so it may be handy. Unfortunately, it has various quirks and compatibility issues.

C

C is actually not that bad. Basically, you just need to call system() everywhere and omit error handling to write it “script-like”.

The downside is that it requires compiling and is bad at error handling.

Ruby

An interesting characteristic of Ruby is that it exposes various Unix functions. It can also call C functions without any external library so you can have unlimited access to the underlying system.

While Ruby is a good option for system scripting as described above, it is not as popular as Python or JavaScript. For example, Visual Studio Code comes with mature support for Python and JavaScript, but Microsoft does not directly provide support for Ruby.

Node.js

Node.js employs JavaScript language. Modern JavaScript is comfortable to write, and its rich documentation on MDN and formal specification are helpful if you are not familiar with the language.

Node.js’ asynchronous nature is also attractive for scripting. For example, if you need to buildProject() and copyFiles() (assuming both of them return Promise) in parallel, you can just write as follows:

await Promise.all([buildProject(), copyFiles()])

Google’s zx makes it even easier to write scripts.

However, Node.js is somewhat more restrictive than Ruby so you cannot easily interact with file descriptors, for example.

git-bisect

The man page says:

git-bisect - Use binary search to find the commit that introduced a bug

In principle, it’s so simple and very effective to debug regressions, but sometimes the bug you are debugging may involve several regressions introduced in different commits. Naively performing binary search may not be sufficient in such a case so you need to keep considering what is happening during git-bisect.

Sanitizers

Sanitizers detect common bugs in C code at runtime. AddressSanitizer and UndefinedBehaviorSanitizer can be enabled by passing the --enable-sanitizers flag to the configure script. I recommend keeping them enabled.

Note that the entire QEMU code base is not free of addressability issues and undefined behaviors. It’s better to keep your workload minimal to avoid noise from existing issues. There were several noisy bugs affecting my minimal workload so I fixed them to keep the console clean¹²³⁴.

printf()

It may sound stupid but, in fact, printf() is a great tool for debugging. It works everywhere, and it has all of the expressiveness of the programming language. Want the state of the program when the loop count is 42? Just write if (i == 42). Need to parse the bytes in an array? Write a parsing function and call it. It’s so simple and effective.

A downside is that such code can be often left after debugging. You may annotate the logging code with a debug flag to pollute the console, which mitigates the problem but does not solve it; it’ll still end up with a lot of logs you cannot really digest when the debug flag is set.

The rule of thumb for printf() debugging is to check in the code beforehand. You can then just git checkout . (or safer: git stash) to discard all of debug code at once. Another useful trick is to add your name to debug logs. A person’s name is often very distinguishable from the other code; you can easily find it with grep. And if you forgot to delete it, others will ask you: “Hey, there is your name in the code, did you forget to remove it?”

Another problem with printf() debugging is that it needs recompilation. There may be tools for live patching, but live patching has inherently confusing as the state of the old code persists. Don’t stick to printf(), and switch to debugging using GDB/LLDB when appropriate.

tracepoints

QEMU has tracing infrastructure. It is basically printf() with external tooling support, but it is still useful even if you just use the default “log” trace backend, which outputs traces to standard error as it allows filtering traces.

For example, you can get all traces from igb (except e1000x-common code) by adding the following to the command line:

-trace 'igb_*'

The name of tracepoints for igb is listed in hw/net/trace-events. Tracepoints for other devices are listed in corresponding trace-events files.

The output of this pattern will be huge as it enables all tracepoints igb has so you can first skim it and narrow down the code. If you find something related to IRQ suspicious, you can write:

-trace 'igb_irq_*'

…and keep narrowing it down. It is also possible to enable/disable tracepoints at runtime using the QEMU monitor, which will be described later.

To add a new trace point, you can add a new line to the trace-events file and call the corresponding trace function. For example,

igb_set_vfmailbox(uint32_t vf_num, uint32_t val) "VFMailbox[%d]: 0x%x"

It is easy to filter tracepoints, but it is still important not to add too many tracepoints. The documentation gives some hints for adding new trace events:

Trace state changes in the code. Interesting points in the code usually involve a state change like starting, stopping, allocating, freeing. State changes are good trace events because they can be used to understand the execution of the system.

Trace guest operations. Guest I/O accesses like reading device registers are good trace events because they can be used to understand guest interactions.

Use correlator fields so the context of an individual line of trace output can be understood. For example, trace the pointer returned by malloc and used as an argument to free. This way mallocs and frees can be matched up. Trace events with no context are not very useful.

Name trace events after their function. If there are multiple trace events in one function, append a unique distinguisher at the end of the name.

In other words, you should think carefully if you add a trace event that does not fit well with this description. For example, it usually does not make sense to add tracepoints for individual if statements or functions. If there are well-designed trace events to trace state changes and guest operations with meaningful correlator fields, you should be able to infer which if statements or functions will be evaluated.

GDB and LLDB

GDB and LLDB are debuggers for C programs. They both provide features you will expect; it allows you to pause the execution at some point of code (breakpoint) or to pause the execution when it accesses some region of memory (watchpoint) without recompilation. GDB is widely used so you can expect good support from debugged programs. LLDB is a more modern alternative and (in my experience) has superior reliability. You can pick either of them or use both of them, depending on the situation.

As QEMU debugging involves software at various levels, you need to attach GDB to some or even all of them, depending on the nature of the bug. In particular, there are three levels you may be interested in:

QEMU
Guest kernel
Guest userspace

QEMU

Attaching GDB or LLDB to QEMU is no different from attaching it to other userspace programs; simply put, you just type:

gdb build/qemu-system-aarch64

QEMU also has some GDB scripts that tweak it for debugging QEMU so it’s better to use them. To use them, load the .gdbinit file in the QEMU source tree. GDB has a feature to load .gdbinit automatically, but usually, it’s disabled for the security reason so type the following to load it manually:

source .gdbinit

Guest kernel

Kernels often implement remote debugging protocols; one of our blog posts covers such a protocol of Windows. Linux also comes with a similar feature called kgdb. However, they are sometimes unreliable when e.g., interrupts happen.

The remote debugging feature of QEMU is more reliable as it operates in a more lower-level. This feature is documented well, but, in short, start QEMU with option -s added to the QEMU command line, start gdb (with no argument), and type:

target remote :1234

Now GDB is attached to the guest and you can freely place breakpoints and watchpoints.

However, this comes with no debug information so it’s not practically usable in this state. To use debug information from a guest kernel, you need to take some procedure specific to it. Linux comes with a good documentation describing the procedure to debug it with QEMU so I recommend reading it. A common pitfall is that KASLR confuses the symbolizer, which is described in the documentation. I also had to disable in-kernel pointer authentication (CONFIG_ARM64_PTR_AUTH_KERNEL) for my Arm64 environment.

It may be still necessary to interrupt the guest execution without knowing the address to place a breakpoint. For example, Windows does support debugging the kernel with QEMU, but it requires another Windows machine and so troublesome to set up. It doesn’t come with source code so you cannot know the exact address to place a breakpoint anyway. A useful technique in such a case is to pause the execution from the QEMU device code.

First, add the following includes to the head of the source file:

#include "exec/gdbstub.h"
#include "hw/core/cpu.h"
#include "sysemu/runstate.h"

Now you can add the following to the point where you want to pause the guest:

gdb_set_stop_cpu(current_cpu);
vm_stop(RUN_STATE_PAUSED);

The first line switches the CPU GDB interacts with. The second line actually stops the VM. You can add conditions with if or do whatever since it’s C.

filter-dump

Sometimes there may be debug tools specialized for a particular feature. filter-dump is such a tool included in QEMU for networking and allows to dump network traffic in QEMU. It’s usually superior to dumpcap running on a tap device attached to QEMU netdev as it runs at the very end of QEMU; it can see corrupted packets sent from QEMU that will be filtered out by the tap device, for example.

This tool was not working with packets with virtio-net headers attached so I wrote a fix. Such investment for a specialized tool is not essential but pays off well.

QEMU monitor

QEMU monitor allows you to interact with a live QEMU session. You can change the configuration of QEMU by changing the command line, but it requires frequent restart so it’s recommended to learn how to change the configuration at runtime with the QEMU monitor. Sometimes live configuration change is inevitable; you may need to add and remove filter-dump to capture only a particular part of network traffic, for example.

Snapshotting and hotplugging are only available with interactive access with QEMU. You should occasionally test them as they are error-prone features.

Case study

This section describes actual bug hunting during the igb development. The cause of the bugs are not limited to QEMU, but include guest kernel and compiler.

DMA reentry bug

After sending igb patches to the upstream, several crash reports generated with fuzzer came up:

Heap-use-after-free in e1000e_receive_internal (#1543) · Issues · QEMU / QEMU · GitLab

This crash happened with e1000e, but I still got the report as I did lots of cleanup before copying the e1000e code for igb. It is also likely that a bug affecting e1000e affects igb as well.

Locally reproducing the crash resulted in the following logs:

==17147==ERROR: AddressSanitizer: heap-use-after-free on address 0xfffec4259798 at pc 0xaaabab108f88 bp 0xffffe3a4a780 sp 0xffffe3a4a790
READ of size 8 at 0xfffec4259798 thread T0
    #0 0xaaabab108f84 in e1000e_write_packet_to_guest ../hw/net/e1000e_core.c:1651
    #1 0xaaabab108f84 in e1000e_receive_internal ../hw/net/e1000e_core.c:1776
    #2 0xaaabab0cb978 in net_tx_pkt_send_custom ../hw/net/net_tx_pkt.c:823
    #3 0xaaabab0fe608 in e1000e_tx_pkt_send ../hw/net/e1000e_core.c:682
    #4 0xaaabab0fe608 in e1000e_process_tx_desc ../hw/net/e1000e_core.c:762
    #5 0xaaabab0fe608 in e1000e_start_xmit ../hw/net/e1000e_core.c:953
    #6 0xaaabab102694 in e1000e_set_tctl ../hw/net/e1000e_core.c:2496
    #7 0xaaabab10f918 in e1000e_core_write ../hw/net/e1000e_core.c:3349
    #8 0xaaababc913bc in memory_region_write_accessor ../softmmu/memory.c:493
(snip)

0xfffec4259798 is located 24 bytes inside of 208-byte region [0xfffec4259780,0xfffec4259850)
freed by thread T0 here:
    #0 0xfffecfc9d7cc in __interceptor_free /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:52
    #1 0xaaabab0cc3d0 in net_rx_pkt_iovec_realloc ../hw/net/net_rx_pkt.c:76
    #2 0xaaabab0cc3d0 in net_rx_pkt_pull_data ../hw/net/net_rx_pkt.c:99
    #3 0xaaabab0cf33c in net_rx_pkt_attach_iovec_ex ../hw/net/net_rx_pkt.c:153
    #4 0xaaabab102e70 in e1000e_receive_internal ../hw/net/e1000e_core.c:1764
    #5 0xaaabab0cb978 in net_tx_pkt_send_custom ../hw/net/net_tx_pkt.c:823
    #6 0xaaabab0fe608 in e1000e_tx_pkt_send ../hw/net/e1000e_core.c:682
    #7 0xaaabab0fe608 in e1000e_process_tx_desc ../hw/net/e1000e_core.c:762
    #8 0xaaabab0fe608 in e1000e_start_xmit ../hw/net/e1000e_core.c:953
    #9 0xaaabab102694 in e1000e_set_tctl ../hw/net/e1000e_core.c:2496
    #10 0xaaabab10f918 in e1000e_core_write ../hw/net/e1000e_core.c:3349
    #11 0xaaababc913bc in memory_region_write_accessor ../softmmu/memory.c:493
(snip)
    #21 0xaaabab0e5e5c in pci_dma_write /home/alarm/q/var/qemu/include/hw/pci/pci_device.h:271
    #22 0xaaabab0e5e5c in e1000e_write_to_rx_buffers ../hw/net/e1000e_core.c:1474
    #23 0xaaabab105218 in e1000e_write_packet_to_guest ../hw/net/e1000e_core.c:1646
    #24 0xaaabab105218 in e1000e_receive_internal ../hw/net/e1000e_core.c:1776
    #25 0xaaabab0cb978 in net_tx_pkt_send_custom ../hw/net/net_tx_pkt.c:823
    #26 0xaaabab0fe608 in e1000e_tx_pkt_send ../hw/net/e1000e_core.c:682
    #27 0xaaabab0fe608 in e1000e_process_tx_desc ../hw/net/e1000e_core.c:762
    #28 0xaaabab0fe608 in e1000e_start_xmit ../hw/net/e1000e_core.c:953
    #29 0xaaabab102694 in e1000e_set_tctl ../hw/net/e1000e_core.c:2496
    #30 0xaaabab10f918 in e1000e_core_write ../hw/net/e1000e_core.c:3349
    #31 0xaaababc913bc in memory_region_write_accessor ../softmmu/memory.c:493
(snip)

previously allocated by thread T0 here:
    #0 0xfffecfc9ed24 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0xfffece700ea4 in g_malloc (/usr/lib/libglib-2.0.so.0+0x70ea4)
    #2 0xaaabab0cc3d8 in net_rx_pkt_iovec_realloc ../hw/net/net_rx_pkt.c:77
    #3 0xaaabab0cc3d8 in net_rx_pkt_pull_data ../hw/net/net_rx_pkt.c:99
    #4 0xaaabab0cf33c in net_rx_pkt_attach_iovec_ex ../hw/net/net_rx_pkt.c:153
    #5 0xaaabab102e70 in e1000e_receive_internal ../hw/net/e1000e_core.c:1764
    #6 0xaaabab0cb978 in net_tx_pkt_send_custom ../hw/net/net_tx_pkt.c:823
    #7 0xaaabab0fe608 in e1000e_tx_pkt_send ../hw/net/e1000e_core.c:682
    #8 0xaaabab0fe608 in e1000e_process_tx_desc ../hw/net/e1000e_core.c:762
    #9 0xaaabab0fe608 in e1000e_start_xmit ../hw/net/e1000e_core.c:953
    #10 0xaaabab102694 in e1000e_set_tctl ../hw/net/e1000e_core.c:2496
    #11 0xaaabab10f918 in e1000e_core_write ../hw/net/e1000e_core.c:3349
    #12 0xaaababc913bc in memory_region_write_accessor ../softmmu/memory.c:493
(snip)

I omitted most stack trace entries of common QEMU code for brevity.

Looking at the logs, it seems use-after-free happened at line 1651 of hw/net/e1000e_core.c. The content of the line is:

if (iov_ofs == iov->iov_len) {

This means iov was a dangling pointer when AddressSanitizer got triggered. iov is assigned very early in the e1000e_write_packet_to_guest function and the function appears in the stack trace of free so it’s likely that the iov was valid at the beginning of the function but it got corrupted while the function progresses. This leads to two possible explanations for the bug:

It was a mistake to retrieve iov early and to keep it in the function.
It was a mistake to free iov in the function.

To prove explanation 2, I looked at the stack trace of free. The stack trace shows it tried to write the packet with DMA. In common sense, such DMA should write RAM, but the stack trace says it triggered another packet transmission. This sounds very wrong. In a nutshell, the following is what happened here:

The address of the MMIO register to trigger transmission was specified as the address of the received packet buffer.
Packet transmission was triggered.
The packet was looped back, and written to the address of the received packet buffer, which in reality points to the MMIO register specified in 1.
Another packet transmission and loopback happened, corrupting the first loopback.
The second loopback finishes, and the first loopback resumes.
The first loopback sees the corrupted state and AddressSanitizer triggers.

The first condition is very unlikely to happen in the real world, but a fuzzer and malicious actors it tries to mimic can create such an abnormal situation.

I could devise a fix with this explanation; such a fix should add a flag which indicates an MMIO access already happened and reject another MMIO access if the flag is set. The easiest way is to add such a flag to e1000e, but considering the entire QEMU program, it is likely that other devices need the same flag so I decided to add it to the common memory subsystem.

Eventually, I finished my fix and sent the upstream to find out there is already a fix for this kind of situation.

Lessons learned

Sanitizers and fuzzers do help.
It is sometimes better to fix the common code instead of fixing the device-specific code.
Don’t forget to look for patches and issue reports for a common code bug ;)

Linux bug

I first tried Linux Test Project with e1000e instead of igb. That led to a kernel warning and device reset:

[  194.808645] NETDEV WATCHDOG: enp0s4 (e1000e): transmit queue 0 timed out
[  194.809321] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x6f4/0x7f0
[  194.810092] Modules linked in:
[  194.810388] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.3.0-rc6+ #24
[  194.810987] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-stable202302-for-qemu 03/01/2023
[  194.811805] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[  194.812461] pc : dev_watchdog+0x6f4/0x7f0
[  194.812845] lr : dev_watchdog+0x6f4/0x7f0
[  194.813233] sp : ffff800013227ad0
[  194.813552] x29: ffff800013227ad0 x28: ffff0000ef528510 x27: ffff80000e7cc000
[  194.814237] x26: 0000000000000003 x25: 0000000000000000 x24: 0000000000000000
[  194.814916] x23: ffff80000e7cc520 x22: ffff80000e7afac0 x21: ffff0000ef528420
[  194.815602] x20: ffff0000ef528000 x19: 1ffff00002644f6a x18: 0000000000000000
[  194.816284] x17: 1ffff00001f49ab0 x16: 0000000000000000 x15: 00000000a0f72ce0
[  194.818185] x14: 0000000000000000 x13: 0000000000000001 x12: ffff700002644ef3
[  194.820795] x11: 1ffff00002644ef2 x10: ffff700002644ef2 x9 : ffff80000831e908
[  194.823443] x8 : ffff800013227797 x7 : 00008ffffd9bb10e x6 : 0000000000000001
[  194.826174] x5 : ffff800013227790 x4 : 1fffe000184e560a x3 : dfff800000000000
[  194.828648] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c272b040
[  194.831117] Call trace:
[  194.831960]  dev_watchdog+0x6f4/0x7f0
[  194.833231]  call_timer_fn+0x1f4/0x860
[  194.834378]  __run_timers+0x73c/0xa50
[  194.835201]  run_timer_softirq+0x54/0xb8
[  194.836098]  __do_softirq+0x304/0xe4c
[  194.836986]  ____do_softirq+0x14/0x28
[  194.837879]  call_on_irq_stack+0x24/0x30
[  194.838782]  do_softirq_own_stack+0x20/0x30
[  194.839736]  __irq_exit_rcu+0x23c/0x5e8
[  194.840615]  irq_exit_rcu+0x18/0x88
[  194.841421]  el1_interrupt+0x34/0x50
[  194.842314]  el1h_64_irq_handler+0x14/0x20
[  194.843321]  el1h_64_irq+0x68/0x70
[  194.844117]  cpuidle_idle_call+0x254/0x358
[  194.844856]  do_idle+0x124/0x1e0
[  194.845460]  cpu_startup_entry+0x2c/0x38
[  194.846174]  secondary_start_kernel+0x12c/0x168
[  194.846976]  __secondary_switched+0xa4/0xa8
[  194.847741] irq event stamp: 447143
[  194.848385] hardirqs last  enabled at (447142): [<ffff80000b28db18>] el1_interrupt+0x40/0x50
[  194.849731] hardirqs last disabled at (447143): [<ffff80000b28d4e0>] el1_dbg+0x20/0x90
[  194.850987] softirqs last  enabled at (446880): [<ffff800008011128>] __do_softirq+0x8f0/0xe4c
[  194.852350] softirqs last disabled at (447119): [<ffff80000801b184>] ____do_softirq+0x14/0x28

The call trace originates from a watchdog timer and does not tell much. The error message implies something with the transmission is wrong, but it is not enough to identify the bug.

I first minimized the reproduction case because repetitively running LTP takes too long. The test that triggered the warning was a test to make parallel HTTP requests. In the end, I figured out that this bug can be reproduced by running Apache on e1000e and requesting a 2-GiB file twice in parallel.

As I wrote earlier, it originates from a watchdog timer, which triggers a bit after a problematic behavior occurs. This means it is unlikely that you can understand the issue by looking at the last few QEMU traces.

It is necessary to inspect the kernel to identify the true origin of the warning. I attached GDB to the guest and put the breakpoint at the line that emits the warning. Below is the snippet from the kernel:

for (i = 0; i < dev->num_tx_queues; i++) {
	struct netdev_queue *txq;

	txq = netdev_get_tx_queue(dev, i);
	trans_start = READ_ONCE(txq->trans_start);
	if (netif_xmit_stopped(txq) &&
	    time_after(jiffies, (trans_start +
				 dev->watchdog_timeo))) {
		some_queue_timedout = 1;
		atomic_long_inc(&txq->trans_timeout);
		break;
	}
}

if (unlikely(some_queue_timedout)) {
	trace_net_dev_xmit_timeout(dev, i);
	WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
	       dev->name, netdev_drivername(dev), i);
	netif_freeze_queues(dev);
	dev->netdev_ops->ndo_tx_timeout(dev, i);
	netif_unfreeze_queues(dev);
}

When the breakpoint hit, I typed: p *dev. This spams the console with all of the information related to the network device. Skimming the output, I found the value of the name member is enp0s4, which is the name of the e1000e device and confirms it is a problem with e1000e.

Investigating further requires a more educated methodology than just skimming p *dev. Below is the definition of netdev_get_tx_queue.

static inline
struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
					 unsigned int index)
{
	return &dev->_tx[index];
}

p dev->num_tx_queues tells it’s 1, so dev->_tx[0] should be the queue that timed out. Then I looked at netif_xmit_stopped:

static inline bool netif_xmit_stopped(const struct netdev_queue *dev_queue)
{
	return dev_queue->state & QUEUE_STATE_ANY_XOFF;
}

where QUEUE_STATE_ANY_XOFF is defined as QUEUE_STATE_DRV_XOFF | QUEUE_STATE_STACK_XOFF. p dev->_tx[0].state showed QUEUE_STATE_DRV_XOFF was set so certainly this transmit queue is stopped.

Searching the kernel code base for QUEUE_STATE_DRV_XOFF, it seemed that the bit is cleared with netif_tx_wake_queue and netif_tx_start_queue. Looking for these symbols in drivers/net/ethernet/intel/e1000e led to the e1000_clean_tx_irq function, which means the guest is not receiving tx IRQ after the queue is stopped. I looked for kernel code that requests transmission since tx IRQ will be generated only after that, finding such code in the e1000_xmit_frame function.

netdev_sent_queue(netdev, skb->len);
e1000_tx_queue(tx_ring, tx_flags, count);
/* Make sure there is space in the ring for the next send. */
e1000_maybe_stop_tx(tx_ring,
		    (MAX_SKB_FRAGS *
		     DIV_ROUND_UP(PAGE_SIZE,
				  adapter->tx_fifo_limit) + 2));

if (!netdev_xmit_more() ||
    netif_xmit_stopped(netdev_get_tx_queue(netdev, 0))) {
	if (adapter->flags2 & FLAG2_PCIM2PCI_ARBITER_WA)
		e1000e_update_tdt_wa(tx_ring,
				     tx_ring->next_to_use);
	else
		writel(tx_ring->next_to_use, tx_ring->tail);
}

It seemed the kernel requests transmission by writing the first TDT register, which is pointed to by tx_ring->tail. If QEMU works properly, the last call for the e1000e_set_tdt function in the device emulation code before the reset happens should be matched with a call for the e1000e_set_interrupt_cause function with E1000_ICR_TXQ0 as its argument. To confirm this, I added printf calls to print when e1000e_set_tdt is called or e1000e_set_interrupt_cause is called with E1000_ICR_TXQ0, and enabled e1000e_core_ctrl_sw_reset trace.

Surprisingly, e1000e_set_interrupt_cause was always happening after the last e1000e_set_tdt call, meaning QEMU emulates the device properly. This made me question the kernel driver implementation. There was an earlier call of the e1000_maybe_stop_tx kernel function, which stops the Tx queue, in e1000_xmit_frame, but it was not followed with TDT register write:

int count = 0;

/* snip */

/* reserve a descriptor for the offload context */
if ((mss) || (skb->ip_summed == CHECKSUM_PARTIAL))
	count++;
count++;

count += DIV_ROUND_UP(len, adapter->tx_fifo_limit);

nr_frags = skb_shinfo(skb)->nr_frags;
for (f = 0; f < nr_frags; f++)
	count += DIV_ROUND_UP(skb_frag_size(&skb_shinfo(skb)->frags[f]),
			      adapter->tx_fifo_limit);

if (adapter->hw.mac.tx_pkt_filtering)
	e1000_transfer_dhcp_info(adapter, skb);

/* need: count + 2 desc gap to keep tail from touching
 * head, otherwise try next time
 */
if (e1000_maybe_stop_tx(tx_ring, count + 2))
	return NETDEV_TX_BUSY;

I added writel(tx_ring->next_to_use, tx_ring->tail); before the return statement and it worked.

So this was a kernel bug. Before submitting it as a fix, I tried to understand the intention of e1000_xmit_frame. Below is the summary of this function:

Check if there is sufficient space to put the current packet in the tx ring. Otherwise, return NETDEV_TX_BUSY.
Queue the current packet to the tx ring.
Stop the tx queue if there is no sufficient space for the next packet in tx ring.
Trigger transmission if there are no more packets or the tx is stopped.

And what I observed is that the check in step 1 was failing. In such a case, the previous e1000_xmit_frame should have stopped the tx queue in step 3 triggered transmission in step 4. In fact, Documentation/networking/netdevices.rst states the check in step 1 should never fail:

NETDEV_TX_BUSY Cannot transmit packet, try later Usually a bug, means queue start/stop flow control is broken in the driver. Note: the driver must NOT put the skb in its DMA ring.

This implies it is a bug that the tx queue was not stopped in step 3 so I inspected the condition to stop the tx queue in step 3. e1000_maybe_stop_tx stops the queue if the tx queue has space specified with the size argument. In this case, the value is: (MAX_SKB_FRAGS * DIV_ROUND_UP(PAGE_SIZE, adapter->tx_fifo_limit) + 2)

This looks so complex and suspicious. I followed the computation of the size argument of the earlier e1000_maybe_stop_tx call triggering NETDEV_TX_BUSY in step 1 to find its maximum value. There are two increments of count at the beginning so that counts two. DIV_ROUND_UP(len, adapter->tx_fifo_limit) will be MAX_SKB_FRAGS at maximum. The later iterations on fragments will be summed into MAX_SKB_FRAGS * DIV_ROUND_UP(PAGE_SIZE, adapter->tx_fifo_limit) at maximum. In the end, 2 is added to count. Summarizing this, the maximum value will be:

2 + MAX_SKB_FRAGS + MAX_SKB_FRAGS * DIV_ROUND_UP(PAGE_SIZE, adapter->tx_fifo_limit) + 2
= (MAX_SKB_FRAGS + 1) * DIV_ROUND_UP(PAGE_SIZE, adapter->tx_fifo_limit) + 4

This is greater than the value used in step 4. A patch to update the value fixed the problem so I submitted the upstream and it’s now merged⁵. This bug was present since the introduction of the e1000e driver in 2007 and 15-year-old when it was fixed. It is surprising considering the affected workload is a plain common HTTP server.

Lessons learned

Don’t naively trust the kernel even if the code is mature. (I also hit several other kernel bugs⁶⁷⁸⁹.)
The kernel source visibility matters.
Sometimes it’s necessary to track the bug both from the host and guest side.

GCC bug

We covered a kernel bug so the next thing to cover is obviously (?) a compiler bug. This compiler bug was found when running the Internet Protocol (IP) Pipeline application of DPDK under DPDK Test Suite. A DPDK program has a somewhat unconventional architecture that polls the hardware with an infinite loop, which confused GCC.

It was quite easy to locate the bug except it was something that looked impossible. It was clear that the DPDK application hung since the QEMU trace and the output of the application stopped. That usually implies QEMU did something wrong and the application got confused by that and froze. However, I could not find any bug looking at the source code of the application or adding lots of printf() and attaching GDB. It seemed the request for the worker thread is silently ignored so I looked at the implementation of the request queue to find no bug.

After observing this impossible behavior for an hour, eventually, I decided to trim down the infinite loop of the worker thread. There were several conditions to process requests and function calls so I removed each of them and replaced the request processing function with simple puts(). Below is what I got in the end:

#include <stdio.h>

int g;
int h;

int main(void)
{
	for (int i = 0; ; i++) {
		for (int j = 0; j < g; j++);

		if (i & 1) {
			if (h)
				continue;

			if (g)
				puts("a");

			g = 1;
		}
	}
}

This code should indefinitely emit a, but it did nothing. It was indeed a compiler bug.

To identify the buggy optimization pass I replaced the -O3 compiler flag with individual optimization flags listed in the documentation and performed binary search and derived the minimal reproducing combination of flags: -O1 -ftree-pre -ftree-partial-pre

The bug was reported to the upstream and promptly fixed¹⁰. Meanwhile, as it takes a long for a fix to be available on a distribution, I added -fno-tree-partial-pre to the compiler flag to disable the faulty optimization.

Lessons learned

Something that looks impossible happens sometimes. (I hit yet another seemingly-impossible bug during igb development¹¹. Although it was not the fault of the compiler, it also teaches the same lesson.)
The usual methodologies like minimizing the reproduction case or binary search do apply to unusual cases.

Contributing to the upstream

With all known bugs fixed, it’s time to submit the changes to the upstream. QEMU has extensive documentation for patch submission.

This process is a repetition of getting reviews and submitting a new version. New device emulation code is often large as in the case of igb, and such code takes many rounds of reviews; it took 9 versions before igb gets merged¹². As igb and e1000e have generic virtual-device fuzzing enabled, I also received reports of bugs found with the fuzzer and had to fix each of them.

Patience you must have, my young Padawan. - Master Yoda

The merge of the new device code marks the end of the initial development, but it is not the end of the entire development process. Introducing a new device makes you its maintainer so you will get patches and bug reports for it. Community involvement keeps the code being improved.

Conclusion

This series covered what new device emulation code for QEMU looks like by depicting the introduction of igb, which was a 6-month journey¹³. It is complex and time-consuming, but tractable if you break it down into several steps and divide-and-conquer. The new device code will benefit the entire QEMU community and downstream once you finish the process and get it merged.

target/arm: Initialize debug capabilities only once (ad5c6dde) · Commits · QEMU / QEMU · GitLab ↩
vhost-user-gpio: Configure vhost_dev when connecting (daae36c1) · Commits · QEMU / QEMU · GitLab ↩
vhost-user-fs: Back up vqs before cleaning up vhost_dev (331acddc) · Commits · QEMU / QEMU · GitLab ↩
vhost-user-i2c: Back up vqs before cleaning up vhost_dev (0126793b) · Commits · QEMU / QEMU · GitLab ↩
e1000e: Fix TX dispatch condition ↩
igb: Allocate MSI-X vector when testing ↩
igb: Enable SR-IOV after reinit ↩
igbvf: Regard vf reset nack as success ↩
iommu/virtio: Detach domain on endpoint release ↩
109002 – [13 Regression] -O1 -ftree-pre -ftree-partial-pre results in stall value ↩
ip: Enforce strict aliasing (!131) · Merge requests · slirp / libslirp · GitLab ↩
[v9] Introduce igb | Patchew ↩
This does not include the time spent on the initial bring-up that was already done when I started working on igb. I also didn’t spend full-time for igb development. ↩