Monday, September 09, 2013

A real-life Heisenbug

I spent three or four days debugging (a while ago now, this blog post has been aging for longer than expected in my blog cellar) an interesting and hard to find bug on Windows. It became apparent due to intermittent failures on Try with Windows OMTC builds. Finding the route cause has been an adventure and turned out to have nothing much to do with compositing or graphics. All the action is in bug 896896, if you are interested to see some code.

A Heisenbug is a bug whose behaviour is changed by observation. In this case it would go away if I set more than a few breakpoints (breakpoints which print something and continue, not actually breaking, which also caused the bug to disappear). That made debugging annoying, I was reduced to using printfs for logging, and even that sometimes affected the frequency of reproduction.

The failure occurred for mochitests with a drop down select box. About one in five test runs was timing out and thus failing. They timed out with a crash, but there was no useful crash information. There were a few mochitests which failed and the thing they had in common was a drop down list box. The first step was to try and reproduce this locally. For the longest time I could not. I tried using test pages with list boxes, all worked fine. I tried to run the mochi tests in question and it always succeeded. The simplest of the tests simulated clicking on the select box to open and close it. By accident I noticed that sometimes, the mochitest would freeze, with the list box dropped down, until I moved the mouse. A bit more experimentation showed that if the cursor was outside the window then I could move it and it would remain frozen. A bit more experimentation and I found this was specific to OMTC (it never happened with main-thread composition) and didn't depend on the actual compositor (d3d9 and d3d11 had the same problem). So, it seemed to be a problem with simulated clicks on select boxes. I tried attaching a debugger and breaking when the mochitest froze. This didn't reveal anything interesting - the compositor thread was waiting for messages and the main thread was somewhere doing Windows message stuff or sometimes GC (which is where it usually is if you randomly pause Firefox).

I then spent a day or so figuring out exactly how those simulated clicks worked. The mochitest did four clicks, and usually we got stuck between the 3rd and 4th. The test set a 500ms timeout in JS and we never got to the forth click which should happen when the timeout timed out. It took me a while to figure this out, of course. At first it looked like a problem in the code which rolled up the select box, in part because we hit a different code path for that depending on whether there is a real click or a simulated click.

So the question now looked like why wasn't our timeout completing? More logging confirmed that it was being set correctly and (eventually, in the freezing case) that it correctly called back. Investigating with Spy++ showed that when we froze, we processed a WM_PAINT message and then just stopped. So why was our message queue empty? Well if the mouse isn't in the window and nothing much else is happening, then it should be empty, and Firefox should just wait until it gets a new message. Which is what was happening, and explained why it woke up with mouse movement. But, why weren't we waking up for the timeout?

Firefox waits for a message by calling the windows function WaitMessage. That puts our thread to sleep until a message appears in our queue. We were frozen here, waiting for a message to come in. Unfortunately there are other ways to check the queue. You can call PeekMessage which reads a message from the queue and either pops it off or leaves it in place. The sad thing is that if it leaves it in place, it still counts as 'read'. The contract for WaitMessage is a bit subtle - if there is anything in the queue that is unread, it returns immediately. If the queue is empty or (and this is the nasty bit) contains only read messages, it sleeps until another message comes in. So, you must call PeekMessage just before you call WaitMessage to ensure that the queue is empty.

The bug here was that another thread (the compositor thread) was checking PeekMessage. We couldn't avoid doing this, the call comes from deep in the thread management code. We freeze when, in our main thread message loop, we check that the queue is empty and it is, then we context switched to the compositor thread which also checks the message queue, but at some point our wake up call has arrived. That gets checked by the compositor thread's event loop and is ignored and left on the queue in the 'read' state. When we context switch back to the main thread, we call WaitMessage and there are no messages so we sleep forever. It might seem unlikely that a message arrives and is checked between the queue being checked and the WaitMessage call. Calling PeekMessage doesn't just tell you if any messages have been posted to the queue, it handles any sent messages (Windows has posted (async) and sent (sync) messages). That means that during the PeekMessage call a whole lot of unknown stuff can happen. I'm a bit hazy about exactly how this bit worked out, possibly we got a paint message and that caused us to do a sync composite which context switched to the compositor thread which checked its message loop and then PeekMessaged the wake up call. In any case, this happened frequently enough to cause intermittent failure. It was a Heisenbug because by adding debugger overhead we got the wake up message entirely after the composite (I'm guessing here, btw) so I saw the bug less often or not at all.

The fix was also kind of interesting, but in a knowing the details of the Windows APIs, rather than solving a mystery kind of way. It turns out you can call WaitForMultipleObjectsEx in such a way as to return even if there are only read messages in the queue, and then sidestep the whole business.

Tip of the hat to roc and bsmedberg who helped a lot with the diagnosis and cure of this problem. I would have been stuck without their Windows knowledge and Firefox-workings-insight.

1 comment:

Anonymous said...

Yeah, I've seen one of those before... the behavior was that a Java object wasn't serializing correctly across a remote link - but the problem disappeared as soon as logging was added to one end.

Eventually found that the toString method was modifying internal state... and that this instance of the problem had been caused by someone *removing* a debug line that had been added years earlier in a previous occurrence of the same issue.