Monday, September 09, 2013

A real-life Heisenbug

I spent three or four days debugging (a while ago now, this blog post has been aging for longer than expected in my blog cellar) an interesting and hard to find bug on Windows. It became apparent due to intermittent failures on Try with Windows OMTC builds. Finding the route cause has been an adventure and turned out to have nothing much to do with compositing or graphics. All the action is in bug 896896, if you are interested to see some code.

A Heisenbug is a bug whose behaviour is changed by observation. In this case it would go away if I set more than a few breakpoints (breakpoints which print something and continue, not actually breaking, which also caused the bug to disappear). That made debugging annoying, I was reduced to using printfs for logging, and even that sometimes affected the frequency of reproduction.

The failure occurred for mochitests with a drop down select box. About one in five test runs was timing out and thus failing. They timed out with a crash, but there was no useful crash information. There were a few mochitests which failed and the thing they had in common was a drop down list box. The first step was to try and reproduce this locally. For the longest time I could not. I tried using test pages with list boxes, all worked fine. I tried to run the mochi tests in question and it always succeeded. The simplest of the tests simulated clicking on the select box to open and close it. By accident I noticed that sometimes, the mochitest would freeze, with the list box dropped down, until I moved the mouse. A bit more experimentation showed that if the cursor was outside the window then I could move it and it would remain frozen. A bit more experimentation and I found this was specific to OMTC (it never happened with main-thread composition) and didn't depend on the actual compositor (d3d9 and d3d11 had the same problem). So, it seemed to be a problem with simulated clicks on select boxes. I tried attaching a debugger and breaking when the mochitest froze. This didn't reveal anything interesting - the compositor thread was waiting for messages and the main thread was somewhere doing Windows message stuff or sometimes GC (which is where it usually is if you randomly pause Firefox).

I then spent a day or so figuring out exactly how those simulated clicks worked. The mochitest did four clicks, and usually we got stuck between the 3rd and 4th. The test set a 500ms timeout in JS and we never got to the forth click which should happen when the timeout timed out. It took me a while to figure this out, of course. At first it looked like a problem in the code which rolled up the select box, in part because we hit a different code path for that depending on whether there is a real click or a simulated click.

So the question now looked like why wasn't our timeout completing? More logging confirmed that it was being set correctly and (eventually, in the freezing case) that it correctly called back. Investigating with Spy++ showed that when we froze, we processed a WM_PAINT message and then just stopped. So why was our message queue empty? Well if the mouse isn't in the window and nothing much else is happening, then it should be empty, and Firefox should just wait until it gets a new message. Which is what was happening, and explained why it woke up with mouse movement. But, why weren't we waking up for the timeout?

Firefox waits for a message by calling the windows function WaitMessage. That puts our thread to sleep until a message appears in our queue. We were frozen here, waiting for a message to come in. Unfortunately there are other ways to check the queue. You can call PeekMessage which reads a message from the queue and either pops it off or leaves it in place. The sad thing is that if it leaves it in place, it still counts as 'read'. The contract for WaitMessage is a bit subtle - if there is anything in the queue that is unread, it returns immediately. If the queue is empty or (and this is the nasty bit) contains only read messages, it sleeps until another message comes in. So, you must call PeekMessage just before you call WaitMessage to ensure that the queue is empty.

The bug here was that another thread (the compositor thread) was checking PeekMessage. We couldn't avoid doing this, the call comes from deep in the thread management code. We freeze when, in our main thread message loop, we check that the queue is empty and it is, then we context switched to the compositor thread which also checks the message queue, but at some point our wake up call has arrived. That gets checked by the compositor thread's event loop and is ignored and left on the queue in the 'read' state. When we context switch back to the main thread, we call WaitMessage and there are no messages so we sleep forever. It might seem unlikely that a message arrives and is checked between the queue being checked and the WaitMessage call. Calling PeekMessage doesn't just tell you if any messages have been posted to the queue, it handles any sent messages (Windows has posted (async) and sent (sync) messages). That means that during the PeekMessage call a whole lot of unknown stuff can happen. I'm a bit hazy about exactly how this bit worked out, possibly we got a paint message and that caused us to do a sync composite which context switched to the compositor thread which checked its message loop and then PeekMessaged the wake up call. In any case, this happened frequently enough to cause intermittent failure. It was a Heisenbug because by adding debugger overhead we got the wake up message entirely after the composite (I'm guessing here, btw) so I saw the bug less often or not at all.

The fix was also kind of interesting, but in a knowing the details of the Windows APIs, rather than solving a mystery kind of way. It turns out you can call WaitForMultipleObjectsEx in such a way as to return even if there are only read messages in the queue, and then sidestep the whole business.

Tip of the hat to roc and bsmedberg who helped a lot with the diagnosis and cure of this problem. I would have been stuck without their Windows knowledge and Firefox-workings-insight.

Saturday, September 07, 2013

An ode to Sublime Text

I use Sublime Text as my editor for about 80% of the work I do (sometimes on Windows I use Visual Studio because I use it for debugging and the integration with the debugger is convenient. Even so, I often use Sublime Text even then because it is so much nicer). Sublime Text is pretty much a perfect piece of software. It is not often I get that warm glow from using software that does exactly what I want it to, but Sublime is one time. Here are some good things about it:
  • It does one thing and it does it well (editing text for programmers).
  • It is super fast and super stable - I've never had it jank (actually when searching through an 80MB file, but that is understandable) and never had it crash.
  • It is beautiful - seriously, it looks really nice. That is important to me if I am going to spend most of my day looking at it.
  • It is multi-platform (but not, unfortunately, open source (so not quite perfect) so I can't use it on Linux/ARM).
  • It is extremely customisable. I have not found anything I want to change and can't.
  • It is easy to set up your customisations across platforms because they are stored in a plain text file.
  • It is smart (fuzzy search, etc.).
  • It has keyboard shortcuts for everything and they are all the ones you would expect (except maybe ctrl+t which I often hit trying to open a new tab, but hey, that is customisable too).
  • It's extensible with plugins.
Notice that none of the above are about editing text. Editing text is really nice too. But Sublime Text does all the 'meta' level things perfectly. That is necessary (but not sufficient) for a great piece of software. Of course you have to get the primary function right (and Sublime Text does), but so many pieces of software do their core function nicely but fail on the meta-level stuff.

Anyhow, if you haven't tried Sublime Text, you should. And if you are making software, you should strive to make it as nice as Sublime Text is.