Tuesday, February 05, 2013

A fun bug

(Actually this is a two for one kind of a deal)

I've spent the last two days finding two tricky bugs in my port of tiled Thebes layers to the async compositing API. I think they are kind of fun, so I'll try and describe them here. I'll try to elide the details a bit. If you want to check out the real code, look at ContentClient.cpp, ContentHost.cpp, and BasicTiledThebesLayer.cpp on the graphics branch.

First, the old way. A tile buffer keeps a bunch of tiles (the actual tiles, not references, that is important) and each tile keeps a reference to a gfxReusableSurfaceWrapper. A gfxReusableSurfaceWrapper is kind of neat, it keeps a reference to a surface and can be locked for reading. When we want to write to it we ask for its surface. If it is locked, then you get a fresh surface (with a new gfxReusableSurfaceWrapper to wrap it). If it is not locked, you get the same surface as last time.

To render the tiled layer, the content thread gets a surface for each tile and paints to it. When the tiled layer is rendered, a copy of the tile buffer is made in the heap and a reference is passed to the compositor thread. The compositor thread locks all of the gfxReusableSurfaceWrappers (via the tiles and buffers) and blits them to the screen.

Note that if the gfxReusableSurfaceWrapper is locked and we get a new surface when painting, then we store the new gfxReusableSurfaceWrapper in the tile and lose track of the old gfxReusableSurfaceWrapper. Also, gfxReusableSurfaceWrappers are reference counted. They are destroyed when there are no more references to them. Finally, it is very important that when a gfxReusableSurfaceWrapper is destroyed it is not locked for reading; we assert that.

This sounds fun already, right? But the fun bit is still to come...

As far as we are concerned, the main effect of refactoring into the new compositing API is that we add another layer between the tiles and the gfxReusableSurfaceWrappers. We add a TextureClient. The tile holds a reference to the TextureClient and the TextureClient holds a reference to the gfxReusableSurfaceWrapper. The TextureClient lives on the heap and is also reference counted.

What could go wrong?

What goes wrong is that we trigger an assertion by trying to destroy a locked gfxReusableSurfaceWrapper. Figuring out why took me a little while. What should happen is that the copy of the buffer and  its tiles on the compositor thread keeps the gfxReusableSurfaceWrappers alive once the tiles on the content tread forget about them. That works because we only lock the tiles for reading when we pass them to the compositor and because when we copy the buffer (a bitwise copy) we copy all the tiles, creating another reference to each gfxReusableSurfaceWrapper. But, with the TextureClients, the tiles are copied and we add another reference to the TextureClients, but they are not copied and so we only have one reference to the gfxReusableSurfaceWrappers. Thus, the next time around if we get new gfxReusableSurfaceWrappers and forget about the old ones, then they are destroyed, even though they are locked by the compositor! The fix is to do a 'deep' copy, copying the TextureClients rather than making another reference to them.

What could go wrong?

This gives rise to the really fun bug. Because if you do the 'deep' copy on the Compositor thread, you still hit the same assertions, just much less often. What is happening here is that there is a gap between when the tiles are locked (content thread) and when we make a copy (compositor thread). Sometimes we might get to repaint (content) before we composite the previous round and that means we un-reference the gfxReusableSurfaceWrappers after we lock and before we copy. That took a while to find, but in retrospect doing the 'deep' copy on the compositor thread was dumb, I'm not sure why I did that. The fix is easy, just move the deep copy to the content thread.