Featherweight Musings: May 2012

Tuesday, May 29, 2012

Building a mask layer

Mask layers are built in layout/base/FrameLayerBuilder.cpp. Most of the work constructing the mask layer is done in SetupMaskLayer (which I guess should actually be called SetupMaskLayerForRoundedRectClip or something). But before that, we have to figure out if making a mask layer makes sense or not.

This is easy for image, colour, and container layers. But for Thebes layers it is more complicated, there can be multiple display items in a single Thebes layer, and we can only create a mask for the rounded rect clips that all items have in common. In practice, this is usually one clip, and often when there is only one item in the layer, but we cope with more. As items are added to a Thebes layer we keep track of the number of rounded rect clips in common using ThebesLayerData::mCommonClipCount, mostly in ThebesLayerData::UpdateCommonClipCount. We keep the clip for some item in the layer in UpdateCommonClipCount::mItemClip, the first mCommonClipCount clips in this clip are common to all items in the layer (we only find common clips if they are the first for each item). As we add items to the layer, their clips are compared to mItemClip, and mCommonClipCount is updated as necessary.

When Thebes layers are popped from the stack (i.e., we are done building), then mCommonClipCount is moved to ThebesLayerEntry::mCommonClipCount (it must be initialised, i.e., greater than zero by now). We create a mask layer for the first mCommonClipCount clipping rects. When the thebes layer is drawn, we only clip for clipping rounded rects other than the common ones.

FrameLayerBuilder has a neat way of recycling layers, this is a combination of caching and memory management: we recycle and create layers in order and if we get back the same layer as we had last time round (common case) we don't bother to rebuild it, but leave the layer as is. We do a similar thing with mask layers. This is a bit complicated because we only guarantee that layers of the same type are recycled in order, so we need to keep lists of mask layers for each type of layer. We store information about the clips we are masking for in the user data on the mask layer, this is checked in SetupMaskLayer to see if we have to build the mask fresh, or if we can get away with re-using the old version. Thus, we avoid building the mask too often.

When the mask is built in FrameLayerBuilder.cpp, we create a mask surface which is the same size as the masked layer's visible region. We paint the clip into this surface, but only the relevant bits of the clip will be inside the surface. So, the mask is in it's own coordinate system, which is relative to the masked layer's visible region. We create a transform for the mask which transforms the mask's coordinate system to the masked layer's coordinate system. (Note, I'm currently changing this stuff, so it might not be true for long). After the layer tree is built, it is walked to compute the effective transform of each layer (this means different things for the different backends). We updated this walk to also calculate an effective transform for mask layers. So, when we come to render the layer tree, the effective transform for the mask layer is the one that we need for drawing the mask layer 'on top of' the masked layer's visible region, however the masked layer is rendered (I'll describe the transform in more detail in later posts).

Sunday, May 27, 2012

Assembly - strings

I want to find out a little about how strings work in assembly, and as a side effect call some functions that I haven't defined. So I will try to encode the following rather contrived example:

char* hello = "hello";
char* world = "world";
char* helloWorld = (char*)malloc(64);
strcpy(helloWorld, hello);
helloWorld[5] = ' ';
strcpy(helloWorld+6, world);
printf("%s!\n", helloWorld);

The interesting bits are stack (I'm going to allocate the format string on the stack) and heap allocation (the latter done by malloc), copying strings and doing some character manipulation and calling a function which takes a variable number of arguments. I will make the hello and world strings global variables. Here is the assembly:

// allocate on the heap for helloWorld
push 64
call malloc
add esp, 4
// save address of helloWorld on the stack
push eax
// copy hello to helloWorld
mov ebx, 0
mov ecx, hello
copyh:
mov dl, BYTE PTR [ebx+ecx]
mov BYTE PTR [eax], dl
inc eax
inc ebx
cmp ebx, 5
jl copyh
// add a space
mov [eax], ' '
inc eax
// copy world to helloWorld
mov ebx, 0
mov ecx, world
copyw:
mov dl, BYTE PTR [ecx+ebx]
mov BYTE PTR [eax], dl
inc eax
inc ebx
cmp ebx, 5
jl copyw
// null terminate helloWorld
mov [eax], 0
// temporarily move address of helloWorld to eax
pop eax
// allocate format on the stack and save address in ebx
sub esp, 8
mov ebx, esp
// create format
mov byte ptr [ebx], '%'
mov byte ptr [ebx+1], 's'
mov byte ptr [ebx+2], '!'
mov byte ptr [ebx+3], '\n'
mov byte ptr [ebx+4], 0
// call printf
push eax
push ebx
call printf
// tidy up printf call and format
add esp, 16
// should really call delete, but meh

One thing that surprised me is that you don't need to pass the number of vararg arguments - the callee just has to work it out! Getting the call to malloc working was a pain, I had to link the standard library statically rather than as a dll, and a whole bunch of other compilation options which I don't know whether they affected it or not (compile as C, not C++, turn off randomised addressing, etc.).

Saturday, May 26, 2012

Mask Layers

The bulk of my work at Mozilla so far has been on mask layers. These just landed (bug 716439) and should be in Aurora at the next uplift. I'm going to try and explain the work in some detail over a series of blog posts. Here I'll try to give an overview.

The motivation for mask layers is to allow fast (hardware accelerated) clipping. Before mask layers, clipping to a rectangle is basically free, but clipping to any other shape is extremely expensive, because it was not done on graphics hardware. A particularly example is rectangles with rounded corners, so a video clipped to a rectangle was very smooth, but as soon as you use CSS border-radius, it would degrade horribly.

Mask layers solve this by adding a mask layer to each layer in the layer tree (which is just null if no masking is needed), the alpha of the mask is used when compositing layers, which clips anything where the mask alpha is zero.

At present there is only support for image layers as mask layers, so the mask is simply a bitmap in memory, but there is no reason other layer types cannot be supported. Mask layers are currently only used to support rounded corner rectangles, but the mechanism is fully general.

So, overview: when building an active layer (no masks on inactive layers) in FrameLayerBuilder, if an image, colour, or container layer has a rounded rect clip, or a all items in a Thebes layer have the same rounded rect clip, then we create a mask layer for that layer. When rendering the layers, the mask is used whilst compositing. This means that in OMTC, masking is done on the compositor thread, and the mask layer is passed across as part of the PLayers protocol.

In the hardware backends, the masking is done using shaders. For each shader that used to exist, we create an extra one which does the masking as well. In fact, for some shaders we need a third shader for masks with a 3D transform, because we must take account of perspective-correct (which is incorrect, in our case) interpolation done by the GPU.

On the basic layers backend, masking is done using Cairo masking.

Inactive layers are done the old way, that is, when the layer is drawn, and rounded rect clips on content are clipped using Thebes. There is new work (Bug 755078) to enable turning mask layers on or off at compile time using the MOZ_ENABLE_MASK_LAYERS flag. It might be necesssary to do this on mobile because the mask layers are not playing nice with tiled textures (although hopefully, this is fixed now).

Assembly - calling conventions - callees

The callee has to do the other side of the handshake I discussed last time. But it also has to do a little more housekeeping. There are two registers which point into the stack, esp points to the top of the stack and is kept up to date by push/pop. ebp is the base pointer and points to the bottom of the current stack frame. It must be set when we create a new stack frame, such as when we enter a function. Since ebp is a callee-preserved register we must save the old value of ebp to the stack and later restore it.

cdecl

void __declspec(naked) copy(...)
{
__asm {
    //prologue
    push ebp
    mov ebp, esp
    push esi
    //copy
    mov eax, [ebp+12]
    mov esi, [ebp+8]
    cmp eax, esi
    je finish
    mov ecx, [ebp+16]
start:
    mov dl, BYTE PTR [esi+ecx]
    mov BYTE PTR [eax+ecx], dl
    loop start
    mov dl, BYTE PTR [esi]
    mov BYTE PTR [eax], dl
finish:
    //epilogue
    pop esi
    pop ebp
    ret
}
}

I've elided the arguments to the function (since we are doing the calling now, they are irrelevant). The __declspec(naked) declaration means that the compiler will not generate prologue and epilogue for the function. Our epilogue saves ebp to the stack, moves esp to ebp and saves esi to the stack. In the epilogue, esi is restored (it is the only register we need to preserve), then restore ebp. Because there are no push/pops in the body of the function, the stack pointer is correct. The body of the function is pretty much unchanged, the only difference is that we access the arguments by offset from ebp (i.e., in the stack), rather than by name.

Normally, we would store the arguments on the stack, somewhere between ebp and esp, but here we never take them out of the registers, so no need. We would often allocate memory on the stack (also between ebp and esp) for local variables, but we don't need to do that either.

fastcall

Here, we get the arguments in registers, we could copy them to the stack, but we don't need to, so we'll just use them. Because we don't pass anything on the stack, there is nothing to tidy up. We could avoid a couple of moves by using different registers, but I haven't.

//prologue
push ebp
mov ebp, esp
push edi
//fillWithCount
mov edi, ecx
mov ecx, edx
dec ecx
start:
mov BYTE PTR [edi+ecx], cl
loop start
mov BYTE PTR [edi], cl
//epilogue
pop edi
pop ebp
ret

stdcall

Here the arguments are on the stack, and we must restore it on exit.

//prologue
push ebp
mov ebp, esp
push edi
push ebx
//fill
mov edi, [ebp+8]
mov ebx, [ebp+12]
mov ecx, [ebp+16]
dec ecx
start:
mov BYTE PTR [edi+ecx], bl
loop start
mov BYTE PTR [edi], bl
//epilogue
pop ebx
pop edi
pop ebp
ret 12

We use the arguments in the same way as for the cdecl function. The difference is that the ret instruction takes an argument, the number of bytes to pop from the stack in order to tidy up. We could also do this manually:

pop edx
add esp, 12
jmp edx

Monday, May 21, 2012

Assembly - calling conventions - callers

Next up calling conventions, what does C do when it calls a function? Actually there are a few ways of calling functions, called calling conventions, there are choices about whether the caller or callee must restore registers, how argument and the return value are passed, how this is passed if it's a class method (not to mention virtual dispatch, but that's beyond the scope of this blog post). I will look at the following common conventions: cdecl (standard convention for C on x86), stdcall (used for the Win32 API), and fastcall (there are actually many varieties, but I'll look at the Microsoft version). There are many other conventions on different platforms and for different languages. In particular, there is thiscall, which is used by Microsoft for calling member functions, and a bunch of conventions for 64 bit processors, which favour registers, because there are more of them (cite: http://msdn.microsoft.com/en-us/library/984x0h58.aspx).

We'll use the call and ret instructions. In all cases, call saves the return address to the stack and ret pops it, so the return address is always on top of the stack when a call is made.

I'm going to continue to use the functions from last time, but now I'm going to add a calling convention to the signature and call them from assembly, rather than C.

Here are the new signatures:

void __cdecl copy(unsigned char* aSrc, unsigned char* aDest, unsigned int aLength);
void __stdcall fill(unsigned char* aDest, unsigned char aVal, unsigned int aLength);
void __fastcall fillWithCount(unsigned char* aDest, unsigned int aLength);

And here is the current calling code in C (there is some allocation before hand, and some printing after, but that is not important):

fillWithCount(src, 256);
fill(dest, 5, 256);
copy(src, dest, 200);

cdecl
Arguments are passed on the stack in right to left order. Return value is in eax. eax, ecx, and edx must be preserved by the caller, other registers are preserved by the callee. The caller must clear the stack on return. So, the call to copy looks like:

//copy(src, dest, 200);
__asm {
push 200
push dest
push src
call copy
add esp, 12
}

Which is actually not very interesting because there is no return value, nor registers to restore. The only thing of note is that we pop the three params off the stack by manually fiddling with the stack pointer.

stdcall

stdcall is similar to cdecl, but the callee tidies up the stack. So we don't need to pop anything from the stack.

//fill(dest, 5, 256);
__asm {
push 256
push 5
push dest
call fill
}

fastcall

fastcall is similar to stdcall in that the callee tidies up the stack, but two parameters are passed in ecx and edx, the rest on the stack like the other conventions. The return value is left in eax.

//fillWithCount(src, 256);
__asm {
mov ecx, src
mov edx, 256
call fillWithCount
}

Thursday, May 17, 2012

Assembly, day 4 - memory modes

'Today' I wanted to learn about the different addressing modes available in x86 assembly, so I wrote a few functions which fill an array of bytes in various ways.

The addressing modes are: a constant (e.g, 5), a C variable (I suppose this is some inline assembly magic; e.g., x), the contents of a register (e.g., eax), the value in memory pointed to by a register (e.g., [eax]), that plus a fixed offset (e.g., [eax+5]), or that plus a variable offset ([eax+ecx]).

void fill(unsigned char* aDest, unsigned char aVal, unsigned int aLength)
{
__asm {
    mov edi, aDest
    movzx ebx, aVal // because I can!
    mov ecx, aLength
    dec ecx
start:
    mov BYTE PTR [edi+ecx], bl
    loop start
    mov BYTE PTR [edi], bl
}
}

This function fills the aLength bytes starting at *aDest with aVal. We use loop to do this aLength times (which we setup by moving aLength into ecx). It turns out that loop is one-indexed, so we have to decrement at the start of the loop and do the zero iteration manually. Is there a better way to do this?

Here we use the edi register plus ecx offset to index into memory and set it with the value in bl. Note we have to use the BYTE PTR directive to move only a single byte. When we set up bl, we actually use movzx to fill the upper three bytes of ebx with zeroes, this was just for fun, we don't need to do this.

void fillWithCount(unsigned char* aDest, unsigned int aLength)
{
__asm {
    mov ecx, aLength
    dec ecx
    mov edi, aDest
start:
    mov BYTE PTR [edi+ecx], cl
    loop start
    mov BYTE PTR [edi], cl
}
}

Will with count puts the number n into the aDest+nth address. It is pretty similar to the above function.

void copy(unsigned char* aSrc, unsigned char* aDest, unsigned int aLength)
{
__asm {
    mov eax, aDest
    mov esi, aSrc
    cmp eax, esi
    je finish
    mov ecx, aLength
start:
    mov dl, BYTE PTR [esi+ecx]
    mov BYTE PTR [eax+ecx], dl
    loop start
    mov dl, BYTE PTR [esi]
    mov BYTE PTR [eax], dl
finish:
}
}

This one is a bit more fun, we copy aLength bytes from aSrc to aDest. We first check that aSrc and aDest are different. An interesting aside: instructions involving eax, such as 'cmp eax, esi' are often optimised on x86, that line ought to get compiled into a single byte opcode with a single argument.

Then we loop through the data as in the previous functions. We have to do two moves here because x86 does not support memory to memory moves, so we move a byte to dl and then from there to the destination. I should probably have used eax rather than edx as the temporary, rather than using it to hold the destination address.

Wednesday, May 16, 2012

Assembly - interlude

I just want to mention that I am really just learning this assembly stuff, and making it up as I go along. If anyone is experienced, and sees better ways to do things, I would really appreciate a comment, thanks!

Tuesday, May 15, 2012

Assembly, day 3 - more control flow

So to make things a bit more complicated, I wanted to simulate some basic parts of procedure call and recursion. I'm not going to try to interact with or emulate C or anything (later, later), I just want some more complex control flow which includes parameter passing and so forth. The example will be the Fibonacci series (of course).

First in C (this isn't nice code, but it is close to the simple assembly coming next):

int fib(int n) {
if (n == 0) return 0;
if (n == 1) return 1;
int result = fib(n--);
result += fib(n--);
return result;
}

And then in assembly:

    push done
    mov ebx, n
start:
    cmp ebx, 0
    je case0
    cmp ebx, 1
    je case1
    dec ebx
    push ret1
    jmp start
ret1:
    push eax
    push ret2
    dec ebx
    jmp start
ret2:
    pop ecx
    add eax, ecx
    add ebx, 2
    jmp fin
case0:
    mov eax, 0
    jmp fin
case1:
    mov eax, 1
fin:
    pop ecx
    jmp ecx
done:
    mov result, eax

We mostly only use two registers: ebx takes the argument to the function, and eax stores the result. The calling convention is that eax is preserved by the caller and ebx is preserved by the callee. ecx is used as a temporary. When a procedure is called, the argument is in ebx, the return address is on top of the stack and below that is the callee's business, the result is left in eax.

We start by 'calling' our procedure for the first time by pushing the final return address ('done') and moving the initial argument to ebx. Once we're done we'll end up at 'done:' because we just pushed that, there we will simply move the result from eax to 'result'.

'start:' is the entry point to the procedure, my invariant here is that the arg must be in ebx and the return address is on top of the stack. We check if the argument (ebx) is 0 or 1 and jump to the special cases, otherwise we continue. The special cases just leave 0 or 1 in eax and then execute the epilogue code (at 'fin:'), which simply pops the return address from the stack and jumps there (this should probably have been inlined).

So, the interesting case is where ebx > 1, here we decrement ebx push ret1 as the return address and recurse. When we come back we save eax (the result of the first call) to the stack, push ret2 as the return address, decrement ebx again and recurse again. On return we restore the old value of eax to ecx, and then add it to the new value of eax (which is the result of the second call). We then restore ebx by adding two (because we decremented twice) and go to the epilogue to return. Note that we have fulfilled the calling convention by restoring ebx and leaving the result in eax.

Wednesday, May 09, 2012

Assembly, day 2 - simple control flow

Using the same inline assembly framework as yesterday, today I want to use a simple loop. The example will be to sum all the numbers between 0 and n, where n is passed in as a parameter. To do this I will simulate a 'backward' for loop: for (; n > 0; --n), and in the body just add n to the running total. The function looks like:

int sum(unsigned int n)
{
int result;

__asm {
    mov eax, 0
    mov ebx, n
start:
    add eax, ebx
    dec ebx
    cmp ebx, 0
    jg start
    mov result, eax
}

return result;
}

Here, we keep the running total in eax, and the current value of n in ebx. The first two mov instructions initialise these registers. Then the 'start' label identifies the beginning of the loop. In the loop we add the current 'n' to the running result, and decrement 'n'. The 'cmp' instruction compares its two arguments and sets some flags accordingly, the following instruction (jg) uses those flags and jumps if ebx is greater than zero. If we don't jump, then we are done and we can copy the result out of the register and into memory, to 'result', to be precise.

An Alternative:

mov eax, 0
mov ecx, n
start:
add eax, ecx
loopnz start
mov result, eax

In the alternative we make use of the x86 loop instruction (the C stuff is exactly the same as the first version). Now we use ecx, rather than ebx, to store our counter ('n'). The loopnz instruction automatically decrements ecx (which is why we use it instead of ebx), and jumps to the given label if ecx is not equal to zero (n is always positive, so that is OK). This version uses fewer instructions, and is perhaps a little clearer. No idea which version would be quicker, modern processors are far too complex to make such a judgement without measuring.

Tuesday, May 08, 2012

Inline assembly

A mini-quest of mine is to (re-)learn x86 assembly language. Today I tried to get a 'Hello World!' thing going. My chosen platform is inline assembly in MSVC, basically because it seems like the easiest way. And it was pretty easy. Since strings in assembly are pretty complex, my program will add two integers together.

Inline assembly is just put in a '__asm' block. And it is really simple. Variable names can just be used inside the block. So my whole program looks like:

int add(int x, int y)
{
__asm {
    mov eax, y
    add x, eax
}

return x;
}

int _tmain(int argc, _TCHAR* argv[])
{
printf("sum: %d\n", add(4, 23));
    return 0;
}

Where add is a c function who's implementation is in assembly (well the prelude and epilogue are done by the C++ compiler, but that's one reason to use inline assembly, told you it made things easy). The main function supplies the arguments and prints the result.

The actual assembly is really simple, 'add' adds its second argument to the first, only one of the arguments can be a memory location, so we have to use a register for the second argument. The 'mov' instruction copies the value in y to eax, which then gets added to x, which holds the result.

Monday, May 07, 2012

How a debugger works

Debuggers are part of a computer that seems like magic, I've often wondered how they work, so I set out to find out. I found out about the Windows user-mode debugger. Kernel-mode debugging is similar, but a bit more complicated. I imagine it is similar in different OSs, but I don't really know.

Your standard debugger runs in a separate process from the debuggee, it must have some special privileges over the debuggee process. When the debugged process is running, the debugger does nothing, and execution runs as usual.

Debugging on Windows basically works via windows exceptions, that is the structured exception handler (seh) mechanism. As an aside I never realised C++ exceptions were compiled to seh OS calls, I always assumed they just used jumps, no wonder they are expensive. Anyway, when the debugger attaches to the debugged process, it is registered with the OS (Windows) as the first-chance exception handler for the debuggee process. Any exceptions thrown by the debuggee are caught by the debugger, which can do with them as it pleases. Usually it will print a notification and pass the exception on to the debuggee. The debuggee can then run its usual exception handling procedure.

If the debuggee doesn't catch the exception, which would usually cause a crash (or at least revert to OS handling), then the debugger gets another go with the exception, called second-chance exception handling. Usually, the debugger will stop execution and let the user debug the 'crash'.

Another, longer, aside: this gets complicated if the debuggee has some kind of crash reporter for dealing with unhandled exceptions. In normal execution, any unhandled exception is handled by the unhandled exception filter, which can be set with a system call. But when debugging, this is always turned off, so the debuggee never gets its last chance exception handler called, instead the debugger catches it, as described above.

There are a special set of exceptions which are used only by the debugger. These are caught by the debugger process in the first-chance exception handler and not passed on to the debuggee process. Depending on the exception, the debugger takes some action.

If the debugger wants to break, then a new thread is created in the debuggee process which just executes the debug break CPU instruction (int 3, apparently). The OS handles this and throws one of those special exceptions (STATUS_BREAKPOINT), which is caught by the debugger. The debugger can then use its special debugging privileges to pause all the threads in the debuggee process, and voila: a break has happened and the debugger is in control.

Breakpoints work similarly, when you breakpoint a line of code, the debugger (presumably with some help from the compiler) finds the corresponding instruction and overwrites it with that 'int 3' instruction. When it gets executed (where the original instruction should be), the same thing happens as above. When the debugger is done, it overwrites the 'int 3' with the original instruction and control returns back to the debuggee. The debugger must single step through the instruction (see next) and control will come back to the debugger so that it can re-replace the instruction with 'int 3' again, then single stepping is stopped and control is properly returned to the debuggee.

To single step, a CPU flag is set which causes the 'int 1' interrupt to occur after every instruction, control follows the path of 'int 3', above. The debugger can return control to the debuggee to execute all instructions that correspond with a single C++ line (or whatever language and concept of 'single step' is being debugged). At the end of a step, execution is paused as for a break. When the debugger is done single stepping, the CPU flag is unset and control transferred back to the debuggee.

Wednesday, May 02, 2012

Quick post - a question about Tribal Ownership

Tribal Ownership is probably the work I am most proud of. The idea is that syntactic nesting of classes gives semantic nesting of objects, with very little extra syntax. (Note, the original idea is not mine, it was proposed in the original Tribe paper, and may have been known before).

A challenge for this kind of approach has occurred to me: a common case is that we want a tree where the children are owned by their parents, and each node has the same class. It is easy to have a tree where all nodes are owned by a single object, but I think the common situation where a parent owns its children is not possible. It would be good to have a solution to this, without changing the language too much. Or prove that we never really need this, and a tree where all nodes are owned by the root or a manager is OK in some situations (perhaps only some very limited applications of ownership). No ideas right now though, if you have an idea, leave a comment, or write it up and publish it!