Automated binary analysis woes
On the other hand things seem to be simple. After spending some time contemplating these always-the-same code patterns one gets used to them. Many binary analysis problems start to look elementary, if not boring. One could even think: it is so obvious, why not automate it? Let the computer find things for us. Questions like where will this register be used? or where does that value come from? could be answered by the computer! Sure, there will be situations when it fails, but at least in other cases the answers could be calculated without any human intervention!…
When we implement the first naive approach, it turns out to work with our test program. Then we switch to another program and get our first surprise: it does not work at all! Switching to another file compiled by another compiler or even with the same compiler but with a different set of compiler options spoils everything. The patterns which worked for the first compiler do not work and the analysis fails.
The next step could be to add patterns for the second compiler. Then for the third. If we take this approach, our analysis tool would never be complete.
Imagine we want to find where a function expects its input arguments. This question arises very often during binary analysis and answering it correctly is very important. Without knowing where to look for the input we can never find out what the input values are. For example, this function
push ebp mov ebp, esp movsx eax, [ebp+8] ...
Another example. This function
push ebp mov ebp, esp call [ecx+10h] ...
INPUT ARGUMENTS RULEIf the function uses a register or stack location without initializing it, then it expects an input argument at that location.
We ‘just’ need to perform the data analysis and find all these registers. Unfortunately, this does not work at all. Here is a counterexample
push ebp mov ebp, esp push esi ... pop esi pop ebp ret
For any experienced IDA user all this is pretty obvious. We have to amend our rule about the function inputs. It could look like this:
INPUT ARGUMENTS RULE VERSION 2If the function uses a register or stack location without initializing it, then it is an input argument. Watch out for the preserved registers, the save-restore instructions should be ignored during this analysis!
We will build an algorithm to find preserved registers. In our first try at it we look for the push/pop pairs:
PRESERVED REGISTERS RULE VERSION 1A register is preserved by the function if there is a pop at the end of the function (before the return instruction) and a push instruction at the beginning of the function which corresponds to it. In this case we can remove the push/pop pair and repeat.
push edi push esi ... pop esi pop edi ret
push edi push esi ... ret1: pop esi pop edi ret ... ret2: pop esi pop edi ret
PRESERVED REGISTERS RULE VERSION 2A register is preserved by the function if for each return instruction there is a pop instruction and a push instruction at the beginning of the function which corresponds to it. In this case we can remove all pop instructions and the push instruction. Repeat the process while possible.
push ebp mov ebp, esp sub esp, 18h mov [ebp-8], ebx mov [ebp-4], esi ... mov ebx, [ebp-8] mov esi, [ebp-4] mov esp, ebp pop ebp retn
push ebp mov ebp, esp push esi mov [ebp-4], edi ... mov edi, [ebp-4] mov esp, ebp pop ebp retn
Here is the second one:
push ebp mov ebp, esp sub esp, 20h ... push esi ... pop esi ... mov esp, ebp pop ebp retn
- it doesn’t make sense to look only for push/pop instructions
other instructions can be used to preserve registers
- it doesn’t make sense to check only the function prolog/epilog
a register can be saved/restored anywhere in the function
- create the function flow chart
- add empty entry and exit blocks to it
- the entry block will pretend to define all registers used in the function
- the exit block will pretend to use all registers defined by the function
- for each block compute the use/def lists
- use the iterative algorithm to compute du-ud chains3.
PRESERVED REGISTERS – FINAL VERSIONa register is preserved by the function, iff:
Please note that we considered only compiler generated code. Hand crafted code is almost always impervious to automated analysis. Fortunately there is less and less assembler code.
I hope this small example illustrates my point: binary analysis problems are much more difficult than they look. Simple approaches simply do not work. Even if we strictly limit ourselves to compiler generated code, they still fail. They will work if we limit ourselves to a specific version of a specific compiler or even to a specific binary. So crafting a naive script to solve today’s problem might be the right thing to do. If not, expect monstrous rules like the above to haunt your logic…
P.S. The gcc compiler source code is abundant with lengthy and complex rules; does it mean that we have to live with that? 😉
1. There are many exceptions to this rule, just one of them: the trailing arguments of variadic functions are always accessed indirectly; but since all these arguments are anonymous, we do not lose anything. In fact there are many minor details which are irrelevant here.
2. If you wonder which compiler generates such a code, it is ICC, Intel’s compiler.
3. The iterative approach to the data flow analysis is explained in this book:
Muchnick, Steven S. Advanced Compiler Design and Implementation