Return to the sources?
A decompiler is commonly viewed as a tool to recover the source code of a program, the same way as a disassembler is a tool to convert a binary executable program to an assembler text.
This is true in some cases but only in some.
Most of the time the goal is not to get a compilable text. What would one do with such a text? Recompile and link it to get a copy of the input binary program? The only legitimate situation I can imagine is when the source text has been lost and the owner of the program tries to recover it. Even in this case I personally would rewrite everything from scratch because the disassembler/decompiler output is a nightmare to maintain: it lacks comments, inline functions, preprocessor symbols, abstractions, class inheritance, etc.
In short, I wouldn’t consider obtaining a compilable text as a major goal of a binary reverse engineering tool (RET).
One goal of a RET is to represent the program logic in the most precise and clear way. This representation should be easily understandable by a human being. This means that the fundamental blocks comprising the program logic should be easily visualized on request. Questions like “what is the function called from the current instruction”, “where is the current variable initialized or used”, “how many times is the current instruction executed” arise continuously during the analysis of almost any program. In simple cases a disassembler can answer these questions, a decompiler is more intelligent and can perform deeper analysis (e.g. pointer analysis and value propagation could help to find the answer even in the presence of indirect addressing modes and loops). In both cases the RET helps the analyst to grasp the meaning of the program and convert a sequence of instructions into very high level descriptions like “send a mail message to all contacts” or “get list of .doc files on the computer”. The clearest program representation is of a paramount importance to understand its purpose.
Another important goal of a RET is to automate program analysis and instrumentation. In this case RET can be used as a basic building unit for another program. We could analyze the input program in order to:
- find logically equivalent blocks (copyright breaches; recognition of library functions)
- find logical flaws (automatic vulnerability search; code quality assessment)
- deobfuscate it (malware analysis)
- add profiling instrumentation (optimization; code coverage)
- prepare various inputs (testing)
Obviously we do not need to have a nice and readable program representation to achieve the second goal. Everything can stay in the binary form because the result of the analysis will be used by another computer program. One may even ask himself, “is it a decompilation, after all?” The input program is not converted into a high level language (HLL). More than that, the program text may even not be generated.
However, both cases use the same fundamental algorithms to analyze the input program. These algorithms are similar to the algorithms used in HLL compilers. The core of a RET builds an intermediate internal representation (usually a tree or graph-like data structure) which can then be either converted to an HLL text or used to discover interesting properties of the input program. In the same way a compiler builds an intermediate representation which is used to generate machine code.
Maybe we should stop using the word “decompiler” so widely to avoid confusion and to stop mixing concepts.