Introduction to Decompilation vs. Disassembly

A decompiler represents executable binary files in a readable form. More precisely, it transforms binary code into text that software developers can read and modify. The software security industry relies on this transformation to analyze and validate programs. The analysis is performed on the binary code because the source code (the text form of the software) traditionally is not available, because it is considered a commercial secret.

Programs to transform binary code into text form have always existed. Simple one-to-one mapping of processor instruction codes into instruction mnemonics is performed by disassemblers. Many disassemblers are available on the market, both free and commercial. The most powerful disassembler is our own IDA Pro. It can handle binary code for a huge number of processors and has open architecture that allows developers to write add-on analytic modules.

Decompilers are different from disassemblers in one very important aspect. While both generate human readable text, decompilers generate much higher level text which is more concise and much easier to read.

Compared to low level assembly language, high level language representation has several advantages:

  • It is consise.
  • It is structured.
  • It doesn't require developers to know the assembly language.
  • It recognizes and converts low level idioms into high level notions.
  • It is less confusing and therefore easier to understand.
  • It is less repetitive and less distracting.
  • It uses data flow analysis.

Let's consider these points in detail.

Usually the decompiler's output is five to ten times shorter than the disassembler's output. For example, a typical modern program contains from 400KB to 5MB of binary code. The disassembler's output for such a program will include around 5-100MB of text, which can take anything from several weeks to several months to analyze completely. Analysts cannot spend this much time on a single program for economic reasons.

The decompiler's output for a typical program will be from 400KB to 10MB. Although this is still a big volume to read and understand (about the size of a thick book), the time needed for analysis time is divided by 10 or more.

The second big difference is that the decompiler output is structured. Instead of a linear flow of instructions where each line is similar to all the others, the text is indented to make the program logic explicit. Control flow constructs such as conditional statements, loops, and switches are marked with the appropriate keywords.

The decompiler's output is easier to understand than the disassembler's output because it is high level. To be able to use a disassembler, an analyst must know the target processor's assembly language. Mainstream programmers do not use assembly languages for everyday tasks, but virtually everyone uses high level languages today. Decompilers remove the gap between the typical programming languages and the output language. More analysts can use a decompiler than a disassembler.

Decompilers convert assembly level idioms into high-level abstractions. Some idioms can be quite long and time consuming to analyze. The following one line code

x = y / 2;

can be transformed by the compiler into a series of 20-30 processor instructions. It takes at least 15- 30 seconds for an experienced analyst to recognize the pattern and mentally replace it with the original line. If the code includes many such idioms, an analyst is forced to take notes and mark each pattern with its short representation. All this slows down the analysis tremendously. Decompilers remove this burden from the analysts.

The amount of assembler instructions to analyze is huge. They look very similar to each other and their patterns are very repetitive. Reading disassembler output is nothing like reading a captivating story. In a compiler generated program 95% of the code will be really boring to read and analyze. It is extremely easy for an analyst to confuse two similar looking snippets of code, and simply lose his way in the output. These two factors (the size and the boring nature of the text) lead to the following phenomenon: binary programs are never fully analyzed. Analysts try to locate suspicious parts by using some heuristics and some automation tools. Exceptions happen when the program is extremely small or an analyst devotes a disproportionally huge amount of time to the analysis. Decompilers alleviate both problems: their output is shorter and less repetitive. The output still contains some repetition, but it is manageable by a human being. Besides, this repetition can be addressed by automating the analysis.

Repetitive patterns in the binary code call for a solution. One obvious solution is to employ the computer to find patterns and somehow reduce them into something shorter and easier for human analysts to grasp. Some disassemblers (including IDA Pro) provide a means to automate analysis. However, the number of available analytical modules stays low, so repetitive code continues to be a problem. The main reason is that recognizing binary patterns is a surprisingly difficult task. Any “simple” action, including basic arithmetic operations such as addition and subtraction, can be represented in an endless number of ways in binary form. The compiler might use the addition operator for subtraction and vice versa. It can store constant numbers somewhere in its memory and load them when needed. It can use the fact that, after some operations, the register value can be proven to be a known constant, and just use the register without reinitializing it. The diversity of methods used explains the small number of available analytical modules.

The situation is different with a decompiler. Automation becomes much easier because the decompiler provides the analyst with high level notions. Many patterns are automatically recognized and replaced with abstract notions. The remaining patterns can be detected easily because of the formalisms the decompiler introduces. For example, the notions of function parameters and calling conventions are strictly formalized. Decompilers make it extremely easy to find the parameters of any function call, even if those parameters are initialized far away from the call instruction. With a disassembler, this is a daunting task, which requires handling each case individually.

Decompilers, in contrast with disassemblers, perform extensive data flow analysis on the input. This means that questions such as, “Where is the variable initialized?” and, “Is this variable used?” can be answered immediately, without doing any extensive search over the function. Analysts routinely pose and answer these questions, and having the answers immediately increases their productivity.

Side-by-side comparisons of disassembly and decompilation

Below you will find side-by-side comparisons of disassembly and decompilation outputs. The following examples are available:

The following examples are displayed on this page:

  1. Division by two
  2. Simple enough?
  3. Where's my variable?
  4. Arithmetics is not a rocket science
  5. Sample window procedure
  6. Short-circuit evaluation
  7. Inlined string operations

Division by two

Just note the difference in size! While the disassemble output requires you not only to know that the compilers generate such convoluted code for signed divisions and modulo operations, but you will also have to spend your time recognizing the patterns. Needless to say, the decompiler makes things really simple.

Assembler code
; =============== S U B R O U T I N E ======================================= ; Attributes: bp-based frame ; mod_ll(long long) public __Z6mod_llx __Z6mod_llx proc near var_10 = dword ptr -10h var_C = dword ptr -0Ch arg_0 = qword ptr 8 push ebp mov ebp, esp push ebx sub esp, 0Ch mov ecx, dword ptr [ebp+arg_0] mov ebx, dword ptr [ebp+arg_0+4] mov eax, ecx mov edx, ebx mov eax, edx mov edx, eax sar edx, 1Fh sar eax, 1Fh mov eax, edx mov edx, 0 shr eax, 1Fh add eax, ecx adc edx, ebx shrd eax, edx, 1 sar edx, 1 mov [ebp+var_10], eax mov [ebp+var_C], edx mov eax, [ebp+var_10] mov edx, [ebp+var_C] shld edx, eax, 1 add eax, eax sub ecx, eax sbb ebx, edx mov [ebp+var_10], ecx mov [ebp+var_C], ebx mov eax, [ebp+var_10] mov edx, [ebp+var_C] add esp, 0Ch pop ebx pop ebp retn __Z6mod_llx endp
Pseudocode
__int64 __cdecl mod_ll(__int64 a1) { return a1 % 2; }

Simple enough?

Questions like

  • What are the possible return values of the function?
  • Does the function use any strings?
  • What does the function do?

can be answered almost instantaneously looking at the decompiler output. Needless to say that it looks better because I renamed the local variables. In the disassembler, registers are renamed very rarely because it hides the register use and can lead to confusion.

Assembler code
; =============== S U B R O U T I N E ======================================= ; int __cdecl sub_4061C0(char *Str, char *Dest) sub_4061C0 proc near ; CODE XREF: sub_4062F0+15p ; sub_4063D4+21p ... Str = dword ptr 4 Dest = dword ptr 8 push esi push offset aSmtp_ ; "smtp." push [esp+8+Dest] ; Dest call _strcpy mov esi, [esp+0Ch+Str] push esi ; Str call _strlen add esp, 0Ch xor ecx, ecx test eax, eax jle short loc_4061ED loc_4061E2: ; CODE XREF: sub_4061C0+2Bj cmp byte ptr [ecx+esi], 40h jz short loc_4061ED inc ecx cmp ecx, eax jl short loc_4061E2 loc_4061ED: ; CODE XREF: sub_4061C0+20j ; sub_4061C0+26j dec eax cmp ecx, eax jl short loc_4061F6 xor eax, eax pop esi retn ; --------------------------------------------------------------------------- loc_4061F6: ; CODE XREF: sub_4061C0+30j lea eax, [ecx+esi+1] push eax ; Source push [esp+8+Dest] ; Dest call _strcat pop ecx pop ecx push 1 pop eax pop esi retn sub_4061C0 endp
Pseudocode
signed int __cdecl sub_4061C0(char *Str, char *Dest) { int len; // [email protected] int i; // [email protected] char *str2; // [email protected] signed int result; // [email protected] strcpy(Dest, "smtp."); str2 = Str; len = strlen(Str); for ( i = 0; i < len; ++i ) { if ( str2[i] == 64 ) break; } if ( i < len - 1 ) { strcat(Dest, &str2[i + 1]); result = 1; } else { result = 0; } return result; }

Where's my variable?

IDA highlights the current identifier. This feature turns out to be much more useful with high level output. In this sample, I tried to trace how the retrieved function pointer is used by the function. In the disassembly output, many wrong eax occurrences are highlighted while the decompiler did exactly what I wanted.

Assembler code
; =============== S U B R O U T I N E ======================================= ; int __cdecl myfunc(wchar_t *Str, int) myfunc proc near ; CODE XREF: sub_4060+76p ; .text:42E4p Str = dword ptr 4 arg_4 = dword ptr 8 mov eax, dword_1001F608 cmp eax, 0FFFFFFFFh jnz short loc_10003AB6 push offset aGetsystemwindo ; "GetSystemWindowsDirectoryW" push offset aKernel32_dll ; "KERNEL32.DLL" call ds:GetModuleHandleW push eax ; hModule call ds:GetProcAddress mov dword_1001F608, eax loc_10003AB6: ; CODE XREF: myfunc+8j test eax, eax push esi mov esi, [esp+4+arg_4] push edi mov edi, [esp+8+Str] push esi push edi jz short loc_10003ACA call eax ; dword_1001F608 jmp short loc_10003AD0 ; --------------------------------------------------------------------------- loc_10003ACA: ; CODE XREF: myfunc+34j call ds:GetWindowsDirectoryW loc_10003AD0: ; CODE XREF: myfunc+38j sub esi, eax cmp esi, 5 jnb short loc_10003ADD pop edi add eax, 5 pop esi retn ; --------------------------------------------------------------------------- loc_10003ADD: ; CODE XREF: myfunc+45j push offset aInf_0 ; "\\inf" push edi ; Dest call _wcscat push edi ; Str call _wcslen add esp, 0Ch pop edi pop esi retn myfunc endp
Pseudocode
size_t __cdecl myfunc(wchar_t *buf, int bufsize) { int (__stdcall *func)(_DWORD, _DWORD); // [email protected] wchar_t *buf2; // [email protected] int bufsize; // [email protected] UINT dirlen; // [email protected] size_t outlen; // [email protected] HMODULE h; // [email protected] func = g_fptr; if ( g_fptr == (int (__stdcall *)(_DWORD, _DWORD))-1 ) { h = GetModuleHandleW(L"KERNEL32.DLL"); func = (int (__stdcall *)(_DWORD, _DWORD)) GetProcAddress(h, "GetSystemWindowsDirectoryW"); g_fptr = func; } bufsize = bufsize; buf2 = buf; if ( func ) dirlen = func(buf, bufsize); else dirlen = GetWindowsDirectoryW(buf, bufsize); if ( bufsize - dirlen >= 5 ) { wcscat(buf2, L"\\inf"); outlen = wcslen(buf2); } else { outlen = dirlen + 5; } return outlen; }

Arithmetics is not a rocket science

Arithmetics is not a rocket science but it is always better if someone handles it for you. You have more important things to focus on.

Assembler code
; =============== S U B R O U T I N E ======================================= ; Attributes: bp-based frame ; sgell(__int64, __int64) public @sgell$qjj @sgell$qjj proc near arg_0 = dword ptr 8 arg_4 = dword ptr 0Ch arg_8 = dword ptr 10h arg_C = dword ptr 14h push ebp mov ebp, esp mov eax, [ebp+arg_0] mov edx, [ebp+arg_4] cmp edx, [ebp+arg_C] jnz short loc_10226 cmp eax, [ebp+arg_8] setnb al jmp short loc_10229 ; --------------------------------------------------------------------------- loc_10226: ; CODE XREF: sgell(__int64,__int64)+Cj setnl al loc_10229: ; CODE XREF: sgell(__int64,__int64)+14j and eax, 1 pop ebp retn @sgell$qjj endp
Pseudocode
bool __cdecl sgell(__int64 a1, __int64 a2) { return a1 >= a2; }

Sample window procedure

The decompiler recognized a switch statement and nicely represented the window procedure. Without this little help the user would have to calculate the message numbers herself. Nothing particularly difficult, just time consuming and boring. What if she makes a mistake?...

Assembler code
; =============== S U B R O U T I N E ======================================= wndproc proc near ; DATA XREF: sub_4010E0+21o Paint = tagPAINTSTRUCT ptr -0A4h Buffer = byte ptr -64h hWnd = dword ptr 4 Msg = dword ptr 8 wParam = dword ptr 0Ch lParam = dword ptr 10h mov ecx, hInstance sub esp, 0A4h lea eax, [esp+0A4h+Buffer] push 64h ; nBufferMax push eax ; lpBuffer push 6Ah ; uID push ecx ; hInstance call ds:LoadStringA mov ecx, [esp+0A4h+Msg] mov eax, ecx sub eax, 2 jz loc_4013E8 sub eax, 0Dh jz loc_4013B2 sub eax, 102h jz short loc_401336 mov edx, [esp+0A4h+lParam] mov eax, [esp+0A4h+wParam] push edx ; lParam push eax ; wParam push ecx ; Msg mov ecx, [esp+0B0h+hWnd] push ecx ; hWnd call ds:DefWindowProcA add esp, 0A4h retn 10h ; --------------------------------------------------------------------------- loc_401336: ; CODE XREF: wndproc+3Cj mov ecx, [esp+0A4h+wParam] mov eax, ecx and eax, 0FFFFh sub eax, 68h jz short loc_40138A dec eax jz short loc_401371 mov edx, [esp+0A4h+lParam] mov eax, [esp+0A4h+hWnd] push edx ; lParam push ecx ; wParam push 111h ; Msg push eax ; hWnd call ds:DefWindowProcA add esp, 0A4h retn 10h ; --------------------------------------------------------------------------- loc_401371: ; CODE XREF: wndproc+7Aj mov ecx, [esp+0A4h+hWnd] push ecx ; hWnd call ds:DestroyWindow xor eax, eax add esp, 0A4h retn 10h ; --------------------------------------------------------------------------- loc_40138A: ; CODE XREF: wndproc+77j mov edx, [esp+0A4h+hWnd] mov eax, hInstance push 0 ; dwInitParam push offset DialogFunc ; lpDialogFunc push edx ; hWndParent push 67h ; lpTemplateName push eax ; hInstance call ds:DialogBoxParamA xor eax, eax add esp, 0A4h retn 10h ; --------------------------------------------------------------------------- loc_4013B2: ; CODE XREF: wndproc+31j push esi mov esi, [esp+0A8h+hWnd] lea ecx, [esp+0A8h+Paint] push ecx ; lpPaint push esi ; hWnd call ds:BeginPaint push eax ; HDC push esi ; hWnd call my_paint add esp, 8 lea edx, [esp+0A8h+Paint] push edx ; lpPaint push esi ; hWnd call ds:EndPaint pop esi xor eax, eax add esp, 0A4h retn 10h ; --------------------------------------------------------------------------- loc_4013E8: ; CODE XREF: wndproc+28j push 0 ; nExitCode call ds:PostQuitMessage xor eax, eax add esp, 0A4h retn 10h wndproc endp
Pseudocode
LRESULT __stdcall wndproc(HWND hWnd, UINT Msg, WPARAM wParam, LPARAM lParam) { LRESULT result; // [email protected] HWND h; // [email protected] HDC dc; // [email protected] CHAR Buffer; // [sp+40h] [bp-64h]@1 struct tagPAINTSTRUCT Paint; // [sp+0h] [bp-A4h]@10 LoadStringA(hInstance, 0x6Au, &Buffer, 100); switch ( Msg ) { case 2u: PostQuitMessage(0); result = 0; break; case 15u: h = hWnd; dc = BeginPaint(hWnd, &Paint); my_paint(h, dc); EndPaint(h, &Paint); result = 0; break; case 273u: if ( (_WORD)wParam == 104 ) { DialogBoxParamA(hInstance, (LPCSTR)0x67, hWnd, DialogFunc, 0); result = 0; } else { if ( (_WORD)wParam == 105 ) { DestroyWindow(hWnd); result = 0; } else { result = DefWindowProcA(hWnd, 0x111u, wParam, lParam); } } break; default: result = DefWindowProcA(hWnd, Msg, wParam, lParam); break; } return result; }

Short-circuit evaluation

This is an excerpt from a big function to illustrate short-circuit evaluation. Complex things happen in long functions and it is very handy to have the decompiler to represent things in a human way. Please note how the code that was scattered over the address space is concisely displayed in two if statements.

Assembler code
loc_804BCC7: ; CODE XREF: sub_804BB10+A42j mov [esp+28h+var_24], offset aUnzip ; "unzip" xor eax, eax test esi, esi setnz al mov edx, 1 mov ds:dword_804FBAC, edx lea eax, [eax+eax+1] mov ds:dword_804F780, eax mov eax, ds:dword_804FFD4 mov [esp+28h+var_28], eax call _strstr test eax, eax jz loc_804C4F1 loc_804BCFF: ; CODE XREF: sub_804BB10+9F8j mov eax, 2 mov ds:dword_804FBAC, eax loc_804BD09: ; CODE XREF: sub_804BB10+9FEj mov [esp+28h+var_24], offset aZ2cat ; "z2cat" mov eax, ds:dword_804FFD4 mov [esp+28h+var_28], eax call _strstr test eax, eax jz loc_804C495 loc_804BD26: ; CODE XREF: sub_804BB10+99Cj ; sub_804BB10+9B9j ... mov eax, 2 mov ds:dword_804FBAC, eax xor eax, eax test esi, esi setnz al inc eax mov ds:dword_804F780, eax .............................. SKIP ............................ loc_804C495: ; CODE XREF: sub_804BB10+210j mov [esp+28h+var_24], offset aZ2cat_0 ; "Z2CAT" mov eax, ds:dword_804FFD4 mov [esp+28h+var_28], eax call _strstr test eax, eax jnz loc_804BD26 mov [esp+28h+var_24], offset aZcat ; "zcat" mov eax, ds:dword_804FFD4 mov [esp+28h+var_28], eax call _strstr test eax, eax jnz loc_804BD26 mov [esp+28h+var_24], offset aZcat_0 ; "ZCAT" mov eax, ds:dword_804FFD4 mov [esp+28h+var_28], eax call _strstr test eax, eax jnz loc_804BD26 jmp loc_804BD3D ; --------------------------------------------------------------------------- loc_804C4F1: ; CODE XREF: sub_804BB10+1E9j mov [esp+28h+var_24], offset aUnzip_0 ; "UNZIP" mov eax, ds:dword_804FFD4 mov [esp+28h+var_28], eax call _strstr test eax, eax jnz loc_804BCFF jmp loc_804BD09
Pseudocode
dword_804F780 = 2 * (v9 != 0) + 1; if ( strstr(dword_804FFD4, "unzip") || strstr(dword_804FFD4, "UNZIP") ) dword_804FBAC = 2; if ( strstr(dword_804FFD4, "z2cat") || strstr(dword_804FFD4, "Z2CAT") || strstr(dword_804FFD4, "zcat") || strstr(dword_804FFD4, "ZCAT") ) { dword_804FBAC = 2; dword_804F780 = (v9 != 0) + 1; }

Inlined string operations

The decompiler tries to recognize frequently inlined string functions such as strcmp, strchr, strlen, etc. In this code snippet, calls to the strlen function has been recognized.

Assembler code
mov eax, [esp+argc] sub esp, 8 push ebx push ebp push esi lea ecx, ds:0Ch[eax*4] push edi push ecx ; unsigned int call [email protected]@Z ; operator new(uint) mov edx, [esp+1Ch+argv] mov ebp, eax or ecx, 0FFFFFFFFh xor eax, eax mov esi, [edx] add esp, 4 mov edi, esi repne scasb not ecx dec ecx cmp ecx, 4 jl short loc_401064 cmp byte ptr [ecx+esi-4], '.' jnz short loc_401064 mov al, [ecx+esi-3] cmp al, 'e' jz short loc_401047 cmp al, 'E' jnz short loc_401064 loc_401047: ; CODE XREF: _main+41j mov al, [ecx+esi-2] cmp al, 'x' jz short loc_401053 cmp al, 'X' jnz short loc_401064 loc_401053: ; CODE XREF: _main+4Dj mov al, [ecx+esi-1] cmp al, 'e' jz short loc_40105F cmp al, 'E' jnz short loc_401064 loc_40105F: ; CODE XREF: _main+59j mov byte ptr [ecx+esi-4], 0 loc_401064: ; CODE XREF: _main+32j _main+39j ... mov edi, esi or ecx, 0FFFFFFFFh xor eax, eax repne scasb not ecx add ecx, 3 push ecx ; unsigned int call [email protected]@Z ; operator new(uint) mov edx, eax
Pseudocode
v4 = operator new(4 * argc + 12); v5 = *argv; v77 = strlen(*argv); v3 = v77 - 1; if ( (signed int)(v77 - 1) >= 4 ) { if ( v5[v3 - 4] == '.' ) { chr = v5[v3 - 3]; if ( chr == 'e' || chr == 'E' ) { v7 = v5[v3 - 2]; if ( v7 == 'x' || v7 == 'X' ) { v8 = v5[v3 - 1]; if ( v8 == 'e' || v8 == 'E' ) v5[v3 - 4] = 0; } } } } v9 = operator new(strlen(v5) + 3);