x86 Machine Code Statistics - strchr.com

3 min read Original article ↗

Which instruction is the most common one in your code? In this test, three popular open-source applications were disassembled and analysed with a Basic script:

All programs were developed with Microsoft Visual C++ 6.0.

Most frequent instructions

Top 20 instructions of x86 architecture: mov constitutes 35% of all instructions, push do 10%, call do 6%, cmp do 5%, add, pop, and lea do 4%

The most popular instruction is MOV (35% of all instructions). Note that PUSH is twice more common than POP. These instructions are used in pairs for preserving EBP, ESI, EDI, and EDX registers across function calls, and PUSH is also used for passing arguments to functions; that's why it is more frequent. CALLs to functions are also very popular.

More than 50% of all code is dedicated to moving things between registers and memory (MOV), passing arguments, saving registers (PUSH, POP), and calling functions (CALL). Only 4th instruction (CMP) and the following ones (ADD, LEA, TEST, XOR) do actual calculations.

From conditional jumps, JE and JNE (equal and not equal) are the most popular. CMP and TEST are commonly used to check conditions. The percentage of the LEA instruction is surprisingly high, because MS VC++ compiler generates it for multiplications by constant (e.g., LEA eax, [eax*4+eax]) and for additions and subtractions when the result should be saved to another register, e.g.:

LEA eax, [ecx+04]
LEA eax, [ecx+ecx]

The compiler also pads the code with harmless forms of LEA (for example, the padding may be LEA edi, [edi]). As is easy to see, the top 20 instructions include all logical operations (AND, XOR, OR) except NOT.

Though LAME encoder uses MMX technology instructions, their share in the whole code of the program is very low. Two FPU instructions (FLD and FSTP) appears in the top 20.

But what about other instructions? It turns out that multiplication and division are very rare: IMUL takes 0.13%, IDIV takes 0.04%, and both MUL and DIV do 0.02%. Even string operations such as REPZ SCASB or REPZ MOVSB are more common (0.32%) than all IMULs and IDIVs. On the contrary, FMUL is more common than FADD (0.71% versus 0.27%).

Average instruction length

Distribution by length: one-byte instructions are seen in 16% of cases, two-byte ones are seen in 29% of cases, three-byte ones are seen in 20% of cases. Instruction length is from 1 to 11 bytes.

75% of x86 instructions are shorter than 4 bytes. But if you multiply the percentage by length, you will find that these short instructions take only 53% of the code size. So another half of a typical executable file consists of instructions with 32-bit immediate values, which are 5 bytes or longer.

The number and type of operands

Number of operands: one-operand instructions make 37% and two-operand instructions make 60% Operand types: immediates constitute 20%, register operands constitute 56%, absolute addresses do 1%, and indirect addresses do 23%

Here are some examples of operand types:

  • immediate: 00000008, 00401024;
  • register: eax, esp;
  • absolute address: dword[00401024], byte[00401024];
  • indirect address: dword[esp + 10], dword[00401024 + eax * 4];

The parser is fairly limited and operands of the JMP and CALL instructions are counted as immediate, while in fact they are absolute addresses. Still you can see that most operands are registers. Global variables are rare in modern programs.

Instruction formats

Instruction formats: 'register-memory' format makes 35%, 'register-register' format makes 27%, 'register-immediate' makes 16%, 'memory-register' format makes 15%, and 'memory-immediate' format makes 7%

Examples of these instructions:

  • register-memory: ADD eax, [esp + 10]; MOV eax, [00401024]
  • register-register: ADD eax, ecx
  • register-immediate: CMP eax, 10
  • memory-register: MOV [esp + 10], eax; MOV [esi + ecx * 4], eax
  • memory-immediate: MOV [esp + 10], 0

Conclusion

Certainly, some observations are true only for MSVC++ compiler. Other compilers will use other instructions; for example, some of them can't do the trick with LEA instruction, and they will use IMUL or MOV/ADD instead. But you can see several general trends: most instructions have 2 operands; memory-register format is less frequent than register-memory; MOV is the most popular instruction and so on.

Download source code (Basic, AWK) and Excel sheet with all data (19 Kb)