As we're in the midst of preparing an "introduction to reverse engineering and assembly" workshop (to be officially announced soon), we've asked ourselves the obvious question: which of the hundreds of x86-64 assembly instructions one should start with.
Typically you would learn assembly instructions in groups. First the basic data moves, arithmetics, and bit operations. Then the unconditional jumps followed by tests and comparisons paired with conditional jumps. Then probably stack instructions, followed by CALL
/RET
. And then it gets a bit vague. But is this the right order? Is it really more urgent to know what JS
is than what CALL
is, especially when learning assembly with reverse code engineering as the final goal?
These questions should be considered from a couple of angles, but one thing that might help with the answer is always data!
Given the above, Dan wrote a Python script for Ghidra and counted how many times each specific instruction appeared in common binaries found both on Windows and Ubuntu Linux. To be a bit more precise, by instruction we specifically mean a unique mnemonic as shown by Ghidra.
Here are the results:
# | Instruction | Share | |
---|---|---|---|
1. | MOV | 35.14% | |
2. | CALL | 7.97% | |
3. | LEA | 6.83% |
See full top 100 at the bottom of this post...
Based on this data, if you just learn the top 10 assembly instructions on the list, you will understand over 75% of the instructions that make the code you are reverse engineering. With learning top 20 that's over 90% of actual used instruction. Of course, there's a long tail of instructions remaining—we've observed over 800 of them, and more do exist. And there's more to understanding the algorithm than just understanding the instructions, but this might be a good way for a fast start.
A few notes about this table:
- While it's interesting to consider the popularity of operations, and it can serve as a good datapoint when designing a course, many applications require a more accurate approach (e.g. when designing CPUs and trying to figure out where to spend more time on optimisation).
- The "Share" column indicates the proportion of all observed instructions that this particular instruction accounts for. There is likely some error here as described later in these notes, so don't read too much into the actual number. Also, it matters more whether a given instruction is common or not, than whether a given instruction is in place e.g. 55th or 57th.
- On the same note, please remember that this is a list of most common instructions as observed in code (deadlistings). Specifically, this is not a list of most commonly executed instructions.
- Also, any instructions which dynamically appear in memory, for example as a result of code unpacking or just-in-time compilation, would also not make it to the list.
- As the instructions were sourced from Ghidra's output, some Ghidra-specific idiosyncrasies do apply. For example, the
LOCK
instruction prefix in Ghidra is outputted as a suffix in the instruction name. As such, e.g.INC
andINC.LOCK
appear as separate entries in the table on 29th and 78th place. You could argue that this is the same instruction—in such a case it would occupy 28th place. - Note that you might get a bit different results with other disassemblers. For example, some disassemblers might use data-length suffixes (
MOVL
orMOV.L
). Some disassemblers might represent multi-byte no-ops asNOP
orNOPn
, while others might show the actual aliased instructions. - Given the problem of distinguishing what actually is and isn't code in a program (solved usually by a bunch of heuristics), there is likely a slight error introduced on this front as well. This also includes disassembler runs that start at a wrong offset (inside the actually generated instruction)—thankfully these tend to align pretty quickly during the run.
- There are certain differences in the order and instructions that appear in the top 100 between Windows and Linux. This makes sense as different compiler suites are used to generate the majority of code on these platforms.
- Along the same lines the results would be different on Gentoo for example—especially if you use custom compilation settings in your
make.conf
. - Lastly, the top 100 list of instructions in user-mode and kernel-mode code will probably be different as well. To state the obvious, privileged instructions do not appear in user-mode programs outside of errors (and local-privilege escalation exploits of course).
By the way...
Check out the 5th issue of the deeply technical Paged Out! magazine! You can download it for free at https://pagedout.institute/ – enjoy!
Top 100 Most Common x86-64 Instructions
# | Instruction | Share | |
---|---|---|---|
1. | MOV | 35.14% | |
2. | CALL | 7.97% | |
3. | LEA | 6.83% | |
4. | CMP | 4.98% | |
5. | JZ | 4.15% | |
6. | TEST | 4.04% | |
7. | POP | 4.03% | |
8. | JMP | 4.00% | |
9. | PUSH | 3.98% | |
10. | ADD | 3.07% | |
11. | JNZ | 2.79% | |
12. | XOR | 2.51% | |
13. | SUB | 1.76% | |
14. | RET | 1.15% | |
15. | MOVZX | 1.02% | |
16. | AND | 0.91% | |
17. | NOP | 0.82% | |
18. | MOVUPS | 0.74% | |
19. | ENDBR64 | 0.67% | |
20. | MOVAPS | 0.61% | |
21. | JA | 0.52% | |
22. | INT3 | 0.45% | |
23. | SHR | 0.41% | |
24. | MOVSXD | 0.41% | |
25. | OR | 0.39% | |
26. | SHL | 0.36% | |
27. | JC | 0.31% | |
28. | JBE | 0.28% | |
29. | INC | 0.26% | |
30. | JNC | 0.25% | |
31. | MOVDQA | 0.23% | |
32. | XORPS | 0.22% | |
33. | ROR | 0.20% | |
34. | SAR | 0.19% | |
35. | IMUL | 0.19% | |
36. | JS | 0.19% | |
37. | JNS | 0.19% | |
38. | JLE | 0.17% | |
39. | MOVDQU | 0.14% | |
40. | JG | 0.13% | |
41. | MOVSD | 0.11% | |
42. | MOVQ | 0.11% | |
43. | CMOVZ | 0.10% | |
44. | MOVSS | 0.10% | |
45. | SETZ | 0.09% | |
46. | DEC | 0.09% | |
47. | UD1 | 0.09% | |
48. | BT | 0.09% | |
49. | MOVSX | 0.07% | |
50. | SETNZ | 0.07% | |
51. | CMOVNZ | 0.07% | |
52. | PXOR | 0.06% | |
53. | JL | 0.06% | |
54. | JGE | 0.06% | |
55. | XADD.LOCK | 0.06% | |
56. | NEG | 0.06% | |
57. | DEC.LOCK | 0.06% | |
58. | CMOVS | 0.05% | |
59. | CMOVNC | 0.05% | |
60. | SUB.LOCK | 0.05% | |
61. | PADDD | 0.05% | |
62. | NOT | 0.04% | |
63. | ROL | 0.04% | |
64. | MOVD | 0.04% | |
65. | PUNPCKLQDQ | 0.04% | |
66. | CDQE | 0.04% | |
67. | CMOVC | 0.04% | |
68. | SBB | 0.04% | |
69. | CMOVNS | 0.04% | |
70. | CMPXCHG.LOCK | 0.03% | |
71. | VMOVDQA | 0.03% | |
72. | ADC | 0.03% | |
73. | UCOMISS | 0.02% | |
74. | MOVAPD | 0.02% | |
75. | DIV | 0.02% | |
76. | CMOVA | 0.02% | |
77. | PSHUFD | 0.02% | |
78. | INC.LOCK | 0.02% | |
79. | PADDW | 0.02% | |
80. | CMOVBE | 0.02% | |
81. | BSWAP | 0.02% | |
82. | MULSS | 0.02% | |
83. | CMOVO | 0.02% | |
84. | MUL | 0.02% | |
85. | LFENCE | 0.02% | |
86. | SETC | 0.02% | |
87. | VPADDD | 0.02% | |
88. | CMOVL | 0.02% | |
89. | CVTSI2SD | 0.02% | |
90. | CVTSI2SS | 0.02% | |
91. | ADDSS | 0.02% | |
92. | UCOMISD | 0.02% | |
93. | FSTP | 0.02% | |
94. | PAND | 0.02% | |
95. | UD2 | 0.02% | |
96. | FLDZ | 0.02% | |
97. | MULSD | 0.02% | |
98. | POR | 0.02% | |
99. | CMOVGE | 0.01% | |
100. | PMADDWD | 0.01% |