How Executable EXE Files Are Built? (Film, 11 minutes)

At least once, at some point in your life, you may have tried to open an .exe file in a text editor – just to see what happens, and it looks like this. It's a total mess. How does a computer make sense of any of this? The thing is, what you see on the screen is an illusion. It does not exist. Everything the computer ever sees and deals with is numbers. On a computer, there is nothing else but numbers. Anything else is a representation. There are agreed-upon standards – describing which numbers are used to represent which letters, so that we can handle text on a computer. There is really no text, only numbers. But we have agreed a long time ago – that 65 means capital A, 70 means capital F, and so on. This agreement is called ASCII, American Standard Code for Information Interchange. Later it was extended and updated, but that's another topic. When you tell your text editor to open an .exe file, you are assigning an interpretation to the number sequence. By opening the file in a text editor, you insist that these numbers should be interpreted as text. But they are not text. If you opened this file in an audio player, they would be interpreted as PCM samples, and you could listen to the sound made from these numbers. It usually does not sound very pleasant. But the file is not audio either. You could open it in a picture viewer, and it would look like this. Now the picture viewer is interpreting the numbers as colors, but the numbers are not colors either. We can use a hex editor to look at the numbers inside the file, and see the numbers as what they really are. Numbers. What do the numbers mean then? Answering that question is not exactly straightforward. If you look carefully, you can actually see that this .exe file does actually contain some text. I mean, it still contains numbers, but these particular numbers here actually make sense, even if they are interpreted as text, as my hex editor helpfully does. You may have thought of .exe files as buckets of computer code. A monoculture of an abstract thing called code. But that is wrong. They are actually structures. This file contains many kinds of content, and the operating system identifies these sections, and handles them differently. For now, we will focus on just the program code. This is a disassembly of an .exe file. All disassemblies look something like this. On the left edge, you see an address. It might be a memory address, or an offset to a file. In this case, it is a memory address. The second column is those numbers from the file. If I open side-by-side this disassembly, and the hex editor view, you can see how these contents are related. The code begins here. Here's the first four bytes, then the next four bytes, then two bytes and seven bytes, and so on. They are a perfect match, because it is the same data. The third column is assembler instructions. And this is the most mystifying part for many people. Remember earlier when I said that – when the numbers are interpreted as text, 65 means A. When the numbers are interpreted as program code, this is the interpretation. The number sequence 4883EC48 – means the assembler instruction – that subtracts this number from RSP. The number sequence 4C8D4210 – means the assembler instruction – that adds one to RDX and stores it into R8. The number sequence 8BCA – means the assembler instruction that copies EDX into ECX. But why? Also you will notice that these number sequences – have different lengths. How does it work? This is a modern day 64-bit architecture called AMD64. The AMD64 architecture programmer's manual – is six volumes long, almost 3000 pages in total. For simplicity, we'll take a look at the 8086 CPU datasheet instead. It is only 30 pages long. The 8086 is the CPU inside this computer. It is an ancestor of AMD64, released over 40 years ago. Actually, let's take a moment to appreciate the sounds of this beauty. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 We can even have a look at this assembly. Here's one. Even though instructions look quite similar – as in the previous sample, that is not a coincidence. These two processors are related after all. In this 30-page manual, the interesting bits begin at page 26. Know what? Let's do a side-by-side view. On the left side, the disassembler listing. On the right side, the datasheet. So the first instruction is 8CC8. Move the code segment register into the AX register. Can we find it in the datasheet? As luck would have it, we can actually find it on this very page. It's here. The instruction is MOVE, and the order of operation is – segment register into register or memory. If we then look at the right side of the page, we see this chart. This chart tells that this particular instruction – is encoded using two bytes. The bytes are described in binary, not hexadecimal, so the description goes 10001100. This is 8C in hexadecimal. And it matches exactly what we see in the disassembly. The second byte is something called MODRM. We will get back to that later. Now let's scroll down the disassembly a bit. How about this one? LODSB. That's byte AC, or 10101100 in binary. In the datasheet we can find this instruction on page 28. Here it is. LODS. The instruction encoding is 1010110W, where W denotes the size. Zero for byte size, and one for word size. This instruction had a zero here, so it is LODSB. Load String Byte. The next instruction is 49. This is 01001001 in binary. Can we find it in the datasheet? Indeed we can. Here it is on page 27. The first byte is 01001, and then something called REG. The instruction is called DEC or decrement. Our byte ended with 001, so what does the REG number of 001 mean? The answer is found on the last page, page 30, in this table. REG is assigned according to the following table. 001 means CX. So our instruction was DECCX. Did we interpret it correctly? Let's have a look. Oh yes. The instruction was DECCX indeed. On this page in the datasheet, it also describes how that mod RM from earlier works. I'll leave that as an exercise to the reader. Inside the microprocessor there is a lookup table. The processor reads the number from the program, and compares that number into this lookup table. Suppose it finds the number 11000011. It reads this table, and finds that this is RET, the subroutine return instruction. It therefore performs that operation. Then it reads the next number. Suppose that the next number it finds is 11101001. This is the unconditional jump instruction, but look, it needs more data. It needs to read two more numbers. The instruction is three bytes long. The two other numbers together constitute a bigger number, that tells where it should jump next. Once it has read three bytes, it now has enough information to perform the operation, which it will promptly do. And then it reads the next number. That's what the processor does for all eternity. Millions or billions times in a second. Read an instruction, interpret it, execute it, and do that same over and over and over again. Some of these instructions might be complex – and require multiple bytes, others are short and only one or two bytes long. There are other processor architectures – in which instructions are always same width, but on the x86 architecture, instructions have varying lengths. And that's how the numbers in your .exe file map into program code. That program code just happens to consist of – millions of tiny instructions that do very small things. But when millions of them are performed in a sequence, in a rapid fashion, something more complex is achieved, such as the computer reacting to your mouse button click. I am Bisqwit. I do videos about topics – that I hope are just as insightful and inspiring to you – as they were to me when I originally learned the stuff. Have a blessed day, and see you again!

Menu

How Executable EXE Files Are Built? (Film, 11 minutes)

Toggle timeline summary

Transcription