How Executable EXE Files Are Built? (Film, 11 minutes)
In Bisqwit's latest video, he takes viewers on a fascinating journey into the world of .exe files and how computers interpret data. The knowledge of how exactly a computer processes and interprets numbers can be awe-inspiring. When you open an .exe file in a text editor, you see only a chaotic jumble of symbols - it is merely an illusion. Inside a computer, there is nothing but numbers, and every letter or character is just a representation of those numbers. For instance, 65 means capital A according to the ASCII standard, which has long been used to interpret characters. This leads us to understand that each time you open an .exe file in a text editor, you're forcing an incorrect interpretation of those numbers as text.
In Bisqwit's video, viewers learn that .exe files are structures made up of various sections that the operating system identifies to process them differently. The author focuses on the program code contained in these files, which consists of different types of content. Viewers might notice that even within an .exe file, there are pieces of text that make sense when read with a hex editor. In this work, Bisqwit dissects the data into a form understandable to computers, where each number in the context of machine code has a specific meaning.
While the interpretation of numbers can seem convoluted, the video illustrates that each sequence of numbers leads to specific assembler instructions. Therefore, when we see the sequence 4883EC48, it represents an instruction that subtracts a number from the RSP register in the AMD64 architecture. The author also explains what this means in the broader context of programming by citing documentation detailing the workings of the processor.
Bisqwit delves into the 8086 processor example, showcasing the differences between modern and outdated architectures. This allows viewers to appreciate how computer technology has evolved over the past four decades. Validating older processor documentation as a tool for understanding modern systems is profoundly intriguing. Moreover, he demonstrates how each byte of program instruction is interconnected, and the information is stored in a structure that can be interpreted using tables.
It's also worth noting that at the time of writing this article, Bisqwit's video has already garnered 148,447 views and 9,941 likes. This reflects the significance and interest in the topics he discusses, which can open eyes to the mysteries of computer technology. His work is definitely worth exploring to uncover more critical issues related to programming and computer architectures. This resource-filled video will undoubtedly provide knowledge and inspire further exploration of this complex field.
Toggle timeline summary
-
Introduction to opening an .exe file in a text editor.
-
The visual representation of an .exe file appears as a mess.
-
Discussion on how computers only understand numbers.
-
Explanation of ASCII as a standard for representing characters.
-
Opening an .exe file in a text editor assigns an interpretation to its data.
-
If opened in an audio player, the file is seen as PCM samples.
-
Opening the file in a picture viewer reveals numbers interpreted as colors.
-
Using a hex editor to view the raw numbers inside the .exe file.
-
The .exe file contains both binary code and readable text.
-
Explanation that .exe files are structures containing various content types.
-
Visualizing the disassembly of an .exe file.
-
Understanding assembler instructions from number sequences.
-
A brief history of CPU architecture from 8086 to AMD64.
-
Comparing the disassembly with the 8086 CPU datasheet.
-
Instruction encoding and how it's reflected in the datasheet.
-
Illustration of how numbers in an instruction map to operations.
-
Complex tasks are achieved through sequences of tiny instructions.
-
Conclusion by Bisqwit inviting further exploration of topics.
Transcription
At least once, at some point in your life, you may have tried to open an .exe file in a text editor – just to see what happens, and it looks like this. It's a total mess. How does a computer make sense of any of this? The thing is, what you see on the screen is an illusion. It does not exist. Everything the computer ever sees and deals with is numbers. On a computer, there is nothing else but numbers. Anything else is a representation. There are agreed-upon standards – describing which numbers are used to represent which letters, so that we can handle text on a computer. There is really no text, only numbers. But we have agreed a long time ago – that 65 means capital A, 70 means capital F, and so on. This agreement is called ASCII, American Standard Code for Information Interchange. Later it was extended and updated, but that's another topic. When you tell your text editor to open an .exe file, you are assigning an interpretation to the number sequence. By opening the file in a text editor, you insist that these numbers should be interpreted as text. But they are not text. If you opened this file in an audio player, they would be interpreted as PCM samples, and you could listen to the sound made from these numbers. It usually does not sound very pleasant. But the file is not audio either. You could open it in a picture viewer, and it would look like this. Now the picture viewer is interpreting the numbers as colors, but the numbers are not colors either. We can use a hex editor to look at the numbers inside the file, and see the numbers as what they really are. Numbers. What do the numbers mean then? Answering that question is not exactly straightforward. If you look carefully, you can actually see that this .exe file does actually contain some text. I mean, it still contains numbers, but these particular numbers here actually make sense, even if they are interpreted as text, as my hex editor helpfully does. You may have thought of .exe files as buckets of computer code. A monoculture of an abstract thing called code. But that is wrong. They are actually structures. This file contains many kinds of content, and the operating system identifies these sections, and handles them differently. For now, we will focus on just the program code. This is a disassembly of an .exe file. All disassemblies look something like this. On the left edge, you see an address. It might be a memory address, or an offset to a file. In this case, it is a memory address. The second column is those numbers from the file. If I open side-by-side this disassembly, and the hex editor view, you can see how these contents are related. The code begins here. Here's the first four bytes, then the next four bytes, then two bytes and seven bytes, and so on. They are a perfect match, because it is the same data. The third column is assembler instructions. And this is the most mystifying part for many people. Remember earlier when I said that – when the numbers are interpreted as text, 65 means A. When the numbers are interpreted as program code, this is the interpretation. The number sequence 4883EC48 – means the assembler instruction – that subtracts this number from RSP. The number sequence 4C8D4210 – means the assembler instruction – that adds one to RDX and stores it into R8. The number sequence 8BCA – means the assembler instruction that copies EDX into ECX. But why? Also you will notice that these number sequences – have different lengths. How does it work? This is a modern day 64-bit architecture called AMD64. The AMD64 architecture programmer's manual – is six volumes long, almost 3000 pages in total. For simplicity, we'll take a look at the 8086 CPU datasheet instead. It is only 30 pages long. The 8086 is the CPU inside this computer. It is an ancestor of AMD64, released over 40 years ago. Actually, let's take a moment to appreciate the sounds of this beauty. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 We can even have a look at this assembly. Here's one. Even though instructions look quite similar – as in the previous sample, that is not a coincidence. These two processors are related after all. In this 30-page manual, the interesting bits begin at page 26. Know what? Let's do a side-by-side view. On the left side, the disassembler listing. On the right side, the datasheet. So the first instruction is 8CC8. Move the code segment register into the AX register. Can we find it in the datasheet? As luck would have it, we can actually find it on this very page. It's here. The instruction is MOVE, and the order of operation is – segment register into register or memory. If we then look at the right side of the page, we see this chart. This chart tells that this particular instruction – is encoded using two bytes. The bytes are described in binary, not hexadecimal, so the description goes 10001100. This is 8C in hexadecimal. And it matches exactly what we see in the disassembly. The second byte is something called MODRM. We will get back to that later. Now let's scroll down the disassembly a bit. How about this one? LODSB. That's byte AC, or 10101100 in binary. In the datasheet we can find this instruction on page 28. Here it is. LODS. The instruction encoding is 1010110W, where W denotes the size. Zero for byte size, and one for word size. This instruction had a zero here, so it is LODSB. Load String Byte. The next instruction is 49. This is 01001001 in binary. Can we find it in the datasheet? Indeed we can. Here it is on page 27. The first byte is 01001, and then something called REG. The instruction is called DEC or decrement. Our byte ended with 001, so what does the REG number of 001 mean? The answer is found on the last page, page 30, in this table. REG is assigned according to the following table. 001 means CX. So our instruction was DECCX. Did we interpret it correctly? Let's have a look. Oh yes. The instruction was DECCX indeed. On this page in the datasheet, it also describes how that mod RM from earlier works. I'll leave that as an exercise to the reader. Inside the microprocessor there is a lookup table. The processor reads the number from the program, and compares that number into this lookup table. Suppose it finds the number 11000011. It reads this table, and finds that this is RET, the subroutine return instruction. It therefore performs that operation. Then it reads the next number. Suppose that the next number it finds is 11101001. This is the unconditional jump instruction, but look, it needs more data. It needs to read two more numbers. The instruction is three bytes long. The two other numbers together constitute a bigger number, that tells where it should jump next. Once it has read three bytes, it now has enough information to perform the operation, which it will promptly do. And then it reads the next number. That's what the processor does for all eternity. Millions or billions times in a second. Read an instruction, interpret it, execute it, and do that same over and over and over again. Some of these instructions might be complex – and require multiple bytes, others are short and only one or two bytes long. There are other processor architectures – in which instructions are always same width, but on the x86 architecture, instructions have varying lengths. And that's how the numbers in your .exe file map into program code. That program code just happens to consist of – millions of tiny instructions that do very small things. But when millions of them are performed in a sequence, in a rapid fashion, something more complex is achieved, such as the computer reacting to your mouse button click. I am Bisqwit. I do videos about topics – that I hope are just as insightful and inspiring to you – as they were to me when I originally learned the stuff. Have a blessed day, and see you again!