X86es and Assemblers
2025-01-06 16:13 +0000
For those unaware, I have been working on an x86 assembler throughout the last year. For now, it’s nothing special. It doesn’t assemble most of the x86 instructions. It doesn’t support addressing. Neither can it assemble to an object file. And not too long ago, it didn’t even have a CLI. But, now it does. Even though it only outputs machine code, and nothing else.
If you want to check it out, you can. I named it rei. It’s written in the most amazing language, known as Go, and has four parts to it:
cmd/rei
– The CLI for the assembler.rasm
– The assembler’s lexer, parser, code generator. Basicallyrei
’s syntax.x86
– The x86 mnemonic/instruction to machine code translation layer.relf
– The ELF object file writer library.
relf
at the time of writing hasn’t even been seriously worked on. It’s not
in the git repo, and locally only has like 10 lines of code. But the rest of it
is – in my opinion – well-written. I didn’t know much about Go, but little by
little I got used to it.
But this isn’t what this blog post is about. I won’t talk about how my assembler works. We will discuss the mental overload of information about assemblers and x86 that I had to learn to understand how to even write it.
Let’s start from low abstraction and move upwards…
CISC and The x86 Architecture
If you didn’t know, x86 is a CISC, or a complex instruction set computer, architecture. What does this mean? It means that the translation from assembly to machine code is as annoying as it can get. Which is fun when that’s EXACTLY what you’re trying to do! Obviously that’s an oversimplification of what CISC is. To explain it more clearly, you need to understand another type of instruction set architecture, or ISA: RISC.
RISC, or reduced instruction set computer, is an instruction set for a processor, where each instruction is encoded in a constant sized area of bits. For example, the RISC-V and ARM architectures encode every instruction in 32 bits. These type of ISAs also have a minimal amount of instructions, which makes them easier to build assemblers for.
Now that you understand what RISC is, to understand CISC, take everything written above and turn them upside-down. Every instruction has variable sizes, they can also change by changing the arguments (known as operands) passed to the instructions. In x86 the size of instructions can go from a byte to 15 BYTES. There’s also almost 1000 unique mnemonics in x86 and MORE THAN 3000 INSTRUCTION VARIANTS.
To put into perspective of how much that is, in the Intel manual, which is broken down to three volumes, the volume which explains the instruction encoding – the second one – is only 2500 PAGES LONG. But, for some reason, I still tried to understand it, and now you will too!
In assembly, we have the so-called registers to store numerical values into. In x86, there are quite a few registers, here are some of them:
- 64-bit:
rax
,rbx
,rcx
,rdx
,r8
tor15
, etc. - 32-bit:
eax
,ebx
,ecx
,edx
,eip
,ebp
,esi
, etc. - 16-bit:
ax
,bx
,cx
,dx
,bp
,sp
, etc. - 8-bit:
al
,ah
,bl
,bh
,cl
,ch
,dl
,dh
, etc.
When translating to machine code, we also care about the endianness of the system.
Endianness describes how the bytes of multibyte values are ordered. There are two
types of endianness: big and little. For x86, which is little-endian, the bytes
are written in reverse order, or in less layman speak, least-significant byte
first. So, if you had the number 0xa1a2
, and say that the value is 32-bits,
it would be encoded as 0xa2 0xa1 0x00 0x00
.
Machine code for each and every instruction is structured like so:
- Prefixes;
- Opcode;
- ModR/M Byte;
- SIB Byte;
- Displacement; and
- Immediate.
From which, only the opcode is guaranteed to exist.
Just to give you a little sneak peek to what you’re in for, look at the following assembly code
add rax, 0xa1a2 ; rax += 0xa1a2
add rax, 0x7f ; rax += 0x7f
This assembles to:
0: 48 05 a2 a1 00 00 add rax,0xa1a2
6: 48 83 c0 7f add rax,0x7f
The 0x48
value is a prefix and 0x05
and 0x83
are the identifiers for the
instructions called opcodes. As you can see, just changing the constant’s value
that gets added to the rax
register can have a huge difference in the outputted
machine code. But, before we discuss why that is, let’s understand the only thing
unchanged: the prefix.
The Prefixes
There are quite a bit of prefixes, but from them only two are of interest. These
two are: the REX byte and 0x66
. As a heads-up, I haven’t yet seen or needed to
use the other prefixes, so, they may become “of interest”.
The REX byte is of form 0100WRXB
, where:
W
– specifies that operands are 64-bits if1
, or is specified by the code segment.R
– is an extension to the ModR/M’sreg
field (will be discussed later).X
– is an extension to the SIBindex
field (will also be discussed).B
– is an extension to ModR/M’sreg
, SIB’sbase
, or opcode’sreg
fields.
For now, all you need to know is that if the instruction’s destination register,
or the register where the result of the instruction will go to, is 64-bit, the
W
bit is set. From the assembly before, this is where we get the 0x48
. If
we had chosen registers r8
to r15
, we would have to also set the B
bit.
But if we had
add rax, r13 ; rax += r13
We would also have to set the R
bit. Pretty cool, isn’t it?
The 0x66
prefix byte is simple to explain. It’s used to indicate that the
instruction has 16-bit operands. As an example take the following assembly code:
mov eax, 0xbeef ; eax = 0xbeef
mov ax, 0xbeef ; ax = 0xbeef
Here eax
is a 32-bit register, and ax
is a 16-bit register. This is the
machine code:
0: b8 ef be 00 00 mov eax,0xbeef
5: 66 b8 ef be mov ax,0xbeef
As you can see the only difference between the two instructions is the prefix and, of course, the constant’s size.
The Opcodes
Obviously, there are thousands of opcodes, but behind all these opcodes, there are two types. These are static opcodes – they are always same – and register-compact/dynamic opcodes, which encode the register into the opcode. To give you an example, take
; static
mov eax, ebx ; 89 d8
mov ebx, ecx ; 89 cb
; dynamic
mov eax, 1 ; b8 01 00 00 00
mov ecx, 1 ; b9 01 00 00 00
mov edx, 1 ; ba 01 00 00 00
mov ebx, 1 ; bb 01 00 00 00
As you can see the opcode for mov reg32, reg32
– 0x89
– stays static, while
the opcode for mov reg32, imm32
– 0xb8
– changes dependent on the register.
In the Intel manual these opcodes are denoted as 0x89 /r
and 0xb8+rb
, where
/r
implies a ModR/M byte, and +rb
means that the lower 3 bits of the opcode
is used to encode the register in the opcode (if you remember the REX, it’s B
bit can extend it for the r8
-r15
registers).
Now you may be asking how these 3 bits (or 4 bits) are distributed between registers. As it is obvious, these bits alone can’t uniquely represent all the registers – just the 64-bit registers are already overloading it. This is why x86 instead divides all the register into groups. I like to call these “register classes”, and they are:
- A-class (identifier
0b000
):RAX
,EAX
,AX
, andAL
; - C-class (identifier
0b001
):RCX
,ECX
,CX
, andCL
; - D-class (identifier
0b010
):RDX
,EDX
,DX
, andDL
; - B-class (identifier
0b011
):RBX
,EBX
,BX
, andBL
; - SP-class (identifier
0b100
):RSP
,ESP
,SP
, andAH
; - BP-class (identifier
0b101
):RBP
,EBP
,BP
, andCH
; - SI-class (identifier
0b110
):RSI
,ESI
,SI
, andDH
; - DI-class (identifier
0b111
):RDI
,EDI
,DI
, andBH
;
Now for clarity, these classes also include MM
and XMM
registers, but as
we aren’t going to be discussing them, I decided to omit them. These classes
are used any time a register is required to be encoded, such as in the ModR/M
byte!
The ModR/M
The ModR/M byte is probably the most interesting part of an instruction. This
byte is divided into three parts: mod
(2 bits), reg
(3 bits), r/m
(3 bits).
It is used for addressing or encoding registers, and the mod
field is used
to distinguish which we are looking for. If mod
is
0b00
, implies addressing from a register;0b01
, implies addressing from a register plus an 8-bit displacement;0b10
, same as above, but with a 32-bit displacement;0b11
, implies direct register access.
For those unknown to what addressing is, it means taking the integer in a
specified register, written [reg]
in Intel syntax, and grabbing what value
is in the memory using that integer as the index. It works a lot like pointers
do in higher-level languages, so you can treat them as such. As an example, let’s
look at the following
mov rbx, rcx ; 48 89 cb
mov [rbx], rcx ; 48 89 0b
As it is easy to see, the last bytes differ. These are the ModR/M bytes. Taking
a look at their binary representations: 0xcb = 0b11001011
and
0x0b = 0b00001011
. In the first one, we have mod = 11
, meaning direct register
access, reg = 001
, meaning a C-class register, and r/m = 011
, meaning a
B-class register. The same can be said about the second one, only difference
being mod = 00
, meaning it’s addressing from a register. As you might have
noticed, the destination register gets encoded in the r/m
field, while the
source register gets encoded in the reg
field.
There are two register classes that are not allowed the ModR/M byte. These are
the BP- (when mod = 00
) and SP-classes. The identifier for the BP-class, when
the mod = 00
, is used to encode a sole 32-bit displacement. And the identifier
for the SP-class, for any mod
, is used to indicate that the SIB byte is encoded.
The SIB
Let’s say, we want to do something like this [rbx+rsi]
, where we use the
rbx
register as the base pointer for some sequence of data, and rsi
as the
index we want to access – something like (*rbx)[rsi]
in C. For this we use
the SIB byte.
The SIB byte, or the scale-index-base byte, is a byte, which, as the name implies,
encodes a base register, an index register, and the scale for the index. It is
defined as SSIIIBBB
, where SS
are the bits for the scale, III
for index,
and BBB
for base. The index and base are encoded via the register classes,
but the scale is encoded like so:
scale = 00
, implies that the scale is 1;scale = 01
, implies 2;scale = 10
, 4; andscale = 11
, 8.
At start, these scales might seem weird, but as these data values will most likely be moved into registers, the sizes make sense.
Like the ModR/M byte, the SIB also doesn’t allow two register classes. The
SP-class isn’t allowed in the index
as it’s used for specifying no index. While,
the BP-class isn’t allowed in the base
as it implies one of the following, based
on the mod
:
mod = 00
, the scaled index plus a 32-bit displacement;mod = 01
, the scaled index plus an 8-bit displacement and[EBP]
;mod = 10
, the scaled index plus a 32-bit displacement and[EBP]
.
The Displacement and The Immediate
Both the displacement and immediate values can be a single byte, two bytes, four bytes, or eight bytes. These bytes as described above are encoded in little-endian, and hold no complex structure. They are quite literally just numbers.
There’s nothing more that can really be said here.
Choosing a Register From a Register Class
If you still haven’t figured out how a register is chosen, or identified, from
an instruction, let’s take a look at an example. Take the DI-class, with the
identifier 111
. This class includes the registers RDI
, EDI
, DI
, and BH
.
add rdi, 1 ; 48 83 c7 01
add edi, 1 ; 83 c7 01
add di, 1 ; 66 83 c7 01
add bh, 1 ; 80 c7 01
As you can see, for the 64-, 32-, and 16-bit registers we use the same opcode
0x83
, just differing in the prefixes, while the 8-bit variant is given a
different opcode.
Fun fact: the opcode 0x83
isn’t the standard opcode for add
, which is 0x81
.
Opcodes like 0x83
are what I call compressed, or optimized, opcodes. In this
case, 0x83
is used if the immediate operand can fit in a byte, or is a signed
8-bit integer. For clarity,
0: 81 c7 01 00 00 00 add edi,0x1
6: 83 c7 01 add edi,0x1
COFFing With ELFs and MACH-Os
Terrible title I know, but now we will be discussing the output of assemblers: the object files. For anyone unaware of them, COFF, ELF, and MACH-O are object file formats.
What’s an object file? Unlike what you might have been taught by the big OOP, these aren’t instantiated classes. These are files that wrap around the instructions, which are generated by assemblers (or by hand if you’re brave enough). All of the above three are examples of currently used formats for object files.
- ELF is used for Linux and Unix-like systems;
- COFF/PE is used by Windows (PE is an extension on top of COFF);
- MACH-O is used by Apple.
As I have yet to go in depth with both COFF/PE and MACH-O, I will only discuss ELF, or the Executable and Linkable Format, shortly.
ELF files are made up of four parts: the header, program header table, section header table, and the section data. The program header table is only present if the object file is an executable, which means that assemblers don’t care about them.
In the section data, there exists several special sections. Some important ones
are the .shstrtab
, .symtab
, and .rel.text
which are the section header
string table, the symbol table, and the relocation table. The string table
contains null-terminated strings of section names, the symbol table contains
the file assembled and labels with additional data, and the relocation table
contains the location of used symbols and where they are used.
As this is already getting too long, I’m not going to explain this anymore. For the structure of each part of the ELF file, just check out the spec file, and make sure it’s the correct version (most spec files are 32-bit only).
Finalizing
This is a lot of information, but also it’s almost nothing. There’s
SO MUCH MORE that is necessary to create an assembler. But,
amazingly, in rei
the mov
and add
instructions are written like so
switch mnem {
case Add:
return newOpFmt().
withClass(0).
addRI([]byte{0x80}, immFmtNative32).
withARegCompressed([]byte{0x04}, immFmtNative32).
withByteCompressed([]byte{0x83}).
addRR([]byte{0x00}, true)
case Mov:
return newOpFmt().
withClass(opFmtClassCompactReg).
addRI([]byte{0xB0}, immFmtNative).
addRR([]byte{0x88}, true).
addRA([]byte{0x8A})
}
As you can see, I have generalized a lot of the assembling. By doing this,
I can add many more mnemonics/instructions to rei
without much of a hassle –
unless I realize that what I generalized can’t actually be generalized. Like
the withClass(byte)
, which is used for – what I call – opcode classes. At
first, it was used for opcodes of form byte /n
, where n = 0..7
. An example
of such opcodes are the arithmetic/logical instructions.
add rax, 1 ; 48 83 c0 01
and rax, 1 ; 48 83 e0 01
or rax, 1 ; 48 83 c8 01
sub rax, 1 ; 48 83 e8 01
The n
is used in the ModR/M byte’s reg
field.
But to end all this, learning this much about the lower-level stuff of computers has been a fun time. It’s not my first time creating an assembler, but it’s the first time it has actually gone somewhere. My second attempt at creating an assembler can also be seen in 1as, which was supposed to be a RISC-V assembler. I gave up on it as I had transitioned back to Windows for a bit, and as it was written in Hare, working on it on Windows was quite literally impossible. Still a learning experience, as the ELF generation was quite a headache. But now, I think I got it under control.
As to my last blog post, I’m working on that frontend, and at some point I will write about it. I chose to write it in Next.js to see what happened to React, as I haven’t used it for A LONG TIME. It’s going well, styling is pissing me off, but tailwind is quite a fun library to work with.
Till that time, I have another thing I want to do, and that’s exactly what my next post will be about. See ya till then. And happy new years, btw.
When death comes knocking at your door
And nothing’s left worth fighting for
Say a little prayer and pull the rug
Show me what you’re made ofKingdom come is falling down
No one’s left to save you now
Say a little prayer and pull the rug
Show me what you’re made ofWhat You’re Made Of — Arrested Youth