X86es and Assemblers


For those unaware, I have been working on an x86 assembler throughout the last year. For now, it’s nothing special. It doesn’t assemble most of the x86 instructions. It doesn’t support addressing. Neither can it assemble to an object file. And not too long ago, it didn’t even have a CLI. But, now it does. Even though it only outputs machine code, and nothing else.

If you want to check it out, you can. I named it rei. It’s written in the most amazing language, known as Go, and has four parts to it:

  • cmd/rei – The CLI for the assembler.
  • rasm – The assembler’s lexer, parser, code generator. Basically rei’s syntax.
  • x86 – The x86 mnemonic/instruction to machine code translation layer.
  • relf – The ELF object file writer library.

relf at the time of writing hasn’t even been seriously worked on. It’s not in the git repo, and locally only has like 10 lines of code. But the rest of it is – in my opinion – well-written. I didn’t know much about Go, but little by little I got used to it.

But this isn’t what this blog post is about. I won’t talk about how my assembler works. We will discuss the mental overload of information about assemblers and x86 that I had to learn to understand how to even write it.

Let’s start from low abstraction and move upwards…

CISC and The x86 Architecture

If you didn’t know, x86 is a CISC, or a complex instruction set computer, architecture. What does this mean? It means that the translation from assembly to machine code is as annoying as it can get. Which is fun when that’s EXACTLY what you’re trying to do! Obviously that’s an oversimplification of what CISC is. To explain it more clearly, you need to understand another type of instruction set architecture, or ISA: RISC.

RISC, or reduced instruction set computer, is an instruction set for a processor, where each instruction is encoded in a constant sized area of bits. For example, the RISC-V and ARM architectures encode every instruction in 32 bits. These type of ISAs also have a minimal amount of instructions, which makes them easier to build assemblers for.

Now that you understand what RISC is, to understand CISC, take everything written above and turn them upside-down. Every instruction has variable sizes, they can also change by changing the arguments (known as operands) passed to the instructions. In x86 the size of instructions can go from a byte to 15 BYTES. There’s also almost 1000 unique mnemonics in x86 and MORE THAN 3000 INSTRUCTION VARIANTS.

To put into perspective of how much that is, in the Intel manual, which is broken down to three volumes, the volume which explains the instruction encoding – the second one – is only 2500 PAGES LONG. But, for some reason, I still tried to understand it, and now you will too!

In assembly, we have the so-called registers to store numerical values into. In x86, there are quite a few registers, here are some of them:

  • 64-bit: rax, rbx, rcx, rdx, r8 to r15, etc.
  • 32-bit: eax, ebx, ecx, edx, eip, ebp, esi, etc.
  • 16-bit: ax, bx, cx, dx, bp, sp, etc.
  • 8-bit: al, ah, bl, bh, cl, ch, dl, dh, etc.

When translating to machine code, we also care about the endianness of the system. Endianness describes how the bytes of multibyte values are ordered. There are two types of endianness: big and little. For x86, which is little-endian, the bytes are written in reverse order, or in less layman speak, least-significant byte first. So, if you had the number 0xa1a2, and say that the value is 32-bits, it would be encoded as 0xa2 0xa1 0x00 0x00.

Machine code for each and every instruction is structured like so:

  • Prefixes;
  • Opcode;
  • ModR/M Byte;
  • SIB Byte;
  • Displacement; and
  • Immediate.

From which, only the opcode is guaranteed to exist.

Just to give you a little sneak peek to what you’re in for, look at the following assembly code

add rax, 0xa1a2 ; rax += 0xa1a2
add rax, 0x7f   ; rax += 0x7f

This assembles to:

0:  48 05 a2 a1 00 00       add    rax,0xa1a2
6:  48 83 c0 7f             add    rax,0x7f

The 0x48 value is a prefix and 0x05 and 0x83 are the identifiers for the instructions called opcodes. As you can see, just changing the constant’s value that gets added to the rax register can have a huge difference in the outputted machine code. But, before we discuss why that is, let’s understand the only thing unchanged: the prefix.

The Prefixes

There are quite a bit of prefixes, but from them only two are of interest. These two are: the REX byte and 0x66. As a heads-up, I haven’t yet seen or needed to use the other prefixes, so, they may become “of interest”.

The REX byte is of form 0100WRXB, where:

  • W – specifies that operands are 64-bits if 1, or is specified by the code segment.
  • R – is an extension to the ModR/M’s reg field (will be discussed later).
  • X – is an extension to the SIB index field (will also be discussed).
  • B – is an extension to ModR/M’s reg, SIB’s base, or opcode’s reg fields.

For now, all you need to know is that if the instruction’s destination register, or the register where the result of the instruction will go to, is 64-bit, the W bit is set. From the assembly before, this is where we get the 0x48. If we had chosen registers r8 to r15, we would have to also set the B bit. But if we had

add rax, r13 ; rax += r13

We would also have to set the R bit. Pretty cool, isn’t it?

The 0x66 prefix byte is simple to explain. It’s used to indicate that the instruction has 16-bit operands. As an example take the following assembly code:

mov eax, 0xbeef ; eax = 0xbeef
mov ax, 0xbeef  ; ax = 0xbeef

Here eax is a 32-bit register, and ax is a 16-bit register. This is the machine code:

0:  b8 ef be 00 00          mov    eax,0xbeef
5:  66 b8 ef be             mov    ax,0xbeef

As you can see the only difference between the two instructions is the prefix and, of course, the constant’s size.

The Opcodes

Obviously, there are thousands of opcodes, but behind all these opcodes, there are two types. These are static opcodes – they are always same – and register-compact/dynamic opcodes, which encode the register into the opcode. To give you an example, take

; static
mov eax, ebx ; 89 d8
mov ebx, ecx ; 89 cb

; dynamic
mov eax, 1   ; b8 01 00 00 00
mov ecx, 1   ; b9 01 00 00 00
mov edx, 1   ; ba 01 00 00 00
mov ebx, 1   ; bb 01 00 00 00

As you can see the opcode for mov reg32, reg320x89 – stays static, while the opcode for mov reg32, imm320xb8 – changes dependent on the register. In the Intel manual these opcodes are denoted as 0x89 /r and 0xb8+rb, where /r implies a ModR/M byte, and +rb means that the lower 3 bits of the opcode is used to encode the register in the opcode (if you remember the REX, it’s B bit can extend it for the r8-r15 registers).

Now you may be asking how these 3 bits (or 4 bits) are distributed between registers. As it is obvious, these bits alone can’t uniquely represent all the registers – just the 64-bit registers are already overloading it. This is why x86 instead divides all the register into groups. I like to call these “register classes”, and they are:

  • A-class (identifier 0b000): RAX, EAX, AX, and AL;
  • C-class (identifier 0b001): RCX, ECX, CX, and CL;
  • D-class (identifier 0b010): RDX, EDX, DX, and DL;
  • B-class (identifier 0b011): RBX, EBX, BX, and BL;
  • SP-class (identifier 0b100): RSP, ESP, SP, and AH;
  • BP-class (identifier 0b101): RBP, EBP, BP, and CH;
  • SI-class (identifier 0b110): RSI, ESI, SI, and DH;
  • DI-class (identifier 0b111): RDI, EDI, DI, and BH;

Now for clarity, these classes also include MM and XMM registers, but as we aren’t going to be discussing them, I decided to omit them. These classes are used any time a register is required to be encoded, such as in the ModR/M byte!

The ModR/M

The ModR/M byte is probably the most interesting part of an instruction. This byte is divided into three parts: mod (2 bits), reg (3 bits), r/m (3 bits). It is used for addressing or encoding registers, and the mod field is used to distinguish which we are looking for. If mod is

  • 0b00, implies addressing from a register;
  • 0b01, implies addressing from a register plus an 8-bit displacement;
  • 0b10, same as above, but with a 32-bit displacement;
  • 0b11, implies direct register access.

For those unknown to what addressing is, it means taking the integer in a specified register, written [reg] in Intel syntax, and grabbing what value is in the memory using that integer as the index. It works a lot like pointers do in higher-level languages, so you can treat them as such. As an example, let’s look at the following

mov rbx, rcx   ; 48 89 cb
mov [rbx], rcx ; 48 89 0b

As it is easy to see, the last bytes differ. These are the ModR/M bytes. Taking a look at their binary representations: 0xcb = 0b11001011 and 0x0b = 0b00001011. In the first one, we have mod = 11, meaning direct register access, reg = 001, meaning a C-class register, and r/m = 011, meaning a B-class register. The same can be said about the second one, only difference being mod = 00, meaning it’s addressing from a register. As you might have noticed, the destination register gets encoded in the r/m field, while the source register gets encoded in the reg field.

There are two register classes that are not allowed the ModR/M byte. These are the BP- (when mod = 00) and SP-classes. The identifier for the BP-class, when the mod = 00, is used to encode a sole 32-bit displacement. And the identifier for the SP-class, for any mod, is used to indicate that the SIB byte is encoded.

The SIB

Let’s say, we want to do something like this [rbx+rsi], where we use the rbx register as the base pointer for some sequence of data, and rsi as the index we want to access – something like (*rbx)[rsi] in C. For this we use the SIB byte.

The SIB byte, or the scale-index-base byte, is a byte, which, as the name implies, encodes a base register, an index register, and the scale for the index. It is defined as SSIIIBBB, where SS are the bits for the scale, III for index, and BBB for base. The index and base are encoded via the register classes, but the scale is encoded like so:

  • scale = 00, implies that the scale is 1;
  • scale = 01, implies 2;
  • scale = 10, 4; and
  • scale = 11, 8.

At start, these scales might seem weird, but as these data values will most likely be moved into registers, the sizes make sense.

Like the ModR/M byte, the SIB also doesn’t allow two register classes. The SP-class isn’t allowed in the index as it’s used for specifying no index. While, the BP-class isn’t allowed in the base as it implies one of the following, based on the mod:

  • mod = 00, the scaled index plus a 32-bit displacement;
  • mod = 01, the scaled index plus an 8-bit displacement and [EBP];
  • mod = 10, the scaled index plus a 32-bit displacement and [EBP].

The Displacement and The Immediate

Both the displacement and immediate values can be a single byte, two bytes, four bytes, or eight bytes. These bytes as described above are encoded in little-endian, and hold no complex structure. They are quite literally just numbers.

There’s nothing more that can really be said here.

Choosing a Register From a Register Class

If you still haven’t figured out how a register is chosen, or identified, from an instruction, let’s take a look at an example. Take the DI-class, with the identifier 111. This class includes the registers RDI, EDI, DI, and BH.

add rdi, 1 ; 48 83 c7 01
add edi, 1 ;    83 c7 01
add di, 1  ; 66 83 c7 01
add bh, 1  ;    80 c7 01

As you can see, for the 64-, 32-, and 16-bit registers we use the same opcode 0x83, just differing in the prefixes, while the 8-bit variant is given a different opcode.

Fun fact: the opcode 0x83 isn’t the standard opcode for add, which is 0x81. Opcodes like 0x83 are what I call compressed, or optimized, opcodes. In this case, 0x83 is used if the immediate operand can fit in a byte, or is a signed 8-bit integer. For clarity,

0:  81 c7 01 00 00 00       add    edi,0x1
6:  83 c7 01                add    edi,0x1

COFFing With ELFs and MACH-Os

Terrible title I know, but now we will be discussing the output of assemblers: the object files. For anyone unaware of them, COFF, ELF, and MACH-O are object file formats.

What’s an object file? Unlike what you might have been taught by the big OOP, these aren’t instantiated classes. These are files that wrap around the instructions, which are generated by assemblers (or by hand if you’re brave enough). All of the above three are examples of currently used formats for object files.

  • ELF is used for Linux and Unix-like systems;
  • COFF/PE is used by Windows (PE is an extension on top of COFF);
  • MACH-O is used by Apple.

As I have yet to go in depth with both COFF/PE and MACH-O, I will only discuss ELF, or the Executable and Linkable Format, shortly.

ELF files are made up of four parts: the header, program header table, section header table, and the section data. The program header table is only present if the object file is an executable, which means that assemblers don’t care about them.

In the section data, there exists several special sections. Some important ones are the .shstrtab, .symtab, and .rel.text which are the section header string table, the symbol table, and the relocation table. The string table contains null-terminated strings of section names, the symbol table contains the file assembled and labels with additional data, and the relocation table contains the location of used symbols and where they are used.

As this is already getting too long, I’m not going to explain this anymore. For the structure of each part of the ELF file, just check out the spec file, and make sure it’s the correct version (most spec files are 32-bit only).

Finalizing

This is a lot of information, but also it’s almost nothing. There’s SO MUCH MORE that is necessary to create an assembler. But, amazingly, in rei the mov and add instructions are written like so

switch mnem {
  case Add:
    return newOpFmt().
      withClass(0).
      addRI([]byte{0x80}, immFmtNative32).
      withARegCompressed([]byte{0x04}, immFmtNative32).
      withByteCompressed([]byte{0x83}).
      addRR([]byte{0x00}, true)
  case Mov:
    return newOpFmt().
      withClass(opFmtClassCompactReg).
      addRI([]byte{0xB0}, immFmtNative).
      addRR([]byte{0x88}, true).
      addRA([]byte{0x8A})
}

As you can see, I have generalized a lot of the assembling. By doing this, I can add many more mnemonics/instructions to rei without much of a hassle – unless I realize that what I generalized can’t actually be generalized. Like the withClass(byte), which is used for – what I call – opcode classes. At first, it was used for opcodes of form byte /n, where n = 0..7. An example of such opcodes are the arithmetic/logical instructions.

add rax, 1 ; 48 83 c0 01
and rax, 1 ; 48 83 e0 01
or  rax, 1 ; 48 83 c8 01
sub rax, 1 ; 48 83 e8 01

The n is used in the ModR/M byte’s reg field.

But to end all this, learning this much about the lower-level stuff of computers has been a fun time. It’s not my first time creating an assembler, but it’s the first time it has actually gone somewhere. My second attempt at creating an assembler can also be seen in 1as, which was supposed to be a RISC-V assembler. I gave up on it as I had transitioned back to Windows for a bit, and as it was written in Hare, working on it on Windows was quite literally impossible. Still a learning experience, as the ELF generation was quite a headache. But now, I think I got it under control.

As to my last blog post, I’m working on that frontend, and at some point I will write about it. I chose to write it in Next.js to see what happened to React, as I haven’t used it for A LONG TIME. It’s going well, styling is pissing me off, but tailwind is quite a fun library to work with.

Till that time, I have another thing I want to do, and that’s exactly what my next post will be about. See ya till then. And happy new years, btw.


When death comes knocking at your door
And nothing’s left worth fighting for
Say a little prayer and pull the rug
Show me what you’re made of

Kingdom come is falling down
No one’s left to save you now
Say a little prayer and pull the rug
Show me what you’re made of

What You’re Made Of — Arrested Youth