Z80 emulation in Rust, Milestone 1

…wherein: I proudly present the first milestone in my current adventure of emulating the Z80 in Rust, look at the different 8-bit CPU emulation strategies, have a look at the current performance, and talk about what’s coming next.

But first…

…let me say a big thank-you to everyone who provided hints and feedback (and a pull-request!) to my last blog post via twitter and github! A quick run-down, apologies for just going through the feedback omitting the names:

I got a pull-request demonstrating how a ‘multi-dimensional’ match can be used to simplify the instruction decoder
for my problem of how to load Z80 code at compile time I was pointed to the include_bytes! macro (which I then used successfully), and cargo’s build-script feature which can be used for code-generation (see here: http://doc.crates.io/build-script.html)
the various smaller ‘borrow-checker is too conservative’ problems I stumbled over are going to be fixed as part of the MIR work (see here: https://blog.rust-lang.org/2016/04/19/MIR.html)

The WIP Rust Z80 emulator is here: https://github.com/floooh/rz80

…the C++ counterpart is in this repository: https://github.com/floooh/yakc.

Z80 registers in a nutshell

A little refresher since I’ll be talking a lot about the Z80 registers:

PC: the ‘program counter’ or ‘instruction pointer, this is a 16-bit register pointing to the next instruction to be executed
SP: the ‘stack pointer’, a 16-bit register pointing to the top of the stack, which grows downward
F: the flag register, this contains 6 documented and 2 undocumented status flag bits about the last arithmetic operation
A: the special 8-bit ‘accumulator’ register which serves as one operand and the result of arithmetic instructions
B, C, D, E, H, L: 6 general-purpose (well, almost) 8-bit registers, which can also be accessed as 3 16-bit registers BC, DE and HL
IX, IY: two 16-bit address registers used for the Z80’s powerful ‘indexed addressing modes’
AF’, BC’, DE’, HL’: the ‘shadow register set’, these are not directly accessible, but can be swapped very quickly with their counterparts AF, BC, DE and HL
WZ, WZ’: a mysterious, undocumented internal register and its shadow counterpart used for address computations; which has been painstakingly reverse-engineered by emulation enthusiasts during the 90’s and early 2000’s
I’m leaving out the R, I, IM, IFF1 and IFF2 registers and status bits, R is the memory refresh counter, and the others are related to interrupts

ZEXDOC & ZEXALL

Ok, back to the emulator. The big milestone number one is that the Rust Z80 emulator is now successfully running through the ZEXDOC and ZEXALL conformance tests, meaning that the bulk of the emulation works correctly both for the documented and the undocumented parts of the CPU.

ZEX is the ‘Z80 instruction set exerciser’, a hallowed ancient CP/M* program written by one Mr. Frank D. Cringle in 1994 to run through (nearly) all Z80 instructions and compare the resulting CPU state against a real Z80 CPU. ZEXDOC is masking out the 2 undocumented flag bits XF and YF, while ZEXALL is testing all flag bits.

*CP/M was the dominating business operating system for 8-bit machines in the 70’s, before there was MS-DOS.

On a real Z80, a single test run takes several hours. On a modern CPU running in an emulator, this is reduced to under a minute. Since the test runs for such a long time, and uses many different instructions it is also an excellent benchmark. More on that later in the performance section.

The interested reader will now ask ‘How do you run a CP/M program on a naked Z80 emulator without emulating a complete CP/M system?’. The answer is surprisingly simple. Here are the ingredients:

the CPU emulator, attached to 64 KByte of empty, flat memory
a raw dump of the ZEXALL or ZEXDOC Z80 binary code which must be loaded at address 0x0100 into the emulated RAM (this is where CP/M programs usually are located)
set the CPU’s PC register (program counter, or instruction pointer) to the program start at 0x0100, and the stack pointer to 0xF000 (I have no idea where the stack pointer is in CP/M, it just has to be in an area that’s not overwritten while the test runs)
start executing instructions… this will properly update the CPU state, but we can’t see anything from the outside…
at some point, the code will call into CP/M operating system functions by jumping to address 0x0005, with a number in register C that identifies the requested function. Thankfully ZEX only calls 2 different functions, one is 0x02 which outputs a single character to the terminal, and the other is 0x09, which outputs a string (terminated by the character ‘$’, not a null byte)

So to emulate the required CP/M calls I need to:

after each instruction, check whether the PC register is 0x0005
if the C register is 2, output the value in register E as ASCII code to the console
if the C register is 9, start reading bytes at the address in register pair DE until ‘$’ and output them as ASCII to the console

Here’s some Rust code, first the very useful include_bytes! macro to load the Z80 code dumps into a byte array at compile time:

    static ZEXDOC: &'static [u8] = include_bytes!("zexdoc.com");

That’s it! In C++ I have written a python code-generator script and a custom cmake build step to provide the same functionality, and I need this surprisingly often for my small demos. It’s really nice that Rust has this already built in.

Next a new CPU object is created, the code dump is copied to address 0x0100, and PC and SP registers are setup:

    let mut cpu = rz80::CPU::new(dummy_in, dummy_out);
    cpu.mem.write(0x0100, &ZEXDOC);
    cpu.reg.set_sp(0xF000);
    cpu.reg.set_pc(0x0100);

dummy_in and dummy_out can be ignored, these are callback functions for the Z80 IN and OUT instructions, which are not called by the ZEX tests.

Next is the loop that executes emulated instructions via cpu.step(), checks if a CP/M function is called (if PC is at 0x0005), or if the program is finished (if PC is at 0x0000). Otherwise just continue the loop:

    loop {
        cpu.step();
        match cpu.reg.pc() {
            0x0005 => { cpm_bdos(&mut cpu); },
            0x0000 => { break; },
            _ => { },
        }
    }

The call into the cpm_bdos() function takes a ‘mutable reference’ to the CPU object. It has to be mutable because the function must emulate a RET instruction to properly return from the CP/M function.

The cpm_bdos() function to implement the 2 possible CP/M text output calls is also very simple:

    // get the function call id from register C
    match cpu.reg.c() {
        2 => {
            // output character in register E
            print!("{}", cpu.reg.e() as u8 as char);
        },
        9 => {
            // output a string at register DE until '$'
            let mut addr = cpu.reg.de();
            loop {
                let c = cpu.mem.r8(addr) as u8 as char;
                addr = (addr + 1) & 0xFFFF;
                if c != '$' {
                    print!("{}", c);
                }
                else {
                    break;
                }
            }
        },
        _ => {
            panic!("Unknown CP/M call {}!", cpu.reg.c());
        }
    }

The double cast ‘as u8 as char’ is a bit wacky, but Rust doesn’t seem to allow casting directly from an i64 to a char.

Also note the ‘manual 16-bit overflow fix’ in addr = (addr + 1) & 0xFFFF. By default, integer overflow in Rust is a runtime error, and this is one possible workaround (more on that later, this is basically also the reason why registers are accessed via setter/getter methods, and not directly exposed as data members).

Finally I need to emulate a RET instruction in order to return to the ZEX program body, this is the reason why the cpu argument had to be passed as a mutable reference:

    let sp = cpu.reg.sp();              // get current stack pointer
    cpu.reg.set_pc(cpu.mem.r16(sp));    // set PC to return address from stack
    cpu.reg.set_sp(sp + 2);             // bump stack pointer

When I implemented the C++ Z80 emulator end of last year, getting the ZEX tests right was really hard work, before actually attempting to even run ZEXDOC (not to mention ZEXALL) I wrote 2.5kloc worth of tests where I’m running little Z80 code snippets for each instruction group and check the resulting register values and status flags. Writing the tests involved writing Z80 assembly code snippets, translate them using my little KC85 SDK which I created roughly 2 years ago (here: https://github.com/floooh/kc85sdk), and finally load the code into the MAME/MESS emulator, stepping through MAME’s debugger and copying the register and status flags over into the C++ unit tests.

One of the simpler of those roughly 100 ‘opcode test’ looks like this (when converted to Rust):

    #[test]
    fn test_ld_ihl_n() {
        let mut cpu = rz80::CPU::new(in_fn, out_fn);
        let prog = [
            0x21, 0x00, 0x20,   // LD HL,0x2000
            0x36, 0x33,         // LD (HL),0x33
            0x21, 0x00, 0x10,   // LD HL,0x1000
            0x36, 0x65,         // LD (HL),0x65
        ];
        cpu.mem.write(0x0000, &prog);

        assert!(10==cpu.step()); assert!(0x2000 == cpu.reg.hl());    
        assert!(10==cpu.step()); assert!(0x33 == cpu.mem.r8(0x2000));
        assert!(10==cpu.step()); assert!(0x1000 == cpu.reg.hl());    
        assert!(10==cpu.step()); assert!(0x65 == cpu.mem.r8(0x1000));
    }

With those opcode tests in place, ZEXDOC was (to my big surprise) running without errors at the first attempt, and ZEXALL only had 2 minor bugs in the CPD and CPI instructions. This is actually a nice example where writing unit tests in parallel to the implementation really makes sense, and saves a lot of time afterwards (e.g. when playing around with optimizations, the tests catch problems early on).

Btw: the ‘manual’ opcode tests are invoked as usual:

> cargo test
...

Invoking the ZEX tests is a little bit more involved:

> cargo test --release -- --ignored --nocapture
...

Giant-Switch-Case vs the Algorithmic Decoder

This is about the two basic strategies I have tried out for writing Z80 emulators, first in C++, then in Rust.

As the name says, ‘Giant Switch Case’ is using a huge switch-case statement, where each case-branch implements one instruction (on top of that, the whole thing is actually nested for the multibyte-instructions). This was my first attempt for the C++ emulator, but I wasn’t as desperate to write this all by hand, instead I wrote a (huge) python script which spits out a (slightly huger) C++ source file with the giant, nested switch-case in it.

I basically took the official Z80 User Manual (here: http://www.z80.info/zip/z80cpui_um.pdf) went through it from start to end and implemented the instructions as described there. The result was a python script that was nearly as big as the generated C++ code, which was kinda dumb and labour-intensive, but worked. And it was damn fast out-of-the-box, up to 1.2 emulated-Z80-GHz on my 2.8 GHz Intel i5 MBP for the ZEXDOC test (so ‘real’ code, not just NOPs, and running through a proper memory subsystem with a ‘page-table’, which is pretty bad for performance as I found out later).

I was happy with the performance, but not very happy with the code. So after a while (this was after I had already finished the KC85 emulator) I tried out a more elegant ‘proper’ instruction decoder.

The idea is to look at an instruction byte not as a simple value, but as a bunch of bit-groups, where the different bit groups describe what the instruction does.

A Z80 instruction is built from 3 bit groups, the topmost two bits split the instruction space into 4 broad instruction groups, the other 6 bits form two 3-bit groups which have a different meaning based on the instruction group:

    +---+---+ +---+---+---+ +---+---+---+
    | x | x | | y | y | y | | z | z | z |
    +---+---+ +---+---+---+ +---+---+---+
      7   6     5   4   3     2   1   0

The instruction group with the upper bits (x) set to 01 contains all the 8-bit load/store instructions. For those, the 3 (y) bits define the destination register, and the lower (z) bits define the source.

The register indices are as follows:

    B   C   D   E   H   L  (HL)  A
    0   1   2   3   4   5   6    7

As you can see, position 6 is special, this is where the F register would actually be located, but the status flags register cannot be written or read directly, so this slot is assigned to (HL), which is not a register, but the content of the memory location pointed to by HL. So whenever (HL) is involved, an 8-bit memory load or store takes places instead of accessing a register.

Let’s try this out by encoding the instruction LD A,C (load register A with content of register C):

x is binary 01, since this is the 8-bit load group
y is destination A, which is at slot 7 (binary 111)
z is source C, at slot 1 (binary 001)

This results in binary ‘01 111 001’ for the whole instruction, which is

    0   1   1   1   1   0   0   1

       40 +20 +10  +8          +1 = 79 hex

Reverse-looking up 79 in a translation table (e.g. here: http://www.z80.info/z80oplist.txt) yields LD A,C!

There is one interesting caveat, which demonstrates how the Z80 instruction set has been riddled with exceptions (painfully so when writing an emulator):

The instruction LD (HL),(HL) would have the bit pattern ‘01 110 110’, which is 76 hex. Now this instructions doesn’t make any sense, and has been repurposed to the HALT instruction, which has nothing at all to do with 8-bit loads.

Another instruction group are the 8-bit arithmetic instructions (or ALU instructions, for Arithemtic-Logic-Unit), these are in the group x=2 (binary 10). For this group, the (y) bits define the type of instruction, and the (z) bits define operand register, or (HL) if z==6. The other operand is always the accumulator register A, and this is also where the result ends up.

The 3 (y) bits are assigned to the ALU operation as follows:

 => ADD
 => ADC
 => SUB
 => SBC
 => AND
 => XOR
 => OR
 => CP

With this information we can figure out the opcode of the SUB A,C instruction as before with LD A,C:

x is binary 10, since this is the ALU instruction group
y is the ALU code for SUB, at slot 2, binary 010
z is the register C, which is at slot 1, binary 001

This yields the binary value ‘10 010 001’ = 91 hex, and reverse lookup of 91 in the Z80 instruction table http://www.z80.info/z80oplist.txt yields SUB A,C. Great success!

Of course I didn’t figure this stuff out myself :D Decoding Z80 instruction is described in detail here: http://www.z80.info/decoding.htm

The whole point of such a ‘smart’ algorithmic decoder is to cover large areas of the instruction set with very few lines of code. Unfortunately with all the exceptions and ‘hole-plugging’ in the Z80 instruction set it’s still a lot of code (about 1.5kloc of Rust to be exact).

Now the big question is of course: which of the two implementation strategies is better?

When implemented in C++, the giant-switch-case is about 25..30% faster than the algorithmic decoder while only being 6 KByte bigger (measured as compressed asm.js code size, which is comparable to compressed native x86 code). My hope was that the algorithmic decoder would have an advantage by being much smaller and thus have better caching effect, but that didn’t happen.

As is often the case, brute force clearly wins over elegance :)

For the C++ emulator, I ended up with a nice compromise. I rewrote the python code generator script as an ‘algorithmic decoder’, which writes a big, dumb but fast switch-case source file, so basically best of both worlds!

For Rust, the jury is still out. I have only implemented the algorithmic decoder so far, without any code generation (so the version which was 25..30% slower in C++). Since I’m currently more interested in learning the language this felt like the right thing to do, especially since there’s more than enough performance headroom for a full emulator, 30% slower is still several hundred times faster than an actual Z80!

…which brings me to…

Performance!

Ok, so the original C++ emulator was running the ZEXDOC test at an equivalent speed of an Z80 running somewhere between 750 MHz and 1200 MHz, depending on the stage of development and what implementation strategy was used (on my 2.8 GHz i5 MBP).

Quite a lot of tweaking went into the 1200 MHz number, but it was still ‘real-world’ code, with a real page-table memory system underneath which is required for bank-switching in many 8-bit computer systems.

The Rust emulator is currently running at about 850 MHz on the same system, this is with the ‘slow’ arithmetic instruction decoder, but with a ‘fast’ simplified memory system which doesn’t have an indirection through a page-table (so this is not yet useful for a real-world 8-bit emulator).

When I removed the page-table indirection in the C++ emulator, I could boost the speed to slightly over emulated 1.5GHz, but this would then really be a meaningless ‘synthetic benchmark’.

I also learned that both the C++ and the Rust emulator are extremely sensitve to small manual tweaks:

Rust doesn’t do cross-module inlining, just forcing most methods to be inlined more than doubled the performance from around 365 MHz to 850 MHz! I saw similar, but not quite as dramatic effects in the C++ code when playing around with inlining.
I haven’t tried compiling Rust with LTO (link-time optimization), this should hopefully do cross-module inlining on its own without having to annotate every single function.
Just something simple as a single pointer-indirection in the memory system affected performance in the C++ emulator by 25%, in Rust I’m currently not doing this indirection, but I will have to once I start implementing a real home computer emulator with memory-bank-switching.
When playing around with computing the status flags versus storing them into lookup tables I was seeing diminishing returns because of (I think) cache effects in the C++ emulator. It was generally not worth it to precompute flags which required a 256x256 (64 KByte) lookup table, but I have a single 256-byte lookup table which contains parity-flag state (this defines whether the number of 1-bits in a result are even or odd, so this would require bit-counting). In Rust I’m currently not using flag lookup tables at all, but instead use the handy count_ones() function, however I must confess that I haven’t checked yet what assembler code is generated from this, so this is definitely something I need to look into.

Another important difference between the C++ and Rust emulator are the data types used for register values and how overflow is handled:

In C++, the 8-bit registers are uint8_t, grouped as pairs into a union with the associated 16-bit register as uint16_t. This means I can access the 8- and 16-bit registers directly and I don’t need to care about 8- and 16-bit overflow wrapping, this will simply happen automatically. For some 8-bit ALU instructions (those which set the carry flag) I need an imaginary bit #9 of the result, so I need to do the math operation in a wider data type (currently 32-bit integer). This expansion could actually be a hidden performance trap, I haven’t played around with this yet.

On Rust, I have decided to use the full i64 type for all registers (i32 would do as well, and it doesn’t make much of a performance difference, or rather now that I tried it: no performance difference at all). The reasons for this are:

Rust doesn’t automatically cast between integer types, which led to a lot of ‘noisy’ manual type casts
there’s no union in Rust, so the trick to access an 8-bit register pair directly doesn’t work
integer overflow is a runtime error, there are wrapping data types and operations, but combined with the other 2 points it made more sense to go with a wider data type for 8- and 16-bit registers

This means:

I need to build 16-bit register values manually involving bit-wise operations (shift and &)
I need to manually take care of overflow wrapping by anding values against 0xFF or 0xFFFF.

These are the reasons why I’m hiding register accesses behind setters and getters, since every time a register is set it is anded with 0xFF or 0xFFFF, and setting or getting 16-bit registers involves splitting or merging them to and from 8-bit register values. All this hoopla is probably the main reason why the Rust implementation might be slower than the C++ implementation, on the other hand the Rust implementation doesn’t need to expand and shorten values for arithmetic ops all the time.

I really need to take me some time to ponder over the generated assembler code…

But all in all, the current speed (without much tweaking) of the Rust implementation is really not too shabby, especially since there’s not a single line of unsafe code in the whole thing.

To Be Continued…

…ok that’s more than enough for today, I think next time will be a bit about Rust closures (more or less the same thing as C++11 lambdas), since this was the first time I had to wrestle the borrow checker, and I’ll have to start implementing the missing parts on the road to a full 8-bit emulator: adding a page-table to the memory system, the PIO and CTC chips, and the interrupt daisy chain to bind them all together.