I have to admit that the title is not 100% correct. To be honest, I wrote something that allows playing some ROMs to some extent. Writing a complete emulator with all features including sound and all available co-processors would probably be a full-time project for many years. Yet, I'm super happy with the results and would have never imagined to reach the current state.

Why?

When I grew up, there was a SNES at my grandparents' home. Playing (and failing!) Super Mario World is probably my earliest childhood memory of playing video games. Thus, the console and game have a special place in my heart. When I wanted to play this game later in my life, I didn't have access to the real console any more, so emulating the game was my only chance. Being interested in software development since my childhood, I was quickly fascinated by the technology behind emulators. I always wanted to build one myself, but I was lacking the skills. After finishing my studies and becoming a professional software developer, I felt that it was time to pick up this project again.

I know that Nintedo sometimes takes hard actions against emulator development. That's why I won't publish mine. Also, all code examples are very simplified.

Getting started

My very first task was to choose a programming language. I decided to use C++ since I work with this language professionally and know it rather well. I wanted the emulator to be cross-platform, so I refrained from using OS-specific APIs.

The ROM-loader

To get started with the actual emulator project, I needed some way to load the ROM file into my software. I studied the console's memory layout and came up with a simple loader. This was as loading the ROM file (starting at byte 512 which is the SMC header size) and writing the contents to the internal memory (starting at 0x8000) byte by byte. The loader also extracts the ROM Header, a part of the game memory that contains information like interrupt addresses.

The CPU

The CPU class contains the same registers as the physical hardware.

struct Registers
{
    Register pc;
    Register db;
    Register pb;
    Register sp;
    Register p;
    Register a;
    Register d;
    Register x,y;
};

class Cpu
{
public:
    explicit Cpu(System& system);
    bool runClockCycle();

private:
    [[nodiscard]] types::opcode fetchOpcode();
    Registers registers{};
    bool emulationMode{ true };
}

When the CPU class is instantiated, it first sets the program counter register to the Reset Vector from the ROM header which is the game's entry point.

Cpu(System& system) : system{ system }
{
    romHeader = system.getMemoryBus().getMemory().getHeader();
    registers.pc = { romHeader.emulationResetVector };
}

What makes 5A22 hard to emulate is the fact that the CPU can be in one of three states:

Emulation mode: The CPU behaves like a 6502
8-bit mode: All arithmetic operations are 8-bit
16-bit mode: All arithmetic operations are 16-bit

Initially, the CPU is in emulation mode until a special instruction disables it.

Implementing The First Instruction

After loading the Super Mario World ROM and initializing the CPU, it's time to fetch the first instruction from memory. The memory address is composed of the program counter register and the value in the program bank register.

uint8_t fetch8(int offset) const
{
    return system.getMemoryBus().read8(MemoryAddr(((*registers.pb << 16) | *registers.pc) + offset));
}

The first instruction in Super Mario World is SEI which disables interrupts.
The first opcode handler looks as follows:

bool runClockCycle()
{
    opcode op = fetchOpcode();
    switch (op)
    {
        case 0x78: OP(SEI)
        {
            finishOp(1, {{ CpuFlag::I, true }});
            break;
        }
    }
    return true;
}

Finish op is a little helper method that increments the program counter (SEI is an 1-byte instruction) and sets CPU flags (bits of the P-register), in this case the interrupt disable flag.

The next instruction is STZ which stores a zero to an absolute address in memory.

In the opcode handler switch statement:

case 0x9c: OP(STZ)
{
    stz(addrAbsolute());
    finishOp(3, {});
    break;
}

Helper methods:

void stz(address addr)
{
    system.getMemoryBus().write8or16(MemoryAddr{addr}, 0x0, use16bit());
}

address addrAbsolute() const
{
    return fetch16(1) | (*registers.db << 16);
}

This is one of the instructions that behave differently in 8- and 16-bit mode:

In 8-bit mode, 0x00 is written to memory
In 16-bit mode, 0x0000 is written to memory

All the other 254 instruction handlers are written in a similar way. I proceeded by going through the ROM instruction by instruction and implementing missing ones. I later discovered a special CPU test ROM which was a great help in fixing some mistakes I made during implementation.

DMA

After implementing most CPU instructions, I was eager to see some graphics output.
The SNES has special PPU (Picture Processing Unit) chips that render graphics to the screen. The CPU writes to special memory addresses (PPU Registers) to control the PPU.
This is mostly done via DMA instead of directly writing the registers. That means that I had to implement the DMA controller first. I came up with the following architecture:

In a very simplified form, the communication looks like this:

The PPU

The graphics of SNES games consist of Backgrounds and Sprites.
Number and order of background- and sprite layers are determined by the graphics mode (my emulator currently support mode 0 and 1).

Backgrounds

Backgrounds are separated into 8x8 tiles. The Tilemap specifies the the tile used for every 8x8 region on screen. Each tile has a palette which assings a color to each pixel. Tiles and tilemaps are stored in video RAM while palettes are stored separately in the CGRAM.
Rendering a background layer works like this:

Load the tilemap of the layer from VRAM
Iterate over all 8x8 pixel regions of the screen
Read the tile id for each region from the tilemap
Load the tile from VRAM
Load the palette of the tile from CGRAM
Read the color for each of the tile's pixels and draw them to the screen

Sprites

Sprites are easier since they're just a bunch of tiles (without tilemaps) that have a X- and Y-position assigned. Sprites are stored in their own memory called OAM.

The rendering method for a vertical line on screen is implemented like this:

    void renderRow(uint16_t row)
    {
        for (const auto& toRender : MODES[activeBackgroundMode].layerOrder)
        {
            if (const auto& background = backgrounds[toRender.bgLayerNumber];
                background.bgEnableOnMainSceen || background.bgEnableOnWindow)
            {
                renderBackgroundRow(
                    toRender.bgLayerNumber,
                    row,
                    toRender.bgLayerColorDepth,
                    toRender.bgLayerPriority);
            }
        }

        // Sprites
        for (int i = 0; i < 128; i++)
        {
            renderSpriteRow(i, row);
        }
    }

Please note that this is not correct yet: It assumes that sprites are rendered after all backgrounds. In reality, there can be background layers over the sprite layer.

First Successful rendering

For the rendering itself, I use OpenGL and plot the picture pixel-by-pixel.
After weeks of work, this was my first (semi-)successful rendering. The colors were wrong because I had an error in the palette-handling and the scaling was also off.

Nevertheless, I was super happy to finally see some progress.

The Nintendo logo on this screen is a sprite. Rendering background layers was still barely working at this point. After improving my code over and over, fixing bugs in the CPU and PPU, I got more stuff to render properly every week:

Controller Input

After fixing the graphics to a certain extent, I finally wanted to play a level. To my surprise, controller input is dead simple on the SNES.
The CPU determines the status (pressed, released) by reading the individual bits of the value stored at memory address 0x4218 and 0x4219 which map to hardware registers. I use the OpenGL renderer to detect keyboard events in the emulator window. Therefore, the implementation of the controller class is very simple:

uint8_t read(const MemoryAddr &addr)
{
    if (addr.getAddr() == 0x4219)
    {
        uint8_t result{ 0 };
        result |= (screen->isKeyDown(render::SNES_KEY_A) << 7);
        result |= (screen->isKeyDown(render::SNES_KEY_UP) << 3);
        result |= (screen->isKeyDown(render::SNES_KEY_DOWN) << 2);
        result |= (screen->isKeyDown(render::SNES_KEY_LEFT) << 1);
        result |= (screen->isKeyDown(render::SNES_KEY_RIGHT));
        return result;
    }

    return 0;
}

Refactoring

After months of developing features, trying things and fixing bugs, the code became almost unmaintainable and refactoring became necessary. After discussion with a lot of friends and colleagues, I came up with the following component structure:

After the refactoring, the main-method is pretty minimalistic:

int main(int argc, char** argv)
{
    auto memory = snes::loadRomFromFile("rom.smc", false);

    std::shared_ptr<render::Renderer> renderer = render::Renderer::getRenderer();
    renderer->init();

    snes::System system{};
    system.createTracer();
    system.loadMemory(std::move(memory));
    system.createApu();
    system.createPpu(renderer);
    system.createStatusRegisters();
    system.createController(renderer);
    system.createDmaController();
    system.createMemoryBus();
    system.createCpu();

    snes::SystemError reason = system.run();
    if (reason == snes::SystemError::UNKNOWN_CPU_INSTRUCTION)
    {
        std::cout << "Unknown CPU instruction found; System halted." << std::endl;
        system.getTracer()->printDump();
    }

    return 0;
}

What's missing

I'm already very happy with my progress that allows me to (more or less good) play Super Mario World. However, there are still plenty of features to tackle when I find some time again:

Cycle accuracy: Not all instructions take the same amount of time in reality, so game speed is off.
Sound: The implementations of the SPC coprocessor and DSP are missing completely.
Graphics modes: Currently only 0 and 1 are implemented. 7 would be especially challenging.
Windowing and Mosaic features of the PPU: These normally great-looking effects currently look super odd.
Most of HDMA (since it's not used very much in SMW, I only implemented the bare minimum).
Performance optimizations, Refactoring...

I Wrote a Super Nintendo Emulator From Scratch

Matthias Rupp