For an upcoming project, I need to extract level data from the classic 1985
video game Super Mario Bros (SMB).
More precisely, I want to extract the background imagery for each stage of the
game, excluding HUD elements and moving sprites, etc.
Of course, I could just stitch together images from the game, and perhaps
automate this process with computer vision techniques. But I think the method
described below is more interesting, and allows for inspection of
elements of the level perhaps not exposed through screenshots.
In this first stage of the project, we will explore 6502 assembly and an
emulator written in Python. Full source code is available
here.
Source code analysis
Reverse engineering any program is a lot easier if you have the source code
available, as we have here in the form of 17,000 lines of 6502 (the NES CPU) assembly code, posted by
doppelganger. Because Nintendo has never made an official source release, it
was created by disassembling the SMB machine code, painstakingly deciphering
what each part means, and inserting comments and meaningful symbol names along
the way.
A quick search through the file and we find something that looks like it might
be the level data we seek:
If you are not familiar with assembly, this is just saying “insert these bytes
verbatim into the compiled program, and then allow other parts of the program to
refer to it via the symbol L_GroundArea6”. You can think of it as an array
where each element is a single byte.
The first thing to note is that this is a very small amount of data (about 100
bytes). This rules out any sort of encoding which allows arbitrary placement of
blocks in the level. After a bit of searching, I found
that this data is actually read (after some indirection) in
AreaParserCore.
This subroutine in turn calls lots of other subroutines, eventually calling a
specific subroutine for each type of object allowed in a scene (eg.
StaircaseObject, VerticalPipe, RowOfBricks) for over 40 different objects:
Abbreviated call graph for AreaParserCore
The routine writes into a MetatileBuffer: a 13-byte-long section of memory,
which respresents a single column of blocks in a level, each byte representing a
single block. A metatile is a 16x16 block that makes up the backgrounds in SMB:
Level with boxes drawn around the metatiles
They are called metatiles because each one consists of four 8x8 pixel tiles —
but more on this later.
The fact that the decoder works in terms of predefined objects explains the
small level size: the level data needs to refer just to object types and
locations, for example, “place a pipe at location (20, 16), a row of blocks at
location (10, 5), …”, however, it does mean that there is a lot of code
required to turn the raw level data into metatiles.
Porting this amount of code in order to write my level extractor would take far
too long, so let’s try a different approach.
py65emu
If we had an interface between Python and 6502 assembly we could call the
AreaParserCore subroutine for each column of the level, and then use the the
more concise Python to convert the block information into the desired image.
Enter py65emu, a succinct 6502
emulator with a Python interface. Here is how to set up py65emu with the same
memory map configuration as the NES:
from py65emu.cpu import CPU
from py65emu.mmu import MMU
# Load in the program ROM (ie. the compiled assembly)
with open("program.bin", "rb") as f:
prg_rom = f.read()
# Define the memory mapping.
mmu = MMU([
# Create 2K RAM, mapped to address 0x0.
(0x0, 2048, False, []),
# Map the program ROM to 0x8000.
(0x8000, len(prg_rom), True, list(prg_rom))
])
# Create the CPU, telling it to start executing at location 0x8000
cpu = CPU(mmu, 0x8000)
After this we can execute a single instruction with the cpu.step() method, and
we can inspect the memory mmu.read(), and check the machine’s registers with
cpu.r.a, cpu.r.pc, etc. In addition we can write to the memory with
mmu.write().
It should be noted that this is only an emulator for the CPU of the NES: it
does not emulate other parts of the hardware such as the PPU, so it cannot be
used to emulate the entire game. It should, however, be sufficient for calling
the parsing subroutine, as this does not rely on any hardware beyond the CPU and
memory.
The plan is to set up the CPU as above, and then for each column of the level,
initialize sections of the memory with the inputs required by AreaParserCore,
call AreaParserCore, and then read back the column data. Once this is done
we will use Python to compose the result into an image.
But before we do this we will need to compile the assembly listing into machine
code.
x816
As noted in the source, the assembly compiles with x816. x816 is an MS-DOS based
6502 assembler used by the NES homebrew and ROM hacking community, and it works
great in DOSBox.
Along with the program ROM required by py65emu, x816 produces a symbol file
which maps symbols to their memory locations in the CPU’s address space. Here
is an excerpt:
Here we can see the AreaParserCore function will be accessible at address
0x93fc in the source.
For convenience, I put together a parser for the symbol file, which maps
between symbol names and addresses:
sym_file = SymbolFile('SMBDIS.SYM')
print("0x{:x}".format(sym_file['AREAPARSERCORE'])) # prints 0x93fc
print(sym_file.lookup_address(0x93fc)) # prints "AREAPARSERCORE"
Subroutines
As mentioned in the plan above, we want to be able to call the AreaParserCore
subroutine from Python.
To understand the mechanics of a subroutine let us look at a short subroutine
and its corresponding call:
WritePPUReg1:
sta PPU_CTRL_REG1 ;write contents of A to PPU register 1
sta Mirror_PPU_CTRL_REG1 ;and its mirror
rts
...
jsr WritePPUReg1
The jsr (jump to subroutine) instruction pushes the PC register onto the
stack, and sets the PC register to the address referred to by WritePPUReg1.
The PC register tells the CPU the address of the next instruction to load, so
that the next instruction executed after the jsr instruction will be the first
line of WritePPUReg1.
At the end of the subroutine an rts (return from subroutine) instruction is
executed. This command pops the saved value from the stack and stores it in the
PC register, which makes the CPU execute the instruction following the jsr
call.
The great thing about subroutines is that you can have nested calls, ie. calls
to subroutines within subroutines. Return addresses will be pushed onto the
stack and then popped off in the correct order, in the way you would expect of
function calls in a high level language.
Here then is the code for executing a subroutine from Python:
This works by saving the current value of the stack pointer register (s),
emulating a jsr call, and then executing instructions until the stack has
returned to its initial height, which will only happen when the first subroutine
has returned. This is useful, as we now have a way of directly calling 6502
subroutines from within Python.
However, we are forgetting something: how to provide inputs to this subroutine.
We need of telling the routine what level we are trying to render, or which
particular column we want to parse.
Unlike a high-level language such as C or Python, subroutines in 6502 assembly
do not take explicit inputs. Rather, inputs are passed by setting memory
locations at some point prior to the call, which are then read anywhere within
the subroutine call. Given the size of AreaParserCore, reverse engineering
the required inputs purely by looking at the source would be difficult and open
to human error.
Valgrind for NES?
To work out what the inputs to AreaParserCore were, I drew inspiration from
the memcheck tool for
Valgrind. Memcheck detects accesses to uninitialized memory by storing “shadow”
memory alongside every piece of actual memory allocated. The shadow memory
records whether the corresponding real memory has ever been written to. If
the program reads from an address that has never been written to, an
uninitialized memory error is printed. If we could run AreaParserCore with
such a tool it would tell us the inputs to the subroutine that should be set
before being called.
It is actually very easy to write a simple version of memcheck for py65emu:
def format_addr(addr):
try:
symbol_name = sym_file.lookup_address(addr)
s = "0x{:04x} ({}):".format(addr, symbol_name)
except KeyError:
s = "0x{:04x}:".format(addr)
return s
class MemCheckMMU(MMU):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._uninitialized = array.array('B', [1] * 2048)
def read(self, addr):
val = super().read(addr)
if addr < 2048:
if self._uninitialized[addr]:
print("Uninitialized read! {}".format(format_addr(addr)))
return val
def write(self, addr, val):
super().write(addr, val)
if addr < 2048:
self._uninitialized[addr] = 0
Here we have wrapped py65emu’s memory management unit. This class maintains an
array, _uninitialized, whose entries tell us whether the corresponding byte of
the emulated RAM has ever been written to. When an uninitialized read occurs,
the address of the invalid read and the corresponding symbol name are printed.
Searching around the code, we see that many of these values are set by the
subroutine InitializeArea, so let’s re-run the script with a call to this
function first. Repeating this process, we end up with the following
arrangement of calls, which needs just the world number and area number to be
set:
mmu.write(sym_file['WORLDNUMBER'], 0) # World number, minus 1
mmu.write(sym_file['AREANUMBER'], 0) # Level number, minus 1
execute_subroutine(sym_file['LOADAREAPOINTER'])
execute_subroutine(sym_file['INITIALIZEAREA'])
metatile_data = []
for column_pos in range(48):
execute_subroutine(sym_file['AREAPARSERCORE'])
metatile_data.append([mmu.read_no_debug(sym_file['METATILEBUFFER'] + i)
for i in range(13)])
execute_subroutine(sym_file['INCREMENTCOLUMNPOS'])
This writes the first 48 columns of World 1-1 into metatile_data, using the
subroutine IncrementColumnPos to increment the internal variables used to
track the current column.
And here are the contents of metatile_data, overlaid onto some screenshots
from the game (bytes with the value 0 are not shown):
metatile_data clearly corresponds with the background information.
Now let’s look at how we turn the metatile numbers we fetched above into actual
images. The following steps were found by analysing the source and studying the
documentation on the excellent Nesdev Wiki.
To understand how to render each metatile, we first need to talk about colour
palettes on the NES. The NES PPU has the ability to render 64 different colours
in total, although black is duplicated a few times (see Nesdev for details):
Each Mario level is limited to using only 10 of these 64 colours for its
background, divided into 4 four colour palettes; the first colour is always the
same. Here are the 4 palettes for World 1-1:
Now let’s take a look at an example metatile number which has been expressed
in binary. Here is the metatile number for the cracked rock tile that runs
along the ground of World 1-1:
The palette index tells us which palette to use when rendering the metatile, in
this case palette 1. The palette index also an index into these two arrays:
These arrays combined give a 16-bit address, which for our example points to
Palette1_Mtiles:
Palette1_MTiles:
.db $a2, $a2, $a3, $a3 ;vertical rope
.db $99, $24, $99, $24 ;horizontal rope
.db $24, $a2, $3e, $3f ;left pulley
.db $5b, $5c, $24, $a3 ;right pulley
.db $24, $24, $24, $24 ;blank used for balance rope
.db $9d, $47, $9e, $47 ;castle top
.db $47, $47, $27, $27 ;castle window left
.db $47, $47, $47, $47 ;castle brick wall
.db $27, $27, $47, $47 ;castle window right
.db $a9, $47, $aa, $47 ;castle top w/ brick
.db $9b, $27, $9c, $27 ;entrance top
.db $27, $27, $27, $27 ;entrance bottom
.db $52, $52, $52, $52 ;green ledge stump
.db $80, $a0, $81, $a1 ;fence
.db $be, $be, $bf, $bf ;tree trunk
.db $75, $ba, $76, $bb ;mushroom stump top
.db $ba, $ba, $bb, $bb ;mushroom stump bottom
.db $45, $47, $45, $47 ;breakable brick w/ line
.db $47, $47, $47, $47 ;breakable brick
.db $45, $47, $45, $47 ;breakable brick (not used)
.db $b4, $b6, $b5, $b7 ;cracked rock terrain <--- This is the 20th line
.db $45, $47, $45, $47 ;brick with line (power-up)
.db $45, $47, $45, $47 ;brick with line (vine)
.db $45, $47, $45, $47 ;brick with line (star)
.db $45, $47, $45, $47 ;brick with line (coins)
...
The metatile index when multipled by 4 is an index into this array. The data is
formatted with 4 entries per line, so our example metatile refers to the 20th
line, reassuringly commented with cracked rock terrain.
The four entries on this line are in fact tile IDs: Each metatile consists of
four 8x8 pixel tiles, in the order top-left, bottom-left, top-right, and
bottom-right. These IDs are sent directly to the NES’s PPU and the ID refers to
16 bytes of data in the NES’s CHR-ROM, with each entry starting at address
0x1000 + 16 * <tile id>:
The CHR-ROM is a piece of read-only memory that can be accessed only by the PPU,
and is separate to the PRG-ROM where the program code is stored. As such, the
above data is not available in the source code, and needs to be retrieved from a
ROM dump of SMB.
The 16 bytes for each tile make up a 2-bit 8x8 tile: the first bit is the first
8 bytes, and the second bit is the second 8 bytes: