Spoiler tag

Monday, March 20, 2017

How I reverse engineered JSRF file formats [Part 1]

Time for a retrospect on this adventure!

This is a long time coming, some people have asked over time how I went about reverse engineering the game Jet Set Radio Future file formats.
Well here it is and is it going to be lonnnnng, trying to keep it short but there is much to explain.

Note: the stuff in this post is probably very basic for experienced programmers, sorry might not be so interesting for you fellow developpers ;)

 
I'll be skipping some more advanced technical details as the article is more aimed at the general public and the curious, for technical keywords I have embedded links to articles with more details on the subject if you want to know more.



Something around seven years ago I set out to reverse engineer some of the JSRF file formats, simply because I love the game and wanted to get a closer look at the game's artwork and its unique artstyle.





Videogames load things such as the 3D models and textures (images) from files.

While many file formats are standard and well documented, JSRF as most videogames, has its own custom game engine and its own custom set of file formats, which means we don't have any documentation on how to read/write those file formats, hence why we resort to reverse engineering them to read and extract the data we want, in this case we're looking for 3D models and textures(images).

Accessorily for me, this was also a good excuse to learn and practice programming, as for the life of me at the time I could hardly pickup a book on such a technical subject and do meaningless exercises to study this subject.
Instead, here was some real incentive, to learn and practice programming while having a real goal, real obstacles and a real reward if I get through the brick walls: reverse engineer the JSRF file format to get to extract and view the 3D models and textures.

I already had some beginner experience in reverse engineering game file formats from modding Halo: CE and Halo 2, writing some plugins, mapping "tag" (object type) file structures.
Modding Halo 2 taught me some basics about game engines, 'data types' and 'binary files', what a byte, an integer,  a float are, what an offset is, how data can be arranged in blocks and nested blocks within a file etc

What's the point of all this?

The main goal of this project was being able to get a closer look at the 3D models of the game, for instance character models, vehicles etc
In order to do that we'd need to figure out what model format the game uses, how data is structured within the different files and once we know that, code a program to do the file reading and extracting/converting data for us into standard readable formats.

In this first part, we'll focus on getting straight to the model data, without knowing the file structure at all, that's how I started with JSRF.

So for instance we would eventually be able to extract the models and textures to get a closer look at the game's character models as follows:

 

So how did I start reverse engineering JSRF?

-I didn't really have much of a clue where to start
-I was going to do it the dumb brute force way
-The only compiled programming language I roughly knew at the time was VB.net

Before I explain how I went about it, lets explain why I chose to do it the "dumb brute force way".

A computer program is created by using a programming language, in that form its human-readable and understandable, when the code is "compiled" it is converted to CPU language, so to speak, and the compiled code is written into a binary file (zeroes and ones) as an executable file (.exe) that the operative system will be able to open and load as a program.

Usually most file format or software reverse engineering is done by decompiling the program executable file and reading the program's CPU commands to figure out what the program is doing.


That would have been a much "easier" and better way to reverse engineer the file formats in the long run, and I mean in the long run because in my case I would have first needed to learn Assembly Machine Code(ASM), in order to be able to read and understand CPU code/commands.
Disassembling/Decompiling the Xbox executable (.xbe) file and by analyzing the program's long and twisted list of CPU commands/operations, we could figure out how the program reads the files and loads the data, easier said than done though.

For instance this is what ASM instructions (from a de-sasssembled executable file) look like:



What may be a line of code before compiling the program can turn into multiple or several dozens of lines of CPU instructions once compiled into an executable file.
The cherry on the cake, the Xbox CPU (a modified x86 Intel Pentium 3) has its very own set of custom CPU commands unique to the xbox which makes things a bit more complicated.


I knew reverse engineering the game's file format was going to take time, but learning ASM and being able to read and understanding xbox CPU commands was going to take me at least a year or two before I was even able to get remotely anything out of it.

Given the steep learning curve time required and me having no programming experience then, ASM was too daunting and unreasonable path for me to pick at the time, besides this was more of a hobby and my main study/work was focused on art and 3D modelling/animation/painting.

I have nothing but the utmost respect for people who know ASM, can read/write it and reverse engineer that way.
Maybe today I could learn it but its not something I would personally make use of often enough to justify learning and mastering such a complex skill.


And so, instead... I chose to reverse engineer things by brute-force, figuring out file structures by reading files manually, hex editor in one hand, notepad in the other, an eye for seeing patterns and puzzling things together!
Oh and patience for trial and error, a whole lot of that.... really.


What is binary data, hexadecimal etc?

Before we go into what and how I was reverse engineering, lets explain some basics of how computer files work.

Computers files are stored as a list of zeroes and ones, because that's how hard disk drives or flash memory, transistors etc work, they store and transfer zeroes and ones.

To get an idea how data is stored here is a visual representation of how the text word "Hello" is stored into a file (as binary) translates in different forms, from binary to hexadecimal to integers values to ASCII characters.

 

In green, the binary data, this is really how computer files are stored, a list of zeroes and ones, and that's how many 0's and 1s it takes to store the word Hello as ASCII text.

In this example we show that reading  0100 1000  which is 1 Byte (8 bits) of data, translates to   48  as Hexadecimal, which translates to   72   as an Integer number, which translates to the character H, given that in the ASCII character table the number   72   corresponds to the capital letter "H".

So for instance a classic text editor as notepad, will open a file and try to read the binary data as list of integer numbers and then display the corresponding characters based on the ASCII character table.


At its core the data is always stored as binary but it can be translated into Hexadecimal, especially for debugging or reverse engineering, hex is easier to read than a long list of seemingly random zeroes and ones.

This Hello example is pretty basic and there are many other forms to encode and read the data (integers, decimals, boolean, text etc) and any programmer can come up with their very own format and way to store any type of data.

Now that you have an idea of what a computer file is like, lets get to reading the data! :)


What/Where to start looking for?

I decided to start by looking for 3D model (polygon mesh) data, and try to visualize it, or at least part of it as 3D points.

When starting I of course had strictly no clue how the JSRF files are structured nor what type of data exactly they contain, so I wasn't even sure what files might contain model data or even if they did, if the model data would be interpret-able as well known standard formats, most games have their own formats.

Doesn't matter though! 
I just wanted to first make sure that the files contained readable 3D models and confirm that "I was onto something" for starters. So I chose a few files which I thought might contain models, based on how they were named.

 

THE Hex Editor
In order to read the data of a (binary) file, I used a program/tool known as a Hex Editor.
This tool reads and displays the binary file data as hexadecimal code, as well as text, it also lets us modify the files.
The hex editor also serves to "translate" how the data can read, we can select part of the data and it displays how it would read as different types of values (i.e: integer, decimal, text/string etc), in the "Data Inspector" panel on the right.



So what kind of data type are we looking for 3D models?

We're first going to look for 3D points data, because its easy to recognize as generally most 3D models are stored as lists of 3D points and triangles definitions(how the points are connected to form edges/triangles).

Luckily, JSRF files don't have any form of data compression nor encryption as some more modern games can have, so the data is "raw" and quite accessible, otherwise I would have first needed to reverse engineer the executable file and reading CPU instructions.

Anyways, generally a 3D point is stored as a '3D Vector', that means 3 floats: X, Y, Z, that is as three values of the type "float", one for each axis (X,Y,Z) to define the position of a point in a 3D space.

So I opened up JSRF files in a hex editor where I thought (based on the file name) models would be stored, looking for data patterns that looked like lists of floats.

This is what scrolling through the start of a JSRF file binary file looks like in a hex editor:



This is what the different columns represent:


The left column represents the "offsets" a list of positions/address in the file for each line start, in this case we have 16 bytes (0 to 15) of data displayed per line, so the offsets on the left are incremented by +16 for each line.

The central column displays the binary file data, Binary (ones and zeros) read as Hexadecimal.
We usually count and work with "Base 10" (0123456789) but for computers, we translate binary "Base 2": zeros and ones into in "Base 16" aka hexadecimal (From 0 to 16: 0123456789 ABCDEF).

The right column represents the binary data read as text, it can be useful to find actual text data, but also helps noticing patterns of the binary data, for the most part it doesn't make any sense because its binary data (not text).


In the central column, we can change the amount of bytes we want to display per line, by doing this, in some cases, we can more easily spot structured data and its patterns, illustrated as follows:


So I went through files looking for patterns and lists of floats and found this section.
We can expect structures, because in programming we tend to organize data in structures or lists that have items of a fixed size.

You'll notice there clearly is a pattern when I set the number of collumns to 23:


Its actually 24 bytes being displayed per row, the way computer memory is defined and with most programming languages, for lists of objects/data the first object's identifier or starting position is defined as starting from 0 rather than 1.


How to read the data and what does it mean?

Notice the selection in black in the previous image, I have selected 4 bytes of data, a float is encoded with 4 bytes of data.

Once selected the Hex Editor has another panel where it reads the selection as various types of data of the selected length or less.

On the left, we have the list of "types" of data, on the right what the data reads as.



int8 type is a signed integer, its stored as 1 byte of data, it can store an integer number that can go from -128 to 127.

int16 type is a signed integer, its stored as 2 byte of data it can store an integer number that can go from -32 768 to 32 767.

uint For the other "int" starting by "u", it means unsigned, has no sign, which means it can only be a positive number, this allows to have a larger range from 0 to 4 294 967 295

uint8 is stored as 1 byte but can have a value from 0 to 255, in other words its always positive.

 
float type can store a decimal number, it is stored in 4 bytes and its value can range from -3.4E+38 to +3.4E+38 (E for exponent, its quite long).

Note: the E stands for Exponent notation, for instance 6E-06 would equal to 0.000006, the exponent notation is just a shorter way to represent the float.

You can find more information about data types here.
 
You get the idea, this is how different types of values can be stored in computer file formats, in memory and programming in general, the types like int or float are standard data types and it's what CPU are built to work with.

Moving onto the next image, I have selected
4 bytes (40 BE 5B 41) that read -13.73395 as a float, this seems like quite a plausible float number.
If I read the next 4 bytes (70 F8 9B C1) it also gives a plausible float: -19.49631 and so on more floats for the next 4 bytes etc

I didn't select these 4 Bytes for no reason in particular, notice the pattern changes before the black selection:
 



Well, in red its mostly empty data, but there is clearly a pattern and structure.

Here for every line, we have 24 bytes displayed for which we find a similar pattern that starts with 4 bytes being 00 00 00 00, then 00 00 80 BF, and the 16 other bytes are for the most part different for each line.

If we read every 4 bytes of the first line as floats, this is what the data looks like:

This does look like it could be float data, as the numbers aren't too high nor too low, generally if we tried to read data that was an Int as a float we could get a very large or small number for a float such as 1.1204101e-038, generally we would expect float numbers to only have a few decimals after the comma, not 38 decimals after the comma, that's one way to determine that the data at that position most likely isn't a float.


So at this point we've established there is a pattern and this is a list of floats, we're looking for 3D point positions, so XYZ floats.

 



 
Lets take the first 15 lines of hex code and read them as floats, as seen in black text.
Note for the sake of readability I arranged the hex bytes in groups of 4 bytes. 



Ok so now that we have a block of data that looks more or less like a list of floats that could be 3D points , we want to visualize these floats as 3D points to see if it looks like anything coherent and that is seen in the game.

As I said earlier a 3D point is described by 3 floats:
X Y Z, but here we have a pattern of 24 bytes per line, or 24/4 = 6, so that's 6 floats per line, where as a 3D point is described by 3 floats.

So which part of that is the actual point data? is it a sequence of
XYZ XYZ in a single line?

Lets find out!  by importing the list of floats as 3D points into a 3D application which can display these points in 3D in space.

You might have noticed I colored
XYZ, that's because its the color codes used for each axis in geometry and most 3D applications use this color scheme too.



So for starters, let's try to import the floats list as a linear sequence of floats XYZ
XYZ ... etc
We read the entire listas follows, though I only listed the first three lines of data to keep it short:

So we're going to import the points in this order:
Point 0: (0, 0, 13.73395)
Point 1: (-19.49631, 0.203655, -1E-06)
Point 2:(0, -1, 13.88524)
Point 3:(-19.4553, 0.203655, -1E-06)
Point 4:(0, -1, 13.62626)
Point 5:(-19.60399, 0.203655, -1E-06) 
... (and +44  other points)

And this is what we get:

Hmmm.. so it doesn't look like much, definitely nothing coherent as most of the points align either in the X or Z axis, this doesn't make any sense :(

What now? 
Well, remember we figured there's a structure and a pattern of 24 bytes repeating, what if those 24 bytes (6 floats) have only one 3D point and the rest of the floats are another type of data?

I figured that would make more sense, as the first 2 floats are always 0 or -1, that didn't make much sense for the points to have different positions you would expect at least 2 or 3 of the floats to have quite different or at least opposite values as the third and fourth do.

So lets then try to only read one 3D point from a single line as follows:

We already tried reading XYZ starting from position 0 (and it looked like nothing) so instead we try reading starting from the second float (so from the byte position = 4)

We get this:


Hey! now that looks like, something, lets get a closer look:



Now we definitely seem to finally be onto something.


The file from which I got this float data is under the path and file name Media\Disp\map\Stg00.dat 
I guessed Disp probably stands for "display" and might be related to the game's UI.

I wasn't quite sure what this could be at first, so I searched more float lists in the same file and imported more 3D points.

And sure enough! those are the icon models from one of the game's menu as seen in the youtube video screenshot of the game's menu below:



Pssshhht! little secret, actually the proper position to read XYZ is this:

Starting at offset 8 and not 4, the model is no longer flipped and we get the correct Z value too.

Now you might be wondering, so what's the rest of the data ??? 
At the time I guessed probably more 3D model data, but lets not over complicate things, for now we don't need it.

Getting the triangle data

So far we have the 3D points aka vertices, but as you can see from the menu:
 

These icons have more details to them, they have colors and separate shapes, so what are we still missing?

Triangles! that's how 3D polygonal meshes work in computer graphics, points connect to one another to form edges, that form triangles - the facets we'll be able to see.

At the time I had to learn about 3D models, how are the triangle faces stored and defined?
Most game engines use triangles to represent 3D models as that is how graphics cards work, with triangles, which are composed of, points(vertices) and edges.

The triangles are defined by a list of indices(number of the vertices) for how the points are connected to form edges that compose one or multiple triangles.

Vertex indices are just integer numbers, so the first point (0, 013.73395) will be referenced as "0" in the triangle list, second point as "1" etc

Back to the hex editor

So let's get back to the hex editor and look for something that looks like a list of integers, the most logical place to look for that would be before or right after the float list we initially found.
But to keep this short, I'll skip ahead, I know the triangle data is after the floats list.

Here we have the end of the floats list, so the 3D points (encapsulated in the green outline)


And after it, you might notice this pattern 

Not that it means anything in ASCII text, but it helps us notice and differentiate a change in pattern or type of data.

so lets try to read the hex data of this section:

At first we see a bunch of 000000 zeroes, but then, 01... 02...02..01...03..03..01..04...01...05...06...06
So first thing that comes in mind is, incrementing numbers, but there's a bunch of zeroes between them, so lets try to read this as intergers with the hex editor.

So we try reading as Int16, that is a integer encoded in 2 bytes of data:


0000 (hex) = 0   (Int16)
0100 (hex) = 1   (Int16)
0200 (hex) = 2   (Int16)
0300 (hex) = 3   (Int16)
0900 (hex) = 3   (Int16)
0A00 (hex) = 10 (Int16)
0B00 (hex) = 11 (Int16)
0C00 (hex) = 12 (Int16)
...

So, pretty sure these are indeed Int16, if we read the data as Int16

So what does that look like if I select a few bytes and read it as integers?
It looks like a list of: 0 1 2 2 1 3 3 1 4 4 1 5 5 1 6 6...

Here's the whole block, I chose to start and stop at these specific locations because you can clearly see it starts and ends with a lot of zeroes, this is usually called "padding" in file structures and in many cases it helps us recognize when a section of data starts and ends.



Once again for the sake of ease of readability  I have here paired the hex code as groups of 2 bytes (the size of an Int16).

So as you can see, the numbers increase but it goes back to lower numbers sometimes.
It starts by 0 1 2, then follows by 2 1 3, these are most likely triangle definitions, because we cannot find the same number for every 3 int16.

A triangle definition of a polygon mesh references to the vertex(Point) number and defines in what order to connect vertices, to form the triangles.

Clearly though, the start being several zeroes, that's not exactly the start of the triangle list, it starts at the fifth 0, which the 8th byte, right before 1.

The highest number we get is 50, this would mean that the model has 51 vertices(points), enumerated 0 to 50.

Remember the floats list? 


We figured there's 24 bytes per line, a line has a definition for a vertex aka 3D point (XYZ) floats.

How many lines are there?
Notice I have highlighted the position in yellow: 00043512  to 00044712
So 44712 - 43512 = 1200 bytes

1200 / 24 = 50 lines, that is 50 vertices.

So we have 50 vertices, and the data we think is the triangle definitions has values up to 50, actually 51 if we count the one starting at 0, seems it is what we think it is.
It all comes together now :)

So now that we have the full lists of 3D points aka vertices and the list of triangle definitions, we import the vertices list and the triangle definitions into a 3D program, I was using good old Softimage at the time.

I won't go into details of how this works, basically I just coded a little script that would read the JSRF binary file, then created an empty mesh in the 3d program, then feed it the vertices list (floats for the points positions) and also the triangle definitions, the list of integers for each triangle.
And the 3D program would then display the 3D mesh object.


And this is what we get tadaaaaaaaa:



I manually added the colors so its more readable and as ingame:



The mesh with the vertices (red) and the edges/triangles in yellow:





An animation showing the order in which triangles are built:





Okay, the animation is just for the kicks, the modelling software can import and display the model in milliseconds of course :)

There's the other vertices/triangles list I manually extracted and imported:

Perfectly extracted as on the game's menu :D



I picked these menu meshes as an example because they are fairly simple and smaller models in terms of amounts of data, its easier to present/explain, but it really was one of the first meshes I stumbled upon and used to test reading/importing because of its simplicity.

Of course, this lead to extracting much more interesting 3D meshes such as characters, vehicles etc






But those models are somewhat different and more complex and I extracted later after figuring out more of the file structures, which we'll see in Part 2.



This concludes Part 1, that was the easy part, we were just trying to get raw polygon mesh data and importing it into a 3D software to view it to see if it really existed in the JSRF files and was readable as common 3D meshes.

The game has hundreds of files, and within each file there can be dozens or hundreds of 3D meshes, so searching for lists of floats and triangles definitions as explained in this post is not a solution, this was simply a means to getting started and making sure there was readable 3D polygon mesh data.

In Part 2, we'll get to the real puzzle, the next step is being able to understand the general file structure, to know where a block of data starts and ends within the file, how many models the file contains etc
Then to understand the model's file structure to be sure where the vertices list starts and ends, same for triangles etc


Anyways, that's it for now!
If you've gotten this far, bravo! I tried to keep it short, but its not my strongest suit... and I guess this techy stuff is not the most interesting to some so I tired to have some images to make it more bearable.

Feel free to ask questions and/or leave feedback, not sure if I over-explained things or missed on some things, let me know! :)