(Not) Fixing the Final Bug

This is the story of the end of a project. We were about to release the last DLC before we moved on to the next game. Everything was ready to go, except for one, single, game breaking bug that would cause us to fail compliance testing.

I had developed something of a reputation for fixing bugs, and many other developers were already taking time off, so I was assigned the last bug. It was a tricky one.

PS3 version of the game crashes on boot if DLC 2 is installed.

We had 2 DLCs (downloadable content) for the game. The first one had been released, and we were about to release the second one. What QA were reporting was that the PS3 version of the game would crash if we had the second DLC installed.

Reproducing


The first step in investigating almost any bug is to reproduce it. If you can make it happen on your developer machine, you then have a few debugging tools you can use to figure out the bug. So I built the game, built the DLCS, and installed them all on my developer machine.

I couldn’t reproduce the bug. No matter what I did, the game would run every time. This isn’t good. It means there is something causing a crash, but no way for us to test and fix it.

I went back to QA and asked them to reproduce the bug with the latest build. They loaded up the game on a PS3, and sure enough it crashed. So it definitely was happening. We went back and forth for a while trying to figure out if I was doing anything different. They noticed that I had installed both DLCS, but they only saw the crash with just the second one installed.

Aha! So that was it! I went back to my desk. Reset everything. Installed the game and just DLC2. Ran the game. And then…

…it didn’t crash.

Damn. We were still missing something. I went back and forth with QA for a while. Neither of us could figure out what the problem was. Then they figured it out. For them, it only crashed if they loaded the game from a Blu-ray disc. If they loaded a test build from the hard drive, it didn’t crash.

This was something. During development we always load a test build from the hard drive. If we were to burn a new disc every time we wanted to test the game, we would burn (ha!) through a ridiculous amount of discs. So this would explain why I couldn’t reproduce it.

I asked QA to burn a disc for me and send it back. I got the disc, attached a debugger to my PS3, and ran the game. It crash.

Finally, I could start to figure out what was causing the crash.

Investigation


Debugging release builds can be difficult. You usually don’t have debug symbols that allow you to see what source code is currently being run. This is mainly done to reduce the size of an executable and so it will run faster. Development builds run slower, but give you more information that make it easier to see what is happening under the hood.

By the time a game gets onto Blu-ray, all this debug information is gone. It makes it much harder to understand what code is being run and what the current state of the program’s memory is. So I started to create builds that included debug information, and asked QA to create Blu-ray discs with these builds on them.

This stopped the crash from happening. I went through several Blu-ray discs before I gave up on trying to get a debug build that would reproduce the crash. I was going to have to do this the hard way.

So I spent over a day pouring over the assembly code where the game was crashing. The PS3 was good at giving me a vague location in the code base, but I still had to interpret the machine code in that area.

I figured out that it was loading a shader into the render buffer. I reproduced the crash several times to make sure it always crashed in the same place, and sure enough it did. I had found the cause of the crash.

The (Not) Fix


The renderer isn’t my area, so I talked to one of the rendering programmers who was still around. I showed him exactly where the crash was happening, and where in the code it happened. He looked at it and had no idea why it would crash there. Still, he would take a look and try to make a fix.

He took over and I watched the ticket as he submitted a fix, and QA checked and confirmed the bug was fix. Curious after this, I asked him how he fixed it. “I don’t know”, he said, “I just changed it so the shader was hard coded”.

Basically, he had no idea how to fix it, so he just rewrote it so it did exactly the same thing in a slightly different way. Since it worked, and we were a few days from release, that was the version of the game that shipped.

We’ll never know exactly what went wrong here. Maybe the shader file was corrupt. Maybe it was some bizarre byte-alignment error. Maybe cosmic rays just happened to strike every Blu-ray disc as it was burned.

But it was fixed. The game was shipped. We used our bonuses and time-off-in-lieu to take a nice holiday. We lived happily ever after and didn’t wonder for years after the fact what on earth caused this bizarre, hard to debug and fix, bug.