Hunting the Pointer Wumpus

First, a bit of scheduling news: the monthly update of Clockwork Empires, revision 40, will be released next week. Due to E3 and a giant Steam sale there’s not a lot of point in releasing anything this week.

Last time I wrote a blog post, I had written half of the Clockwork Empires codebase in order to fix various issues with autosaves being reported by users. Naturally, these things never happen when I am playing the game; instead, we got DMP files and sorted through them. (I wrote a little in house tool to process these for me; now I can run a batch file and it runs all the DMP files we get through a little script which automatically outputs their call stacks and other pertinent information.)

The moment of truth.

The moment of truth.

To try and fix these issues, I rewrote the entire jobs system in order to move everything into nice, flat arrays, and to get rid of direct references to pointers. My original hope was that this would fix the save game crashes we were experiencing, which were due to various instances of dead pointers to jobs being left in Lua space. Well, it fixed *some* of those crashes. Once Revision 39A came out, we got a bunch more crash reports: various people’s assignments were crashing, due to assignments being held in Lua after they were dead. Well, it wasn’t part of my original plan, but they got the same treatment – assignment handles are now first class citizens, and we can now actually tell if an assignment handle exists or not, and is deleted or not. Starting from Revision 39B, all the workshop assignments were also moved over to a new system where they were controlled by C++, and we were able to remove huge chunks of broken Lua code. Now the workshops were starting to feel good, in a way that they hadn’t felt since… well, since we first put them in – but yet, we still had save game crashes.

Crashes!

Crashes!

Ugh.

What’s even weirder, of course, is that the save game crashes seem to be saving a building footprint – the little data structure full of points that is used to regenerate the shape of workshops and houses when you draw them. What the heck was going on?

So at first, I couldn’t figure out what was going on. I could draw buildings, save buildings and load buildings just fine, and nobody could reproduce the crash. Revision 39C, therefore, was spent ignoring the autosave crash. What changed is that somebody sent me a replay for something *else* – some sort of stockpile related issue, I believe. The replay system is still not perfectly deterministic – sometimes, somehow, determinism will fail, and the replay doesn’t match up perfectly on my machine. We’re still sorting this out, and this is one of the major stumbling blocks towards getting multiplayer in. I got it on my computer, I left for lunch while it replayed 100,000 frames of Clockwork Empires save game… and what happens when I come back from lunch? Well, wouldn’t you know it – it had crashed saving the game. Furthermore, it had crashed saving a building footprint.

Operating the debugger.

Operating the debugger.

Finally, I thought to myself, a repro case. Quickly pulling up the watch window in Microsoft Visual Studio, which is used to inspect the state of variables in the program when debugging, I saw the problem: the array of points thought it had, uh, 2,576,320 points in it. Yikes.

First order of business, see if it was reproducible. Quit the replay, run the replay again. Fortunately, it showed up after only about 20,000 frames of the game. Unfortunately, it only showed up sometimes, which puts it in the category of “heisenbug” – a technical term which refers to a bug which stops appearing when you’re looking for it. When you get one of these, you cannot change anything else in the program, or you will lose any hope you have of ever seeing it again – you are now stuck, working on this problem, until you solve it (or are a mewling, quivering wreck.) A bit more inspection revealed that the pointer at the *start* of the array of points (pointing to the first element in the array) was set to NULL, meaning that something *else* had trashed it. That something else could be anywhere else in the program.

At this point, I tried my luck re-running the application with AppVerifier enabled to see if it would turn up anything interesting. Sometimes you get lucky; AppVerifier lets you set guard pages around heap allocations, and occasionally, the thing that is misbehaving will write to one of those guard pages and it will instantly throw an exception at the point in your program which is misbehaving. So, it’s worth a shot – but no dice.

After gently poking the code some more, and sort of randomly guessing at things that might be broken, I decided there was really only one option left. I wrote a little C++ function called ValidateBuildings(). What it does, when you call it, is it runs through every building in the game to try and make sure that its Points() array is valid. If it’s not valid (if the size of the points array is over, say, 1000 points), we throw a breakpoint. So, I put it in, and… ugh, the program starts working again and not crashing.

At this point, you just curse and swear a bit, and hope that it starts crashing again.  Sometimes, the nature of computer memory allocation is such that a random heap corrupting thing will just start writing into different areas of program memory that do not, obviously, affect the state of the program. Fortunately, after a few more run it started crashing again, and I started sprinkling the code with the ValidateBuilding() all over the place. The first place I put it was in the message broadcasting loop, which sends the Update messages to everything in the world that is supposed to be updated. Sure enough, it caught the problem, after a citizen updated. I ran the code again, and it caught the problem after a module updated. Conclusion: it’s not the gameplay thread causing this bug, it’s something on the renderer side.

I sprinkled the renderer full of ValidateBuildings() calls, and none of it really seemed to narrow it down much. The renderer and the game are both large, large functions, and it’s hard to figure out what goes where, when and how. Finally, I decided to try something different – I started dumping consoles from various runs that crashed or ended in a crash, trying to figure out what the last thing was before ValidateBuildings() failed.

The bug! The horror.

The horror! The horror.

It seemed like the culprit, every time, was when a workshop pickled Black Fungus. My first thought was one of horror – was it the new workshop code? Drat! But, no, it was something on the workshops. I grabbed the script, and started trying to figure out what calls to the renderer were being made from the workshop script object. The culprit? The blasted Machine Expression message had a use-after-deletion in it; a rare condition, but one that could happen if the renderer was running at certain speeds, or under certain conditions. That bug had probably been in there for ten versions of Clockwork Empires, and only now did it get caught. A one line fix, and the rendBuildingFootprint crash went away (and, consequently, the crash when autosaving which started all of this went away.)

Total time for this one line fix: I started this Friday, and finished it early Sunday morning – so, several hours.

What ended up being the key to tracking it? All three debug files, working in tandem. We used the DMP to figure out that the building footprints were crashing. Somebody gave me a replay for another, unrelated thing, which just happened to crash in the right place (the same place the game was crashing for everybody else, but not me); the console was what eventually got used to track down the exact point in the renderer that the game crashed, once I had a nice function added to the renderer to automatically detect when the save game would crash. It took all three of these little pieces of debug information to finally figure it out.

As a consequence of all this rewriting, most of the work I’ve done on this month’s patch is invisible. Revision 39E generated a bug report from one of our users, which was that “In limited testing, I’ve discovered that I have to quit and save when it is time for me to head to bed rather than being able to rely on a crash to break me free of the game.” This is a good sign – when users start complaining about the game being too stable AND keeping them up all night, you know you’re doing something right!

Otherwise, I’ve been dealing with small niceties – making sure various controls work correctly, making the module and building creator feel and look better, fixing text rendering issues, that sort of thing. Most of what I have done in this patch isn’t actually on my sprint, so I guess it’s been one of those months, but the game is much more stable and feels much more polished as a result – hopefully folks will like it.

Explore the ruins, they said.

Explore the ruins, they said.

Posted in Clockwork Empires | Tagged , , , , , , , , ,
7 Comments

7 Responses to “Hunting the Pointer Wumpus”

  1. Abudefduf says:

    As someone who works on codes that are less obviously engaging than what you guys work on, I really appreciate reading these kind of stories. It’s like… we’re all in this together.

    { reply }
  2. Daniel says:

    Any chance of an option for a Civ IV style clock in the corner? You know, to prevent players from dying at their computer desks from playing too long.

    { reply }
  3. JoshJers says:

    Reminds me of an issue I debugged at one of my previous jobs:

    We had memory that typically ended up being used for color palettes that would, occasionally, crash with a heap validation error on free.

    Turns out that whole area of memory was getting trashed after > 24 hours of running the program, but only sometimes (about a 2% repro rate, which is awful).

    After a few weeks of getting very scattered reports of this and being unable to repro it myself (or catch someone’s repro while it was still running), I finally bagged a repro on my own machine!

    Looking at the heap, the pattern looked very regular and I thought “I wonder if it’s a bitmap?” So I copied the memory bytes and slapped a bmp header onto the top of it (educated-guessing at the width/height based on the pattern and the size of the corruption) and sure enough, it was a rectangular image that said “OK” in the middle of it.

    Yep, my memory was getting trashed by a rogue OK button (When discovering this, tired and hopped up on caffeine, I laughed at this like it was the funniest thing I had ever seen in my life).

    It was easy enough, once I knew WHAT the culprit was, to find the memory free that wasn’t properly notifying the UI subsystem that its memory was free and fix it, but it was a long time to get to that point.

    So, yeah. I hate memory corruption heisenbugs.

    { reply }
  4. Antsoldier says:

    Wow, thats the best thriller I have read in years!

    { reply }
  5. Skrylar says:

    I can’t help but think a linux build being run under Valgrind would find these things faster. I used to write unit tests to run with memory checking mallocs and it really killed the amount of heisenbugs I had to deal with in projects.

    { reply }
  6. Bettik says:

    I only just got around to reading this and it was excellent. Thanks for going into the nitty gritty sometimes, makes me feel a warm and glowy sense of empathy.

    { reply }
  7. Ingmar says:

    valgrind will probably slow the game too much down, but at least heap-use-after-free can also be found by AdressSanitizer (both clang and gcc variants) with much less slowdown (only 2-3x slowdown vs. vanilla)

    I’m working in an embedded software shop, and I try hard to get everything runnable also on Linux in addition to whatever embedded OS we are targetting for each piece of code, just to be able to use AddressSanitizer and LeakSanitizer.

    { reply }

Leave a Reply

Your email address will not be published. Required fields are marked *