Ask YC: Your most interesting bugs / bug fixes?

tx · on June 13, 2008

The visual plotting component I was working on would suddenly stop producing output: just a black screen but everything else was normal - no leaks, race conditions, crashes, etc - just no video. The bug was so rare that only customers ever saw it, but they were pissed (serious manufacturing folks, they ran our software 24/7 on production floors)

Eventually I wrote a debugging code around it to take screen shots of the main window and compare it with a reference image about 20 times a second (after each blit) and left it running for months on an old PC.

One night I got an SMS sent by my debugging script with attached call stack, etc, but I knew what had happened without even reading it: I got it 1 second after daytime savings time kicked in - the plotter didn't handle it properly and was trying to plot data from the future, which, of course hadn't happened yet.

qwph · on June 14, 2008

One bug I had to fix recently was in the commentary system of a PS2 soccer game (with an ancient codebase, parts of which hadn't been touched in 10 years).

The commentator would always refer to a particular player on a particular team by the name of a different player on that team. No-one had any idea why.

This issue had been knocking around for years and somehow ended up on my plate. After about 3 or 4 days of digging around and finding nothing, I finally got to the bottom of it. It just so happened that subtracting the ID's of the two players in question gave exactly 65536. It didn't take me much longer to figure out that the game's database used 32 bit integers for player ID's, and the commentary system was using 16 bit integers, so from the commentary's point of view, the two players had the same ID.

Luckily for me, I was able to fix it by swapping in a slightly-less-ancient commentary system which used 32 bit IDs instead.

bayareaguy · on June 13, 2008

I once had to fix a random memory corruption in some X windows code which used a proprietary third party library to which we didn't have the source. After linking with my own malloc/free library that would carefully add lots of guard space around every allocation I discovered that it tended to overwrite an area just past a structure it was responsible for allocating. We reported the bug to the vendor but we had a demo coming up so in the meantime we just wrapped calls to their stuff with some code that swapped in my allocator. Simply allocating 10% more than what the library asked for made the problem go away.

davidw · on June 13, 2008

Ooh, those kinds of bugs are evil. I once had a bug in Rivet caused by Apache and Tcl linking to different versions of the same struct, getting confused, and stomping on some memory. It was really a bitch to track down:-/

jrockway · on June 13, 2008

This is pretty common. I used to name my submit buttons "submit", but when I started using Javascript to do the submission, I found that form.submit() didn't work everywhere. The reason is because form fields are accessed by form.field_name, and so the field data replaces the function stored in the "submit" slot. sigh.

jfarmer · on June 13, 2008

When I was working on Adonomics there was a bug that caused certain virus scanners to claim that we were trying to install spyware on the visitor's computer.

It just so happened that there was a string on the page that matched the scanners' virus signature. It was in the HTML so between a few of the offending lines I inserted . Problem solved.

I'd never, ever seen a website that did that, let alone one that I created!

wataguy · on June 14, 2008

My company made a machine for medical labs, and I wrote an app that monitored the performance of those machines. Our basic metric was samples analyzed per hour, and my app would grab the runlogs from all the machines (we needed the data in the logs for other purposes, or I'd have done the analysis on the machines and just transferred the results), count how many samples were analyzed each calendar day, and then divide by 24. Easy, simple, and worked a treat. That is, until one fine spring Monday.

The service manager walked into my office bearing the graph of machine performance, and said there was something wrong with my script. He showed me the graph, and pointed out that most of the machines had exhibited a nearly 5% drop in productivity on Sunday. This wasn't usual; the machines either worked, or they didn't, and they were expensive enough that labs generally ran them continuously. We looked at the graph for a while, and realized that the affected machines were all in the US; the ones in Asia looked normal.

This was baffling, but I did a little checking of the data, and it looked like the script was counting samples correctly. The nearly 5% drop was real.

But the next day everything was back to normal and people stopped bugging me about this, so I just let it go, though I couldn't get rid of a nagging feeling that I was missing something.

I didn't realize what it was until six months later, when those selfsame machines exhibited a nearly 5% rise in productivity, again reported to me by the service manager on a Monday. I looked at him for a moment, then said, "Not all days are 24 hours long!" I'd forgotten about daylight savings time (not observed in Asia), which makes one day in the spring only 23 hours long, and one day in the fall 25 hours long.

I fixed the calculation by using time()s, so the script'll work for any time zone. As an added bonus, even the occasional leap second is handled correctly.

davidw · on June 13, 2008

I love doing quick, hacks to improve stuff. Sometimes it's even more fun if it's a system I'm not familiar with. Not really a 'bug', but I made Linux boot faster off of USB pen drive type devices by adding a wait queue:

http://www.welton.it/freesoftware/patches/blkdev_wakeup-2.6....

Although I don't know if it ever made it into the official kernel, as I got tired of prodding those guys to either accept it or reject it (and in the meantime they put in some lame hack telling the kernel to wait N seconds before proceeding in the hope that that was long enough to mount the device), but a number of people have thanked me for it over the years.

BTW, I'm in awe of people who spend all their time hacking kernel stuff. It's hard work, and for many things, you risk crashing your whole computer if things go wrong.

pmjordan · on June 13, 2008

At my game development job we got some incredibly hard to reproduce crashes every now and then on the game that was supposed to be shipping soon. We were using Lua for scripting, and the stack traces showed that it was happening somewhere in Lua. This was on the Wii, so dropping to a debugger is only possible if you happen to be running the right build on the right hardware attached to a PC running the right software. Which meant not in the QA department.

Not that a debugger helped that much after they managed to get a fairly reliable but convoluted repro. It turned out that really stressing the scripting system with certain patterns would cause the crash to happen much more frequently, so I could see it in the debugger. It didn't help all that much, it was an apparently random memory stomp, and by the time it crashed, it was much too late to tell where it came from. I forget how I figured this out but I eventually managed to narrow the cause down to garbage collection runs. Now, GC runs were periodic, but consoles place hard limits on memory. You can't just swap to disk when the going gets tough, so we had memory budgets for each game component, including the scripting system. So if scripts got particularly greedy, they'd run out of memory before the next GC run.

Now, as the memory limits were hard, some clever sod had put a GC call in the Lua malloc hook that was supplying the memory to run when there was no memory available (and the game would have crashed) - no doubt in order to fix an earlier bug. Most of our scripts didn't create hash tables, arrays, and strings frequently, so this bug hadn't been a big enough problem for what must have been years. In Lua, those types of objects require two allocations, one for the base object and one for the data storage. You can see where this is going.

If Lua ran out of memory halfway through creating a hash table, array, or string, that is, after successfully creating the base object, but failing on the data store, it would trigger a GC run. Thankfully this was actually not that hard to hit, as the data store memory generally was way bigger than the 16 or so bytes used for primitive types (i.e. base objects, numbers, ...) so the probability of not having enough contiguous space was much higher than not having a 16-byte slot. In any case, the hash table (etc) constructor had of course not returned yet, and therefore there were no references to the hash table object yet, and it promptly got collected. The memory was initialised as a hash table and returned from the constructor, and it was just a matter of time until another allocation wrote straight over that. Not just any allocation of course, as re-allocating it as a (legal) primitive type wouldn't have caused a crash.

The fix was of course easy once the cause was known: don't put the base object in the allocated list for GC consideration until the whole object had been assembled.

Took me days. And I wasn't even the first person to be assigned the bug, it was one of those hot potatoes that went round all the senior people until it landed on the junior tech programmer's list. (mine)

I looked in the checkin history for the malloc hook, and they had shipped at least one game with that bug in. (records didn't go back far enough to rule out the game before) If you figure out what scripts to trigger repeatedly, you can make that game crash.

I can't really blame just one person for this. Putting the GC call in the malloc was thoughtless. Maybe I would have done the same without checking that it was safe. In Lua itself, that was a pretty careless way to handle object creation given that the malloc hook is user-defined, so Lua has no control what goes on in there.

More bedtime war stories another time.

rtf · on June 13, 2008

On the game project I'm working on(as a designer) we use an internally developed scripting language that was originally meant for simple timelined cinematics. It was, according to the author, made in one day.

Each line gets a time in seconds followed by a call on some entity: hello world is "0.0 Script:Print("hello world")". The syntax quickly gets verbose from there - and no multi-line statements allowed!

Some of the flaws of this language:

-Miserable computational abilities. Until just this morning we were unable to add two numbers together. The main limitation is poor access to information - getting things like the x position of an entity requires a language extension - an easy one, but when it's not there....

-A vague notion of types. Variables attached on entities are typically enclosed in quotes: "true", "100", etc. But in arbitrary instances you can also use space-delimited vectors, for example: Script:Goto(100) or Entity.SimpleMovableInst:SetPosition(0 0 0). If you use the wrong thing, you will be lucky to get an error message. Usually it just silently fails.

-Lexical macros. These were added in order to cut down on the number of custom scripts we were using: instead of cut-and-pasting "switch1...door2....etc." variables we could say "$[target]" and it would be replaced at runtime with the contents of variable "target" on that entity instance. But they have the effect of giving a lot of false errors in the cases where we don't need to use those variables on a rich, multi-purpose entity. Worse, if I forget to add a variable necessary for the script, that instance will fail and I may not realize why.

-The parser doesn't recognize delimitation of strings, among other things. Putting a colon inside a print statement causes an error message.

It is no surprise that endless grief has resulted from these various flaws. I fixed bugs resulting from all of the above just today.

It works fine for timelines, but I'm hoping to get Scheme(or something with similar potency) for the next project.

LogicHoleFlaw · on June 14, 2008

Lua is designed for exactly the type of work you describe here. I wrote a Lua interface for The Nebula Device (Open-sourced 3d game engine from Radon Labs.) It took about a weekend, starting from scratch. Interfacing to C or C++ code is dead simple. And the language itself draws a lot of inspiration from Scheme without abandoning a more familiar syntax. You get first-class functions, closures, lexical scoping, coroutines, and a clever "metamethod" system for defining object behavior. It's an extremely quick and efficient implementation written in strict ANSI C, and it works well in constrained environments like games. It's quite popular in both console and PC games for higher level functions on top of a low level engine. Give Lua a look-see!

rtf · on June 14, 2008

I integrated Lua once for a project of my own. But on this project the choice wasn't mine - we ported over existing technology to a new console to meet a five-month ship deadline.

Also, Lua is imperfect for games: Squirrel is considerably more attractive because it is designed to work in real-time situations(no GC interference). I also had issues with Lua's error-handling.

toxik · on June 14, 2008

And now with Lunatic Python, you can even use Python from Lua or Lua from Python, two-ways! So you can do Lua-in-Python-in-Lua-in-Python if you have that desire.

Try it out, it's darn fun: http://labix.org/lunatic-python

pm · on June 14, 2008

Seconded. Lua is especially popular for games, and for good reason. However, I've had peculiar bugs when interfacing Lua with C code on occasion (related to my own incompetence, not Lua itself), so do be aware of where you place your code.

tlrobinson · on June 14, 2008

Debugging VHDL on a FPGA (which ran correctly in a simulator) is, uh, fun. I don't remember the details, but it was a single character fix.

Also, debugging MS Office file format generating code is painful when the only feedback you have is Office giving you "corrupted file" errors.

thaumaturgy · on June 14, 2008

I'm ashamed to admit that I've still got a hefty little piece of VBScript at a grocery store that does some price & cost number crunching for them; it features a very funky two-stage inner loop which will occasionally produce random numbers if you feed in very large data sets. Two different pretty sharp people have spent a lot of time trying to track it, and nobody can find it. One of them looked at it and about cried when he realized how it was supposed to work.

Not too long ago, I had to figure out a timing issue in some JavaScript for a photo gallery. It's about as multi-tasking as JavaScript can get, and the code has its own task manager. Each new task was supplied with a UID, which was just the number of milliseconds since 01/01/1970. Occasionally, however, a task wouldn't fire. It would get queued, and then disappear from the queue, but never run. Every time I set breakpoints in the script, used alerts, or otherwise examined the code as it ran, everything worked fine.

It took me most of an afternoon to realize that occasionally two different tasks were being created within the same millisecond, and getting the same UID. Subsequently, one of the tasks would stomp on the other one. I kicked myself for knowing better than that and wrote a more reasonable UID hash.

Years ago, one of the first languages I got to play with was Pascal, on a pre-7.5 Mac OS. I was using THINK Pascal at the time, and having fun, but occasionally my program would bomb out in really interesting ways. It took me a long while to trace the bug to a problem in the compiler, where a struct that was supposed to be 4 bytes long was getting loaded on to the stack in 6 bytes. Usually this wouldn't be an issue, except that one of my functions was popping the struct off the stack as a longint, instead of as the actual struct. Whoops.

But, probably the most fun "bug" I ever figured out wasn't really a software issue at all. I worked in the data processing department of a Bay Area school district. Despite being one of the better school districts in the area, and having enough spare change for student A/V programs and the like, a lot of their systems ran on an ancient Unisys mainframe, COBOL74, and reel tapes. (Hey, it worked. Unisys mainframes have incredible recovery abilities.)

The reel tapes had a little reflective strip near the end of the tape, and the tape drive had this optical sensor that would pick up the reflection as the strip flew by, and dismount the tape.

It was the damnedest thing -- sometimes, usually in late Spring or early Summer, a tape would be in the middle of some operation, and it would dismount, even though it was nowhere near the end of the tape.

I was sitting in the room one day when it happened, and turned around in time to see a perfect little sunbeam sneaking through the blinds and shining right on the drive's optical sensor, fooling it.

CRASCH · on June 13, 2008

Some of the more interesting bugs are the ones not in your code.

I decided to list the ones outside my code that I've found in the last year or so. These are windows / .net bugs.

1. When you add a program to run at startup to the registry and that program has a space in it. Like this "CoolCompany CoolSoftware.exe" it will also launch the program in the same directory called "CoolCompany ReallyCoolSoftware.exe"

2. The FILETIME data type in .net is mis-defined as signed. This leads to conversion problems that were really hard to track down. File times would magically change in certain conditions by a few seconds to a few minutes. This was still in the 3.0 version which is the last I checked.

3. .net Socket library pins the buffer in memory to prevent garbage collection and allow the call to not lose the buffer. Unfortunately it never unpins the memory. This looks like a memory leak and fragments the heap badly.

4. The split control width is reported differently on XP and Vista. Hiding the split on one platform will give a width of zero. On the other one it reports the width.

5. The zip lib on .net CLR ends up with much worse compression than a properly implemented zip lib. Ironically the J# zip lib works well.

6. The windows setup in visual studio release build fails to install required components like .net. The debug build will?

7. The .net file and directory objects don't support long paths. A user can easily create a path longer than the limit by dragging another user's directory to their desktop. These files will (un)conveniently be unreachable by any .net app not making direct win32 calls.

8. I don't even want to think about the threading model in .net with delegates (they don't always work) etc. Bonus they work differently on Vista yea! Give me a semaphore, a mutex, etc.

9. Vista UAC is such an embarrassment. I'm completely astonished it got released. In my opinion everyone who even saw it should be fired. The design is worse than privilege elevation designed over a quarter century earlier. The implementation is so bad any still sane person that uses their computer for more than light work will turn it off.

Windows aside. If anyone is developing a commercial app on windows. I strongly recommend writing directly to the win32 API. If you try to implement anything non-trivial (like I did) you may end up with a huge rewrite. In my opinion the convenience of .net is just not worth the loss of control.

I'm in mid port to a native code version of my application. I've lost 95% of the re-distributable size and increased performance by 50%.

Seriously though. I think there is a general problem with programming.

Part of the problem is the workaround. There is a certain amount pride / respect someone gets for figuring out a work around to a tough problem. The issue with this is that it discourages the fixing of the root problem.

The other part of the problem is consistency. "I don't care what the interface is as long as it is concise and consistent." I often feel like a turn of the century mechanic trying to build a car from a box of randomly sourced parts. At the turn of the century there were not any standards for important things like bolts. Everything varied from thread width to thread angle to thread spacing. The situation with non standard mechanical parts is pretty close to what we have today with software. Sometimes I think all I do is write code to convert data from one type to another. All to appease the requirements of the various APIs I need to use.

Someone will probably note that I'm on windows and X is so much better to code to. I will prematurely completely agree. I typically hack on my mac and my grid runs on linux boxes. I need a windows component for market reasons.

Bonus!!! The Unknown or expired link when posting to Hacker News after typing for a long time.

xirium · on June 14, 2008

> Bonus!!! The Unknown or exprired link when posting to Hacker News after typing for a long time.

Definitely. I hit it after reading your comment. If you ever want a technique to discourage long and insightful comments, this is it.

xirium · on June 14, 2008

> Sometimes I think all I do is write code to convert data from one type to another.

You'd be good at ETL.