Our continuous integration efforts came crashing down around our ears yesterday. Our build started failing on Wednesday evening, after someone checked in a ton of changes all at once (not very continuous, I know). That turned into a snowball of issues that took Scott and I all day yesterday to unravel. Even after we'd fixed all the consequences of such large code changes (that was done maybe by noon) things continued not to work. To make matters more dicey, at some point during the day our CVS server went TU, and had to be rebooted. Builds continued to fail, mostly with weird timeout problems. We were also seeing lots of locks being held open which was slowing down the build, contributing to the timeouts, etc.
We ended up building manually just to get a build out, which was suboptimal but necessary.
Thankfully, Scott was able to track down the problem last night. Turns out that when the CVS box went down, our build server was in the middle of a CVS "tag" operation, a.k.a. labeling. That left a bunch of crud on the CVS box that meant that subsequent tagging operations failed miserably, thus causing the timeouts, etc. A few well placed file deletions on the CVS server cleaned things up, and we're (relatively) back to normal now.
While I think that continuous integration is a fabulous idea, it's times like these that bring home just how many things have to go right at the same time for it to work. What a tangled web we weave.