May 9, 2010

What If the Rules for Team Communication Aren’t Enough?

Effective software development teams have defined communication paths that help guide how projects are run. One important path governs what to do when a customer finds a defect in the software. The bug gets reported, logged, prioritized. The priority of the bug determines when it will be evaluated and fixed.

But what about when the rules don’t apply? What if a low-priority bug is frustrating a high-priority person – for example, an executive under pressure to deliver who may unreasonably question the project design and has the political means to damage the overall project? Suddenly it’s time to escalate and go outside normal protocol.

This is exactly the situation that arose this week. I’ll tell you how we handled it, but first some background.

A pilot program under threat

One of our development teams is piloting a recently completed large application. The pilot is going well, at least it was until I got a call from our client’s Program Manager. This Program Manager is as steady as a rock, so if he calls and says he’s concerned, I sit up and pay attention. On this call he’s concerned.

The issue was that two of the pilot users were having a problem with the application and were making their frustration known to a lot of people, including the CIO. Although the problem had been logged with the development team, it did not get a high priority based on its description. The development team was focused on completing an iteration, so this problem was on a list of bugs that would be considered for a future iteration. Standard development methodology here.

The symptom that the two users were experiencing was this: a request to the application that should normally take less than two seconds was taking 45 seconds. As you can imagine, they found this troublesome.

Redirecting the team

After I got off the phone with the Program Manager, I immediately contacted the manager of our development team in Eastern Europe and explained the situation. We agreed that we needed to immediately focus the development team and the QA team on fixing this bug.

Naturally, when our engineers tried to replicate the problem at their lab, they couldn’t. The application performed perfectly.

It wasn’t until they gained remote access to the pilot users’ computers that they could see the problem. What did it turn out to be? The local database for the application wasn’t indexing properly and so some queries to the database were taking much longer than they should. The developers were able to quickly fix the problem by running a simple script once they identified its source. The script was included in a future patch to be sure this problem would not resurface and we reviewed our release procedures and patch creation process to look for anything we could improve in a future release. Finally, we checked back with the two pilot users to make sure they were satisfied with the software.

What’s the takeaway?

Once the developers fixed the problem and restored the user’s application to its normal state, their job was done. As for me, I still had plenty of work to do to reassure the Program Manager, our sponsor, and the CIO that the problem was an isolated one, that it was a simple problem and that it did not reflect any fundamental underlying issues with the application.

What does this experience tell me? I don’t think it’s that we should change our procedure for fixing bugs. My takeaway is that even if a development team has a good set of procedures and communication policy in place, some situations will arise that require escalation. In these cases, the Client Manager has to change the team’s priorities and put all its energy into fixing the problem immediately. It also points out the importance of an effective escalation path so that the client always has someone to go to when the conditions are extraordinary.

In the scheme of the whole project, this kind of escalation is unproductive. The problem the two pilot users were experiencing would have been resolved in due course. Escalating an entire QA team for 36 hours prevented other issues from being worked on, so there was likely a small schedule hit as a result of the team being quickly redirected, then redirected again when the bug is resolved. But sometimes, despite the productivity hit, it’s necessary to parachute in a SWAT team so that the client can manage internal expectations and politics. We understand this is the reality for our clients, and we work to support them as best we can. But for everyone’s sanity and for the continuity of the project, we try to keep these fire drills to a minimum.