Written by Michael Hall
On this Halloween, we bring you the story of a high school, a mysterious recurring Phantom of Fifth Period network issue and an admin who realized the power of a community watch group.
With grid computing, p2p networks and the Amazon Mechanical Turk, it seems like everything is distributed these days. Everything, that is, except help with your networking woes. Unless you take advantage of the distributed troubleshooting network running right under your nose. I learned all about this thanks to the Phantom of Fifth Period.
My first job out of the army, where I’d spent four years running and troubleshooting FM battle networks involved something a lot more hairy than freezing up on mountaintops in Korea for weeks at a time: I had to act as a one-man technical support operation for a high school.
One day I came back to my office to find a blinking voicemail light and an inbox with a dozen messages. They all came within 10 minutes of each other, they all reported that the school attendance database had mysteriously stopped working, then just as mysteriously began working again.
The network I had inherited was a pretty messy affair. It had gone up in the early ’90s, kludged together from cut-rate equipment. Sticking your head above the ceiling tile to get a look at a hub (we weren’t even running switches) usually involved getting tangled up in a ball of cable someone had just wadded up and stuffed up over a light fixture. The previous database specialist had decided to save the time of updating the database application by letting users run the entire 20 MB binary over the network. Dropouts and molasses speeds were just part of the game.
So when all these calls came in about a phantom dropout that fixed itself, I deleted them, confident that, whatever it was, we’d fix it when we got the budget approved to move to a switched network.
Imagine my surprise when I got another dozen calls the next day. I rolled my eyes, called the teachers back and promised to look into it, then I checked my outbox to make sure I’d filed that request for better hardware. Then the calls came again the next day. And the next. And on the following Monday, there were fewer calls, but only because a teacher who didn’t call bumped into me in the hallway and told me she was tired of calling about the problem.
So I grabbed a stepladder and began visiting rooms, prodding around the dusty ceilings, making a show of trying to fix the problem. The two bits of information we had to work on, which took a few days to isolate, were that the problem happened every day about ten minutes into fifth period — while the teachers were taking attendance after the tardy bell had rung — and the problem was happening on two floors of one wing of the three-story, three-wing building.
My first thought was that the hallways in question were unusually busy with classes that period. That perhaps there was a better mix of teachers on prep periods (and hence taking no attendance) on the other floors and wings. I even ran a query on active classrooms through the database confident that must be it … too many teachers hitting our fragile network and clobbering the four hubs that serviced them. That wasn’t the case, though.
I sat in classrooms eyeing my wristwatch and reloading Web pages to time when the connection would drop out, and began to notice that there was an eerily predictable regularity involved. At 2:15, every single day, the affected computers would drop off the net, and they’d regain their connections about a minute later.
From there, it was pretty easy to at least physically observe the hubs on those halls (as easy as it could be, dodging balled up cable), and it became clear that the hubs were all fine, but they were losing their uplink at 2:15 p.m. every day, too. So the next step was the router that serviced those hubs, which was in a locked closet. The district support people and I got a key, got into the closet, and learned that it wasn’t a closet at all. It was a large room with a very large air handler. Sitting in a rack next to the air handler, sharing an outlet, was the router. I glanced at my wristwatch, and we stood around waiting for 2:15.
At 2:15, the air handler shuddered to life, the lights in the room flickered, and the router’s lights dimmed, then dipped out — just for a second. Then it commenced to reboot. About a minute later, it gave an OK light, and all its link-lights were active.
The air handler was on a timer, running some sort of cycle every day at 2:15. Fifth period. While the teachers were taking attendance.
We put the router behind a line conditioning surge protector and the Phantom of Fifth never came back.
So what’s that got to do with distributed support networks?
About a week into the continual support calls, I realized there was just one of me vs. dozens of teachers. A lot of them were reporting the same problems due to our creaky network. Calling them all back, triaging their problems, and reassuring entire hallways worth of people when a hub broke or something like our Fifth Period Phantom manifested was impossible.
So I walked the halls one afternoon looking for the teachers on each floor and each hallway who always seemed to have done plenty of their own troubleshooting before calling me. They were deputized as “hall captains,” and they were given a simple form to fill out when a problem came up. All the other teachers on their floor were told to go to them when a problem occurred. The hall captains could help with common user errors and “reboot it and try again” problems, and when it came to widespread problems they were in a better position to collate the complaints and pass them along, offering a much easier-to-parse overview of where the problems were occurring. My inbox was considerably decluttered, the users got better service and we were able to smooth out problems much more quickly. All because we took advantage of the intelligence distributed throughout our small organization instead of relying on one point of failure (that’d be me … not that flakey router).
If you’re running a network or handling support for a small organization, consider doing the same thing. You’ll like the results. And if you like that idea, don’t let me take all the credit for it. I learned it from the excellent essay “How to help someone use a computer,” by UCLA professor Phil Agre, who notes “knowledge lives in communities, not individuals. A computer user who’s part of a community of computer users will have an easier time than one who isn’t.”
Same goes for the person supporting that user.