How Many Administrators are Enough?

MARK VERBER
Appeared in Unix Review, April 1991
Minor Revisions May 1, 1997.

The Enigmatic Question

``How many system administrators does a site need?'' is a commonly asked and difficult to answer question. There is no magic ratio. The appropriate number of administrators depends on what each system administrator is responsible for and on the level of service expected in each area of responsibility.

The best way to estimate the number of administrators needed is to figure out what level of service is required and how various factors (for instance networking infrastructure and heterogeneity of the machines being supported) will affect the the fulfillment of those responsibilities. Rarely are system administrators doing only ``administrator'' tasks. The first part of this article will detail the tasks that I find myself performing in addition to the normal ``administrator'' tasks, such as backups, installing new users, operating-system maintenance, and so forth. Additional tasks are presented (for the most part) in the form of questions. The second part details some of the various factors that will affect staff levels. The third part details some simple perspectives that system administrators can adopt to make their environment more easily administrable.

What System Administrators Do

User Services

How much hand-holding is expected? Some sites have users who are pretty self-sufficient; other sites have users who need assistance for everything they do. Can your users take care of themselves or do they need and want the administrator to perform even the simplest tasks for them? For example, I have a friend whose users demand that he perform the most basic tasks for them (such as moving their files from one directory to another). Anything that isn't simply invoking the text editor or reading mail is ``UNIX'' and hence a job for the administrator. This sort of support requires a ratio something like one administrator for every four users.

Does the site want you to conduct workshops or prepare extensive local documentation? To what extent are you expected to consult on technical issues? Do you concern yourself with just UNIX or other realms? For example, let's say your site has heavy users of TeX, Mathematica, Common LISP, C++, X11, PostScript, and Sybase. Are you supposed to be able to answer detailed questions on all those topics? Few people are experts at all these things. Something that many people don't appreciate is that development of expertise in any given topic area requires time to play, experiment, and mature in that area.

Software Support

How much public domain software or freeware do people want installed? What level of support are they expecting? Just compiling and installing software doesn't take much time. Often though, software doesn't just compile and install properly. There are often assumptions in the software which need to be changed before the software can be used at a given site. In addition, administrators are often expected (and rightly so) to continue maintenance of the software (bug fixes and what not) and to become an expert in the use of the software. Compiling and installing (coupled with frequent patches) or many hardware/software platforms can make this incredibly time consuming for even just a few software packages. The time this takes varies with the quality and complexity of the software. Keeping a current version of kermit or perl isn't hard (I wish everyone did as nice a job as Larry Wall has with perl); keeping up with g++ is much more time-consuming.

Custom Software

Most places not only expect the system administrators to keep their world running, but also to create -- on demand -- tools for the user population. This is understandable, especially in small sites where the administrator might be the only professional programmer. If there is this expectation, time must be allocated for this development process.

Site Planning/Administration Overhead

How much site planning is the administrator expected to handle? Must the administrator know about AC/heating loads and power? How much paperwork is there?

Hardware/Network Maintenance

Who crawls through the ceiling to pull wires? Who finds the flaky transceiver when the Ethernet starts to go crazy? When a terminal or workstation dies, does a secretary just call your vendor and wait, or are more creative solutions required? Does your site buy all its peripherals ready-to-install or do you save money by purchasing components and do the integration yourself? Having a system administrator do any of these things takes time.

Anticipate Technology

Is the administrator supposed to anticipate new technology and advise the company about new approaches? Most places I have worked expect administrators to have a good feel for the state of the art and new technology that looks promising (not just products, but research, too). Anticipation is often necessary given many sites have a two-to-five year planning or depreciation schedule. Keeping up with our field isn't easy. There are a variety sources one much draw upon to stay current. I have found a variety of good sources for current information. Trade rags can give you a picture of what is being sold, Usenet (and other electronic media) is great for questions regarding current issues and problems. Professional journals from ACM, IEEE, etc are useful to see what is happening on the almost done research front. There is no substitute however, for a good network of professional contacts. This network can be maintained with phone calls, electronic mail, and attending conferences.

Other Issues

Site with one administrator are not very desirable.

They are a fact of life since many small sites can neither afford nor justify more than one system administrator. It is difficult for one person to have the breadth of knowledge and experience to run a really first-class site, no matter how few machines it has. There will always be some area that is not the strength of a sole administrator.

Another problem is that the site with a single system administrator has a single point of failure: when the administrator is on vacation (or gets run over by a bus), the site is vulnerable. Carrying a pager on vacation isn't my idea of fun; however, no one can predict when a crisis might occur. Of course, it's hard to interest a high-level person in a job that also involves changing the backup tapes and crawling through the ceilings.

The more homogeneous a site is, the easier it is to support.

The number of different platforms supported (different machine architectures or different operating systems) increases the complexity of the support task. Upgrading the operating system will have to be done at least once by hand for each platform. Each operating system has it own idiosyncrasies that must be learned and mastered. Most sites want all the platforms to appear identical so that their users can sit down on any of the workstations and get work done. This requires that each platform have identical tools, window systems, etc. This can greatly increase the amount of work the administrator must do. In the best of circumstances this means recompiling programs for each platform. In the worst circumstances, it involves porting software, and fighting with vendor-supplied software. My personal nightmare is trying to support all of X11R4 (from MIT), DECwindows, OSF/Motif, and Sun's OpenWindows on three different platforms.

Larger sites can exploit economies of scale.

Large sites can expand their administration staffs less rapidly than the number of users (or workstations) grows. The reason for this is that as your staff gets larger it is possible for people to specialize. This specialization permits individual staff members to develop a depth of expertise that enables them to understand all the issues on a given topic and solve more quickly whatever problems crop up.

Secondly, larger sites can leverage off previous work. The first installation of a machine or piece of software is always the most difficult. The second is easier. By the time you have done 50 or 100 installations, you have developed automatic scripts and can do installations in your sleep. I have seen large sites at a 1:100 administrator-to-machine ratio where things ran pretty well. I must caution the reader though: this sort of ratio is only feasible with top-notch people working in a carefully engineered environment with many hundreds of users. Most sites can't get productive work done with this sort of ratio. This sort of ratio also limits the professional growth of members of the system staff because they will spend most of their time with the day-to-day issues and fire-fighting. This is a shame since an organization's most valuable resource is its people.

High Availability Sites Require higher staffing.

Site which need to be highly available (e.g. greater than 99.9% service delivery) will require a higher level of staffing. The reason for this is you need people who can respond almost immediately to any service issues (e.g. 24x7 coverage, ideally at least 2 people deep who can do first and second level resolution, and be able to escalate to subject area experts). You also need to have multiple people for each subject area who are able to diagnosis and resolve complex issues quickly.

Hints for Making Administration Easier

As is sadly too often the case with support staff, system administrators are not highly regarded (even though everyone at the site depends on them). My experience is that there are never enough system administrators. Because staffing levels are not what they should be, a system administrator needs to take all possible (productive) short cuts and have a proactive rather than reactive approach to system-administration tasks. If administrators do not employ a proactive approach, they will find themselves constantly in a ``fire-fighting mode,'' which is counter-productive. System administrators need to leverage their time as much as possible. Here are some of the things that help me survive at my site.

Build Tools! Always do your work with scripts or tools. If you have to install a program or modify a set of configuration files, you will most likely have to do it again. Build small tools to do the work for you. Never do things by hand (or least never do things by hand more than once).
Automate Everything! Use tools that take care of things automatically. Clean your logs with a shell script that runs from cron. Use programs that will automatically update workstations from a ``master'' so you only have to install software on one machine and the software is automatically ``distributed'' to all the other machines. Berkeley's rdist program does this by ``pushing'' new copies of software. CMU's sup does this by having the workstations ``pull'' new copies of a program to the workstation.
Carefully Encapsulate Localizations Minimize the number of nonstandard pieces you have to add when you perform new installations on the operating system. Concentrate local changes in /usr/local (or use some similar scheme) as much as possible. Try to refrain from hacking on and reinstalling local versions of things in /bin, /usr/ucb, and so on. Throw out vendor supplied /etc/rc* files and create your own. Your /etc/rc* should provide all of the parameters in a single file that you need to change to localize your machines.
Standardize Environments and Configurations If each of your machines is configured differently (such as different swap sizes; some diskless, some dataless, some diskful; different software installed), you are creating headaches for yourself. If you can have a single ``prototypical'' machine from which you can clone distributions and upgrades, software dissemination can be performed automatically. For example, let's say you run a dataless configuration with /, /var, and parts of /usr on a local disk (all other files are accessed via the automounter). You could configure a diskless partition on a server that would boot up, install your localized operating system on the small local disk, and reboot as a newly configured workstation ready for action just by editing two or three files on your server that specify how to boot your diskless client. If you have to configure and install each machine by hand, you will waste time whenever you install a new machine or have to do an OS upgrade. Leverage uniformity!
Document your Environment If you regularly get the same questions from your users, you have failed to effectively document. Spend the up-front time to document, you will save that time (and more) on the backside by fewer and shorter support calls on those topics.
Share the Work Most sites have a number of highly motivated and clueful users. Harness their energy. Find ways for them to help out and find ways to encourage people to serve themselves.

What About Other Platforms?

The platform which is being supported makes a great deal of difference. My experience is that support of Macintosh and UNIX communities take approximately the same staffing levels. Support of PCs (running any Microsoft OS) seems to require at least double the staffing and delivers a lower level of service.

Other People's Ratios

In the last few years there have been a lot of people who have talked about the ratios they think are reasonable. It is common to hear people talking about staff/user ratios of 1:60 where there is some variation in the population, and staff/user ratios of 1:150 (or higher) in locations that can use "cookie cutter" solutions, eg universities with hordes of undergraduates. These ratios give, at best, a very crude benchmark indicating the minimal staffing required to deliver very basic service.

David Cappuccio of the Gartner Group suggested in his article Know The Types: Sizing up Support Staffs that there are two ratios that you need to consider. The first ratio is staff to users, an attempt to capture the human part of the equation. This ratio is looking at how many people you need to do what is often called Tier I, help desk, or user support. The second ratio is the number of machines and subsystems per staff, that is capturing how many people are needed to take care of the technical infrastructure. While I like David's framework, I think that his ratios are too high for user support, and that he has failed to capture the diverse set of technologies most organizations deploy: there is much more than print, file, web, and database servers. There are directory, security, messaging, and collaborative services. To complicate matters, many sites are heterogeneous requiring extra efforts to make one service work for all clients, or worse, resulting in the need services which are based on the client platform. A final complicating factor is that these services often have complex interactions and dependencies which makes them more difficult to deploy and maintain. The result is that David's ratios will result staffing which will be able to deliver only the most basic services at an adequate level.

My rule of thumb is:

Unit counting:
    1 unit for each 40 users.
    1 unit for each make of OS (2 for Windows*)
    1 unit for each 40 boxes if you can protect the OS from the users without
           hindering the user. 1 unit for each 20 boxes if you can't protect
           the OS and system configurations from the users.
    n unit (where n is # of OS) if tight coupling. (shared filing, etc)
    1 unit for major subsystems that are set up network wide
           instead of machine wide. E.g. newsspool, httpd, DNS, mail,
           printing, SAMBA.
    1 unit for each additional subnet or segment if multisegmented LAN.
    n*2 unit (where n is # of OS) if you care about security.

Junior admins can handle 4 units. A really good experienced admin can hack 8-12 units. This is based on an equation proposed by Sherwood Botsford and found in the comp.unix.admin FAQ.

Conclusion

The number of administrators required varies greatly from site to site. The one constant is that there are rarely enough system administrators for the responsibilities that they have. My personal experience is that it is possible for a single person to maintain up to 120 machines (with three different platforms) and give adequate user services to a fairly sophisticated user population. My time is divided between user services (30 percent), general system administration tasks (20 percent), installing new machines and hardware/network support (10 percent), software installation and maintenance (40 percent), custom software development and tracking of trends (25 percent), and site planning (10 percent). You will note that this adds up to 135 percent.

About the Author

Mark Verber wrote this article while he was working for the Physics Department at The Ohio State University. Mark now leads the Operations Team at Tellme Networks. If you would like to work with a great operations team, drop Mark some email. We are hiring.