How Many Administrators are Enough?
MARK VERBER
Appeared in Unix
Review, April 1991
Minor Revisions May 1, 1997.
The Enigmatic Question
``How many system administrators does a site
need?'' is a commonly asked and difficult to answer question. There is no magic
ratio. The appropriate number of administrators depends on what each system
administrator is responsible for and on the level of service expected in each
area of responsibility.
The best way to estimate the number of administrators needed is to figure out
what level of service is required and how various factors (for instance
networking infrastructure and heterogeneity of the machines being supported)
will affect the the fulfillment of those responsibilities. Rarely are system
administrators doing only ``administrator'' tasks. The first part of this
article will detail the tasks that I find myself performing in addition to the
normal ``administrator'' tasks, such as backups, installing new users,
operating-system maintenance, and so forth. Additional tasks are presented (for
the most part) in the form of questions. The second part details some of the
various factors that will affect staff levels. The third part details some
simple perspectives that system administrators can adopt to make their
environment more easily administrable.
What System Administrators Do
User Services
How much hand-holding is expected? Some sites have users
who are pretty self-sufficient; other sites have users who need assistance for
everything they do. Can your users take care of themselves or do they need and
want the administrator to perform even the simplest tasks for them? For example,
I have a friend whose users demand that he perform the most basic tasks for them
(such as moving their files from one directory to another). Anything that isn't
simply invoking the text editor or reading mail is ``UNIX'' and hence a job for
the administrator. This sort of support requires a ratio something like one
administrator for every four users.
Does the site want you to conduct workshops or prepare extensive local
documentation? To what extent are you expected to consult on technical issues?
Do you concern yourself with just UNIX or other realms? For example, let's say
your site has heavy users of TeX, Mathematica, Common LISP, C++, X11,
PostScript, and Sybase. Are you supposed to be able to answer detailed questions
on all those topics? Few people are experts at all these things. Something that
many people don't appreciate is that development of expertise in any given topic
area requires time to play, experiment, and mature in that area.
Software Support
How much public domain software or freeware do people
want installed? What level of support are they expecting? Just compiling and
installing software doesn't take much time. Often though, software doesn't just
compile and install properly. There are often assumptions in the software which
need to be changed before the software can be used at a given site. In addition,
administrators are often expected (and rightly so) to continue maintenance of
the software (bug fixes and what not) and to become an expert in the use of the
software. Compiling and installing (coupled with frequent patches) or many
hardware/software platforms can make this incredibly time consuming for even
just a few software packages. The time this takes varies with the quality and
complexity of the software. Keeping a current version of kermit or
perl isn't hard (I wish everyone did as nice a job as Larry Wall has
with perl); keeping up with g++ is much more time-consuming.
Custom Software
Most places not only expect the system administrators to
keep their world running, but also to create -- on demand -- tools for the user
population. This is understandable, especially in small sites where the
administrator might be the only professional programmer. If there is this
expectation, time must be allocated for this development process.
Site Planning/Administration Overhead
How much site planning is the
administrator expected to handle? Must the administrator know about AC/heating
loads and power? How much paperwork is there?
Hardware/Network Maintenance
Who crawls through the ceiling to pull
wires? Who finds the flaky transceiver when the Ethernet starts to go crazy?
When a terminal or workstation dies, does a secretary just call your vendor and
wait, or are more creative solutions required? Does your site buy all its
peripherals ready-to-install or do you save money by purchasing components and
do the integration yourself? Having a system administrator do any of these
things takes time.
Anticipate Technology
Is the administrator supposed to anticipate new
technology and advise the company about new approaches? Most places I have
worked expect administrators to have a good feel for the state of the art and
new technology that looks promising (not just products, but research, too).
Anticipation is often necessary given many sites have a two-to-five year
planning or depreciation schedule. Keeping up with our field isn't easy. There
are a variety sources one much draw upon to stay current. I have found a variety
of good sources for current information. Trade rags can give you a picture of
what is being sold, Usenet (and other electronic media) is great for questions
regarding current issues and problems. Professional journals from ACM, IEEE, etc
are useful to see what is happening on the almost done research front.
There is no substitute however, for a good network of professional contacts.
This network can be maintained with phone calls, electronic mail, and attending
conferences.
Other Issues
Site with one administrator are not very desirable.
They are a fact of
life since many small sites can neither afford nor justify more than one system
administrator. It is difficult for one person to have the breadth of knowledge
and experience to run a really first-class site, no matter how few machines it
has. There will always be some area that is not the strength of a sole
administrator.
Another problem is that the site with a single system administrator has a
single point of failure: when the administrator is on vacation (or gets run over
by a bus), the site is vulnerable. Carrying a pager on vacation isn't my idea of
fun; however, no one can predict when a crisis might occur. Of course, it's hard
to interest a high-level person in a job that also involves changing the backup
tapes and crawling through the ceilings.
The more homogeneous a site is, the easier it is to support.
The number
of different platforms supported (different machine architectures or different
operating systems) increases the complexity of the support task. Upgrading the
operating system will have to be done at least once by hand for each platform.
Each operating system has it own idiosyncrasies that must be learned and
mastered. Most sites want all the platforms to appear identical so that their
users can sit down on any of the workstations and get work done. This requires
that each platform have identical tools, window systems, etc. This can greatly
increase the amount of work the administrator must do. In the best of
circumstances this means recompiling programs for each platform. In the worst
circumstances, it involves porting software, and fighting with vendor-supplied
software. My personal nightmare is trying to support all of X11R4 (from MIT),
DECwindows, OSF/Motif, and Sun's OpenWindows on three different platforms.
Larger sites can exploit economies of scale.
Large sites can expand
their administration staffs less rapidly than the number of users (or
workstations) grows. The reason for this is that as your staff gets larger it is
possible for people to specialize. This specialization permits individual staff
members to develop a depth of expertise that enables them to understand all the
issues on a given topic and solve more quickly whatever problems crop up.
Secondly, larger sites can leverage off previous work. The first installation
of a machine or piece of software is always the most difficult. The second is
easier. By the time you have done 50 or 100 installations, you have developed
automatic scripts and can do installations in your sleep. I have seen large
sites at a 1:100 administrator-to-machine ratio where things ran pretty well. I
must caution the reader though: this sort of ratio is only feasible with
top-notch people working in a carefully engineered environment with many
hundreds of users. Most sites can't get productive work done with this sort of
ratio. This sort of ratio also limits the professional growth of members of the
system staff because they will spend most of their time with the day-to-day
issues and fire-fighting. This is a shame since an organization's most valuable
resource is its people.
High Availability Sites Require higher staffing.
Site which need to be highly available (e.g. greater than 99.9% service
delivery) will require a higher level of staffing. The reason for this is
you need people who can respond almost immediately to any service issues (e.g.
24x7 coverage, ideally at least 2 people deep who can do first and second level
resolution, and be able to escalate to subject area experts). You also
need to have multiple people for each subject area who are able to diagnosis and
resolve complex issues quickly.
Hints for Making Administration Easier
As is sadly too often the case
with support staff, system administrators are not highly regarded (even though
everyone at the site depends on them). My experience is that there are never
enough system administrators. Because staffing levels are not what they should
be, a system administrator needs to take all possible (productive) short cuts
and have a proactive rather than reactive approach to system-administration
tasks. If administrators do not employ a proactive approach, they will find
themselves constantly in a ``fire-fighting mode,'' which is counter-productive.
System administrators need to leverage their time as much as possible. Here are
some of the things that help me survive at my site.
- Build Tools! Always do your work with scripts or tools. If you have
to install a program or modify a set of configuration files, you will most
likely have to do it again. Build small tools to do the work for you. Never do
things by hand (or least never do things by hand more than once).
- Automate Everything! Use tools that take care of things
automatically. Clean your logs with a shell script that runs from cron. Use
programs that will automatically update workstations from a ``master'' so you
only have to install software on one machine and the software is automatically
``distributed'' to all the other machines. Berkeley's rdist program
does this by ``pushing'' new copies of software. CMU's sup does this
by having the workstations ``pull'' new copies of a program to the
workstation.
- Carefully Encapsulate Localizations Minimize the number of
nonstandard pieces you have to add when you perform new installations on the
operating system. Concentrate local changes in /usr/local (or use some
similar scheme) as much as possible. Try to refrain from hacking on and
reinstalling local versions of things in /bin, /usr/ucb, and so
on. Throw out vendor supplied /etc/rc* files and create your own. Your
/etc/rc* should provide all of the parameters in a single file that you
need to change to localize your machines.
- Standardize Environments and Configurations If each of your
machines is configured differently (such as different swap sizes; some
diskless, some dataless, some diskful; different software installed), you are
creating headaches for yourself. If you can have a single ``prototypical''
machine from which you can clone distributions and upgrades, software
dissemination can be performed automatically. For example, let's say you run a
dataless configuration with /, /var, and parts of /usr on
a local disk (all other files are accessed via the automounter). You could
configure a diskless partition on a server that would boot up, install your
localized operating system on the small local disk, and reboot as a newly
configured workstation ready for action just by editing two or three files on
your server that specify how to boot your diskless client. If you have to
configure and install each machine by hand, you will waste time whenever you
install a new machine or have to do an OS upgrade. Leverage uniformity!
- Document your Environment If you regularly get the same
questions from your users, you have failed to effectively document.
Spend the up-front time to document, you will save that time (and more) on the
backside by fewer and shorter support calls on those topics.
- Share the Work Most sites have a number of highly motivated
and clueful users. Harness their energy. Find ways for them to
help out and find ways to encourage people to serve themselves.
What About Other Platforms?
The platform which is being supported makes
a great deal of difference. My experience is that support of Macintosh and UNIX
communities take approximately the same staffing levels. Support of PCs (running
any Microsoft OS) seems to require at least double the staffing and delivers a
lower level of service.
Other People's Ratios
In the last few years there have been a lot of
people who have talked about the ratios they think are reasonable. It is common
to hear people talking about staff/user ratios of 1:60 where there is some
variation in the population, and staff/user ratios of 1:150 (or higher) in
locations that can use "cookie cutter" solutions, eg universities with hordes of
undergraduates. These ratios give, at best, a very crude benchmark indicating
the minimal staffing required to deliver very basic service.
David Cappuccio of the Gartner Group suggested in his article Know The Types:
Sizing up Support Staffs that there are two ratios that you need to
consider. The first ratio is staff to users, an attempt to capture the human
part of the equation. This ratio is looking at how many people you need to do
what is often called Tier I, help desk, or user support. The second ratio is the
number of machines and subsystems per staff, that is capturing how many people
are needed to take care of the technical infrastructure. While I like David's
framework, I think that his ratios are too high for user support, and that he
has failed to capture the diverse set of technologies most organizations deploy:
there is much more than print, file, web, and database servers. There are
directory, security, messaging, and collaborative services. To complicate
matters, many sites are heterogeneous requiring extra efforts to make one
service work for all clients, or worse, resulting in the need services which are
based on the client platform. A final complicating factor is that these services
often have complex interactions and dependencies which makes them more difficult
to deploy and maintain. The result is that David's ratios will result staffing
which will be able to deliver only the most basic services at an adequate
level.
My rule of thumb is:
Unit counting:
1 unit for
each 40 users.
1 unit for each make of OS (2 for
Windows*)
1 unit for each 40 boxes if you can protect the
OS from the users
without
hindering the user. 1 unit for each 20 boxes if you can't
protect
the OS
and system configurations from the users.
n unit (where n
is # of OS) if tight coupling. (shared filing, etc)
1 unit
for major subsystems that are set up network
wide
instead of
machine wide. E.g. newsspool, httpd, DNS,
mail,
printing,
SAMBA.
1 unit for each additional subnet or segment if
multisegmented LAN.
n*2 unit (where n is # of OS) if you
care about security.
Junior admins can handle 4 units. A really good experienced admin can hack
8-12 units. This is based on an equation proposed by Sherwood Botsford and
found in the comp.unix.admin
FAQ.
Conclusion
The number of administrators required varies greatly from
site to site. The one constant is that there are rarely enough system
administrators for the responsibilities that they have. My personal experience
is that it is possible for a single person to maintain up to 120 machines
(with three different platforms) and give adequate user services to a
fairly sophisticated user population. My time is divided between user services
(30 percent), general system administration tasks (20 percent), installing new
machines and hardware/network support (10 percent), software installation and
maintenance (40 percent), custom software development and tracking of trends (25
percent), and site planning (10 percent). You will note that this adds up to 135
percent.
About the Author
Mark Verber
wrote this article while he was working for the Physics Department at The Ohio
State University. Mark now leads the Operations Team at Tellme Networks. If you would like to
work with a great operations team, drop Mark some email. We are hiring.