Developing fail safes
What is a fail safe and why install it? Choosing appropriate fail safes for your school Maintaining your fail safes to keep them effective
What is a fail safe and why install it?
What is a fail-safe?
All network equipment will fail eventually. The equipment may reach the end of its useful life, it may fail because of physical (malicious or accidental) damage, or it may be the victim of environmental damage such as a power spike, lightning strike or overload from adjacent equipment.

A fail-safe is any method or device used to reduce the chance of failure.

Why install fail-safes?
Installing fail-safes on your network will involve extra cost, but this must be weighed against the cost of replacing expensive network devices the time taken to track down the problem and the disruption resulting from the network downtime while the problem is resolved.

While some fail-safe techniques can be expensive to implement, many are well within the range of a school budget. The following section outlines some methods of minimising the chances of network failure and will help you decide the most appropriate level of protection for your school’s network.

Choosing appropriate fail safes for your school
There are four types of fail-safe:
  • a protective device such as an uninterruptible power supply (UPS) to protect network equipment from a specific threat (in this case, power failure or power fluctuations)
  • a duplicate device connected and working as part of the network such as a RAID array or a secondary cable run
  • a hot standby - a duplicate device ready to connect in place of a failed component (the device already loaded with the correct software and configuration to enable the quickest possible swap)
  • a spare device that you can configure to replace a failed network component.
The following tables list network components most likely to require fail- safes and identify possible solutions to protect against component failure:

Components requiring protective devices
Component
Risk
Fail-safe options
Pros and cons
Server
Power surge or failure
Uninterruptible power supply (UPS)
UPSs can be expensive but one UPS can supply power to multiple servers and critical devices such as switches.
Workstations
Power surge or failure
Mains surge protectors
Surge protectors are often incorporated in power distribution strips.

Components requiring duplicate devices
Component
Risk
Fail-safe options
Pros and cons
Data
Loss of data due to corruption or disk failure
RAID array (see Setting up RAID)
RAID arrays can be configured with relatively inexpensive disks and provide a high level of protection for your data.
Cables
Malicious or accidental break to cable runs
Alternative cable path to destination
Duplicating cable runs may be expensive but a break in a cable can be very time consuming to trace and may result in large parts of the network being unavailable for a long time.
Printers
General failure of printer such as paper jam
Duplicate printing facility, if possible on a different part of the network
Additional printers for critical administration tasks can be connected at low cost. Being mechanical devices, printers are prone to failure.

Components requiring hot standby
Component
Risk
Fail-safe options
Pros and cons
Data
Loss of data due to corruption or disk failure
Data back-up (see Implementing a back-up and restore process)
Of all the fail-safe methods, this is probably the most important to implement.
Servers
General failure
Duplicate spare server preloaded with required software and configured with correct interface boards
A server is an expensive part of the network but is often the most critical component, managing internet access, email and so on. You will need to weigh the cost of dedicating a spare computer to this task against the level of disruption likely to be caused by server failure and the time it would take to source and build a replacement.
Workstations
General failure
Duplicate spare workstation preloaded with required software and configured with correct interface boards
This option is only worth considering if you have a large number of workstations that have common applications and set-up.

Components requiring available spares
Most components of your network require some spare equipment to ensure a speedy recovery from component failure. The level of spares you should keep depends on a number of factors:
  • the number of such devices in use in the school (it is probably uneconomic to have spare devices available for one-off specialist equipment such as the robot arm used in the science lab)
  • the cost of the components (obviously the lower the cost of the individual devices such as workstations, the greater the level of spares you can afford)
  • the failure rate of certain network devices (this information can be obtained from the Problem Management process)
  • the ratio of mechanical to electronic components in the device (for instance, printers are more prone to fail than hubs)
  • the speed of availability of replacement parts from suppliers
  • the possibility of sharing spare equipment with other local schools
  • the importance of the device to the smooth running of your network.

The above factors may help you determine the level of spare equipment you should budget for, but an absolute minimum should be:
  • for primary schools – two spare workstations
  • for secondary schools – one spare workstation for every fifty workstations.
Maintaining your fail safes to keep them effective
Protective devices, duplicates and hot and cold spares should be subject to the same preventative maintenance routines as any other device on your network.

It is tempting to see spare equipment as an easy way to increase network capacity. After all, the equipment is not being used, but is probably just sitting in a cupboard somewhere gathering dust. Of course that is exactly when the spare is required to fix a problem, so it is important not to press spare equipment into use as network usage increases. In addition, spares should be replenished as soon as possible whenever they are used to repair the network.

The requirement for additional fail-safes should be considered as part of Release Management.