Most discussion of troubleshooting, and especially training in formal troubleshooting procedures, tends to be domain specific, even though the basic principles are universally applicable.
Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a printer that "was working when it was plugged in over there"). However, there is a well known principle that
correlation does not imply
causality. (For example the failure of a device shortly after it's been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of
coincidence.) Therefore troubleshooting demands
critical thinking rather than
magical thinking.
It's useful to consider the common experiences we have with light bulbs. Light bulbs "burn out" more or less at random; eventually the repeated heating and cooling of its
filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.
A basic principle in troubleshooting is to start from the simplest and most
probable possible problems first. This is illustrated by the old saying "When you see hoof prints, look for horses, not zebras", or to use another
maxim, use the
KISS principle. This principle results in the common complaint about
help desks or manuals, that they sometimes first ask: "Is it plugged in and does that receptacle have power?", but this should not be taken as an affront, rather it should serve as a reminder or
conditioning to always check the simple things first before calling for help.
A troubleshooter could check each component in a
system one by one, substituting known good components for each potentially suspect one. However, this process of "serial substitution" can be considered degenerate when components are substituted without regards to a hypothesis concerning how their failure could result in the symptoms being diagnosed.
Efficient methodical troubleshooting starts with a clear understanding of the expected behavior of the system and the symptoms being observed. From there the troubleshooter forms hypotheses on potential causes, and devises (or perhaps references a standardized checklist of) tests to eliminate these prospective causes. Two common strategies used by troubleshooters are to check for frequently encountered or easily tested conditions first (for example, checking to ensure that a printer's light is on and that its cable is firmly seated at both ends), and to "bisect" the system (for example in a network printing system, checking to see if the job reached the server to determine whether a problem exists in the subsystems "towards" the user's end or "towards" the device).
This latter technique can be particularly efficient in systems with long chains of serialized dependencies or interactions among its components. It's simply the application of a
binary search across the range of dependences.
Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (
feedback loops). Such systems are less amenable to "bisection" troubleshooting techniques.
It also helps to start from a known good state, the best example being a computer
reboot. A
cognitive walkthrough is also a good thing to try. Comprehensive
documentation produced by proficient
technical writers is very helpful, especially if it provides a
theory of operation for the subject device or system.
A common cause of problems is bad
design, for example bad
human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (
behavior-shaping constraint), or a lack of
error-tolerant design. This is especially bad if accompanied by
habituation, where the user just doesn't notice the incorrect usage, for instance if two parts have different functions but share a common case so that it isn't apparent on a casual inspection which part is being used.
Troubleshooting can also take the form of a systematic
checklist, troubleshooting
procedure,
flowchart or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.