OVERVIEW
Distributed systems contain multiple hardware and software components
that can interact across multiple nodes/subsystems in sometimes
unforeseen and complicated ways. As a result, determining the root
cause of failures in these systems can be a very frustrating
experience that might take several hours or even days.
Problem diagnosis (or fingerpointing) involves instrumenting systems
to yield meaningful data, detecting errors and/or failures within
these systems, and ascertaining their root-cause, i.e., the underlying
fault. Fingerpointing is difficult because the distributed
interactions, protocols and inter-component dependencies in computer
systems can cause a problem to change ``shape'' or manifestation,
leading to potential red herrings in problem determination. There can
be many root causes of an outward manifestation of a problem and there
might be insufficient information to distinguish between the various
root causes. On the other hand, too much monitoring and too many
error messages might overwhelm the system, obscure the root cause, and
lead to increased latencies and additional resource costs.
We are currently developing a variety of techniques for automated
fingerpointing in distributed systems -- the aim is to perform online
and offline root-cause analyses in order to identify a faulty
node/process, diagnose the source of the problem, and report it to the
user or administrator in a meaningful/useful manner.
We ultimately aim for a preemptive strategy (where we need not wait
for any instability or problem to manifest into system-wide outage
before taking remedial action) that might improve the system's overall
responsiveness and availability. The idea is to observe the trends of
various key metrics in the system to ascertain which of these can be
good indicators of the overall health of the system, and which metrics
(if monitored appropriately and at the right frequency) could herald a
potential outage. Thus, our techniques aim for two key elements:
diagnosis of the root cause of the problem, and (where possible) a
proactive indication of an imminent critical problem in the system
that averts a total system failure.
|