Node state transitions
Table of Contents:
introduction:
The lifecycle of a node comprises instantiation, and runtime state. The
runtime state of an uninstantiated node is UNDEFINED. An instantiated node
is typically either STOPPED, RUNNING, or SUSPENDED. Other states reflect
transitions, which are made explicit due to potential lag between receipt
of a control command, and state transition. This document describes valid
runtime states, valid transitions, and use of relative control state
information by cooperating nodes.
TOC
node state transitions:
When a node has not yet been instantiated, or its control state does not
map to one of the known states, it is UNDEFINED. The FAILED state is
reserved for catastrophic failure of a node (the node cannot continue
without manual intervention). A node can transition to a FAILED state from
any other state, even though these are not all diagrammed.
When transitioning to a failed state, the intermediate FAILING state is
optional. If the node wishes to perform some action before failure is
complete, then the node must first set the state to FAILING before taking
this action and setting the status to FAILED. Actions may take time to
complete, especially under failure circumstances, and the transitionary
status provides information to the control system in the interim.
|
Node state transitions:
|
Meanings inherent in the states and valid transitions:
- undefined: no state information for this node is available
- undefined -> stopped: standard node instantiation
- undefined -> failing: catastrophic failure on instantiation
- stopped: the node is instantiated but must be initialized to
become operational. It is not processing input or producing output of
any kind.
- stopped -> starting: initialization begins
- starting -> failing
- starting -> running: successful initialization
- running: the node is fully operational
- running -> stopping: the node is being shut down normally, no
new input is being accepted, outstanding work is being completed and
resources are being relinquished.
- stopping -> failing
- stopping -> stopped: all outstanding operations have been completed
and resources have been relinquished.
- running -> failing: a catastrophic failure occurred during
processing
- running -> suspending: the node has been suspended and is cleaning
up its remaining outstanding synchronous input
- suspending -> failing
- suspending -> suspended: processing of remaining synchronous input
has been completed
- suspended: synchronous input not accepted, resources requiring
extensive initialization continue to be held
- suspended -> failing
- suspended -> resuming: setting up for acceptance of synchronous
input.
- resuming -> failing
- resuming -> running: all suspended functions restored
|
For brevity, the transition of any state xyz -> failing is
understood to mean:
- xyz -> failing: a catastrophic failure occurred while in state
xyz and the node is performing final actions
- xyz -> failed: a catastrophic failure occurred while in state
xyz, no cleanup actions were taken or attempted
- failing -> failed: final failure actions completed.
- failed: the node is not processing input or producing output of
any kind. External action is required.
A suspended node does not process synchronous input (input declared
using @receive). This input is automatically suspended when the node
changes to the SUSPENDING state. A node may optionally process
asynchronous input (input declared via @subscribe). This input is not
automatically suspended; the node can choose to continue processing
this input or ignore it as application logic dictates.
TOC
notes on heartbeat requirements:
Heartbeat monitoring is not explicitely supported within SAND, but is
generally expected to be available as part of the underlying control
technology. The logic behind this is:
- You don't need a heartbeat for asynchronous communication. Message
delivery is guaranteed, so if the subscriber is up it will get the
message. If the sender goes down, and it is expected to be producing
regular output, then its output is already a heartbeat and can be
detected. If the sender does not produce regular output then its
health can be monitored using the control system directly.
- You don't need a heartbeat for synchronous communication. The call
either succeeds or not, and state information can be pulled from the
call failure details.
With the timer utility, any node can be programmed to send out a regular
signal (as with Stats messages). This is useful in cases where the
application itself needs to be reactive to specific situations.
TOC
helper nodes and state:
When a node requires other nodes to perform its work, some degree of
cross-node state checking is necessary for due diligence during the
initialization process. The amount of checking is application dependent,
but tends to break along two categories of node relationships:
- cooperative nodes: are declared within the same NodeInstance
array in a ServerDeclaration. They are generally considered to be
loosely coupled, and do not usually check state across one another
during the initialization process.
- helper nodes: are declared within the same NodeInstance
array within a NodeInstance. The parent node cannot function
without the children, and the children are not directly used by
nodes other than the parent. Helper nodes are generally considered
to be tightly coupled, and the parent must verify the state of its
children is consistent with its own.
The common logic to verify child nodes is supported within SAND:
- When a parent node is instantiated, its children are also instantiated.
- The children of a parent move synchronously with the parent through
recursive state transitions. For example to start a node:
- The parent moves to the STARTING state
- Each child is started
- The parent moves to the STARTED state
- The failure of a child cascades upwards to the parent state.
- Deletion of any node instance in the group results in all instances
(parent and all children) being deleted.
TOC
implementation notes:
If the underlying control system provides comprehensive state management
using a standard that differs from the SAND state specification, it is
expected that the SAND environment will provide both the SAND state
management, and a mapping to/from the additional standard. Best efforts in
mapping between the standards provide the least astonishment and maximum
utility to those responsible for system maintenance, regardless of
expressive power.
TOC
© 2002 SAND Services Inc.
All Rights Reserved.