Ironic Node Error State
Last updated
Last updated
Summary: an Ironic node has entered the error
. Per the docs:
This is the state a node will move into when deleting an active deployment fails.
Consequences: users will not be able to launch instances on these Ironic nodes. However, they will still be able to reserve the nodes, which can lead to confusion when trying to utilize the reservation.
Temporary IPMI connectivity disruption: In some cases, the power status of the node cannot be synced during a deployment or undeployment, and the node can enter an error state as a precaution. There is a that should attempt to "reset" this state, as it can and does happen periodically simply due to network contention or interruption on the provisioning network.
Check the "extra" field on the node: openstack baremetal node show $node -f json | jq .extra
. A node that has been reset by the hammer will have a "hammer_error_resets" key with timestamps for each time a reset was performed.
If there are more than (3 at time of writing), then this node could have an issue with its IPMI interface and should be put into maintenance.
Temporary API connectivity disruption: Many OpenStack services are involved in instance tear-down (e.g., Keystone, Nova, Ironic, Neutron)--if any of those cannot be reached, the instance can fail to tear down.
IPMI interface failure: If the node has a pattern of issues with IPMI, there could be an issue with the BMC, the IPMI NIC, or even the physical cable or connection on the switch that provides IPMI connectivity. All of these issues require maintenance of the node.
To put the node back into the available
state, you can trigger an undeploy
of the node. This works even if the node doesn't have an instance; it essentially performs a clean
and then delete
if there is an instance, then resets the state.