📦
CHI-in-a-box
  • What is CHI-in-a-Box?
  • Before You Begin
    • Assumed Knowledge
    • Hosts and Services
    • Network Overview
    • CC-Ansible
    • The site configuration
      • inventory
      • defaults.yml
      • passwords.yml
      • certificates/
      • node_custom_config/ (optional)
      • post-deploy.yml (optional)
    • How Deployment Works
    • Security considerations
  • Setup Guides
    • Evaluation Site
      • Bring up the Control Plane
    • Production Baremetal
      • Baremetal QuickStart
      • Host Networking Configuration
    • Troubleshooting
      • Networking
    • Verification Checklist
    • Dev-in-a-Box
    • Edge-in-a-Box
  • Reference
    • Chameleon Identity Federation
    • Ironic Flat Networking
    • Ironic Multi-Tenant Networking
    • Glance Image Storage
    • Resource Reservation
      • Default Resource Properties
    • Monitoring
      • IPMI Metrics
      • SNMP Metrics
  • Example Deployments
    • ARM/x86 mixed architecture
    • Edge computing/container testbed
  • Operations
    • Hardware management
    • Certificate management
    • Chameleon tools
      • Hammers 🔨
        • maintenance_reservation
      • Disk image subscription
      • Usage reporting
    • Troubleshooting
      • Known issues
        • Neutron (networking)
        • Nova (KVM)
        • Ironic (bare metal)
      • Instance networking diagnostics
      • Security incident triage
      • Troublesome Hardware
    • Alert runbooks
      • Cron Job No Recent Success
      • Instance Failure
      • Image Cache Space
      • Ironic Node Error State
      • Jupyter Server Launch Failure
      • MySQL Host Down
      • MySQL Replication Error
      • Node Exporter Down
      • Node Network Bridge Down
      • Node Network Bridge Low Traffic
      • Nova Ironic Instance Launch Failure
      • OpenStack API Down
      • PeriodicTask No Recent Success
      • Portal Down
      • Precis Parsed Events Low
      • Provider Conflict
      • Runbook Template
    • User support guide
    • Upgrading to a new Release
  • Development
    • Developing OpenStack Services
    • Dev-in-a-box
Powered by GitBook
On this page
Edit on GitHub
  1. Operations
  2. Alert runbooks

Ironic Node Error State

PreviousImage Cache SpaceNextJupyter Server Launch Failure

Last updated 2 years ago

Summary: an Ironic node has entered the error . Per the docs:

This is the state a node will move into when deleting an active deployment fails.

Consequences: users will not be able to launch instances on these Ironic nodes. However, they will still be able to reserve the nodes, which can lead to confusion when trying to utilize the reservation.

Possible causes

Temporary IPMI connectivity disruption: In some cases, the power status of the node cannot be synced during a deployment or undeployment, and the node can enter an error state as a precaution. There is a that should attempt to "reset" this state, as it can and does happen periodically simply due to network contention or interruption on the provisioning network.

  1. Check the "extra" field on the node: openstack baremetal node show $node -f json | jq .extra. A node that has been reset by the hammer will have a "hammer_error_resets" key with timestamps for each time a reset was performed.

  2. If there are more than (3 at time of writing), then this node could have an issue with its IPMI interface and should be put into maintenance.

Temporary API connectivity disruption: Many OpenStack services are involved in instance tear-down (e.g., Keystone, Nova, Ironic, Neutron)--if any of those cannot be reached, the instance can fail to tear down.

IPMI interface failure: If the node has a pattern of issues with IPMI, there could be an issue with the BMC, the IPMI NIC, or even the physical cable or connection on the switch that provides IPMI connectivity. All of these issues require maintenance of the node.

Clearing the error state

To put the node back into the available state, you can trigger an undeploy of the node. This works even if the node doesn't have an instance; it essentially performs a clean and then delete if there is an instance, then resets the state.

openstack baremetal node undeploy $node
provision state
hammer
max_attempts