📦
CHI-in-a-box
  • What is CHI-in-a-Box?
  • Before You Begin
    • Assumed Knowledge
    • Hosts and Services
    • Network Overview
    • CC-Ansible
    • The site configuration
      • inventory
      • defaults.yml
      • passwords.yml
      • certificates/
      • node_custom_config/ (optional)
      • post-deploy.yml (optional)
    • How Deployment Works
    • Security considerations
  • Setup Guides
    • Evaluation Site
      • Bring up the Control Plane
    • Production Baremetal
      • Baremetal QuickStart
      • Host Networking Configuration
    • Troubleshooting
      • Networking
    • Verification Checklist
    • Dev-in-a-Box
    • Edge-in-a-Box
  • Reference
    • Chameleon Identity Federation
    • Ironic Flat Networking
    • Ironic Multi-Tenant Networking
    • Glance Image Storage
    • Resource Reservation
      • Default Resource Properties
    • Monitoring
      • IPMI Metrics
      • SNMP Metrics
  • Example Deployments
    • ARM/x86 mixed architecture
    • Edge computing/container testbed
  • Operations
    • Hardware management
    • Certificate management
    • Chameleon tools
      • Hammers 🔨
        • maintenance_reservation
      • Disk image subscription
      • Usage reporting
    • Troubleshooting
      • Known issues
        • Neutron (networking)
        • Nova (KVM)
        • Ironic (bare metal)
      • Instance networking diagnostics
      • Security incident triage
      • Troublesome Hardware
    • Alert runbooks
      • Cron Job No Recent Success
      • Instance Failure
      • Image Cache Space
      • Ironic Node Error State
      • Jupyter Server Launch Failure
      • MySQL Host Down
      • MySQL Replication Error
      • Node Exporter Down
      • Node Network Bridge Down
      • Node Network Bridge Low Traffic
      • Nova Ironic Instance Launch Failure
      • OpenStack API Down
      • PeriodicTask No Recent Success
      • Portal Down
      • Precis Parsed Events Low
      • Provider Conflict
      • Runbook Template
    • User support guide
    • Upgrading to a new Release
  • Development
    • Developing OpenStack Services
    • Dev-in-a-box
Powered by GitBook
On this page
Edit on GitHub
  1. Operations
  2. Alert runbooks

OpenStack API Down

PreviousNova Ironic Instance Launch FailureNextPeriodicTask No Recent Success

Last updated 2 years ago

Summary: One of the OpenStack APIs is returning a different HTTP code than expected, OR we are failing to get any response from it.

Consequences: If one of the OpenStack APIs is indeed down, or returning error codes, it will affect the reliability of the cluster as a whole, and large portions of the testbed may not be working.

Possible causes

Failure of monitoring to contact cluster: this alert depends on metrics that are gathered by an external source (a Prometheus OpenStack metric exporter), and if the exporter fails to authenticate or otherwise cannot reach the cluster, this alert can fire, though nothing is really on 🔥.

Recent upgrade changed default HTTP codes: If an OpenStack API is returning a weird code, it could be an unexpected side effect of a deploy, where a service used to return a code 200 on the root (/) endpoint, and now it's returning 204 (for example.)

  1. What happens when you curl -i $endpoint/ for the service? See what code is returned, and if it makes sense to return that code.

  2. If this is indeed an instance of bad alerting, make an update to the to look for the updated return code.

Service failure: It could be that the OpenStack service is actually not healthy. Try the following if this is the case:

  1. Verify that the service appears healthy; usually a smoke test through the system is enough.

  2. If not, is the service running? Check the running Docker containers for any containers in a restart loop.

  3. Examine the logs for the service for errors: docker run --rm -v kolla_logs:/logs centos:7 grep -i ERROR /logs/$service/$component. This can usually tell you what's wrong.

monitoring code