It seems a little odd, that with Proxmox 4.x+ out, there’s still the need to fix Proxmox 2.3 cluster issues, but hey, never change a running system, right?

The Issue

Good ol’ Proxmox 2.3 Cluster wasn’t as resilient as its successor is now in Proxmox 4.x. The Cluster tends to lose connectivity with the other nodes, Proxmox’ Cluster-Filesystem casually denies write-access, Corosync enjoys crowding your syslog with error messages and so on. Problem here is, you can’t do shit on your machines because “FUCK YEAH…”

cluster not ready – no quorum?

So awesome and endless debugging possibilites. Not!

The Fix

TBH, this is not the go-to fix that repairs your Cluster no matter what. I mostly have issues with cman and pve-cluster which can be resolved by restarting both services on all Cluster nodes. With pve-cluster this can lead to undesired behaviour though.

I’m managing tasks like these with Ansible as my Clusters are usually composed of 10+ nodes and doing it manually on each node is annoying. And by “managing with Ansible” I mean I’m using Ansible to execute whatever command I need in parallel on multiple nodes at once. This usually works fine, especially for all things cman but might push your shared /etc/pve filesystem under pressure as you’re probably producing deadlocks and race-conditions when 5 nodes try to re-sync their filesystem at once.

So here’s a small Ansible command to restart pve-cluster one node after another (Yes, it can be done in a while-loop with Bash, too, but hey: Ansible is Love, Ansible is Life).

Requirements

  • Working Ansible
  • Group “pve” in your Ansible Inventory (as you’re probably pretty smart, you surely already realized that this changes depending on your setup and Inventory)

ansible pve -m service -a "name=pve-cluster state=restarted" -f 1

This way, pve-cluster will be restarted one node at a time and the Cluster Filesystem has enough time to sync and report “notice: received all states” after each restart which, in the end, fixes any problem related to the shared filesystem.

Additionally:

Executing

ansible pve -m service -a "name=cman state=restarted" -f 1

afterwards doesn’t hurt either.

Leave a Reply