MySQL/MariaDB

MariaDB Galera Cluster Crash. Recover a Galera Cluster.

MariaDB Galera Clusters are designed to be online all of the time. A MariaDB Galera cluster can deal well with a node failure. But if you experience a total outage of all nodes you could be in trouble. The MariaDB Galera Cluster will likely crash. So for this article let’s assume you have a three-node MariaDB Galera cluster and you lost power to all nodes. You are likely in a position that the servers are now back online but MariaDB won’t start. In my development Galera cluster I just powered off all of my nodes to cause this problem.

It’s important to point out that even though the cluster was running in Master-Master replication due to the crash one of the nodes will have newer information than the others. In this development environment, I have two NVMe Servers in France and one in the UK. So which one has the latest information? Check the status of MariaDB with systemctl mysqld status.

root@frn-bk:~# service mysqld status
× mariadb.service - MariaDB 10.6.16 database server
     Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2024-03-03 12:35:49 UTC; 37min ago
       Docs: man:mariadbd(8)
             https://mariadb.com/kb/en/library/systemd/
    Process: 1224 ExecStartPre=/usr/bin/install -m 755 -o mysql -g root -d /var/run/mysqld (code=exited, status=0/SUCCESS)
    Process: 1225 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
    Process: 1227 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`cd /usr/bin/..; /usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environ>
    Process: 1320 ExecStart=/usr/sbin/mariadbd $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)
   Main PID: 1320 (code=exited, status=1/FAILURE)
     Status: "MariaDB server is down"
        CPU: 127ms

Mar 03 12:35:16 frn-bk.f2h.cloud systemd[1]: Starting MariaDB 10.6.16 database server...
Mar 03 12:35:16 frn-bk.f2h.cloud sh[1228]: WSREP: Recovered position e8eee71f-d664-11ee-8fde-7fd997630df6:116324
Mar 03 12:35:49 frn-bk.f2h.cloud systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Mar 03 12:35:49 frn-bk.f2h.cloud systemd[1]: mariadb.service: Failed with result 'exit-code'.
Mar 03 12:35:49 frn-bk.f2h.cloud systemd[1]: Failed to start MariaDB 10.6.16 database server.

Find Higher Sequence number

So from the output, I can see two bits of information that are useful to me. The first is obvious. The cluster node is goosed. But the second piece of information is critical, in the status output, MariaDB has provided a UUID and a sequence number. The UUID is the e8eee71f-d664-11ee-8fde-7fd997630df6 section and the sequence number is the data appended after the UUID. In this case 116324.

Mar 03 12:35:16 frn-bk.f2h.cloud sh[1228]: WSREP: Recovered position e8eee71f-d664-11ee-8fde-7fd997630df6:116324

Check the status of the mysqld service on all nodes. You need to find the higher sequence number. In my nodes, the London node had the higher sequence number so this is the node that has to latest information on it. It was the last node to go down and will be the node we bring up first.

Mar 03 13:20:15 uklon-bk.f2h.cloud sh[1817]: WSREP: Recovered position e8eee71f-d664-11ee-8fde-7fd997630df6:116326

So inside the node with the higher sequence number now cat the /var/lib/mysql/grastate.dat file. The returned data will look something like this.

cat /var/lib/mysql/grastate.dat

# GALERA saved state
version: 2.1
uuid:    e8eee71f-d664-11ee-8fde-7fd997630df6
seqno:   -1
safe_to_bootstrap: 1

Recreate Galera Cluster

On this node, you should see that the safe_to_bootstrap line has a 1. That means we can proceed. However, if you see a 0 here double-check that this is the node with the higher seqno number. Assuming it is you can edit the above file and place a 1 in the file with the below command.

sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' /var/lib/mysql/grastate.dat

Because your MariaDB Galera Cluster has suffered a total crash, you must create the cluster again. All of your settings are already in place and your data is safe. Issue the below command.

galera_new_cluster && systemctl status mysqld

Assuming this went well, you should now see that the MariaDB service has started. You can start MariaDB as normal on the remaining nodes in the MariaDB Galera Cluster to recover from the crash. To help combat this issue, ensure MariaDB Galera restarts automatically on failure. Most of the time, automatically restarting MariaDB Galera after a crash will bring the server back online.

Going Further

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button