Use pacemaker and corosync on Illumos (OmniOS) to run a HA active/passive cluster

In the Linux world, a popular approach to build highly available clusters is with a set of software tools that include pacemaker (as resource manager) and corosync (as the group communication system), plus other libraries on which they depend and some configuration utilities.

On Illumos (and in our particular case, OmniOS), the ihac project is abandoned and I couldn’t find any other platform-specific open source and mature framework for clustering. Porting pacemaker to OmniOS is an option and this post is about our experience with this task.

The objective of the post is to describe how to get an active/passive pacemaker cluster running on OmniOS and to test it with a Dummy resource agent. The use case (or test case) is not relevant, but what should be achieved in a correctly configured cluster is that, if the node of the cluster running the Dummy resource (active node) fails, then that resource should fail-over and be started on the other node (high availability).

I will assume to start from a fresh installation of OmniOS 151012 with a working network configuration (and ssh, for your comfort!). Check the general administration guide, if needed.

This is what we will cover:

  • Configuring the machines
  • Patching and compiling the tools
  • Running pacemaker and corosync from SMF
  • Running an active/passive cluster with two nodes to manage the Dummy resource

Installing packages

Some packages are installed from the default repositories, but others need to be retrieved from opencsw.

Install from default repositories:

# pkg install developer/gnu-binutils text/gnu-grep \
 rsync gnu-tar wget text/gnu-sed compatibility/ucb text/gawk\
 autoconf gnu-m4 system/header header-math \
 ipmitool gnu-make developer/build/libtool library/libtool/libltdl \
 library/ncurses library/security/openssl text/gnu-gettext \
 developer/versioning/mercurial \
 developer/versioning/git \
 SUNWcs driver/network/ofk system/header \
 developer/library/lint developer/object-file \
 system/library/mozilla-nss/header-nss library/nspr/header-nspr \
 xz pkg://omnios/developer/swig \
 package/pkg file/gnu-coreutils \
 system/header/header-picl developer/gcc48 \
 developer/build/automake

Install the OpenCSW utility and update the repos:

# pkgadd -d http://get.opencsw.org/now
# /opt/csw/bin/pkgutil -U

Install the packages from CSW:

# /opt/csw/bin/pkgutil -i ggettext pkgconfig libnet gnutls libgnutls_dev libgnutls13 libev_dev libevent_dev libgcrypt11

Configure the environment

Variables

Some environment variables will be needed by the tools and scripts, but also during the building process. The easiest thing is to create a file (e.g., pacemaker.rc) and source it to get the pacemaker environment ready. You may want to separate the variables needed only for running the tools from the various flags needed during the build.

NOTE: there should be no particular reason to change the installation prefix (PREFIX), but if you need to, please adapt also to the remaining part of the instructions to that change, where needed.

Content of pacemaker.rc:

export PCMK_ipc_type=socket
export PREFIX=/opt
export CFLAGS='-D__EXTENSIONS__ -D_POSIX_PTHREAD_SEMANTICS -DNAME_MAX=255 -DHOST_NAME_MAX=255 -I/opt/gcc-4.8.1/include -I/usr/include -I${PREFIX}/include -I/opt/ha/include -I/opt/gcc-4.8.1/lib/gcc/i386-pc-solaris2.11/4.8.1/include/ -lsocket -lnsl'
export LDFLAGS='-R/usr/gnu/lib -L${PREFIX}/lib -L/opt/gcc-4.8.1 -L/usr/gnu/lib -L/lib -L/usr/lib'
export PATH=/usr/gnu/bin:/opt/gcc-4.8.1/bin/:/opt/csw/bin:/usr/gnu/bin:/usr/bin:/usr/sbin:/usr/local/bin:$PREFIX/bin:/sbin/:/opt/csw/gnu/:${PREFIX}/sbin
export PKG_CONFIG_PATH='/opt/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/lib/pkgconfig'
export PKG_CONFIG_LIBDIR='/opt/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/lib/pkgconfig'
export PKG_CONFIG_ALLOW_SYSTEM_CFLAGS=yes
export PKG_CONFIG_ALLOW_SYSTEM_LIBS=yes
export LCRSODIR=/usr/libexec/lcrso 
export CLUSTER_USER=hacluster
export CLUSTER_GROUP=haclient
export BUILDPATH=/export/builds
export LD_ALTEXEC=/usr/gnu/i386-pc-solaris2.11/bin/ld
export CONFIG_SHELL=/usr/gnu/bin/sh
export PYTHONPATH=${PREFIX}/lib/python2.6/site-packages
export OCF_ROOT=/opt/usr/lib/ocf

Then source the file to have the configuration on the current shell:

# source pacemaker.rc

Now update the library path on the system to include the CSW objects:

# crle -l /opt/csw/lib/ -u

Folders

# mkdir -p $BUILDPATH
# mkdir -p $PREFIX/var
# mkdir -p $PREFIX/lib/heartbeat/cores/$CLUSTER_USER

Cluster user and group

We create the hacluster and haclient user and group (respectively), that will run the cluster, then we set some permissions on the folders that we created before.

Note that the corosync and pacemaker processes will be run as hacluster user (as per the SMF script that comes later), so a common problem when using resource agents will be about missing permission on directories or executables.

# getent group ${CLUSTER_GROUP} >/dev/null || groupadd ${CLUSTER_GROUP}
# getent passwd ${CLUSTER_USER} >/dev/null || useradd -g ${CLUSTER_GROUP} -d $PREFIX/lib/heartbeat/cores/$CLUSTER_USER -s /bin/bash -c "cluster user" ${CLUSTER_USER}
# chown $CLUSTER_USER:$CLUSTER_GROUP $PREFIX/var/
# chown $CLUSTER_USER:$CLUSTER_GROUP $PREFIX/lib/heartbeat/cores/$CLUSTER_USER

Similarly, hacluster won’t have enough rights to run write commands such as ipadm create-addr, so we give him passwordless sudo powers. If some resource agents that you want to run will need sudo permissions in some of their instructions, then they will need to be patched.

Run

# visudo

then append this at the end of the file to have the passwordless sudo:

hacluster    ALL=(ALL) NOPASSWD: ALL

An alternative would be to set appropriate Role-Based Access Control (RBAC) authorizations.

UPDATE: running pacemaker and corosync as root should work without issues. So you can edit the SMF script and use “root” instead of “hacluster” in the “CLUSTER_USER” variable.

Hostnames

Set the hostnames of both machines by appending an entry at the end of /etc/hosts (use the output of uname -n to get the symbolic name), example:

10.0.100.10     ha-test-1

Check that from each machine you can ping the other with the symbolic name.

Installation of the tools

This section will provide indications on which tools to build and install and how. Install them in the order shown here below.

This information is re-elaborated from Andreas page on libqb (check credits and references).

I will mention the version of the tools that I used on my setup (or explicitly add a checkout command). You are encouraged to try the latest “masters/tips” when available, but that might need patching work not documented here.

Also, the fact that code is compiling really doesn’t mean much. There will be differences, such as expected return values, between Linux and Solaris that will break the code at runtime. With the patches here described, I managed to run correctly the Dummy, IPaddr and ZFS resources, but the line of code that will crash everything will be executed sooner or later, I haven’t just traversed that code yet :D!

General note on configure.ac and Makefile.am

I had the need to do this change for many packages, so I will document it here as a general note and reference this paragraph if this change is needed to compile a certain package.

So if you see the note “apply the changes described in the general section about configure.ac and Makefile.am” during the installation instructions of a package, come back to this paragraph and do the two changes described here below.

Add the following line in configure.ac, after the AC_INIT directive:

AC_CONFIG_MACRO_DIR([/opt/csw/share/aclocal/])

If Makefile.am already has an ACLOCAL_AMFLAGS variable, then append

-I/opt/csw/share/aclocal/

to that line, otherwise add the complete entry

ACLOCAL_AMFLAGS=-I/opt/csw/share/aclocal/

help2man

# cd $BUILDPATH
# wget http://ftp.hawo.stw.uni-erlangen.de/gnu/help2man/help2man-1.46.1.tar.xz
# tar xf help2man\-1.46.1.tar.xz
# cd help2man\-1.46.1
# ./configure
# make
# make install

libtool

# cd $BUILDPATH
# wget http://ftp.gnu.org/gnu/libtool/libtool-2.4.2.tar.gz
# tar zxf libtool-2.4.2.tar.gz
# cd libtool\-2.4.2
# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./bootstrap
# chmod +x libltdl/config/install\-sh
# ./configure
# make install

libesmtp

# cd $BUILDPATH
# export CFLAGS='-std=c89 -D__EXTENSIONS__ -DNAME_MAX=255 -DHOST_NAME_MAX=255'
# wget http://www.stafford.uklinux.net/libesmtp/libesmtp-1.0.6.tar.gz
# gtar zxf libesmtp-1.0.6.tar.gz
# cd libesmtp-1.0.6
# mkdir m4
# autoreconf -i
# ./configure --prefix=$PREFIX 
# perl -pi -e 's#// TODO: handle GEN_IPADD##' smtp-tls.c
# gmake
# gmake install 
# cp auth-client.h $PREFIX/include
# cp auth-plugin.h $PREFIX/include
# cp libesmtp.h $PREFIX/include
# unset CFLAGS
# export CFLAGS='-D__EXTENSIONS__ -D_POSIX_PTHREAD_SEMANTICS -DNAME_MAX=255 -DHOST_NAME_MAX=255 -I/opt/gcc-4.8.1/include -I/usr/include -I${PREFIX}/include -I/opt/ha/include -I/opt/gcc-4.8.1/lib/gcc/i386-pc-solaris2.11/4.8.1/include/ -lsocket -lnsl'

check

# cd $BUILDPATH
# wget http://sourceforge.net/projects/check/files/check/0.9.8/check-0.9.8.tar.gz
# gtar zxf check-0.9.8.tar.gz
# cd check-0.9.8

Edit the configure.ac file to add two lines (just after AC_CONFIG_MACRO_DIR([m4]) ):

 m4_pattern_allow([AM_PROG_AR])
 AM_PROG_AR

Then continue with the build:

# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 autoreconf --install
# ./configure 
# make
# make install

asciidoc

# cd $BUILDPATH
# wget http://sourceforge.net/projects/asciidoc/files/asciidoc/8.6.8/asciidoc-8.6.8.tar.gz
# gtar zxf asciidoc-8.6.8.tar.gz
# cd asciidoc-8.6.8
# ./configure
# gmake install

cluster glue

(Note: I used version 2ce85bfab4c1 for my setup)

# cd $BUILDPATH
# wget -O cluster-glue.tar.bz2 http://hg.linux-ha.org/glue/archive/tip.tar.bz2
# gtar jxf cluster-glue.tar.bz2
# cd Reusable-Cluster-Components-*
# perl -pi -e 's#\$\(XSLTPROC\) \\#\$\(XSLTPROC\) --novalid \\#g' doc/Makefile.am

Search for “solaris” in configure.ac and match that section with the following (you should only add the CFLAGS line):

*solaris*)
       REBOOT_OPTIONS="-n"
       POWEROFF_OPTIONS="-n"
       CFLAGS="$CFLAGS -D__EXTENSIONS__"

Search for “cc_supports_flag()” in configure.ac and check that it matches the following:

cc_supports_flag() {
       local CFLAGS="$@"
       AC_MSG_CHECKING(whether $CC supports "$@")
       AC_COMPILE_IFELSE([AC_LANG_SOURCE(int main(){return 0;})] ,[RC=0; AC_MSG_RESULT(yes)],[RC=1; AC_MSG_RESULT(no)])
       return $RC
}

Run:

# ./autogen.sh 
# chmod +x install\-sh
# sed -i 's/-fstack-protector-all//g' configure.ac

Then apply the changes described in the general section about configure.ac and Makefile.am.

Complete the build:

# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 autoreconf --install
# LDFLAGS='-L/opt/csw/lib' ./configure --prefix=$PREFIX --enable-fatal-warnings=no --enable-doc=no --with-daemon-user=${CLUSTER_USER} --with-daemon-group=${CLUSTER_GROUP}
# make
# make install

Resource agents

(Note: I used version b644395 for my setup)

# cd $BUILDPATH
# wget -O resource-agents.tar.gz https://github.com/ClusterLabs/resource-agents/tarball/master
# gtar zxvf resource-agents.tar.gz
# cd ClusterLabs-resource-agents-*/
# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./autogen.sh
# chmod +x install\-sh
# ./configure --prefix=$PREFIX
# gmake clean
# gmake
# gmake install

libqb

# cd $BUILDPATH 
# git clone https://github.com/ClusterLabs/libqb.git libqb
# cd libqb
# git checkout v0.17.1

Then apply the changes described in the general section about configure.ac and Makefile.am.

Complete the build:

# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./autogen.sh
# LDFLAGS='-R/opt/gcc-4.8.1/lib' CFLAGS="-D_REENTRANT -D_POSIX_PTHREAD_SEMANTICS -D__EXTENSIONS__ -march=i486 -mtune=native" ./configure --prefix=$PREFIX --enable-debug --with-check=yes --enable-slow-tests
# make clean
# make
# make install

libstatgrab

# cd $BUILDPATH
# wget http://dl.ambiweb.de/mirrors/ftp.i-scream.org/libstatgrab/libstatgrab-0.91.tar.gz
# gtar zxvf libstatgrab-0.91.tar.gz
# cd libstatgrab-0.91
# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./configure --prefix=$PREFIX 
# make
# make install

corosync

Get the sources:

# cd $BUILDPATH
# git clone https://github.com/corosync/corosync.git corosync
# cd corosync
# git checkout v2.3.4

Set the environment:

# export LDFLAGS='-R/opt/gcc-4.8.1/lib -R/usr/lib/mps -R/opt/lib -L/opt/gcc-4.8.1/lib -L/usr/lib/mps -L/opt/lib -lnss3 -lsmime3 -lssl3 -lnssutil3 -lplds4 -lplc4 -lnspr4 -lpthread -ldl -lposix4'
# export nss_CFLAGS='-I/usr/include/mps'
# export nss_LIBS='-R/usr/lib/mps -L/usr/lib/mps'
# export PKG_CONFIG_PATH='/opt/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/lib/pkgconfig'
# export PKG_CONFIG_LIBDIR='/opt/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/lib/pkgconfig'
# export PKG_CONFIG_ALLOW_SYSTEM_CFLAGS=yes
# export PKG_CONFIG_ALLOW_SYSTEM_LIBS=yes

In case of previous failed attempts, clean the configuration cache:

# rm config.status
# rm -rf autom4te.cache

Apply the changes described in the general section about configure.ac and Makefile.am.

Complete the build:

# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./autogen.sh
# ./configure --prefix=$PREFIX --localstatedir=$PREFIX/var --enable-monitoring --enable-snmp --enable-xmlconf --enable-testagents -enable-augeas --enable-debug --enable-coverage
# make
# make install

Now logout and login again from your shell, then source pacemaker.rc to continue from a clean environment.

heartbeat

Get the sources:

# cd $BUILDPATH
# wget http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/tip.tar.bz2
# gtar jxf tip.tar.bz2
# cd Heartbeat-3-0-*/

Apply the changes described in the general section about configure.ac and Makefile.am (for heartbeat the file is configure.in).

In case of previous failed attempts, clean the configuration cache:

# rm config.status
# rm -rf autom4te.cache

Complete the build:

# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 autoreconf -i
# CFLAGS="-I/opt/include -I/opt/csw/include/ " LDFLAGS='-R/opt/lib -L/opt/lib -L/opt/csw/lib/ -lnsl /opt/csw/lib/libgnutls.so.13 -lsocket' ./configure --prefix=$PREFIX --enable-quorumd
# chmod +x install\-sh
# make CPPFLAGS="-L/opt/csw/lib/ -lgnutls -I/usr/include/glib-2.0/ -I/usr/lib/glib-2.0/include/"
# make install

pacemaker

Get the sources:

# cd $BUILDPATH
# git clone https://github.com/ClusterLabs/pacemaker.git pacemaker
# cd pacemaker
# git checkout 272814b6423d4cdc21a0a83cd9007a4d57bd542d

Set the environment:

# export CFLAGS='-O3 -D_REENTRANT -D_POSIX_PTHREAD_SEMANTICS -march=i486 -mtune=native -I/usr/include/ncurses/'
# export LDFLAGS="-R/usr/lib/mps:/opt/gcc-4.8.1/lib -L'/usr/lib/mps:/opt/gcc-4.8.1/lib' -lssp_nonshared"
# export PKG_CONFIG_PATH='/opt/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/lib/pkgconfig'
# export CONFIG_SHELL=/usr/gnu/bin/sh

Patch:

# perl -pi -e 's/-Wunsigned-char//g' configure.ac
# perl -pi -e 's#-Wunused-but-set-variable##' configure.ac
# perl -pi -e 's/-fstack-protector-all//g' configure.ac
 
# sed -i 's/\(ACLOCAL_AMFLAGS\s*=\s*\-I\s*m4\)/\1 \-I\/opt\/csw\/share\/aclocal\//g' Makefile.am
 
# find . -name "*.c" -o -name "*.h" | xargs sed -i 's/syscall\.h/sys\/syscall\.h/g'
# sed -i 's/reboot(RB_AUTOBOOT)/reboot(RB_AUTOBOOT, \"pacemaker\")/g' lib/common/watchdog.c
# sed -i 's/\(sysrq_init()\)/\/\/\1/g' mcp/pacemaker.c

Apply the hack of shame: this is a terrible workaround for the missing signalfd system call in IllumOS (the patch target is a file named services_linux.c!). We just wait 5 seconds for the forked process to finish providing stdout (instead of listening to signals)…

Get the patch file from the gist, extract it and apply it (this will also do minor changes to the Dummy resource agent):

# wget https://gist.githubusercontent.com/vincepii/b5d8f356a35d535313b5/raw/5a4a1e8df5691c39531eb5ffb7f6f0a5c0769a0b/pacemaker.patch
# git apply pacemaker.patch

Complete the build:

# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./autogen.sh
# ./configure --prefix=$PREFIX --enable-fatal-warnings=no --with-corosync --with-cs-quorum --with-acl=no --enable-debug
# make CPPFLAGS="-I/usr/include/ -I/usr/include/glib-2.0/ -I/usr/lib/glib-2.0/include/ -I/usr/include/libxml2/ -I/opt/include $CFLAGS"
# make install

Post install:

# mkdir -p $PREFIX/etc/corosync/uidgid.d
# (
echo "uidgid {"
echo " uid: `id -u ${CLUSTER_USER}`"
echo " gid: `id -g ${CLUSTER_USER}`"
echo "}"
) > $PREFIX/etc/corosync/uidgid.d/uid.conf

Now logout and login again from your shell, then source pacemaker.rc to continue from a clean environment.

crm

# cd $BUILDPATH
# git clone https://github.com/crmsh/crmsh.git crmsh
# cd crmsh
# git checkout 0d631cb36655695a67c940cf02c3fabccff705da
# perl -pi -e 's#ps -e -o pid,command#ps -e -o pid,comm#' ./modules/utils.py
# perl -pi -e 's#a2x -f manpage#a2x -L -f manpage#' doc/Makefile.am
# ACLOCAL=aclocal-1.14 AUTOMAKE=automake-1.14 ./autogen.sh
# ./configure --prefix=$PREFIX

Edit the file doc/Makefile.am:

# sed -i 's/a2x -L -f manpage $</a2x --no-xmllint -f manpage $</g' doc/Makefile.am

Complete the build:

# make
# make install

Post install:

# mkdir -p /root/.config/crm/
# cp /opt/etc/crm/crm.conf /root/.config/crm/

And this should complete the installation part.

NOTE: if crm will not run, complaining about missing readline module, then you can use crm with the CSW python (for some reason readline.so will not appear in /usr/lib/python2.6/lib-dynload).

To do this (only if crm is not working), use CSW python as interpreter and reinstall crm:

# /opt/csw/bin/pkgutil -i libreadline6 libreadline_dev python py_lxml
# cd $BUILDPATH/crmsh
# sed -i 's/\#\!\/usr\/bin\/python/\#\!\/opt\/csw\/bin\/python/p' crm
# make
# make install

Corosync configuration

Get a sample corosync configuration file from this gist and put it in place:

# wget https://gist.github.com/vincepii/86f60ff4ff912a782a67/raw/c70603c7af996494ad490aa5ef16613babfe4572/corosync.conf
# mv corosync.conf ${PREFIX}/etc/corosync/corosync.conf

Then edit the file and [change] the following fields:

memberaddr: use the addresses of your members
bindnetaddr: use the address of your network
ring0_addr: set the hostname of each node
nodeid: not really necessary to change it, use values that you prefer

Fixing permissions

All these files should already exist (except for corosync.pid, unless you had already run corosync for some reason). If one of these commands give an error, do not ignore it!

# chown -R hacluster:haclient ${BUILDPATH}/corosync/exec
# chown -R hacluster:haclient ${BUILDPATH}/corosync/common_lib/.libs
# chown -R hacluster:haclient ${PREFIX}/var/log/cluster
# chown hacluster:haclient ${PREFIX}/var/run
# touch ${PREFIX}/var/run/corosync.pid
# chown hacluster:haclient ${PREFIX}/var/run/corosync.pid
# chown -R hacluster:haclient /opt/var/lib
# chown -R hacluster:haclient /opt/var/run/resource-agents/

Setting up Corosync in SMF

To run corosync as a service in SMF, you will need the manifest and the executable script. You can find both of them on gist.

Download the manifest and the script and put them in place:

# wget https://gist.github.com/vincepii/2771b79dddd18adb1e51/raw/f0cc2d8c08419dd44ee9b4e1c2b6290d5c8859f0/corosync.xml
# wget https://gist.github.com/vincepii/2771b79dddd18adb1e51/raw/f744a4ef02fa2e7e4d5a2f9c403cb9e9ff411617/corosyncd
# mkdir ${PREFIX}/etc/smf
# mv corosyncd ${PREFIX}/etc/smf/
# mv corosync.xml ${PREFIX}/etc/smf/
# chmod u+x ${PREFIX}/etc/smf/corosyncd

Validate the SMF manifest, hopefully you will get no errors.

# svccfg validate ${PREFIX}/etc/smf/corosync.xml

Import and enable the service:

# svccfg import ${PREFIX}/etc/smf/corosync.xml
# svcadm enable corosync

Check if the service started:

# svcs | grep corosync

If everything went well, you should see an output like the following:

online 12:35:55 svc:/application/hacluster/corosync:default

If something went wrong, try to get the output of the SMF script with

# cat `svcs -L corosync`

Now corosync and pacemaker should start at boot. You can disable and enable the service with:

# svcadm disable corosync
# svcadm enable corosync

Running the cluster

You can check the cluster status with:

# crm_mon

(remember to source pacemaker.rc if the command is not available!)

The output should look similar to the following (note that for this run I had 3 nodes, 2 of which offline, and I set the expected_votes to 1 to have a partition with quorum):

You need to have curses available at compile time to enable console mode
Last updated: Thu Nov 6 12:44:26 2014
Last change: Thu Nov 6 12:42:01 2014
Stack: corosync
Current DC: ha-test-1 (80) - partition with quorum
Version: 1.1.12-272814b
3 Nodes configured
0 Resources configured
Node omni-pcm (20): UNCLEAN (offline)
Node omni-pcm-2 (40): UNCLEAN (offline)
Online: [ ha-test-1 ]

Administer the Dummy resource!

If you now have configured two nodes to create a pacemaker cluster, the next step is to check that the cluster can administer a resource.

We will use the Dummy resource, which does nothing other than verifying that it is running. When we start the resource, it will run on one of the nodes. If we kill that node in some way, the Dummy resource should fail over and start running on the other node.

This demonstrates that pacemaker operations are working.

First, let’s create some symlinks and set some permissions to be sure that pacemaker will find everything accessible:

# ln -s /usr/lib/ocf/resource.d/pacemaker /opt/usr/lib/ocf/resource.d/pacemaker
# ln -s /opt/usr/lib/ocf/lib/ /usr/lib/ocf/lib
# ln -s /opt/usr/lib/ocf/resource.d/heartbeat /usr/lib/ocf/resource.d/heartbeat
# chown -R hacluster:haclient /usr/lib/ocf
# chown -R hacluster:haclient /opt/usr/lib/ocf

As a basic configuration for our cluster, we disable STONITH and ignore the no-quorum state (with two nodes, we cannot have a quorum if one of them fails!):

# crm configure property stonith-enabled=false
# crm configure property no-quorum-policy=ignore

Verify that our configuration so far is correct with

# crm_verify -L -V

that should print nothing if everything is fine.

Now configure the Dummy resource:

# crm configure primitive dummy ocf:pacemaker:Dummy op monitor interval=120s

and then check its status (it should be running on one of the nodes):

# crm resource status dummy
resource dummy is running on: ha-test-1

Now you can verify that if one node stops working (you can simulate this with svcadm disable corosync), the Dummy resource will be started on the other node.

You can check the status also with crm_mon, that should show something similar to this:

# crm_mon
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Last updated: Thu Nov 6 16:39:58 2014
Last change: Thu Nov 6 16:34:43 2014
Stack: corosync
Current DC: ha-test-1 (80) - partition with quorum
Version: 1.1.12-272814b
3 Nodes configured
1 Resources configured

Online: [ ha-test-1 ]
OFFLINE: [ omni-pcm omni-pcm-2 ]

dummy (ocf::pacemaker:Dummy): Started ha-test-1

Conclusions

If everything worked, you should now have a pacemaker cluster running on OmniOS.

If you need different (i.e., useful) resource agents now (e.g., IPaddr), some patching may be needed for the fact that RA scripts will be run by the hacluster user, which, according to this post, doesn’t have authorizations to perform system changes. Also expect problems in case the hacluster user won’t have enough rights to read/write/traverse some folders that the RA script will want to access. Check the logs (tail -f /opt/var/log/cluster/corosync.log), debug and fix :). UPDATE: running corosync and pacemaker as root should also work and this simplifies using RAs. Check the  “Cluster user and group” section.

To have a better understanding on pacemaker, please refer to the official documentation.

Credits

I want to thank Andreas Grüninger for his great support in helping me with this setup and his contributions to the pacemaker tools that made it possible to run them on Illumos. I re-used the SMF manifest and script for corosync that he shared with me (with his permission :)). Also, the general procedure was made available by Andreas at this page.

Many thanks also to Sašo Kiselkov for his in-depth blog post on building a HA ZFS storage appliance.

References

12 thoughts on “Use pacemaker and corosync on Illumos (OmniOS) to run a HA active/passive cluster

  1. Hi,
    It’s Awesome guide, to install the HA-cluster on OmniOS. Btw, I follow your guide and successfully add the Dummy resource. but when I’ve tried to add IPaddr2 resource its return with error on “crm status” http://pastebin.com/bKYru6iD
    and also I Can’t find the ZFS resource, can you share the modification you made on the resource script ?

    • Hi Adhi, I’m very happy to hear that this helped :)!

      One first big recommendation to have RAs working more easily is to run pacemaker and corosync as root.

      Then, I can share my modified IPaddr script, hoping that this would also work for you, otherwise you may need to debug :).

      You can find it here: https://gist.github.com/vincepii/6763170efa5050d2d73d

      There are still debug lines (commented out) in there :D.

      • thanks piiv for your response. More question , how about using stonith ?
        just like in Sašo Kiselkov blog post, configure stonith for ipmi resource :
        crm(live)configure# primitive head1-stonith stonith:external/ipmi

        with this guide my crm only show this option:
        crm(live)configure# primitive test_stonith stonith:fence_
        stonith:fence_legacy stonith:fence_pcmk

        even I link stonith plugins resource from :
        ln -s /opt/lib/stonith /opt/usr/lib/

        can you help me ?

          • I’m Sorry for the next question, but actually I’m not expert in clustering.

            How to enable or add the “node-level fencing” ?

          • You need to get STONITH to run on your pacemaker setup, you can follow any of the available tutorials.

            I haven’t covered this, so I don’t know what kind of issues you may run into and if more patching will be needed.

  2. Hi,
    Very nice writeup. I have tried to follow all te steps litterally but got stuck building pacemaker.
    I get error messages like:
    ../lib/cluster/.libs/libcrmcluster.so: undefined reference to `cl_get_string’
    ../lib/cluster/.libs/libcrmcluster.so: undefined reference to `ha_msg_expand’
    ../lib/cluster/.libs/libcrmcluster.so: undefined reference to `ha_msg_addstruct_compress’
    .
    . etc
    Now, I am not a programmer so troubleshooting this will take me a long time probably. Any tips?
    Since this page has been here for a while, might it be that following your instructions, I get a newer version of pacemaker then the one that you are using? I get version 1.1.12. If so, can you please tell me which versions of the tools you have used to accomplish this? I will try to download them then.

    Thanks in advance.
    Sim

    • The linker cannot find the libraries that define those symbols.

      Most likely it is one of two possible causes:

      1. You have the libraries, but they are in a path that the linker is not aware of (missing -L option) or the library is not used for linking (missing -l option)
      2. You don’t have the libraries/object files and you need to compile the source code that provides them

      I see that all those functions are defined in the heartbeat package, so that’s what you are missing here (http://sourcecodebrowser.com/heartbeat/2.1.3/ha__msg_8h.html#a37d2ff7771f225e758a14e4c390b53f5).
      Did heartbeat built and installed correctly?

      About the version of the tools, the git checkout command on the pacemaker source code should bring your working copy to the same state as I used.

      What version of OmniOS are you using?

      • Hi,
        Thank you very much for your reply!
        I have used the commands copy paste from above when building pacemaker. Do you mean a -L or -l option in the make command of pacemaker? Pardon the question, since I haven’t looked at your link regarding the heartbeat package yet. There was one diff while building heartbeat though. The source does not contain a config.in as stated in your red message. I had to apply the changes to the Makefile.am. I am using omnios r12 (same as you stated in this post), though I downloaded one of the first images available. Maybe I should use a more recent r12 if still available? Do you mind if I send you a mail with some questions about your own implementation to your normal mail account? They might clutter your nice page if I did it here 🙂 and are not speciffic to building the source but more to your experience.
        High regards!

        • The -L and/or -l should go in the LDFLAGS, with the right parameters (e.g., -L/path/to/libs or -llibname).

          But most likely, something went wrong with the installation of heartbeat.

          I have checked the download link and that actually points to the tip of the branch, so your first hypothesis is correct, that is not exactly the same archive that I used (but a most recent one).

          You can find the package that I used here: https://owncloud.engineering.zhaw.ch/public.php?service=files&t=bd374d54710dd131bc81ef8baf3d7c59

          The OmniOS version that you use should not give problems, I have done this setup there as well.

          For the last question, feel free to write to me at “|my four letters username as you can read it on top of this message|” at zhaw dot ch 🙂

    • l find that my stonith:external/ipmi resource can’t start, and the error log was :stonith-ng: error: get_agent_metadata: Could not retrieve metadata for fencing agent fence_legacy.
      could you help to give some light?

Leave a Reply

Your email address will not be published. Required fields are marked *