Wednesday 9 February 2011

Making BASH Scripts HA Compatible - Daemonising BASH

Quite a few times on our clusters we've needed to make a cron job, or a shell script HA compatible. We'd like the cluster to be able to start and stop it, so it can failover with other resources if required.


It's actually a lot easier than it seems. The easiest way is making a while true loop with a sleep in the middle, then in each iteration check the current time against the run time of the script. It's kind of replacing cron, but needs must.


This is how I did it.

Stage 1 - Make a standard LSB compatible init script carcass


#!/bin/sh                                                
# description: Start or stop your res name
#                                                        
### BEGIN INIT INFO                                      
# Provides: your_res_name                     
# Required-Start: $network $syslog                       
# Required-Stop: $network                                
# Default-Start: 3                                       
# Default-Stop: 0                                        
# Description: Start or stop your res name      
### END INIT INFO    

RUNFILE="/var/run/your_res_name"


NAME="YourResName"
case "$1" in
'start')
        CHECKSTATUS
        [ "$RUNNING" ] && echo "$0 is already running" && exit 0
        echo $"Starting $0"
        touch $RUNFILE
        MAINLOOP &
        ;;
'stop')
        [ -f "$RUNFILE" ] && rm $RUNFILE
        pkill -f "$NAME "
        echo "$NAME"
        ;;
'restart')
        $0 stop
        sleep 5
        $0 start
        ;;
'status')
        CHECKSTATUS
        [ "$RUNNING" ] && echo "$NAME is running" && exit 0 || echo "$NAME is stopped" && exit 3;;
*)
        echo
        echo $"Usage: $0 {start|stop}"
        echo
        exit 1;;
esac

There's no CHECKSTATUS or MAINLOOP functions yet, we need to add those next.

MAINLOOP


You need a while true ; do ; done loop to sit there running through the stuff you want to check and then to do stuff at the appropriate times.


RUNTIME="000300" # 00:30:00
MAINLOOP() {


LOG="/var/log/$NAME.log"


while true
do
# Check for permission to run.
[ ! -f "$RUNFILE" ] && exit 0


# Check if we've already run today
if [ ! -f "$OUTPUTFILE" ]
then
# Or if we're still running
NUMPROCS=`pgrep -f "$NAME " | wc -l`
                if [ $NUMPROCS -lt 1 ]
                then
THETIME=`date +%H%M%S` # Get a numerically comparable time. 
if [ $THETIME -gt $RUNTIME ]
then
echo -e "\nApparently $THETIME is greater than $RUNTIME so it's time to do our thang" >> $LOG
echo -e "------------------------" >> $LOG
echo -e "\n*** Starting process.***\nThe time : $THETIME" >> $LOG
echo -e "\nNumber of existing processes : $NUMPROCS" >> $LOG
echo -e "\nLet's GO!\n" >> $LOG
RUN_OUTPUT
fi
fi
fi
sleep 1m
done
}


CHECKSTATUS


You can just use something simple like check for the run file and the background process running on this



CHECKSTATUS () {


if [ -f "$RUNFILE" ]
then
if [ `pgrep -f "mi_data_extract start" | wc -l` -gt 0 ]
then
RUNNING="yes"
else
unset $RUNNING
fi
fi


}


That's pretty much it. Check each operation of the init script with echoing out the return code from every state. e.g.


/etc/init.d/your_res start ; echo $? # from stopped, should be 0
/etc/init.d/your_res start ; echo $? # from started, should be 0
/etc/init.d/your_res stop ; echo $? # from started, should be 0
/etc/init.d/your_res stop ; echo $? # from stopped, should be 0
/etc/init.d/your_res status ; echo $? # from started, should be 0
/etc/init.d/your_res status ; echo $? # from stopped, should be 3


Once this is done, you can just add it to the cluster as a primitive LSB resource, add a monitor on it and let the cluster take care of your script.