`The monitor process`

monitor.c

Data Structures/Variables:

ipAddr : A string that contains the ip-address of the co-ordinator that this monitor
process has to monitor.
msgQueueId : Contains the reference to the message queue from which the monitor process
will pick messages from.

nodeId : The id of the co-ordinator that this monitor has to monitor.

A monitor process is an RPC client only. It periodically monitors its adjacent coordinator process in a ring to find if it is alive or failed. If the failed coordinator is a leader, an alive coordinator that most recently performed as leader resumes as the new leader.

When the monitor process is started by the local co-ordinator process. It is given the
id of the last co-ordinator in the ring that it has to monitor. The co-ordinator and the
monitor processes running on a machine communicate with each other through a local message queue. The monitors form a monitoring ring as shown below:

Following are the steps/functions performed by the monitor process:

Get the handle to the message queue with a predefined known key.
Get the pid of the monitor process (itself).
Register itself with the local co-ordinator process. This is done by calling the RPC function register_timer_pid(int pid) on the local coordinator, where pid is the process identifier of the monitor process. This registration is required since the server needs to deliver a signal to the monitor process whenever it sends a ip-address of a coordinator to be monitored to the monitor process through the local Unix Message Queue. This may happen more than once since new coordinators may join the system and old coordinators may leave or fail requiring readjustments to the ring.
Install signal handler for the following two signals :

            1)SIGUSR1 : Whenever a new co-ordinator joins the ARC ring, there is change
            in the ARC ring at one link. When this happens, the last monitor process
            must start monitoring this newly joined co-ordinator. This is told to the
            monitor by the last leader local co-ordinator by putting this information
            in the message queue and sending this signal to the monitor process. The
            last leader co-ordinator sends the address of the new coordinator to be
            monitored to the monitor process when it receives the broadcast_join()from the newly joined
            co-ordinator.

The handler for this signal is the function called handler().
handler() gets invoked every time the signal,SIGUSR1 is received.

      2)SIGALRM : This is an alarm signal that is set by the monitor process itself, so
            that it can be receive the alarm periodically. Now, whenever the monitor
            process receives this signal, it tests whether the co-ordinator it
            has to monitor is alive or it has failed, by trying to create a client handle
            to the coordinator. This is done in a function live_test().
            The handler for this signal is time_out() and it is invoked whenever
            this alarm signal is received.

Enter an infinite sleep loop, in which the process sets the alarm for TIMEOUT seconds and sleeps while waiting for TIMEOUT seconds. (why not simply sleep and carry out action when it is broken, why sigalram?)

Following are the helper functions used:

void handler() : This is the handler for SIGUSR1 signal. This function gets the message from the message queue. The message contains the ip-address of the coordinator that the monitor should now start monitoring. From this point onwards, the monitor starts monitoring this new co-ordinator.

void timeout() : The function is invoked whenever the alarm signal arrives. It calls the function live_test(char* ip-address) to check for the liveness status of the co-ordinator running on machine with ip = ip-address. If the machine has failed, then call the function handleNodeFailure(int nodeid), where nodeId is the id of the co-ordinator that has failed.

Following functions have been defined in the file "/usr/src/generalRoutine.c"

void handleNodeFailure(int nodeId) :This function creates a client handle on to the local coordinator and then calls the RPC node_failed(int nodeId). This tells the local coordinator about the failure of some coordinator.
int live_test(char* ipAddress) : The purpose of the function is to check the status of the coordinator running on machine with the given ip-address. The function creates a client handle to the co-ordinator and if this is successful, the co-ordinator is alive. The integer returned by the function indicates the status indicating whether creation of the client handle was successful.