The monitor process
Data Structures/Variables:
ipAddr : A string that contains the ip-address of the co-ordinator
that this monitor
process has to monitor.
msgQueueId : Contains the reference to the message queue from which
the monitor process
will pick messages from.
nodeId : The id of the co-ordinator that this
monitor has to monitor.
A monitor process is an RPC client only. It
periodically monitors its adjacent coordinator process in a ring
to find if it is alive or failed. If the failed coordinator is a leader,
an alive coordinator that most recently performed as leader resumes as
the new leader.
When the monitor process is started by the local co-ordinator process.
It is given the
id of the last co-ordinator in the ring that it has to monitor.
The co-ordinator and the
monitor processes running on a machine communicate with each other
through a local message queue. The monitors form a monitoring ring as shown
below:

Following are the steps/functions performed by the monitor process:
-
Get the handle to the message queue with a predefined known key.
-
Get the pid of the monitor process (itself).
-
Register itself with the local co-ordinator process. This is done by
calling the RPC function register_timer_pid(int pid) on the local coordinator,
where pid is the process identifier of the monitor process. This registration
is required since the server needs to deliver a signal to the monitor process
whenever it sends a ip-address of a coordinator to be monitored to the
monitor process through the local Unix Message Queue. This may happen more
than once since new coordinators may join the system and old coordinators
may leave or fail requiring readjustments to the ring.
-
Install signal handler for the following two signals :
1)SIGUSR1
: Whenever a new co-ordinator joins the ARC ring, there is change
in the ARC ring at one link. When this happens, the last monitor process
must start monitoring this newly joined co-ordinator. This is told to the
monitor by the last leader local co-ordinator by putting this information
in the message queue and sending this signal to the monitor process. The
last leader co-ordinator sends the address of the new coordinator to be
monitored to the monitor process when it receives the broadcast_join()from
the newly joined
co-ordinator.
The handler for this signal is the function called handler().
handler() gets invoked every time the signal,SIGUSR1 is received.
2)SIGALRM : This is an alarm signal
that is set by the monitor process itself, so
that it can be receive the alarm periodically. Now, whenever the monitor
process receives this signal, it tests whether the co-ordinator it
has to monitor is alive or it has failed, by trying to create a client
handle
to the coordinator. This is done in a function live_test().
The handler for this signal is time_out() and it is invoked whenever
this alarm signal is received.
-
Enter an infinite sleep loop, in which the process sets the alarm for
TIMEOUT seconds and sleeps while waiting for TIMEOUT seconds. (why
not simply sleep and carry out action when it is broken, why sigalram?)
Following are the helper functions used:
-
void handler() : This is the handler for SIGUSR1 signal.
This function gets the message from the message queue. The message contains
the ip-address of the coordinator that the monitor should now start monitoring.
From this point onwards, the monitor starts monitoring this new co-ordinator.
-
void timeout() : The function is invoked whenever
the alarm signal arrives. It calls the function live_test(char* ip-address)
to check for the liveness status of the co-ordinator running on machine
with ip = ip-address. If the machine has failed, then call the function
handleNodeFailure(int
nodeid), where nodeId is the id of the co-ordinator that has failed.
Following functions have been defined in the file "/usr/src/generalRoutine.c"
-
void handleNodeFailure(int nodeId) :This function creates
a client handle on to the local coordinator and then calls the RPC node_failed(int
nodeId). This tells the local coordinator about the failure of some coordinator.
-
int live_test(char* ipAddress) : The purpose of the function
is to check the status of the coordinator running on machine with the given
ip-address. The function creates a client handle to the co-ordinator and
if this is successful, the co-ordinator is alive. The integer returned
by the function indicates the status indicating whether creation of the
client handle was successful.